Team:Stuttgart/Software

Awards

Software

Abstract

During our project we ran into the problem, that we wanted to gather more information on the distribution of rare codons of Vibrio natriegens in proteins of interest. To enable this without manual counting, we wanted to develop a tool based on Python, which was easy to use for all team members involved in this part of the project. In the same instance we wanted our snippet of algorithm to be freely available and easy to use for everyone else. The result is our algorithm called GeneCodonSearch (GeCoS).

Approach

We developed an algorithm (written in Python3) to analyse any input DNA sequence(s) formatted as fasta for its codon composition. GeCoS can analyse whole proteomes (Coding sequences) of organisms in seconds (e.g. Escherichia coli) to minutes (e.g. Homo sapiens) for their codon composition, depending on the hardware and the proteome size. GeCoS is publicly available in the Galaxy Tool Shed as an open-source software. Figure 1 shows, the repository of GeCoS in the Galaxy Tool Shed which is accessible at https://bit.ly/2YMcRZl.

Figure 1 – Screenshot of the Galaxy Tool Shed repository containing GeCoS.

Open-Source

The functionality of an algorithm is equally important as its user-friendliness. Therefore, we decided to use a graphical user interface (GUI), which is accessible via browser over an instance of a Galaxy server. We opted to use the Galaxy Project (https://www.galaxyproject.org/) as platform for GeCoS, as Galaxy is an open, web-based platform for accessible, reproducible, transparent computational biomedical research and OSI licensed (https://galaxyproject.org/admin/license/#images-and-documentation-license). The GUI for each tool can be embedded with one simple xml file. Figure 2 shows the GUI of GeCoS in Galaxy with a possible user input.

Figure 1 – Screenshot of the Galaxy Project GUI of GeCoS with a possible input.

Relation to PhyCoVi

With GeCoS and the Ensembl databank1 we identified proteins with increased amount of Vibrio natriegens rare codons. An exemplary list of proteins is listed in table 1.

Table 1 – List of selected proteins from the human proteome (GRCh38) with an increased rare codon ratio. The rare codon ratio is the percentage of rare codons in the gene sequence.
Protein Uniprot Rare codon ratio AGA AGG CGG TCC TGC

NF-kappa-B essential modulator

Q9Y6K9

5.4 %

19

8

2

13

2

Centrosomal protein of 131 kDa

Q9UPN4

5.3 %

33

6

9

4

3

Neurosecretory protein VGF

O15240

5.0 %

80

32

43

24

29

Histone deacetylase 4

P56524

4.9 %

9

12

15

4

6

All proteins listed in table 1 might be poorly expressed in Vibrio natriegens or Escherichia coli2, due to their rare codon ratios, that were calculated with GeCoS. Additionally, proteins with >5% of rare codons were found to be less soluble in Escherichia coli strains with increased tRNA levels3.  With GeCoS, we provide an easy and fast way to evaluate codon ratios in heterologous genes, in order to investigate selected expression issues that might occur during heterologous gene expression.

tRNA availability and the elongation factor thermo unstable (EFTu) were shown to be two key factors, when it comes to cell free protein synthesis4. Therefore, GeCoS should also be used to adjust CFPS systems, according to rare codon ratios found in the genes, that are supposed to be expressed.


Outlook

For the future, we are planning to embed the following features into GeCoS:

  • Selection of output format, which can be further analysed with different tools (csv format, tab-separated table)
  • Graphical output (distribution plot of the codons, highlighting of rare codons in gene sequence)
  • Analysis for tandem codons (codon repeats)

Programm

#1: Modules
    import re
    import argparse
    import numpy as np
    np.set_printoptions(threshold=np.inf)
    
    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--minimum",type=float, help="cutoff value")
    parser.add_argument("-c", "--codon",type=str, help="codons of interest")
    parser.add_argument("-i", "--input_sequences", type=str, help="")
    parser.add_argument("-r", "--result", type=str)
    
    args = parser.parse_args()
    
    sequences = args.input_sequences
    cutoff = args.minimum
    codonlist=args.codon
    resultfile=args.result
    
    codonlist = codonlist.split(",")
    
    i=0
    codon_dictionary = {}
    while i < len(codonlist):
        codon_dictionary[codonlist[i]] = 0
        i+=1
    #2:input
    dict1 = {}
    dict2 = {}
    seq_id =""
    
    i=0
    input_sequences = open(sequences,mode = 'r')
    for line in input_sequences:
        if line [0] == ">":
            match = re.search(">(\S+)\s(.*)$",line)
            seq_id = match.group(1)
            dict1[seq_id] = ""
        elif seq_id != "" and line.strip()!= "":
            dict1[seq_id] += line.strip()
    
    i=0
    for name in dict1.keys():
        dict2[i] = name
        i+=1
    
    testsequences=int(len(dict2))
    
    #3: Codon-Scoring of the sequences
    i=0
    m=1
    testseq=""
    
    scoremat=np.zeros((len(dict2)+1,len(codon_dictionary)+2)).astype(object)
    
    for i in range(len(dict2)):
    
        l=0
        scorecomp=0
        codonscores=[]
        testseq=dict1[dict2[i]]
        codon_dictionary=dict.fromkeys(codon_dictionary,0)
        
        for l in range(len(testseq)):
            try:
                codon_dictionary.get(testseq[l*3:l*3+3])
                codon_dictionary[testseq[l*3:l*3+3]]+=1
            except KeyError:
                continue
        
        scorecomp=sum(codon_dictionary.values())/(float(len(testseq)/3)*len(codon_dictionary))
        for x in codon_dictionary.values():
            codonscores.append(x)
            
        if scorecomp > cutoff :
            scoremat[m,0]=dict2[i]
            scoremat[m,1]=round(scorecomp,7)
            scoremat[m,2:]=codonscores
            m+=1
    
    scoremat_sort=scoremat[np.lexsort((scoremat[:, 1], ))]
    scoremat_sort[0,0]="Gene_ID"
    scoremat_sort[0,1]="Total Score"
    scoremat_sort[0,2:]=codonlist        
    scoremat_cleaned = scoremat_sort[[i for i, x in enumerate(scoremat_sort) if x.any()]]
    
    #4: output
    with open(resultfile, 'w') as file_object:
        file_object.write("DNA sequences with a higher total score than ")
        file_object.write(str(cutoff))
        file_object.write("\n")
        file_object.write("\n")
        file_object.write(str(scoremat_cleaned))
        file_object.close()
    


References

  1. Zerbino, D. R. et al. Ensembl 2018. Nucleic acids research 46, D754-D761; 10.1093/nar/gkx1098 (2018).
  2. Drott, D. Overcoming the codon bias of E. coli for enhanced protein expression. inNovations 12 (2001).
  3. Rosano, G. L., Bruch, E. M. & Ceccarelli, E. A. Insights into the Clp/HSP100 chaperone system from chloroplasts of Arabidopsis thaliana. The Journal of biological chemistry 286, 29671–29680; 10.1074/jbc.M110.211946 (2011).
  4. Nieß, A., Failmezger, J., Kuschel, M., Siemann-Herzberg, M. & Takors, R. Experimentally Validated Model Enables Debottlenecking of in Vitro Protein Synthesis and Identifies a Control Shift under in Vivo Conditions. ACS synthetic biology 6, 1913–1921; 10.1021/acssynbio.7b00117 (2017).