Team:TUDelft/Osman

Sci-Phi 29

https://www.stud.nl/

Codon Usage - Cross-species codon harmonization

When transferring genetic circuits across different organisms, translation is not only dependent on translation initiation. The translation also depends on codon usage. During expressing of heterologous protein in new bacterial host cells, it has shown altered protein levels compared to that in the original microorganism. One of the reasons for a lower expression level is the variance in codon usage between the original organism and the new host cell (Angov et al, 2008). The foundation of the variance in codon usage is written in the DNA sequence itself. Protein structures are dependent on the DNA sequence, which is translated into a functional protein through two subsequent cellular processes: transcription and translation (Angov et al, 2008).

In general bacterial cells contain 20 different amino acids encoded by 64 codons (excluding 3 stop codons) and has resulted in a phenomenon called synonymous codon usage. Synonymous codon usage means that most of the 20 amino acids are encoded by more than one codon. (Nascimento et al, 2018). Nascimento, et al. (2018) have proven that cells are making great use of the codon choice that this offers, since codon usage directly affects both the level of mRNA copies and the translation rate. They showed that proteins expressed at high levels have more mRNA copies and contain more frequently used codons in order to speed up the translation rate (Nascimento et al, 2018). In order to increase the expression level of the heterologous protein in the host cell, new codon optimization tools were developed. However, it remains difficult to predict which tool will design the optimal sequences (Mignon et al, 2008). The codon optimization tools available now can be divided into two main groups based on how the tool's algorithm functions:


  1. Codon optimization tools: The basic idea of a codon optimisation tool is to achieve the highest translation rate possible. The translation rate is increased by substituting each codon with the codon that is used mosed frequently for the corresponding amino acid and keep the ribosomal binding site (RBS) freely accessible for the ribosomal subunit RBS by avoiding hairpin formation at the translation initiation side (Puigbò et al, 2018).
    The relative codon frequencies are calculated through the Codon Adaptation Index (CAI) as shown in the following equation. In this equation $w_i$ is the CAI, $f_i$ is the frequency of a particular codon, and $max(f_i)$ is the codon that is used most frequently for the corresponding amino acid.
  2. $$w_i = \frac{f_i}{max( f_i )}$$

  3. Codon harmonization tools: The basic idea of a codon harmonization tool is to mimic the native translation rate in the host organism by using rare codons at specific places and avoiding hairpin formation as much as possible. The codon usage of the original microorganism functions as the reference point (Athey et al, 2017). This approach allows pre-folding of the protein during translation in order to reduce the chance of the protein misfolding as much as possible (Figure 7).
  4. Translation rate
    Figure 7: Schematic representation of protein translation. Parts indicated by the green arrow are encoded with high frequency codons in order to speed up the translation rate. Parts indicated by the red arrow encoded by rare codons in order to slow down the translation rate, which limits misfolding of proteins by creating a small time window for protein pre-folding. Codon harmonization aims to create the same codon usage pattern as the native host in order to increase the amount of functional protein.


The current limitation of both codon adaptation tools are the adaptation of the DNA coding sequence for one single organism at the same time. There is not an option to adapt the DNA coding sequence into one single DNA coding sequence adapted to function across multiple species. We wanted to enable consistent gene expression between bacteria, we created the first cross-species codon harmonization tool. This harmonization tool provides the user with a single DNA coding sequence that will yield the same protein expression level in different bacterial host cells. The core of our algorithm is the codon harmonization approach as described before. We modified this algorithm by making use of statistical analysis method of the least square variance. Furthermore, we made our tool BioBrick RFC compatible by removing type II standard restriction sites.


  • Click here to find out more about the codon harmonization

      Preselection data

      Our harmonization tool is based on a database containing the codon usage of 152903 microorganisms obtained from a paper published by (Athey et al. (2017). Since working with such a big database slows down the tool, we designed a MATLAB script that pre-selects the data of microorganisms of your interest. This MATLAB script is available in the supplementary list below.

      The taxonomy identification number (taxid) is used as input for the preselection in order to reduce the chance if a mistake based on type error of the name. Each organism name is converted into a specific taxid in order to make it easier to search for a specific organism or strain in the NCBI Taxonomy database (Federhen et al, 2011).
      As mentioned before, the codon harmonization approach will function as the core of our code, which means that we will create the same codon usage pattern as the native host in order to increase the amount of functional protein expression. The native organisms own codon usage functions as the reference point for the tool. The regions with high frequency codon usage, will be substituted with high codon frequency usage in the organism of interest and low frequency codon usage regions will be substituted with low frequency codon usage.

      In our case we used eGFP as the heterologous protein and Escherichia coli strain BL21(DE3) as native host (taxid 469008). The actual native organism for the eGFP protein is the jelly fish Aequorea victoria (taxid 6100). However, we first designed our tool to function in bacterial hosts before expanding the tool further to eukaryotic cell.
      When generating the filtered database containing data of only the organisms of interest, we designed the code in such a way that the first input row will contain data of the reference organism and the rows below contain data of the potential new host organisms. Each new row is a new potential organism. A schematic representation of the preselection is shown in figure 8

      Translation rate
      Figure 8: A schematic representation of the preselection code to obtain data of only the organisms of interest from the main database. The top row corresponds with the data of the codon usage of the reference organism, while each following row corresponds with the codon usage of the organisms of interest. In the preselection code, the taxid for each organisms of interest is used as query to find the right row in the main database. The selected row are combined together to form a filtered database containing the codon usage of only the reference organism and the hosts of interest.


      Harmonization code

      Before the harmonized coding sequence for the organisms of interest is generated, the deviation of the endogenous GC content of each organism of interest from the mean GC content is calculated. This calculation has been added to inform the user whether the GC concert of the organisms of interest are too different from each other. In case the GC content for one of the organisms deviates more then 5% from the mean, a notification will occur to inform the user that the GC content is too different from each other and that might result in a harmoinzed coding sequence, which may not functions in the same way.

      First, the codon frequency is calculated in the same as was described by Athey et al. (2017). We used this formula instead of the formula for CAI as described before, since the CAI calculation requires a reference gene. However, our system functions across species, so an adapted version of the CAI is used, as shown in equation obtained from Athey et al. (2017). This adapted version calculates the frequency of the used codon for that specific amino acid instead of calculating the relative codon use for that specific amino acid.

      $$freq_{codon,i} = \frac{codon}{\sum codon,i}$$

      Secondly, the variance for each codon position is calculated separately as a intermediate step for the calculation of the least square variance. During the calculation of the variance, the codon frequency at that particular sequence position is also taken into account in order to remove outliers as much as possible. The calculated variance for each position is ordered from lowest to highest.

      In the first iteration of generating the final sequence, we use the lowest variance codon at every position. The generated sequence will go through screening for type II restriction enzyme recognition sequences to make it MoClo compatible, and easier to use in a construct. In case a site is found, the codon at that particular position will be substituted with the synonymous codon that has the second lowest variance. When going through this iteration cycle multiple times we derive a single nucleotide sequence cleared from type II restriction sites and codon harmonized for all organisms of interest in order to achieve the same translation rate in each organism of interest. The deviation of the endogenous GC content of each organism of interest from the mean GC content across these organisms. ( This output is added additionally to inform the user wether the GC content of the organisms of interest are to differnt from each other. In case the GC content are to different, a notification will pop up in the screen.)

      Inputs and Outputs

      Our codon harmonization tool requires three inputs and will generate two outputs.
      As input the harmonization script requires the following three inputs:

      1. The filtered database containing only the codon usage of the organisms of interest and the reference organisms codon usage (data_formatted = output file name of the filtered database).
      2. Database containing recognition sites for type II restriction enzymes (restriction_enzymes_database).
      3. Nucleotide sequence of the gene of interest that will be harmonize.

      As output the codon harmonization script will generate the following two outputs:

      1. A codon harmonized nucleotide sequence usable in all organisms of interest.
      2. The deviation of the endogenous GC content of each organism of interest from the mean GC content across these organisms. ( This output is added additionally to inform the user wether the GC content of the organisms of interest are to differnt from each other. In case the GC content are to different, a notification will pop up in the screen.)

    • Obtained output
      For the validation for our model we used the microorganisms listed below:
      Name organism taxid
      Escherichia coli BL21 (DE3) 469008
      Vibrio natriegens NBRC 15636 = ATCC 14048 = DSM 759) 1219067
      Bacillus subtilis subsp. subtilis str. 16 224308
      As input sequence we use the coding sequence for eGFP ( click here to get the sequence)
      The generated output sequence is here ( click here to get the sequence)

  • Experimental validation

    Qualitative validation expression of the gene

    We tested the prediction of the codon harmonization by first proving the coding sequence for the harmonized GFP results in functional fluorescing protein. Therefore, we implemented the coding sequence for harmonized eGFP in a level 1 MoClo construct by assembling the following basic parts together. (Include list of the basic parts used to assemble the level 1 construct).
    The assembled construct was transformed into chemically competent E. coli BL21(DE3) cells (heat shock protocol) through heat shock transformation. 50 µL of the transformed cells are plated out on agar plates, with Xgal 50 µl of Xgal (stock-concentration 2 µg Xgal/ 20 mg DMSO) and 25 µL ampiciline anti-biotic (stock-concentration 5 mg/ml) use.
    From the plate, a single colony was picked sent for sequencing for confirmation and stored for further use in glycerol stock. No mutation were detected in the coding sequence for the harmonized GFP.
    Through following the protocol of IPTG induction and Gel doc, the gene expression of the harmonized GFP is shown qualitatively as presented in figure ........

    Cross species codon harmonization


    When heterologous proteins were expressed in new bacterial host cells, altered protein expression levels were observed due to the variance in codon usage between the original organism and the new host cell (Angov et al, 2008). To increase the expression level of the heterologous protein in the host cell, new codon optimization tools are developed. The current limitation of codon adaptation tools is that the adaptation of the DNA coding sequence can be performed for one single organism at the time. Therefore we created the first cross-species codon harmonization tool. The tool provides the user with a single DNA coding sequence that will yield the same protein expression level in different bacterial host cells. We demonstrated functional protein expression using our own harmonized coding sequence for E. coli , B. Subtilus, and V. Natrigen.

    • Results -- functional protein production
      • To validate whether the cross-species codon harmonization results in a functional protein, we designed a cross-species codon harmonized GFP coding sequence. The sequence was harmonized using E. coli BL21(DE3) as reference organism, and V. natriegens NBRC 15636 = ATCC 14048 = DSM 759) and B. Subtilis subsp. subtilis str. 16
        The harmonized GFP was constructed through MoClo assembling with the WT t7 promoter , universal RBS, universal RBS harmonized GFP, harmonized GFP and WT terminator The assembled plasmid is transformed into E. coli BL21(DE3).
        We measure the GFP fluorescence using the Gel doc. As reference for background fluorescence we use untransformed E. coli BL21(DE3) pLysS). As reference for fluorescence we use E. coli BL21(DE3) pLysS with BBa_K2918030 plasmid.

        Furthermore, to validate the harmonized GFP even more, a flow cytometer experiment was performed putting the harmonized GFP under two different promoter strength. The two promoter strength of the med T7sp1 promoter. and T7sp1 promoter were cloned in the same backbone ( pICH47761) and were paired with the same RBS (universal RBS). The fluorescence is measured with fluorescence readout from harmonized GFP by flow cytometry and E.coli BL21 DE(3) cells were used as blank. Click here for the protocol. FCSalyzer v.0.9.18-alpha was used to analyze data from the flow cytometry experiment.

        Results

        Both the positive control and the harmonized GFP showed some level of fluorescence expression. The harmonized GFP showed fluorescing cells but the positive control was almost not visual. For better validation a flow cytometer measurement was performed.

        GFP Harmonized
        Figure ...: qualitative data of a fluorescence measurement of pelleted cells in eppendorf tubes after IPTG induction. From left to right E. coli BL21(DE3) pLysS with no plasmid (negative control), E. coli BL21(DE3) pLysS with JuniperGFP (positive control), and E. coli BL21(DE3) pLysS with harmonized GFP. A) The fluorescing image was taken using the gel doc. As seen the cell pellets are all fluorescing. B) After removing the autofluorescence by subtracting the negative control from the image, the green fluorescence cells has been visualized . The still fluorescing cell pellets are marked with a green circle. C) A 3D plot of the fluorescing cells. The height of the graph corresponds with the intensity of the measured GFP.


        The raw data of the cytoflow meter is shown in figure…. The curves represent fluorescence values of E.coli BL21 DE (3) cells (black), clones with GFP expressed from T7sp1 (red) and clones with GFP expressed from T7sp1 (blue). cytoflowharmonizedGFP
        Figure ...: Raw fluorescence data. The curves represent fluorscenece values of E.coli BL21 DE (3) cells (black), clones with GFP expressed from T7sp1 (red) and clones with GFP expressed from T7sp1 (blue).

    Conclusion

    The fluorescence measurement using the gel doc did not show a clear separation between the positive and negative control. the result was insufficient to see whether the harmonized GFP sequence resulted into functional protein production. .

    The cytoflow meter did show a better qualitative data. The clear shift between the implicates active GFP protein production. So the harmonized GFP coding sequence encodes for functional GFP proteins.


References