Team:TUDelft/Osman

Sci-Phi 29

Overview - Codon Usage

Overview text here.


Codon Usage - Cross-species codon harmonization

To support our orthogonal system, we developed a new and novel codon adaptation tool which ensures equal translation rates within different species. Similar translation rates will increase the predictability of heterologous protein expression levels within different bacterial species.
Protein structures are dependent on the DNA sequence, which is translated into a functional protein through two subsequent cellular processes: transcription and translation (Angov, Hillier, Kincaid, & Lyon, , ). The cell contains 20 different amino acids encoded by 64 codons. This has resulted in a phenomenon called synonymous codon usage. Synonymous codon usage means that most of the 20 amino acids are encoded by more than one codon (Nascimento et al., Crick, Gun, Yumiao, Haixian, & Liang, ). Nascimento, Kelly et al. (2018) have proven that cells are making great use of the codon choice that this offers, since codon usage directly affects both the level of mRNA and the translation rate. They showed that proteins expressed at high levels have more mRNA copies and contain more frequently used codons in order to speed up the translation rate (Nascimento et al., )

After the development of gene editing techniques, scientists started to express heterologous proteins in new host cells. Heterologous protein expression has shown altered protein levels compared to that in the original microorganism. One of the reasons for a lower expression level is the variance in codon usage between the original organism and the new host cell. In order to increase the expression level of the heterologous protein in the host cell, new codon optimization tools were developed. The codon optimization tools available now can be divided into two main groups based on how the tool's algorithm functions:

  1. Codon optimization tools: The basic idea is to achieve the highest translation rate possible and avoid hairpin formation by substituting each codon with the codon that is used mosed frequently for the corresponding amino acid (Hanson & Coller 2017). The relative codon frequencies are calculated through the Codon Adaptation Index (CAI) as shown in the following equation. In this equation $w_i$ is the CAI, $f_i$ is the frequency of particular codon, and $max(f_i)$ is the codon that is used most frequently for the corresponding amino acid.
  2. $$w_i = \frac{f_i}{max( f_i )}$$

  3. Codon harmonization tools: The basic idea is to mimic the same translation-rate in the host organism by using rare codons at specific places and avoiding hairpin formation as much as possible. The codon usage of the original microorganism function as the reference point. This approach allows pre-folding of the protein during the translation in order to reduce the chance of protein misfolding as much as possible ( Figure 1 ).
  4. Translation rate
    Figure 1: Schematic representation of the protein translation. The green parts are encoded with high frequency codons in order to speed up to translation rate. The red parts are encoded by rare codons in order to slow down the translation rate in order to decrease misfolding of proteins by creating a small time window for protein pre-foldong. With codon harmonization the aim is to create the same codon usage pattern as the native host in order to increase the amount of functional protein.


Since our project is all about creating an universal toolbox, we boosted our toolbox by creating the first cross species codon harmonization tool. We developed this new harmonization tool in order to increase the predictability of heterologous protein in multiple bacterial host species by taking into account the translation variability between organisms. This harmonisation tool will yield the same expression levels of functional protein in different host cells using 1 single coding sequence. The codon harmonization tool approach functions as the core of our algorithm. The algorithm is modified by making use of the statistical analysis. Furthermore, we made our tool BioBrick RFC compatible by removing the type II standard removes restriction sides.

  • How does our universal harmonization tool work?

      Our harmonization tool is based on a database containing the codon usage of 152903 microorganisms obtained from the paper published by Athey et al. (2017). Since working with such a big database slows down the tool, we designed a Matlab script encoding for pre-selecting only the data of the microorganisms of your interest. The used Matlab code is available in the supplementary list below. The taxonomy identification number (taxid) is used as input, since each microbial species has its own unique taxid.
      As mentioned before, the codon harominzation approach will function as the core of our code, which means our tool requires a reference organism. The reference organism equals the organisms where the heterologous protein originates from. In our case we used as heterologous protein eGFP and as organism of origin Escherichia coli stain BL21(DE3) (taxid 469008 ).
      When generating the filtered database containing the data of only the organisms of intereest, we designed the code in such a way that the first input row will contain the data of the reference organisms and the rows below the data of the organisms of interest.

      Translation rate
      Figure 2: Schematic representation of how the data has been filtered form the main database. Again the top row data corresponds with the data of the reference organism the protein translation.


      For our codon harmonization script requires 3 inputs and will generate 2 outputs.
      As input the harmonization script requires the following 3 inputs:

      1. The new generated database containing the data of the organisms of interest (data_formatted)
      2. Database containing the recognition sides for type II restriction enzymes ( restriction_enzymes_database).
      3. Nucleotide sequence of the gene of interest that will be harmonized.

      As output the codon harmonization script will generate the following 2 outputs:

      1. A codon harmonized nucleotide sequence for all the organisms of interest.
      2. The deviation of the GC content of each organism form the mean GC content.

    • From nucleotide input to nucleotide output

      First the codon frequency is calculated through using the same way as has been calculated in Athey et al. (2017). We used the this formula instead of the formula for CAI, since for the CAI calculation a reference gene is required. The CAI is very useful in case you want to change gene expression of a gene within the same organisms, However, our system function interspecies so an adapted version of the CAI is used as shown in the equation below.

      $$freq_{codon,i} = \frac{codon}{\sum codon,i}$$

      Secondly, the variance for each codon position is calculated sepatelty. During this step each synonymous codon encoding for the same amino acid is calculated. During the caculation of the variance the codon frequency at that particular sequence position is also taken in the calculation in order to remove outlayers as much as possible. The calculated variance for each position is ordered from lowest to highest variance.

      In the first iteration for generating the final sequence, we use the lowest variance codon at that curtain position. The generated sequence will go through screening for type II restriction enzyme recognition sequences. In case one side is found, the codon at that curtain position will be substituted with the second lowest variance. Going through this iteration cycle multiple times we derive a single nucleotide sequence cleared from type II restriction side and codon harmonised for all the organisms of interest in order to achieve same translation rates in each organism of interest.