Team:TUDelft/Osman

Sci-Phi 29

Overview - Codon Usage

Overview text here.


Codon Usage - Cross-species codon harmonization

To support our orthogonal system, we developed a new and novel codon adaptation tool which ensures equal translation rates within different species. Similar translation rates will increase the predictability of heterologous protein expression levels within different bacterial species.
Protein structures are dependent on the DNA sequence, which is translated into a functional protein through two subsequent cellular processes: transcription and translation (Angov, Hillier, Kincaid, & Lyon, , ). The cell contains 20 different amino acids encoded by 64 codons. This has resulted in a phenomenon called synonymous codon usage. Synonymous codon usage means that most of the 20 amino acids are encoded by more than one codon (Nascimento et al., Crick, Gun, Yumiao, Haixian, & Liang, ). Nascimento, Kelly et al. (2018) have proven that cells are making great use of the codon choice that this offers, since codon usage directly affects both the level of mRNA and the translation rate. They showed that proteins expressed at high levels have more mRNA copies and contain more frequently used codons in order to speed up the translation rate (Nascimento et al., )

After the development of gene editing techniques, scientists started to express heterologous proteins in new host cells. Heterologous protein expression has shown altered protein levels compared to that in the original microorganism. One of the reasons for a lower expression level is the variance in codon usage between the original organism and the new host cell. In order to increase the expression level of the heterologous protein in the host cell, new codon optimization tools were developed. The codon optimization tools available now can be divided into two main groups based on how the tool's algorithm functions:

  1. Codon optimization tools: The basic idea is to achieve the highest translation rate possible and avoid hairpin formation by substituting each codon with the codon that is used mosed frequently for the corresponding amino acid (Hanson & Coller 2017). The relative codon frequencies are calculated through the Codon Adaptation Index (CAI) as shown in the following equation. In this equation $w_i$ is the CAI, $f_i$ is the frequency of a particular codon, and $max(f_i)$ is the codon that is used most frequently for the corresponding amino acid.
  2. $$w_i = \frac{f_i}{max( f_i )}$$

  3. Codon harmonization tools: The basic idea is to mimic the native translation rate in the host organism by using rare codons at specific places and avoiding hairpin formation as much as possible. The codon usage of the original microorganism functions as the reference point. This approach allows pre-folding of the protein during translation in order to reduce the chance of the protein misfolding as much as possible (Figure 1).
  4. Translation rate
    Figure 1: Schematic representation of protein translation. The green parts are encoded with high frequency codons in order to speed up the translation rate. The red parts are encoded by rare codons in order to slow down the translation rate, which limits misfolding of proteins by creating a small time window for protein pre-folding. Codon harmonization aims to create the same codon usage pattern as the native host in order to increase the amount of functional protein.


Since our project is all about creating a universal toolkit, we boosted our project by creating the first cross-species codon harmonization tool. We developed this new harmonization tool in order to increase the predictability of heterologous protein expression in multiple bacterial host species, by taking into account the translation variability between organisms. This harmonization tool will provide the user with a single DNA coding sequence that will yield the same protein expression level in different bacterial host cells. The codon harmonization approach as explained above forms the core of our algorithm. We modified this algorithm by making use of statistical analysis. Furthermore, we made our tool BioBrick RFC compatible by removing type II standard restriction sites.

  • How does our universal harmonization tool work?

      Our harmonization tool is based on a database containing the codon usage of 152903 microorganisms obtained from a paper published by Athey et al. (2017). Since working with such a big database slows down the tool, we designed a Matlab script that pre-selects the data of microorganisms of your interest. This Matlab script is available in the supplementary list below. The taxonomy identification number (taxid) is used as input, since each microbial species has its own unique taxid.
      As mentioned before, the codon harmonization approach will function as the core of our code, which means our tool requires a reference organism. As our reference organism we use the native host of the protein in question. In our case we used eGFP as the heterologous protein and Escherichia coli strain BL21(DE3) as native host (taxid 469008).
      When generating the filtered database containing data of only the organisms of interest, we designed the code in such a way that the first input row will contain data of the reference organism and the other rows contain data of the potential new host organisms.

      Translation rate
      Figure 2: Schematic representation of how data has been filtered from the main database. Again, the top row corresponds with data of the reference organism.


      Our codon harmonization tool requires three inputs and will generate two outputs.
      As input the harmonization script requires the following three inputs:

      1. The newly generated database containing data of the organisms of interest (data_formatted).
      2. Database containing recognition sites for type II restriction enzymes (restriction_enzymes_database).
      3. Nucleotide sequence of the gene of interest that will be harmonized.

      As output the codon harmonization script will generate the following two outputs:

      1. A codon harmonized nucleotide sequence usable in all organisms of interest.
      2. The deviation of the endogenous GC content of each organism from the mean GC content across these organisms.

    • From nucleotide input to nucleotide output

      First, the codon frequency is calculated in the same way as in Athey et al. (2017). We used this formula instead of the formula for CAI, since for the CAI calculation a reference gene is required. The CAI is very useful in case you want to change gene expression of a gene within the same organism. However, our system functions across species, so an adapted version of the CAI is used, as shown in the equation below.

      $$freq_{codon,i} = \frac{codon}{\sum codon,i}$$

      Secondly, the variance for each codon position is calculated separately. During the calculation of the variance, the codon frequency at that particular sequence position is also taken into account in order to remove outliers as much as possible. The calculated variance for each position is ordered from lowest to highest.

      In the first iteration of generating the final sequence, we use the lowest variance codon at every position. The generated sequence will go through screening for type II restriction enzyme recognition sequences. In case a site is found, the codon at that particular position will be substituted with the synonymous codon that has the second lowest variance. Going through this iteration cycle multiple times we derive a single nucleotide sequence cleared from type II restriction sites and codon harmonized for all organisms of interest in order to achieve the same translation rate in each organism of interest.