Modelling
A Shannon entropy approach to the assessment of codon harmonisation methods
Introduction
The concept of entropy, introduced by Claude Shannon in the foundational work of information theory, A Mathematical Theory of Communication (1948), is a measure of the information content in a random variable. The entropy H of a given codon \(C_{XYZ}\), measured in bits, where \(XYZ\) belongs to the set of sixty-four triplet codons, and where \(p_C\) is the probability of selecting the codon at random from some exonic sequence, is:
$$H(C_{XYZ}) = - p(C_{XYZ})\log_{2}{p(C_{XYZ})}$$Subsequently, we can characterise the entropy of the ‘alphabet’ of all codons (\(c\)) for a species (\(S\)) as:
$$H(S) = - \sum_{c \in S}p(c)\log_{2}p(c)$$We suppose that the variability of codon usage has some impact on the process of translation. For example, rare codons may reflect some scarcity in corresponding tRNA, and as such could moderate the rate of translation, often with some biological purpose, eg. to decelerate translation in difficult to fold regions.
It is often possible to optimise heterologous protein expression by taking advantage of codon degeneracy. Codon harmonisation is one such approach, wherein each codon in a given coding sequence is replaced by a synonymous codon that has some similarity in usage to the original in a destination organism.
In the information-theoretic sense of the word, the information encoded in the sequence should closely mirror the information in the sequence in the source organism. Subsequently, by minimising the differential of the information entropy of each codon, we hypothesise that heterologous expression would be much improved.
Using the Codonator 3000 software tool, we have generated four codon-harmonised sequences based on VVD36-C73A, a fluoroprotein derived from the ascomycete Neurospora crassa, for expression in Escherichia coli. These sequences corresponded to rank order, nearest frequency, and relative proportion harmonisation methods, as well as codon optimisation.
We introduce the measure of the average entropy deviation (\(D\)) per codon. This calculated measure the sum of squares of the deviation of the entropy of a codon \(XYZ\) in the source species (A) and the harmonised codon \(LKJ\) in the destination species (B).
$$D = \frac{\sqrt{\sum\limits_{i=0}^{n}[H(C(A)_{XYZ})-H(C(B)_{LKJ})]^2}}{n}$$Accordingly, we determine the entropy of at each codon of VVD36-C73A in N. crassa. For brevity, we will show results only for the first ten codons in the sequence. The table below illustrates both the probability \(P\) of a given codon \(XYZ\) selected at random from the entire exonic content of N. crassa, as well as the entropy \(H\) for each codon.
$$\textbf{Table 1: }\text{Calculation of the Shannon entropy for the first ten codons of VVD36-C73A in }\textit{N. crassa}$$ \begin{array} {|r|r|}\hline \mathbf{XYZ} & \mathbf{p(XYZ)\text{ in }\textit{N. crassa}} & \mathbf{H(XYZ)} \\ \hline ATG & 0.0218 & 0.120325712 \\ \hline CAT & 0.00945 & 0.063555691 \\ \hline ACG & 0.01354 & 0.084037749 \\ \hline CTC & 0.02679 & 0.139901709 \\ \hline TAC & 0.01746 & 0.101962954 \\ \hline GCT & 0.02113 & 0.117579225 \\ \hline CCC & 0.02242 & 0.122840747 \\ \hline GGC & 0.02902 & 0.148199588 \\ \hline GGT & 0.01828 & 0.105541227 \\ \hline TAT & 0.00847 & 0.058302587 \\ \hline \end{array}Below we calculate the average entropy deviation (\(D\)) in each codon in our codon-optimised VVD36-C73A from the corresponding codon in the original sequence.
$$\textbf{Table 2: }\text{Calculation of the average entropy deviation in the first ten codons of a codon-optimised VVD36-C73A in }\textit{E. coli}$$ \begin{array} {|r|r|}\hline \mathbf{LKJ} & \mathbf{P(LKJ)\text{ in }\textit{E. coli}} & \mathbf{H(LKJ)} & \mathbf({H(XYZ)-H(LKJ))^{2}} \\ \hline ATG & 0.02375 & 0.128153306 & 6.12712E^{-05} \\ \hline CAT & 0.01241 & 0.078584502 & 0.000225865 \\ \hline ACC & 0.01894 & 0.108382632 & 0.000592673 \\ \hline CTG & 0.03744 & 0.177438484 & 0.001409009 \\ \hline TAT & 0.02162 & 0.119590807 & 0.000310741 \\ \hline GCA & 0.023 & 0.125171114 & 5.76368E^{-05} \\ \hline CCG & 0.0145 & 0.088563148 & 0.001174954 \\ \hline GGT & 0.02372 & 0.128034682 & 0.000406623 \\ \hline GGT & 0.02372 & 0.128034682 & 0.000505955 \\ \hline TAT & 0.02162 & 0.119590807 & 0.003756246 \\ \hline \end{array}As the method of calculation is replicated across all codons, and for all harmonisation methods, they will not be shown here. The full table is available here.
The average entropy deviation for each method was found to be:
Optimisation: \(D = 0.003204981\)
Rank order: \(D = 0.00145138\)
Nearest frequency: \(D = 0.001277842\)
Relative proportion: \(D = 0.001447648\)
The nearest frequency harmonisation approach produces an output with the lowest difference in average codon entropy, at 0.001277842 bits per codon.
GC content is highly correlated with entropy
Total GC exonic content for N. crassa and E. coli was determined from total codon counts in the Codonator 3000 database (derived from the Kazusa codon usage database). GC content for E. coli was found to be 47.3%, and for N. crassa was found to be 56.1%. We then calculated GC content for the input VVD36-C73A sequence, as well as output harmonisation sequences.
\begin{array} {|r|r|}\hline \mathbf{\text{Sequence}} & \mathbf{\text{GC content (percent)}} \\ \hline \text{Input} & 50.2 \\ \hline \text{Optimisation} & 41.0 \\ \hline \text{Rank order} & 43.2 \\ \hline \text{Nearest frequency} & 44.5 \\ \hline \text{Relative proportion} & 43.6 \\ \hline \end{array}It is notable that the output of the nearest frequency method has not only the lowest average entropy difference, but also the closest GC content to the source organism. Interestingly, GC content was found to have a strong inverse correlation with average entropy difference (\(r = -0.9577\)). To explore further, we determined sequence identity with the original input using a discontiguous megablast.
\begin{array} {|r|r|}\hline \mathbf{\text{Sequence}} & \mathbf{\text{Sequence identity}} \\ \hline \text{Optimisation} & 340 \\ \hline \text{Rank order} & 316 \\ \hline \text{Nearest frequency} & 322 \\ \hline \text{Relative proportion} & 332 \\ \hline \end{array}In this case, the correlation between average entropy difference and sequence identity was found to be \(r = 0.7877\). It appears that GC content, rather than sequence identity, is more closely related to the average entropy difference. We have plotted the average entropy difference for each harmonisation output against GC content below, revealing a roughly parabolic relationship, in line with previous literature (Frappat et al., 2003).
Conclusion
In summary, we have conducted a model to assess the optimal codon harmonisation method using Shannon entropy as a measure of information content contained within a given coding sequence. Based on these results, we conclude that the nearest frequency harmonisation method is optimal, and that the average entropy difference is highly correlated with the GC content of the output sequence. The results of our model were confirmed experimentally, with the nearest frequency harmonisation variant of VVD36-C73A exhibiting the greatest fluorescence in culture. Consequently, we will extend our Shannonian model to the psilocybin biosynthesis genes from Psilocybe cubensis in our attempt to optimise the production of psilocybin in our E. coli expression system.
References
- Frappat, L., Minichini, C., Sciarrino, A. & Sorba, P. Universality and Shannon entropy of codon usage. Phys Rev E Stat Nonlin Soft Matter Phys 68, 061910 (2003).
- Shannon, C.E. A Mathematical Theory of Communication. Bell System Technical Journal 27, 379-423 (1948).