Software
Codonator 3000
Enter your gene sequence:
Search source organism
Search destination organism
Set minimum codon frequency threshold per thousandCODONATOR 3000 Copyright © 2019, Tristan Ofner, Fahad Ali, Merrie Caruana, Benj Gonzaga, Nathan Hawkins, Isobel McGrath, Emma Todd.
This program is licensed under GNU General Public License v3.0 or later.
Documentation
Introduction to the Codonator 3000
Codonator 3000 is a web-based tool with which users can generate a codon harmonised sequences for any given input coding sequence. The tool provides an output for each of four algorithms: codon optimisation, as well as rank order, nearest frequency, and relative proportion codon harmonisation methods, the latter two of which also allow for the user to set a minimum threshold for rareness to accept in the harmonised sequence.
How to use the Codonator 3000
- Enter the source codon sequence to be harmonised to the gene sequence entry box.
- Search the name of your source species: you must use the scientific name of the species, or some substring of it. For example "Escherichia" or "coli" will return results, but "E. coli" will not. The search results will be ordered by the number of coding sequences upon which the codon frequency table is based for that species. (If the database is not in your browser cache, this may take some time, as the data file is retrieved from the iGEM servers).
- Search the name of your destination/target species.
- If you want to exclude rare codons from the harmonised output, you may click the checkbox and enter the minimum frequency threshold (in codons per thousand).
- Hit Execute!.
- Codon Usage Database Search: When you search for an organism, the Codonator retrieves and searches its codon usage database of 6148 species for the species name specified.
- Presentation of Codon Usage Biases: As all species have unique codon usage biases, the Codonator displays these biases in an easy-to-read table, for direct comparison of the source and destination species.
- Execution: When entering an input gene sequence, the Codonator 3000 takes ‘chunks’ of three nucleotides, and translates the codon according to the destination species’ Codon Usage Bias using four different algorithms:
- Codon Optimisation
- Codon Harmonisation – Rank Order
- Codon Harmonisation – Nearest Frequency
- Codon Harmonisation – Relative Proportion
Codon Optimisation
For each codon in the input sequence, codon optimisation selects, as the translated codon, the most frequently occurring triplet coding for the respective amino acid.
The following pseudocode describes the implementation of Codon Optimisation in the Codonator 3000:
- For each amino acid:
- Sort source species coding sequences by frequency, ascending.
- Sort destination species coding sequences by frequency, ascending
- For each triplet in input sequence:
- AALength = num. codons coding for Amino Acid
- Take codonOut at AAlengthth index
- Append codonOut to output sequence.
Codon Harmonisation - Rank Order
The rank-ordered Codon Harmonisation algorithm translates each codon by taking the codon, based on the destination species codon usage table, at the same rank, by frequency, within the amino acid, as the input codon in the source species codon usage table.
The following pseudocode describes the implementation of Codon Harmonization by Rank in the Codonator 3000:
- For each amino acid:
- Sort source species coding sequences by frequency, ascending.
- Sort destination species coding sequences by frequency, ascending
- For each triplet in input sequence:
- Find index (rank) of codonIn within relevant amino acid
- Take codonOut at same index in destination species.
- If frequency cutoff threshold specified:
- While codonOut frequency < threshold frequency AND index <= codonCount:
- Increase index by 1
- Get codonOut at index in destination species
- While codonOut frequency < threshold frequency AND index <= codonCount:
- Append codonOut to output sequence.
Codon Harmonisation – Nearest Frequency
The absolute frequency Codon Harmonisation algorithm translates the source codon into the destination species codon according to which codon has the nearest frequency of occurring. The frequency is calculated at an absolute level, among all coding sequences.
The following pseudocode describes the implementation of Codon Harmonization by Absolute Frequency in the Codonator 3000:
- For each amino acid:
- Sort source species coding sequences by frequency, ascending.
- Sort destination species coding sequences by frequency, ascending
- For each triplet in input sequence:
- Set nearest frequency difference arbitrarily large.
- For each triplet coding for relevant Amino Acid:
- Calculate frequency difference between CodonIn and triplet
- If calculated frequency difference < nearest frequency difference:
- Set nearest frequency difference to iterated calculated difference.
- Set CodonOut to this triplet
- If frequency cutoff threshold specified:
- While codonOut frequency < threshold frequency:
- Get codonOut with next highest frequency.
- While codonOut frequency < threshold frequency:
- Append codonOut to output sequence.
Codon Harmonization – Relative Proportion
The relative frequency Codon Harmonisation algorithm translates the source codon into the destination species codon according to which codon has the nearest frequency of occurring, given the biases in the appearances of amino acids. This approach differs from absolute frequency in that the frequencies of codon appearance are normalised within each amino acid.
The following pseudocode describes the implementation of Codon Harmonization by Relative Frequency in the Codonator 3000:
- For each amino acid:
- Calculate source codon frequency, as proportion of all codons coding for amino acid in source species.
- Calculate destination codon frequency, as proportion of all codons coding for amino acid in destination species.
- For each triplet in input sequence:
- Set nearest frequency difference arbitrarily large.
- For each triplet coding for relevant Amino Acid:
- Calculate relative frequency difference between CodonIn and triplet
- If calculated relative frequency difference < nearest relative frequency difference:
- Set nearest frequency difference to iterated calculated difference.
- Set CodonOut to this triplet
- If frequency cutoff threshold specified:
- While codonOut absolute frequency < threshold frequency:
- Get codonOut with next highest absolute frequency.
- While codonOut absolute frequency < threshold frequency:
Append codonOut to output sequence.
The codon frequency tables were collected from the codon usage database provided by the Kazusa DNA Research Institute. We wrote a custom python script to aggregate and transform the usage frequencies into a structured JSON text file that would be suitable for integration with a javascript front-end. This script may be found in the appendix. The steps in the process are described below:
- Parsing the source data. The source data extracted from kazusa is structured by providing the name of the species, followed by the number of coding sequences, and the number of codons found among the coding sequences for each codon. Kazusa also divides the codon usage data between species types: for example, by bacteria, plants, invertebrates, etc. Our script identifies the start and end of a single species’ usage data according to the presence of the new-line (“\n”) character.
- For each species as defined by the presence of the new line characters, the script gathers the ‘metadata’ of the species, including the name of the species, the number of coding sequences, and the number of codons, calculated independently by the summation of all codons. This is structured into a string of key-value pairs.
- Using a lookup table, we then append a series of key-value pairs for each amino acid. The name of the amino acid becomes a key, and the value is a nested array of key-value pairs, where each key and value, respectively, indicates the codon and the number of times the codon occurs among the coding sequences.
- The series of species, each species having a unique integer identifier, and codon usages are appended into a single string, structured as a JSON object. The design of the data as a single, large JSON object was to allow for simple integration with a javascript front-end. Secondly, the ‘encoding’ of the codon usages in integer format, to be calculated as a frequency per 1000 codons at runtime, was to reduce the demands on throughput as the data is requested from the server. The reason for this is that a float object (expressed as a decimal) requires greater memory than an integer object.
The next step of the development process was to build the logic to produce the optimised and harmonised output. The algorithms were implemented as described in the pseudocode above. Technically, the implementation used javascript, which runs natively inside all modern browsers, embedded within an html file. Unit tests were run on a Google cloud server, to test both the technical integration of the front-end html and javascript with the JSON-structured text-file database, and the integrity of the algorithm implementations.
With the success of these tests, an html GUI was created to allow the prospective user to intuitively use the Codonator 3000.
Codon Database Processing Script (Python)
src = ['gbbct', 'gbpln.spsum', 'gbvrt.spsum', 'gbvrl.spsum', 'gbrod.spsum', 'gbpri.spsum', 'gbphg.spsum', 'gbmam.spsum', 'gbinv.spsum']
fulltext = '['
i = 0
#read each species-type file in turn.
for spec in range(len(src)):
bacteria = open('workingFolder/' + src[spec])
bacteria = bacteria.read()
bacteria = bacteria.split('\n')
codonUsage = []
for index in range(len(bacteria)):
#parse and split the source data by spaces and colons.
if index%2 == 0:
try:
bacteria[index] = bacteria[index]+":"+bacteria[index+1]
bacteria[index]=bacteria[index].split(':')
if (len(bacteria[index]))==7:
bacteria[index][1] = bacteria[index][1] + bacteria[index][2] + bacteria[index][3] + bacteria[index][4]
del bacteria[index][2:4]
if (len(bacteria[index]))==6:
bacteria[index][1] = bacteria[index][1] + bacteria[index][2] + bacteria[index][3]
del bacteria[index][2:3]
if len(bacteria[index])==5:
bacteria[index][1] = bacteria[index][1] + bacteria[index][2]
del bacteria[index][2]
codonUsage.append(bacteria[index])
except:
pass
bacteria = []
for codonUsageItem in codonUsage:
#codonUsageItem = codonUsageItem.split(":")
#for item in range(len(codonUsageItem)):
# codonUsageItem[item] = codonUsageItem[item].split()
bacteria.append(codonUsageItem)
#lookup table of codons and amino acids
text = ""
arg = ['CGA', 'CGC', 'CGG', 'CGT', 'AGA', 'AGG']
leu = ['CTA', 'CTC', 'CTG', 'CTT', 'TTA', 'TTG']
ser = ['TCA', 'TCC', 'TCG', 'TCT', 'AGC', 'AGT']
thr = ['ACA', 'ACC', 'ACG', 'ACT']
pro = ['CCA', 'CCC', 'CCG', 'CCT']
ala = ['GCA', 'GCC', 'GCG', 'GCT']
gly = ['GGA', 'GGC', 'GGG', 'GGT']
val = ['GTA', 'GTC', 'GTG', 'GTT']
lys = ['AAA', 'AAG']
asn = ['AAC', 'AAT']
gln = ['CAA', 'CAG']
his = ['CAC', 'CAT']
glu = ['GAA', 'GAG']
asp = ['GAC', 'GAT']
tyr = ['TAC', 'TAT']
cys = ['TGC', 'TGT']
phe = ['TTC', 'TTT']
ile = ['ATA', 'ATC', 'ATT']
met = ['ATG']
trp = ['TGG']
stp = ['TAA', 'TAG', 'TGA']
triplets = ['CGA', 'CGC', 'CGG', 'CGT', 'AGA', 'AGG', 'CTA', 'CTC', 'CTG', 'CTT', 'TTA', 'TTG', 'TCA', 'TCC', 'TCG', 'TCT', 'AGC', 'AGT', 'ACA', 'ACC', 'ACG', 'ACT', 'CCA', 'CCC', 'CCG', 'CCT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GGG', 'GGT', 'GTA', 'GTC', 'GTG', 'GTT', 'AAA', 'AAG', 'AAC', 'AAT', 'CAA', 'CAG', 'CAC', 'CAT', 'GAA', 'GAG', 'GAC', 'GAT', 'TAC', 'TAT', 'TGC', 'TGT', 'TTC', 'TTT', 'ATA', 'ATC', 'ATT', 'ATG', 'TGG', 'TAA', 'TAG', 'TGA']
#for each species. Ignore the bacteria, script tested against the bacteria file.
for bacterium in bacteria:
bacterium[1] = bacterium[1].replace('"',"'")
if (len(bacterium)) != 4:
print(bacterium[0])
if int(bacterium[2]) < 10:
continue
text = text + "{"
row = '"speciesID":' + str(i) + ","
#build species metadata
row = row + '"speciesName":"' + bacterium[1] + '",'
#row = row + '"speciesType":' + speciesName[spec] + ','
row = row + '"CDS":' + bacterium[2] + ',"nbCodons":'
j = 0
sumCodons = 0
argtxt = '"Arg":{'
leutxt = '"Leu":{'
sertxt = '"Ser":{'
thrtxt = '"Thr":{'
protxt = '"Pro":{'
alatxt = '"Ala":{'
glytxt = '"Gly":{'
valtxt = '"Val":{'
lystxt = '"Lys":{'
asntxt = '"Asn":{'
glntxt = '"Gln":{'
histxt = '"His":{'
glutxt = '"Glu":{'
asptxt = '"Asp":{'
tyrtxt = '"Tyr":{'
cystxt = '"Cys":{'
phetxt = '"Phe":{'
iletxt = '"Ile":{'
mettxt = '"Met":{'
trptxt = '"Trp":{'
stptxt = '"STOP":{'
amAcid = ',"AminoAcids":{'
codonCnt = ',"Codons":{'
for value in bacterium[3].split(' '):
if value == '':
pass
#check the lookup table for which amino acid the codon codes for
else:
append = '"' + triplets[j] + '":' + value + ','
if triplets[j] in arg:
argtxt = argtxt + append
elif triplets[j] in leu:
leutxt = leutxt + append
elif triplets[j] in ser:
sertxt = sertxt + append
elif triplets[j] in thr:
thrtxt = thrtxt + append
elif triplets[j] in pro:
protxt = protxt + append
elif triplets[j] in ala:
alatxt = alatxt + append
elif triplets[j] in gly:
glytxt = glytxt + append
elif triplets[j] in val:
valtxt = valtxt + append
elif triplets[j] in lys:
lystxt = lystxt + append
elif triplets[j] in asn:
asntxt = asntxt + append
elif triplets[j] in gln:
glntxt = glntxt + append
elif triplets[j] in his:
histxt = histxt + append
elif triplets[j] in glu:
glutxt = glutxt + append
elif triplets[j] in asp:
asptxt = asptxt + append
elif triplets[j] in tyr:
tyrtxt = tyrtxt + append
elif triplets[j] in cys:
cystxt = cystxt + append
elif triplets[j] in phe:
phetxt = phetxt + append
elif triplets[j] in ile:
iletxt = iletxt + append
elif triplets[j] in met:
mettxt = mettxt + append
elif triplets[j] in trp:
trptxt = trptxt + append
elif triplets[j] in stp:
stptxt = stptxt + append
#codonCnt = codonCnt + '"' + triplets[j] +'":' + val
sumCodons = sumCodons + int(value)
#if j < 63:
#codonCnt = codonCnt + ','
j+=1
#last codon, get rid of the final ']'
amAcid = amAcid + argtxt[:-1] + '},' + leutxt[:-1] + '},' + sertxt[:-1] + '},' + thrtxt[:-1] + '},' + protxt[:-1] + '},' + alatxt[:-1] + '},' + glytxt[:-1] + '},' + valtxt[:-1] + '},' + lystxt[:-1] + '},' + asntxt[:-1] + '},' + glntxt[:-1] + '},' + histxt[:-1] + '},' + glutxt[:-1] + '},' + asptxt[:-1] + '},' + tyrtxt[:-1] + '},' + cystxt[:-1] + '},' + phetxt[:-1] + '},' + iletxt[:-1] + '},' + mettxt[:-1] + '},' + trptxt[:-1] + '},' + stptxt[:-1] + '}'
row = row + str(sumCodons) + amAcid + '}},\n'
text = text + row
i +=1
fulltext = fulltext + text
#write to output
bacteriaScript = open('data.txt','w')
bacteriaScript.write(fulltext[:-2]+']')
bacteriaScript.close()