Team:KU LEUVEN/Modeling/Bioinformatics

KUL iGEM wiki 2019


KUL iGEM wiki 2019

Bioinformatics

Codon optimization

In order to maximally increase the chances of detecting YFP after S-TIP37 infection of its bacterial host, the codon distribution of the S-TIP37 capsid gene, putatively the most strongly expressed gene on the S-TIP37 genome, was applied to the YFP gene. As no codon optimization tools exist for adapting the codon distribution of a new gene to match the codon distribution of a specific gene (such as the S-TIP37 capsid protein), an in silico investigation was conducted and a custom bioinformatics pipeline was devised.

The pipeline consists of two separate scripts. The first script’s purpose is to capture the codon usage of the capsid protein and then convert it to a table that holds the frequency of occurrence of each codon with respect to the translated target amino acid. The second script reads these frequencies and the gene that needs to be codon-optimized; in our case, that is YFP. Consequently, the script applies this distribution to the gene’s sequence to assemble a codon-optimized nucleotide sequence. Both scripts were written in Python3 and can be found below.

Script: get_codon_freqs

Initialization

from Bio import SeqIO
import pickle

gbk_filename = "S-TIP37.gbk"

codons = {}
codons_counts = {}
codons_frequencies = {}

translation_table = {"AAA": "lys","AAC": "asn","AAG": "lys","AAT": "asn","ACA": "thr","ACC": "thr","ACG": "thr","ACT": "thr","AGA": "arg","AGC": "ser","AGG": "arg","AGT": "ser","ATA": "ile","ATC": "ile","ATG": "met","ATT": "ile","CAA": "gln","CAC": "his","CAG": "gln","CAT": "his","CCA": "pro","CCC": "pro","CCG": "pro","CCT": "pro","CGA": "arg","CGC": "arg","CGG": "arg","CGT": "arg","CTA": "leu","CTC": "leu","CTG": "leu","CTT": "leu","GAA": "glu","GAC": "asp","GAG": "glu","GAT": "asp","GCA": "ala","GCC": "ala","GCG": "ala","GCT": "ala","GGA": "gly","GGC": "gly","GGG": "gly","GGT": "gly","GTA": "val","GTC": "val","GTG": "val","GTT": "val","TAA": "STOP","TAC": "tyr","TAG": "STOP","TAT": "tyr","TCA": "ser","TCC": "ser","TCG": "ser","TCT": "ser","TGA": "STOP","TGC": "cys","TGG": "trp","TGT": "cys","TTA": "leu","TTC": "phe","TTG": "leu","TTT": "phe"}

Read the capsid gene of the phage S-TIP37 and count how many times each codon is used in that gene

for seq_record in SeqIO.parse(gbk_filename, "genbank"):
    seq = seq_record.seq
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS":
            if seq_feature.qualifiers['product'][0] == 'capsid protein':
                CDS_start = seq_feature.location.start
                CDS_end = seq_feature.location.end

                gene_sequence = seq[CDS_start:CDS_end]
                for i in range(0, len(gene_sequence)//3):
                    codon = str(gene_sequence[i*3:i*3+3])
                    if codon in codons.keys():
                        codons[codon] += 1
                    else:
                        codons[codon] = 1

Count how many times each codon is being used per its respective amino acid

for codon in codons.keys():
    if translation_table[codon] in codons_counts:
        codons_counts[translation_table[codon]][codon] = codons[codon]
    else:
        codons_counts[translation_table[codon]] = {codon: codons[codon]}

Convert counts to frequencies per amino acid (ranging from 0 to 1)

for aa in codons_counts.keys():
    sum = 0
    for aa_codon in codons_counts[aa].keys():
        sum += codons_counts[aa][aa_codon]
    for aa_codon in codons_counts[aa].keys():
        codons_counts[aa][aa_codon] = codons_counts[aa][aa_codon]/sum

codons_counts["trp"] = {"TGG": 1.0}

Write output to file and exit

with open('codon_counts_capsid.p', 'wb') as fp:
    pickle.dump(codons_counts, fp, protocol=pickle.HIGHEST_PROTOCOL)

print("Done")

exit()

Script: codon_optimize

Initialization

from Bio import SeqIO
import pickle
import operator
import math
import random
from collections import Counter

with open('codon_counts_capsid.p', 'rb') as fp:
    codon_counts = pickle.load(fp)

aa_letter_to_trip = {"A": "ala","R": "arg","D": "asp","N": "asn","C": "cys","E": "glu","Q": "gln","G": "gly","H": "his","I": "ile","L": "leu","K": "lys","M": "met","F": "phe","P": "pro","S": "ser","T": "thr","W": "trp","Y": "tyr","V": "val",}
aa_trip_to_letter = {v: k for k, v in aa_letter_to_trip.items()}

aa_list = []
aa_counts = {}

A function that returns a list of codons per amino acid to be used when assembling the new gene_sequence

def optimize_occurences(codon_counts, n_occ):
    number_roundup = n_occ - sum([int(elem) for elem in codon_counts.values()])
    k = Counter(codon_counts)
    codons_to_roundup = []
    for roundup_entry in k.most_common(number_roundup):
        codons_to_roundup.append(roundup_entry[0])

    add_roundup = False
    for codon in codon_counts.keys():
        if codon in codons_to_roundup:
            if int(math.ceil(codon_counts[codon])) == int(codon_counts[codon]):
                add_roundup = True
            codon_counts[codon] = int(math.ceil(codon_counts[codon]))
        else:
            if add_roundup:
                codon_counts[codon] = int(math.ceil(codon_counts[codon]))
                add_roundup = False
            else:
                codon_counts[codon] = int(codon_counts[codon])

    codon_list = []
    for codon in codon_counts.keys():
        for i in range(0, codon_counts[codon]):
            codon_list.append(codon)

    return codon_list

Count the amount of amino acids in the translated gene

for aa in aa_sequence:
    letter = aa_letter_to_trip[aa]
    aa_list.append(letter)

    if letter in aa_counts.keys():
        aa_counts[letter] += 1
    else:
        aa_counts[letter] = 1

Make a dictionary, where every amino acid is a key that corresponds to a set value containing a number of codons distributed according to the capsid gene distribution

dict_of_codons_per_aa = {}
for aa in aa_counts.keys():
    aa_codon_distr = []
    number_occ = aa_counts[aa]
    for codon in codon_counts[aa].keys():
        codon_counts[aa][codon] = codon_counts[aa][codon]*number_occ
    dict_of_codons_per_aa[aa_trip_to_letter[aa]] = optimize_occurences(codon_counts[aa], number_occ)

Replace every amino acid from the gene sequence with a codon from the distribution dictionary, randomly selected from the codon list corresponding to each amino acid

final_sequence = ""
for letter in aa_sequence:
    codon = random.choice(dict_of_codons_per_aa[letter])
    list_copy = dict_of_codons_per_aa[letter].copy()
    list_copy.remove(codon)
    dict_of_codons_per_aa[letter] = list_copy
    final_sequence += codon

Print the assembled codon optimized sequence and exit

print("Codon optimized sequence:")
print(final_sequence)

print("Done")

exit()

KUL iGEM wiki 2019