BioBrick Optimization Tool - synthesis
A software tool to help iGEM teams optimize hard-to-synthesize sequences for expression and synthesis. GRAPHIC
- Remove Repeats
- Reduce GC Content
- Remove Hairpins
- And More!
Results
How good is BOT?
With BOTs' advanced SPEA2 algorithm (Coell et al, 2007), BOTs is capable of optimizing sequences for both expression and ease-of-synthesis. As evidenced by its ability to reduce IDT Gene Fragment Synthesis scores from well over a hundred to 7. A task impossible to do in a reasonable time by manual labour.
This score for SGR, an enzyme in the degradation pathway was achieved in minutes with BOTs, whilst it took two weeks of tireless work to bring it down to this level with manual labour.
Background
Why do we want to improve codon optimization
This year, one of the team's enzymes was nearly impossible to optimize for both expression and synthesis. Tools to optimize for expression are plenty, but tools to optimize for synthesis are nowhere to be found. This led to frustration as two weeks of work were wasted on painstakingly iterative work. Thus the idea of a novel tool for synthetic biology was born.
Codon optimization is a standard problem in synthetic biology. To produce a protein in a given host, we not only need to think about restriction sites, and how it might get folded in the host, we also want a high level of production in the host. Getting that high production is done through codon optimization. Expression optimization is important, but sometimes you get unlucky and your sequence cannot be synthesized. You have to deal with repeats, gc richness, hairpins and other factors that reduce the ability to synthesize. Removing all these features is tedious work as removing one may cause the others.
How
Steps in the approach
So we divided the problem into two parts.
- Finding current solutions
- High Useability
- High Performance Code
Current solutions
codon-harmony
A first-glance into current solutions yielded very few results. Many tools that could optimize sequences for expression, some that could remove known undesirable features, but only one, codon-harmony by Brian Weitzner claimed to be able to do this. Whilst extremely hard to use, it did run, that's not to say it functioned well. Scores would often increase. Furthermore, it was command-line which made using the code 10x harder.
Useability
Using Django
To make it useable, we had to steer away from command-lines. We decided it was best to create a website to do this. Because codon-harmony by Brian Weitzner is coded in Python we decided to use Django, an API that lets us use python code and HTML at the same time. The first step in this process was to modify the code to accept arguments that are not command-line. In its current state, the code could only take command line arguments which is very user-unfriendly. We decided to let people provide the arguments as a dictionary by using dataclasses in Python. This way scripts could easily be set-up to run codon-harmony. After this is done it is possible to use the code easily in the website, and do checks on it beforehand. One of the major flaws of the program is that it doesn’t recognize if a file is in a FASTA file or not. So we made the changes so it doesn't accept non-FASTA files and throws a warning instead of outputting empty sequences.
Whilst codon-harmony can currently only functionally run locally, we hope to have it deployed fully by the Jamboree, that way all iGEMers can use BOTs in the future.
Better Program
Genetic Algorithms
Albeit codon-harmony claimed to be able to optimize for synthesis, testing showed that not only it couldn't optimize, it often made the sequences worse. This was due to the fact that codon-harmony was iterative, meaning it would optimize a sequence one function at a time, which could then be undone by the next function.
So we created a tool that removes repeats, keeps GC richness below a certain percentage and removes hairpins.
Initial design was to use a simple genetic algorithm. The advantage of a genetic algorithm is that it will operate in a very different computational way. Identifying all the repeats in a sequence is trivial compared to removing all of them. The genetic algorithm only operates on identification, not actually solving the problem algorithmicly. Solving the problem is left to evolution, ie. random events.
This did not work. It operated on the sum of all the fitness functions, which while valid for many applications, is incorrect for this application as we have no idea what weights to assign, assigning weights can only be done after extensive research into the solution space. Thus came along the third and final form of BOT, which is an application of the SPEA2 algorithm. Based upon of the best evolutionary algorithm, SPEA2 is used as a benchmark for all new genetic algorithms, and for the most part the best new ones are "comparable to" never really better, at least in a breadth of problem statements. So if you are building a genetic algorithm, SPEA2 is your friend.
Future Directions
There are several way we could improve BOTs, namely improve the truncation function, the raw fitness calculation function. Integrate a more compact version of the graph for more effective calculations, as well as using linkedlists rather than arrays as arrays lose their data easier with truncation. Hosting the website on a webpage rather than just having a functional local webpage.
Access
GitHub
References
Coello, C. A. C., Lamont, G. B., & Veldhuizen, D. A. V. (2007). Evolutionary Algorithms for Solving Multi-Objective Problems Second Edition. Boston (MA): Springer.