Software
Introduction
In order to generate our synthetic promoters, we built promoter sequence models based on hidden Markov models (HMMs) (see the model page). For handling the many calculations and large volumes of data needed to build these models, we built a user-friendly program that does all the work — we call it proHMMoter
, and you can download it here! Besides making working the model significantly easier, not to mention faster, it also allowed us to implement all the different features we wanted our promoters to have in a way that is easy to extend and modify for future users.
proHMMoter
has been written as a command-line program that ties a small suite of scripts together in a single convenient command. Due to the command-line interface as well as consistent naming and location of output files, users can easily embed our software as a tool in a bigger pipeline.
Features
- Domestication
proHMMoter
outputs promoter sequences that are compatible with iGEM’s Type IIS standards, and incorporates a penalty system that can automatically domesticate for other relevant synthetic biology standards when this can be done without major disruption.- Shortening sequences
proHMMoter
can identify evolutionarily non-conserved regions. Accordingly, the program can remove non-essential regions upstream of the actual promoter region, creating shorter promoter sequences. This can be essential when working with higher eukaryotes, as these gene constructs can get very large, and thereby more difficult and expensive to work with.- Non-homology
proHMMoter
can generate promoter sequences that in theory should have the same function as the original promoter, but with a drastically different sequence. This can be very useful, since it reduces the chance of unwanted homologous recombination.- Synthesizability
- The promoters generated by
proHMMoter
automatically comply with the complexity rules of major DNA synthesis vendors, making it easy to get your freshly generated promoters synthesised. - Host versatility
proHMMoter
can generate consensus sequences representing the promoters in diverse sets of genomes. As long as these genomes are sufficiently related, the promoters should be usable in any of the organisms, and further in silico analysis can help verify this. As a proof of concept, we generated our synthetic Aspergillus niger promoters based on all the genomes within the Aspergillus genus.- Induction of noise
proHHMoter
has built-in functionality that allows the user to inject noise of varying amounts into the promoter generation process. This gives the user the opportunity to create artificially weakened promoters, and thereby allows for the creation of promoters with different strengths but broadly similar expression dynamics. For our project, we have injected noise into several of our promoters - and proved that expression was still achieved, in a step towards the aim of building a comprehensive promoter ladder.
How it works
As Figure 1 illustrates, the program requires two inputs: The genome data from the set of organisms you want the modeling process to take into account, and a set of protein-coding candidate genes with valuable expression characteristics. To generate the synthetic promoters for our project, we used genome data from Mycocosm and RNA-Seq data to identify candidate genes by their expression levels.The program then works by first finding all orthologs of the genes you have chosen, using the HMMER tool phmmer
to identify orthologs by amino acid sequence, approximately identifying orthologs by selecting the highest-scoring homolog in each genome.
When the sequences of the orthologs are found, the program locates them in their respective genomes, and extracts the putative promoters upstream of these genes. These promoter regions are then aligned with the multiple sequence alignment tool MUSCLE[2]. From the multiple sequence alignments we then build the actual HMM, using the HMMER tool hmmbuild
. To further refine this model, we perform an expectation maximization step, wherein we align all native promoter sequences to the model using hmmalign
, then rebuild the HMM from this alignment while trimming unaligned ends, and iterate until the consensus sequence converges. To ensure minimal promoter lengths, we extract the position-weight matrix underlying the HMM, to use in generating the synthetic promoters. Implementing the linear program described on our model page.
in clear, concise code that is easy to extend, we apply constraints such as limitations on homopolymers and GC content to reduce synthesis complexity to levels considered acceptable to major vendors, and domesticate with respect to the features considered in
our design, complying with strictly required as well as making a best effort to follow merely desirable synthetic biology standards. This latter part is accomplished via a penalty system, ensuring the amount of sequence conservation (measured in bits, as sequence conservation is essentially a measure of information) sacrificed when domesticating falls below preset levels.
With this linear program, we then solve it, finding the sequence which, with respect to sequence conservation, comes as close to the consensus sequence as the constraints allow.
We also adjust the distribution implied by the PWM as described on our model page and sample from it, using the linear program to reject non-solutions, thus fairly sampling promoter sequences living up to our defined standards.
In our project, we have shown that our promoters work even when a substantial amount of noise has been injected, and that this appropriately resulted in a relatively weak promoter.
Usability
Since we wanted to ensure that proHMMoter
would be useful for future teams that might not have strong bioinformatics experience, it has been designed in a way that requires minimal programming/computational knowledge to use. A README file has been written on what programs and files are needed to run the software and how to obtain and organize those files. To further assist in the process, an example of how to find genome data and what format it should be in has been included.
To help us test if our software was sufficiently user-friendly for fellow synthetic biologists to make productive use of, the BrownStanfordPrinceton team offered to test it for us. Their story can be read here. We were very pleased to hear from the team that they found our software easy to use, and that they consider its capabilities valuable. This also confirms that the program works consistently, even with other organisms besides Aspergillus.
Publication of the source code
If you now want to generate synthetic promoters too, you can contact us for our software (cwor@dtu.dk). The files relevant to you as a user of the software are as follows:- prohmmoter.py
- The program to run.
- README.md
- README file describing how to use the program, what files to include, and the output formats.
- Quick guide to Mycocosm
- A guide to find and download necessary genome data from the Mycocosm, which has a lot of fungal genomes.
- hmm-match-state-emission.jl
- If you want to modify the constraints that the generated sequences will live up to, or adjust the noise levels, this is the file to edit.
- LICENSE
- The GNU GPLv3 license
[1] The genome portal of the Department of Energy Joint Genome Institute: 2014 updates.
Nordberg H, Cantor M, Dusheyko S, Hua S, Poliakov A, Shabalov I, Smirnova T, Grigoriev IV, Dubchak I.
Nucleic Acids Res. 2014,42(1):D26-31.
[2] R. C. Edgar, “MUSCLE: Multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res., 2004.
[3] http://hmmer.org/.