Team:DTU-Denmark/Model

Model

To generate synthetic promoters, we needed more than data and an idea of the properties promoters ought to have. We needed a coherent mathematical and conceptual framework that could drive our understanding: we needed a model.

Introduction

Using RNA-Seq read counts as a proxy for gene expression, it is possible for genes with desirable properties, such as expression in specific growth phases or with a certain strength, to be identified. Based on this the upstream sequences of the genes should act as promoters, producing expression with those same properties. However, there are many reasons why such a native sequence might be undesirable. First of all, if your organism of choice has not been studied sufficiently, it may not even be possible to identify genes with a desirable expression. Even if you find them, the boundaries of their promoters are hard to define, and nothing guarantees that their sequences are compatible with your cloning or assembly method of choice. Worse yet, when employing these sequences in synthetic biology work, the large regions of homology with the genome of your target organism makes your constructs prone to inappropriate homologous recombination, complicating the construction of your strain and undermining its stability.
For these reasons and more, synthetic promoters with methodically defined boundaries that have a predictable performance with sequences dissimilar from anything found natively, yet complying with arbitrary requirements on their sequences, are highly desirable. But how can such promoters be obtained? Given the constant growth in whole-genome sequencing and increasing feasibility of working in non-model organisms, we sought an approach that did not require a detailed understanding of the intricacies of the regulatory mechanisms of a single organism but could leverage genomic data by exploiting general principles that hold true for all known life.
Inspired by the conservation of genes across taxonomic ranks as high as domains, we aimed to take advantage of homology to find promoters that would work across a broad range of species by making a simple assumption: homologous genes must have homologous promoters.

Background knowledge

To model the homology of promoter sequences between species, we needed a model that could take into account not only point mutations but also the possibility of insertions and deletions. For this purpose, we chose to use a hidden Markov model (HMM) which, as the name implies, is related to Markov processes, which can be thought of as a process which correlates the current state with previous states although it is not known what happened in the previous states. This “lack of memory” is called the Markov property. An HMM consists of a set of symbols on top of an underlying, but unobservable, Markov process. Each state in the process contains probabilities for emitting each symbol when the process enters that state. The adjective “hidden” refers to the fact that the sequence of states cannot be observed directly, only the sequence of emitted symbols.

HMMER structure

To tackle our original problem of modeling homology, while keeping it simple enough to handle computationally, we decided to use a HMM with the same structure as the one used in the HMMER software, illustrated. The HMMER structure can be thought of as an extension to a position-weight matrix, essentially considering our entire promoter sequence as a giant motif with structure on top.

Hidden Markov Model showing distribution of A, C, G, and T in M1, M2, M2, I1, I2, and I3.Fig. 1: Hidden markov model

An overview of a HMM can be seen in figure 1. The structure contains 3 types of nodes: match nodes Mi, which represent bases in the motif, and have emission probabilities for each possible base; delete nodes Di, which represent deletions, and don’t emit anything; and insert nodes Ii, that represent insertions, and have emission probabilities for each base.
From each match node Mi, there are only 3 non-zero transition probabilities: transitioning to the next match node Mi+1; to the insertion node Ii (which represents the beginning of an insertion); or to the delete state Di+1 (representing the next base being deleted).
From each delete node, Di, there are 2 non-zero transition probabilities: transitioning to Mi+1 (representing the next base being present), or to Di+1 (representing the next base being deleted as well).
From each insert node Ii, there are 2 non-zero transition probabilities as well: transitioning to Mi+1 (representing the insertion being complete}; or to Ii (representing the insertion continuing).

In this way, we can model deletions of multiple bases always occurring concurrently, but insertions are modeled as being of geometrically distributed length, and each base in an insertion is drawn from the same probability distribution.

Our model

With this model structure in hand, we could determine the parameters needed to construct a specialized model for each set of promoter sequences stemming from the orthologs of a specific gene. For this, after constructing an initial multiple sequence alignment, we employed an expectation-maximization approach, alternating between building an HMM-based on alignment and aligning the sequences to the HMM.

In an attempt to minimize our promoter lengths, we decided to take advantage of the conceptualization previously mentioned of the model structure as being an extension to a position-weight matrix, by simply extracting the PWM, thus ridding the sequences of insertions.

Sequence logo showing sequence conservation for DNA sequence.Fig. 2: Sequence logo.

To determine the model parameters, we used a set of scripts we wrote to transform the genomic data of the entire set of accessible Aspergillus spp. genomes, into a format we could use as an input for the HMMER software suite, and then further transformed the results into the minimized form described above.
Using these parameters, we could then set up a mixed-integer linear program whose feasible region would represent the set of synthesizable sequences complying with the hard domestication constraints described on the design page. These would also comply with an objective function pointing towards the sequence of lowest surprisal (a measure of how unlikely a sequence is, given the parameters). Thus, the solution would be the minimally perturbed consensus sequence, that still lives up to our domestication requirements and is possible to de novo synthesize.

The process of representing our promoter sequence generation in the form of a linear program led to an intriguing insight. When representing the promoter sequences as position-weight matrices, we are considering each base as being drawn independently from their own categorical distribution over the bases A, C, G, and T, with the consensus sequence being the mode of the joint distribution. An alternative way of conceptualizing the consensus sequence arises from considering the individual distributions as being Gibbs distributions, with the probabilities px being proportional to exp(-ex/(kT)) for the probability px of a given base, with ex being a measure of some form of energy associated with that base.
We do not need to consider physical interpretation of these energies, but merely observe that as T goes to 0, the probability of sampling the consensus sequence approaches 1. We can then vary T to continuously move from the single consensus sequence to promoters with much higher diversity.

We hoped that this would allow us to sample promoters of varying strengths from a single model, thus expanding a model of the promoter for a single ortholog group into an entire library of promoters with similar expression characteristics.

For any model, validation is crucial. As Aspergillus is remarkably diverse, the possibility for validation was especially critical. [1] Thanks to the framework of sequence conservation and its relationship to information theory, we were able to procure a remarkably simple measure for “goodness of conservation”, namely the entropy of our models. Furthermore, we could use the score of the promoters of our specific target organisms against our models to give an indication of whether this model would be able to generate promoters that would work in A. niger.
However, despite this great promise, we soon realized that we had taken a LEAP into the unknown. The evolutionary conservation of promoter sequences has simply not been studied sufficiently to allow us to determine the biological meaning of a given level of entropy.
In fact, even from a purely mathematical point of view, the distribution of conservation of promoters is simply unknown.
Not to let that stop us, we placed our trust in our models and reasoning, picked out genes with interesting expression levels, and using our software and IDT’s gBlocks, we translated our design intentions into physical DNA. We did not expect all computed promoters to work in vivo, but we had no easy way of knowing without trying. So in the spirit of innovation and experimentation, we decided to try anyway.
This paid off handsomely: Not only did our homology modeling approach result in several novel synthetic promoters, which are compatible with multiple cloning standards and keep their dependencies on the growth phase. Our noise injection approach was validated, as we showed that we were able to generate “high-temperature” samples with promoter activity and to specifically weaken a strong promoter, gdpA_1, producing gdpA_2 with similar promoter dynamics.



[1] J. G. Gibbons and A. Rokas, “The function and evolution of the Aspergillus genome,” Trends in Microbiology, vol. 21, no. 1. pp. 14–22, Jan-2013.

The logos of our three biggest supporters, DTU Blue Dot, Novo Nordisk fonden and Otto Mønsted fonden The logos of all of our sponsors, DTU, BioNordica, Eurofins Genomics, Qiagen, NEB New England biolabs, IDT Integrated DNA technologies and Twist bioscience