Difference between revisions of "Team:DTU-Denmark/Model"

m
m
Line 50: Line 50:
 
<h2>Model</h2>
 
<h2>Model</h2>
 
 
<p><div>Early on, we decided to generate synthetic promoters based on homology within the Aspergillus genus, based on a hypothesis that regulatory elements such as transcription factor binding sites would have been under greater selective pressure than irrelevant regions.</div></p>
+
<p><div>Using RNA-Seq read counts as a proxy for gene expression, it is possible for genes with desirable properties, such as expression in specific growth phases or with a certain strength, to be identified. Based on this the upstream sequences of the genes should act as promoters, producing expression with those same properties. However, there are many reasons why such a native sequence might be undesirable. First of all, if your organism of choice has not been studied sufficiently, it may not even be possible to identify genes with a desirable expression. Even if you find them, the boundaries of their promoters are hard to define, and nothing guarantees that their sequences are compatible with your cloning or assembly method of choice. Worse yet, when employing these sequences in synthetic biology work, the large regions of homology with the genome of your target organism makes your constructs prone to inappropriate homologous recombination, complicating the construction of your strain and undermining its stability.<br>
 +
For these reasons and more, synthetic promoters with methodically defined boundaries that have a predictable performance with sequences dissimilar from anything found natively, yet complying with arbitrary requirements on their sequences, are highly desirable.
 +
But how can such promoters be obtained? Given the constant growth in whole-genome sequencing and increasing feasibility of working in non-model organisms, we sought an approach that did not require a detailed understanding of the intricacies of the regulatory mechanisms of a single organism but could leverage genomic data by exploiting general principles that hold true for all known life.<br>
 +
Inspired by the conservation of genes across taxonomic ranks as high as domains, we aimed to take advantage of homology to find promoters that would work across a broad range of species by making a simple assumption: homologous genes must have homologous promoters.</div></p>
 
</div>
 
</div>
 
</div><!-- /end col-md-4 -->
 
</div><!-- /end col-md-4 -->
Line 85: Line 88:
 
<div class="sm-no-float col-md-10 col-xs-12">
 
<div class="sm-no-float col-md-10 col-xs-12">
  
<p>Introduction
 
Using RNA-Seq read counts as a proxy for gene expression, it is possible for genes with desirable properties, such as expression in specific growth phases or with a certain strength, to be identified. Based on this the upstream sequences of the genes should act as promoters, producing expression with those same properties. However, there are many reasons why such a native sequence might be undesirable. First of all, if your organism of choice has not been studied sufficiently, it may not even be possible to identify genes with a desirable expression. Even if you find them, the boundaries of their promoters are hard to define, and nothing guarantees that their sequences are compatible with your cloning or assembly method of choice. Worse yet, when employing these sequences in synthetic biology work, the large regions of homology with the genome of your target organism makes your constructs prone to inappropriate homologous recombination, complicating the construction of your strain and undermining its stability.<br>
 
For these reasons and more, synthetic promoters with methodically defined boundaries that have a predictable performance with sequences dissimilar from anything found natively, yet complying with arbitrary requirements on their sequences, are highly desirable.
 
But how can such promoters be obtained? Given the constant growth in whole-genome sequencing and increasing feasibility of working in non-model organisms, we sought an approach that did not require a detailed understanding of the intricacies of the regulatory mechanisms of a single organism but could leverage genomic data by exploiting general principles that hold true for all known life.<br>
 
Inspired by the conservation of genes across taxonomic ranks as high as domains [TODO source this, then replace by as high as could be found sources for], we aimed to take advantage of homology to find promoters that would work across a broad range of species by making a simple assumption: homologous genes must have homologous promoters.
 
  
Background knowledge
+
<h2>Background knowledge</h2>
To model the homology of promoter sequences between species, we needed a model that could take into account not only point mutations but also the possibility of insertions and deletions. For this purpose, we chose to use a hidden Markov model (HMM) which, as the name implies, is related to Markov processes, which can be thought of as a process which correlates the current state with previous states although it is not known what happened in the previous states. This “lack of memory” is called the Markov property. An HMM consists of a set of symbols on top of an underlying, but unobservable, Markov process. Each state in the process contains probabilities for emitting each symbol when the process enters that state. The adjective “hidden” refers to the fact that the sequence of states cannot be observed directly, only the sequence of emitted symbols.  
+
<p>To model the homology of promoter sequences between species, we needed a model that could take into account not only point mutations but also the possibility of insertions and deletions. For this purpose, we chose to use a hidden Markov model (HMM) which, as the name implies, is related to Markov processes, which can be thought of as a process which correlates the current state with previous states although it is not known what happened in the previous states. This “lack of memory” is called the Markov property. An HMM consists of a set of symbols on top of an underlying, but unobservable, Markov process. Each state in the process contains probabilities for emitting each symbol when the process enters that state. The adjective “hidden” refers to the fact that the sequence of states cannot be observed directly, only the sequence of emitted symbols. </p>
  
  
HMMER structure
+
<h2>HMMER structure</h2>
To tackle our original problem of modeling homology, while keeping it simple enough to handle computationally, we decided to use a HMM with the same structure as the one used in the HMMER software, illustrated. The HMMER structure can be thought of as an extension to a position-weight matrix, essentially considering our entire promoter sequence as a giant motif with structure on top.
+
<p>To tackle our original problem of modeling homology, while keeping it simple enough to handle computationally, we decided to use a HMM with the same structure as the one used in the HMMER software, illustrated. The HMMER structure can be thought of as an extension to a position-weight matrix, essentially considering our entire promoter sequence as a giant motif with structure on top.
 
<figure>
 
<figure>
 
<img style="padding:28px; width:100%" src="https://static.igem.org/mediawiki/2019/f/f5/T--DTU-Denmark--Marcus1.png" alt="Hidden Markov Model showing distribution of A, C, G, and T in M1, M2, M2, I1, I2, and I3."  
 
<img style="padding:28px; width:100%" src="https://static.igem.org/mediawiki/2019/f/f5/T--DTU-Denmark--Marcus1.png" alt="Hidden Markov Model showing distribution of A, C, G, and T in M1, M2, M2, I1, I2, and I3."  
 
<figcaption>Fig. 1: Hidden markov model</figcaption>
 
<figcaption>Fig. 1: Hidden markov model</figcaption>
 
</figure>
 
</figure>
 +
</p>
  
 
+
<p>
 
An overview of a HMM can be seen in figure 1. The structure contains 3 types of nodes: match nodes M<sub>i</sub>, which represent bases in the motif, and have emission probabilities for each possible base; delete nodes D<sub>i</sub>, which represent deletions, and don’t emit anything; and insert nodes I<sub>i</sub>, that represent insertions, and have emission probabilities for each base. <br>
 
An overview of a HMM can be seen in figure 1. The structure contains 3 types of nodes: match nodes M<sub>i</sub>, which represent bases in the motif, and have emission probabilities for each possible base; delete nodes D<sub>i</sub>, which represent deletions, and don’t emit anything; and insert nodes I<sub>i</sub>, that represent insertions, and have emission probabilities for each base. <br>
 
From each match node M<sub>i</sub>, there are only 3 non-zero transition probabilities: transitioning to the next match node M<sub>i+1</sub>; to the insertion node I<sub>i</sub> (which represents the beginning of an insertion); or to the delete state D<sub>i+1</sub> (representing the next base being deleted). <br>
 
From each match node M<sub>i</sub>, there are only 3 non-zero transition probabilities: transitioning to the next match node M<sub>i+1</sub>; to the insertion node I<sub>i</sub> (which represents the beginning of an insertion); or to the delete state D<sub>i+1</sub> (representing the next base being deleted). <br>
Line 110: Line 109:
 
In this way, we can model deletions of multiple bases always occurring concurrently, but insertions are modeled as being of geometrically distributed length, and each base in an insertion is drawn from the same probability distribution.<br>
 
In this way, we can model deletions of multiple bases always occurring concurrently, but insertions are modeled as being of geometrically distributed length, and each base in an insertion is drawn from the same probability distribution.<br>
  
Our model
+
</p>
With this model structure in hand, we could determine the parameters needed to construct a specialized model for each set of promoter sequences stemming from the orthologs of a specific gene. For this, after constructing an initial multiple sequence alignment, we employed an expectation-maximization approach, alternating between building an HMM-based on alignment and aligning the sequences to the HMM.<br><br>
+
  
In an attempt to minimize our promoter lengths, we decided to take advantage of the conceptualization previously mentioned of the model structure as being an extension to a position-weight matrix, by simply extracting the PWM, thus ridding the sequences of insertions.  
+
<h2>Our model</h2>
 +
<p>With this model structure in hand, we could determine the parameters needed to construct a specialized model for each set of promoter sequences stemming from the orthologs of a specific gene. For this, after constructing an initial multiple sequence alignment, we employed an expectation-maximization approach, alternating between building an HMM-based on alignment and aligning the sequences to the HMM.<br><br></p>
 +
 
 +
<p>In an attempt to minimize our promoter lengths, we decided to take advantage of the conceptualization previously mentioned of the model structure as being an extension to a position-weight matrix, by simply extracting the PWM, thus ridding the sequences of insertions. </p>
  
 
<figure>
 
<figure>
Line 120: Line 121:
 
</figure>
 
</figure>
  
To determine the model parameters, we used a set of <a href=”https://2019.igem.org/Team:DTU-Denmark/Software” target=”_blank”>scripts we wrote</a> to transform the genomic data of the entire set of accessible <i>Aspergillus</i> spp. genomes, into a format we could use as an input for the HMMER software suite, and then further transformed the results into the minimized form described above. <br>
+
<p>To determine the model parameters, we used a set of <a href=”https://2019.igem.org/Team:DTU-Denmark/Software” target=”_blank”>scripts we wrote</a> to transform the genomic data of the entire set of accessible <i>Aspergillus</i> spp. genomes, into a format we could use as an input for the HMMER software suite, and then further transformed the results into the minimized form described above. <br>
Using these parameters, we could then set up a mixed-integer linear program whose feasible region would represent the set of synthesizable sequences complying with the hard domestication constraints described on the <a href=”https://2019.igem.org/Team:DTU-Denmark/Design_Promoter” target=”_blank”>design page</a>. These would also comply with an objective function pointing towards the sequence of lowest surprisal (a measure of how unlikely a sequence is, given the parameters). Thus, the solution would be the minimally perturbed consensus sequence, that still lives up to our domestication requirements and is possible to <i>de novo</i> synthesize.
+
Using these parameters, we could then set up a mixed-integer linear program whose feasible region would represent the set of synthesizable sequences complying with the hard domestication constraints described on the <a href=”https://2019.igem.org/Team:DTU-Denmark/Design_Promoter” target=”_blank”>design page</a>. These would also comply with an objective function pointing towards the sequence of lowest surprisal (a measure of how unlikely a sequence is, given the parameters). Thus, the solution would be the minimally perturbed consensus sequence, that still lives up to our domestication requirements and is possible to <i>de novo</i> synthesize.</p>
The process of representing our promoter sequence generation in the form of a linear program led to an intriguing insight. When representing the promoter sequences as position-weight matrices, we are considering each base as being drawn independently from their own categorical distribution over the bases A, C, G, and T, with the consensus sequence being the mode of the joint distribution. An alternative way of conceptualizing the consensus sequence arises from considering the individual distributions as being Gibbs distributions, with the probabilities p<sub>x</sub> being proportional to exp(-e<sub>x</sub>/(kT)) for the probability p<sub>x</sub> of a given base, with e<sub>x</sub> being a measure of some form of energy associated with that base. <br>
+
<p>The process of representing our promoter sequence generation in the form of a linear program led to an intriguing insight. When representing the promoter sequences as position-weight matrices, we are considering each base as being drawn independently from their own categorical distribution over the bases A, C, G, and T, with the consensus sequence being the mode of the joint distribution. An alternative way of conceptualizing the consensus sequence arises from considering the individual distributions as being Gibbs distributions, with the probabilities p<sub>x</sub> being proportional to exp(-e<sub>x</sub>/(kT)) for the probability p<sub>x</sub> of a given base, with e<sub>x</sub> being a measure of some form of energy associated with that base. <br>
 
We do not need to consider physical interpretation of these energies, but merely observe that as T goes to 0, the probability of sampling the consensus sequence approaches 1.
 
We do not need to consider physical interpretation of these energies, but merely observe that as T goes to 0, the probability of sampling the consensus sequence approaches 1.
 
We can then vary T to continuously move from the single consensus sequence to promoters with much higher diversity.<br><br>
 
We can then vary T to continuously move from the single consensus sequence to promoters with much higher diversity.<br><br>
  
 
We hoped that this would allow us to sample promoters of varying strengths from a single model, thus expanding a model of the promoter for a single ortholog group into an entire library of promoters with similar expression characteristics.<br><br>
 
We hoped that this would allow us to sample promoters of varying strengths from a single model, thus expanding a model of the promoter for a single ortholog group into an entire library of promoters with similar expression characteristics.<br><br>
 
+
</p>
For any model, validation is crucial. As <i>Aspergillus</i> is remarkably diverse, the possibility for validation was especially critical. [1] Thanks to the framework of sequence conservation and its relationship to information theory, we were able to procure a remarkably simple measure for “goodness of conservation”, namely the entropy of our models. Furthermore, we could use the score of the promoters of our specific target organisms against our models to give an indication of whether this model would be able to generate promoters that would work in <i>A. niger</i>.<br>
+
<p>For any model, validation is crucial. As <i>Aspergillus</i> is remarkably diverse, the possibility for validation was especially critical. [1] Thanks to the framework of sequence conservation and its relationship to information theory, we were able to procure a remarkably simple measure for “goodness of conservation”, namely the entropy of our models. Furthermore, we could use the score of the promoters of our specific target organisms against our models to give an indication of whether this model would be able to generate promoters that would work in <i>A. niger</i>.<br>
 
However, despite this great promise, we soon realized that we had taken a LEAP into the unknown. The evolutionary conservation of promoter sequences has simply not been studied sufficiently to allow us to determine the biological meaning of a given level of entropy. <br>
 
However, despite this great promise, we soon realized that we had taken a LEAP into the unknown. The evolutionary conservation of promoter sequences has simply not been studied sufficiently to allow us to determine the biological meaning of a given level of entropy. <br>
 
In fact, even from a purely mathematical point of view, the distribution of conservation of promoters is simply unknown<br>
 
In fact, even from a purely mathematical point of view, the distribution of conservation of promoters is simply unknown<br>
 
Not to let that stop us, we placed our trust in our models and reasoning, picked out genes with interesting expression levels, and using <a href=”https://2019.igem.org/Team:DTU-Denmark/Software” target=”_blank”>our software</a> and IDT’s gBlocks, we translated our design intentions into physical DNA.  
 
Not to let that stop us, we placed our trust in our models and reasoning, picked out genes with interesting expression levels, and using <a href=”https://2019.igem.org/Team:DTU-Denmark/Software” target=”_blank”>our software</a> and IDT’s gBlocks, we translated our design intentions into physical DNA.  
We did not expect all computed promoters to work <i>in vivo</i>, but we had no easy way of knowing without trying.<br>
+
We did not expect all computed promoters to work <i>in vivo</i>, but we had no easy way of knowing without trying.
This paid off handsomely: Not only did our homology modeling approach result in several novel synthetic promoters, which are compatible with multiple cloning standards and keep their dependencies on the growth phase. Our noise injection approach was validated, as we showed that we were able to generate “high-temperature” samples with promoter activity and to specifically weaken a strong promoter, <a href=http://parts.igem.org/wiki/index.php?title=Part:BBa_K3046003”” target=”_blank”>gdpA_1</a>, producing <a href=http://parts.igem.org/wiki/index.php?title=Part:BBa_K3046004”” target=”_blank”>gdpA_2</a> with similar promoter dynamics.  
+
So in the spirit of innovation and experimentation, we decided to try anyway<br>
 +
This paid off handsomely: Not only did our homology modeling approach result in several novel synthetic promoters, which are compatible with multiple cloning standards and keep their dependencies on the growth phase. Our noise injection approach was validated, as we showed that we were able to generate “high-temperature” samples with promoter activity and to specifically weaken a strong promoter, <a href=http://parts.igem.org/wiki/index.php?title=Part:BBa_K3046003”” target=”_blank”>gdpA_1</a>, producing <a href=http://parts.igem.org/wiki/index.php?title=Part:BBa_K3046004”” target=”_blank”>gdpA_2</a> with similar promoter dynamics. </p>
  
 
</p>
 
</p>

Revision as of 03:14, 22 October 2019

Model

Using RNA-Seq read counts as a proxy for gene expression, it is possible for genes with desirable properties, such as expression in specific growth phases or with a certain strength, to be identified. Based on this the upstream sequences of the genes should act as promoters, producing expression with those same properties. However, there are many reasons why such a native sequence might be undesirable. First of all, if your organism of choice has not been studied sufficiently, it may not even be possible to identify genes with a desirable expression. Even if you find them, the boundaries of their promoters are hard to define, and nothing guarantees that their sequences are compatible with your cloning or assembly method of choice. Worse yet, when employing these sequences in synthetic biology work, the large regions of homology with the genome of your target organism makes your constructs prone to inappropriate homologous recombination, complicating the construction of your strain and undermining its stability.
For these reasons and more, synthetic promoters with methodically defined boundaries that have a predictable performance with sequences dissimilar from anything found natively, yet complying with arbitrary requirements on their sequences, are highly desirable. But how can such promoters be obtained? Given the constant growth in whole-genome sequencing and increasing feasibility of working in non-model organisms, we sought an approach that did not require a detailed understanding of the intricacies of the regulatory mechanisms of a single organism but could leverage genomic data by exploiting general principles that hold true for all known life.
Inspired by the conservation of genes across taxonomic ranks as high as domains, we aimed to take advantage of homology to find promoters that would work across a broad range of species by making a simple assumption: homologous genes must have homologous promoters.

Background knowledge

To model the homology of promoter sequences between species, we needed a model that could take into account not only point mutations but also the possibility of insertions and deletions. For this purpose, we chose to use a hidden Markov model (HMM) which, as the name implies, is related to Markov processes, which can be thought of as a process which correlates the current state with previous states although it is not known what happened in the previous states. This “lack of memory” is called the Markov property. An HMM consists of a set of symbols on top of an underlying, but unobservable, Markov process. Each state in the process contains probabilities for emitting each symbol when the process enters that state. The adjective “hidden” refers to the fact that the sequence of states cannot be observed directly, only the sequence of emitted symbols.

HMMER structure

To tackle our original problem of modeling homology, while keeping it simple enough to handle computationally, we decided to use a HMM with the same structure as the one used in the HMMER software, illustrated. The HMMER structure can be thought of as an extension to a position-weight matrix, essentially considering our entire promoter sequence as a giant motif with structure on top.

Hidden Markov Model showing distribution of A, C, G, and T in M1, M2, M2, I1, I2, and I3.Fig. 1: Hidden markov model

An overview of a HMM can be seen in figure 1. The structure contains 3 types of nodes: match nodes Mi, which represent bases in the motif, and have emission probabilities for each possible base; delete nodes Di, which represent deletions, and don’t emit anything; and insert nodes Ii, that represent insertions, and have emission probabilities for each base.
From each match node Mi, there are only 3 non-zero transition probabilities: transitioning to the next match node Mi+1; to the insertion node Ii (which represents the beginning of an insertion); or to the delete state Di+1 (representing the next base being deleted).
From each delete node, Di, there are 2 non-zero transition probabilities: transitioning to Mi+1 (representing the next base being present), or to Di+1 (representing the next base being deleted as well).
From each insert node Ii, there are 2 non-zero transition probabilities as well: transitioning to Mi+1 (representing the insertion being complete}; or to Ii (representing the insertion continuing).

In this way, we can model deletions of multiple bases always occurring concurrently, but insertions are modeled as being of geometrically distributed length, and each base in an insertion is drawn from the same probability distribution.

Our model

With this model structure in hand, we could determine the parameters needed to construct a specialized model for each set of promoter sequences stemming from the orthologs of a specific gene. For this, after constructing an initial multiple sequence alignment, we employed an expectation-maximization approach, alternating between building an HMM-based on alignment and aligning the sequences to the HMM.

In an attempt to minimize our promoter lengths, we decided to take advantage of the conceptualization previously mentioned of the model structure as being an extension to a position-weight matrix, by simply extracting the PWM, thus ridding the sequences of insertions.

Sequence logo showing sequence conservation for DNA sequence.Fig. 2: Sequence logo.

To determine the model parameters, we used a set of scripts we wrote to transform the genomic data of the entire set of accessible Aspergillus spp. genomes, into a format we could use as an input for the HMMER software suite, and then further transformed the results into the minimized form described above.
Using these parameters, we could then set up a mixed-integer linear program whose feasible region would represent the set of synthesizable sequences complying with the hard domestication constraints described on the design page. These would also comply with an objective function pointing towards the sequence of lowest surprisal (a measure of how unlikely a sequence is, given the parameters). Thus, the solution would be the minimally perturbed consensus sequence, that still lives up to our domestication requirements and is possible to de novo synthesize.

The process of representing our promoter sequence generation in the form of a linear program led to an intriguing insight. When representing the promoter sequences as position-weight matrices, we are considering each base as being drawn independently from their own categorical distribution over the bases A, C, G, and T, with the consensus sequence being the mode of the joint distribution. An alternative way of conceptualizing the consensus sequence arises from considering the individual distributions as being Gibbs distributions, with the probabilities px being proportional to exp(-ex/(kT)) for the probability px of a given base, with ex being a measure of some form of energy associated with that base.
We do not need to consider physical interpretation of these energies, but merely observe that as T goes to 0, the probability of sampling the consensus sequence approaches 1. We can then vary T to continuously move from the single consensus sequence to promoters with much higher diversity.

We hoped that this would allow us to sample promoters of varying strengths from a single model, thus expanding a model of the promoter for a single ortholog group into an entire library of promoters with similar expression characteristics.

For any model, validation is crucial. As Aspergillus is remarkably diverse, the possibility for validation was especially critical. [1] Thanks to the framework of sequence conservation and its relationship to information theory, we were able to procure a remarkably simple measure for “goodness of conservation”, namely the entropy of our models. Furthermore, we could use the score of the promoters of our specific target organisms against our models to give an indication of whether this model would be able to generate promoters that would work in A. niger.
However, despite this great promise, we soon realized that we had taken a LEAP into the unknown. The evolutionary conservation of promoter sequences has simply not been studied sufficiently to allow us to determine the biological meaning of a given level of entropy.
In fact, even from a purely mathematical point of view, the distribution of conservation of promoters is simply unknown
Not to let that stop us, we placed our trust in our models and reasoning, picked out genes with interesting expression levels, and using our software and IDT’s gBlocks, we translated our design intentions into physical DNA. We did not expect all computed promoters to work in vivo, but we had no easy way of knowing without trying. So in the spirit of innovation and experimentation, we decided to try anyway
This paid off handsomely: Not only did our homology modeling approach result in several novel synthetic promoters, which are compatible with multiple cloning standards and keep their dependencies on the growth phase. Our noise injection approach was validated, as we showed that we were able to generate “high-temperature” samples with promoter activity and to specifically weaken a strong promoter, gdpA_1, producing gdpA_2 with similar promoter dynamics.



[1] J. G. Gibbons and A. Rokas, “The function and evolution of the Aspergillus genome,” Trends in Microbiology, vol. 21, no. 1. pp. 14–22, Jan-2013.

The logos of our three biggest supporters, DTU Blue Dot, Novo Nordisk fonden and Otto Mønsted fonden The logos of all of our sponsors, DTU, BioNordica, Eurofins Genomics, Qiagen, NEB New England biolabs, IDT Integrated DNA technologies and Twist bioscience