Difference between revisions of "Team:DTU-Denmark/Software"

(Prototype team page)
 
 
(22 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
{{DTU-Denmark}}
 
{{DTU-Denmark}}
 +
{{DTU-Denmark/mainCSS}}
 +
{{DTU-Denmark/carouselCSS}}
 
<html>
 
<html>
  
  
<div class="column full_size judges-will-not-evaluate">
+
 
<h3>★  ALERT! </h3>
+
<img class="effect" id="effect2" src="https://static.igem.org/mediawiki/2019/e/ea/T--DTU-Denmark--background3.png">
<p>This page is used by the judges to evaluate your team for the <a href="https://2019.igem.org/Judging/Medals">medal criterion</a> or <a href="https://2019.igem.org/Judging/Awards"> award listed below</a>. </p>
+
<img class="effect" id="effect3" src="https://static.igem.org/mediawiki/2019/4/4b/T--DTU-Denmark--background4.png">
<p> Delete this box in order to be evaluated for this medal criterion and/or award. See more information at <a href="https://2019.igem.org/Judging/Pages_for_Awards"> Instructions for Pages for awards</a>.</p>
+
 
 +
 
 +
<head>
 +
<meta name="viewport" content="width=device-width, initial-scale=1">
 +
 
 +
</head>
 +
 
 +
<section id="top" class="header_con ">
 +
<div class="container">
 +
<div class="row flex-center sm-no-flex" style="padding-top:75px;">
 +
 
 +
 
 +
 
 +
 +
 
 +
<div class=" topimg pull-right sm-no-float col-md-7 ">
 +
 +
 
 +
 
 +
<img src="https://static.igem.org/mediawiki/2019/0/08/T--DTU-Denmark--modelheader.svg" alt="" alt="" style="margin-top:75px;max-width:70%;margin-right:auto; margin-left:auto;display: block;
 +
    height:auto;z-index:5;">
 +
 
 +
 
 +
</div><!-- /end col-md-8 -->
 +
 +
     
 +
     
 +
      <div class="pull-left col-md-5 sm-text-center com-sm-12 ">
 +
<div class="team-overview" style="font-size: 1.0em;">
 +
<h2>Software</h2>
 +
 +
<p><div>We have developed a new state-of the art of software that can generate promoter sequences for different organisms. This page provides that software and takes LEAP towards making custom synthetic promoters accessible to all.</div></p>
 +
</div>
 +
</div><!-- /end col-md-4 -->
 +
     
 +
     
 +
     
 +
     
 +
   
 +
   
 +
   
 +
   
 +
   
 +
</div><!-- /end row -->
 +
 
 +
 
 +
 
 +
 
 +
 
 +
</div><!-- /end container -->
 +
</section>
 +
 
 +
 
 +
 
 +
 
 +
<section class="grey_con">
 +
  <div class="container">
 +
 
 +
 
 +
<div class="row flex-center sm-no-flex interlabspace">
 +
 
 +
 +
 
 +
 
 +
 
 +
<div class="sm-no-float col-md-10">
 +
 
 +
<h2>Introduction</h2>
 +
<p>In order to generate our synthetic promoters, we built promoter sequence models based on hidden Markov models (HMMs) (see <a target="_blank" href="https://2019.igem.org/Team:DTU-Denmark/Model">the model page</a>). For handling the many calculations and large volumes of data needed to build these models, we built a user-friendly program that does all the work &mdash; we call it <code>proHMMoter</code>, and you can download it <a target="_blank" href="https://static.igem.org/mediawiki/2019/2/24/T--DTU-Denmark--proHMMoter.zip"><b>here</b></a>! Besides making working the model significantly easier, not to mention faster, it also allowed us to implement all the different features we wanted our promoters to have in a way that is easy to extend and modify for future users.  
 +
<code>proHMMoter</code> has been written as a command-line program that ties a small suite of scripts together in a single convenient command. Due to the command-line interface as well as consistent naming and location of output files, users can easily embed our software as a tool in a bigger pipeline.
 +
 
 +
</p>
 +
 
 +
 
 +
  </div>
 +
 
 +
 
 +
 
 
</div>
 
</div>
  
 +
<div class="row flex-center sm-no-flex interlabspace">
  
<div class="clear"></div>
+
 +
      <div class="sm-no-float col-md-10">
 +
 
 +
 
 +
 
 +
<h2>Features</h2>
 +
<p>
 +
<dl>
 +
<dt>Domestication</dt>
 +
<dd><code>proHMMoter</code> outputs promoter sequences that are compatible with iGEM’s Type IIS standards, and incorporates a penalty system that can automatically domesticate for other relevant synthetic biology standards when this can be done without major disruption.</dd>
 +
<dt>Shortening sequences</dt>
 +
<dd><code>proHMMoter</code> can identify evolutionarily non-conserved regions. Accordingly, the program can remove non-essential regions upstream of the actual promoter region, creating shorter promoter sequences. This can be essential when working with higher eukaryotes, as these gene constructs can get very large, and thereby more difficult and expensive to work with.</dd>
 +
<dt>Non-homology</dt>
 +
<dd><code>proHMMoter</code> can generate promoter sequences that in theory should have the same function as the original promoter, but with a drastically different sequence. This can be very useful, since it reduces the chance of unwanted homologous recombination. </dd>
 +
<dt>Synthesizability</dt>
 +
<dd>The promoters generated by <code>proHMMoter</code> automatically comply with the complexity rules of major DNA synthesis vendors, making it easy to get your freshly generated promoters synthesised.</dd>
 +
<dt>Host versatility</dt>
 +
<dd><code>proHMMoter</code> can generate consensus sequences representing the promoters in diverse sets of genomes. As long as these genomes are sufficiently related, the promoters should be usable in any of the organisms, and further <i>in silico</i> analysis can help verify this. As a proof of concept, we generated our synthetic <i>Aspergillus niger</i> promoters based on all the genomes within the  <i>Aspergillus</i> genus.</dd>
 +
<dt>Induction of noise</dt>
 +
<dd><code>proHHMoter</code> has built-in functionality that allows the user to inject noise of varying amounts into the promoter generation process. This gives the user the opportunity to create artificially weakened promoters, and thereby allows for the creation of promoters with different strengths but broadly similar expression dynamics. For our project, we have injected noise into several of our promoters - and proved that expression was still achieved, in a step towards the aim of building a comprehensive promoter ladder.</dd>
 +
<dl>
 +
 
 +
</p>
  
  
<div class="column full_size">
 
<h1>Software</h1>
 
 
</div>
 
</div>
<div class="column two_thirds_size">
 
<h3>Best Software Tool Special Prize</h3>
 
<p>Regardless of the topic, iGEM projects often create or adapt computational tools to move the project forward. Because they are born out of a direct practical need, these software tools (or new computational methods) can be surprisingly useful for other teams. Without necessarily being big or complex, they can make the crucial difference to a project's success. This award tries to find and honor such "nuggets" of computational work.
 
  
  
<br><br>
+
 
To compete for the <a href="https://2019.igem.org/Judging/Awards">Best Software Tool prize</a>, please describe your work on this page and also fill out the description on the <a href="https://2019.igem.org/Judging/Judging_Form">judging form</a>.
+
 
<br><br>
+
 
To be eligible, your software has to be documented and made available under an OSI approved open source license. You must also delete the alert box on the top of this page to be eligible for this prize.
+
  </div>
 +
 
 +
<div class="row flex-center sm-no-flex interlabspace">
 +
 
 +
 +
 
 +
 
 +
 
 +
<div class="sm-no-float col-md-10">
 +
 
 +
<h2>How it works</h2>
 +
 
 +
<figure>
 +
<img style="padding:28px; width:100%" src="https://static.igem.org/mediawiki/2019/c/ce/T--DTU-Denmark--MikkelMarcusGraf.png" alt=”A flow chart showing arrows, symbolizing actions, pointing to nodes, symbolizing the files these actions generate, starting from annotated genomes and choices for candidate genes, and ending at both consensus promoters and promoters with artificial noise." class=""/>
 +
<figcaption>Fig. 1: Flowchart showing how <code>proHMMoter</code> works.  </figcaption>
 +
 
 +
</figure>
 +
As Figure 1 illustrates, the program requires two inputs: The genome data from the set of organisms you want the modeling process to take into account, and a set of protein-coding candidate genes with valuable expression characteristics. To generate the synthetic promoters for our project, we used genome data from Mycocosm and RNA-Seq data to identify candidate genes by their expression levels.
 +
<p>The program then works by first finding all orthologs of the genes you have chosen, using the HMMER tool <code>phmmer</code> to identify orthologs by amino acid sequence, approximately identifying orthologs by selecting the highest-scoring homolog in each genome.
 +
When the sequences of the orthologs are found, the program locates them in their respective genomes, and extracts the putative promoters upstream of these genes. These promoter regions are then aligned with the multiple sequence alignment tool MUSCLE[2]. From the multiple sequence alignments we then build the actual HMM, using the HMMER tool <code>hmmbuild</code>. To further refine this model, we perform an expectation maximization step, wherein we align all native promoter sequences to the model using <code>hmmalign</code>, then rebuild the HMM from this alignment while trimming unaligned ends, and iterate until the consensus sequence converges. To ensure minimal promoter lengths, we extract the position-weight matrix underlying the HMM, to use in generating the synthetic promoters. Implementing the linear program described on <a target="_blank" href="https://2019.igem.org/Team:DTU-Denmark/Model">our model page</a>.
 +
in clear, concise code that is easy to extend, we apply constraints such as limitations on homopolymers and GC content to reduce synthesis complexity to levels considered acceptable to major vendors, and domesticate with respect to the features considered in
 +
<a target="_blank" href="https://2019.igem.org/Team:DTU-Denmark/Design_Promoter">our design</a>, complying with strictly required as well as making a best effort to follow merely desirable synthetic biology standards. This latter part is accomplished via a penalty system, ensuring the amount of sequence conservation (measured in bits, as sequence conservation is essentially a measure of information) sacrificed when domesticating falls below preset levels.
 +
With this linear program, we then solve it, finding the sequence which, with respect to sequence conservation, comes as close to the consensus sequence as the constraints allow.
 +
We also adjust the distribution implied by the PWM as described on <a target="_blank" href="https://2019.igem.org/Team:DTU-Denmark/Model">our model page</a> and sample from it, using the linear program to reject non-solutions, thus fairly sampling promoter sequences living up to our defined standards.
 +
In our project, we have shown that our promoters work even when a substantial amount of noise has been injected, and that this appropriately resulted in a relatively weak promoter.
 +
 
 
</p>
 
</p>
 +
 +
 +
  </div>
 +
  
  
 
</div>
 
</div>
  
<div class="column third_size">
+
<div class="row flex-center sm-no-flex">
<div class="highlight decoration_A_full">
+
 
<h3> Inspiration </h3>
+
 +
      <div class="sm-no-float col-md-10">
 +
 
 +
 
 +
 
 +
<h2>Usability</h2>
 
<p>
 
<p>
Here are a few examples from previous teams:
+
Since we wanted to ensure that <code>proHMMoter</code> would be useful for future teams that might not have strong bioinformatics experience, it has been designed in a way that requires minimal programming/computational knowledge to use. A README file has been written on what programs and files are needed to run the software and how to obtain and organize those files. To further assist in the process, an example of how to find genome data and what format it should be in has been included.
 +
To help us test if our software was sufficiently user-friendly for fellow synthetic biologists to make productive use of, the BrownStanfordPrinceton team offered to test it for us. Their story can be read <a target="_blank" href="https://2019.igem.org/Team:BrownStanfordPrinctn/Collaborations">here</a>. We were very pleased to hear from the team that they found our software easy to use, and that they consider its capabilities valuable. This also confirms that the program works consistently, even with other organisms besides <i>Aspergillus</i>.
 +
 
 +
<h2>Publication of the source code</h2>
 +
If you now want to generate synthetic promoters too, you can contact us for our software (cwor@dtu.dk).
 +
The files relevant to you as a user of the software are as follows:
 +
<dl>
 +
<dt>prohmmoter.py</dt>
 +
<dd>The program to run. <dd>
 +
 
 +
<dt>README.md </dt>
 +
<dd>README file describing how to use the program, what files to include, and the output formats.</dd>
 +
 
 +
<dt>Quick guide to Mycocosm<dd/>
 +
<dd>A guide to find and download necessary genome data from the Mycocosm, which has a lot of fungal genomes.</dd>
 +
 
 +
<dt>hmm-match-state-emission.jl</dt>
 +
<dd>If you want to modify the constraints that the generated sequences will live up to, or adjust the noise levels, this is the file to edit.</dd>
 +
 
 +
<dt>LICENSE</dt>
 +
<dd>The GNU GPLv3 license</dd>
 +
</dl>
 +
 
 +
 
 
</p>
 
</p>
<ul>
+
 
<li><a href="https://2016.igem.org/Team:BostonU_HW">2016 BostonU HW</a></li>
+
 
<li><a href="https://2016.igem.org/Team:Valencia_UPV">2016 Valencia UPV</a></li>
+
<li><a href="https://2014.igem.org/Team:Heidelberg/Software">2014 Heidelberg</a></li>
+
<li><a href="https://2014.igem.org/Team:Aachen/Project/Measurement_Device#Software">2014 Aachen</a></li>
+
</ul>
+
 
</div>
 
</div>
 +
 +
 +
 +
 +
 +
  </div>
 +
 +
<div class="column full_size" >
 +
 +
<p style="color:#000; font-size:14px;"><br><br>
 +
[1] The genome portal of the Department of Energy Joint Genome Institute: 2014 updates.
 +
Nordberg H, Cantor M, Dusheyko S, Hua S, Poliakov A, Shabalov I, Smirnova T, Grigoriev IV, Dubchak I.
 +
Nucleic Acids Res. 2014,42(1):D26-31.
 +
 +
[2] R. C. Edgar, “MUSCLE: Multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res., 2004.
 +
 +
[3] http://hmmer.org/.
 +
</p>
 +
 +
  </div>
 
</div>
 
</div>
 +
 +
 +
</section>
 +
 +
 +
 +
 +
 +
 +
    <footer id="myFooter" class="blue">
 +
        <div class="container">
 +
            <div class="row">
 +
                <div class="col-sm-2">
 +
                                     
 +
  <a href="https://2019.igem.org/Team:DTU-Denmark">
 +
                  <img id="footylogo" src="https://static.igem.org/mediawiki/2019/f/fe/T--DTU-Denmark--happylogotemp.png" style="max-width: 50%;padding-top:30px;display: flex;margin-left:auto; margin-right:auto;">
 +
    </a>
 +
                </div>
 +
                <div class="col-sm-3">
 +
                    <h5>Useful links</h5>
 +
                    <ul>
 +
                      <p class="footer-links">
 +
<a href="https://2019.igem.org/Team:DTU-Denmark">Home</a>
 +
·
 +
<a href="https://2019.igem.org/Team:DTU-Denmark/Description">Project description</a>
 +
·
 +
                      <a href="https://2019.igem.org/Team:DTU-Denmark/Model">Modelling</a>
 +
·
 +
<a href="https://2019.igem.org/Team:DTU-Denmark/Parts">Parts overview</a>
 +
·
 +
<a href="https://2019.igem.org/Team:DTU-Denmark/Notebook">Notebook</a>
 +
·
 +
<a href="https://2019.igem.org/Team:DTU-Denmark/Team">Team</a>
 +
·
 +
<a href="https://2019.igem.org/Team:DTU-Denmark/Safety">Safety</a>
 +
·
 +
<a href="https://2019.igem.org/Team:DTU-Denmark/Human_Practices">Human practices</a>
 +
</p>
 +
                        <li style="font-size: 32px;"><a href="#top"><i class="fa fa-arrow-up"title="Back to top"></i> </a></li>
 +
                     
 +
                    </ul>
 +
                </div>
 +
                <div class="col-sm-4">
 +
                    <h5>About us</h5>
 +
                    <ul>
 +
                        <li><a href="mailto:dtubiobuilders@gmail.com">dtubiobuilders@gmail.com</a></li>
 +
                        <li><i class="fa fa-map-marker"></i><a href="https://www.google.dk/maps/@55.7854419,12.5196676,16z"> Anker Engelunds Vej 1 Bygning 101A, 2800 Kgs. Lyngby, Denmark</a></li>
 +
                       
 +
                    </ul>
 +
                </div>
 +
               
 +
                <div class="col-sm-3">
 +
                    <div class="social-networks">
 +
                        <a href="https://twitter.com/iGEM_DTU" class="twitter"><i class="fa fa-twitter"></i></a>
 +
                        <a href="https://www.facebook.com/dtubiobuilders?fref=ts" class="facebook"><i class="fa fa-facebook"></i></a>
 +
                        <a href="https://www.instagram.com/igem_dtubiobuilders/" class="insta"><i class="fa fab fa-instagram"></i></a>
 +
                    </div>
 +
                    <button type="button" class="btn btn-default" onclick="window.location.href='mailto:dtubiobuilders@gmail.com'">Contact us</button>
 +
                </div>
 +
            </div>
 +
        </div>
 +
       
 +
    </footer>
 +
<img class="footergrants" src="https://static.igem.org/mediawiki/2019/9/91/T--DTU-Denmark--biggrants.svg" title="The logos of our three biggest supporters, DTU Blue Dot, Novo Nordisk fonden and Otto Mønsted fonden" alt="The logos of our three biggest supporters, DTU Blue Dot, Novo Nordisk fonden and Otto Mønsted fonden">
 +
<img class="footersponsors" src="https://static.igem.org/mediawiki/2019/d/d9/T--DTU-Denmark--sponsorsfooter.svg" title="The logos of all of our sponsors, DTU, BioNordica, Eurofins Genomics, Qiagen, NEB New England biolabs, IDT Integrated DNA technologies and Twist bioscience" alt="The logos of all of our sponsors, DTU, BioNordica, Eurofins Genomics, Qiagen, NEB New England biolabs, IDT Integrated DNA technologies and Twist bioscience">>
 +
</div></div>
 +
 +
  
 
</html>
 
</html>

Latest revision as of 09:14, 19 November 2019

Software

We have developed a new state-of the art of software that can generate promoter sequences for different organisms. This page provides that software and takes LEAP towards making custom synthetic promoters accessible to all.

Introduction

In order to generate our synthetic promoters, we built promoter sequence models based on hidden Markov models (HMMs) (see the model page). For handling the many calculations and large volumes of data needed to build these models, we built a user-friendly program that does all the work — we call it proHMMoter, and you can download it here! Besides making working the model significantly easier, not to mention faster, it also allowed us to implement all the different features we wanted our promoters to have in a way that is easy to extend and modify for future users. proHMMoter has been written as a command-line program that ties a small suite of scripts together in a single convenient command. Due to the command-line interface as well as consistent naming and location of output files, users can easily embed our software as a tool in a bigger pipeline.

Features

Domestication
proHMMoter outputs promoter sequences that are compatible with iGEM’s Type IIS standards, and incorporates a penalty system that can automatically domesticate for other relevant synthetic biology standards when this can be done without major disruption.
Shortening sequences
proHMMoter can identify evolutionarily non-conserved regions. Accordingly, the program can remove non-essential regions upstream of the actual promoter region, creating shorter promoter sequences. This can be essential when working with higher eukaryotes, as these gene constructs can get very large, and thereby more difficult and expensive to work with.
Non-homology
proHMMoter can generate promoter sequences that in theory should have the same function as the original promoter, but with a drastically different sequence. This can be very useful, since it reduces the chance of unwanted homologous recombination.
Synthesizability
The promoters generated by proHMMoter automatically comply with the complexity rules of major DNA synthesis vendors, making it easy to get your freshly generated promoters synthesised.
Host versatility
proHMMoter can generate consensus sequences representing the promoters in diverse sets of genomes. As long as these genomes are sufficiently related, the promoters should be usable in any of the organisms, and further in silico analysis can help verify this. As a proof of concept, we generated our synthetic Aspergillus niger promoters based on all the genomes within the Aspergillus genus.
Induction of noise
proHHMoter has built-in functionality that allows the user to inject noise of varying amounts into the promoter generation process. This gives the user the opportunity to create artificially weakened promoters, and thereby allows for the creation of promoters with different strengths but broadly similar expression dynamics. For our project, we have injected noise into several of our promoters - and proved that expression was still achieved, in a step towards the aim of building a comprehensive promoter ladder.

How it works

”A
Fig. 1: Flowchart showing how proHMMoter works.
As Figure 1 illustrates, the program requires two inputs: The genome data from the set of organisms you want the modeling process to take into account, and a set of protein-coding candidate genes with valuable expression characteristics. To generate the synthetic promoters for our project, we used genome data from Mycocosm and RNA-Seq data to identify candidate genes by their expression levels.

The program then works by first finding all orthologs of the genes you have chosen, using the HMMER tool phmmer to identify orthologs by amino acid sequence, approximately identifying orthologs by selecting the highest-scoring homolog in each genome. When the sequences of the orthologs are found, the program locates them in their respective genomes, and extracts the putative promoters upstream of these genes. These promoter regions are then aligned with the multiple sequence alignment tool MUSCLE[2]. From the multiple sequence alignments we then build the actual HMM, using the HMMER tool hmmbuild. To further refine this model, we perform an expectation maximization step, wherein we align all native promoter sequences to the model using hmmalign, then rebuild the HMM from this alignment while trimming unaligned ends, and iterate until the consensus sequence converges. To ensure minimal promoter lengths, we extract the position-weight matrix underlying the HMM, to use in generating the synthetic promoters. Implementing the linear program described on our model page. in clear, concise code that is easy to extend, we apply constraints such as limitations on homopolymers and GC content to reduce synthesis complexity to levels considered acceptable to major vendors, and domesticate with respect to the features considered in our design, complying with strictly required as well as making a best effort to follow merely desirable synthetic biology standards. This latter part is accomplished via a penalty system, ensuring the amount of sequence conservation (measured in bits, as sequence conservation is essentially a measure of information) sacrificed when domesticating falls below preset levels. With this linear program, we then solve it, finding the sequence which, with respect to sequence conservation, comes as close to the consensus sequence as the constraints allow. We also adjust the distribution implied by the PWM as described on our model page and sample from it, using the linear program to reject non-solutions, thus fairly sampling promoter sequences living up to our defined standards. In our project, we have shown that our promoters work even when a substantial amount of noise has been injected, and that this appropriately resulted in a relatively weak promoter.

Usability

Since we wanted to ensure that proHMMoter would be useful for future teams that might not have strong bioinformatics experience, it has been designed in a way that requires minimal programming/computational knowledge to use. A README file has been written on what programs and files are needed to run the software and how to obtain and organize those files. To further assist in the process, an example of how to find genome data and what format it should be in has been included. To help us test if our software was sufficiently user-friendly for fellow synthetic biologists to make productive use of, the BrownStanfordPrinceton team offered to test it for us. Their story can be read here. We were very pleased to hear from the team that they found our software easy to use, and that they consider its capabilities valuable. This also confirms that the program works consistently, even with other organisms besides Aspergillus.

Publication of the source code

If you now want to generate synthetic promoters too, you can contact us for our software (cwor@dtu.dk). The files relevant to you as a user of the software are as follows:
prohmmoter.py
The program to run.
README.md
README file describing how to use the program, what files to include, and the output formats.
Quick guide to Mycocosm
A guide to find and download necessary genome data from the Mycocosm, which has a lot of fungal genomes.
hmm-match-state-emission.jl
If you want to modify the constraints that the generated sequences will live up to, or adjust the noise levels, this is the file to edit.
LICENSE
The GNU GPLv3 license



[1] The genome portal of the Department of Energy Joint Genome Institute: 2014 updates. Nordberg H, Cantor M, Dusheyko S, Hua S, Poliakov A, Shabalov I, Smirnova T, Grigoriev IV, Dubchak I. Nucleic Acids Res. 2014,42(1):D26-31. [2] R. C. Edgar, “MUSCLE: Multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res., 2004. [3] http://hmmer.org/.

The logos of our three biggest supporters, DTU Blue Dot, Novo Nordisk fonden and Otto Mønsted fonden The logos of all of our sponsors, DTU, BioNordica, Eurofins Genomics, Qiagen, NEB New England biolabs, IDT Integrated DNA technologies and Twist bioscience>