Difference between revisions of "Team:Vilnius-Lithuania/Software"

Line 921: Line 921:
 
        
 
        
 
                 <div class="content-block">
 
                 <div class="content-block">
                    <!--TEKSTAS-->
+
<h4 class="page-heading">Motivation</h4>
                    <img src="https://2019.igem.org/wiki/images/d/d3/T--Vilnius-Lithuania--Software-1.gif" style="width:50%;margin-left:25%"><br><br>
+
 
                    <img src="https://2019.igem.org/wiki/images/2/22/T--Vilnius-Lithuania--Software-2.gif" style="width:50%;margin-left:25%"><br><br>
+
<p>Recent advancements in metagenomics suggest that current metagenomes hold the untapped potential to provide humanity with new drugs, antibiotics, industrial enzymes, and even more. However, designing functional metagenomic screens is a labor-intensive and time-consuming process, the essential part of which is knowing where to look. Extremophiles with enzymes that have biotechnological applications can be lurking anywhere from hydrothermal vents to permafrost or vulcanic regions. Thankfully metagenomic sequences from these and many more environments are being extracted at a faster pace and cheaper cost than ever before, therefore creating terabytes upon terabytes of information. Unfortunately, due to the complicated analysis pipeline and sophisticated software used for it, these data goldmines have been locked away from the public, and more importantly, researchers. Considering the complexity of this field and the amount of knowledge it requires, it would seem natural that these issues remained unchallenged, but solving them would mean unlocking these data mines and quite possibly marking a beginning of a new era in metagenomic research.</p><br>
 +
 +
<p>Vilnius-Lithuania 2019 team proposes a novel approach to the in-silico metagenomic research that could change the way we used to think about metagenomics. We managed to tackle the problem from its inner-heart and came up with a solution that is innovative and offers the approach to the variety of problems metagenomics holds. Our software provides an easy-to-use interface and a possibility to extract enormously rewarding results. It takes advantage of <a href="https://www.mg-rast.org/" class="pdf-link">MG-RAST</a> database API and allows users to search & download hundreds of thousands of metagenomic samples to our servers with just one click in their web-browser. Then it automatically standardizes and cleans the selected sequences & prepares them for further analysis. The user can choose from a variety of pre-made protein profiles obtained from the Pfam database; the primary tool used for similarity search is HMMER. Thus, the software allows anyone with only a basic understanding of IT to contribute to metagenomic research and advance science to the next level.</p><br>
 +
 
 +
<h4 class="page-heading">Tackling the problem</h4>
 +
 
 +
<p>The primary objective of our IT team was to screen through metagenomic databases for light-inducible protein sequences for our wet lab team. For this task, we chose to use a popular tool for this job – Hidden Markov Models (<a href="http://hmmer.org/" class="pdf-link">HMMER</a>) – a hmmer profile search against a sequence database. We chose to download our sequences from MG-RAST – currently the largest and professionally maintained metagenomic database. The first problem we encountered was gathering the sample sequences from the database - we discovered that accessibility, even for public data, is quite limiting. For example, we were unable to download the data in massive amounts. Therefore, we went to consult an experienced researcher in this field – Dr. Kazlauskas. We found that his team had the same issue with the same database – MG-RAST, and the bottom line was that they were unable to perform such downloads as well. During the discussion, we learned that even the downloading tool (which we built later on) would be a valuable asset to researchers in this field. We saw that the need for such software was imminent. This was the beginning of our journey towards Big-Seq software.</p><br>
 +
 
 +
<p>It didn't take long for us to discover that the data gathering problem was just the tip of the iceberg, and underneath lay much bigger problems. We discovered why performing such analysis was so hard, and the few scientific articles that mentioned MG-RAST didn't offer us any promising solutions as well. Specific needs for protein discovery analysis:</p><br>
 +
 
 +
<ul style="margin-left:16px!important">
 +
<li>Getting to know a selected metagenomic database takes time, especially if we intend to use its application programming interface (API). The majority of them do not offer convenient tools to download the data.</li>
 +
 
 +
<li>Messy data – varying standards and dozens of formats brings confusion and makes it hard to adapt software settings to every kind of metagenomic data.</li>
 +
 
 +
<li>Storage space – usually a personal computer does not have the extra few terabytes of space to store the downloaded data, and it is also a considerable problem for academic research since computers are usually not fit with this type of storage capabilities.</li>
 +
 
 +
<li>Processing power – the processing requirements for tools like our sequence extractor and indexer are usually found in server-grade machines specifically built for these kinds of tasks.</li>
 +
 
 +
<li>Knowledge of particular tools (HMMER, BLAST, mmseqs.) - it takes weeks of testing, learning, and configuring the software to work just the way you intend it to work, and for someone entirely new to this, it may seem repulsive and time-consuming.</li>
 +
 
 +
<li>Programming knowledge & experience – since the data is varying from one database to another, the researcher will almost inevitably encounter a problem, like conversion or taking out an unwanted character. These tasks require some decent scripting knowledge - not a thing one can learn on-the-go while also being an absolute must in this field.</li>
 +
 
 +
<li>Bioinformatics knowledge (profile building, file readings, formats) - a general understanding of what is a fasta format, why would you need to press a profile, error handling, and many more topics researchers would encounter throughout such and similar analysis.</li>
 +
<li>Linux operating system – since our primary analysis tool - HMMER may have a version for the Windows operating system, the complimenting applications usually don't. In most cases, they are built for Linux operating systems, so it is a consensus to work in a Linux environment for this type of analysis.</li>
 +
 
 +
</ul>
 +
<br>
 +
 
 +
<h4 class="page-heading">The solution</h4>
 +
 
 +
<p>Our state of the art software features a user-friendly interface, self-explanatory protein discovery pipeline, while at the same time providing a Pfam's protein profiles for users to choose from.</p><br>
 +
 
 +
<p>The whole protein discovery pipeline involves several steps. First, gathering the metagenomic data to search proteins in. We took advantage of the MG-RAST database API and built a more convenient search engine & data extraction tool that can download huge volumes of data with little to no effort. This approach meets the needs of every student, scientist, or enthusiast who wants to analyze the metagenomic data but is unable to due to complex reachability.</p><br>
 +
 
 +
<p>After the data is downloaded, the next step consists of cleaning and standardizing the data. We came up with a solution to a well-known "duplicates and residues" problem, which could also solve similar problems in the big-data science field. Our solution enables users to prepare their data for any tool and analysis by cleaning and standardizing it. The output data can be used for further analysis with other popular tools.</p><br>
 +
 
 +
<p>The last step involves a protein profile scan against a sequence database. Sophisticated yet effective, open-source tool HMMER and its minapps are now made to work in the background without the users' need to set the complicated options in the terminal for each task - this eliminates the need to spend countless hours of reading the manual. The optimal computational power is also provided within our servers, enabling users to perform analysis on any machine that has access to the internet.</p><br>
 +
 +
<h4 class="page-heading">How it works</h4>
 +
 
 +
<p>Type any desired keyword of choice, for example, a biome name – water in the search bar and allow the engine to display the search results. </p><br>
 +
 
 +
<img src="https://2019.igem.org/wiki/images/d/d3/T--Vilnius-Lithuania--Software-1.gif" style="width:50%;margin-left:25%"><br><br>
 +
 
 +
<p>At this point, our server makes a request to the MG-RAST API with users keyword and searches in their database - if the keyword finds the associated data, it downloads automatically to our server. Files should display in the results section on our website.</p><br>
 +
 
 +
<img src="https://2019.igem.org/wiki/images/2/22/T--Vilnius-Lithuania--Software-2.gif" style="width:50%;margin-left:25%"><br><br>
 +
 
 +
<p>After the data downloads, you are asked to select the protein profile from the given drop-down menu. The menu stores the majority of popular protein profiles from Pfam.</p><br>
 +
 
 
                     <img src="https://2019.igem.org/wiki/images/1/18/T--Vilnius-Lithuania--Software-3.gif" style="width:50%;margin-left:25%"><br><br>
 
                     <img src="https://2019.igem.org/wiki/images/1/18/T--Vilnius-Lithuania--Software-3.gif" style="width:50%;margin-left:25%"><br><br>
 +
 +
<p>The last step is to analyze the data – at this point, everything happens "behind the curtains." The downloaded data is scanned for duplicate sequences, and if any was found, they get removed from the original file. Also, the sequence file goes through "standardization," which removes the illegal characters and sequences that were too short. After that, the files are ready for HMMER analysis, which searches for similar proteins (according to selected profile) against your downloaded sequence database. </p><br>
 +
 +
<p>Output sequences show up at the bottom section of the website – these sequences are the ones that HMMER determined as similar to the profile of your choice.</p><br>
 +
  
 
                     <img src="https://2019.igem.org/wiki/images/6/6b/T--Vilnius-Lithuania--Software-4.png" style="width:100%"><br><br>
 
                     <img src="https://2019.igem.org/wiki/images/6/6b/T--Vilnius-Lithuania--Software-4.png" style="width:100%"><br><br>
 +
 +
 +
<h4 class="page-heading">Software integration</h4>
 +
 +
 +
<p>The fasta file format used in our software is a standard in bioinformatics and is recognized by popular tools, such a BLAST, HMMER, mmseqs, and many more. Output data can be used for further analysis with similar tools. Not only the input and output data is a well-known standard, but the HMMER software that runs in the background is one of the most popular tools for this kind of task.</p><br>
 +
 +
<h4 class="page-heading">Pfam</h4>
 +
 +
<p>Pfam is a protein family database, which conveniently stores hmmer generated profiles. HMM, profiles are used as the description of the selected protein family to search against the sequence database. They are provided in our applications drop-down menu for users to choose from. They are essential for protein discovery analysis as they are the main focal point for HMMER software to compare sequences to.</p><br>
 +
 +
<h4 class="page-heading">Hidden Markov Models</h4>
 +
 +
<p>Hidden Markov Models (HMMER) is a sequence analysis tool written by Sean Eddy. It is widely used in genomics for sequence search and perfectly serves our purpose while being more precise than other popular tools, such as BLAST. HMMER is a primary analysis software used in our pipeline as it scans through a sequence database and compares the sequences according to a selected profile.</p><br>
 +
 +
<h4 class="page-heading">Optimization tools</h4>
 +
 +
<p>One of our great concerns was to find a way to remove duplicated sequences in large files from the initial data. Duplicates in the analyzed data can cause unexpected results, and most tools are not allowing such inconsistencies. So the removal of such duplicates is a must before we can continue with our analysis. One of the options we tried was to use a popular clustering tool – MMSeqs2 – a suggestion from our leading bioinformatics expect Dr. Justas Dapkunas. While being a prominent tool, it would still leave the duplicates in the output data. We discovered that it was a standing <a href="https://github.com/soedinglab/MMseqs2/issues/180" class="pdf-link">issue</a>, so we decided to write our own algorithm, which would not only remove the duplicates but would clean the file from other residues. Our algorithm would allow us to standardize the file for specific tool requirements, in our case – HMMER. It is a long-needed tool for the scientific community, and it helps to advance metagenomics even further.</p><br>
 +
 +
<p>HMMER's requirements for specific tasks allowed us to mark the bounds of standards to our data. For example, the - hmmpress function required for profile database scanning has to be "pressed" before any further analysis can be performed. In this case, the protein profile has to have the right extension and prepared with the same version of HMMER software.</p><br>
 +
 +
<h4 class="page-heading">Code</h4>
 +
 +
<p>The code is written in PHP, JavaScript, Bash, and Bootstrap framework. Everything is set o a Linux Ubuntu 18.04 LTS server.</p><br>
 +
 +
<h4 class="page-heading">Future improvements</h4>
 +
 +
<p>Considering the fascinating applications of this software, we are currently planning future improvements. They are as follows: </p><br>
 +
<ul style="margin-left:16px!important">
 +
<li>Clustering the most used data and maintaining it on our server for even faster analysis.</li>
 +
<li>The building of a passive protein-miner bot which would analyze the database in a long-time period and store the findings in the database as a mapping file, so that eventually we could have metagenomic protein indexing engine.</li>
 +
<li>For code improvements, we are planning to rebuild our application on more advanced languages like Go, React, Node, Python.</li>
 +
<li>Hardware improvements would include more storage, more memory, and higher processing power.</li>
 +
</ul><br>
 +
 +
 +
<h4 class="page-heading">A potential</h4>
 +
 +
<p>By providing users with our software, we give them the ability to find proteins no one has ever seen before, meaning there could be numerous drugs, antibiotics, enzymes, and much more protein compounds the world has never seen before, thus providing advancements in medicine, biology, and chemistry. With future improvements, it might become a cornerstone technology used in everyday biotechnology to look for novel parts in metagenomic samples.</p>
 +
 +
  
 
                     <img src="https://2019.igem.org/wiki/images/3/38/T--Vilnius-Lithuania--Software-5.png" style="width:100%"><br><br>
 
                     <img src="https://2019.igem.org/wiki/images/3/38/T--Vilnius-Lithuania--Software-5.png" style="width:100%"><br><br>

Revision as of 02:58, 22 October 2019

Software

Motivation

Recent advancements in metagenomics suggest that current metagenomes hold the untapped potential to provide humanity with new drugs, antibiotics, industrial enzymes, and even more. However, designing functional metagenomic screens is a labor-intensive and time-consuming process, the essential part of which is knowing where to look. Extremophiles with enzymes that have biotechnological applications can be lurking anywhere from hydrothermal vents to permafrost or vulcanic regions. Thankfully metagenomic sequences from these and many more environments are being extracted at a faster pace and cheaper cost than ever before, therefore creating terabytes upon terabytes of information. Unfortunately, due to the complicated analysis pipeline and sophisticated software used for it, these data goldmines have been locked away from the public, and more importantly, researchers. Considering the complexity of this field and the amount of knowledge it requires, it would seem natural that these issues remained unchallenged, but solving them would mean unlocking these data mines and quite possibly marking a beginning of a new era in metagenomic research.


Vilnius-Lithuania 2019 team proposes a novel approach to the in-silico metagenomic research that could change the way we used to think about metagenomics. We managed to tackle the problem from its inner-heart and came up with a solution that is innovative and offers the approach to the variety of problems metagenomics holds. Our software provides an easy-to-use interface and a possibility to extract enormously rewarding results. It takes advantage of MG-RAST database API and allows users to search & download hundreds of thousands of metagenomic samples to our servers with just one click in their web-browser. Then it automatically standardizes and cleans the selected sequences & prepares them for further analysis. The user can choose from a variety of pre-made protein profiles obtained from the Pfam database; the primary tool used for similarity search is HMMER. Thus, the software allows anyone with only a basic understanding of IT to contribute to metagenomic research and advance science to the next level.


Tackling the problem

The primary objective of our IT team was to screen through metagenomic databases for light-inducible protein sequences for our wet lab team. For this task, we chose to use a popular tool for this job – Hidden Markov Models (HMMER) – a hmmer profile search against a sequence database. We chose to download our sequences from MG-RAST – currently the largest and professionally maintained metagenomic database. The first problem we encountered was gathering the sample sequences from the database - we discovered that accessibility, even for public data, is quite limiting. For example, we were unable to download the data in massive amounts. Therefore, we went to consult an experienced researcher in this field – Dr. Kazlauskas. We found that his team had the same issue with the same database – MG-RAST, and the bottom line was that they were unable to perform such downloads as well. During the discussion, we learned that even the downloading tool (which we built later on) would be a valuable asset to researchers in this field. We saw that the need for such software was imminent. This was the beginning of our journey towards Big-Seq software.


It didn't take long for us to discover that the data gathering problem was just the tip of the iceberg, and underneath lay much bigger problems. We discovered why performing such analysis was so hard, and the few scientific articles that mentioned MG-RAST didn't offer us any promising solutions as well. Specific needs for protein discovery analysis:


  • Getting to know a selected metagenomic database takes time, especially if we intend to use its application programming interface (API). The majority of them do not offer convenient tools to download the data.
  • Messy data – varying standards and dozens of formats brings confusion and makes it hard to adapt software settings to every kind of metagenomic data.
  • Storage space – usually a personal computer does not have the extra few terabytes of space to store the downloaded data, and it is also a considerable problem for academic research since computers are usually not fit with this type of storage capabilities.
  • Processing power – the processing requirements for tools like our sequence extractor and indexer are usually found in server-grade machines specifically built for these kinds of tasks.
  • Knowledge of particular tools (HMMER, BLAST, mmseqs.) - it takes weeks of testing, learning, and configuring the software to work just the way you intend it to work, and for someone entirely new to this, it may seem repulsive and time-consuming.
  • Programming knowledge & experience – since the data is varying from one database to another, the researcher will almost inevitably encounter a problem, like conversion or taking out an unwanted character. These tasks require some decent scripting knowledge - not a thing one can learn on-the-go while also being an absolute must in this field.
  • Bioinformatics knowledge (profile building, file readings, formats) - a general understanding of what is a fasta format, why would you need to press a profile, error handling, and many more topics researchers would encounter throughout such and similar analysis.
  • Linux operating system – since our primary analysis tool - HMMER may have a version for the Windows operating system, the complimenting applications usually don't. In most cases, they are built for Linux operating systems, so it is a consensus to work in a Linux environment for this type of analysis.

The solution

Our state of the art software features a user-friendly interface, self-explanatory protein discovery pipeline, while at the same time providing a Pfam's protein profiles for users to choose from.


The whole protein discovery pipeline involves several steps. First, gathering the metagenomic data to search proteins in. We took advantage of the MG-RAST database API and built a more convenient search engine & data extraction tool that can download huge volumes of data with little to no effort. This approach meets the needs of every student, scientist, or enthusiast who wants to analyze the metagenomic data but is unable to due to complex reachability.


After the data is downloaded, the next step consists of cleaning and standardizing the data. We came up with a solution to a well-known "duplicates and residues" problem, which could also solve similar problems in the big-data science field. Our solution enables users to prepare their data for any tool and analysis by cleaning and standardizing it. The output data can be used for further analysis with other popular tools.


The last step involves a protein profile scan against a sequence database. Sophisticated yet effective, open-source tool HMMER and its minapps are now made to work in the background without the users' need to set the complicated options in the terminal for each task - this eliminates the need to spend countless hours of reading the manual. The optimal computational power is also provided within our servers, enabling users to perform analysis on any machine that has access to the internet.


How it works

Type any desired keyword of choice, for example, a biome name – water in the search bar and allow the engine to display the search results.




At this point, our server makes a request to the MG-RAST API with users keyword and searches in their database - if the keyword finds the associated data, it downloads automatically to our server. Files should display in the results section on our website.




After the data downloads, you are asked to select the protein profile from the given drop-down menu. The menu stores the majority of popular protein profiles from Pfam.




The last step is to analyze the data – at this point, everything happens "behind the curtains." The downloaded data is scanned for duplicate sequences, and if any was found, they get removed from the original file. Also, the sequence file goes through "standardization," which removes the illegal characters and sequences that were too short. After that, the files are ready for HMMER analysis, which searches for similar proteins (according to selected profile) against your downloaded sequence database.


Output sequences show up at the bottom section of the website – these sequences are the ones that HMMER determined as similar to the profile of your choice.




Software integration

The fasta file format used in our software is a standard in bioinformatics and is recognized by popular tools, such a BLAST, HMMER, mmseqs, and many more. Output data can be used for further analysis with similar tools. Not only the input and output data is a well-known standard, but the HMMER software that runs in the background is one of the most popular tools for this kind of task.


Pfam

Pfam is a protein family database, which conveniently stores hmmer generated profiles. HMM, profiles are used as the description of the selected protein family to search against the sequence database. They are provided in our applications drop-down menu for users to choose from. They are essential for protein discovery analysis as they are the main focal point for HMMER software to compare sequences to.


Hidden Markov Models

Hidden Markov Models (HMMER) is a sequence analysis tool written by Sean Eddy. It is widely used in genomics for sequence search and perfectly serves our purpose while being more precise than other popular tools, such as BLAST. HMMER is a primary analysis software used in our pipeline as it scans through a sequence database and compares the sequences according to a selected profile.


Optimization tools

One of our great concerns was to find a way to remove duplicated sequences in large files from the initial data. Duplicates in the analyzed data can cause unexpected results, and most tools are not allowing such inconsistencies. So the removal of such duplicates is a must before we can continue with our analysis. One of the options we tried was to use a popular clustering tool – MMSeqs2 – a suggestion from our leading bioinformatics expect Dr. Justas Dapkunas. While being a prominent tool, it would still leave the duplicates in the output data. We discovered that it was a standing issue, so we decided to write our own algorithm, which would not only remove the duplicates but would clean the file from other residues. Our algorithm would allow us to standardize the file for specific tool requirements, in our case – HMMER. It is a long-needed tool for the scientific community, and it helps to advance metagenomics even further.


HMMER's requirements for specific tasks allowed us to mark the bounds of standards to our data. For example, the - hmmpress function required for profile database scanning has to be "pressed" before any further analysis can be performed. In this case, the protein profile has to have the right extension and prepared with the same version of HMMER software.


Code

The code is written in PHP, JavaScript, Bash, and Bootstrap framework. Everything is set o a Linux Ubuntu 18.04 LTS server.


Future improvements

Considering the fascinating applications of this software, we are currently planning future improvements. They are as follows:


  • Clustering the most used data and maintaining it on our server for even faster analysis.
  • The building of a passive protein-miner bot which would analyze the database in a long-time period and store the findings in the database as a mapping file, so that eventually we could have metagenomic protein indexing engine.
  • For code improvements, we are planning to rebuild our application on more advanced languages like Go, React, Node, Python.
  • Hardware improvements would include more storage, more memory, and higher processing power.

A potential

By providing users with our software, we give them the ability to find proteins no one has ever seen before, meaning there could be numerous drugs, antibiotics, enzymes, and much more protein compounds the world has never seen before, thus providing advancements in medicine, biology, and chemistry. With future improvements, it might become a cornerstone technology used in everyday biotechnology to look for novel parts in metagenomic samples.