Team:UESTC-Software/Design

Description

...

Database

Based on more than 40,000 biobricks in the iGEM Registry, we have integrated 10 databases to expand information and describe biobricks more accurately, including UniProt, PromEC, EPD, QuickGO, KEGG, BRENDA, ExplorEnz, STRING, BioGRID and NCBI PubMed. All the information contained in BioMaster has been verified by experiments, which ensures the high reliability of data.
Fig.1 Database integration(All the above logos are from official databases.)

Data Integration

In 2018, the BLAST(BLAST+ 2.71)-based local sequence alignment method was used to compare all sequences in other databases to find other biobricks related information.
In 2019, we found that only considering full sequence partial alignment would mistakenly lose part of necessary information. But there are many errors or redundant information in annotation of sequence in iGEM Registry. To improve the completeness of information, we finally reserved the full sequence of each part and the featured region sequence related to the 3 types of coding proteins as the settings of BLAST[1] to align the partial sequence with the sequence of other databases.
Fig.2 Process of data integration

Database Structure

This year, MySQL was used to store data for higher stability and expansibility. We expanded the number of tables and imported last year's CSV data into MySQL. To reduce redundancy and facilitate the call of data, we improved the data paradigm and enhanced the relations among tables.
Fig.3 Database structure

iGEM Parts

We've downloaded the latest iGEM parts tables directly from iGEM Registry through POINT-IN-TIME DATABASE DUMP to ensure the accuracy and timeliness of information. Simultaneously, the mapping relationships among different databases were expanded and established based on the iGEM parts.
Fig.4 iGEM POINT-IN-TIME DATABASE DUMP

Interaction

The interaction information between biobricks and certain components are provided. We collected genes contained in each part from iGEM Registry and analyzed the interactions in STRING[2]. Besides, BioGRID enrich the interaction information, too.
To visualize the interaction between biobricks, we utilized the interactive graphics package Cytoscape.js to draw interactive scatter plots.
Fig.5 Interaction of biobricks

Enzymes

Enzyme fragments are often the key to iGEM parts or devices. In order to provide better guidance for experimenters, KEGG, BRENDA[3] and ExplorEnz[4] were inclued to supplement the enzymes-related information. We used EC number as feature key to connect the enzymes-related databases. These databases can replenish the information of enzyme activity conditions, reactant products, references and so on.

Search

To improve user experience, BioMaster offers several search methods. Multiple IDs (such as iGEM_ID, UniProt_ID, EC number, gene name or EPD_ID) and keywords can be used to find corresponding biobricks. In addition, we provide sequence aligment tool (BLAST[4]), which makes it possible to find the matching biobricks through the sequence directly.
Fig.6 Different search methods
Keywords Search
The keyword search of BioMaster 1.0 has been well received, and many users hope that we can further improve. Considering the maintainability and scalability of data, we used Elasticsearch to implement keyword search.
ID Search
BioMaster supports many kinds of database ID retrieval methods, enabling users to get biobricks in a variety of ways and view the data from multiple perspectives.
Sequence Search
We spotted that many users wanted to search through sequences, so we provide the sequence aligment searching in BioMaster. Users can locate the information by setting appropriate thresholds with sequence information.
Team Wiki Search
It is very helpful for new teams to study the previous projects. Based on version 1.0, we updated the team information and profile pictures of 306 teams in 2018. Single awards were also posted if any of them had ever earned. Now users can search in "Year & Single Award" to get the team that won the single awards in that year.
Fig.7 Team wiki search

Cloud Database Service

Our entire project takes the strategy of local server & cloud database storage. The files are stored in the Amazon S3 cloud database service. The cloud database ensures the security of local projects and makes it easier to provide data supporting services to other teams.
We allow users to download data of biobricks in FASTA format and GenBank format. For the whole database, we provide data download in SQL, CSV and XML format. Users can create a same database through it.

Web Page

To make BioMaster more user-friendly, we specifically designed a guiding home page and the top navigation bar. Screen adaptations for different devices were designed this year. On the search results page, we modeled the NCBI PubMed and UniProt design side navigation and filtering. All efforts are made to create a smooth user experience.
Additonally, we used Bootstrap Response Framework and other plugins in the front-end to design the web pages' response, which meet the users' needs for mobile devices.
In the back-end, we adopt the PHP framework: Laravel. Laravel is currently the most popular and widely used PHP framework in the world because of its feature of simplicity and elegance, which can free us from the messy code.
Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable image and then publish it to any popular Linux or Windows machine for sharing the development environment.
We developed the version of BioMaster 2.0 that can be run in docker. After downloading the website file, users only need to install docker and run images to run the website program independently. It has the characteristics of data independence and mobility. At the same time, the website is completely open source, other developers can pull the image through our document and share the development environment with us.
Elasticsearch(the most popular open source search engine today) was used for the keyword full-text search.
BioMaster 1.0 used the MySQL database full-text matching default dictionary to implement keyword search, which not only has low performance, low operability, but also needs to update the dictionary synchronously with the database update. For improvement, We turned to Elasticsearch to implement keyword search. Elasticsearch is a search engine based on the Lucene library. with a high performance and fast search, and offers a wide range of operations. At the same time, Elasticsearch doesn't need to process the dictionary, thus the updating of database will not affect the search performance.

User-friendliness

BioMaster made great efforts to improve user experience:
We designed a graphical guide home page, a simple and easy-to-use navigation bar. It dedicated to ensure a delightful experience for users when they first visit BioMaster. Besides, we integrated frequently-used functions such as ID conversion between databases, NCBI web page version BLAST, etc.. Web design was also improved.
According to our survey, ranking of search results came to the first among the features that BioMaster was expected to have. So we especially improved the search process and result display and added search screening and sorting functions. By adjusting the ranking method on the search results page, users can get desired results.
Furthermore, many users need to find the latest part or more descriptive information when using parts. To meet this demand, we specially came up with the weight system. Users can adjust the weights of "document quantity", "keyword matching degree" and "descriptive information quality". BioMaster will show the adjusted search recommendation results on the next refreshed page.

New Function

EC Number Prediction

The EC number prediction[5] is an automated EC number based enzymatic function prediction method, that takes the amino acid sequences as inputs. It was constructed considering an ensemble prediction approach, where the results of 3 different predictors with different qualities are combined. The tool can provide probabilistic enzymatic function predictions for uncharacterized protein sequences.
Fig.8 EC prediction[6]

Visualization of SBOL Diagrams

People usually search for interested parts and works in the design repository of SynBioHub. When displaying the search results, the user can intuitively understand the structure of the biobrick through the visual interface, which is very satisfying.
Fig.9 Example on SynBioHubs
SynBioHub did not update the contents of the library in a timely manner, and it did not correct erroneous features, so we combined the contents of BioMaster 2.0 with SynBiohub to optimize the visual interface[7] in order to give users a better search experience.
Fig.10 Visual interface in BioMaster 2.0

SBOL Design Tool

SBOL Designer-3.0[8] is a JAVA-based visual aiding design software that assists users in designing new bio-bricks or simple gene sequences.
The software can be downloaded on GitHub, but for most synthetic biology users who don't understand the JAVA language, it is extremely difficult to install this software. For this reason, our team has created a web-based SBOL[9]-design tool, users can just click the button to use it. The web page is capable of visualizing the genetic design. With the help of this tool, users can add different components such as promoters, CDS, etc.. They can also add corresponding information such as introductions, sequences, etc. to the components. After the design is completed, users can save the designed content into a picture or GenBank format, which helps users to store the design to a certain extent.
Fig.11 SBOL-design tool

Others

We inherited last year's promoter prediction tools. Please refer to BioMaster1.0_description and BioMaster1.0_model for details.
For users’ convience, we added two useful small functions to BioMaster 2.0: NCBI BLAST in website with commonly-used parameters, and the ID conversion tool between databases implemented by UniProt API.

Reference

Reference for Database

[1] McGinnis, Scott, and Thomas L. Madden. "BLAST: at the core of a powerful and diverse set of sequence analysis tools." Nucleic acids research 32.suppl_2 (2004): W20-W25.

[2] Szklarczyk, Damian, et al. "The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible." Nucleic acids research (2016): gkw937.

[3] Lisa Jeske, Sandra Placzek, Ida Schomburg, Antje Chang, Dietmar Schomburg, BRENDA in 2019: a European ELIXIR core data resource, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D542–D549, https://doi.org/10.1093/nar/gky1048

[4] McDonald, A.G., Boyce, S. and Tipton, K.F. ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Res. 37, D593–D597 (2009). [DOI: 10.1093/nar/gkn582]

Reference for EC Number Prediction

[5] Dalkiran A, Rifaioglu AS, Martin MJ, Cetin-Atalay R, Atalay V, Doğan T. ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinformatics. 2018;19(1):334. Published 2018 Sep 21. doi:10.1186/s12859-018-2368-y

[6]UniProt Consortium T UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018;46:2699. doi: 10.1093/nar/gky092.
Madden T. The BLAST sequence analysis tool. Bethesda (MD): National Center for Biotechnology Information (US); 2013.
Rice P, Longden I, Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16:276–277. doi: 10.1016/S0168-9525(00)02024-2.
Chang C, Lin C. LIBSVM : a library for support vector machines. ACM Trans Intell Syst Technol. 2013;2:1–39. doi: 10.1145/1961189.1961199.
Joachims, T., 1999. Making large-Scale SVM Learning Practical (Book Chapter). Advances in Kernel Methods—Support Vector Learning, MIT Press.

Reference for SBOL Diagrams

[7] Structure and Function in Biological Design with SBOL 2.0. Nicholas Roehner, et al. ACS Synth. Biol. 201656498-506

[8] Synthetic Biology Open Language Visual (SBOL Visual) Version 2.1. Curtis Madsen, et al. Journal of Integrative Bioinformatics, Volume 16, Issue 2, 20180101, ISSN (Online) 1613-4516

[9] Synthetic Biology Open Language (SBOL) Version 2.3. Curtis Madsen et al. Journal of Integrative Bioinformatics, Volume 16, Issue 2, 20190025, ISSN (Online) 1613-4516