Team:UESTC-Software/Model

Description

...

Overview

We spend a lot of time and energy to construct our models, which played an important role in our project.
To get more accurate data, we derived a formula to help sifting origin BLAST results. Considering that the accuracy of data directly impact the credibility of our project, we used Poisson distribution and normal distribution to fit the frequency histogram of formula results. Each coefficient was modified in our formula until it performs well enough. The testing result showed that the new filtrating model performed much better than last year. This model can be generalized in big data processing of iGEM parts to help iGEMers.
Additionally, the EC prediction used machine learning and constructs multiple binary classifiers. It combined 3 independent predictors based on subsequences, sequence similarities, and nucleotide physicochemical features to predict probabilistic EC number, to provide synthetic biologists with more help on finding right enzymes.

Blast Filtration Model

During the data filtration, there would be many BLAST results in the same nucleotide sequence[1]. We sorted these BLAST results in three ways, score, expect value, and identity[2]. Then the 0.1 overlapping filtering strategy was used to get the best matched UniProt number. However, the results of these 3 sorting methods are not very similar. Thus, forming a formula by setting parameters and coefficients with these three values of BLAST results at the same time could be the better choice for us[3]. Now the results are no longer obtained by one sort method alone, which minimize the offset between the results sorted by single way and the real data.

Data Preparation

Get Data by Three Different Filtrating Strategy


We downloaded the latest iGEM part tables directly from iGEM Registry through POINT-IN-TIME DATABASE DUMP, including iGEM ID and the corresponding feature information.
Firstly, the feature fragments that labeled as Biobrick were removed. For these deleted Biobricks, we will provide an external link when the final web page results are displayed, and users can click to view the information for the shorter Biobricks contained in the current Biobrick.
By using the complete gene sequence and feature information in Biobricks, we cut out the corresponding fragment sequences according to features and retain the original complete sequence. With these sequences, we generated the input dataset to use BLAST algorithm, while the input database is UniProt.
Fig.1 Take out "minimal BLAST sequence"
The original BLAST results were extracted into the corresponding csv files. Because there were many BLAST results obtained from the same query sequence, we needed to delete some of the results obtained. In this process, we had two strategies.
Fig.2 Sequence alignment
① Sort the BLAST Results of the Same Query Sequence by Different Parameters
The BLAST results were sorted by expect value, identity and score. Score and identity were sorted descendly, and expect value was sorted ascendantly. Through this step, three different data lists can be obtained for further processing.
② 10% Overlapping Filtering Strategy
Fig.3 Preliminary filtration
Scan the BLAST results of the same fragment in sorted order. The first pointer points to the BLAST result data block in the first row, and the second pointer points to the data block in turn. If the repetition rate of the query sequence corresponding to the data block pointed by the two pointers exceeds 10%, delete the result pointed by the second pointer until all the results are scanned, and then continue the above operation until the first pointer points to the last block of data.

Merge Data

After using the above two policies to select the corresponding data under different sorting methods and store them in csv files, the data could be merged by three different filtrating method.
① Internal Mergence
The original sequence and the sequences cut by feature may have the same matching result, we just need to keep one of them.
② External Mergence
Because the same results may occur twice or three times, we just need to keep one of them and mark the source (identity, expect value or score) of the result at the beginning of the row to help our future process. Some of the mergence results are shown below.
Table.1 Merged BLAST results

Formula Derivation

To help us make comparison, a preprocessing equation was needed to shrink the parameters into a certain range. In this premise of comparing these parameters fairly, we could further adjust the coefficient before the parameters and insure the formula in line with our functional expectations.

P_identity

$$V_{id}=\frac{the\; number\; of\; nucleotide\; matched\; completely}{the\; total\; number\; of\; nucleotide\; in\; the\; alignment}\tag{ 1 }$$
We directly used V_id as P_identity to be the parameter relative to identity in our formula.

P_score

Since the origin score was not in this range, it required some mathematical means to fall into (0,1), to implement multiparameter filtering after setting the threshold. According to the document of BLAST+, in the calculation of P_score, we firstly needed to divide it by hsplen (the mean score of each nucleotide in a query sequence), and then divide 7, which is the highest value of a single nucleotide, we can get the final result. $$P_{score} = \frac{V_{score}}{7\times hsplen}\tag{ 2 }$$

P_evalue

In the procession of P_evalue, the logarithm of expect value was taken twice. Firstly, the logarithm based on 10 of expect value was taken to reduce dimension. At this point, we magnified the range of all expect values in the table from 1 to nearly 200. That is to say, we magnified the difference between different expect values, and then reduced it to a maximum of 1 by mathematical method in next step. This scaling made the difference between numbers more obvious, and it was convenient for us to deduce the screening formula. At this time, the exponent of expect value was obtained, and it was found that the maximum value of its absolute value is close to 200, so the logarithm based on 6 of it was taken to reduce it to (0, 3). Divide the result by 3, and we finally got a value in the range of (0, 1). For a case where expect value is 0, we directly set it as 1. Through this transformation, the result was negatively correlated with the original size, which meant the higher the expect value was, the smaller the P_evalue was. According to the meaning of expect value, the smaller the exponent was, the more likely it was to be correct. So P_evalue had positive correlation with the accuracy of the result. $$P_{evalue} = \frac{\lg({-\log_6({V_{evalue}})})}{3}\times w3\qquad\tag{ 3 }$$

Threshold

After getting expressions about the identity, score and expect value, we needed to select a threshold for filter. The hits above the threshold would be kept, and the hits below the threshold would be wiped out.
$$result=V_{id}\times w1+\frac{V_{score}}{7\times alilen}\times w2+\frac{\lg({-\log_6({V_{evalue}})})}{3}\times w3\qquad\tag{ 4 }$$

Multiple Combination

A total of 36 groups of different coefficient data were set up, and the screening results under 10 different threshold values were detected. There were totally 360 tables obtained. The range of threshold was from 0.4 to 0.85 with interval set at 0.05. And the coefficient value of each position was from 1.0 to 0.8 with interval set at 0.1. The content structure in the table is as follows:
Table.2 Merged BLAST results

Coefficient and Threshold Modification

Threshold

The corresponding frequency histogram and the probability density function curve subjecting to Poisson distribution were drawn by using all the results with one kind of coefficients. When the area of probability density function in all graphs were more than 0.15, 0.1 or 0.05, we drew dotted lines at the boundary and save the result value into list L1, L2 and L3. The average values of the three boundaries were 0.67, 0.66 and 0.63 respectively. In order to avoid the error caused by directly calculating the mean value, we used list L1, L2, and L3 to draw another frequency histogram and the probability density function curve subjecting to the Gaussian distribution, and displayed the result value of the center line in the graph, from which we could see that the best threshold selection should be around 0.66.
Fig.4 Pdf fit by Poisson distribution
Fig.5 Pdf fit by Normal distribution
After comparing the result values calculated by different coefficients, we hoped to get other support to help us confirm the better formula threshold. The following figure showed the amount of reflections between UniProt ID and iGEM ID, the number of UniPort ID, and the number of iGEM ID in the range of (0.5, 0.85). It can be seen from the figure that when the threshold range is within (0.5, 0.75), the number of values dropped more slowly, which indicated that when the threshold was within this range, the formula could not only guarantee the accuracy of the data, but also ensure that the completeness. At the same time, it was clear that in this range, each number also dropped evenly and slowly. Obviously, 0.65 was closer to the value of retaining 5%-15% BLAST results. So we finally determined 0.65 as our final threshold.
Fig.6 Comparison between 8 thresholds

Coefficient

In our analysis, if one triple intersection appears, it means this case is of highly credibility. So this kind of cases can be regarded as the positive samples.
During our filtration process, the following figure showed the percentage of triple intersection in each data set. So we chose the appropriate coefficient collocation from the line with threshold at 0.65, we decided to choose 1_5_4 as coefficient and 0.65 as the threshold for our final filtrating equation.
Fig.7 Comparison between 360 different coefficients
$$result=V_{id}\times w1+\frac{V_{score}}{7\times alilen}\times w2+\frac{\lg({-\log_6({V_{evalue}})})}{3}\times w3\qquad\tag{ 4 }$$ $$if\; result\;>0.65:save\;$$ $$else\;:delete\;$$
The curves of different colors corresponded to different thresholds, the abscissa corresponding to each point was the coefficient ratio of the filter formula, and the ordinate was the ratio of the triple intersection in the corresponding file.

Comparison with Accuracy and Amount of Last-year Data

We processed the BLAST results from last year and obtained the average of three values in the data sets of them. The table below is the Comparison of three values between last year and this year. As you can see from the figure, this year's results’ accuracy improved in an all-round way over last year, because the accuracy of the results is positively correlated by these values.
Fig.8 Comparison with last version-1
By comparing the total number of UniProt ID and iGEM ID, We could clearly see that dataset of this year is more complete than that of last year.
Fig.9 Comparison with last version-2

Sample Test

        Doubt the Accuracy of the iGEM Biobricks
Fig.10 Feature info of one Biobrick of iGEM Registry
From the data given in the Feature table, BBa_C0052 should correspond with only one sequence (the 5th rows in the table). In this case, we should believe the co-choice of expect value, score and ignore the results recommended by identity. However, since the expect value and score of the 2nd rows and 4th rows are very close to each other, and the identity values of the 2nd rows are very high, only the 4th rows of results can be deleted. So we can draw two conclusions: one is that our formula cannot get the correct results completely accurate, the other is that the feature table can only give a range of possible interzone of genes.
Table.3 Origin data (before filtrate)
Fig.11 P16117 is one of the originally detected UniProt ID of Web BLAST in UniProt
Fig.12 P69202 is another one of the originally detected UniProt ID of Web BLAST in UniProt
Table.4 P69202 is deleted from our origin data after using our formula
        BBa_K1740003

Click for more details

        BBa_E0050

Click for more details

        K1718004

Click for more details

Prediction Model

Our EC number prediction tool adopts a supervised ensemble classification approach by incorporating different predictors based on subsequences, sequence similarities, and nucleotide physicochemical features. The system was trained with the EC number annotations of characterized enzymes in the UniProtKB/Swiss-Prot database[4]. In the prediction results, the first digit in any EC number indicates which of the six main classes the annotated enzyme belongs to, the second digit represents the subclass class, the third digit expresses the sub-subclass class and the fourth digit shows the substrate of the enzyme[5]. (eg, EC 3.1.3.16 - Protein-serine/threonine phosphatase).
Fig.16 Prediction flow chart

Predictors

One of the predictors we used is called Subsequence Profile Map[6], it is a subsequence-based method to predict protein functions.
In the subsequence extraction module, all possible subsequences for given length 5 were extracted from the training dataset using the sliding window technique.
Fig.17 Slide window is used for subsequence extraction.
We used Blocks substitution matrix (BLOSUM62)[7] to calculate the similarity score between two subsequences by a simple string comparison procedure. Similarity score s(x, y) between two subsequences was calculated as follows.
$$s(x,y)=\sum\limits_{i=1}^{l}M(x(i),y(i))\tag{5 }$$
Where l is the length of the subsequences, x(i) is the nucleotide at the ith position of the subsequence x and M(x(i), y(i)) is the similarity score in BLOSUM62 matrix for the ith position of x and y.

Fig.18 BLOSUM 62

When create the position specific scoring matrix,Sc denoted the total number of subsequences in cluster c and the nucleotide count for the nucleotide j at the ith position of the subsequence be shown by aacount(i, j). Where PPc(i, j) was the probability of the nucleotide j to occur at the ith position of the subsequence and Sc was the total number of subsequences in cluster c. Next, feature vectors were generated. Each subsequence ss was compared with a probabilistic profile PPc. Then the cth dimension element of the feature vector V was determined.[8]
$$PP_c(i,j)=\log{\frac{aa_{cout}+0.01}{S_c}}\tag{ 6 }$$
$$P(ss|PP_c)=\sum\limits_{i=1}^{l}PP_c(i,ss(i))\tag{ 7 }$$
$$V(c)=\max\limits_{ss\in E}{P(ss|PP_c)}\tag{ 8 }$$
The same operations were applied for the proteins in both the positive and negative datasets, and finally, a training file was created. Support vector machines (SVM) classifier was then used for the classification.
The second predictor was based on sequence similarities, we used the k-nearest neighbor algorithm to classify a target protein, and the similarities between the query protein and proteins in the training dataset were calculated using the NCBI-BLAST tool. [9] The last predictors were based on peptide physicochemical properties, here we used the Pepstats tool.

The Combination of Predictors


Fig.19 Structure of classifier
The tool combined the individual prediction scores of three predictors. A 5-fold cross-validation is applied for each method and the area under the receiver operating characteristic curve (AUROC) was calculated for the predictors. Using these AUROC values, all three methods were combined and the weighted mean score for each method was calculated. The weight for method m is calculated as follows:[10]
$$W(m)=\frac{R_m^4}{R^4_{BLAST-kNN}+R^4_{SPMap}+R^4_{PEPSTATS-SVM}}\tag{ 9 }$$
Where the weight of the method m was represented by W(m). Rm stands for AUROC value for method m.
Each prediction score was multiplied with the class-specific weights and summed up to produce the weighted mean score, which corresponded to the finalized prediction score for the query protein for that EC number.

Result

We used F1 score to measure the accuracy of the test. Which considers both the precision and the recall of the test to compute the score. It is the harmonic mean of the precision and recall, where a F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. They are calculated as follows:
$$Recall = \frac{TP}{TP+FN}\tag{10}$$
$$Precision = \frac{TP}{TP+FP}\tag{ 11 }$$
$$F1\;score=2\times\frac{precision\times recall}{precision+recall}\tag{12}$$
We integrated data sets from several different data sources to test the effects of the model. These proteins come from UniProtKB/Swiss-Prot, Protein Families Database [11] and COFACTOR dataset [12], all of them were never used in the system training. The results of the tests indicated the effectiveness of our prediction tool. The test results are as follows:
Table.9 The test results.

Reference

Reference for BLAST Filtration

[1] O'Driscoll Aisling,Belogrudov Vladislav,Carroll John,Kropp Kai,Walsh Paul,Ghazal Peter,Sleator Roy D. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.[J]. Journal of biomedical informatics,2015,54.

[2] Mount David W. Using the Basic Local Alignment Search Tool (BLAST).[J]. CSH protocols,2007,2007.

[3] Altschul S F,Gish W,Miller W,Myers E W,Lipman D J. Basic local alignment search tool.[J]. Journal of molecular biology,1990,215(3).

Reference for EC Prediction

[4] UniProt Consortium T UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018;46:2699. doi: 10.1093/nar/gky092.

[5] Cornish-Bowden A. Current IUBMB recommendations on enzyme nomenclature and kinetics. Perspect Sci. 2014;1:74–87. doi: 10.1016/j.pisc.2014.02.006.

[6] Sarac OS, Gürsoy-Yüzügüllü Ö, Cetin-Atalay R, Atalay V. Subsequence-based feature map for protein function classification. Comput Biol Chem. 2008;32:122–130. doi: 10.1016/j.compbiolchem.2007.11.004.

[7] Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci. 1992;89:10915–10919. doi: 10.1073.

[8] Dalkiran A, Rifaioglu AS, Martin MJ, Cetin-Atalay R, Atalay V, Doğan T. ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinformatics. 2018;19(1):334. Published 2018 Sep 21. doi:10.1186/s12859-018-2368-y

[9] Madden T. The BLAST sequence analysis tool. Bethesda (MD): National Center for Biotechnology Information (US); 2013.

[10] Sarac OS, Atalay V, Cetin-Atalay R. GOPred: GO molecular function prediction by combined classifiers. PLoS One. 2010;5:e12382. doi: 10.1371/journal.pone.0012382.

[11] Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016;44:D279–D285. doi: 10.1093/nar/gkv1344.

[12] Roy A. et al. (2012) Cofactor: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res., 40, W471–W477.