Overview
We spend a lot of time and energy to construct our models, which played an important role in
our project.
To get more accurate data, we derived a formula to help sifting origin BLAST results. Considering that the
accuracy of data directly impact the credibility of our project, we used Poisson distribution and normal
distribution to fit the frequency histogram of formula results. Each coefficient was modified in our formula
until it performs well enough. The testing result showed that the new filtrating model performed much better
than last year. This model can be generalized in big data processing of iGEM parts to help iGEMers.
Additionally, the EC prediction used machine learning and constructs multiple binary classifiers. It combined 3
independent predictors based on subsequences, sequence similarities, and nucleotide physicochemical features to
predict probabilistic EC number, to provide synthetic biologists with more help on finding right enzymes.
Blast Filtration Model
During the data filtration, there would be many BLAST results in the same nucleotide sequence[1].
We sorted these BLAST results in three ways, score, expect value, and identity[2]. Then the 0.1 overlapping
filtering strategy was used to get the best matched UniProt number. However, the results of these 3 sorting
methods are not very similar. Thus, forming a formula by setting parameters and coefficients with these three
values of BLAST results at the same time could be the better choice for us[3]. Now the results are no longer
obtained by one sort method alone, which minimize the offset between the results sorted by single way and the real data.
Data Preparation
Get Data by Three Different Filtrating Strategy
We downloaded the latest iGEM part tables directly from iGEM Registry through POINT-IN-TIME
DATABASE DUMP, including iGEM ID and the corresponding feature information.
Firstly, the feature fragments that labeled as Biobrick were removed. For these deleted
Biobricks, we will provide an external link when the final web page results are displayed, and users can click
to view the information for the shorter Biobricks contained in the current Biobrick.
By using the complete gene sequence and feature information in Biobricks, we cut out
the corresponding fragment sequences according to features and retain the original complete sequence. With these sequences,
we generated the input dataset to use BLAST algorithm, while the input database is UniProt.
Fig.1 Take out "minimal BLAST sequence"
The original BLAST results were extracted into the corresponding csv files. Because there were
many BLAST results obtained from the same query sequence, we needed to delete some of the results obtained. In
this process, we had two strategies.
Fig.2 Sequence alignment
① Sort the BLAST Results of the Same Query Sequence by Different Parameters
The BLAST results were sorted by expect value, identity and score. Score and identity were sorted
descendly, and expect value was sorted ascendantly. Through this step, three different data lists can be obtained for further processing.
② 10% Overlapping Filtering Strategy
Fig.3 Preliminary filtration
Scan the BLAST results of the same fragment in sorted order. The first pointer points to the
BLAST result data block in the first row, and the second pointer points to the data block in turn. If the
repetition rate of the query sequence corresponding to the data block pointed by the two pointers exceeds 10%,
delete the result pointed by the second pointer until all the results are scanned, and then continue the above
operation until the first pointer points to the last block of data.
Merge Data
After using the above two policies to select the corresponding data under different sorting
methods and store them in csv files, the data could be merged by three different filtrating method.
① Internal Mergence
The original sequence and the sequences cut by feature may have the same matching result, we
just need to keep one of them.
② External Mergence
Because the same results may occur twice or three times, we just need to keep one of them and
mark the source (identity, expect value or score) of the result at the beginning of the row to help our future process.
Some of the mergence results are shown below.
Table.1 Merged BLAST results
Formula Derivation
To help us make comparison, a preprocessing equation was needed to shrink the parameters into a
certain range. In this premise of comparing these parameters fairly, we could further adjust the coefficient
before the parameters and insure the formula in line with our functional expectations.
P_identity
$$V_{id}=\frac{the\; number\; of\; nucleotide\; matched\; completely}{the\; total\; number\; of\; nucleotide\; in\; the\; alignment}\tag{ 1 }$$We directly used V_id as P_identity to be the parameter relative to identity in our formula.
P_score
Since the origin score was not in this range, it required some mathematical means to fall into
(0,1), to implement multiparameter filtering after setting the threshold. According to the document
of BLAST+, in the calculation of P_score, we firstly needed to divide it by hsplen (the mean score of each
nucleotide in a query sequence), and then divide 7, which is the highest value of a single nucleotide, we can
get the final result. $$P_{score} =
\frac{V_{score}}{7\times hsplen}\tag{ 2 }$$
P_evalue
In the procession of P_evalue, the logarithm of expect value was taken twice. Firstly, the
logarithm based on 10 of expect value was taken to reduce dimension. At this point, we magnified the range of all
expect values in the table from 1 to nearly 200. That is to say, we magnified the difference between different
expect values, and then reduced it to a maximum of 1 by mathematical method in next step. This scaling made the
difference between numbers more obvious, and it was convenient for us to deduce the screening formula. At this
time, the exponent of expect value was obtained, and it was found that the maximum value of its absolute value is
close to 200, so the logarithm based on 6 of it was taken to reduce it to (0, 3). Divide the result by 3, and we
finally got a value in the range of (0, 1). For a case where expect value is 0, we directly set it as 1.
Through this transformation, the result was negatively correlated with the original size, which meant the higher
the expect value was, the smaller the P_evalue was. According to the meaning of expect value, the smaller the
exponent was, the more likely it was to be correct. So P_evalue had positive correlation with the accuracy of the result. $$P_{evalue} =
\frac{\lg({-\log_6({V_{evalue}})})}{3}\times w3\qquad\tag{ 3 }$$
Threshold
After getting expressions about the identity, score and expect value, we needed to select a
threshold for filter. The hits above the threshold would be kept, and the hits below the threshold would be wiped out.
$$result=V_{id}\times w1+\frac{V_{score}}{7\times alilen}\times
w2+\frac{\lg({-\log_6({V_{evalue}})})}{3}\times w3\qquad\tag{ 4 }$$
Multiple Combination
A total of 36 groups of different coefficient data were set up, and the screening results under
10 different threshold values were detected. There were totally 360 tables obtained. The range of threshold was from
0.4 to 0.85 with interval set at 0.05. And the coefficient value of each position was from 1.0 to 0.8 with
interval set at 0.1. The content structure in the table is as follows:
Table.2 Merged BLAST results
Coefficient and Threshold Modification
Threshold
The corresponding frequency histogram and the probability density function curve subjecting to
Poisson distribution were drawn by using all the results with one kind of coefficients. When the area of
probability density function in all graphs were more than 0.15, 0.1 or 0.05, we drew dotted lines at the
boundary and save the result value into list L1, L2 and L3. The average values of the three boundaries were 0.67,
0.66 and 0.63 respectively. In order to avoid the error caused by directly calculating the mean value, we used
list L1, L2, and L3 to draw another frequency histogram and the probability density function curve subjecting to
the Gaussian distribution, and displayed the result value of the center line in the graph, from which we could see
that the best threshold selection should be around 0.66.
Fig.4 Pdf fit by Poisson
distribution
Fig.5 Pdf fit by Normal distribution
After comparing the result values calculated by different coefficients, we hoped to get other
support to help us confirm the better formula threshold. The following figure showed the amount of reflections
between UniProt ID and iGEM ID, the number of UniPort ID, and the number of iGEM ID in the range of (0.5,
0.85).
It can be seen from the figure that when the threshold range is within (0.5, 0.75), the number of values dropped
more slowly, which indicated that when the threshold was within this range, the formula could not only guarantee
the accuracy of the data, but also ensure that the completeness. At the same time, it was clear that in this
range, each number also dropped evenly and slowly. Obviously, 0.65 was closer to the
value of retaining 5%-15% BLAST results. So we finally
determined 0.65 as our final threshold.
Fig.6 Comparison between 8 thresholds
Coefficient
In our analysis, if one triple intersection appears, it means this case is of highly
credibility.
So this kind of cases can be regarded as the positive samples.
During our filtration process, the following figure showed the percentage of triple intersection in each data set. So we
chose the appropriate coefficient collocation from the line with threshold at 0.65, we decided to choose 1_5_4 as coefficient and 0.65 as the threshold for our final filtrating equation.
Fig.7 Comparison between 360 different
coefficients
The curves of different colors corresponded to different thresholds, the abscissa corresponding
to each point was
the coefficient ratio of the filter formula, and the ordinate was the ratio of the triple intersection in the
corresponding file.
Comparison with Accuracy and Amount of Last-year Data
We processed the BLAST results from last year and obtained the average of three values in the data
sets of them. The table below is the Comparison of three values between last year and this year.
As you can see from the figure, this year's results’ accuracy improved in an all-round way over last year,
because the accuracy of the results is positively correlated by these values.
Fig.8 Comparison with last version-1
By comparing the total number of UniProt ID and iGEM ID, We could clearly see that dataset of this year is more complete than that of last year.
Fig.9 Comparison with last version-2
Sample Test
Doubt the Accuracy of the iGEM Biobricks
Fig.10 Feature info of one Biobrick of iGEM Registry
From the data given in the Feature table, BBa_C0052 should correspond with only one sequence (the 5th rows in the table). In this case,
we should believe the co-choice of expect value, score and ignore the results recommended by identity.
However, since the expect value and score of the 2nd rows and 4th rows are very close to each other,
and the identity values of the 2nd rows are very high, only the 4th rows of results can be deleted.
So we can draw two conclusions: one is that our formula cannot get the correct results completely accurate,
the other is that the feature table can only give a range of possible interzone of genes.
Table.3 Origin data (before filtrate)
Fig.11 P16117 is one of the originally detected
UniProt ID of Web BLAST in UniProt
Fig.12 P69202 is another one of the originally
detected UniProt ID of Web BLAST in UniProt
Table.4 P69202 is deleted from our origin data after
using our
formula
BBa_K1740003
BBa_E0050
K1718004
Prediction Model
Our
EC number prediction tool adopts a supervised ensemble classification approach by incorporating different
predictors based on subsequences, sequence similarities, and nucleotide physicochemical features. The system
was
trained with the EC number annotations of characterized enzymes in the UniProtKB/Swiss-Prot database[4]. In
the
prediction results, the first digit in any EC number indicates which of the six main classes the annotated
enzyme belongs to, the second digit represents the subclass class, the third digit expresses the sub-subclass
class and the fourth digit shows the substrate of the enzyme[5]. (eg, EC 3.1.3.16 - Protein-serine/threonine phosphatase).
Fig.16 Prediction flow chart
Predictors
One of the predictors we used is called Subsequence Profile Map[6], it is a subsequence-based method to predict
protein functions.
In the subsequence extraction module, all possible subsequences
for given length 5 were extracted from the
training dataset using the sliding window technique.
Fig.17 Slide window is used for subsequence
extraction.
We used Blocks substitution matrix (BLOSUM62)[7] to calculate the
similarity score between two subsequences by a
simple string comparison procedure. Similarity score s(x, y) between two subsequences was calculated as follows.
$$s(x,y)=\sum\limits_{i=1}^{l}M(x(i),y(i))\tag{5 }$$
Where l is the length of the subsequences, x(i) is the nucleotide at the ith position of the
subsequence x and M(x(i), y(i)) is the similarity score in BLOSUM62 matrix for the ith position of x and y.
Fig.18 BLOSUM 62
When create the position specific scoring matrix,Sc denoted the
total number of subsequences in cluster c
and the nucleotide count for the nucleotide j at the ith position of the subsequence be shown by aacount(i, j).
Where PPc(i, j) was the probability of the nucleotide j to occur at the ith position of the subsequence and Sc
was
the total number of subsequences in cluster c. Next, feature vectors were generated. Each subsequence ss was
compared with a probabilistic profile PPc. Then the cth dimension element of the feature vector V was
determined.[8]
$$PP_c(i,j)=\log{\frac{aa_{cout}+0.01}{S_c}}\tag{ 6 }$$
$$P(ss|PP_c)=\sum\limits_{i=1}^{l}PP_c(i,ss(i))\tag{ 7 }$$
$$V(c)=\max\limits_{ss\in E}{P(ss|PP_c)}\tag{ 8 }$$
The same operations were applied for the proteins in both the positive
and negative datasets, and finally, a
training file was created. Support vector machines (SVM) classifier was then used for the classification.
The second predictor was based on sequence similarities, we used the
k-nearest neighbor algorithm to classify a
target protein, and the similarities between the query protein and proteins in the training dataset were
calculated using the NCBI-BLAST tool. [9] The last predictors were based on peptide physicochemical properties,
here we used the Pepstats tool.
The Combination of Predictors
Fig.19 Structure of classifier
The tool combined the individual prediction scores of three
predictors. A 5-fold cross-validation is applied for each method and the area under the receiver operating
characteristic curve (AUROC) was calculated for the predictors. Using these AUROC values, all three methods were
combined and the weighted mean score for each method was calculated. The weight for method m is calculated as
follows:[10]
$$W(m)=\frac{R_m^4}{R^4_{BLAST-kNN}+R^4_{SPMap}+R^4_{PEPSTATS-SVM}}\tag{ 9 }$$
Where the weight of the method m was represented by W(m). Rm stands for
AUROC value for method m.
Each prediction score was multiplied with the class-specific weights
and summed up to produce the weighted mean
score, which corresponded to the finalized prediction score for the query protein for that EC number.
Result
We used F1 score to measure the accuracy of the test. Which considers
both the precision and the recall of the test to compute the score. It is the harmonic mean of the precision
and
recall, where a F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. They are
calculated as follows:
$$Recall = \frac{TP}{TP+FN}\tag{10}$$
$$Precision = \frac{TP}{TP+FP}\tag{ 11 }$$
$$F1\;score=2\times\frac{precision\times recall}{precision+recall}\tag{12}$$
We integrated data sets from several different data sources to test
the effects of the model. These proteins
come from UniProtKB/Swiss-Prot, Protein Families Database [11] and COFACTOR dataset [12], all of them were never
used in the system training. The results of the tests indicated the effectiveness of our prediction tool. The
test results are as follows:
Table.9 The test results.
Reference
Reference for BLAST Filtration
[1] O'Driscoll Aisling,Belogrudov Vladislav,Carroll John,Kropp Kai,Walsh Paul,Ghazal Peter,Sleator Roy D.
HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.[J]. Journal
of biomedical informatics,2015,54.
[2] Mount David W. Using the Basic Local Alignment Search Tool (BLAST).[J]. CSH protocols,2007,2007.
[3] Altschul S F,Gish W,Miller W,Myers E W,Lipman D J. Basic local alignment search tool.[J]. Journal of
molecular biology,1990,215(3).
Reference for EC Prediction
[4] UniProt Consortium T UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018;46:2699. doi:
10.1093/nar/gky092.
[5] Cornish-Bowden A. Current IUBMB recommendations on enzyme nomenclature and kinetics. Perspect Sci.
2014;1:74–87. doi: 10.1016/j.pisc.2014.02.006.
[6] Sarac OS, Gürsoy-Yüzügüllü Ö, Cetin-Atalay R, Atalay V. Subsequence-based feature map for protein function
classification. Comput Biol Chem. 2008;32:122–130. doi: 10.1016/j.compbiolchem.2007.11.004.
[7] Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci.
1992;89:10915–10919. doi: 10.1073.
[8] Dalkiran A, Rifaioglu AS, Martin MJ, Cetin-Atalay R, Atalay V, Doğan T. ECPred: a tool for the prediction of
the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinformatics. 2018;19(1):334.
Published 2018 Sep 21. doi:10.1186/s12859-018-2368-y
[9] Madden T. The BLAST sequence analysis tool. Bethesda (MD): National Center for Biotechnology Information
(US); 2013.
[10] Sarac OS, Atalay V, Cetin-Atalay R. GOPred: GO molecular function prediction by combined classifiers. PLoS
One. 2010;5:e12382. doi: 10.1371/journal.pone.0012382.
[11] Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, et al. The Pfam protein families
database:
towards a more sustainable future. Nucleic Acids Res. 2016;44:D279–D285. doi: 10.1093/nar/gkv1344.
[12] Roy A. et al. (2012) Cofactor: an accurate comparative algorithm for structure-based protein function
annotation. Nucleic Acids Res., 40, W471–W477.