Team:SJTU-software/Model

The clever model design makes our software PHOSYME a more powerful tool for plant synthetic biology. Our modelling team use many fancy technologies, such as deep learning, graph theory to increase the performance of our Enzyme Prediction Function and SBML Online Modification Function.




Enzyme Prediction Function
Introduction
When it comes to plant synthetic metabolic engineering, choosing the proper enzyme for rate-limiting reaction is key to increasing the yield of target product. Based on numerous reactions and corresponding enzymes related with photosynthetic in our database, we build a complicated deep learning model to predict enzymes for reactions in the field of synthetic photosynthesis , which achieves fantastic performance and have potential to increase the effiency of synthetic carbon fixation.
Model architecture
As shown in figure 1, we first encoded the sequence of the enzyme and the SMILES of the substrate with the method of one-hot,so elements of them can be represented as vectors, which are orthogonal and equally distant from each other. Then we used two different text convolutional neural networks to extract the features of the two encoded matrices respectively. Finally, the eigenvalues extracted by the two convolutional neural networks are merged and the probability values are calculated by caculating the full connection layers.


Figure 1. Architecture of our deep learning model



Algorithm
TextCNN and Batch Normalization
Convolutional neural network is a common model in deep learning. It can extract the features of the input matrix through simple convolutional layer and pooling layer. Traditional convolutional neural networks, such as LeNet-5 and DenseNet, are more oriented to extracting multi-layer features of an image. The task we are faced with is to extract the features of two different matrices. Therefore, two different convolutional neural networks are used for training here.In order to make the model more suitable for the various lengths of input data, we use TextCNN here to replace the traditional CNN model and compare the effect of the model with the traditional CNN.
In order to improve the original network, we used Batch Normalization to normalize the network so that the model can have better generalization capability[1]. The following formula is used:
Convolution and pooling
In order to extract the one-hot encoded matrix information, we use convolution kernels of different sizes for convolution calculation.
If we have a convolution kernel, which is a matrix w of width d and height h, then w has h × d parameters that need to be updated. For A sentence, after embedding layer can get matrix A. The A[i : j] represents the i to j rows of A, then the convolution operation can be expressed as follows:
We add the bias b and activate it with the activation function f, so that we can get the desired feature. The formula is as follows:
For a convolution kernel, we can get the features. And we can use more convolution kernels of different height h to obtain richer feature expressions[2].
Loss function
We can transform the problem into a dichotomy problem and use the cross entropy function to calculate the loss value[3]. The formula is as follows:
We use entropy to represent the expectation of all information. We are faced with a dichotomy problem, we can convert the formula into:
Result
Accuracy
In order to test the effects of our model, we used the data sets containing seven different species of plants and compared with traditional CNN. We first used accuracy as the reference index, and used five-fold cross validation method to take 80% of the data set as the training set and 20% as the test set. As shown in figure 2, our model has a better effect than traditional CNN, which has been verified in seven different species.


Figure 2. The accuracy of two models
AUC
In order to further verify the effect of the model, we use AUC value as another reference value. As we can see, from figure 3, our model is still much better than the traditional CNN.


Figure 3. The AUC of two models

Abbreviations of species:
ALY:Arabidopsis lyrata (lyrate rockcress)
BNA:Brassica napus (rape)
BRP:Brassica rapa (field mustard)
CMAX:Cucurbita maxima (winter squash)
CSAT:Camelina sativa (false flax)
GAB:Gossypium arboreum
NHE:Nectria haematococca




SBML Online Modification Function
Introduction
Sbml model typically is metabolite-reaction network, encoding for a very complex graph. When it comes to comparing these genome-scale models through topological analysis, it is really challenging and has high computational complexity. Actually, in many cases, it is unnecessary to compare the global graphs. It is useful to extract important subnetworks from sbml model and simplify them with the help of our sbml online modification function.
Graph model of metabolic networks
Metabolite-reaction network is a tripartite graph, which consists of three types of nodes: metabolites, reactions and enzymes. As shown in figure1, circles represent metabolites, squares represent reactions and triangles represent enzymes or enzyme complexes. For example, R3 transforms M5 into M2 and M4, which is catalyzed by enzyme complex(E6 and E7). It can be expressed as a graph model G(M,R,E,L), M is metabolite nodes, R is reactions nodes, E is enzymes nodes and L represents the lines between different nodes.
First of all, we extract reaction-centric network, which represents the relationship between different reactions. Each node has the same type -- reaction. The directed edge linking two reactions means that at least one metabolite which is produced by the source of the edge is consumed by the target of the edge. The transformed network can be expressed as G(R,L). Figure 4 can be transformed to figure 5. For example, there is a directed edge from R3 to R1 because R3 produces a metabolite M2 which is consumed by R1.



What's next, based on the reaction-centric network, we can extract the enzyme-centric network. Because R1 is catalyzed by E1, E2 and E3-E4 complex,it expands to three nodes. G(R,L) can be transformed to G(E,L). As shown in figure 5, the topology of the enzyme-centric network is very different from reaction-centric network, which is worth analyzing and can determine the metabolic distance between enzymes.



Floyd's Algorithm
The metabolic distance of each pair of enzymes can be calculated as follows using Floyd's Algorithm[4]:
Representation of enzyme-centric network by the adjacency matrix:
- In enzyme-centric network G(E,L) , E represent enzymes and L represents edges.
- For the network G with n vertices(enzymes), the adjacency matrix is a n × n matrix.
Setting up the distance matrix D with elements
Using Floyd's Algorithm to calculate the metabolic distance between enzymes
- Input: adjacency matrix A (n × n matrix)
- Procedure:
- Output: Distance matrix D that contains the distance between each two enzymes
- Time complexity: O(n^3)
Through extracting and analyzing the enzyme-centric networks, we can obtain the critical precursor of physiological representation from genome-scale models.
Reference
[1] Santurkar S, Tsipras D, Ilyas A, et al. How Does Batch Normalization Help Optimization?[J]. 2018.
[2] Zhang Y, Wallace B. A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification[J]. Computer Science, 2015.
[3] Spurek P, Tabor J, Byrski K. Active Function Cross-Entropy Clustering[J]. Expert Systems with Applications, 2017, 72(957-4174):49-66.
[4] Larson, R. and Odoni, A. "Shortest Paths between All Pairs of Nodes." §6.2.2 in *Urban Operations Research.* 1981.
[5]GEMtractor: Extracting Views into Genome-scale Metabolic Models Martin Scharm, Olaf Wolkenhauer, Mahdi Jalili, Ali Salehzadeh-Yazdi bioRxiv 790725; doi: https://doi.org/10.1101/790725












Presented by
SJTU - software