Difference between revisions of "Team:NCTU Formosa/Mutagenicity Prediction"

Line 206: Line 206:
 
             <p>&emsp;&emsp; We design a UI for the public and you. Click the website below to experience our artificial
 
             <p>&emsp;&emsp; We design a UI for the public and you. Click the website below to experience our artificial
 
                 intelligence!</p>
 
                 intelligence!</p>
             <a href="http://nctu_formosa.nctu.me/" target="_blank"><img
+
             <a href="https://sites.google.com/view/nctuformosa2019/homepage" target="_blank"><img
 
                     src="https://2019.igem.org/wiki/images/d/da/T--NCTU_Formosa--tryit.png" class="try"></a>
 
                     src="https://2019.igem.org/wiki/images/d/da/T--NCTU_Formosa--tryit.png" class="try"></a>
 
             <p id="try_it">Try it!<br><br></p>
 
             <p id="try_it">Try it!<br><br></p>

Revision as of 05:19, 14 November 2019

Navigation Bar

Mutagenicity Prediction

Machine Learning

SVM

Results

Demonstration

Mutagenicity Prediction

Introduction

   In an interview with Professor Ethan Lan of NCTU, we found that mutagenicity of a compound is highly related to its chemical substructure. Thoroughly inspired, we came up with an idea: how about using machine learning to predict mutagenicity based on chemical substructures?

   Afterward, we extracted 67 features based on different chemical substructures. Moreover, we trained our data with Support Vector Machine (SVM), and then used scoring function to quantify mutagenicity. To validate our model, we compared the mutagenicity computed by our AI and the number of bacterial colonies from Ames test results to prove the score’s reliability.

Introduction to Machine Learning

   Artificial intelligence is a prevalent technique now trending all over the world. More and more experts in all different fields start to use it as an essential tool to extend their research. Moreover, Machine learning is a branch of artificial intelligence; concisely, it processes input data to generate useful predictions. Unlike typical programming methods, machine learning use “statistics” instead of “logics” to solve problems, which allows machine learning to solve complicated tasks with accessible programming.

Basic Concept of Machine Learning

   In general, training a machine learning model can be represented by the following flow chart:

Figure 1: Flow chart of machine learning

Label: The thing we are predicting. In our model, the label is “mutagen” or “non-
   mutagen.”

Features: Features in machine learning are input variables. In our model, the features
      are the substructures.

Loss: Loss is the mean square error of label and model prediction.

Model: The function to predict the label. We choose support vector machine as
    a classification algorithm.

   In supervised machine learning progress, we divide training data into features and labels. For extracted features, the model will compute the input vectors and make predictions to compare it with the label. Initially, the loss will be significant, so we have loss function and optimizer to adjust the parameters in the model to make the prediction more accurate. Once the training data is sufficient, which the training progress reaches specific iterations, the loss score will converge to a stable value indicating the training is over.

Data Preprocess

Figure 2: Database preprocessing

1. Database Source

   For trustworthy machine learning, sufficient data is crucial. We use QSAR toolbox, which is an open software that offers transparent chemical hazard assessment. We collect the chemical structure in the data form of SMILES as training data and the result of Ames test as target data for machine learning.

2. SMILES

   Simplified Molecular Input Line Entry System (SMILES) is a set of chemical notations which commonly uses in molecular databases. The characteristic of SMILES is that it can easily use 1-dimensional syntax to represent a 3-dimensional chemical structure. In other words, once we have the SMILES of the chemical, we can get the chemical structure in 1-dimensional syntax.

Figure 3: SMILES notation transition

3. Feature Extraction

   For human practices, we conducted an interview about machine learning with a postdoctoral student in the computer science laboratory at NCTU. He suggested that we should not put SMILES as input data directly without any preprocessing. Instead, we should extract features of different substructures. After researching literature, we finally discovered a paper suggesting 67 kinds of substructures with mutagen potential. Since one chemical structure could address in multiple ways in SMILES, we could not merely catch the substring from SMILES. We used RDKit in python API, which is an open-source library commonly used in cheminformatics to catch substructures. After catching all the substructures, we could take it as input features for machine learning.

4. Labeling

   The result of Ames test is “Positive” or “Negative.” However, it is hard for the machine to understand the meaning of “Positive” or “Negative,” so we label “Positive” and “Negative” with “1” and “0.”

Support Vector Machine

   SVM mainly emphasized data classification in pattern recognition. Generally speaking, this kind of method simulated a plane equation to operate data into two classes. Because of the plane might equation classify a multi-dimensional data in machine learning, we named it “hyperplane.” For example, to divide a 3-D space, we needed to use a 2-D plane. Therefore, we could deduce that in N-dimensional space, we could split it with an (N-1)-D hyperplane. Among all data, some data points would determine the optimal hyperplane, and those critical points were called “support vectors.”

Figure 4: Demonstration of support vector machine classifier

Results

Confusion Matrix

Figure 5: The confusion matrix of predicted class and true class

$$Accuracy = \frac{TPN+TNN}{TPN+TNN+FPN+FNN} = 0.834 $$

$$Sensitivity = \frac{TPN}{TPN+FNN} = \frac{5138}{7231} = 0.71 $$

$$Specificity = \frac{TNN}{TNN+FPN} = \frac{12380}{13769} = 0.899 $$

   TPN: True-Positive Number

   TNN: True-Negative Number

   FPN: False-Positive Number

   FNN: False-Negative Number

   Validation: k-folds cross validation (k = 5)

   Training data: 7231 mutagens and 13769 non-mutagens

  The confusion matrix shows the data amount of true class and predicted class. The "0" and "1" are the labels in our model. The X-axis is the prediction of our model, and the Y-axis is the label.

ROC Curve

   A receiver operating characteristic (ROC) curve is one of the most important evaluation metrics for checking performance of classifiers. It compares two operating characteristics, true positive rate (TPR) and false positive rate (FPR), to diagnosis the ability of model also known as relative operating characteristic curve.

Figure 6: The ROC curve analysis of Support vector machine (SVM)

   The area under the curve (AUC) of ROC is the standard to judge the model’s ability. The score will be between 0 and 1, and the higher the score, the higher the accuracy. If AUC > 0.5, it means that the model has predictive value.

Demonstration

   We design a UI for the public and you. Click the website below to experience our artificial intelligence!

Try it!

Reference

1. Toshihiro Ohta, et al.(2000)."A comparison of mutation spectra detected ny the Escjerichia colu Lac+ reversion assay and the Salmonella typhimurium His+"

2. Paolo Mazzatorta, et al.(2006)."Integration of Structure-Activity Relationship and Artificial Intelligence Systems To Improve in Silico Prediction of Ames Test Mutagenicity"

3. Donald J. Abraham .(2003)"History of Quantitative Structure-Activity Relationships"

4. Katsuhiko SAWATARI,et al.(2001)."Relationships between Chemical Structures and Mutagenicity: A Preliminary Survey for a Database of Mutagenicity Test Results of New Work Place Chemicals"

5. Seal A,et al.(2012)"In-silico predictive mutagenicity model generation using supervised learning approaches."

6. Romualdo Benigni, et al.(2005)"Structure_Activity Relationship Studies of Chemical Mutagens and Carcinogens: Mechanistic Investigations and Prediction Approaches"

7. Zhu, Yicheng,et al.(2018)"Machine learning techniques for classifying the mutagenic origins of point mutations"

8. Norinder, U.er al.(2019)"Predicting Ames Mutagenicity Using Conformal Prediction in the Ames/QSAR International Challenge Project"

9. Moorthy, N. S. Hari Narayana,et al.(2017)"Classification of carcinogenic and mutagenic properties using machine learning method"

10. Yunyi Wu,et al.(2018)."Machine Learning Based Toxicity Prediction: From Chemical Structural Description to Transcriptome Analysis"

11. Kristina Preuer,et al.(2019)."Interpretable Deep Learning in Drug Discovery"

12. Defang Fan,et al.(2018)."In silico prediction of chemical genotoxicity using machine learning methods and structural alerts"

13. Suman K. Chakravarti,et al.(2019)."Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks"