Terrychan999 (Talk | contribs) |
|||
| Line 206: | Line 206: | ||
<p>   We design a UI for the public and you. Click the website below to experience our artificial | <p>   We design a UI for the public and you. Click the website below to experience our artificial | ||
intelligence!</p> | intelligence!</p> | ||
| − | <a href=" | + | <a href="https://sites.google.com/view/nctuformosa2019/homepage" target="_blank"><img |
src="https://2019.igem.org/wiki/images/d/da/T--NCTU_Formosa--tryit.png" class="try"></a> | src="https://2019.igem.org/wiki/images/d/da/T--NCTU_Formosa--tryit.png" class="try"></a> | ||
<p id="try_it">Try it!<br><br></p> | <p id="try_it">Try it!<br><br></p> | ||
Revision as of 05:19, 14 November 2019
Machine Learning
SVM
Results
Demonstration
Mutagenicity Prediction
Introduction
In an interview with Professor Ethan Lan of NCTU, we found that mutagenicity of a compound is highly related to its chemical substructure. Thoroughly inspired, we came up with an idea: how about using machine learning to predict mutagenicity based on chemical substructures?
Afterward, we extracted 67 features based on different chemical substructures. Moreover, we trained our data with Support Vector Machine (SVM), and then used scoring function to quantify mutagenicity. To validate our model, we compared the mutagenicity computed by our AI and the number of bacterial colonies from Ames test results to prove the score’s reliability.
Introduction to Machine Learning
Artificial intelligence is a prevalent technique now trending all over the world. More and more experts in all different fields start to use it as an essential tool to extend their research. Moreover, Machine learning is a branch of artificial intelligence; concisely, it processes input data to generate useful predictions. Unlike typical programming methods, machine learning use “statistics” instead of “logics” to solve problems, which allows machine learning to solve complicated tasks with accessible programming.
Basic Concept of Machine Learning
In general, training a machine learning model can be represented by the following flow chart:
Figure 1: Flow chart of machine learning
In supervised machine learning progress, we divide training data into features and labels. For extracted features, the model will compute the input vectors and make predictions to compare it with the label. Initially, the loss will be significant, so we have loss function and optimizer to adjust the parameters in the model to make the prediction more accurate. Once the training data is sufficient, which the training progress reaches specific iterations, the loss score will converge to a stable value indicating the training is over.
Data Preprocess
Figure 2: Database preprocessing
1. Database Source
For trustworthy machine learning, sufficient data is crucial. We use QSAR toolbox, which is an open software that offers transparent chemical hazard assessment. We collect the chemical structure in the data form of SMILES as training data and the result of Ames test as target data for machine learning.
2. SMILES
Simplified Molecular Input Line Entry System (SMILES) is a set of chemical notations which commonly uses in molecular databases. The characteristic of SMILES is that it can easily use 1-dimensional syntax to represent a 3-dimensional chemical structure. In other words, once we have the SMILES of the chemical, we can get the chemical structure in 1-dimensional syntax.
Figure 3: SMILES notation transition
3. Feature Extraction
For human practices, we conducted an interview about machine learning with a postdoctoral student in the computer science laboratory at NCTU. He suggested that we should not put SMILES as input data directly without any preprocessing. Instead, we should extract features of different substructures. After researching literature, we finally discovered a paper suggesting 67 kinds of substructures with mutagen potential. Since one chemical structure could address in multiple ways in SMILES, we could not merely catch the substring from SMILES. We used RDKit in python API, which is an open-source library commonly used in cheminformatics to catch substructures. After catching all the substructures, we could take it as input features for machine learning.
4. Labeling
The result of Ames test is “Positive” or “Negative.” However, it is hard for the machine to understand the meaning of “Positive” or “Negative,” so we label “Positive” and “Negative” with “1” and “0.”
Support Vector Machine
SVM mainly emphasized data classification in pattern recognition. Generally speaking, this kind of method simulated a plane equation to operate data into two classes. Because of the plane might equation classify a multi-dimensional data in machine learning, we named it “hyperplane.” For example, to divide a 3-D space, we needed to use a 2-D plane. Therefore, we could deduce that in N-dimensional space, we could split it with an (N-1)-D hyperplane. Among all data, some data points would determine the optimal hyperplane, and those critical points were called “support vectors.”
Figure 4: Demonstration of support vector machine classifier
Results
Confusion Matrix
Figure 5: The confusion matrix of predicted class and true class
$$Accuracy = \frac{TPN+TNN}{TPN+TNN+FPN+FNN} = 0.834 $$
$$Sensitivity = \frac{TPN}{TPN+FNN} = \frac{5138}{7231} = 0.71 $$
$$Specificity = \frac{TNN}{TNN+FPN} = \frac{12380}{13769} = 0.899 $$
TPN: True-Positive Number
TNN: True-Negative Number
FPN: False-Positive Number
FNN: False-Negative Number
Validation: k-folds cross validation (k = 5)
Training data: 7231 mutagens and 13769 non-mutagens
The confusion matrix shows the data amount of true class and predicted class. The "0" and "1" are the labels in our model. The X-axis is the prediction of our model, and the Y-axis is the label.
ROC Curve
A receiver operating characteristic (ROC) curve is one of the most important evaluation metrics for checking performance of classifiers. It compares two operating characteristics, true positive rate (TPR) and false positive rate (FPR), to diagnosis the ability of model also known as relative operating characteristic curve.
Figure 6: The ROC curve analysis of Support vector machine (SVM)
The area under the curve (AUC) of ROC is the standard to judge the model’s ability. The score will be between 0 and 1, and the higher the score, the higher the accuracy. If AUC > 0.5, it means that the model has predictive value.
Demonstration
We design a UI for the public and you. Click the website below to experience our artificial intelligence!
Try it!
Reference
1. Toshihiro Ohta, et al.(2000)."A comparison of mutation spectra detected ny the Escjerichia colu
Lac+ reversion assay
and the Salmonella typhimurium His+"
2. Paolo Mazzatorta, et al.(2006)."Integration of Structure-Activity Relationship and Artificial
Intelligence Systems To Improve
in Silico Prediction of Ames Test Mutagenicity"
3. Donald J. Abraham .(2003)"History of Quantitative Structure-Activity Relationships"
4. Katsuhiko SAWATARI,et al.(2001)."Relationships between Chemical Structures and Mutagenicity: A
Preliminary Survey for a
Database of Mutagenicity Test Results of New Work Place Chemicals"
5. Seal A,et al.(2012)"In-silico predictive mutagenicity model generation using supervised learning
approaches."
6. Romualdo Benigni, et al.(2005)"Structure_Activity Relationship Studies of Chemical Mutagens and
Carcinogens: Mechanistic Investigations and Prediction Approaches"
7. Zhu, Yicheng,et al.(2018)"Machine learning techniques for classifying the mutagenic origins of point
mutations"
8. Norinder, U.er al.(2019)"Predicting Ames Mutagenicity Using Conformal Prediction in the Ames/QSAR
International Challenge Project"
9. Moorthy, N. S. Hari Narayana,et al.(2017)"Classification of carcinogenic and mutagenic properties
using machine learning method"
10. Yunyi Wu,et al.(2018)."Machine Learning Based Toxicity Prediction: From Chemical Structural
Description to Transcriptome Analysis"
11. Kristina Preuer,et al.(2019)."Interpretable Deep Learning in Drug Discovery"
12. Defang Fan,et al.(2018)."In silico prediction of chemical genotoxicity using machine learning methods
and structural alerts"
13. Suman K. Chakravarti,et al.(2019)."Descriptor Free QSAR Modeling Using Deep Learning With Long
Short-Term Memory Neural Networks"
Label: The thing we are predicting. In our model, the label is “mutagen” or “non-
mutagen.”
Features: Features in machine learning are input variables. In our model, the features
are the substructures.
Loss: Loss is the mean square error of label and model prediction.
Model: The function to predict the label. We choose support vector machine as
a classification algorithm.