Team:NCTU Formosa/Mutagenicity Prediction

In an interview with Professor Ethan Lan of NCTU, we found that mutagenicity of a compound is highly related to its chemical substructure. Thoroughly inspired, we came up with an idea: how about using machine learning to predict mutagenicity based on chemical substructures?

Afterward, we extracted 67 features based on different chemical substructures. Moreover, we trained our data with Support Vector Machine (SVM), and then used scoring function to quantify mutagenicity. To validate our model, we compared the mutagenicity computed by our AI and the number of bacterial colonies from Ames test results to prove the score’s reliability.

Introduction to Machine Learning

Artificial intelligence is a prevalent technique now trending all over the world. More and more experts in all different fields start to use it as an essential tool to extend their research. Moreover, Machine learning is a branch of artificial intelligence; concisely, it processes input data to generate useful predictions. Unlike typical programming methods, machine learning use “statistics” instead of “logics” to solve problems, which allows machine learning to solve complicated tasks with accessible programming.

Basic Concept of Machine Learning

In general, training a machine learning model can be represented by the following flow chart:

Figure 1: Flow chart of machine learning

Label: The thing we are predicting. In our model, the label is “mutagen” or “non-
mutagen.”

Features: Features in machine learning are input variables. In our model, the features
are the substructures.

Loss: Loss is the mean square error of label and model prediction.

Model: The function to predict the label. We choose support vector machine as
a classification algorithm.

In supervised machine learning progress, we divide training data into features and labels. For extracted features, the model will compute the input vectors and make predictions to compare it with the label. Initially, the loss will be significant, so we have loss function and optimizer to adjust the parameters in the model to make the prediction more accurate. Once the training data is sufficient, which the training progress reaches specific iterations, the loss score will converge to a stable value indicating the training is over.

Data Preprocess

Figure 2: Database preprocessing

1. Database Source

For trustworthy machine learning, sufficient data is crucial. We use QSAR toolbox, which is an open software that offers transparent chemical hazard assessment. We collect the chemical structure in the data form of SMILES as training data and the result of Ames test as target data for machine learning.

Table 1: Database Source

Chemical reactivity COLIPA	Experimental pKa	GSH Experimental RC50	Phys-chem EPISUITE	pKa OASIS
ECHA CHEM	Bioconcentration NITE	Bioaccumulation Canada	Biota-Sediment Accumulation Factor US-EPA	Biodegradation in soul OASIS
REACH Bioaccumulation database (normalised)	kM database Environment Canada	Hydrolysis rate constant OASIS	Bioaccumulation fish CEFIC LRI	ECOTOX
Biodegradation NITE	Aquatic ECETOC	Food TOX Hazard EFSA	Aquatic Japan MoE	Aquatic OASIS
Micronuleus ISSMIC	ToxRdfDB US-EPA	Micronucleus OASIS	Skin irritation	Receptor Mediated Effects
Rep Dose Tox Fraunhofer ITEM	Dendritic cells COLIPA	Biocides and plant protection ISSBIOC	Skin sensitization ECETOC	Rodent Inhalation Toxicity Database
Acute Oral toxicity	ToxCastDB	Cell transformation Assay ISSCTA	Repeated Dose Toxicity HESS	Keratinacyte gene expression LuSens
Genotoxicity pesticides EFSA	ZEBET database	Developmental & Reproductive Toxicity (DART)	Human Half-Life	Skin Sensitization
REACH Skin Sensitisation database (normalised)	GARD Skin sensitization	Yeast estrogen assay database	Transgenic Rodent Database	Genotoxicity & Carcinogenicity ECVAM
Carcinogenic Potency Databse (CPDB)	Carcinagenicity ISSCAN	Toxicity Japan MHLW	Genotoxicity OASIS	Keratinacyte gene expression Givaudan
Eye irritation ECETOC	Toxicity to reproduction (ER)	Developmental toxicity ILSI	MUNRO non-canceer EFSA	Bacterial mutagenicity ISSSTY
Developmental toxicity databse (CAESAR)	ADME database

2. SMILES

Simplified Molecular Input Line Entry System (SMILES) is a set of chemical notations which commonly uses in molecular databases. The characteristic of SMILES is that it can easily use 1-dimensional syntax to represent a 3-dimensional chemical structure. In other words, once we have the SMILES of the chemical, we can get the chemical structure in 1-dimensional syntax.

Figure 3: SMILES notation transition

3. Feature Extraction

For human practices, we conducted an interview about machine learning with a postdoctoral student in the computer science laboratory at NCTU. He suggested that we should not put SMILES as input data directly without any preprocessing. Instead, we should extract features of different substructures. After researching literature, we finally discovered a paper suggesting 67 kinds of substructures with mutagen potential (Figure 4). Since one chemical structure could address in multiple ways in SMILES, we could not merely catch the substring from SMILES. We used RDKit in python API, which is an open-source library commonly used in cheminformatics to catch substructures. After catching all the substructures, we could take it as input features for machine learning.

Figure 4: The chemical substructures with mutagen potential

4. Labeling

The result of Ames test is “Positive” or “Negative.” However, it is hard for the machine to understand the meaning of “Positive” or “Negative,” so we label “Positive” and “Negative” with “1” and “0.”

Support Vector Machine

Support Vector Machine model was built for quantifying mutagenicity of chemical compounds. To achieve this goal, we chose a machine learning algorithm called Support Vector Machine (SVM). SVM mainly emphasized data classification in pattern recognition. Generally speaking, this kind of method simulated a plane equation to operate data into two classes. Because of the plane might equation classify a multi-dimensional data in machine learning, we named it “hyperplane.” For example, to divide a 3-D space, we needed to use a 2-D plane. Therefore, we could deduce that in N-dimensional space, we could split it with an (N-1)-D hyperplane. Among all data, some data points would determine the optimal hyperplane, and those critical points were called “support vectors.”

Figure 5: Demonstration of support vector machine classifier

SVM Scoring Function

Now, with SVM model, we could classify whether a chemical is mutagen or not base on its structure. However, we were looking forward to our ultimate goal of quantifying mutagenicity. To achieve our target, we had to compute the intensity of each data point precisely, which referred to its mutagenicity. After doing further research, we finally figured out a solution from the SVM scoring function. This unique function could calculate the relative distance between the input data point and each support vector then summed them into a score. Furthermore, this score could be considered as a standard to quantify mutagenicity.

$$f(x) = \sum_{i=1}^{m}\alpha _{i}y^{(i)}K(x^{(i)},x)+b $$

Table 2: The parameters of SVM scoring function

Parameter	Meaning	Value
$$\alpha_i$$	The coefficient associated with the i th training data point.	variable
$$y^{\left(i\right)}$$	The class label to divide into two groups, which has only one of two values.	1 or -1
$$K\left(x^{\left(i\right)},x\right)$$	Kernel function	variable
$$b$$	Scalar value	-0.544

In the table above, we could observe that every data point corresponded to a parameter α. α served as a critical variable, showing the data points' coefficient to the support vectors that determine the hyperplane.

Table 3: The instances of SVM score

Chemical	Score
MNNG	29.13
MMS	0.073
EMS	0.073
Azacytidine	5.50
Aminopurine	4.62
Glyoxal	0.70
Formaldehyde	0.70
Captan	18.36
Phosmet	3.79

Kernel Function

Kernel function in the SVM scoring system can map each data point's initial non-linear classification results in high-dimensional space. Hence, data can be separated and be further analyzed. Also, it transfers high-dimensional space into a matrix format and therefore simplifies the calculation and reduces time in machine learning.

The following figure shows the input and the output data point of the kernel function in different dimensions.

While a new chemical introduces to our model, its coordinates will be decided first based on 67 features. The coordinates will then turn into a matrix, and the SVM scoring function will come in and generate its score. The higher the value of the score, the higher the chemical compound’s mutagenicity. On the other hand, if the score has a negative value, the chemical compound is a non-mutagen.

Figure 6: The examples of input and output of kernel function

Results

Confusion Matrix

Figure 7: The confusion matrix of predicted class and true class

$$Accuracy = \frac{TPN+TNN}{TPN+TNN+FPN+FNN} = 0.834 $$

$$Sensitivity = \frac{TPN}{TPN+FNN} = \frac{5138}{7231} = 0.71 $$

$$Specificity = \frac{TNN}{TNN+FPN} = \frac{12380}{13769} = 0.899 $$

TPN: True-Positive Number

TNN: True-Negative Number

FPN: False-Positive Number

FNN: False-Negative Number

Validation: k-folds cross validation (k = 5)

Training data: 7231 mutagens and 13769 non-mutagens

The confusion matrix shows the data amount of true class and predicted class. The "0" and "1" are the labels in our model. The X-axis is the prediction of our model, and the Y-axis is the label.

ROC Curve

A receiver operating characteristic (ROC) curve is one of the most important evaluation metrics for checking performance of classifiers. It compares two operating characteristics, true positive rate (TPR) and false positive rate (FPR), to diagnosis the ability of model also known as relative operating characteristic curve.

Figure 8: The ROC curve analysis of Support vector machine (SVM)

The area under the curve (AUC) of ROC is the standard to judge the model’s ability. The score will be between 0 and 1, and the higher the score, the higher the accuracy. If AUC > 0.5, it means that the model has predictive value.

Statistical Histogram

We were not satisfied with the result we have got. So, we started to figure out ways to increase our accuracy. Finally, during an interview with Ms. ZHAO, SHU-RU, a postdoctoral research fellow at the Disaster Prevention & Water Environment Research Center in NCTU, we figured out the problem. She told us the reason why our model's accuracy could not reach higher was results in our data source, Ames test results, which owns a high false-positive rate. Therefore, incorrect data will accumulate errors and brought down the accuracy. In order to solve this problem, we use the ISSTOX database (not in our training data), and we collect the data that have been verified by more than one method. As we can see below, now, the false-negative rate (FNR) is lower than before. Most importantly, our model is now more reliable and accurate.

Figure 9: Statistical data of SVM score

Comparison of Ames Test Colonies and SVM scores

The result of Ames test can refer to quantified mutagenicity. To prove our artificial intelligence is trustworthy, we import the linear regression of the result of Ames test to our model prediction. The result shows that the two has a R² of 0.9699, which indicates their high relativeness.

$$Correlation\,\,coefficient: R^2 = 0.9699 $$

Figure 10: Regression of Ames test and model prediction

Demonstration

We design a UI for the public and you. Click the website below to experience our artificial intelligence!

Try it!

Reference

1. Toshihiro Ohta, et al.(2000)."A comparison of mutation spectra detected ny the Escjerichia colu Lac⁺ reversion assay and the Salmonella typhimurium His⁺"

2. Paolo Mazzatorta, et al.(2006)."Integration of Structure-Activity Relationship and Artificial Intelligence Systems To Improve in Silico Prediction of Ames Test Mutagenicity"

3. Donald J. Abraham .(2003)"History of Quantitative Structure-Activity Relationships"

4. Katsuhiko SAWATARI,et al.(2001)."Relationships between Chemical Structures and Mutagenicity: A Preliminary Survey for a Database of Mutagenicity Test Results of New Work Place Chemicals"

5. Seal A,et al.(2012)"In-silico predictive mutagenicity model generation using supervised learning approaches."

6. Romualdo Benigni, et al.(2005)"Structure_Activity Relationship Studies of Chemical Mutagens and Carcinogens: Mechanistic Investigations and Prediction Approaches"

7. Zhu, Yicheng,et al.(2018)"Machine learning techniques for classifying the mutagenic origins of point mutations"

8. Norinder, U.er al.(2019)"Predicting Ames Mutagenicity Using Conformal Prediction in the Ames/QSAR International Challenge Project"

9. Moorthy, N. S. Hari Narayana,et al.(2017)"Classification of carcinogenic and mutagenic properties using machine learning method"

10. Yunyi Wu,et al.(2018)."Machine Learning Based Toxicity Prediction: From Chemical Structural Description to Transcriptome Analysis"

11. Kristina Preuer,et al.(2019)."Interpretable Deep Learning in Drug Discovery"

12. Defang Fan,et al.(2018)."In silico prediction of chemical genotoxicity using machine learning methods and structural alerts"

13. Suman K. Chakravarti,et al.(2019)."Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks"