The Parseq model
Neural networks predicting antimicrobial activity
We are proud to present the Parseq model. A model, based on neural networks, able to predict the antimicrobial activity of peptides by both studying pairs of amino acids and the amino acid sequence. This combination of studying pairs and the sequence has given it it’s name Parseq. When giving Parseq a peptide sequence, the model analyses it and then gives a prediction, whether the peptide is antimicrobial or not. Parseq has been verified to work with 88% accuracy on 1000 peptides and has also been used to create two new peptides, parseq alpha and beta. These have been tested in the lab against gram-negative and positive bacteria with successful results. Also, this model has been used to analyze how mutations and post-translational modifications have affected the activity of our other peptide candidates used in the project: LL-37, Manganin 2 and Pln1.
Download our model here
The Parseq model is now available for download. Press the icon to download a zipfile containing the models and an analysing tool. To run the script, you will need python 3 and the libraries used in the script. Best of luck with the peptide prediction!
So how did we came up with this idéa? Our project uses antimicrobial agents, other than antibiotics and silver, to kill bacteria. The reason for not using antibiotics and silver in our wound dressing is because of risks with resistance against these compounds . However, eventually, resistance against antimicrobial peptides and enzymes could be another problem. To always be one step ahead, we wanted to create a model that can predict the antimicrobial activity of peptides so that it can be used in the creation of new peptide candidates for coupling with CBD and be applied to wound dressings.
 Silver S, Phung LT, Silver G. Silver as biocides in burn and wound dressings and bacterial resistance to silver compounds. Journal of Industrial Microbiology & Biotechnology. 2006 Jul;33(7):627–34.
Our inspiration for this modeling project began in 2017 at the iGEM jamboree. There we met Parabase - the team from NCTU Formosa. They had created a model, based on machine learning, able to find antifungal peptides. So, when our iGEM project this year was set to involve antimicrobial peptides, we decided to improve upon Parabase and adapt it to antimicrobial peptides. The improvements we have done upon their model can be summarized in three steps: first, we have used more advanced modeling by using neural networks. Second, we have used much more data when training the model. Third, we have improved the data analysis to further study the specific peptide sequence.
If you are interested in Parabase, you can press their icon to reach their wiki where you can read all about their project!
Parseq is a neural network-based model. Neural networks are a popular machine learning tool that is utilized in many different fields. The basis of all machine learning is that a computer is to find patterns in big datasets too complicated to be analyzed by us humans. This makes machine learning a great tool for bioinformatics studies of large samples of proteins and DNA sequences. A neural network consists of 3 parts: input layer, hidden layers, and output layer. Each layer consists of several nodes that are connected to all nodes in the layer prior and post the node’s layer. The input layer is the first layer and matches the data which enters the model, ie. there is one node for each datapoint entering the model. A datapoint can, for example, be a single pixel of an image. The hidden layers are the layers between the input and output layers. These are the layers that analyze the data. Commonly there are just a few hidden layers. The output layer is the last and matches the answer the model is giving, ie there is one node for each category the model can divide the input into. Two categories can, for example, be antimicrobial activity and non-antimicrobial activity.
Neural networks are inspired by biology as they are created to mimic how neurons in the brain behave and communicate with each other. In the network, the nodes behave like neurons. The nodes get signals from several other nodes and then pass it forward if the signal is strong enough. This mimics how neurons send out a membrane potential in their axons when their dendrites receive enough signals. The exact numbers in the neural network on how strong the signals are and when the nodes should fire are decided by the computer when the model is training against large sets of data. The goal is for the model to identify important parameters and characteristics that the model can use to determine differences in the data.
The function of neural networks is based on the training data that the nodes are calibrated towards. Therefore, the model needed a positive and a negative training set, to both identify necessary parts for antimicrobial activity and unnecessary or disturbing parts of the sequence. All the antimicrobial peptides were therefore extracted from the antimicrobial peptide database. All peptides that were characterized as antimicrobial were selected. Peptides that served as the negative control were gathered from Uniprot. From both databases, all peptides within the length of 10 to 59 amino acids were taken. Sequences that had unknown amino acids were excluded.
Then antimicrobial peptides were removed from the control list. All sequences that appeared both in the antimicrobial dataset in the control list were removed, as well as peptides in the control list which contained parts that matched another peptide from the antimicrobial dataset. Since many antimicrobial peptides are a product of posttranslational modification, all starting methionine were excluded from the peptide sequences to remove obvious patterns from the control list In total 2271 negative control peptides were used from the control list, the same as antimicrobial peptides.
The total sum of peptides used for training and testing the model were 4542. These peptides were divided into 2 sets, one testing-set and one training-set. Both the training set and the testing set had the same number of antimicrobial peptides and the control peptides. The training set consisted of 3542 peptides which the model used to train and adapt to. The testing set consisted of 1000 peptides which showed how well the model had trained by applying it to new data.
With the data collected, it needed to be converted from sequence data into data readable by the model. We did this conversion in two ways, one pair based, and one sequence based.
We here show examples of both these on the peptide LL37 which has the sequence
The pair based conversion is the same as Parabase used 2017. The sequence is transformed into a 20x20 matrix where each cell represents an amino acid pair. In the image, you can see how the antimicrobial peptide LL37 would look like using this data conversion. 20x20 = 400 which means that a neural network analyzing this data needs 400 nodes in its input layer.
The sequence based conversion converts the sequence into a 20x60 matrix. 20 - one for each amino acid, and 60 which is the sequence limit length. The image shows how LL37 would look like using this data conversion, excluding the last 13 empty columns as ll37 has 37 amino acids and the matrix is 60 wide. 20x60 = 1200 which means that a neural network analyzing this data needs 1200 nodes in its input layer.
With all the data collected, and transformed into images readable by a neural network it was time to create the model. We created two separate neural networks, one for pair analysis and one for sequence analysis which we later combined into the Parseq model. These two models were created using python and the python library TensorFlow.
The two models analyses their respective image of a peptide and output a number from 0 to 1. If the model thinks the peptide is antimicrobial the number is above 0.5 and vice versa. The specific parameters for the models are collected in table 1 below. These specific network sizes and number-of-training-runs were set using the tool TensorBoard and were chosen to avoid overfitting of the models.
Table 1. Displaying the size of the two models
|Model||Layers||Input Nodes||Hidden Nodes||Output Nodes||Training|
Some other important parameters of the models are:
Activation function hidden layer - Rectified Linear Unit
Activation function output layer - Softmax
Optimisation function - Adagrad
Optimisation of - Sparse Categorical Cross-Entropy
Model testing - Sparse Categorical Accuracy
The parseq model was then created by combining the two model. In this way, we created a single model which both analyses the pair composition of a peptide and its sequence. To combine the models the mean value of both models were used, resulting in the parseq model.
To see how well the parseq model performs, the testing dataset was used which consisted of 1000 peptides, 500 antimicrobial and 500 not. In figure 1 a histogram is depicted showing the distributions of the prediction values of the testing dataset peptides. In the histogram, the regular peptides are represented in blue and the antimicrobial are representer in purple. A prediction value greater than 0.5 means the peptide is predicted to be antimicrobial. The graph shows that the model overall can predict if a peptide is antimicrobial or not.
Figure 1. Histogram showing the results of the Parseq model on the testing dataset. Regular peptides are marked with blue and antimicrobial are marked with purple. The x-axis gives the prediction of the peptides given by the Parseq model. The distribution of blue to the left and purple to the right is desired as it proves the model correct.
A more in-depth study of how well the model performs and how the parseq model compares to the individual pair and sequence model can be seen in Table 2 below. The table shows the accuracy of each model against the testing dataset, and the sum of squared error (SSE)of the peptides comparing the predicted values with 1 or 0 depending on if the peptide is antimicrobial or not. This table shows how the parseq model is better than each of the two individual models as both the accuracy and the SSE is improved.
Table 2. Comparison between the parseq model and the two individual models. Shows accuracy of correctly predicted peptides and sum of squared error of the peptide predictions.
After showing that the Parseq model was superior to its parts, we wanted to see how we could improve its accuracy. As we were going to use this model to find new antimicrobial peptides, we wanted to further explore its accuracy when determining if a peptide is antimicrobial. Therefore we started to only study the peptides above the threshold of 0.5, the peptides which the model said was antimicrobial. In this group, there were 517 of the 1000 testing peptides and 86.7 % of these were antimicrobial. To improve this accuracy we increased the threshold. In table 3 the number of peptides above a certain threshold and the percentage of these peptides which are antimicrobial are collected. This table shows that a high model output increases the certainty of the peptide being antimicrobial. It is important to note that the amount of peptides drops for high thresholds making these percentages less accurate
Table 3. The table shows how an increasing threshold applied to the model output increases the certainty of a peptide being antimicrobial. As the threshold increases, less and less of the total of 1000 peptides are analysed.
|Model Threshold||Peptides above threshold||Precentage antimicrobial|
Analysis of modified peptides
In the project we wanted to use three antimicrobial peptides: LL37, Pln1 and Manganin 2. Our project design involved the use of thrombin in the release of these peptides. This would add 2 Amino Acids to the peptide sequences G and S. Therefore we used the Parseq model to analyse if this change would affect their antimicrobial activity. The model proved, as is shown in table 4. That all peptides were predicted to be antimicrobial both with and without the GS amino acids.
Table 4. Showing what the Parseq model predicted about the peptides with and without the rest after thrombin cleavage.
|Peptide||Antimicrobial or not||Model prediction|
|gs - LL37||Antimicrobial||0.5110|
|gs - Pln1||Antimicrobial||0.8210|
|gs - Mag 2||Antimicrobial||0.9064|
It was finally time to use the Parseq model to create new peptides. To do this we created an evolution script in python which used the model as a scoring method determining the peptides’ fitness.
The evolution was set to run for 100 generations. In each generation 10’000 peptides were analysed, only the best were kept and set to mutate 10-20% of their amino acids to create mutated copies. Also some more peptides were generated to add more information to the evolution and to prevent getting caught in a local optimum.
Now we are proud to present the two Parseq peptides. These two are two of many peptides created with our evolution script and were selected due to a few parameters. Most importantly the stability of the peptides was analysed using ExPASy where stable peptides where wanted. The peptides were also blasted to ensure that no toxic protein shared their structure. Lastly, a start codon was added as the first amino acid to make them compatible with bacterial expression.
Parseq Alpha - MLPWKIKAWGVHHNRWKFK
Parseq score: 0.999
Basic part: BBa_K3182009
Parseq Beta - MIHHVWKTWGIKFNRYELK
Parseq score: 0.9998
Basic part: BBa_K3182010
The peptides were then provided to us by Caslo, a Danish company who specialises in synthetic peptide production. In this way, we directly could analyse their activity without trying to express them in E.Coli like our other peptides.
When the peptides were received from Caslo, we dissolved them in 6 M Urea to the concentration of 10 000 µg/ml. These stocks were then used when we tested the peptides against bacteria. The results from these tests have been successful! Both peptides have shown antimicrobial effects against both gram-positive and gram-negative bacteria. These results are presented under our Results page.