C3Pred - Cell-Penetrating Peptide Predictor
Cell-penetrating peptides (CPPs) are short 4-30 amino acids long peptides, which possess the ability to transport different cargo over the cell membrane. These cargos include proteins, nanobodies, DNA molecules, and small molecule drugs. In recent years, numerous promising clinical and pre-clinical trials have been launched, with CPPs as a carrier for pharmacologically active small molecules.
They can be classified into three general groups: cationic, amphipathic and hydrophobic CPPs. Each group of peptides has different physical-chemical properties and therefore differs in the internalization mechanism. The pathways CPPs exploit to enter the cell can be divided into two larger groups, endocytic pathways, and direct cell penetration. Since there are different, not fully understood mechanisms involved, simple models, describing the relationship between sequence and function, are particularly challenging to generate. Hence, the discovery of novel CPPs and activity optimization is mostly relying on large screening approaches.
Motivation
As part of our project, we planned to utilize cell-penetrating peptides (CPPs) to allow a cargo, the Exendin-4 protein, to be transported across a eukaryotic cell-layer. A challenging part of our project design was to decide on a specific CPP out of the tremendous variety of different transporting peptides. In October 2019, CPPsite 2.0, the largest database for CPPs, contained more than 1850 different peptide motifs, which are capable of delivering conjugates into cells. However, the CPPsite database only contains qualitative information about whether a peptide has a cell-penetrating activity, but no quantitative information about how efficient cargo is transported across the cell membrane.
Therefore we decided to generate a predictive software tool, which allows assigning a transport efficiency score to CPPs. This allows making educated decisions on the design process.
Overview of our tool
Our goal was to create an easy-to-use software tool, which allows scientists to make design choices for their system based on quantitative transport activity scores. We put a strong emphasis on good usability for different kinds of users. Firstly, we implemented an intuitive browser-based graphical user interface for simple usability. Secondly, we also released our software tool as a Python package, which is easily installable for everyone using PIP, allowing other developers to incorporate our software tool into their scripts.
Short peptide sequences can be submitted to the program by either using a sequence string in FASTA-format, an Uniprot ID or an iGEM Registry ID of coding DNA sequences. Our tool computes a transport efficiency score for each sequence, which is automatically interpreted for the user by comparing it to well-characterized CPPs.
Installation and usage of our tool
Our tool can be used on Linux, Windows and Mac OS. Please follow the installation instruction LINK.
When starting our tool, the landing page allows navigating to the three running modes (from protein sequence, Uniprot Accession number & iGEM Registry ID). Additional information about the running modes can be displayed by clicking on “more info”.
Prediction using a sequence string
To predict the transport effectivity of a FASTA-formatted amino acid string, please make sure that the sequence does only contain letters corresponding to the genetic code of the 20 standard amino acids. Moreover, please make sure that the submitted sequence is not longer than 50 residues.
Using a Uniprot Accession number
Since Uniprot is one of the largest publicly accessible resources for proteins, our tool allows using the Uniprot Accession number as input, which consists of 6-10 characters (e.g. Q86FU0). Information about the accession numbers can be found on the Uniprot Website.
Prediction using an iGEM Registry ID
Prediction on sequences in the iGEM registry can only be made on parts tagged in the subcategory “coding sequence”, otherwise the submitted identifier is rejected for the prediction. The DNA sequences are automatically translated into protein sequences.
Implementation
Input
C3Pred accepts three possible input formats for protein data:
- FASTA-formatted sequences
- UniProtKB Accession Number
- iGEM Part ID
C3Pred automatically fetches and parses the information about the given identifiers using the UniProt website REST API or using the iGEM Registry API, respectively. For this function, an internet connection is required.
Input requirements
All proteins submitted to the tool must fulfill certain criteria, such that a transport efficiency score can be computed. CPPs usually have a length between 4-30 amino acids, thus only protein sequences with a length in this range are accepted as input. Moreover, only the 20 standard amino acids are allowed. Our tool provides an additional option, which also allows non-standard encodings (e.g. B encoding for asparagine/aspartate or S for selenocysteine). Since information about those amino acids is not backed up by the training data, results produced for those peptides must be interpreted with caution.
For iGEM Part IDs further requirements need to be fulfilled, such that a prediction can be made. C3Pred only accepts parts defined as coding sequences. If the sequence is coding, our tool automatically translates it into a protein sequence, beginning the translation from the start codon. If there is no start codon present, the sequence is translated from the first frame onwards.
Output
For each peptide, a transport efficiency score is computed. For peptides without a carrier property, the score is close to zero. For extremely active peptides, the score is close to ten. The values computed for different peptides can be directly compared. Since the output is a dimensionless numerical value, a straightforward classification into no/low/medium/high activity is provided for the user to facilitate the interpretation process. Moreover, a range of activity values of frequently used CPPs in the literature is also provided in the GUI for direct comparison.
Core Algorithm
XGBoost
The algorithm, C3PRed uses to compute transport activity scores for peptides, is the gradient boosted trees implementation of the XGBoost library. Gradient boosted trees are a commonly used algorithm for supervised machine learning problems. Key advantages are their inherently fast training speed and high accuracy for problems, which cannot be solved using deep learning techniques, due to too little data. The key idea of the gradient boosted trees algorithm is to map a set of input features onto a single numerical value. In our case, the input is the encoded protein sequence and the predicted numerical value is the activity scores. For this purpose, numerous decision trees are generated which are iteratively added to the model, such that with each addition the model improves.
Data
To train our model we used the publically available dataset by Ramaker et. al, consisting of transport efficiency values for 474 short peptide motifs. In their experiments, short peptides were coupled with a fluorophore and the transport over the membrane was then measured using fluorescence as a read-out. The transport efficiency data for each peptide were log2-transformed to facilitate the fitting process.
Encoding
To encode the peptides for the machine learning step, we chose an extended version of the BLOMAP encoding. BLOMAP is a machine-learning oriented representation of amino acids. It is based on a transformed BLOSUM substitution matrix. The encoding is extended by numerical values by physicochemical properties of each the amino acids, including flexibility, weight, isoelectric point, hydrophobicity, polarity, and area. Each peptide is represented as the set of all possible sliding windows of the size of 14 residues. Shorter peptides are padded at the N- and C-terminus. An advantage of such an encoding is that after the training steps, it is possible to analyze the final prediction tool for the most influential features. This allows a better understanding of the transport mechanism and the general characteristics of CPPs. An issue with this encoding is that modified amino acids that are not among the 20 standard ones cannot be represented properly.
Training
For each sliding window of a peptide, an activity score is computed. The total activity of the peptide is determined by the mean of all activity values of all sliding windows. Our CPP transport activity prediction model was fitted to optimize Pearson's correlation coefficient of the predicted values against the experimentally derived ones. To find good hyperparameters of the gradient boosted trees algorithm, a random search was performed. The evaluation of the performance of our predictor was performed using a 10-fold cross-validation.
Hyperparameters List
- window_size = 14
- n_estimators = 600
- learning_rate = 0.4
- max_depth = 5
- subsample = 0.6
- gamma = 3
- min_child_weight = 0.75
- colsample_bytree = 0.8
Performance
Predicted activity values of the 10-fold cross-validation are plotted against the experimentally derived ones. Each point represents a peptide. If the predictions would be perfectly accurate, all points would be on the diagonal.
Pearson correlation: 0.804
Euclidean distance: 27.795
The importance of each feature of the BLOMAP encoding is shown in this visualization. The higher the values, the greater is the importance of the feature on the results of the prediction. The feature isoelectric point is the most important one, which indicates that charged amino acids play the most important role in high activity values.
Results
Validation against experimental data
We used the data obtained from the experimental work of previous iGEM teams to validate the results of our prediction tool. In the past years, numerous teams have worked with CPPs. Some of them conducted experiments to compare the transport efficiency of the peptides, which again indicates the potential usefulness of our tool to other teams.
The ATOMS Turkiye team of 2013 compared the efficiency of the two CPPs TAT and MPG to transport an apoptosis-inducing protein into different cell lines. Their findings of TAT fusion protein being better transported into the cell than the MPG fusion protein, are in accordance with our predicted values.
Predicted activity values (higher is better):
TAT - 5.522
MPG - 5.099
The iGEM Team DLUT-China of 2018 conducted a similar comparative experiment for multiple CPPs.
They compared R8, TAT,
Pep-1 and cyclic heptapeptide DNP.
Predicted activity values (higher is better):
R8: 5.714
TAT - 5.522
Pep-1 : 5.496
cyclic heptapeptide DNP: 2.943
Their results for R8, TAT and Pep-1 are in accordance with our predictions. R8 was the most transported peptides in their experiments and also scored best in our prediction. However, our tool predicts a bad score for the cyclic heptapeptide DNP, which shows high transport efficiency in their experiment. This can be explained by the fact that our tool was trained only with linear protein sequences containing the standard 20 amino acids. Special peptides like this, probably cannot be predicted by our tool correctly.
How our software influenced our project
Validation against experimental data
We used the results of C3Pred to analyze CPPs to decide on a suitable CPP for our project. We analyzed a series of CPP which were already present in the iGEM Registry and additional peptides which were frequently referred to in the literature.
Penetratin : 7.258914470672607
Tp10 : 6.653654098510742
TAT: 5.522
Since Penetratin showed the highest scores compared to the other parts available in the iGEM Registry, we decided to select this specific CPP as the carrier for the Exendin-4 protein of our project.
Moreover, we computed an elevated transport activity for the CPP fusion protein TAT-LK15 , which has been reported to enhance the properties of TAT.
References
- https://www.nature.com/articles/s41467-018-04874-6
- https://www.nature.com/articles/s41598-018-30790-2
- https://www.doi.org/10.1016/j.tips.2017.01.003
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6324683/
- http://dx.doi.org/10.1145/2939672.2939785
- https://doi.org/10.1142/9781860947322_0014
- https://doi.org/10.1080/10717544.2018.1458921



























