Team:Tuebingen/Software

GLP.exe - Software

Software

C3Pred - Cell-Penetrating Peptide Predictor

Our Software

Cell-penetrating peptides (CPPs) are short 4-30 amino acids long peptides, which possess the ability to transport different cargo over the cell membrane. These cargos include proteins, nanobodies, DNA molecules, and small molecule drugs. In recent years, numerous promising clinical and pre-clinical trials have been launched, with CPPs as a carrier for pharmacologically active small molecules.

They can be classified into three general groups: cationic, amphipathic, and hydrophobic CPPs. Each group of peptides has different physical-chemical properties and therefore differs in the internalization mechanism. The pathways that CPPs exploit to enter the cell can be divided into two larger groups, endocytic pathways, and direct cell penetration [2, 3]. Since there are different, not yet fully understood mechanisms involved, simple models, describing the relationship between sequence and function, are particularly challenging to generate. Hence, the discovery of novel CPPs and activity optimization is mostly relying on large screening approaches.

A generic square placeholder
Figure 1: Proteins can be fused to a cell-penetrating peptide (CPP) (left). CPPs allow proteins to penetrate cells via different mechanisms and can be used to transport proteins into cells or through cells. Proteins without a CPP can normally not penetrate cells (right).

Motivation

As part of our project, we planned to utilize cell-penetrating peptides (CPPs) to allow a cargo, the Exendin-4 protein, to be transported across a eukaryotic cell-layer. A challenging part of our project design was to decide on a specific CPP out of the tremendous variety of different transporting peptides. In October 2019, CPPsite 2.0, the largest database for CPPs, contained more than 1850 different peptide motifs, which are capable of delivering conjugates into cells. However, the CPPsite database only contains qualitative information about whether a peptide has a cell-penetrating activity, but no quantitative information about how efficient cargo is transported across the cell membrane.

Therefore we decided to generate a predictive software tool, which allows assigning a transport efficiency score to CPPs. This allows making educated decisions on the design process.

Overview of our tool

C3Pred is an easy-to-use software tool, which allows scientists to make design choices for their CPP-utilizing system based on quantitative transport activity scores. We put a strong emphasis on good usability for different kinds of users. Firstly, we implemented an intuitive browser-based graphical user interface for simple usability. Secondly, we also released our software tool as a Python package, which is easily installable for everyone using PIP, allowing other developers to incorporate our software tool into their scripts.

Short peptide sequences can be submitted to the program by either using a sequence string in FASTA-format, an Uniprot ID or an iGEM Registry ID of coding DNA sequences. Our tool computes a transport efficiency score for each sequence, which is automatically interpreted for the user by comparing it to well-characterized CPPs.

Why use our tool?

Our tool offers several distinct advantages for researchers interested in CPPs. First, our tool predicts real numerical values, which allows for a much more fine-tuned evaluation, as well as a comparison of CPP efficacy. Second, our tool is as user-friendly as it gets. The installation for a local version is trivial, which also allows for very simple integration into more advanced workflows. Moreover, our tool is accessible as a web app here.

Our very simple to use web interface offers three simple and useful options: Peptides can either be entered as raw text sequences, via the Uniprot Accession number, or the iGEM Registry ID. Third, since our algorithm instantly returns the CPP efficiency, several CPPs with different modifications or lengths can be quickly compared, allowing for the exploration as well as the design of novel CPPs.

Installation and usage of our tool

Our tool can be used on Linux, Windows and Mac OS. Please follow the installation instruction here.

When starting our tool, the landing page allows navigating to the three running modes (from protein sequence, Uniprot Accession number & iGEM Registry ID). Additional information about the running modes can be displayed by clicking on “more info”.

Prediction using a sequence string

To predict the transport effectivity of a FASTA-formatted amino acid string, please make sure that the sequence does only contain letters corresponding to the genetic code of the 20 standard amino acids. Moreover, please make sure that the submitted sequence is not longer than 40 residues.

Using a Uniprot Accession number

Since Uniprot is one of the largest publicly accessible resources for proteins, our tool allows using the Uniprot Accession number as input, which consists of 6-10 characters (e.g. Q86FU0). Information about the accession numbers can be found on the Uniprot Website.

Prediction using an iGEM Registry ID

Prediction on sequences in the iGEM registry can only be made on parts tagged in the subcategory “coding sequence”, otherwise the submitted identifier is rejected for the prediction. The DNA sequences are automatically translated into protein sequences.

A generic square placeholder
Figure 2: C3Pred usage demonstration.

Implementation

Input

C3Pred accepts three possible input formats for protein data:

  • FASTA-formatted sequences
  • UniProtKB Accession Number
  • iGEM Part ID

C3Pred automatically fetches and parses the information about the given identifiers using the UniProt website REST API or using the iGEM Registry API, respectively. For this function, an internet connection is required.

Input requirements

All proteins submitted to the tool must fulfill certain criteria, such that a transport efficiency score can be computed. CPPs usually have a length between 4-40 amino acids, thus only protein sequences with a length in this range are accepted as input. Moreover, only sequences containing the 20 standard amino acids should be used. Non-standard amino acid encodings like B encoding for asparagine/aspartate or S for selenocysteine can also be parsed and used for prediction, but this feature is experimental. Since information about those amino acids is not backed up by the training data, results produced for those peptides must be interpreted with caution.

For iGEM part IDs further requirements need to be fulfilled, such that a prediction can be made. C3Pred only accepts parts defined as coding sequences. If the sequence is coding, our tool automatically translates it into a protein sequence.

Output

For every peptide, a transport efficiency score is computed. For peptides without a carrier property, the score is close to zero. For extremely active peptides, the score is close to 250-300. The values computed for different peptides can be directly compared. Since the output is a dimensionless numerical value, a straightforward classification into no/low/medium/high activity is provided for the user to facilitate the interpretation process. High activities are defined by being in the upper 25 percentile (activity value: 80.08) of all CPP motifs. Medium activities are defined by being in the upper 50 percentile (activity value: 40.15). Moreover, a range of activity values of frequently used CPPs in the literature is also provided in the GUI for direct comparison.

A generic square placeholder
Figure 3: The workflow of C3Pred: Information about the protein can be obtained from a sequence string direct.y, the iGEM part ID of coding sequences or using a UniProtKB accession number. The protein is then used as input for our machine learning model and the activity score is computed as output.

Core Algorithm

XGBoost

C3PRed is based on gradient boosted trees using the implementation of the XGBoost library [5]. Gradient boosted trees are a commonly used algorithm for supervised machine learning problems. Key advantages are their inherently fast training speed and high accuracy for problems, which cannot be solved using deep learning techniques, due to sparse datasets. The key idea of the gradient boosted trees algorithm is to map a set of input features onto a single numerical value. In our case, the input is the encoded protein sequence and the predicted numerical value is the activity scores. For this purpose, numerous decision trees are generated, which are iteratively added to the model, such that with each addition the model improves.

<

Data

To train our model we used the publicly available dataset by Ramaker et al. 2018, consisting of transport efficiency values for 474 short peptide motifs [6]. In their experiments, short peptides were coupled with a fluorophore and the transport over the membrane was then measured using fluorescence as a read-out. The transport efficiency data for each peptide were log2-transformed to facilitate the fitting process.

Encoding

To encode the peptides for the machine learning step, we chose an extended version of the BLOMAP encoding. BLOMAP is a machine-learning oriented representation of amino acids [7]. It is based on a transformed BLOSUM substitution matrix. The encoding is extended by numerical values by physicochemical properties of each amino acid, including flexibility, weight, isoelectric point, hydrophobicity, polarity, and area.

Each peptide is represented as the set of all possible sliding windows of the size of 14 residues. Shorter peptides are padded at the N- and C-terminus. An advantage of such an encoding is that after the training steps, it is possible to analyze the final prediction tool for the most influential features. This allows a better understanding of the transport mechanism and the general characteristics of CPPs.

Training

For each sliding window of a peptide, an activity score is computed. The total activity of the peptide is determined by the mean of all activity values of all sliding windows. Our CPP transport activity prediction model was fitted to optimize Pearson's correlation coefficient of the predicted values against the experimentally derived ones. To find good hyperparameters of the gradient boosted trees algorithm, a random search was performed. The evaluation of the performance of our predictor was performed using a 10-fold cross-validation. In a k-fold cross-validation, the data is split into k-subsets of equal size. The predictor is tested k times, with k-1 subsets of the data in the training set and 1 subset in the test data. In each iteration, the single subset which is not part of the training set is used for evaluation. In the end, each of the k subsets has been in the training set exactly once.

Placeholder

Performance

Predicted activity values of the 10-fold cross-validation are plotted against the experimentally derived ones. Each point represents a peptide. If the predictions would be perfectly accurate, all points would be on the diagonal. Our algorithm achieved a Pearson correlation of 0.804, which is considered very good.

A generic square placeholder
Figure 4: Prediction results of C3Pred compared to experimentally derived values. Each point represents a protein. The correlation between predicted and the experimental values is 0.804.

The importance of each feature of the BLOMAP encoding is shown in this visualization. The higher the values, the greater is the importance of the feature on the results of the prediction. The feature isoelectric point is the most important one, which indicates that charged amino acids play the most important role in high activity values.

A generic square placeholder
Figure 5: Feature importance plot of the XGBoost model. Each bar represents one of the features of the BLOMAP encoding. The important feature for the prediction is the isoelectric point, indicating the most influential role being charge.

Results

Validation against experimental data

We used the data obtained from the experimental work of previous iGEM teams to validate the results of our prediction tool. In the past years, numerous teams have worked with CPPs. Some of them conducted experiments to compare the transport efficiency of the peptides, which again indicates the potential usefulness of our tool to other teams.

ATOMS Turkiye

The ATOMS Turkiye team of 2013 compared the efficiency of the two CPPs TAT and MPG to transport an apoptosis-inducing protein into different cell lines. Their findings of TAT fusion protein being better transported into the cell than the MPG fusion protein, are in accordance with our predicted values.

Our predicted activity values (higher is better):
TAT: 45.97
MPG: 34.27

Results of the ATOMS Turkiye team

DLUT-China

The iGEM Team DLUT-China of 2018 conducted a similar comparative experiment for multiple CPPs. They compared R8, TAT, Pep-1 and cyclic heptapeptide DNP.
Their results for R8, TAT and Pep-1 are in accordance with our predictions. R8 was the most transported peptides in their experiments and also scored best in our prediction. However, our tool predicts a bad score for the cyclic heptapeptide DNP, which shows high transport efficiency in their experiment. This can be explained by the fact that our tool was trained only with linear protein sequences containing the standard 20 amino acids. Special peptides, such as this one, probably cannot be predicted by our tool correctly.

Our predicted activity values (higher is better):
R8: 52.50
TAT: 45.97
Pep-1: 45.12
cyclic heptapeptide DNP: 7.69

Results iGEM Team DLUT-China

How our software influenced our project

Validation against experimental data

We used the results of C3Pred to analyze CPPs to decide on a suitable CPP for our project. We analyzed a series of CPP which were already present in the iGEM Registry and additional peptides which were frequently referred to in the literature.

Penetratin : 153.16
Tp10 : 100.68
TAT: 45.97

Since Penetratin showed the highest scores compared to the other parts available in the iGEM Registry, we decided to select this specific CPP as the carrier for the Exendin-4 protein of our project.

Moreover, we computed an elevated transport activity for the CPP fusion protein TAT-LK15 , which has been reported to enhance the properties of TAT.

TAT-LK15 : 228.81

How our software influenced our project (2.0)

We used the results of C3Pred to analyze a series of CPPs to decide on a suitable peptide for our project.

Details on the selection process can be found on the modeling page.

hallo123

hallo123

hallo123

hallo123

hallo123