Team:St Andrews/Model



Rational Design

Rationally Designing Isopeptide Crosslinks


Tertiary Structural Prediction for in silico mutant screening

Isopeptide hunt

Searching for unknown Isopeptide Domains

Rational Design


Rational design guided mutant design for our mutant antibody domains. Our method was based on a previous paper that successfully engineering proteins to contain isopeptide bonds1. This paper used structural alignment of proteins to inform where to create an isopeptide bond, by aligning a protein’s structure with its most structurally similar Isopeptide Domain (IPD) and seeing which residues in the protein matched those forming the isopeptide bond in the IPD. The precise detail of our implementation of this can be found in our design page. Here we will describe the process from a modelling standpoint, outlining and justifying any relevant assumptions.

Core Idea

The two core assumptions to this process is that position of an amino acid’s alpha carbon is a good indicator of side chain position, and that the geometry of the side chains determines the capability of an isopeptide bond to form. Thus, we assume that if a triad of alpha carbons are held in a geometry that forms an isopeptide bond when the correct side chains are attached, substitution of side chains on another triad that has the correct alpha carbon geometry but not the correct side chains, should create a side chain geometry capable of forming an isopeptide bond. In short, we assume that this process is a good predictor of isopeptide bond formation!


In order to assess the alpha carbon geometry of ‘isopeptide triads’ and compare with our antibody domains, we used the DALI2 server. Loosely, DALI works by generating protein structure as an $n\times n$ matrix, for a protein of $n$ residues. The entries of this matrix then describe the distances between alpha carbons within the molecule i.e. the ith, jth entry represents the distance between the alpha carbons of the ith and jth residues. DALI uses these 2D representations of protein structure to align structures, by trying to match sub-matrices between protein matrices3.

Users interpret DALI outputs as a Z-score, a dimensionless score describing how significant the similarity between structures is, with $Z>2.0$ being defined as the threshold for a significant match4. Additionally, DALI defines a “strong match” between structures as a Z-score with $Z >\frac{n}{10} – 4$, where $n$ is the number of residues in the structure being queried 4.

We used the DALI server to query structural similarity of our antibody CH3 domains against all known IPDs (a list of these can be found on the design page). We attempted to single out IPDs with a strong match as defined above. As our CH3 domains were all circa 100 residues in length, we were thus searching for IPDs with a Z-score of approximately 6.0.

Structural Superposition

After identifying a range of IPDs with a Z-score approaching 6.0, the idea was to visualise the structural alignment to identify residues in our query CH3 domain homologous to the residues forming an isopeptide bond in the IPDs. We hoped by taking a variety of IPDs we could identify multiple sites with the correct geometry for isopeptide bond formation and induce relevant mutations at these sites.

To visualise this, we used the CCP4 Molecular Graphics software5 with its Superpose tool to structurally align the crystal structures of our query CH3 domain and the IPD. It is important to note that here we assume that the structural alignment returned by DALI is equivalent to that when we use the Superpose tool, i.e. that the ‘good result’ returned in the DALI search is a similarly ‘good result’ when superposed.

You can find the final execution of this process on our design page , with details regarding the specific crystal structures used.



With our AI and machine learning approach to the modelling, our goal was to be able to determine from the peptide sequence whether the protein would fold. This would allow us to generate mutants of our antibody domain in silico and then screen the mutant library by predicting the structure for each mutant, then attempting to assess the liklihood of an isopeptide bond forming in this predicted structure. The biggest challenge with this approach was the limited number of known isopeptide domains - only 29. As we were focusing on one specific protein feature, which to our knowledge is exhibited in fewer than 30 out of over 100,000 proteins on the PDB database, our biggest challenge was a modelling approach that would notice such a rare feature.


Our first approach to view the protein structure was to use Rosetta, the industry standard for folding proteins7. However, even though the most recent version of Rosetta came out in August 2016 and the first intramolecular isopeptide bond was discovered in 2007 8, because they are so rare, Rosetta had not been trained to look for them. The lack of training for isopeptide bonds meant that when we tried to fold a protein known to contain one, the result was not very successful (Fig. 1.).

Rosetta vs. solved protein structure for 2HH6

Fig. 1.a: Rosetta model (grey) vs. solved protein structure (lilac) for 2HH6 (non-isopeptide domain).

Rosetta vs. solved protein structure for 2OKM

Fig. 1.b: Rosetta model (grey) vs. solved protein structure (green) for 2OKM (isopeptide domain).

As seen in Fig. 1.a and Fig. 1.b, although the Rosetta model is a great fit for the non-isopeptide domain (Fig. 1.a), it is a relatively poor fit for the isopeptide domain (Fig. 1.b).

As part of this process, we developed a series of mini guides for our team on how to use PyRosetta as well as the online Robetta tool to make protein fragments. During the UK iGEM meet-up we discussed this approach with the King’s College London team , who were interested in using Rosetta as part of their modelling. Consequently, we put together a cohesive guide for their team (available here) on how to use Rosetta, PyRosetta and Robetta.


From here we started researching other models that were already trained to find isopeptide bonds or that we could train ourselves. This led us to a GitHub Recurrent Geometric Networks (RGN) repository created by Dr M. AlQuraishi from Harvard in 2018 accompanying by his paper on the topic 9. The idea behind it is to use RGN models to fold FASTA sequences by comparing them to a database of proteins (called ProteinNet) 9. For our project, we used the most recent version possible, ProteinNet12, and the pre-trained model CASP12 (Fig. 2.) (the model number must match the ProteinNet number).

Diagram showing how the CASP model is made

Fig. 2: ProteinNets construction pipeline. For each ProteinNet, all proteins with PDB structures available prior to the start of its corresponding CASP (“All data”, top circle) are clustered using an MSA-based clustering technique (left inset) to yield large clusters where intra-cluster sequence identity is as low as 10%. One exemplar from each cluster is then selected (right inset) to yield the 10% seq. id. validation set. This process is iteratively repeated, by reclustering the data remaining outside of all initial clusters to yield validation sets of higher sequence identity (20% -90%). Once the final validation set is extracted, all remaining data is used to form the training set. Based on this set (“100% thinning”), filtered training sets are created at lower sequence identity thresholds to provide coarser sampling of sequence space. Left inset: Each protein sequence is queried against a large sequence database (filtered to only include sequences publicly available prior to the beginning of the corresponding CASP) using JackHMMer to create an MSA that is subsequently filtered to 90% HHblits is then used to perform an all-against-all sequence alignment ofMSAs. Finally, alignment distances are fed to MMseqs2 to cluster their corresponding sequences. Right inset: The center-most protein of each cluster is chosen to ensure that the desired sequence identity constraints are satisfied, as proteins near cluster boundaries may be closer than the pre-specified radius of each cluster (pink vs. gray measuring tapes), while the distances between cluster centroids must satisfy the sequence identity constraints (blue measuring tape). The centroids are then used to form tight clusters of 95% seq. id. that are intersected with the original clusters to yield candidate exemplars ranked by multiple quality metrics (see main text). The top-ranked candidate is picked as the exemplar protein of each cluster. Figure and caption taken from Mohammed AlQuraishi, ProteinNet: a standardized data set for machine learning of protein structure, 2019.

ProteinNet12 was made by downloading all publicly available PDB structures that were available before 01/05/2016, excluding structures that had less than two residues or were >90% unresolved 9. CASP12 (Critical Assessment of protein Structure Prediction) was also released in 2016 and trained off of ProteinNet12, so we thought that it would be the most likely to predict isopeptide bonding.

RGN set-up

The process to prepare our computers to run the model took a long time, and will be briefly run through. To run this, you need to be using either Linux or OS X and have around 120GB free space as well as a GPU compatible with CUDA 10.0.

The script runs on Python-2.7 and needs two packages to work properly. The first is called setproctitle and can be pip installed easily. Tensorflow-gpu is the second package needed but has a lot of prerequisites and should not be pip installed until they are satisfied. The version of Tensorflow-gpu needed is 1.4 or higher. For tensorflow-gpu to be able to communicate with the gpu successfully, CUDA 10.0 and the corresponding version of cuDNN (v7.6.2) must be correctly installed first. This can be a lengthy process, but fortunately there is an in-depth manual for the installation of both of these programs on the Nvidia website.

The other things needed are the CASP12 (6.5GB) pretrained model to use, ProteinNet12 (102.2GB), and finally, Hmmer. CASP12 and ProteinNet12 can both be downloaded from the RGN GitHub repository.

Hmmer lets you compare single or multiple proteins to a sequence database and find homologies (similarities). Hmmer once again has documentation which is very easy to follow on the EddyLabs website.

Now that everything is set up, follow the instructions from the GitHub page, being careful with the file directories. One thing that will need changed is that the file may point to the wrong directory to find the easel miniapps which are located in: hmmer-3.2.1/easel/miniapps/

The code we would run to fold an individual protein was:

>sudo rgn/data_processing/ <protein>.fasta.txt proteinnet12.fa

>rgn/data_processing/ <protein>.fasta.txt

>rgn/data_processing/ <protein>.fasta.txt.proteinnet .fasta.txt.tfrecord 42

>cp <protein>.fasta.txt.tfrecord RGN12/data/ProteinNet12Thinning90/testing/rgn/model/

>RGN12/runs/CASP12/ProteinNet12Thinning90/configuration -d RGN12 -p -e weighted_testing -g0 -a

Once the final line of code has run, the outputs will be in the highest runs folder, so for our file directory: RGN12/runs/CASP12/ProteinNet12Thinning90/11/outputsTesting

With the .tertiary file containing the atomic coordinates and the .recurrent_states file containing the RGN latent representation of the sequence.

Output converter

In order to be able to view these outputs in PyMol they need to be converted to a PDB format. We did this by writing a python script which converts .tertiary files to .pdb and .xyz file, which can be found here in pdf format.

The script is designed to be used from your terminal and you can follow the instruction guide below to do so.

Step 1: copy the content of the file into a python script and save it as ''

Step 2: copy the script into the same folder you have your .tertiary files in (from now on this will be refered to as your ouputs folder). This will also be where you .pdb and .xyz files will be deposited.

Step 3: open a terminal window and enter the following code line by line, changing only the list of protein for which you want to run the scrip. The example is given with '2okm' and '1d2p'

>cd path/to/outputs/folder

>echo "set completion-ignore-case On" >> ~/.inputrc

>declare -a arr=("2okm" "1d2p")



>for name in "${arr[@]}"; do new_name="$name$f_ext"; ter_name="$name$ter_ext"; python --fasta_sequence $new_name --tertiary_file $ter_name; done

Finally, as this model only produces the backbone of the folded protein, but a refiner such as ModRefiner can be used to put in the sidechains.

A limitation of the model is that it can only fold proteins with one chain. This did not have a negative effect on our project because our end goal was to use the model to permute different residues in a single chain of a protein. To resolve this issue when testing it on proteins with multiple chains, we made a python script called to remove all but the first chain.

Automating the process for a large number of proteins

For our project, we needed to fold a large number of proteins, so we wanted to automate the process as much as possible. To that end we wrote a bash script so that we could simply input a list of the protein names we wanted to fold, press enter, and then leave it to run and fold them all.

In order to speed up the process, we ran the first line of code on the Edinbugh cluster, as this is when the fasta sequence is compared to the ProteinNet database, and so is the most computationally demanding.

For the scripts below, all that would need to be changed is the list of proteins declared at the start.

Below is an example of the first line of the code that we automated:

> declare -a arr=("2okm" "1d2p" "2z1p" "2wza" "2woy" "2x5p" "2xi9" "2xic" "2xid" "2xtl" "2y1v" "3b2m" "3hr6" "3htl" "3kpt" "3rkp" "3pf2" "3phs" "3qdh" "3uxf" "3v10" "4hss" "4hsq" "4uzg" "5f44" "5fie" "6bbt" "6bbw" "6fwv" )



>for name in "${arr[@]}"; do down_link="$first$name$third"; new_name="$name$f_ext"; wget -O fasta_sequences/"$new_name" $down_link; python fasta_sequences/ --fasta_sequence fasta_sequences/"$new_name"; rgn/data_processing/ fasta_sequences/"$new_name" fasta_sequences/proteinnet12.fa; done

We would then download the output files and run the next lines on our local computer with the following bash script to once again automate it:




>declare -a arr=("2okm" "1d2p" "2z1p" "2wza" "2woy" "2x5p" "2xi9" "2xic" "2xid" "2xtl" "2y1v" "3b2m" "3hr6" "3htl" "3kpt" "3rkp" "3pf2" "3phs" "3qdh" "3uxf" "3v10" "4hss" "4hsq" "4uzg" "5f44" "5fie" "6bbt" "6bbw" "6fwv" )

>for name in "${arr[@]}"; do echo $name; fasta_name="$name$f_ext"; python rgn/data_processing/ $fasta_name; prot="$fasta_name$proteinnet"; tf="$fasta_name$tfrecord"; python rgn/data_processing/ $prot $tf 42; cp $tf RGN12/data/ProteinNet12Thinning90/testing/ ; python rgn/model/ RGN12/runs/CASP12/ProteinNet12Thinning90/configuration -d RGN12 -p -e weighted_testing -a -g0; done

After this is done, we would then run the .tertiary to PDB file converter script, making sure the converter is in the same folder as the outputs from the line above:

>echo "set completion-ignore-case On" >> ~/.inputrc

>declare -a arr=("2okm" "1d2p" "2z1p" "2wza" "2woy" "2x5p" "2xi9" "2xic" "2xid" "2xtl" "2y1v" "3b2m" "3hr6" "3htl" "3kpt" "3rkp" "3pf2" "3phs" "3qdh" "3uxf" "3v10" "4hss" "4hsq" "4uzg" "5f44" "5fie" "6bbt" "6bbw" "6fwv" )



>for name in "${arr[@]}"; do new_name="$name$f_ext"; ter_name="$name$ter_ext"; python --fasta_sequence $new_name --tertiary_file $ter_name; done

2OKM folded from ai (yellow) vs. solved structure (pink)

Fig. 3: Comparison of the AI folded 2OKM (pink) vs. solved protein structure (yellow)


We were able to use this part of the model in designing our improved part, mOrange + SpyTag. We used the RGN model to predict the tertiary structure of a candidate part based on the peptide sequence, and found it to behave as desired (see our improve page for more details). The output of this can be seen in figure 4, below.

Figure 4: Predicted tertiary structure of mOrange + SpyTag. mOrange is shown in green, joined via a linker to SpyTag (blue).

We attempted a small-scale mutant screen at the end of the project to retrospectively predict solubility of the mutant proteins, with some showing positive results (figure 5). We will present these results to any students or teams wishing to continue this project, as a starter point to any future work.

Alignment of IgG edit 1 with wt IgG

Figure 5: Predicted tertiary structure of our IgG CH3 domain edit 1 (green) superpositioned onto the wild-type CH3 domain for IgG (dark green). The similar topologies suggest that this mutant could be soluble and thus may be capable of forming an isoeptide bond.

Isopeptide Search


While the overarching aim of our project is to introduce isopeptide bonds into proteins as a stabilization tool, we also had to contend with the fact that there is very little research into isopeptide bonds. So far, intramolecular isopeptide bonds are only known to exist in the structural pilin and adhesin proteins of gram-positive bacteria. If intramolecular isopeptide bonds would be discovered in other protein domains, that would greatly inform any engineering effort of inducing these bonds. In particular, as our rational design process is limited to using Isopeptide Domains with a similar topology to the query protein, discovering new topologies could vastly broaden the scope of our modelling approach. Therefore, we realised that we had to explore the scope of these bonds in their naturally occurring environments, and we started to build an isopeptide finder algorithm. Designing the isopeptide finder turned into an 11-week project that developed in complexity as we learned more about the relevant proteins.

Our strategy was to scan through all available solved protein structures for yet undiscovered intramolecular isopeptide bonds. When intramolecular isopeptide bonds were discovered in 2007 10, it was revealed that multiple proteins, whose 3D structure was believed to be completely solved, actually contained intramolecular isopeptide bonds (such as 2OKM deposited only months before the discovery). After consulting with one of the leading experts in the field of protein crosslink domains (Dr Schwarz-Linek), we decided that our novel attempt of a search algorithm for undiscovered isopeptide bonds could greatly accelerate the advancement of isopeptide bond engineering and would be a great part of our iGEM modelling efforts.

Python Programming

We wrote a programme in Python to analyse over 150,000 PDB files from the RCSB Protein Data Bank 11. In order to design our programme, we required an in-depth understanding of the conditions under which isopeptide bonds form.

Intramolecular isopeptide bonds form between the side chains of Lysine and Asparagine/Aspartate residues in a reaction that also requires a catalytic Aspartate/Glutamate residue in close proximity. This triad of side chains has to be buried in the hydrophobic core of the protein for the auto-catalytic reaction to occur. All of the discovered intramolecular isopeptide bonds occur between Lysine residues and Asparagine or Aspartate residues 12. Theoretically, intramolecular isopeptide bonds should also be able to occur between the Lysine and Glutamate or Glutamine. For this reason, we designed our python programme to search for all instances where Lys (NZ) atoms were within 4 angstroms with one of the following atoms: Asp (CG), Asn (CG), Glu (CD) or Gln (CD). These pairs of atoms provided us with a basis for our search.

We could then refine our search by introducing additional considerations. Through a process of screening the output of our Python search and introducing new criteria to filter out unpromising candidates, we developed the following criteria:

  • Catalytic residue: We filtered for all pairs where an Asp (OD1/OD2) or Glu (OE1/OE2) atom was within a Euclidean distance of 5 Å.

  • Difference in residue number: Crosslinks at either end of the protein sequence are more stabilising than those between adjacent residues 6 , so we assume isopeptide bonds more likely to occur here.

  • Same chain: We filtered out all pairs that were on different protein chains and solely focused on intramolecular protein bonds

  • Surface: Isopeptide bonds can’t form on a protein’s surface due to water contact so we engineered a novel measure for approximating whether the pairs of residues were located on the surface of or deeper within the protein 12. This was done by using spherical coordinates to divide the 10 Å volume around the bond into segments; we then determined the maximum amount of connected segments of empty space using a modified flood fill algorithm. A low value indicated that the residues were buried in the hydrophobic core of the protein.

  • Presence of nearby aromatic group: Dr Schwarz-Linek suggested that aromatic groups might support the formation of isopeptide bonds by shielding the bond from solvent

  • Presence of nearby metal ion: The presence of a metal ion in close proximity to the pair of residues will prevent the formation of an isopeptide bond 12.

  • Year deposited: Isopeptide bonds were discovered in 2007 but many PDB files were deposited before this so their resolved structure so will document these bonds.

One of the crucial insights of this process was the discovery of many proteins that fit our criteria, but did not contain isopeptide bonds, instead containing double salt bridges. In these instances, two basic residues (Lys or Arg) formed salt bridges with two acidic residues (Asp or Glu). We created a list of all those proteins with double salt bridges and it is attached here. This was one instance where the modelling informed our experimental design. We reasoned that an appropriate substitution of one of these residues for an inert amino acid could induce spontaneous formation of an isopeptide bond within the remaining residues. To test this idea, we attempted to edit one of the proteins from this list – the Beta-glucosidase A from the thermophile Thermatoga maritima.

Furthermore, our modelling efforts also lead to a collaboration with the iGEM team of the College of William and Mary. We helped the team to determine whether any of the proteins they were working with included a double salt bridge with the aim of further stabilising those.

The result of our extensive python search was a list of around 10,000 candidates which we further investigated.

Jmol Scripting

For the second part of the Isopeptide Search, we wished to incorporate electron density data into our search for unknown isopeptide bonds. On the RCSB Protein Data Bank, electron density data is provided for many proteins in the form of 2Fo-Fc and Fo-Fc maps. While the 2Fo-Fc map displays the electron density data that fits the 3D model that was created by the crystallographer, the Fo-Fc map documents all the instances at which the crystallographer’s model does not match the experimentally collected data 13.

We examined the PDB files of all the proteins with known isopeptide bonds. We noted that in all cases where the crystallographer was unaware of the presence of an isopeptide bond, the model did not account for the electron density data between the binding residues. Therefore, the Fo-Fc map displayed a lot of positive electron density data between these. The images below show an example of such a case for the protein 2OKM.

2OKM FoFc map around the isopeptide bond between residues 42 and 155

Fig. 6: 2OKM FoFc map around the isopeptide bond between residues 42 and 155

For that reason, we decided to determine if there was any unaccounted electron density data between any of our 10,000 pairs of residues. To automate this process, we initially attempted to develop our own parser to process the electron density data. However, although we did make some promising progress, we eventually had to consider a different approach due to the limited timescales of the project. We were fortunate enough to arrange a meeting with Prof. Bob Hanson, the principal developer 14 of Jmol (an open-source viewer for chemical structures in 3D). He introduced us to a method to quantify the amount of electron density data in a specific, isolated region.

We then developed a Jmol script. The input to our script were the 10,000 most promising candidates as determined by the first part of our search in the form of a protein name, the two residues of interest and the Euclidean distance between the relevant atoms of these residues. For each candidate, we loaded its PDB file and Fo-Fc map from the RCSB Protein Data Bank before isolating the region of positive electron density of interest. This region was defined as being within the Euclidean distance between the residues of both residues (forming an intersection). We then used the command “isosurface AREA SET” to quantify how much discrepancy existed in this region.

When analysing the results, we were interested in which candidates yielded the relatively highest values of discrepancy in electron density data. As expected, among our highest results were several known isopeptide bonds such as those in the proteins 2OKM and 2WZA. These acted as controls and confirmed the working of our algorithm.

The highest results of our 10,000 candidates was a pair of residues in the protein 2X9X. This instance fitted all the criteria of an intramolecular isopeptide bond. Our algorithm was a success. However, a literature search indicated that although this bond is mentioned on neither the RCSB webpage nor in its PDB file, it was in fact already published in a paper. Furthermore, we deemed none of the other candidates to have high enough results. Nevertheless, it would take closer investigation to confirm whether or not any of these are indeed isopeptide bonds. We demonstrated a proof of concept; our algorithm could be extended and used for the search of other types of intramolecular crosslink bonds, such as thioester and ester bonds.


  1. Kwon, H., Young, P.G., Squire, C.J., and Baker, E.N (2017) – ‘Engineering a Lys-Asn isopeptide bond into an immunoglobulin-like protein domain enhances its stability’ Scientific Reports 7, e42753
  2. Holm L. (2019) – ‘Benchmarking fold detection by DaliLite v.5’ Bioinformatics btz536 (
  3. Holm, L. and Sander, C. (1996) – ‘Mapping the Protein Universe’ Science 273(5275), pp. 595-602
  4. Holm, L., Kaariainen, S., Rosenstrom, P., and Schenkel, A. (2008) – ‘Searching protein structure databases with DaliLite v.3’ Bioinformatics 24(23), pp. 2780–2781
  5. McNicholas, S., Potterton, E., Wilson, K.S., and Noble, M.E.M. (2011) – ‘Presenting your structures: the CCP4mg molecular-graphics software’ Acta Cryst. D67, pp. 386-394 (
  6. Kristian W. Kaufmann,†Gordon H. Lemmon,†Samuel L. DeLuca,†Jonathan H. Sheehan,†and Jens Meiler*, Practically Useful: What the Rosetta Protein Modeling Suite Can Do For You, Biochemistry2010,49,2987–2998 2987DOI: 10.1021/bi902153g
  7. Kang HJ, Coulibaly F, Clow F, Proft T, Baker EN, Stabilizing isopeptide bonds revealed in gram-positive bacterial pilus structure, Science, 2007;318:1625–1628. doi: 10.1126/science.1145806.
  8. Mohammed AlQuraishi, ProteinNet: a standardized data set for machine learning of protein structure, 2019
  10. Kang, H. and Baker, E. (2019). Intramolecular isopeptide bonds: protein crosslinks built for stress?.
  11. Kabsch, W. (2010) XDS. Acta Crystallogr. D Biol. Crystallogr. 66, 125–132

School of Biology

School of Chemistry

School of Mathematics

School of CS

School of Physics

School of Philosophy

Sir Kenneth Murray Endowment Fund

iGEM St Andrews 2019