Team:Freiburg/Software

ul, ol { font-size: 1.0em }
Software

Introduction

The standard method for identification of D-peptide binders is the laborious mirror-image phage display (MIPD). Although MIPD is a proven and well-established method, it has its restrictions. It needs some time to establish and execute, furthermore the choice of targets as well as the size of the phage library are limited by the current state of biotechnology. Whilst modelling the results of our MIPD for further analysis, we were wondering whether we could ease the process of finding possible ligands.

The virtual screening of peptide candidates against a chosen target is a well-known in silico equivalent of phage displays. Yet, we could not find any tool targeted specifically towards D-peptides. This is aggravated by the fact that the Protein Data Bank (PDB), one of the most common sources of protein structure models, consists mostly of structures of L-proteins, the mirror image D-form is mostly not accessible at all.

With our software, we offer a solution for these limitations of in silico mirror image screening and thereby accelerate the identification of D-peptide ligands. We have developed tools required for the virtual screening for D-peptide ligands. Our L-to-D Converter enables the conversion of any structure that can be found in the PDB or other sources to its mirror image form. Furthermore, we developed a toolset for the creation of an up-to-date virtual peptide library to screen against, PDBLibrary. Furthermore, we provide an efficient program to perform a virtual screening (VS) on computing cltusters, finDr VS, with the possibility of converting the target protein to its D-form beforehand.

Our software even goes beyond the current limitations of the wet lab. finDr’s second modality, finDr GA, imitates natural selection through the application of evolutionary principles, an optimization technique known as genetic algorithm(GA)1. Defining a peptide’s binding affinity towards a certain target as its fitness, finDr is able to discover high-affinity binders through several generations of selection, recombination and mutation. To see the power of evolution in action, check out the Evolutionary Word Finder further down this page.

Overview

Fig. 1: Workflow of our software tools for the identification of D-peptides

L-to-D Converter

The biggest obstacle for modelling D-proteins might be the meagre availability of the mirror image D-protein structures in the Protein Data Bank and other sources which greatly limits the choice of target. We have successfully removed this hurdle with our L-to-D Converter.

Fig. 2: Schematic representation of the L-to-D converter function. In cyan the NMR structure of L-PSMα3 (5kgy) obtained from the protein data bank. By inversion of the X-coordinate of each atoms' coordinates a model of D-PSMα3 (red) is generated.

The L-to-D Converter is a small but powerful Python tool that lets you convert any L-protein into its mirror-image D-form for further in silico modelling or usage as target for finDr. Protein structure models are written as PDB files, which list the coordinates of every atom in the protein and other information on its structural properties like for example its secondary structure. The L-to-D Converter works by inverting every X-coordinate of a given PDB file2. We demonstrated that our program works faithfully on one of the rare examples of proteins where both the L and D-form were crystallized to obtain a structural model - the protein monellin. In figure 3 you can see the alignment of the crystallized structure of D-monellin (PDB id 2q33)3 acquired from the Protein Data Bank in red and a model of D-monellin we created with the L-to-D converter, based on its L-structure in blue (1mol)4.

alternative text alternative text alternative text

Fig. 3: Structures of L-Monellin (blue PDBid: 1mol) and D-Monellin (red PDBid: 2q33), The structure of L-Monellin was converted to a model for D-Monellin and aligned with the crystal structure for D-Monellin using PyMOL.

Head to https://github.com/kcaliban/LtoD to get the latest version and for instructions on how to convert your digital proteins.

PDBLibrary

PSMa3

Fig. 4: Schematic depiction of the process for library generation.

To perform virtual screening for peptide binders an extensive amount of structures is required: a library. The largest openly available source of protein structure files is the Protein Data Bank, however it consists mainly of proteins, not peptides. We provide a toolset allowing the extraction of helical peptides of chosen length from this database, enabling the fast and flexible preparation of peptide libraries from the current version of the Protein Data Bank. We demonstrated this by generating a library of four million helical 8-12mer peptides for virtual screening.

PDBLibrary consists of three tools: Helix Extractor, Helix Slicer and Duplicate Remover. Helix Extractor extracts all alpha-helical structures from a set of PDB files. Helix Slicer cuts out slices of one or more specific given length from given pdb files. Duplicate Remover removes any redundant sequences and any files containing alternate locations for CA atoms.

Applying these tools sequentially, a large library of alpha helical peptides can be generated. In the example (Fig. 4) we generated a library of 117 helices of length 10 from just three protein structures. For more information, visit https://github.com/kcaliban/LtoD.

finDr - a tool for computational D-peptide identification and optimization

finDr VS -Virtual Screening

finDr VS combines the power of cluster computing and the immense size of libraries that can be created with PDBLibrary to find optimal binders in silico. It performs a virtual screening that is spread across all available computing nodes for fast and efficient identification of high-affinity peptide binders. A virtual screening is based on molecular docking simulations for each possible binder in the given library to the chosen target. Since these simulations are not a trivial task, finDr VS relies on the well-established and benchmarked program AutoDock Vina5 to calculate realistic binding affinities.

finDr VS is provided under https://github.com/kcaliban/finDrVS.

finDr GA

Darwinian evolution in silico

The application of evolutionary principles in computers has first been suggested by one of the fathers of computer science, Alan Turing, in the 1950s6. Since then, it has been used to solve many different optimization problems7. finDr GA is the attempt to apply Darwinian principles to the rapid design of D-peptides in silico, adapting and accelerating a natural process.

To understand finDr GA, it might be useful to reiterate the basic principles of evolution: Species evolve through repeated cycles of mutation, selection and recombination. A selection pressure is applied to select the individuals from a diverse population that have the highest fitness. Fitness in an evolutionary sense is determined by the probability of survival of an individual under a certain selection pressure. The remaining genetic information in the new population after the selection is then recombined into new genotypes by sexual reproduction with a certain chance of the appearance of new properties by random mutation. The expression of this new genetic information to diverse phenotypes determines again the fitness of the individuals in this new population and the cycle of evolution is reiterated for several generations.

How does this process translate into computers? To illustrate, we implemented a small browser application showing the application of a genetic algorithm to a problem. The problem we chose for this demonstration purpose is quite simple: We are given a goal string of letters, numbers and spaces, for example “iGEM 2019”, and define the geno- and phenotype of individuals as a string of the same kind. The fitness of an individual is then determined by the length of the goal string minus the Hamming distance (number of positions where they differ) between the individuals string and the goal string.

For example, the string “iGEM 2018” would have fitness 8, since 8 letters match and only one differs from the goal string. By starting out with a random population of strings like for example the words from a poem or song lyrics, a genetic algorithm using this defined fitness as its driving force will be able to get close to the optimum. Try it yourself!

Evolutionary Word Finder


Goal string
Population size
Number of generations
Probability of mutation
Current generation:
Best individual:

In finDr GA, the genetic information of an individual is a string of amino acids that can be translated into 3D peptide structures by finDr GA. An individual's fitness is equal to its binding affinity towards a given target calculated by AutoDock Vina compared to the average binding affinity of the population. Recombination of two individual peptides A and B, is done by dividing the amino acid sequences at a randomly picking a position in the sequence and combining the split parts of A and B together. To increase the variance within the population and to avoid getting stuck at a local optimum there is an additional mutation step mimicking the natural occurrence of point mutations. According to application and population size finDr GA the user can set the probability of point mutation for a custom-tailored evolutionary approach.

Fig. 5: Workflow of finDr GA. In every generation the best binders of the population are selected based on their binding affinity to the target. As a second step the good binders are recombined with each other and point mutations are inserted with a certain frequency to obtain a new population for the next generation

Like evolution, this process requires a starting point. The user can specify a directory filled with prepared peptide to be used in the initial generation. Alternatively finDr GA can pick structures at random from a different directory for a more blind approach. This way, in vitro as well as in silico results can be optimized while entering some variety into the equation.

Since finDr operates on amino acid sequence level and secondary structure level in an alternating way it needs to create peptide structures de novo from their amino acid sequence. For this purpose finDr implements a PyMOL function for the de novo composition of peptide structures relying on a given FASTA sequence. Since this structure is highly artificial afterwards finDr GA performs a Molecular Dynamics simulation using GROMACS for every new peptide it generates and extracts the most occurring conformation. We hereby ensure the comparability of the simulated ligand structure models with reality in each generation.

finDr GA is provided under https://github.com/kcaliban/finDrGA.

Fig. 6: To generate a faithful model of peptide ligands and its recombined and mutated versions in each generation finDr GA begins from a FASTA amino acid sequence in the one letter amino acid code. Using PyMOL an artificial peptide model is generated. To retain comparability to realistic peptide structures a short molecular dynamics simulation is performed and the most ocurring state is extracted and used for the next selection..

Demonstrating the functionality of finDr

We successfully applied finDr to our search for a D-peptide ligand against PSMα3, thereby validating its usefulness and functionality. Firstly we used finDr VS to screen a 4 million helical 8-12 mer library against D-PSMα3 and identified two candidates. Since those were not ideal binders in our wetlab binding analyses we applied finDr GA for optimization. Starting from an initial population of finDr VS results, a small amount of random peptides and also the ligands we identified in wet lab MIPD we achieved an improvement of binding affinities over 14 generations of directed evolution in silico.

Fig. 7: Improvement of peptide ligands to PSMα3 with finDr. Depicted is the binding energy of the best binder to D-PSMα3 in each generation of the genetic algorithm.

We could validate the results from our in silico prediction of D-peptide ligands for PSMα3 and with this we demonstrate the potential of finDr GA.

To demonstrate the broad applicability of finDr GA and its potential in all fields of science we applied it on a range of diverse targets.

With the use of the right Grid box finDr can be applied to large protein targets such as antibodies, as we demonstrated on a structure model of a patient derived anti-DNA autoantibody (PDB id 5GKR) fig. 7a. These antibodies play an important role in the autoimmune disease systemic lupus erythematodes8 and the generation of a therapeutic D-peptide neutralizing these antibodies could be favorable.

Furthermore we show that finDr might also be beneficial for the treatment of plant diseases with the target Necrosis and Ethylene inducing Protein 2 (NEP2) which is a causative agent for the witches broom disease in cocoa plants (PDB id 3ST1)fig. 7 b. A D-peptide neutralizing this target could possibly applied as a green, nontoxic pesticidal agent, protecting agricultural plant growth without harming other members of the ecosystem9.

finDr can also serve to identify ligands against very small peptides and even cyclic depsipeptides such as the mycotoxin Enniatin B (Fig 7c) (PubChem CID 164754). This mycotoxin is produced by the Fusarium species and can lead to severe food and feed poisoning upon fungus contamination10. The peptide ligands we identified and optimized using finDr GA could potentially serve for the development of an antidote to be used in agriculture and livestock feeding. The peptide ligands we identified and optimized using finDr GA could potentially serve for the development of an antidote to be used in agriculture and livestock feeding.

Fig. 8: Application of finDr GA on different targets. A patient derived anti-DNA autoantibody (PDB id 5GKR); de novo identification and optimization of peptide ligands against it. B Necrosis and Ethylene inducing Protein 2 (PDB id 3ST1); de novo identification and optimization of peptide ligands against it. C mycotoxin Enniatin B (PubChem CID 164754); de novo identification and optimization of peptide ligands against it.

This shows that finDr GA is applicable to a broad range of possible targets, ranging from antibodies to cyclic peptides. Therefore, finDr offers huge potential for various fields of application from medicine to plant physiology to agriculture. All in all we created, validated and provided the software tools for the identification of L or D-peptide ligands for any target of choice. To apply our tools on your research you can obtain them on github https://github.com/kcaliban/.

Software architecture of finDr

Separation of genetic algorithm principles

To completely separate the application of evolutionary principles and the way an individual's fitness is calculated for better testing and cleaner code, finDr GA is based on an abstract Genetic Algorithm framework, CXX-GA, implemented by our team. This framework was tested using the same problem presented earlier: A string of letters, numbers and spaces is given, for example “iGEM 2019”, and starting from a randomly generated initial population of strings try to reach the goal string. The code for the library and string example can be found on Github. [https://github.com/kcaliban/CXX-GA]

Fig. 9: Schematic representation of the interaction between the GA framework CXX-GA and finDr GA

The schematic above shows the interaction between CXX-GA and finDr GA. In this instance of CXX-GA, genetic information is represented as a string of one-letter amino acid codes. A fraction of high-fitness sequences specified by the user is copied directly. Recombination and mutation are done by simple string manipulation. The newly generated sequences are used by finDr GA to generate new structures, simulate molecular dynamics and calculate binding affinities. Resulting values are then used for another round of recombination and mutation. This process repeats itself for a user defined amount of generations.

Distributed Computing

Both finDr GA and finDr VS are designed to run on computing clusters. They rely on the Message Passing Interface (MPI) for communication between computing nodes and Open Multi-Processing (OpenMP) for multithreading on nodes. In finDr VS, a master node distributes all dockings equally among available nodes, on which every available thread runs one docking at a time. This is done by sending a list of filenames, a workload, to each node, followed by each node working through its load by further spreading it onto all available threads using OpenMP.

finDr GA works in a similar way, but sends sequences instead of filenames. Since the genetic algorithm part of finDr GA requires all docking results to be there, it is vital that no node takes much longer than all other nodes for its computation. Hence, the master nodes spreads sequences according to the number of available threads per node, meaning a node with 8 threads will get 8 sequences, a node with 4 threads 4 sequences.

After receiving its workload, every node generates its required structures using PyMOL and distributes molecular dynamics simulations and the consecutive molecular docking simulations onto all available threads.

Computing cluster

BinAC

All our MDs and dockings simulations were performed on the high performance computing cluster for bioinformatics and astrophysics (BinAC) in Tübingen (Germany). Due to the Bash only handling of the used programs and the computing cluster which is laborious for people who are not familiar with Bash we wrote a script which reduces the amount of typed commands severely. Further it represents a pipeline which includes all steps from a given sequence via constructing the peptide, performing a MD, inverting it from L to D, docking it against a target up to some analyses. Each step requires only one command to do it (MDs and dockings are computed on the cluster without additional commands). The Molecular Simulation Pipeline is found on GitHub.

BERT - Our self-built computing cluster

Fig. 10: Selfmade computing cluster “BERT” running finDr GA to identify D-peptide ligands against PSMα3.

Even though we were glad to have access to the BINAC Computing cluster, we wanted to enable everyone to use our software. We therefore found a way to build and set up our own computing cluster and are glad to share it with you.

What you need:

  • A couple of computers, maybe from your fellow iGEM team members, your colleagues or your friends. The operating system and hardware do not matter, though the computers should not be too old. All data on the computers will be preserved and the hardware will not be touched.
  • Enough LAN cables and a LAN switch to connect the computers together
  • As many USB sticks with space >=32GB as there are computers
  • An external hard drive

1. Prepare the USB drives

  • Download the image we provide to your computer.
  • Write its contents to all USB sticks you have gathered. For more information, Google “write img to USB”.

2. Setup computers

PSMa3

  • Find a room where all your computers fit into. Setup each computer. Connect all of them to the LAN switch.
  • Insert a USB stick into each computer and boot from USB. For more information, google “boot from USB”.
  • Pick a computer, usually the fastest, to be your “master” node
    • Now might be a good time to think about a name. We called our cluster “Bert”. Do not ask us why.

    Setup and test LAN connections

    • Follow these steps on every computer:
      • Open wired connection settings (top right corner)

      • Open settings of LAN connection
      • In the IPv4 tab set the Method to “Manual”, Netmask to 255.255.255.0 and the IP address to 10.0.0.X where X is a number different from the ones given to all other computers. It is advised to give every non-master node a name such as “worker-1”, “worker-2”, …, and to give IP addresses accordingly.
      • On your master node, open /etc/hosts (“sudo gedit /etc/hosts”, password is “bert”) and add a line for each computer in your cluster with their IP address, as well as a line for the master node itself with 127.0.0.1 as the IP address:
      • On each non-master node, do the same but with the actual IP address of the master
      • You should now be able to ping any computer by using “ping master” or “ping worker-x”. For more information, google “linux ping”.

    Setup passwordless SSH:

    • Let’s start with our master node:
      • Open a terminal
      • Type “ssh-keygen -t rsa” and press enter a few times, no password is required
      • Type “ssh-copy-id worker-x” and enter for all workers once. This way you enable the master node passwordless access to all workers.
      • Also type “ssh-copy-id master” and enter, since the master node has to be able to ‘connect to itself’
      • Now try logging into every worker as well as the master by typing “ssh worker-x” or ”ssh master” and enter each time. After a successful login, logout by typing “logout” and enter.
      • After successfully logging in and out of every node, type “eval ‘ssh-agent’; ssh-add ~/.ssh/id_rsa” followed by Enter. Done!

    Setup a shared filesystem

    • All nodes need to have access to a filesystem to work with. This is where your external hard drive comes in. Plug it into the master node and follow these steps for the master node:
      • Open a terminal
      • “cd /media/bert/DISKNAME” and enter, where DISKNAME is the name of your hard drive. If you only plugged in one external hard drive, there should only be one name. Press “TAB” after typing “cd /media/bert/” to automatically find it.
      • “mkdir shared” and enter; creates the shared directory
      • Type “cd ~; rm -rf shared; ln -s /media/bert/DISKNAME/shared /home/bert/shared” and enter.
      • Open the file /etc/exports (“sudo gedit /etc/exports”) and add the following line at the end: /media/bert/DISKNAME/shared *(rw,sync,no_root_squash,no_subtree_check)
      • Type into a terminal: “exportfs -a; sudo service nfs-kernel-server restart” and enter.
    • For the worker nodes:
      • Open a terminal.
      • Type “sudo mount -t nfs master:/media/bert/DISKNAME/shared ~/shared” and enter.
    • All nodes should now have access to the “shared” directory. You can try it out by creating a file and seeing if it is visible for every computer.

    Happy computing!

    That’s all there is to it. You can now download (onto a USB drive if your cluster does not have internet access) the latest version of finDr GA or finDr VS and run it on your own, self-built cluster by putting it in the “shared” directory and running from the main node:

    • finDr GA with 2 nodes:
      • mpirun -np 1 --host master,worker-1,worker-2 ./finDrGA -n 100 -m 50 -p 0.5 -c 0.2 : -np 3 ./PoolWorker
    • finDr GA with 10 nodes:
      • mpirun -np 1 --host master,worker-1,worker-2,worker-3,worker-4,worker-5,worker-6,worker-7,worker-8,worker-9 ./finDrGA -n 100 -m 50 -p 0.5 -c 0.2 : -np 10 ./PoolWorker
    • finDr VS with 2 nodes:
      • mpirun -np 1 --host master,worker1,worker-2 ./finDrVS : -np 2 ./finDrVS -w
    • finDr VS with 10 nodes:
      • mpirun -np 1 --host master,worker1,worker-2,worker-3,worker-4,worker-5,worker-6,worker-7,worker-8,worker-9 ./finDrVS : -np 10 ./finDrVS -w

    Attributions

    We gratefully acknowledge the support by the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen, the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant no INST 37/935-1 FUGG. Having access to the BinAC high performance computing cluster was of essential help for our modelling, especially for performing the simulations with GROMACS and the docking of large libraries.

    References

    [1] Yagi Y., et al. In silico panning for a non-competitive peptide inhibitor (2007).BMC Bioinformatics. 2007;8:11.

    [2] Garton, M. et al. Method to generate highly stable D-amino acid analogs of bioactive helical peptides using a mirror image of the entire PDB (2018). Proceedings of the National Academy of Sciences 115, 1505-1510

    [3] Hung, L., et al. Structure of an Enantiomeric Protein,D-Monellin at 1.8 Å Resolution (1998). Acta Crystallographica Section D Biological Crystallography 54, 494-500

    [4] Somoza, JR., et al. Two crystal structures of a potently sweet protein. Natural monellin at 2.75 A resolution and single-chain monellin at 1.7 A (1993). J Mol Biol. 234(2):390-404.

    [5] Trott, O., et al. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading (2019). Journal of Computational Chemistry NA-NA. doi:10.1002/jcc.21334

    [6] Lingaraj, Haldurai. A Study on Genetic Algorithm and its Applications (2016). International Journal of Computer Sciences and Engineering. 4. 139-143.

    [7] A. M. TURING, I. COMPUTING MACHINERY AND INTELLIGENCE (1950), Mind, Volume LIX, Issue 236, Pages 433–460

    [8] Sakakibara, S. et al. Clonal evolution and antigen recognition of anti-nuclear antibodies in acute systemic lupus erythematosus (2017). Scientific Reports. 7

    [9] Prosperini, A. et al. A Review of the Mycotoxin Enniatin B (2017). Frontiers in Public Health 5.

    [10] Zaparoli, G. et al. The Crystal Structure of Necrosis- and Ethylene-Inducing Protein 2 from the Causal Agent of Cacao’s Witches’ Broom Disease Reveals Key Elements for Its Activity (2011). Biochemistry 50, 9901-9910.