Team:Athens/Software


MEDEA/Software

Overview

Overview

In order to kickstart our Continuous Directed Evolution / Adaptive Laboratory Evolution based system, we will need a prototype that works, even barely, because otherwise, even minute amounts of selection pressure would be enough to kill off our population. In our specific system, this translates to we need to begin with an aptamer that already works.

There has already been some bioinformatics work relating to the discovery of novel aptamers. Much of this work is focused on the SELEX procedure, especially when integrated with Next Generation Sequencing (NGS)⁠. In our case, we needed a novel, fully in silico pipeline to identify candidate aptamers to use a starting point for evolution. There has already been some work in this field with aptamers being rationally designed in silico for paraoxon (an organophosphate), optimization of an aptamer to the carcinoembryonic antigen, the Epithelial Cell Adhesion Molecule (EPCAM)1 and a review has been recently written on the subject⁠, as well as a Methods article pertaining to the design of riboswitches from known aptamer sequences⁠ those are just some very recent publications on the subject (2018 and 2019). Older work, that also deserves to be mentioned and has already been reviewed in the literature, has been performed by Dr. Nahar’s group, who used a completely in silico pipeline in order to identify candidate aptamers for Estrogen Receptor α (ERα). Lastly, a lot of groups have focused on using Molecular Dynamics (MD) simulations in order to probe and assess the nature of aptamer-protein interactions. Such work has been performed for MUC1⁠, Thrombin, Flavin⁠ and Theophylline⁠, among others.

Onto our own pipeline, we decided to tackle the development of an aptamer against human C Reactive Protein (CRP). First, we searched Aptagen’s Aptaindex for an aptamer for CRP based on an old publication⁠, which we believed had room for improvement. Then we obtained the structure of the protein from the PDB, based on another publication⁠. After a lot of deliberation, we decided that our pipeline must contain the following:

1. A method to predict secondary structure from an RNA sequence. We achieved this using ViennaRNA2⁠. There is a slew of other alternatives that one could use, such as RNAStructure, UNAFold, Mfold or RNA123, among many others.

2. A method, given an RNA’s sequence and its secondary structure, to predict its tertiary structure. We decided to use SimRNA3⁠ for this, due it being readily available, easy to implement and runs fast, while also being easily parallelizable. Other options we considered were Ernwin and RNAComposer, but some bugs in Ernwin and the unavailability of RNAComposer as a standalone local installation deterred us from using them. One other factor that also pushed us to use SimRNA was its good performance in the RNA-Puzzles⁠ competition, which is a 3D structure prediction competition for RNA, similar to the CAPRI.

3. A method, given an RNA’s structure file, to “clean up” the structure and refine it. We decided to use QRNAS4 for this step, as it is the most recent tool for the job.

4. A method, given an RNA’s structure file and CRP’s structure file, both in .pdb format, to dock the two structures and calculate a score. We decided to use HADDOCK5,6⁠ for this step, due its ease of use and great community. Other alternatives that we considered were 3dRPC7⁠ (which presented compilation problems), NP-DOCK8⁠ (which we were able to obtain a standalone installation by the Bujnicki lab due to dependency problems) and P3-DOCK9 (which is only available as a webserver).

The Genetic Algorithm

The approach we followed in the implementation of our pipeline is an elitist genetic algorithm. A genetic algorithm takes a set of putative solutions to a problem, generates offspring from them by introducing differences (very akin to mutations) and then selects the best solutions from the pool, based on some kind of scoring function and one of the two schemes: elitism or breeding probabilities. In elitism, the top n scored candidates are always selected for the next generation, whereas when breeding a breeding probability scheme is used, each candidate is assigned a probability of bearing offspring. We chose an elitism-based scheme in the interest of time, even if it can be stuck more easily in local optima.

Translating all of this to our model, given a sequence in .fasta format, we first introduce mutations, beginning from a random index. The mutation matrix we utilize is presented in matrix format to the figure below. Each time a mutation occurs, we scale down the probability of mutations by a factor of 2 and raise the probability of no mutations by the difference, in order to not introduce wildly different sequences at each round.




The probability of mutation matrix before and after a new mutation:


It should also be noted that in each generation of the Algorithm, the top performing parents are kept, along with their children. Suppose you have n parents and they each give birth to one offspring, you now have 2*n possible candidates to evaluate and select from. A summary of the pipeline is shown on the following flowchart.

Description of a generation

In this section, we will describe the process of generating a generation, using a single sequence as an example.

1. Beginning from a .fasta file, which is simply the sequence of our current aptamer and only has an index we compute the secondary structure using ViennaRNA’s RNA fold.

2. Using that, along with the sequence as input, we use SimRNA to predict the sequences tertiary structures. We use 800.000 Monte Carlo iterations and the Replica Exchange mode (in the interest of time) with as many Replicates as we can.


3. After generating the files necessary for the structure’s creation (orientation files, .trafl), we cluster the top 1% of them, using a cutoff that is 0.1 * the sequence’s length, as suggested in the user manual. From those clusters we perform an all atom reconstruction of the RNA, using the first cluster. We note that this might not be the most accurate way to obtain good structures, but we just need a rough idea to use for the subsequent docking.
Lastly, we note that we turned off the secondary structure detection for SimRNA, as we are using ViennaRNA for that particular purpose.



4. Using that sequence, we refine it using QRNAS, in order to minimize its energy. For this step, we use 10.000 iterations

5. Using DSSR to obtain the secondary structure from the tertiary one, we change HADDOCK’s needed files (the DNA-RNA_restraints.def file and the run.cns file), reparametrize it for this specific candidate sequence and begin the docking. In this step, it should be noted that we use 480 iterations for the Rigid Body minimization step (as opposed to the default 1000) and 160 iterations for the Semi Flexible Simulated Annealing step (as opposed to the default 200) and another 160 iterations for the Explicit Water Solvent Docking step (as opposed to the default 200). Those decisions are all made in the interest of time. Other parameters we use are random Ambiguous Interaction Restraints, because we do not have the prior knowledge needed to define them, Center of Mass restraints to enforce contact between the two molecules and Base Pairing restraints, obtained from the secondary structure computed by DSSR, in order to avoid the RNA from unraveling. The structures are then clustered and the best one, as well as its score is obtained.



Other approaches

This is not the first time a team attempts to computationally generate aptamers. A team we are very fond of, Heidelberg 2015 had a similar project idea and created two marvelous pieces of software (MAWS and JAWS). Needless to say, we tried to compare our software to theirs. We did face some difficulty in the process though. First, using iGEM Heidelberg 2017 revamp of MAWS, which used antechamber from the AmberTools package as a subprogram, we got stuck at the minimization stage, possibly because our protein is CRP, which is quite bigger than a small molecule like Theophylline. Then, after switching to the original version of MAWS created by iGEM Heidelberg 2015, we faced two problems. Firstly, their software produces only DNA aptamers and not RNA ones, thereby rendering the comparison unfair and secondly, it crashed the computer we were using it at when we tried to create an aptamer larger than 35 base pairs. Therefore, the comparison made below is between our best aptamer and the 35 nucleotide long DNA aptamer produced by MAWS, as it was developed by iGEM Heidelberg 2015.

Our shortcomings

It is imperative that we achieve good running times, as a genetic algorithm’s performance is heavily based upon the number of generations elapsed. For that reason, we opted to use a Google Cloud VM that had 80 and then 160 cores, achieving times of 40 minutes per generation, at a cost of 0.8$ per hour. The reason we chose a Google Cloud VM was the ease of use and the fact that they offer 300$ in free credits, for trying out your systems. It also offers permanent data storage and easy transfers in order to save our states. More information about the setup of the VM is available in our Github repository.We took advantage of multiprocessing to the best of our ability in order to achieve these sub one hour times.

One major problem we faced during the implementation of our pipeline is the problem of accessibility. Specifically, many of the programs that are required and are suggest by HADDOCK are free for use with an academic license, but they are not free for distribution. Regrettably, that means that installation is not as smooth as we’d want it to be. There are programs whose files need to be placed in the working directory, before the installation can proceed. Once more, there is more information to be found in our Github repository.

And it wouldn’t be fair to say that our pipeline is without errors. Most of these decisions are intentional and were made in the interest of saving time:

First, in order to achieve good running times, we had to lower the number of iterations in docking, during each of the 3 phases of HADDOCK docking (Rigid Body Minimisation, Semi-flexible Simulated Annealing, Explicit Solvent docking, more info here). That can lead to inaccurate results, but we reasoned that we only required a rough idea of the binding affinity, in order to obtain a rank for each sequence.

Second, once again in order to achieve better times, we used Elitism in our Genetic Algorithm, which is known to get stuck in local optima.

Third, our scoring is based only on HADDOCK’s docking score. More specifically, we do not integrate the SimRNA energy score, or the QRNAS energy score post refinement in our scoring. We hope that computational biologists more skilled than us will use this idea and refine it, by integrating the score.

Fourth, the programs used in this pipeline were chosen arbitrarily. We make no claim as to whether they are the best fit for the purpose and any suggestions/criticism is more than welcome on this part.

Lastly, the pipeline is heavily dependant on computing power and not as readily available for the public (especially non academia) as we would like it to be.



Results

This software was ran for approximately 72 hours on a Cloud VM in Google Cloud, producing about 200 novel sequences starting from a single one. The sequences were then fed to the pipeline and were ranked according to their HADDOCK score. We created a simple graphic to visualize our results and have made a comparison with MAWS, developed in 2015 by iGEM Heidelberg. Feel free to visit our results page for more information!

Sequence Program Notes HADDOCK Score
GCCUGUAAGGUGGUCGGUGUGGCGAGUGUGUUAGGAGAGAUUGC None Initial sequence from Aptagen -136
CCUGUAAGGUGGUGGGGUUGGCCGGUGAGUUAGAAGAGAUUGC MPDR Best out of 200 sequences from MPDR -221
ACTTGAGGGAGAGAAACACCGGTCATGGGGTGGGG MAWS Heidelberg 2015 Unlike the above ones, this is a DNA aptamer -163

It can be seen that at least for this example of CRP, our pipeline outperforms both MAWS, when given a good starting point and creates a better aptamer than the one initially provided. Since MAWS's performance is not bad, we integrate MAWS, in order to create an initial aptamer in the cases that none exist in the literature, which then we proceed to evolve, using our Genetic Algorithm.


In order to better visualize the aptamers and the relationships between them, we created a rooted tree, which shows the relationships of the aptamers and their scores. Due to resolution issues, we suggest you open the image itself to have a better look.


It seems that our method might get bottlenecked at some points, as the best sequence (74.fasta) remained the best for several iterations. A possible approach to circumvent this problem would be to increase the mutation chance and make the process a little less stepwise and perhaps more erratic.


Furthermore, we sought to compare the initial aptamer conformation, when docked to CRP with the one from the best Aptamer, also when docked. To our surprise, the interaction surface was quite different and the number of interactions was increased. Additionally, the aptamer became more "compact" with a larger number of intramolecular hydrogen bonds and an abundance of G and A.


Initial
Final

Finally, in order to check whether our algorithm was biased in favour of some sequences, we created a logo from the top 10 of the sequences that were created during the run, in regards to HADDOCK score. A logo is a picture that holds information about the probability of each nucleotide appearing in a specific position. The largest the nucleotide in the picture, the larger the probability. It seems there are few very variable positions and many of them are very stable.

Logo

  1. Bavi, R., Liu, Z., Han, Z., Zhang, H. & Gu, Y. In silico designed RNA aptamer against epithelial cell adhesion molecule for cancer cell imaging. Biochem. Biophys. Res. Commun. (2019). doi:10.1016/j.bbrc.2019.01.028
  2. Lorenz, R. et al. ViennaRNA Package 2.0. Algorithms Mol. Biol. (2011). doi:10.1186/1748-7188-6-26
  3. Boniecki, M. J. et al. SimRNA: A coarse-grained method for RNA folding simulations and 3D structure prediction. Nucleic Acids Res. (2015). doi:10.1093/nar/gkv1479
  4. Stasiewicz, J., Mukherjee, S., Nithin, C. & Bujnicki, J. M. QRNAS: Software tool for refinement of nucleic acid structures. BMC Struct. Biol. (2019). doi:10.1186/s12900-019-0103-1
  5. Van Dijk, M., Van Dijk, A. D. J., Hsu, V., Rolf, B. & Bonvin, A. M. J. J. Information-driven protein-DNA docking using HADDOCK: It is a matter of flexibility. Nucleic Acids Res. (2006). doi:10.1093/nar/gkl412
  6. Dominguez, C., Boelens, R. & Bonvin, A. M.
  7. J. J. HADDOCK: A protein-protein docking approach based on biochemical or biophysical information. J. Am. Chem. Soc. (2003). doi:10.1021/ja026939x
  8. Huang, Y., Li, H. & Xiao, Y. 3dRPC: A web server for 3D RNA-protein structure prediction. Bioinformatics (2018). doi:10.1093/bioinformatics/btx742
  9. Tuszynska, I., Magnus, M., Jonak, K., Dawson, W. & Bujnicki, J. M. NPDock: A web server for protein-nucleic acid docking. Nucleic Acids Res. (2015). doi:10.1093/nar/gkv493
  10. Zheng, J., Hong, X., Xie, J., Tong, X. & Liu, S. P3DOCK: a protein–RNA docking webserver based on template-based and template-free docking. Bioinformatics (2019). doi:10.1093/bioinformatics/btz478