Team:MADRID UCM/aptamer-folding.html

Aptamer Folding – iGem Madrid

APTAMER FOLDING

1 Why?

Resolving the 3D coformational shape of aptamers is key to understand the underlying principles of their performance. All the experimental analysis used in the protein folding field are expensive and present aditional difficulties when we tried to extrapolate them to aptamers analysis.

The computational folding techniques appears as a new promising approach to solve these issues.

One of the main challenges of the SELEX process is the selection of the initial library of aptamer molecules, and of which will take part in each round. An important consideration is the protocol used to fold aptamers into their active conformations and the connection with the target protein. The folding conditions include multiple variables, such as temperature, buffer components, incubation time and aptamer concentration. To understand how variations in folding conditions impact aptamer functioning, we developed a novel AI algorithm which performs this folding computationally in order to shorten the SELEX process and to extract the aptamer and its particular conformation that joins best to the target protein.

3D aptamer folding is used in order to understand how the aptamer works. It gives information about the aptamer and its interactions: the exact regions of linkage with the protein, its nucleotide structure, its folding degrees, the weakest or most powerful regions, etc. By acquiring this knowledge, we are able to study the aptamer and improve it.

Our idea is to improve the aptamer with a library, or database, construction of aptamers and perform the 3D folding of them computationally. With this computed algorithm we can know the region of the protein where the aptamer is joined (the target part), and select the aptamers in this library that best join to the protein. With this information, the next step is to make a new aptamer library with new aptamers, similar to the aptamers that joined to the protein but randomizing just the nucleotides in the contact region.

Then the protein linkage is performed computationally again and we extract the main parts. By going through these steps again and again we can shorten the SELEX procedure and reduce laboratory time, helping scientists to find the desired aptamers that dock perfectly with the target protein.

The main advantage of this method is to show the 3D structure of the aptamer, allowing one to see the region of linkage with the protein. SELEX can then be centred on this exact region, rather than relying on the random libraries of aptamers at each step of the process. This allows us to better understand and so improve the aptamers.

2 Current Problem

Currently, a few techniques exist for understanding aptamer structure; The first one is experimental. This involves the crystallography of aptamers and 3D observation of their structure through microscopy [1] (cryo-electron microscopy), or nuclear magnetic resonance [2] or X-ray crystallography [3] probes, in order to extract the nucleotides and its nucleotide degrees (gamma, epsilon, delta, chi and zeta ). However, obtaining crystal structures of aptamer-target complexes has proven difficult, and only a few co-crystal structures have become available over the years. The costs and time associated with the experimental methods are also very high, and the number of samples (nucleotides and angles from each aptamer) is infinite. So, experimental aptamer folding is highly complicated and not viable [4].

The next technology used is the “brute force” algorithm, which performs the computer folding of aptamers and selects the best one (the one that best docks to the target protein) based on the construction of all the possible structures of one given aptamer. The main problem with this type of approach is that each aptamer has millions and millions of conformations (different folding structures), so it is difficult and costly - in time and money - to model them. There is a very high number of interactions between nucleotides to take into account. As noted in Levinthal’s paradox [5], it would take longer than the age of the universe to enumerate all the possible configurations of a typical aptamer before reaching the right 3D structure. So, the “brute force” algorithm is normally discarded.

AlphaFold [6]:

This is the most-used algorithm in the 3D folding of proteins, using artificial intelligence. The main problem is that proteins have multiple and different characteristics from aptamers. The 3D folding is totally different: proteins are composed of amino acids and aptamers are composed essentially of nucleotides, so aptamer folding cannot be performed with this software; however, we extract the idea of AI use to develop our algorithm in another way.

Mfold [7]:

This predicts the secondary structure of single-stranded nucleic acids, so it generates the secondary structure of the aptamer perfectly. The main problem is that 3D folding is needed in SELEX in order to know the bonding part with the target protein. The Mfold software therefore does not meet our needs.

Rosetta[8]:

This performs a dynamic and evolving macromolecular modeling suite addressing biomolecular structure prediction and design. It includes algorithms for computational modeling and analysis of protein structures, and is currently used for the 3D modeling of biomolecular structures. The two problems with this technology are: that the Rosetta software does not perform the 3D folding of DNA aptamers - it is first necessary to transcribe the RNA to DNA in order to perform the folding (see iGem INSA-Lyon 2016 Modeling Technology) - and that the software uses the “Monte Carlo” algorithm [9]. The Monte Carlo method is a numerical method of solving mathematical problems by random sampling, so the resulting structure is good but not always the best one for the specific problem.

The iGEM INSA-Lyon 2016 Modeling code [10]: The INSA-Lyon team created a software that uses ViennaRNA [11] and Rosetta to perform the 3D folding of aptamers. The problem is that, because they used Rosetta, the resulting 3D structures were not the best ones for joining with the protein.

So, the problem of the 3D folding of aptamers was not already solved – our task was to create a new technology to fill this gap.

3 Our Solution

We decided to use Computational Folding, a technique that is becoming more and more important every day. This is still challenging due to the high number of possible combinations and the fact that, at the moment, it is very computationally demanding to simulate three-dimensional structures.

Our solution is based on AI technology. Experimental methods depends on a lot of trial and error, which can take years and cost tens of thousands of dollars per structure. This is why biologists are turning to AI methods as an alternative to this long and laborious process. AI algorithms also make easier the 3D simulation and computational modeling of aptamers with the target proteins.

We created an Artificial Intelligence algorithm to predict the optimal DNA molecule shapes, which is our way of improving aptamers, predicting 3D structures and modeling the docking with the target protein.

Our AI algorithm was programmed based on the relevant 3D aptamer folding algorithm, and we developed the software based on the biological information and using three main components:

1

A Generative Adversarial Network (GAN) that works with biological information and is adapted to our challenges. This is divided into two neural networks: Generative and Discriminative (Convolutional Neural Networks [12] - CNN - nets are used).

2

A database for training the GAN, created from the iGEM INSA-Lyon 2016 and Rosetta code.

3

And Rosetta software, in order to evaluate and score our aptamers creation and judge the performance of our algorithm.

Generative Adversarial Networks (GAN)

Generative Adversarial Networks [13] are machine-learning neural net architectures composed of two nets, pitting one against the other (thus the “adversarial”), generating new information from previous data. We created new 3D aptamer structures from known sequences. And we adapted machine learning algorithms to biological environments. The GAN components are:

Generative

The generative net models or generates data that is very similar to the training data. Our structure: CNN with Diabolo form.

Discriminative

The Discriminative net that identifies if the data is real or fake and returns its conclusions to the generative. Our structure: Classifier Network with 2 CNN as inputs. In this way, the algorithm works in both directions: the Generative keeps generating structures and the Discriminative keeps giving feedback on how good those creations are.

Input

in the form of questions given to the generative network to create artificial samples.

As stated above, our algorithm is formed of CNN neural nets. A Convolutional Neural Network is a class of neural networks that specializes in processing and analyzing visual imagery. It performs 2D convolutions in each of its layer neurons of the input images. Convolution is a simple mathematical operation that performs spatial filtering and feature detection in images.

Database

An organized collection of data with aptamer structures that serves as “knowledge” for the nets. We started with the code of the previous iGEM INSA-Lyon edition (2016) and included multithreading and other optimizations, such as terminal use, the database (CSV) creation by the computer console, or the use in Linux and MacOS operating systems. We then ended up with a collection of more than 10,000 optimal samples. This means that each and every sequence of DNA bases ran through Rosetta software 100 times in search of the best possible angle between the nucleotides. This repetition of each nucleotide sequence was done in order to optimize the structure and allow out GAN to learn from the best examples, improving its work.

Rosetta Software

A dynamic and evolving macromolecular modeling suite addressing biomolecular structure prediction and design. It is a well-known software in the scientific community and was of great help for us in determining the effectiveness of our algorithm. We can generate at this point aptamers with a great scoring. Rosetta scoring is based on the free energy of three-dimensional folding, so the smaller the better (more stability).

4 Our Software

The code [14] is divided into two groups: the database creation and the training and use of the GAN. All the software is written in Python and it uses some dependencies from it (numpy, pandas, os, and sklearn) as well as different parts and protocols of code from Rosetta, ViennaRNA and Keras.

Database

In the Database part, we programmed the code to create the database used in the GAN, in CSV format, in order to learn from created 3D aptamer structures. This can be executed in Linux and MacOS operating systems.

The code is in a “loop for” that creates in each round an entry in the database. This loop was optimized by the use of threads and several nucleotides in the CPU. These optimizations can be used, changed and combined depending on the characteristics of the computer, and all the versions are in the team’s Github.

In each loop, in order to create an entry in the CSV, a random sequence with different nucleotides is created, with the number that the user wants. Then, this chain is transformed into RNA in FASTA format and its secondary structure with ViennaRNA software is created. This information is used in a protocol of Rosetta that performs the folding of 100 different structures thanks to the “Montecarlo” algorithm by the sequence given, and selects the best one. Next, the best one is minimized and its PDB format is extracted in order to optimize the sequence and to eliminate noise. The RNA folded structure is changed to DNA structure because the aptamers are DNA sequences, following the iGEM INSA-Lyon 2016’s code. Finally, from this PDB, the pose format is extracted (a format that Rosetta uses in order to make the energy calculation) and a scoring is given to the final structure.

The degrees of each nucleotide (gamma, epsilon, delta, chi and zeta) are saved in a matrix in the database with the sequence used and the scoring. All the entries together form the database, which is currently composed of 10,000 entries.

Generative Adversarial Networks (GAN)

The GAN part performs the GAN design based on the database characteristics and the user necessities, the training of this network (which takes 10 minutes), the saving of the components for future usages and the testing of the trained networks with the Rosetta code in order to make sure that the network works. Once the network is trained, one or many new nucleotides sequences, specified by the user, can be introduced into the Generative network. The 3D structure of these aptamers is obtained in 3-4 seconds, with low Rosetta score (minimum free energy). For that the code was structured in different parts:

First, the code reads the previously created database and takes 90% of the entries for the training of the network and the other 10% of the rest in order to test the created network. The entries are separated by scoring, sequence and degree (from nucleotides) and reshaped into matrices in order to be readable by the networks. Then, the networks are created:

Generative:

The input is a batch of sequences and the output is the degrees associated with these sequences. The structure is a CNN with “Diabolo” form; it has three 2D convolution operation layers with MaxPooling layers (“reshape” layers) intercalated at the left part, and three 2D Transpose convolution layers with UpPooling layers intercalated at right part.

Discriminative:

The input is a batch of sequences in a CNN (2D convolutional layers with MaxPooling layers intercalated) and its associated degrees with another CNN. Then a Flatten network (a layer that performs a summation of the inputs and applies an activation function to this summation) is applied with the two network outputs. The output is a number between 0 and 1 and is a “score” that tells the Generative how its creations are similar to the database (reality). A “1” means that it is not differentiable and a “0” that the approach is not correctly performed.

The GAN Model:

That encapsulates the Discriminative and Generative in order to connect the Generative output with the input of the Discriminative, plus the database, and to allow the Discriminative to score this input and give feedback to the Generative.

Next, the networks are trained with different accuracies (specified by the user), and the models are tested with the test set.
Once the models have good structures, they are saved in order to be used in the desired aptamer folding.

5 Results and Discussion

The Rosetta scores from the resulting aptamers are similar to the database, in some cases lower. The database contains the best Rosetta aptamers from 100 different structures per sequence, minimized and noise eliminated.

The 3D folding of the aptamers is obtained thanks to its sequence and its nucleotide degrees. The execution time is lower compared with any existing software; the Rosetta protocol for computation of an RNA sequence, for example, takes 30 minutes for one aptamer.

Our software takes 10-15 minutes in the GAN training and 3 seconds for the aptamer folding, once the GAN is trained.

The scoring from the Rosetta code is optimal compared with other software, and so the challenge of this modeling part is fulfilled. We have developed a software that creates reliable 3D folding of desired aptamers in only 10 minutes, with good Rosetta scoring permitting the considerable shortening of the SELEX process and enabling more efficient study and improvement of aptamers.

6 Future Ideas

We propose four future ideas could be developed based on our code.

The first one is the use of a better database as the knowledge-base of the network. A future team could take the aptamers generated by the network that overcomes the Rosetta folding and create a new database. With this database they could train the network again in order to improve the aptamer-folding results.

A yet other idea is the use of a metaheuristic [15] in the Discriminative network. A metaheuristic provides a sufficiently good solution to an optimization problem, especially with incomplete or imperfect information or limited computation capacity. This method could improve the Discriminative network in order to enlarge the training part, because of the correct discrimination of the database from the generated sequences, and so improve the solutions. These solutions would be improved because the Discriminative would perform better selection and force the Generative to create better structures with the same database.

Another future idea would be an adaptation of the database code to the Windows operating system. The only modification would be the Rosetta command line change in the database file: when the Rosetta protocols call the command line, the protocols depend on the operating system, so it would be necessary to find the same Rosetta protocol in the Windows command line and change it.

And finally, knowing the sequence of an aptamer, future iGEM teams could perform a new GAN training with only a sequence with multiple structures as a database input. This new database could help to improve the structure of a specific aptamer, given an optimal folding for a given sequence.

References

[1] D. Elmlund and H. Elmlund, “Cryogenic Electron Microscopy and Single-Particle Analysis,” Annual Review of Biochemistry, vol. 84, no. 1, pp. 499–517, Feb. 2015.
[2] H. Rattle, “NMR Studies of Amino Acids, Peptides, and Proteins: A Brief Review, 1980-1982,” Annual Reports on NMR Spectroscopy, pp. 1–71, 1985.
[3] “Principles of Protein X-Ray Crystallography,” 2007.
[4] V. J. B. Ruigrok, M. Levisson, J. Hekelaar, H. Smidt, B. W. Dijkstra, and J. V. D. Oost, “Characterization of Aptamer-Protein Complexes by X-ray Crystallography and Alternative Approaches,” International Journal of Molecular Sciences, vol. 13, no. 8, pp. 10537–10552, 2012.
[5] P. Tompa and G. D. Rose, “The Levinthal paradox of the interactome,” Protein Science, vol. 20, no. 12, pp. 2074–2079, Sep. 2011.
[6] “AlphaFold: Using AI for scientific discovery,” Deepmind. [Online]. Available: https://deepmind.com/blog/article/alphafold. [Accessed: 19-Oct-2019].
[7] “The mfold Web Server,” UNAFold. [Online]. Available: http://unafold.rna.albany.edu/?q=mfold. [Accessed: 19-Oct-2019].
[8] “Welcome to RosettaCommons,” RosettaCommons. [Online]. Available: https://rosettacommons.org/. [Accessed: 19-Oct-2019].
[9] “Monte Carlo Algorithm,” Encyclopedia of Social Network Analysis and Mining, pp. 982–982, 2014.
[10] “Team:INSA-Lyon/Software”. [Online]. Available: https://2016.igem.org/Team:INSA-Lyon/Software. [Accessed: 19-Oct-2019].
[11] “The ViennaRNA Package,” TBI. [Online]. Available: https://www.tbi.univie.ac.at/RNA/. [Accessed: 19-Oct-2019].
[12] Q. Zhang, “Convolutional Neural Networks,” 3rd International Conference on Electromechanical Control Technology and Transportation, 2018.
[13] A. Odena, “Open Questions about Generative Adversarial Networks,” Distill, vol. 4, no. 4, Sep. 2019.
[14] igemsoftware2019, “igemsoftware2019/MADRID_UCM,” GitHub, 18-Oct-2019. [Online]. Available: https://github.com/igemsoftware2019/MADRID_UCM. [Accessed: 19-Oct-2019].
[15] “Overview of metaheuristic optimization,” Metaheuristic Optimization in Power Engineering, pp. 1–38.