Resolving the 3D coformational shape of aptamers is key to understand the underlying principles of their performance. All the experimental analysis used in the protein folding field are expensive and present aditional difficulties when we tried to extrapolate them to aptamers analysis.
The computational folding techniques appears as a new promising approach to solve these issues.
2 Current Problem
3 Our Solution
We created an Artificial Intelligence algorithm to predict the optimal DNA molecule shapes, which is our way of improving aptamers, predicting 3D structures and modeling the docking with the target protein.
A Generative Adversarial Network (GAN) that works with biological information and is adapted to our challenges. This is divided into two neural networks: Generative and Discriminative (Convolutional Neural Networks [12] - CNN - nets are used).
A database for training the GAN, created from the iGEM INSA-Lyon 2016 and Rosetta code.
And Rosetta software, in order to evaluate and score our aptamers creation and judge the performance of our algorithm.
Generative Adversarial Networks (GAN)
Generative
Discriminative
Input
Database
Rosetta Software
4 Our Software
Database
The code is in a “loop for” that creates in each round an entry in the database. This loop was optimized by the use of threads and several nucleotides in the CPU. These optimizations can be used, changed and combined depending on the characteristics of the computer, and all the versions are in the team’s Github.
In each loop, in order to create an entry in the CSV, a random sequence with different nucleotides is created, with the number that the user wants. Then, this chain is transformed into RNA in FASTA format and its secondary structure with ViennaRNA software is created. This information is used in a protocol of Rosetta that performs the folding of 100 different structures thanks to the “Montecarlo” algorithm by the sequence given, and selects the best one. Next, the best one is minimized and its PDB format is extracted in order to optimize the sequence and to eliminate noise. The RNA folded structure is changed to DNA structure because the aptamers are DNA sequences, following the iGEM INSA-Lyon 2016’s code. Finally, from this PDB, the pose format is extracted (a format that Rosetta uses in order to make the energy calculation) and a scoring is given to the final structure.
The degrees of each nucleotide (gamma, epsilon, delta, chi and zeta) are saved in a matrix in the database with the sequence used and the scoring. All the entries together form the database, which is currently composed of 10,000 entries.
Generative Adversarial Networks (GAN)
First, the code reads the previously created database and takes 90% of the entries for the training of the network and the other 10% of the rest in order to test the created network. The entries are separated by scoring, sequence and degree (from nucleotides) and reshaped into matrices in order to be readable by the networks. Then, the networks are created:
Once the models have good structures, they are saved in order to be used in the desired aptamer folding.
5 Results and Discussion
The 3D folding of the aptamers is obtained thanks to its sequence and its nucleotide degrees. The execution time is lower compared with any existing software; the Rosetta protocol for computation of an RNA sequence, for example, takes 30 minutes for one aptamer.
Our software takes 10-15 minutes in the GAN training and 3 seconds for the aptamer folding, once the GAN is trained.
6 Future Ideas
The first one is the use of a better database as the knowledge-base of the network. A future team could take the aptamers generated by the network that overcomes the Rosetta folding and create a new database. With this database they could train the network again in order to improve the aptamer-folding results.
A yet other idea is the use of a metaheuristic [15] in the Discriminative network. A metaheuristic provides a sufficiently good solution to an optimization problem, especially with incomplete or imperfect information or limited computation capacity. This method could improve the Discriminative network in order to enlarge the training part, because of the correct discrimination of the database from the generated sequences, and so improve the solutions. These solutions would be improved because the Discriminative would perform better selection and force the Generative to create better structures with the same database.
Another future idea would be an adaptation of the database code to the Windows operating system. The only modification would be the Rosetta command line change in the database file: when the Rosetta protocols call the command line, the protocols depend on the operating system, so it would be necessary to find the same Rosetta protocol in the Windows command line and change it.
And finally, knowing the sequence of an aptamer, future iGEM teams could perform a new GAN training with only a sequence with multiple structures as a database input. This new database could help to improve the structure of a specific aptamer, given an optimal folding for a given sequence.
References
[2] H. Rattle, “NMR Studies of Amino Acids, Peptides, and Proteins: A Brief Review, 1980-1982,” Annual Reports on NMR Spectroscopy, pp. 1–71, 1985.
[3] “Principles of Protein X-Ray Crystallography,” 2007.
[4] V. J. B. Ruigrok, M. Levisson, J. Hekelaar, H. Smidt, B. W. Dijkstra, and J. V. D. Oost, “Characterization of Aptamer-Protein Complexes by X-ray Crystallography and Alternative Approaches,” International Journal of Molecular Sciences, vol. 13, no. 8, pp. 10537–10552, 2012.
[5] P. Tompa and G. D. Rose, “The Levinthal paradox of the interactome,” Protein Science, vol. 20, no. 12, pp. 2074–2079, Sep. 2011.
[6] “AlphaFold: Using AI for scientific discovery,” Deepmind. [Online]. Available: https://deepmind.com/blog/article/alphafold. [Accessed: 19-Oct-2019].
[7] “The mfold Web Server,” UNAFold. [Online]. Available: http://unafold.rna.albany.edu/?q=mfold. [Accessed: 19-Oct-2019].
[8] “Welcome to RosettaCommons,” RosettaCommons. [Online]. Available: https://rosettacommons.org/. [Accessed: 19-Oct-2019].
[9] “Monte Carlo Algorithm,” Encyclopedia of Social Network Analysis and Mining, pp. 982–982, 2014.
[10] “Team:INSA-Lyon/Software”. [Online]. Available: https://2016.igem.org/Team:INSA-Lyon/Software. [Accessed: 19-Oct-2019].
[11] “The ViennaRNA Package,” TBI. [Online]. Available: https://www.tbi.univie.ac.at/RNA/. [Accessed: 19-Oct-2019].
[12] Q. Zhang, “Convolutional Neural Networks,” 3rd International Conference on Electromechanical Control Technology and Transportation, 2018.
[13] A. Odena, “Open Questions about Generative Adversarial Networks,” Distill, vol. 4, no. 4, Sep. 2019.
[14] igemsoftware2019, “igemsoftware2019/MADRID_UCM,” GitHub, 18-Oct-2019. [Online]. Available: https://github.com/igemsoftware2019/MADRID_UCM. [Accessed: 19-Oct-2019].
[15] “Overview of metaheuristic optimization,” Metaheuristic Optimization in Power Engineering, pp. 1–38.