Team:SASTRA Thanjavur/Software

Software

In this aspect of our project, we worked on the predictive modelling of the efficacy of toehold switches according to their dynamic range.

Our software consists of the following facets.

  1. The machine learning problem
  2. Domain parsing
  3. Feature Engineering
  4. Predictive modelling
  5. Tools
    
The Machine Learning Problem

Toehold switches are de novo designed, high- performance, mRNA based riboregulators for translational regulation in prokaryotic systems. Three properties of toehold switches buttress their integration in the genome and in synthetic genetic circuits for diagnostic applications in the upcoming field of synthetic biology:

  1. Dynamic range
  2. Low crosstalk
  3. High Orthogonality

Each of the above properties of toehold switches could be characterized by well-established protocols. In the software aspect of House of Toeholds, we aimed to predictively model toehold switches as a sensory element for the detection of short RNA oligonucleotides called microRNAs that are indicative of cervical cancer. We came up with an in silico predictive modelling technique of toehold switches for candidate mRNA sequences through machine learning techniques. The machine learning problem at hand is as follows: for a given toehold construct as input, we would like to predict the efficacy of the toehold measured by its dynamic range (ON/OFF ratio). The features of interest to solve this problem could stem from sequence properties as well as structural considerations.To determine the optimal toehold switches for the problem at hand, we made use of regression analysis. The design workflow for the machine learning problem was the following:

  1. Collect toehold switch data from scientific literature and build an exhaustive dataset of toehold switch sequences with their respective properties.
  2. Generate a feature matrix for the sequences that includes sequence based and free energy features.
  3. After feature selection through a correlation analysis, we develop a model for the efficacy of a toehold switch.
  4. Extend the scope of the toehold switch design to the detection of microRNA biomarkers for cervical cancer.

The primary step of the project was to obtain all the sequences from the scientific literature and the Registry of Standard Biological Parts that contained a toehold switch structure. A thorough literature review was performed and the toehold switch sequences with the variable that indicates their performance, the dynamic range, were obtained from various sources. The majority of the toehold switches with the dynamic range measured and experimentally validated by their ON/OFF ratios were available from Green et al. (2014) and Pardee et al (2014, 2016). It was crucial to choose this particular subset because the remaining hundred odd sequences are not characterized by any defined metric. A total of 230 toehold switch sequences with ON/OFF ratios were obtained from the scientific literature on toehold switches, a vast majority of the total number existing toehold switches.

The dot-bracket notation is a simplistic grammar for representing the secondary structure of RNA without actually visualizing a diagram. The only rules of secondary structure notation are represented in the following manner: unpaired bases by dots ‘.’ and paired bases by parentheses ‘(’ and ‘)’. Toehold switches have multiple unique domains that can be extracted by pattern matching of their dot-bracket structure. We realised that each toehold switch possesses a generic pattern of dots and brackets by virtue of their secondary structure. We wrote regular expression (REGEX) pattern matching scripts in Python and passed them to a program called RNAfold. RNAfold outputs the dot-bracket structure for an inputted sequence. It is part of the ViennaRNA suite of RNA analysis programs. We have used a number of programs from this software package to engineer most of our toehold switch features. The general grammar of the toehold switch structure is shown in the figure below. We used this as a template for our domain parsing REGEX.

Toehold Switch Grammar

Area

Description

d1 dots

The toehold domain that is unbound in its inactive state

b1 open brackets

The toehold domain based paired with the linker, forming the stem bottom

3 dots

Sequestered nucleotides opposite to AUG (part of the secondary loop)

b2 open brackets

Mini linker domain base-paired with a linker between the primary and secondary loop

d2 dots

The primary loop containing the RBS

b2 closed brackets

The mini linker between the RBS and start codon

3 dots

The start codon AUG (the secondary loop)

b1 closed brackets

Linker base-paired with the toehold domain.  This linker connects the toehold switch to a gene of interest

However, our initial REGEX script did not parse out the exact domains of the toehold switches for all the instances in our dataset, so we analysed the dot-bracket more carefully and discovered that most of the engineered toehold switches did not adhere to this grammar. Some switches included a wobble base pair between G-U in the secondary loop containing the start codon, others had base-paired toehold domains, some even had base-paired linker regions. Wobble base pairs offered better stability in the OFF state and these forward engineered designed were, as a result, less leaky in comparison to older designs of the toehold switch. We also found that a couple of unpaired bases in the top stem of the toehold switch also resulted in an erroneous match. The unpaired bases were found near the primary loop sequence and we learned that they could easily allow the unfolding of the switch in the ON state while maintaining stability in the OFF state. As a result of our observations, we rewrote our domain parser in order to account for thoughtful new designs of toehold switches. We were able to match 228 out of 230 switches. We eliminated the exceptions because: We had written an optimal domain parsing program. The dynamic ranges of the eliminated switches were quite low and we realized that they would serve as outliers in our machine learning models and skew our linear fit.

Feature Engineering

We categorized out predicting variables, or features into sequence features involving the length and composition of domains, structural features and engineered structural features. We linked our domain parsing program to match the obtained dot-bracket sequence to the nucleotide sequence so that we could slice a full toehold switch sequence into its composite domains both by secondary structure and primary composition. This way, we did not lose any information in the secondary structure while retaining the base composition. We came up with the following sequence features:

  1. toehold domain length
  2. number of GC base pairs in the top stem
  3. number of GC base pairs in the bottom stem
  4. length of the top stem
  5. length of the bottom stem
  6. length of the primary loop
  7. number of A/U rich regions before RBS

The ViennaRNA Package is a set of standalone programs as well as a comprehensive web suite available for the prediction and analysis of RNA secondary structures. The following structural features were calculated from the ViennaRNA suite:

  1. overall minimum free energy (MFE) of the toehold switch construct using RNAfold
  2. minimum free energy of the bottom region of the toehold switch using RNAcofold
  3. the ensemble diversity of the switch and the frequency of the MFE structure in an ensemble using an option in RNAfold
  4. the solvent accessibility of the entire construct
  5. the net MFE of the complex
  6. MFE of the top stem
  7. MFE of the bottom stem
  8. MFE of the RBS-linker domain
    
Preliminary Feature Selection

After engineering the above features, we did a correlation analysis amongst the features to determine if any feature is correlated to another one. This helps in the dimensionality reduction of the dataset and retains only the best of the predictor variables. After the correlation analysis, we set the correlation at R=0.85 to indicate redundancy these are the features that we retained:

  1. Overall MFE of the toehold switch construct
  2. Frequency of the MFE structure in the ensemble
  3. Net MFE (difference in energy between the MFE of the switch-trigger complex and the sum of individual MFE values of the switch and trigger)
  4. MFE of the bottom region containing the toehold domain, bottom stem and the horizontal linker
  5. MFE of the RBS-linker domain, starting from the primary loop till the end of the construct
  6. Specific heat of the toehold switch construct at 37 degrees Celsius
Multivariate regression Model

With the above feature set, different models based on different subsets of features were constructed. Based on the adjusted R2, the goodness of fit of the models was evaluated, and the model with the best metric was chosen. The feature matrix was generated with the dynamic range as the response variable for all 228 instances. The volume of this work deals with regressing the dynamic range measured by the ON/OFF ratio with the individual attributes present in the feature matrix. The multivariate regression model was constructed using Scikit learn, a Python library for machine learning. We used an 80-20 train-test split and cross-validated our model by 10-fold cross validation. The cross validated model metrics output an adj. R2 value = 0.59, implying an R value = 0.77. We also applied a log10 transform on our response variable to validate if the features better predict the log transformed ON/OFF ratio value. However, we obtained an optimal model without the log transformation of the ON/OFF ratio.Our model outperforms the only other existing multivariate model proposed by CUHK’s iGEM 2017, whose reported R2 value was 0.22.

Predictor variables and their coeffients of linear regression

S.No

Feature Name

Coefficient Value

P value

1

Overall_MFE

16.38

0.002

2

Bottom_region_MFE

-32.80

0.000

3

RBS_linker_MFE

19.20

0.000

4

Net_MFE

5.24

0.003

5

Freq_MFE_struct

-327.32

0.024

6

OVerall_SH_37

-68.21

0.000

The intercept of the model was -160.74 with a p-value of 0.067. Since the coefficient estimates are large, we thought more conservative estimates might be useful. In this regard, we performed ridge and lasso regression to detect moderation in the estimates. We also applied a ridge regression, a regularization technique to improve the final model. We found that the ridge regression did not improve the coefficients to fit the model. However, the lasso regularization technique improved the coefficient of the frequency of the MFE structure (from -327.32 to -294.23) (S.No. 5).The P values of most of our features are extremely significant (less than 0.001), and the overall model is extremely significant, given its cross-validation and robust feature selection techniques.

Toehold switch web tool

We used our regression model to validate our toehold switch design tool hosted on our university’s website. Our toehold switch design tool improves on iGEM CUHK’s (2017) and iGEM CLSB’s (2017) tool by offering second generation toehold switch design parameters and validating a switch’s dynamic range on our multivariate regression model.

For our toehold switches, we applied this regression model and the results were as follows: our model predicted 50.526 as the ON/OFF ratio for the second generation toehold switch for hsa-miR-21-5p and 100.203 (ON/OFF) for the second generation toehold switch for hsa-miR-20a-5p.

Visit our GitHub repo for the ML and modelling code:https://github.com/igemsoftware2019/SASTRA_Thanjavur

Visit our evolving toehold switch designer:https://sas.sastra.edu/rnatoedesign/

References

  1. AA Nielsen, BS Der, J Shin, P Vaidyanathan, V Paralanov, EA Strychalski, D Ross, D Densmore, CA Voigt. (2016) Genetic circuit design automation. Science 352, aac7341
  2. Alexander A. Green, Pamela A. Silver, James J. Collins, Peng Yin. (2014) Toehold Switches: De-Novo-Designed Regulators of Gene Expression. Cell 159, 925–939 Elsevier BV
  3. HM Salis, EA Mirsky, CA Voigt. (2009) Automated design of synthetic ribosome binding sites to control protein expression. Nat Biotechnol 27, 946-50
  4. Keith Pardee, Alexander A. Green, Tom Ferrante, D. Ewen Cameron, Ajay Daley Keyser, Peng Yin, James J. Collins. (2014) Paper-Based Synthetic Gene Networks.Cell 159, 940–954 Elsevier BV
  5. Keith Pardee, Alexander A. Green, Melissa K. Takahashi, Dana Braff, Guillaume Lambert, Jeong Wook Lee, Tom Ferrante, Duo Ma, Nina Donghia, Melina Fan, Nichole M. Daringer, Irene Bosch, Dawn M. Dudley, David H. O’Connor, Lee Gehrke, James J. Collins. (2016) Rapid Low-Cost Detection of Zika Virus Using Programmable Biomolecular Components. Cell 165, 1255–1266 Elsevier B
  6. J Kim, P Yin, AA Green. Ribocomputing: Cellular Logic Computation Using RNA Devices.. Biochemistry 57, 883-885 (2018)
  7. SJ Schroeder, ME Burkard, DH Turner. (1999) The energetics of small internal loops in RNA. Biopolymers 52, 157-67
  8. R Kierzek, ME Burkard, DH Turner. (1999) Thermodynamics of single mismatches in RNA duplexes. Biochemistry 38, 14214-23
  9. Andreas R. Gruber, Stephan H. Bernhart, Ronny Lorenz. (2014) The ViennaRNA Web Services. 307–326 In Methods in Molecular Biology. Springer New York
  10. Ronny Lorenz, Stephan H Bernhart, Christian Höner zu Siederdissen, Hakim Tafer, Christoph Flamm, Peter F Stadler, Ivo L Hofacker. (2011) ViennaRNA Package 2.0.Algorithms for Molecular Biology 6 Springer Nature
  11. V Vimberg, A Tats, M Remm, T Tenson. (2007) Translation initiation region sequence preferences in Escherichia coli. BMC Mol Biol 8, 100

Our Sponsors