Team:Rice/Software

Software Innovation

Most existing RNA thermometers exhibit optimal operation ~37°C. As most plants optimally grow far below this temperature, new RNA thermometers must be designed. Not only must they be sensitive at a lower temperature range, they should also be specific to the constructs we design. To this end, we decided to synthesize custom RNA thermometers. To speed up the design process we decided to computationally mass-design thermometers rather than manually create them through rational design.

Due to the inherent difficulty of designing RNA thermometers from first principles, we decided to naturally select for good candidates in a pool of randomly generated sequences. Since this is an optimization problem with a large search space, we decided to use a genetic algorithm to produce better thermometers by "reproducing" worse-performing sequences. We wrote a Python program that would recombine and reassess a population of potential RNA thermometers when given an upper and lower temperature bound.

NuPack is used to evaluate base pairing and secondary structure. A Python 3 library called Distributed Evolutionary Algorithms in Python (DEAP) provided the components we used to build the genetic algorithm. To parallelize evaluation of potential candidates, another library named Scalable Concurrent Operations in Python (SCOOP) is used, allowing for quick turnover in candidate selection. The advantage of our program over traditional design processes lies in its speed and automated nature. It allows for the creation and testing of a library of RNA thermometers optimized for a custom temperature range without requiring access to sophisticated technical expertise. For iGEM teams that require temperature-dependent components not present in the existing literature, this would be an expeditious way to provide wetlab candidates for testing.

Our software can be downloaded from GitHub

Figure 1. The mean free energy decreases for a less stable conformation, meaning translation will be more likely to occur.

Figure 2. A screenshot of the program.

Figure 3. The top candidates were kept and underwent crossover mutations.

General Software Procedure and Rationale

For each base of the complement of the context containing the RBS, create three other permutations which have that base mutated. All of these permutations combined form the initial population. The baseline is defined as the full sequence containing the context before the variable region, the variable region which is the complement of the RBS-containing context, and the RBS-containing context.
For every pair of sequences in our population, we randomly (with a programmatically specified probability) performed a two-point crossover. Then, we randomly (with another programmatically specified probability) shuffle the bases of a sequence or introduce a point mutation. After this, any duplicates are removed.
Running the NuPack command complexes -T + TEMPERATURE°C + -material rna -pairs -mfe -quiet on an input file containing the sequence to be tested produces a number of files which contain the base-pairing probabilities and minimum free energy secondary structures of the given sequence.
We used the files outputted by that NuPack command to calculate the difference in the expected number bases of the ribosome binding site bound between our two temperature bounds. Since this is a probabilistic estimate, we also measured the change in predicted minimum free energy secondary structure between the two temperature bounds. The procedures to perform these calculations are outlined below.
We perform steps 3 and 4 for every sequence in our population, generating an expected pairing difference and secondary structure.
Then, a (programmatically specified) certain number of the best candidates are selected. Then, another portion of the population is selected through automatic epsilon lexicase selection.
Then, steps 2 through 6 are repeated for an arbitrarily large number of generations or until we reach a local maxima and the best sequence no longer changes.

Percentage base pair optimization algorithm

The .ocx-ppairs file outputted by the command above consists in part of a list of every base and the base(s) which it has a > 0.001 probability of base pairing with by position number. Find the base pairings where one of the bases is in the RBS-containing region and add up the probabilities corresponding to that base pairing. The nature of this NuPack command should prevent duplicates. Then, subtract the probability that these bases are unpaired.
Subtract the number of base pairings at the higher temperature from the number of base pairings at the lower temperature. We want to maximize the number of RBS-base pairings that disappear as the temperature increases from 25°C to 30°C. The resulting number is the base-pairing fitness value.

Secondary structure change optimization algorithm

The .ocx-mfe file outputted by the command above contains the dot-parentheses representation of the minimum free energy secondary structure of the given sequence at the given temperature.
Find the Levenshtein distance between the strings representing the dot-parentheses representation at the two different temperatures. The Levenshtein distance measures the number of changes needed to transform one string into another. The resulting numerical value will serve as a proxy for the degree of change in secondary structure between the two temperatures.