Team:Thessaloniki/GeneticA

<!DOCTYPE html> <!DOCTYPE html>

Genetic Algorithm

Genetic Algorithm

Selecting Sequences for the DSD System

Choosing the sequence of nucleotides that composes each domain and toehold, is of great importance in a DSD circuit. Even for systems that seem to work perfectly at the domain level, bad sequences can lead to undesired results by affecting the time the system needs to reach equilibrium (system kinetics), the bond energy between the strands (system thermodynamics), the final consistency and the final products (system equilibrium).

The existing approach is using a computer to create a large number of random systems, and evaluate them in order to select the best. Creating the system by hand might seem easy for a simple system, however, the task is extremely resourceful, as every possible combination of the strands that form the system must be examined. To overcome these difficulties and to ensure that the DNA to be ordered from the manufacturer will compose a system with minimal deviation from the ideal, we created an algorithm that mimics the Genetic Algorithm for selecting, evaluating and improving the sequences of the system. The scripts for the Genetic Algorithm can be found in our Github. https://github.com/iGEM-Thessaloniki-2019/Genetic-Algorithm-System-Sequencing

Figure 1: The Genetic Evolution Flowchart

The general idea behind our algorithms is as follows:

(0) Choose a starting system of sequences → (1) Evaluate it → (2) Change randomly one or more nucleotides → (3) Evaluate the new system. If the new system is evaluated as better than the initial, it becomes the starting system and the algorithm begins again. Otherwise, the starting system stays the same and the algorithm continues from step (2).

The scripts make use of the NUPACK package as well as the python packages Peppercompiler, Stickydesign, Piperine and PoBT (Patterns of Binding Sites by John Letey and David Knox). Note that some of these packages use other packages not mentioned here. Additionally, our program can mutate multiple nucleotides during a generation and the users can specify the strands which they would like to be mutated.

The initial nucleotides can be designed by the user, using tools found in the bibliography, or can be created using peppercompiler. Our starting system included parts from the system proposed by Dr. Soloveichik at his publication “Programmable Chemical Controllers Made From DNA”.

The evaluation was made according to the heuristic measures proposed in the paper “Enzyme-free nucleic acid dynamical systems”. Additionally, two more custom measures were added, aiming to minimize undesired TF-DNA interaction and the variance between toehold energies. The measures are described in greater detail in the following section.

Heuristic Measures

As mentioned in the previous tab, the measures used are those described in “Enzyme-free nucleic acid dynamical systems” plus two custom measures TF score and dG Variance. The script written from us creates the inputs for the piperine functions, calculates the two custom measures and mutates the sequences. There were also modifications to the piperine algorithm for selecting the best sequence in order to add more measures. All measures used are described in the following tabs in detail.

Using PoBT the algorithm obtains every hit occurring in the system and returns a table with the score of each hit according to the Position Weight Matrix. The score can be any value between -inf up to 1, but we count as hits the ones above 0.35. The ‘TF score’ value is the sum of the scores of every desired hit, minus the sum of the scores of every undesired hit.

TF score

A score showing how close we are to the system that contains the least non desired sites for protein - DNA interaction. Higher score indicates less undesired binding sites.

For all possible combinations between two signal strands SS1,SS2 of the system, we calculate the concentration of the complexes SS1:SS1, SS1:SS2, SS2:SS2 at equilibrium considering that the starting concentrations were cs = 10-6M. We call the sum of these concentrations, multiplied by 100 and divided by c=2*10-6, I(Si,Sj). TSI average is the sum of I(Si,Sj) coming from every possible pair of signal strands. TSI max is the maximum I(Si,Sj).

TSI avg

The Average percentage of Interactions between any two signal strands

TSI max

The Maximum percentage of Interactions between any two signal strands

For all possible combinations between a toehold complementary t1* and a part of signal strand or single strand SS1 that does not contain t1 of the system, we calculate the concentration of the complexes SS1:SS1, SS1:t1*, t1*:t1* at equilibrium considering that the starting concentrations were cs = 10-6 M. We call the sum of these concentrations divided by 2*10-6, F(t,S). TO average is the sum of F(t,S) coming from every possible pair of signal strands. TOI max is the maximum F(t,S).

TO avg

The Average percentage of interaction between a toehold complementary and the largest part/parts of strands not containing the toehold

TO max

The Maximum percentage of interaction between a toehold complementary and the largest part/parts of strands not containing the toehold

For two distinct branch migration domains, which are part of the set of branch migration domains {Ri}, the largest sequence of nucleotide matches, s,s’ , is identified. A weight of 1 is assigned for a match of 5 nucleotides, a weight of 2 for 6 nucleotides, a weight of 4 for 7 nucleotides and so on. The sum of those weights for every pair of distinct branch migration domains is called WS-BM and the maximum value among them is called Max-BM.

WS-BM

The weighted sum of subsequence matches. The largest identified matches between distinct branch migration domains, are assigned a weight analogously to their length.

Max-BM

The Maximum weighted sub-sequence matches. The largest identified matches between distinct branch migration domains, are assigned a weight analogously to their length.

In a solution containing a signal strand with initial concentration 10-6M, the fraction of the population of each nucleotide of the signal strand to be unpaired in the equilibrium, is called p_unpaired. SSU average is the average between all p_unpaired for all signal strands and SSU min is the minimum value among them.

SSU min

The Minimum among the probabilities of nucleotides to be unpaired.

SSU avg

The Average of the probabilities of a nucleotide to be unpaired.

In a solution containing a signal strand with initial concentration 10-6M, he fraction of the population of each nucleotide within a toehold of the signal strand to be unpaired in the equilibrium, is called SSTU. SSTU average is the average between all SSUs for all signal strands and SSTU min is the minimum value among them.

SSTU min

The Minimum among the probabilities of nucleotides within toeholds to be unpaired.

SSTU avg

The Average of the probabilities of a nucleotide within a toehold to be unpaired.

BN% measures how well does a complex of the system form at equilibrium. The calculation is done for 1 μM of each component strand. This well-forming, is tied to the structural defect, which refers to the nucleotides that form incorrect pairs within their complex. The measures related with BN% were not a part of the repeated algorithm because they are time-consuming. The two terms in the numerator are values for structural and concentration defect. The result of these two summed and divided by the target nucleotide concentration, gives the fraction of the nu- nucleotide expected to form incorrect base-pairing states.

BN% max

The maximum fraction between the nucleotides that are expected to form incorrect base-pairing states in the system.

Max Defect Component

The complex that has the nuclotide with the maximum BN% score

BN% avg

The average value between fraction of nucleotides expected to form incorrect base-pairing states in the system.

For two distinct strands, the non intended sub-sequences of nucleotide matches are identified. As weight, the length of the match minus the 5 is assigned for a match from 6 to 12 nucleotides and weight of 7 for 12 or more nucleotides. The sum of those weights for every pair of distinct strands is called WSIS. Additionally, the summed weights of sub-sequence matches that have a one base mismatch is called WSIS-M

WSIS

The weighted sum of unintended sub-sequence matches between strands. The identified matches between strands, are assigned a weight analogously to their length and these weights are summed.

WSIS-M

The weighted sum of unintended sub-sequence matches between strands that have one base mismatch. The identified matches between strands, are assigned a weight analogously to their length and these weights are summed.

For a strand, the non intended sub-sequences of nucleotide matches are identified. As weight, the length of the match minus the 5 is assigned for a match from 6 to 12 nucleotides and weight of 7 for 12 or more nucleotides. The sum of those weights for every strand is called WSAS. Additionally, the summed weights of sub-sequence matches that have a one base mismatch is called WSAS-M

WSAS

The weighted sum of unintended sub-sequence matches within a strand. The identified matches, are assigned a weight analogously to their length and these weights are summed.

WSAS-M

The weighted sum of unintended sub-sequence matches within a strand that have one base mismatch. The identified matches, are assigned a weight analogously to their length and these weights are summed.

Spurious checks for matches within the same strand (i=1), matches between different strands of the same complex (i=2) and between different complexes (i=3). The number of undesired subsequence matches of length k is given by the elements of matrix C(i,k) and the number of matches that contain one mismatch is given by the matrix C1(i,k). The number of each match C(i,k) or C1(i,k) is being multiplied by a factor b^x, that is exponentially increases with the match length. b is equal to 0.5. Finally, the result is being multiplied by spurious_i, an initial penalty for the unintended match which differs for the different types of matches.

Verboten

The weighted sum of four specific sequences, that often cause structural faults described in the supplementary material of “Enzyme free nucleic acid dynamical system”.

Spurious

The weighted sum of subsequences that are complementary or partially complementary but shouldn't be. These interactions are called Spurious.

After calculating the toehold energy for each toehold in the system with the help of stickydesign energy model EnergeticsBasic(), the variance, dG Variance, of the set is calculated. Additionally, dG Range is the the difference between the minimum and the maximum toehold energy.

dG Variance

The Variance of all toehold binding energies.

dG Range

The absolute difference between minimum and maximum toehold binding energy.

Choosing the best system

In order to choose which system is better, between the initial and the mutated one, we used an evaluation method from the python package Piperine. As some measures need to be minimum for a better system, we invert them for easier calculations. So, SSU, SSTU and TF_score are multiplied with -1, so that they also improve when decreasing.

The next step is to compare each measure’s value between the two systems. Initially, the minimum between the two values is identified. Then, the minimum value is subtracted from each value, M (M_old, M_new).

Then, the same difference divided with the two values is being calculated. These two measures indicate how much the value has changed.

Until now it has been clear which system is better for each measure and how much this measure is affected by the mutations. The next challenge is to determine if the overall measures in the mutated system have improved enough compared to the initial system. In general, some measures are improving while others are deteriorating. So, determining to keep or not to keep the mutated system means that some measures are favored at the expense of others. To make such decisions some meta-ranks are introduced. The metascores make extensive use of importance weights of each measure, which are calculated through experiments. The weights we are using are provided by the package piperine and were calculated in Caltech. We have set the weight for the TF score to 2, a pretty important weight, as the minimization of undesired DNA-Protein interaction is crucial for the project. Additionally, given the fact that this interaction could make the toolkit dysfunctional, a mutated system does not replace the initial if the mutated TF score is lower than the initial.

Firstly, a matrix with two lines is created. If a measure in one system is greater than in the other, its value in its system line is 1 and in the other line is 0. If the values are equal in both systems, then both are 0. This matrix looks like the following and it is called matrix R.

Set Index Tf_score TSI avg TSI max ... Verboten Spurious dG Variance dG Range
Initial 1 1 0 ... 0 1 0 0
Mutated 0 0 0 ... 1 0 1 1

Furthermore, a matrix like the one above is created for the f (Eq. 1) values and a matrix for the p (Eq. 2) values of each measure.

The following scores are the metascores that we use.

Metascores Initial System Mutated System Example Values
R matrix MR_1 [1, 1]
MR_2 [0.3, 0.2]
MR_3 [3, 11]
MR_4 [0.43, 0.82]
F matrix M_F1 [0.0539, 2.1367]
M_F2 [0.0051, 0.1111]
P matrix M_P1 [300.0, 1100.0]
M_P2 [43.0, 82.0]

Using the metascores, a new matrix is created that is similar to matrix R. The new matrix, MR, contains two lines with each element being equal to the metascore of each system. If a measure is greater in one system than in the other, its value in its system’s row is 1 and in the other row is 0. These values are named metaranks.

Finally the best system is the one with the lowest sum of metaranks.

Software

The scripts composing the Genetic Algorithm program are used to mutate one or more nucleotides in a existing DNA Strand Displacement system or a specific domain, to reevaluate the system and to choose to keep the mutated system or discard it.

In order to use the software, the user has to put some input files and make some small changes to the code as well as install the software dependencies described here. https://github.com/iGEM-Thessaloniki-2019/Genetic-Algorithm-System-Sequencing

Creating the system files

First of all the users have to know the system they want to create the sequences for. This system should be described by a .comp file like those in "Genetic_Algorithm/Examples/P65_example/System" and "Genetic_Algorithm/Examples/ELK1_example/System". Note that if the user does not intend to put restrictions for specific nucleotides, the .comp nucleotide restriction has to be H for the upper strands, which indicates that only A,T,C nucleotides are allowed (ex. sequence tb = "H"). Otherwise the .comp nucleotide restriction has to be in accordance with the user's restriction alphabet.

The system will be initially imported in peppercompiler, so a .sys file is also needed. Examples of .sys files can be found in "Genetic_Algorithm/Examples/P65_example/System" and "Genetic_Algorithm/Examples/ELK1_example/System". The .comp and the .sys files have to be in the file Genetic_Algorithm/System and the .sys file has to be named systemsys.sys. The .comp file has to have a different name.

Once those files are ready the users are ready to create the system.pil file which will contain the user's restrictions on the nucleotides. This is done with the following command in the terminal:

$ pepper-compiler systemsys.sys --output system.pil

The users can open the system.pil and put nucleotides in the wanted positions. Those nucleotides will never be mutated through the Genetic Algorithm runs. A file named system.save will be also produced by the above command.

About the TF files, two files are needed. The first contains data about the Position Frequency Matrix for the TF. To create the file the users have to search for their Transcription Factor in Jaspar and download the RAW PFM file or they just have to create a file similar to the one provided by JASPAR. This file must be renamed to TF.pfm and in its first line should be edited in order to contain only the four letters, that declare the order of rows providing the data for each nucleotide of the PFM, for example ACGT. Examples of TF.pfm can be found in "Genetic_Algorithm/Examples/P65_example/System" and "Genetic_Algorithm/Examples/ELK1_example/System". The second file defines the intentend positions of the binding sites. The file must be named tf.config and its first line must be the number of those intended positions, so if there are 2 intended positions, the first line is: Hits = 2. Every intended hit, is defined by the name of the strand and the position of the first nucleotide of the binding site. Note that counting for the position starts from 0. The name of the strands are found in "Genetic_Algorithm/PoBT-master/src/data". Examples of tf.config files are found in Genetic_Algorithm/Examples/P65_example/System" and "Genetic_Algorithm/Examples/ELK1_example/System".

Those five files, systemsys.sys, .comp, system.pil, TF.pfm and tf.config must be in the System directory.

Creating the initial sequences

This step is not necessary but is strongly recommended. Using the system.pil and system.save the user can create sequences containing the specified restrains, if any. It can be done by using the following commands:

$ pepper-design-spurious system.pil
$ pepper-finish system.mfe

Two new files are produced system.mfe and system.seqs. Now the initial sequences are produced but the initial files are not ready yet. All files should now be copied to a different directory. The users now open the system.seqs file and copy all lines containing the sequences for the toeholds and the domains. Next, a file .fixed is created by the user and the copied text is pasted. Examples of .fixed files can be found in Genetic_Algorithm/Examples/ELK1_example/ELK1.fixed and Genetic_Algorithm/Examples/ELK1_example/p65.fixed. By running the following commands, the initial sequences are created: $ pepper-compiler systemsys.sys --fixed name.fixed $ pepper-design-spurious system.pil $ pepper-finish system.mfe. The file needed for the initial sequences is the system.pil. This file must be renamed to current.pil and must be placed in the Best_Species directory.

If no initial sequences are given, the software creates a .pil with random sequences, respecting the user specified restrictions.

Running on a different computer or different dna system

In the Process.sh the users have to change the lines specifying the NUPACK 3.0.6 directory. Those lines are 21, 23 and 29. More specifically the users have to change the "/home/lamphs/Desktop/iGem/dry_lab/packages/nupack3.0.6/nupack3.0.6" with the path to their computer's installation of NUPACK 3.0.6 .

Another important change in done in the flag --sig of the EvalutionInputs.py. This flag must followed by the names of the system's signal strands. The lines 91 and 81 must look like this $ python Genetic_Algorithm/EvaluationInputs.py --sys System/systemsys -bn Best_Species/current -sig solo-Dtg solo-Btd solo-Ntb solo-tbB solo-taA solo-Btr solo-Rtq solo-Ctb solo-Itc solo-tcC .

Using the software

Once all above are ready, the users can run the software by opening the terminal in Genetic_Algorithm/ and typing the following command:

$ bash Process.sh

From here the user should specify the NUPACK 3.0.6 path, the number of generations and the number of mutations in each generation, following the instructions.

The user can specify a particular domain to be mutated by adding the flag --specific_mut with the domain in the line 87 of the Process.sh file, for example $ python Genetic_Algorithm/SpecMutate.py -f $j --specific_mut solo-b.

In usual runs, the BN% measures are not calculated due to their excessive computational time. If the users want to calculate the BN% measures the have to put the --ted or -t flag in the lines 91 and 81, of the Process.sh. Specifically, the line should be $ python Genetic_Algorithm/EvaluationInputs.py --sys System/systemsys -bn Best_Species/current -t -sig signal1 signal2 signalN.

Calculating the toehold energies

The toehold energies can be calculated by using the script Toeholds_Energies/toeholds.py. The users have to open toeholds.py, and change the array b. In this array the users have to import the toehold sequences. The binding energy of each toehold, the mean value of the energies, the energy variance, the energy range and a heatmap are printed in the terminal by running the following command:

$ python toeholds.py

Results

As mentioned before, we designed a custom system based on previous work and used the Genetic Algorithm in order to improve its characteristics. The Genetic algorithm is unable to produce a worse system as long as the evaluation measures work as expected. However, as we discovered, the weight assigned to every score plays an important role. Tuning the weights of the scores is a difficult procedure and needs a lot of experimenting, as every score-characteristic is favored at the expense of another. The Genetic algorithm did improve our initial system. As the scores keep improving, the computer needs more time to find a better system. The final sequences are the ones used in our experiments. In the following tables the evaluation measures of the system we used can be found.

The DNA sequences ordered from the manufacturer for the p65 system were created using an older version of the Genetic Algorithm. The initial sequences were based on the Soloveichik system [2], which had low Protein-DNA interaction for the p65 homodimer. The final DNA sequences can be found in the github examples in the file current.seqs https://github.com/iGEM-Thessaloniki-2019/Genetic-Algorithm-System-Sequencing/tree/master/Genetic_Algorithm/Examples/P65_example/Best_Species.

The scores for the system used for p65 Homodimer is the following:

Table I: Evaluation Measures for the p65 system.
Set Index Tf_score TSI avg TSI max TO avg TO max WS-BM Max-BM SSU min SSU avg SSTU min SSTU avg
p65 -291.970 0.375 0.883 0.244 0.394 33 7 0.301 0.895 0.348 0.920
BN% max Max Defect Component BN% avg WSAS WSIS WSAS-M WSIS-M Verboten Spurious dG Variance dG Range
4.794 tcC_Ctc 3.255 17 304 62 1912 518.5 559.64 0.132 1.030
Figure 2: Toehold energies for the p65 system. The heatmap describes the interaction of each toehold with its complementary and with other toeholds.

The scores for the system used for ELK1 were better than those of p65 at most measures, because a better and more recent version of the Genetic Evolution was used. This system was designed to have the same reporter sequences with the previous one, for financial reasons. The DNA sequences can be found in the github examples in the file current.seqs https://github.com/iGEM-Thessaloniki-2019/Genetic-Algorithm-System-Sequencing/tree/master/Genetic_Algorithm/Examples/ELK1_example/Best_Species. The final measures are demonstrated in the following table:

Table II: Evaluation Measures for ELK1 system.
Set Index Tf_score TSI avg TSI max TO avg TO max WS-BM Max-BM SSU min SSU avg SSTU min SSTU avg
ELK1 -289.472 0.568 2.566 0.333 0.613 39 7 0.318 0.900 0.609 0.927
BN% max Max Defect Component BN% avg WSAS WSIS WSAS-M WSIS-M Verboten Spurious dG Variance dG Range
4.983 solo-tcC_Ctc 2.960 30 534 21 1837 382 1581.204 0.459 1.680



The mean toehold energy is 4.22 kcal/mol. As for the toehold energies, the following heatmap describes the strength that each toehold binds with its complementary and with the other toeholds. In general, the binding of each toehold to its complementary differ from the other interactions significantly, as intended.

Figure 3:Toehold energies for the ELK1 system. The heatmap describes the interaction of each toehold with its complementary and with other toeholds

References

[1] Srinivas S., Parkin J., Seelig G., Winfree E., Soloveichik D. (2017) Enzyme-free nucleic acid dynamical systems.

[2] Chen Yuan-Jyue, Dalchau Neil, Srinivas Niranjan, Phillips Andrew, Cardelli Luca, Soloveichik David and Seelig Georg (2013) Programmable chemical controllers made from DNA.

[3] Piperine https://github.com/DNA-and-Natural-Algorithms-Group/piperine

[4] Evans C. G., Winfree E. (2013) DNA Sticky End Design and Assignment for Robust Algorithmic Self-assembly. [5] Stickydesign https://github.com/DNA-and-Natural-Algorithms-Group/stickydesign

[6] Wolfe B. R., Porubsky N. J., Zadeh J. N., Dirks R. M., Pierce N. A. Constrained Multistate Sequence Design forNucleic Acid Reaction Pathway Engineering

[7] NUPACK http://nupack.org/

[8] Peppercompiler https://github.com/DNA-and-Natural-Algorithms-Group/peppercompiler

[9] Letey J., D. Knox Patterns of Binding SiteoBT https://github.com/johnletey/PoBT