Difference between revisions of "Team:uOttawa/Software"

Line 3: Line 3:
  
 
<html>
 
<html>
 +
 +
<script>
 +
 +
function r(){
 +
 +
document.getElementById("hiddenRCode").style.display="block";
 +
}
 +
 +
function python(){
 +
 +
document.getElementById("hiddenPythonCode").style.display="block";
 +
}
 +
</script>
 +
 +
<style>
 +
#hiddenPythonCode{
 +
 +
display:none;
 +
}
 +
 +
#hiddenRCode{
 +
display:none;
 +
}
 +
</style>
  
 
<h1> Software </h1>
 
<h1> Software </h1>
Line 8: Line 32:
 
<p> With the hope to help the students of the team with the design of primer, I tried to develop a software to do primer design. There is still work to do on it, but I think what I did could help some others students with the goal to write a software like this ! They could use it as a draft or as a page with links to resources.
 
<p> With the hope to help the students of the team with the design of primer, I tried to develop a software to do primer design. There is still work to do on it, but I think what I did could help some others students with the goal to write a software like this ! They could use it as a draft or as a page with links to resources.
 
  You can click on this button to get all the code I wrote in Python : </p>
 
  You can click on this button to get all the code I wrote in Python : </p>
<button> Python </button>
+
<button onclick="python()"> Python </button>
  
<p id="hiddenPythonCode"> </p>
+
<p id="hiddenPythonCode">test </p>
  
 
<p> And on this one to get the code in R: </p>
 
<p> And on this one to get the code in R: </p>
<button> R </button>
+
<button onclick="r()"> R </button>
  
<p id="hiddenRCode"> </p>
+
<p id="hiddenRCode">test </p>
  
 
<p> So, in this page, I would like to present this software. The chapter 0 explains how to use it, the chapter 1 and 2 how it was programmed and the chapter 3, what would be to improve next.  </p>
 
<p> So, in this page, I would like to present this software. The chapter 0 explains how to use it, the chapter 1 and 2 how it was programmed and the chapter 3, what would be to improve next.  </p>
Line 418: Line 442:
 
<img src="https://2019.igem.org/wiki/images/thumb/d/d9/T--uOttawa--softwareTable1.png/459px-T--uOttawa--softwareTable1.png"/>
 
<img src="https://2019.igem.org/wiki/images/thumb/d/d9/T--uOttawa--softwareTable1.png/459px-T--uOttawa--softwareTable1.png"/>
 
<img src="https://2019.igem.org/wiki/images/thumb/8/8c/T--uOttawa--softwareTable2.png/484px-T--uOttawa--softwareTable2.png"/>
 
<img src="https://2019.igem.org/wiki/images/thumb/8/8c/T--uOttawa--softwareTable2.png/484px-T--uOttawa--softwareTable2.png"/>
 
+
<img src="https://2019.igem.org/wiki/images/thumb/d/dd/T--uOttawa--softwareTable3.png/447px-T--uOttawa--softwareTable3.png"/>
 +
<img src="https://2019.igem.org/wiki/images/thumb/1/14/T--uOttawa--SoftwareTable4.png/430px-T--uOttawa--SoftwareTable4.png"/>
  
  

Revision as of 02:59, 22 October 2019

logo of uottawa
Menu

Software

With the hope to help the students of the team with the design of primer, I tried to develop a software to do primer design. There is still work to do on it, but I think what I did could help some others students with the goal to write a software like this ! They could use it as a draft or as a page with links to resources. You can click on this button to get all the code I wrote in Python :

test

And on this one to get the code in R:

test

So, in this page, I would like to present this software. The chapter 0 explains how to use it, the chapter 1 and 2 how it was programmed and the chapter 3, what would be to improve next.

0 - How to use the programm

  • Copy-paste the whole python code in repl.it ;
  • Replace the string contained in the variable seqDNA (line 807) with the total sequence of DNA that is going to enter in the PCR;
  • Replace the string contained in the variable target (line 809) with the part of the total sequence of DNA you want to duplicate (and so, the primers must be around it)
  • Run the program
  • Wait for a dozen of minutes
  • There should appear a list of suggested primers, with the one at the bottom being the most recommended

I - Code of each constraint

So, each primer must respect a list of constraints. This list varies from website to website, but in general, we can find something like written below. (I mainly took my informations from this website : http://bioweb.uwlax.edu/GenWeb/Molecular/seq_anal/primer_design/primer_design.htm who himself adapted it from Innis and Gelfand,1991; I changed some value if I was seeing a lot of website using different ones). The coding of some of those constraints is less obvious than other, so under them, I am writing more indications !

  • 1. primers should be 17-28 bases in length;
  • 2. The GC content should be 50-60%
  • 3. primers should end (3') in a G or C, or CG or GC:
  • 4. The melting temperature should be between 52 and 58 degree Celsius
  • They are several ways to calculate the melting temperature of a sequence of DNA, like the basic rule and the nearest-neighbour method. (For the different formulas, see Alejandro Panjkovich, Francisco Melo, Comparison of different melting temperature calculation methods for short DNA sequences, Bioinformatics, Volume 21, Issue 6, 15 March 2005, Pages 711–722, https://doi.org/10.1093/bioinformatics/bti066 at the paragraph melting temperatures calculation ; for an explanation of how the nearest-neighbour method works, see Albert Courey, Nearest-Neighbour method, https://www.youtube.com/watch?v=jy5h9NwBEgk ; ). The formula for this method differs a bit in every article I saw. This might be caused for example, by the use of different experimental data from one place of to another; some use Breslauer tables, some use SantaLucia, ... (see table 1 of this article ). Also, I think sometimes I was seeing differences from one table to an other what were probably errors of copy - so I think it's a good time to think that a good science principle is to double-check what is being said (especially double-checking what an undergraduate student like me says too !) and try to get as many points of view and references as possible ! In conclusion, this is the formula I used :

    Tm = (-H/(A-S/1000+R*(math.log(gamma/4))))+ T0 + 16.6*math.log10(Na)

    With

    • Tm = melting temperature in °C
    • A = constant of -0.0108 kcal K-1 ᐧ mol-1 (accounts for helix initiation during annealing / melting)
    • R = gas constant of 0.001987 kcal K-1 ᐧ mol-1 (constant that scales energy to temperature)
    • T0 = - 273.15, to pass from Kelvin to Celsius
    • Na = [Na+] = sodium ion concentration in M or mol L-1 (we use 0.05, i.e. 50 mM)
    • gamma = 50*10**(-9) = dimensionless (molar) concentration
    • t = -21.6 = empirical temperature correction in Celsius

    Those values were mainly taken from : here

    The S and H variables where found with an algorithm constructed following the tutorial I recommended above ( video ).

    • S = ΔS = entropy change in kcal K-1 ᐧ mol-1 (accounts for energy unable to do work, i.e. disorder)
    • H = ΔH = enthalpy change in kcal mol-1 (accounts for the energy change during annealing / melting)

    This is the python program I wrote:

    
    def delta_H_S(p1,p2):
     """
     A function that returns the delta_H (free enthalpy in kcal/mol) and the delta_S (entropy in cal/K) of a pair of nuclotides, using the tables from Breslauer (1986)
     
     Parameters
     ----------
     p1 : str
       a one-letter string ("A", "T", "C" or "G") to represent the first nucleotide
     p2 : str
       a one-letter string to represent the second nucleotide of the pair
     """
     
     p3 = str(p1) + str(p2) #concatenate the two nucleotides
     
     if(p3 == "AA" or p3 =="TT"):
       H = 9.1
       S = 24.0
     elif(p3 == "AT"):
       H = 8.6
       S = 23.9
     elif(p3 == "TA"):
       H = 6.0
       S = 16.9
     elif(p3 == "CA" or p3 == "TG"):
       H = 5.8
       S = 12.9
     elif(p3 == "GT" or p3 == "AC"):
       H = 6.5
       S = 17.3
     elif(p3 == "CT" or p3 == "AG"):
       H = 7.8
       S = 20.8
     elif(p3 == "GA" or p3 == "TC"):
       H = 5.6
       S = 13.5
     elif(p3 == "CG"):
       H = 11.9
       S = 27.8
     elif(p3 == "GC"):
       H = 11.1
       S = 26.7
     elif(p3 == "GG" or p3 == "CC"):
       H = 11.0
       S = 26.6
     #this is working for oligo
     #return H, S/1000
     return H, S
     
    def sum_deltas_H_S(x):
     """
     A function that returns the total enthalpy (Htot) and entropy (S) of a sequence of DNA, summing the delta_enthalpy (h) and delta_entropy (s) of each pair of nucleotides. For example, to calculate H of "AATTGCCG" we add up h of "AA", h of "AT", h of "TT", ...
     
     
     Parameters
     ----------
     x : str
       the sequence of DNA from who enthalpy and entropy is calculated
     
     """
     H, S = 0,0
     for i in range(0, len(x)-1):
       h, s = delta_H_S(x[i],x[i+1])
       #print("h = ", h, "s = ", s)
       H = H + h
       S = S + s
     #print("H = ", H, "S = ", S)
     return H, S
     
    

    I don't achieve to retrieve the file right now (slow internet connection)- but I did compare the result I had with my programm to other programms, and it was well in the average.

  • 5. 3'-ends of primers should not be complementary (ie. base pair), as otherwise primer dimers could be form (when primer stick together instead of sticking to the sequence that must be replicated).
  • 6. primer self-complementarity (ability to form 2o structures such as hairpins) should be avoided;

    Due to low internet connections, I don't achieve to check it, but I think the link of the article from where I took my algorithms was this one

     len(y)): #it's easier to code if x is always smaller than y
       x,y = y,x
     n,m = len(x), len(y)
     for i in range(m-1):
       y = "o" + y + "o" #to not provoke an error
     for k in range(m-1,3*(m-1)+1):
       tot = 0
       #print("For km = ", k-2*(m-1), "soit k = ", k)
       for i2 in range(0, n):
         #print(x[i2], y[k-i2])
         tot = tot + s(x[i2],y[k-i2])
       #print("-> ", tot)
       align.append(tot)
     #print(align)
     return align
     
     
    def Sprime(x,y):
     """this function returns the number and strengh of bonds (higher score, higher probability to bind)between two sequences in an ininterrupted way from the start in different alignements"""
     #the same code than for S except that you break if there is an interruption in the bond making
     x = x.upper()
     y = y.upper()
     align = []
     if(len(x) > len(y)):
       temp = x
       x = y
       y = temp
     n,m = len(x), len(y)
     for i in range(m-1):
       y = "o" + y + "o"
     for k in range(m-1,3*(m-1)+1):
       tot = 0
       print("For km = ", k-2*(m-1), "soit k = ", k)
       for i2 in range(0, n):
         bond = s(x[i2],y[k-i2])
         print(x[i2], y[k-i2])
         print("k-i2 = ", k-i2)
         if(bond == 0 and y[k-i2] != 'o'): #the first bond between the two sequences has to score more than 0
           break
         tot = tot + bond
       print("-> ", tot)
       align.append(tot)
     #print(align)
     return align
     
    def pa(p,q):
     """so pair annealing"""
     return max(S(p,q))
     
    def pea(p,q):
     """so pair-end annealing"""
     return max(Sprime(p,q))
     
    #why is that working like that, who knows
    def sea(seqDNA):
     """self-end annealing"""
     return max(Sprime(seqDNA[::-1],seqDNA))
     
    def selfComplementarity(seqDNA):
     """self annealing"""
     return max(S(seqDNA, seqDNA))
    
    
    

    On R:

    
    s <- function(a,b){
      if((a == "A" & b == "T") | (a == "T" & b == "A")){
        return (2)
      }
      else if((a == "C" & b == "G") | (a == "G" & b == "C")){
        return (4)
      }
      else{
        return (0)
      }
    }
    
    
    #this function returns the number of bonds (higher score, higher probability to bind)between two sequences
    S <- function(x,y){
      align <- list()
      if(nchar(x)>nchar(y)){
        temp <- x
        x <- y
        y <- temp
      }
      n <- nchar(x)
      m <- nchar(y)
      
     
      for(i in 1:(m-1)){
        y <- paste("o", y,"o", sep="")
      }
      xsplit <- unlist(strsplit(x, split=""))
      ysplit <- unlist(strsplit(y, split=""))
      i<-1
      for(k in (m-1):(3*(m-1))){
        tot <- 0
        for(i2 in 1:n){
          #print(c(xsplit[i2], ysplit[k-i2+2]))
          tot <- tot + s(xsplit[i2],ysplit[k-i2+2])
          #print(c("->", tot))
        }
        #print(c("TOT", "->", tot))
        align[[i]] <- tot
        i <- i +1
      }
      return(align)
    }
    
    
    
    Sprime <- function(x,y){
      align <- list()
      if(nchar(x)>nchar(y)){
        temp <- x
        x <- y
        y <- temp
      }
      n <- nchar(x)
      m <- nchar(y)
      
      
      for(i in 1:(m-1)){
        y <- paste("o", y,"o", sep="")
      }
      xsplit <- unlist(strsplit(x, split=""))
      ysplit <- unlist(strsplit(y, split=""))
      i<-1
      for(k in (m-1):(3*(m-1))){
        tot <- 0
        for(i2 in 1:n){
          #print(c(xsplit[i2], ysplit[k-i2+2]))
          bond <-  s(xsplit[i2],ysplit[k-i2+2])
          if(bond == 0 & ysplit[k-i2 +2] != 'o'){
            break
          }
          tot <- tot + bond
          #print(c("->", tot))
        }
        #print(c("TOT", "->", tot))
        align[[i]] <- tot
        i <- i +1
      }
      return(align)
    }
    
    pa <- function(p, q){
      return(max(unlist(S(p,q))))
    }
    
    pea <- function(p,q){
      return(max(unlist(Sprime(p,q))))
    }
    
    sea <- function(seqDNA){
      return(max(unlist(Sprime(stri_reverse(seqDNA), seqDNA))))
    }
    
    selfComplementarity <- function(seqDNA){
      return(max(unlist(S(seqDNA, seqDNA))))
    }
    
    

  • 7. runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G or C-rich sequences should be avoided.

II - Select the best primer respecting all the constraints

There is the case where it’s impossible to find a primer that respects perfectly all those constraints. In this case, we have to choose a primer that respects the most important, because it’s still possible that the experiment works, if for example, the primer is one nucleotide too long according to some website. So I decided to use a score system, with the primer with the lowest score being the best. For example, let’s say that the primer is 10 bp length, which is bad so, he would get a +10 score; if the primer is 15 bp length, it’s still bad, but less, so he would get a +5 score; and if it’s 20 length, then it’s good, so the primer would get +0 points. For problems that are too bad, put a +1000 should eliminate the primer for sure To choose what points to assign to what constraints, I saw two methods :

  • 1 - using tables with key points : I was thinking of the system of graph editor used in 3d animation software, for example like in Blender : https://docs.blender.org/manual/en/latest/editors/graph_editor/fcurves/introduction.html. You choose a few key points and the computer calculate the images in between those images. You can also choose how the computer calculates those in between (interpolation mode), like linearly, exponentially, constant.
  • 2 - using functions ( A function F that to a value X attributes a value Y with X being the characteristic of the actual sequence of DNA and Y the penalty it returns. For example, for the number of repetitions in a sequence : (the following part is taken from an e-mail my Professor sent me)
    pMax = maximal penalty
    N = number of repetitions in sequence (e.g., N=4 for AAAA or GGGG and for AGAGAGAG, GAGAGAGA)
    n = function parameter
    K= function parameter
    
    p = pMax * N^n / (K^n + N^n)
    
    With reasonables values being:
    pMax = 100
    K = 3 (penalty is 50 for 3 repeats)
    n = 4
    (end of the part my Professor sent me).

Here is a comparison of the two methods:

because of time of uploads and the fact I have 8 picture in this section, I do not upload this section on the website

Using the key point method, this is the table I wrote. I choose the score points from an average of what is the worst in PCR that I got after reading several articles - but a biologist that have actually experience in the wet lab could probably provide a better point system.

TABLE 1 : CONSTRAINT BASED ON THE SINGLE PRIMER ITSELF (FORWARD OR REVERSE)

IV - What stays to improve:

The most important thing to do would be to check that this software give the same results than other primer design software or manual selection ! Maybe resources as this could help : Kumar A., Chordia N. (2015) In Silico PCR Primer Designing and Validation. In: Basu C. (eds) PCR Primer Design. Methods in Molecular Biology, vol 1275. Humana Press, New York, NY Then, it would be great to improve the documentation, double-check everything I wrote and find an academic scientific proof for the values and formulas I found on the internet ! (I wrote this article on the deadline day on the 21 October 2019 with pieces of what I wrote and what I remembered from August 2019; so although I think this page could be of use for future developers, mistakes are possible !). Create a more user friendly environment. In R, the pist I have would be to use Shiny. In Python, there is Tkinter. Solve the time problems (see Big O notation). Given the time my software takes to give an answer, I probably wrote a horrible algorithm of brutal force ! Here is the part that is probably taking the most time :

  def findPrimer(seq, direc = "F"):
 """A function that takes in entry the whole sequence of DNA and the direction of the primers we are looking for (by default "F" for forward) and returns a list of the possibles primers"""
 
 allP = [] #contains all the primers (of all lengths)
 bpMin = 15 #mininum base length
 bpMax = 35 # maximum base length
 
 if bpMax > len(seq.seqDNA): #we don't want to have this absurdity of a primer bigger than a sequence - so in case the sequence provided by the user is very short, we have to go down with the length of the primer
   bpMin = 0
   bpMax = round(len(seq.seqDNA)/3)
 for bp in range(bpMin, bpMax):
   #we test for the primers of all lengths possibles
   allbpP = [] #contains all the primers of length bp
   if direc == "F":
     for i in range(seq.start, seq.target_start - bp, 1): #the first forward primer will have his start on the start of the whole DNA sequence  - and the last primer will have his end on the start of the target sequence that we don't want to bind on
       p = Primer(seq.seqDNA[i:i+bp])
       allbpP.append(p)
   elif direc == "R":
     for i in range(seq.end, seq.target_end + bp, -1):
       p = Primer(seq.seqDNA[i-bp:i])
       allbpP.append(p)
   allP.append(allbpP)
 print("allbpP = ", allbpP)
 return allbpP
 
def findPair(allF, allR, seq):
 """
 A function that takes in entry all the possible forwards (allF) and reverse (allR) primers in the sequence (seq) and returns the list of all the possible pairs of primers sorted by penalties score
 """
 #allF, allR = sorted(allF1), sorted(allR1)
 #took the following code from here : https://stackoverflow.com/questions/403421/how-to-sort-a-list-of-objects-based-on-an-attribute-of-the-objects/48731059
 try: import operator
 except ImportError: keyfun= lambda x: x.score # use a lambda if no operator module
 else: keyfun= operator.attrgetter("score") # use operator since it's faster than lambda
 allF.sort(key=keyfun, reverse=True) # sort in-place
 allR.sort(key=keyfun, reverse=True) # sort in-place
 allPair = []
 for f in allF:
   for r in allR:
     pair = pairPrimer(f, r, seq)
     allPair.append(pair)
 
 try: import operator
 except ImportError: keyfun= lambda x: x.score # use a lambda if no operator module
 else: keyfun= operator.attrgetter("score") # use operator since it's faster than lambda
 allPair.sort(key=keyfun, reverse=True) # sort in-place

A way to make it faster would be probably to cut the program once we have found the 20 first best primers

Conclusion

Well, I hope this can help someone ! It was for me a wonderful opportunity to work with the iGEM team this summer. I learned a great amount, above all about biology and coding (I am a math student). If you need anything else concerning this page, please contact the uOttawa iGEM team, I stay available to help !