Software
With the hope to help the students of the team with the design of primer, I tried to develop a software to do primer design. There is still work to do on it, but I think what I did could help some others students with the goal to write a software like this to ! You can click on this button to get all the code I wrote in Python :
And on this one to get the code in R:
So, in this page, I would like to present this software. The chapter 0 explains how to use it, the chapter 1 and 2 how it was programmed and the chapter 3, what would be to improve next.
0 - How to use the programm
- Copy-paste the whole python code in repl.it ;
- Replace the string contained in the variable seqDNA (line 807) with the total sequence of DNA that is going to enter in the PCR;
- Replace the string contained in the variable target (line 809) with the part of the total sequence of DNA you want to duplicate (and so, the primers must be around it)
- Run the program
- Wait for a dozen of minutes
- There should appear a list of suggested primers, with the one at the bottom being the most recommended
I - Code of each constraint
So, each primer must respect a list of constraints. This list varies from website to website, but in general, we can find something like written below. (I mainly took my informations from this website : http://bioweb.uwlax.edu/GenWeb/Molecular/seq_anal/primer_design/primer_design.htm who himself adapted it from Innis and Gelfand,1991; I changed some value if I was seeing a lot of website using different ones). The coding of some of those constraints is less obvious than other, so under them, I am writing more indications !
- 1. primers should be 17-28 bases in length;
- 2. The GC content should be 50-60%
- 3. primers should end (3') in a G or C, or CG or GC:
- 4. The melting temperature should be between 52 and 58 degree Celsius
- Tm = melting temperature in °C
- A = constant of -0.0108 kcal K-1 ᐧ mol-1 (accounts for helix initiation during annealing / melting)
- R = gas constant of 0.001987 kcal K-1 ᐧ mol-1 (constant that scales energy to temperature)
- T0 = - 273.15, to pass from Kelvin to Celsius
- Na = [Na+] = sodium ion concentration in M or mol L-1 (we use 0.05, i.e. 50 mM)
- gamma = 50*10**(-9) = dimensionless (molar) concentration
- t = -21.6 = empirical temperature correction in Celsius
- S = ΔS = entropy change in kcal K-1 ᐧ mol-1 (accounts for energy unable to do work, i.e. disorder)
- H = ΔH = enthalpy change in kcal mol-1 (accounts for the energy change during annealing / melting)
- 5. 3'-ends of primers should not be complementary (ie. base pair), as otherwise primer dimers could be form (when primer stick together instead of sticking to the sequence that must be replicated).
- 6. primer self-complementarity (ability to form 2o structures such as hairpins) should be avoided;
Due to low internet connections, I don't achieve to check it, but I think the link of the article from where I took my algorithms was this one
def s(a,b): """return the score of bonding of two nucleotides (it's stronger for GC since it has 3 bonds of hydrogen and AT has only 2)""" if (a == "A" and b == "T") or (a =="T" and b == "A"): return 2 elif (a == "C" and b == "G") or (a == "G" and b == "C"): return 4 else: return 0 def S(x,y): """this function returns the complementarity (higher score, higher probability to bind) between two sequences""" #this takes in account the different possible alignement. Ex for x= CAGT and y = CAGT, we have for example those 2 differents alignements: #C - T C - G #A - o A - T #G - o G - o #T - o T - o x = x.upper() y = y.upper() align = [] if(len(x) > len(y)): #it's easier to code if x is always smaller than y x,y = y,x n,m = len(x), len(y) for i in range(m-1): y = "o" + y + "o" #to not provoke an error for k in range(m-1,3*(m-1)+1): tot = 0 #print("For km = ", k-2*(m-1), "soit k = ", k) for i2 in range(0, n): #print(x[i2], y[k-i2]) tot = tot + s(x[i2],y[k-i2]) #print("-> ", tot) align.append(tot) #print(align) return align def Sprime(x,y): """this function returns the number and strengh of bonds (higher score, higher probability to bind)between two sequences in an ininterrupted way from the start in different alignements""" #the same code than for S except that you break if there is an interruption in the bond making x = x.upper() y = y.upper() align = [] if(len(x) > len(y)): temp = x x = y y = temp n,m = len(x), len(y) for i in range(m-1): y = "o" + y + "o" for k in range(m-1,3*(m-1)+1): tot = 0 print("For km = ", k-2*(m-1), "soit k = ", k) for i2 in range(0, n): bond = s(x[i2],y[k-i2]) print(x[i2], y[k-i2]) print("k-i2 = ", k-i2) if(bond == 0 and y[k-i2] != 'o'): #the first bond between the two sequences has to score more than 0 break tot = tot + bond print("-> ", tot) align.append(tot) #print(align) return align def pa(p,q): """so pair annealing""" return max(S(p,q)) def pea(p,q): """so pair-end annealing""" return max(Sprime(p,q)) #why is that working like that, who knows def sea(seqDNA): """self-end annealing""" return max(Sprime(seqDNA[::-1],seqDNA)) def selfComplementarity(seqDNA): """self annealing""" return max(S(seqDNA, seqDNA))
On R:
s <- function(a,b){ if((a == "A" & b == "T") | (a == "T" & b == "A")){ return (2) } else if((a == "C" & b == "G") | (a == "G" & b == "C")){ return (4) } else{ return (0) } } #this function returns the number of bonds (higher score, higher probability to bind)between two sequences S <- function(x,y){ align <- list() if(nchar(x)>nchar(y)){ temp <- x x <- y y <- temp } n <- nchar(x) m <- nchar(y) for(i in 1:(m-1)){ y <- paste("o", y,"o", sep="") } xsplit <- unlist(strsplit(x, split="")) ysplit <- unlist(strsplit(y, split="")) i<-1 for(k in (m-1):(3*(m-1))){ tot <- 0 for(i2 in 1:n){ #print(c(xsplit[i2], ysplit[k-i2+2])) tot <- tot + s(xsplit[i2],ysplit[k-i2+2]) #print(c("->", tot)) } #print(c("TOT", "->", tot)) align[[i]] <- tot i <- i +1 } return(align) } Sprime <- function(x,y){ align <- list() if(nchar(x)>nchar(y)){ temp <- x x <- y y <- temp } n <- nchar(x) m <- nchar(y) for(i in 1:(m-1)){ y <- paste("o", y,"o", sep="") } xsplit <- unlist(strsplit(x, split="")) ysplit <- unlist(strsplit(y, split="")) i<-1 for(k in (m-1):(3*(m-1))){ tot <- 0 for(i2 in 1:n){ #print(c(xsplit[i2], ysplit[k-i2+2])) bond <- s(xsplit[i2],ysplit[k-i2+2]) if(bond == 0 & ysplit[k-i2 +2] != 'o'){ break } tot <- tot + bond #print(c("->", tot)) } #print(c("TOT", "->", tot)) align[[i]] <- tot i <- i +1 } return(align) } pa <- function(p, q){ return(max(unlist(S(p,q)))) } pea <- function(p,q){ return(max(unlist(Sprime(p,q)))) } sea <- function(seqDNA){ return(max(unlist(Sprime(stri_reverse(seqDNA), seqDNA)))) } selfComplementarity <- function(seqDNA){ return(max(unlist(S(seqDNA, seqDNA)))) }
- 7. runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G or C-rich sequences should be avoided.
They are several ways to calculate the melting temperature of a sequence of DNA, like the basic rule and the nearest-neighbour method. (For the different formulas, see Alejandro Panjkovich, Francisco Melo, Comparison of different melting temperature calculation methods for short DNA sequences, Bioinformatics, Volume 21, Issue 6, 15 March 2005, Pages 711–722, https://doi.org/10.1093/bioinformatics/bti066 at the paragraph melting temperatures calculation ; for an explanation of how the nearest-neighbour method works, see Albert Courey, Nearest-Neighbour method, https://www.youtube.com/watch?v=jy5h9NwBEgk ; ). The formula for this method differs a bit in every article I saw. This might be caused for example, by the use of different experimental data from one place of to another; some use Breslauer tables, some use SantaLucia, ... (see table 1 of this article ). Also, I think sometimes I was seeing differences from one table to an other what were probably errors of copy - so I think it's a good time to think that a good science principle is to double-check what is being said (especially double-checking what an undergraduate student like me says too !) and try to get as many points of view and references as possible ! In conclusion, this is the formula I used :
Tm = (-H/(A-S/1000+R*(math.log(gamma/4))))+ T0 + 16.6*math.log10(Na)
With
Those values were mainly taken from : here
The S and H variables where found with an algorithm constructed following the tutorial I recommended above ( video ).
This is the python program I wrote:
def delta_H_S(p1,p2): """ A function that returns the delta_H (free enthalpy in kcal/mol) and the delta_S (entropy in cal/K) of a pair of nuclotides, using the tables from Breslauer (1986) Parameters ---------- p1 : str a one-letter string ("A", "T", "C" or "G") to represent the first nucleotide p2 : str a one-letter string to represent the second nucleotide of the pair """ p3 = str(p1) + str(p2) #concatenate the two nucleotides if(p3 == "AA" or p3 =="TT"): H = 9.1 S = 24.0 elif(p3 == "AT"): H = 8.6 S = 23.9 elif(p3 == "TA"): H = 6.0 S = 16.9 elif(p3 == "CA" or p3 == "TG"): H = 5.8 S = 12.9 elif(p3 == "GT" or p3 == "AC"): H = 6.5 S = 17.3 elif(p3 == "CT" or p3 == "AG"): H = 7.8 S = 20.8 elif(p3 == "GA" or p3 == "TC"): H = 5.6 S = 13.5 elif(p3 == "CG"): H = 11.9 S = 27.8 elif(p3 == "GC"): H = 11.1 S = 26.7 elif(p3 == "GG" or p3 == "CC"): H = 11.0 S = 26.6 #this is working for oligo #return H, S/1000 return H, S def sum_deltas_H_S(x): """ A function that returns the total enthalpy (Htot) and entropy (S) of a sequence of DNA, summing the delta_enthalpy (h) and delta_entropy (s) of each pair of nucleotides. For example, to calculate H of "AATTGCCG" we add up h of "AA", h of "AT", h of "TT", ... Parameters ---------- x : str the sequence of DNA from who enthalpy and entropy is calculated """ H, S = 0,0 for i in range(0, len(x)-1): h, s = delta_H_S(x[i],x[i+1]) #print("h = ", h, "s = ", s) H = H + h S = S + s #print("H = ", H, "S = ", S) return H, S
Due to technical problems I don't achieve to retrieve the file right now - but I did compare the result I had with my programm to other programms, and it was well in the average.