Team:ZJU-China/HCRStatistical

Part 1. Introduction Part 2. Experimental Result Verification Part 3. Theoretical Design Proof Part 4. Contribution

Part 1. Introduction

During the experiment, we found that in the hybridization reaction, the formation of DNA strands of different lengths showed a medium-length fragment concentration, and the longer fragments and the shorter fragments were less and progressive, which made us speculate. Is the above-mentioned reaction of DNA strands of different lengths showing a normal distribution? After literature research, it is found that similar situations have appeared in the experiments of the predecessors, so we consider demonstrating the intrinsic relationship between this experiment and the normal distribution.

The verification process here is mainly carried out on two levels:

1. Firstly, the image processing tool is used to analyze the results of the electrophoresis experiment, and the experimental results are verified to be consistent with the assumption of normal distribution.

2. Starting from the reaction principle, it is proved that according to the experimental design, the final ideal result is a normal distribution.

Finally, two propositions have all been successfully proven, then we will show the details of this exciting work.

Part 2. Experimental Result Verification

First, verify the experimental results. We selected an electrophoresis band after the hybridization chain reaction to perform grayscale analysis using Image J software and obtained the grayscale curve shown on the right side of the figure below. The place where the peak is located is a strip. For the last peak represents the unseparated segments, it will not participate in the processing. We record the coordinate information (X, Y) corresponding to each peak for further processing.

Since there is a logarithmic relationship between the actual molecular weight of each strip in electrophoresis, and in the digital image, the default Y coordinate increase is from top to bottom. Therefore, after processing the coordinates, the following curve can be obtained. It can be seen that it really looks like a normally distributed bell curve!

Of course, "look like" must be imprecise, so we convert the Y-axis coordinate into frequency, convert the picture information into a data set, then use statistical principles to analysis it. In order to visually reflect the normality of this set of data, we made a Q-Q plot of this data set.

In general, the measurement data is well arranged near the standard line (Q-Q Line) of the data. More precise result was also obtained by the Shapiro-Wilk Normality Test. From the results, we found that we have more than 95.104% confidence that our set of data is in a normal distribution. In summary, we have completed enough proof that our experimental results are consistent with the normal distribution.

Part 3. Theoretical Design Proof

For the exciting results mentioned above, we also hope to explain the rationality of this experimental phenomenon from a theoretical perspective.

For a unit volume reaction system, it is assumed that the number of S2 molecules contained therein is N. That is to say, after the reaction is completed, N DNA double strands of different lengths will eventually be produced. We may compare the N S2 molecules to N boxes. After all, this number does not change with the reaction.

We compare the two short chains carrying Biotin to a small sphere. Since the growth of the DNA elongation chain does not affect the binding choice of any of the free fragments, we can use a probability model to characterize a reaction: for N The exact same box has P balls, and the probability of each ball being placed in each box is equal. Verify that when P and N are both large enough, the frequency of the number of balls in the box is normally distributed.

In fact, we only need to consider one of the boxes. The probability of having k balls in this box is:

$$p=\frac{k}{N}$$

According to the Sterling formula:

$$\lim_{n \to \infty}(n!)=\sqrt[2]{2\pi n}(\frac{n}{e})^n$$

The combination number can be approximated as follows. Here ε is a smidgen.

$$C_n^k=\frac{n!}{k!(n-k)!}\approx \frac{1+\epsilon}{\sqrt[2]{2\pi k(1-\frac{k}{n})}\cdot (\frac{k}{n})^k (1-\frac{k}{n})^{n-k}}$$

For all the N boxes, when calculating the expectation of the number of boxes with k small balls, we need to multiply the number of boxes and the possibility of each box to contain k balls. Since we know that when k and N is not so large, this possibility obey the binomial distribution.

$$E_k=NC_n^kp^k(1-p)^k$$

Since we have deduced the approximate equation of the combination number, we then replace the it in the expectation expression. Besides that, all the exponential functions are unified into e-based expressions. Using an extra function H(x) to represent the exponent. In that, this expectation can be handled in such expression.

$$E_k=\frac{N(1+\epsilon)}{2\pi n(1-P_{in})P_{in}}e^{-nH(P_{in})}$$

$$H(x)=xln\frac{x}{p}+(1-x)ln\frac{1-x}{1-p}$$

Then we use Taylor expansion to simplify H (P) , at the same time, we replace the independent variable into Pin.

$$H(P_{in})=H(p)+\frac{dH(p)}{dP_{in}}(P_{in}-p)+\frac{1}{2}\frac{d^2H(p)}{d{P_{in}}^2}(P_{in}-p)^2+O((P_{in}-p)^3)$$

$$H(P_{in})\approx \frac{1}{2}(\frac{1}{p}+\frac{1}{1-p})(P_{in}-p)$$

Finally, after replace H(Pin) into processed format, we can obtain the expression of the expectation when N are all large enough, even tends to be infinity.

$$E_k=nP_n(k)=\frac{n(1+\epsilon)}{2\pi nP_{in}(1-P_{in})}e^{-\frac{(k-np)^2}{2np(1-p)}}$$

This means that the characteristic of the frequency of DNA strands with different lengths happens to satisfy a normal distribution formula. In order to verify this conclusion, we also simulated this process with C++ program (k=106, N=200) and performed statistical tests. The results after the simulation are also consistent with the proof results. This means that the analysis and assumptions about the ideal results of this experiment have a convincing theoretical basis.

Part 4. Contribution

In conclusion, the HCR statistical model enables us to believe that the experimental results are exactly consistent with the theoretical situation and the experiment has been successfully done. This process is divided into two aspects: from experiment result to ideal consequence and from chemical reaction to ideal result. To our surprise, both of these two parts have been well proven using this model.