Abstract
Starting with the simplest situation,we fisrt consider the sequence with only one mismatch spot.
Our modle imply that when comparing off-targets with a single mismatch to the cognate sequence, the relative cleavage probability is sigmoidal, irrespective of the values of the model parameters.
To understand what constitutes the targeting principles of RGNs, we introduce a simplified model where:
For the PAM state (i = 0) we have $\Delta_0$ = $\Delta_{PAM}$; for a partial R-loop ($i\in[1, N-1]$), we have $\Delta_i$ =$\Delta_C$ if the i:th base in the R-loop is correctly matched, and $\Delta_i$ = -$\Delta_I$ if mismatched; for a completed Rloop (i = N), we have $\Delta_N$ = $\Delta_C$ - $\Delta_{clv}$ if the terminal base is mismatched, and $\Delta_N$ = -$\Delta_I$ - $\Delta_{clv}$ if mismatched. An R-loop in which n base pairs are incorporated, out of which $n_{\mathrm{C}}(n)$ are forming correct Watson-Crick pairs, is then described by \begin{equation} \Delta T_{n}=\Delta_{\mathrm{PAM}}+n_{\mathrm{C}}(n) \Delta_{\mathrm{C}}-\left(n-n_{\mathrm{C}}(n)\right) \Delta_{\mathrm{I}}-\delta_{n, N} \Delta_{\mathrm{cl} \mathrm{v}}, \quad n=0, \ldots, N \end{equation}
where $\delta_{n, N} \Delta_{\mathrm{cl} \mathrm{v}}$ represents the Kronecker delta: $\delta_{n, N} \Delta_{\mathrm{cl} \mathrm{v}}$ = 1 if n = N and $\delta_{n, N} \Delta_{\mathrm{cl} \mathrm{v}}$ = 0 otherwise.
Analytic expression
We set $n_{seed}$ is the position where the cleavage probability is half that of its maximum $p_{max}$
To solve the analytic expression of $n_{seed}$ ,we let therebe a single mismatch at position $n_{seed}$, giving: $$ n_{\mathrm{C}}(n)=\left\{\begin{array}{ll}{0} & {n < n_{\mathrm{seed}}} \\ {1} & {n \geq n_{\mathrm{seed}}}\end{array}\right. $$
From which we obtain the relation between every $\Delta T_{seed}$: $$ \Delta T_{seed}=\left\{\begin{array}{llll}{\Delta_{PAM}} & {n = 0} \\ {\Delta_{C}+\Delta T_{n-1}} & {n_{\mathrm{seed}} \geq n \geq 1} \\ {\Delta_{PAM}+(n_{\mathrm{seed}}-1)\Delta_{\mathrm{C}}-\Delta_{\mathrm{i}}} & {n=n_{\mathrm{seed}}}\\{\Delta_{C}+\Delta T_{n-1}} & {n > n_{\mathrm{seed}}}\end{array}\right. $$
As for completely matce sequence, we have: $$ \Delta T_{max}=\left\{\begin{array}{ll}{\Delta_{PAM}} & {n = 0} \\ {e^{-\Delta_{C}}\Delta T_{n-1}} & {n \geq 1}\end{array}\right. $$
We can divide $\Delta T_{seed}$ into equal difference series , and $\Delta T_{max}$ is one equal difference series.From equation (1-7) we have their relations:
$$ P_{\mathrm{max}} =\frac{1}{1+\sum_{n=0}^{N} \exp \left(-\Delta T_{max}\right)} = 2P_{\mathrm{seed}}=2\frac{1}{1+\sum_{n=0}^{N} \exp \left(-\Delta T_{seed}\right)} $$ \begin{equation} 2+2\sum_{n=0}^{N} \exp \left(-\Delta T_{max}\right)=1+\sum_{n=0}^{N} \exp \left(-\Delta T_{seed}\right) \end{equation}
When we put $\Delta T_{n}$ into exponential function, it becomes a geometric series. We can solve the length of the seed region by using the summation formula of the geometric series: \begin{equation} S_{n}=\frac{a_{1}\left(1-q^{n}\right)}{1-q}(q \neq 1) \end{equation}
From 2,3 we solve analytical solution of $n_{seed}$:\begin{equation} \begin{aligned} p_{\max } &=\frac{\left(1-e^{-\Delta_{\mathrm{C}}}\right) e^{\Delta \operatorname{PAM}}\left(1+e^{-\Delta T_{N}^{\mathrm{on}}}\right)+1-e^{-\Delta R_{N}^{\mathrm{on}}}}{\left(1-e^{-\Delta_{\mathrm{C}}}\right) e^{\Delta_{\mathrm{PAM}}}\left(1+e^{-\Delta T_{N}^{\mathrm{tm}}}\right)+1-e^{-\Delta R_{N}^{\mathrm{tm}}}} \\ n_{\mathrm{seed}} &=\frac{1}{\Delta_{\mathrm{C}}} \ln \left[\frac{e^{\Delta_{\mathrm{I}}+\Delta_{\mathrm{C}}}-1}{\left(1-e^{-\Delta_{\mathrm{C}}}\right) e^{\Delta_{\mathrm{PAM}}}\left(1+e^{-\Delta T_{N}^{\mathrm{tm}}}\right)+1-e^{-\Delta R_{N}^{\mathrm{tm}}}}\right] \end{aligned} \end{equation}
and we have introduced the R-loop completion bias with a cognate and terminal-mismatch target respectively $$ \Delta R_{N}^{\mathrm{on}}=N \Delta_{\mathrm{C}}, \quad \Delta R_{N}^{\mathrm{tm}}=(N-1) \Delta_{\mathrm{C}}-\Delta_{\mathrm{I}}=\Delta R_{N}^{\mathrm{on}}-\left(\Delta_{\mathrm{C}}+\Delta_{\mathrm{I}}\right) $$
as well as the total bias toward cleavage of the on-target and on off-target with terminal-mismatch target respectively $$ \Delta T_{N}^{\mathrm{on}}=\Delta R_{N}^{\mathrm{on}}+\Delta_{\mathrm{PAM}}-\Delta_{\mathrm{clv}}, \quad \Delta T_{N}^{\mathrm{tm}}=\Delta R_{N}^{\mathrm{tm}}+\Delta_{\mathrm{PAM}}-\Delta_{\mathrm{clv}}=\Delta T_{N}^{\mathrm{on}}-\left(\Delta_{\mathrm{C}}+\Delta_{\mathrm{I}}\right) $$
Approximate solution
For the correct PAM we expect there to be a considerable PAM bias, and assuming at least a moderate bias for R-loop extension over correct basepairs, we should be able to take $\left(1-e^{-\Delta_{C}}\right) e^{\Delta_{PAM}} \gg 1$ in Equation 4. Further, we expect the overall bias on an on-target to be strongly toward cleavage$\left(\Delta T_{N}^{\mathrm{on}} \gg 1\right)$), as well as a large changein total bias when comparing a correctly and incorrectly matchedbase pair $\left(\Delta_{\mathrm{I}}+\Delta_{\mathrm{C}} \gg 1\right)$. With these assumptions Equation 4 becomes \begin{equation} \begin{aligned} p_{\max } & \approx \frac{1}{1+e^{-\Delta T_{N}^{\mathrm{tm}}}}\approx \frac{p_{\mathrm{max}}}{1+\exp \left[-\left(n-n_{\mathrm{seed}}\right) \Delta_{\mathrm{C}}\right]}\\ n_{\mathrm{seed}} & \approx \frac{\Delta_{\mathrm{I}}+\Delta_{\mathrm{C}}-\Delta_{\mathrm{PAM}}}{\Delta_{\mathrm{C}}}+\frac{\ln p_{\max }-\ln \left(1-e^{-\Delta_{\mathrm{C}}}\right)}{\Delta_{\mathrm{C}}} \approx \frac{\Delta_{\mathrm{I}}-\Delta_{\mathrm{PAM}}}{\Delta_{\mathrm{C}}}+1 \end{aligned} \end{equation}
Parameters Setting
Among the systems we estimate parameters for, the dataset from (Anderson et al). traces out the sigmoidal trend particularly well. For this dataset, we fit out a kinetic seed of about 11.0 and an average bias per correct base pair of about ∆C = 1.80kBT . This positive bias indicates that association with the RGN stabilizes the hybrid, which is in line with recent studies demonstrating that the protein has a strong contribution to the energetics of the resulting bound complex . The relative cleavage probability levels off around pmax = 0:74, indicating that spCas9 retains some specificity even against errors that are outside the seed.
Summary
One should recognize that our minimal model does not capture all the physics of the targeting process. Nucleic acid interactions are explicitly sequence dependent, RGNs are known to undergo conformational changes prior to cleavage and the $\Delta_C$ we fit out in Figure 6 technically only reports the matching bias at the end of the seed, allowing for variable biases along the R-loop. Although these are all topics that need to be explored for future improved quantitative predictions, such extensions are not needed to explain the observed targeting rules and will not qualitatively alter the trends predicted by our model.
In conclusion, our model is capable of explaining the observed off-targeting rules of CRISPR and Argonaute systems in simple kinetic terms. After having established the general utility of this approach, the next step will be to move beyond our minimal model and gradually allow for conformational control and sequence effects by letting our parameters depend on the nature of matches/mismatches as well as their positions.
SJTU-BioX-Shanghai
Contact us: sjtuigem@gmail.com
Bio-X Institute, Shanghai Jiao Tong University, Dongchuan Rd. 800