Team:Ruperto Carola/Model

Modelling evolution

Usually, models of cell growth and gene expression assume a fixed genetic background for the cells involved. This does not play well with the idea of evolution, introducing mutation and selection. Once we add these, standard models break down. To properly model evolution, we need to integrate both the behaviour of our cells' gene regulatory networks and their competition in the context of a larger population. Understanding the model of this at once can be quite the brain-melter, so let us take this step-by-step. We begin by setting down the equations for the population dynamics of a single type of cells, assuming our media can sustain the growth of infinitely many cells. As we're looking at the dynamics of our population \(N : T \to C\) here, we're looking at the rate of change in cell concentration in time, which is made up of a rate of cell division \(\gamma : \mathbb{R}_+\) and a death rate \(\delta : \mathbb{R}_+\). \[ \frac{dN}{dt} = \gamma N - \delta N \] Such media of course do not exist, and realistic conditions dictate a finite carrying capacity \(K : C\). We can easily adjust above equation to enforce a given \(K\) using logistic growth \[ \frac{dN}{dt} = (\gamma - \delta) \cdot N \left( 1 - \frac{N}{K} \right) \] or by explicitly coupling to nutrient concentration \(S : T \to C\) via the Monod equations: \[ \frac{dN}{dt} = \frac{\gamma S}{K + S} N - \delta N \] \[ \frac{dS}{dt} = -\upsilon \frac{\gamma S}{K + S} N \] where \(\upsilon : \mathbb{R}_+\) is a yield factor giving nutrient consumption per unit growth. That's it for the basics. To take the first step towards modelling evolution, consider a set of possible genotypes \(\Gamma : \mathsf{FinSet}\), and the concentration of a subpopulation of a given genotype \(N : \Gamma \to T \to C\), with genotype-specific growth rate \(\gamma : \Gamma \to \mathbb{R}_+\). You can think of differences in \(\gamma\) across \(\Gamma\) as _differences in fitness_ between the genotypes. Thus, subpopulations of fixed genotype compete for nutrients in media: \[ \frac{dN_k}{dt} = \frac{\gamma_k S}{K + S} N_k - \delta N_k \] At this point, we have recovered several aspects of evolving systems – the presence of multiple genotypes, as well as selection _via_ competition of multiple subpopulations for growth and media – but we are still missing one last aspect, without which everything breaks down. We are missing mutation, which provides a way for individuals to move between genotypes \(k : \Gamma\). In most systems, DNA mutation is tightly coupled to DNA replication, which is in turn coupled to cell division. Thus, we want an extension to our model, which moves individuals across genotypes as they divide. We can achieve this by introducing an additional term \(\pi_{k' \to k} : \Gamma \to \Gamma \to \Delta^1\) giving the probability of mutating from genotype \(k'\) to \(k\), at each cell division. Coupled into our model, we arrive at: \[ \frac{dN_k}{dt} = \sum_{k' : \Gamma} \pi_{k' \to k} \cdot \frac{\gamma_{k'} S}{K + S} \cdot N_{k'} - \delta N_k \] where we gather up contributions of mutation from each subpopulation by summing over \(k' : \Gamma\). Now, we have all the parts and we can rewrite this equation into a more succinct, descriptive form: \[ \frac{d\mathbf{N}}{dt} = \gamma_0 \mathbf{\pi}\mathbf{f} \mathbf{N} - \delta \mathbf{N} \] where \(\mathbf{N} : \Gamma \to T \to C\) is the vector of concentrations \(N_k\) for each subpopulation of genotype \(k : \Gamma\), \(\gamma_0 : \mathbb{R}_+\) a maximum growth rate, \(\mathbf{f} : \Gamma \to (0, 1)\) a relative fitness (relative growth rate) and mutation matrix \(\mathbf{\pi} : \Gamma \to \Gamma \to \Delta^1\) with elements \(\pi_{k' \to k}\). Note that for evolution of traits involving interaction between cells, the fitness \(\mathbf{f}\) may have an arbitrary dependence on \(\mathbf{N}\). Now, the observant reader may wonder, where we get our fitness \(\mathbf{f}\) from (here we just assume that till this point this is the only thing that left you wondering). Depending on the level of abstraction we want to treat our system at, we may indeed directly specify a function \(f\) from genotypes \(\Gamma\) to fitness values \((0, 1)\). However, we may also specify actual gene regulatory networks in our cells, and work our way up from there. Briefly, the second case involves treating gene expression as a fast process, relative to cell division and assuming partial steady state behaviour. Our model differs from the majority of the literature on modelling evolution, as we focus on non-saturated liquid-culture with population growth. Having laid the foundations for our model, let us now move on to a specific system for directed evolution in yeast.

The OrthoRep system

We use the OrthoRep mutagenesis system for our experiments. It relies on a linear cytoplasmic high-copy plasmid which replicates orthogonally to the yeast genome. The choice of mutagenesis system defines the form of the mutation matrix \(\mathbf{\pi}\), which is why we have to adapt our model for the brain-melting complexity that is OrthoRep.

To avoid any melting of brains, we will again increase the complexity step by step, until we arrive at the full model. We will, in turn, introduce a perfect, single copy mutagenesis plasmid, before subjecting the reader to the pain of constant copy number-mutation. We will proceed to the case of no-holds-barred, arbitrary plasmid copy numbers with non-unit probability of plasmid replication, coupled to genotype. This will take us into statistical physics territory and we will discuss solving the chemical master equation Cao2017 in the context of truncated state spaces. Things will be put into the context of the brutal game that is survival of the fittest, to restore the connection to directed evolution. We will finally put everything together and present the final model for OrthoRep evolution, with 100% less brain-melting action.

The perfect mutagenizer

In the spirit of the perfect spherical cow let us consider a 'perfect' non-spherical single-copy plasmid, which replicates exactly once each cell division (no, not in vacuum). It has a mutation rate of \(\mu : \mathbb{R}_+\) mutations per basepair and replication. It has \(n : \mathbb{N}\) possible effective genotypes indexed by single base pair mutations. From this, we can construct a pure mutation matrix \(\pi^{mut}_{k' \to k}\). As we keep the template strand of DNA constant during replication, the resulting final mutation matrix becomes: \[ \pi_{k' \to k} = \frac{1}{2} \cdot (\pi^{mut}_{k' \to k} + \mathbf{1}) \]

To better understand this simple system, we shall consider an example set of genotypes. Consider a gene with two possible states: an active "on" state and an inactive "off" state, for example due to an early stop codon. We denote them by \(1\) and \(0\) respectively, giving \(\Gamma = \{0, 1\}\). We shall assume that the probability of crossing from the on to the off state and vice versa is given by the mutation rate \(\mu\), say by having a single base decide the on or off state. This results in a mutation matrix: \[ \pi_{i \to i} = 1 - \mu;\quad \pi_{i \to j, i \neq j} = \mu;\quad i, j : \Gamma \] which is nice and simple to understand. Every cell has just a single allele of each gene, mutation probabilities on sequences translate directly to mutation probabilities on \(\Gamma\), everything is happy and peaceful. As we shall see, lasting peace and wellbeing is but an illusion.

N-copy nightmares

Sadly, OrthoRep is not a perfect system for mutagenesis. OrthoRep is multi-copy, with many copies of the plasmid per cell Ravikumar2018. Now, we have to take into account, that mutation happens on the level of single plasmids, while selection happens on the level of cells. The effective genotype (con)trolling the fitness of a given cell is now given by the combination of all plasmids contained therein.

We thus define our set of effective \(n\)-copy genotypes \(\Gamma_{eff} : \mathbb{N} \to \mathsf{FinSet}\) in terms of single plasmid genotypes \(\Gamma_{sp} : \mathsf{FinSet}\) as the subtype of the free commutative monoid \(\Gamma_{m} := \mathbb{N}[\Gamma_{sp}]\) on \(\Gamma_{sp}\) defined by \[ \Gamma_{eff} := \prod n: \mathbb{N}.\; \left\{x : \Gamma_m\;\middle|\; \sum_i x(i) = n\right\} \] Take a moment (or a week like we did without cheating and looking up the answer) to think about this: if we have exactly \(n: \mathbb{N}\) plasmids per cell, the number of possible of genotypes of a cell grows as: \[ |\Gamma_{eff}(n)| = \binom{|\Gamma_{sp}| + n - 1}{n} \] This grows fast in the number of copies. Thankfully, we "only" have to deal with on average 5 copies of plasmid in the OrthoRep system Ravikumar2018. In the following, we shall refer to \(\Gamma_{eff}\) as just \(\Gamma\) and assume a copy number \(n\) by abuse of notation.

Other things change within our model, and the changes are certainly not for the simpler. Fitness experiences little to no change – it now depends on effective genotypes. More drastic changes appear in the mutation matrix. Upon cell division, plasmids are replicated, shuffled and randomly redistributed to both daughter cells. This has to be reflected in \(\pi_{k' \to k}\), which results in \(\pi\) splitting into a product of two terms - pure mutation and shuffling. This process is shown in Fig. 1.

Figure 1. Illustration of the mutation process in the constant copy-number model. Plasmids are first mutated according to a mutation matrix \(\pi_{k' \to k}\), resulting in a mutant population (top track). We carry along the probability distribution of the unmutated population, as we assume that the template strand is never mutated during replication (bottom track). Finally, we pool both the mutated and unmutated population and draw a fixed-size random set of plasmids without replacement for the daughter cells.

We define the mutation matrix as a product of pure mutation \(\pi^{mut}_{k \to k'} : \Gamma \to \Gamma \to \Delta^1\) and shuffling \(\sigma_{k,k' \to k''} : \Gamma \to \Gamma \to \Gamma \to \Delta^1\), as \[ \pi_{k' \to k} := \sum_{x : \Gamma} \pi^{mut}_{k' \to x} \sigma_{k', x \to k} \] where we can define the shuffling probability as the probability of picking a sample of \(n\) plasmids without replacement, for a given set of counts for each genotype. This is simply given by the multinomial hypergeometric distribution: \[ \sigma_{k,k' \to k''} := \frac{\prod_{x : \Gamma} \binom{k(x) + k'(x)}{k''(x)}} {\binom {\sum_{x : \Gamma} k(x) + k'(x)} {\sum_{x : \Gamma} k''(x)}} \] as our definition of genotypes changes, so does our mutation matrix: \[ \pi^{mut}_{k \to k'} := \sum_{ X : \Gamma \to \mathbb{N}[\Gamma];\; leftsum(k, X);\; rightsum(k', X) } \prod_{i, j : \Gamma} \binom{k(i)}{X(i, j)} \pi_{i \to j}^{X(i, j)} \] \[ leftsum(X, k) := \forall i : \Gamma.\; \sum_{j : \Gamma} X(i, j) = k(i) \] \[ rightsum(X, k) := \forall j : \Gamma.\; \sum_{i : \Gamma} X(i, j) = k'(j) \] which tallies up the transition probabilities for each single plasmid and normalizes by the number of ways available to reach a final mutated multi-copy genotype from an initial multi-copy genotype. So far, so good - we are moving from a single-copy to a multi-copy system just by using some simple combinatorics to take account of multiple plasmids being present. This model suffices to describe the evolution of arbitrary cells with constant copy-number plasmids, has a large but finite state space, and should be enough for any reasonable application of directed evolution. The attentive reader of our Wiki will notice, that we're hinting at something here. Yes, our application of directed evolution is not reasonable; you heard it here first. We model a system, where our genotype directly impacts our plasmid copy-number, which is a MaSIF pain to model correctly (software crossover, here we go). For the biology, take a look at (LINK).

Copy numbers, melting brains and the chemical Master Equation

For unreasonable applications of directed evolution, we may need arbitrary plasmid copy numbers, which will lead us to scrap our previous model (congratulations yeast, you win!). We need a further term in our mutation matrix \(\pi_{k' \to k}\), which specifies the probability with which a given plasmid is replicated: \(\rho_{k \to k'} : \Gamma \to \Gamma \to \Delta^1\): \[ \rho_{k \to k'} := \prod_{x:\Gamma} \binom{k(x)}{k'(x)} p_r^{k'(x)} (1 - p_r)^{k(x) - k'(x)} \] where \(p_r : \Delta^1\) is a basal probability of replication, which may depend on the genotype of a given cell. This is just a simple multinomial distribution for picking a set of plasmids to replicate with probability \(p_r\). Furthermore, \(\sigma_{k, k' \to k''}\) actually simplifies in the presence of arbitrary copy-numbers, to just another multinomial distribution: \[ \sigma_{k, k' \to k''} := \prod_{x:\Gamma} \left(\frac{1}{2}\right)^{k(x) + k'(x)} \binom{k(x) + k'(x)}{k''(x)} \] which may not be that surprising, as the hypergeometric distribution previously present in \(\sigma\) arises from the restriction to a constant copy number.

As our plasmids no longer replicate with unit probability, our copy number may not stay constant across cell divisions. This makes our space of genotypes blow up, badly. Where before we had been restricted to a constant-sum slice of the free commutative monoid \(\Gamma_m := \mathbb{N}[\Gamma_{sp}]\) over \(\Gamma_{sp}\), our genotypes now potentially reside in all of \(\Gamma_m\). Our space of genotypes has become (countably) infinitely large - meltage of brain followed by meltage of computers ensues. The inclined reader may notice, that taken independently from cell division, our plasmids follow a chemical master equation: \[ \frac{d\mathbf{P}}{dt} = \mathbf{\pi} \mathbf{P} \] The usual procedure for approximating solutions to the chemical master equation in the presence of infinitely large state spaces, is to resort to state-space truncation Cao2017. This means summarizing a low-probability segment of state-space with a single, terminal state, where probability may leak to. For practical purposes, we truncate state-space to a subspace where for each single-plasmid genotype \(x\), and total genotype \(k\), we have: \[ \forall x : \Gamma_{sp},\; k : \Gamma.\; k(x) \leq n \] for some maximum copy number \(n : \mathbb{N}\), which stops our brains and computers from melting and allows us to solve our system approximately. State-space truncation errors for a cutoff of \(n=20\) are on the order of \(10^{-16}\), which approaches the order of numerical accuracy. Finally, our mutation matrix is given by: \[ \pi_{k' \to k} := \sum_{x, y: \Gamma} \rho_{k' \to x} \pi^{mut}_{x \to y} \sigma_{x, y \to k} \] completing the transition to arbitrary copy numbers. Now, we have arrived at our most general model for mutation, and it is time to take a look at the selection component of our model of evolution.

Figure 2. Illustration of the mutation process in the variable copy-number model. First, replicating plasmids are sampled from the parent's plasmid population. The non-replicating plasmids are carried along for subsequent shuffling (bottom track). The replicating plasmids are then mutated according to a mutation matrix \(\pi_{k' \to k}\), resulting in a mutant population (top track). We carry along the probability distribution of the replicating, unmutated population, as we assume that the template strand is never mutated during replication (bottom track). Finally, we pool both the mutated and unmutated population and draw a random-sized set of plasmids without replacement according to a binomial sampling distribution for the daughter cells.

Figure 3. Comparison of model and Monte-Carlo computation of mutation matrices. Mutation matrices were computed via both Monte Carlo simulation and our exact model. The model results (exact) look qualitatively similar to the Monte Carlo simulations with 50 samples (empirical). Quantitative agreement is also good, with errors (difference) staying well below 10%.

Figure 4. Examples for variable-copy mutation matrices at plasmid count \(n = 5\). Exponential size of truncated state space is clearly visible. For \(p = 1\), transition probabilities are symmetrical, as for both lower and higher plasmid copy numbers, all plasmids unconditionally replicate. For both \(p = 0.5\) and \(p = (#0 + #1) / 5\), there is a drift towards lower overall copy numbers, as seen by probability being concentrated in the lower half, as expected for non-unit replication probabilities. Plasmids do not always replicate, leading to overall lower plasmid counts.

Figure 5. Directed evolution of a two-state gene with a hostile fitness landscape, providing a qualitative model for URA3. (a) Concentrations of optimal evolved strain (all URA3 in off state) as a function of time. Different plots show different selection strategies, with \(n\) steps of drift followed by selection. It can be seen, that initially longer drift times can benefit selection. (b) Expansion of URA3 selection as an extensive form game, colored by strain optimality. X-axis gives the number of steps of directed evolution, y-axis gives the weighted genotype of the population (higher is better). (b) Bifurcation in the extensive form game at a moderate number of selection steps (versus drift steps). It can be seen, that small perturbations to a directed evolution experiment can result in a large impact on overall fitness.