<h2 class="page-subtitle"><a href="https://2019.igem.org/Team:Calgary/Modelling">Computing Biology</a></h2>
<h1 class="page-title">Software</h1>
<h1 class="page-title">Software</h1>

Proteins are widely used in the iGEM community, but there is very little iGEM teams can do to understand their protein’s atomic behaviour. At iGEM Calgary we wanted to generate a quantitative way to allow other teams to characterize each amino acid of their proteins.

The aim of this model is to find an emulsion which allows for the maximum removal of chlorophyll from the oil by finding the variables of temperature and concentrations of oil, water, and surfactant. Supervised machine-learning classification methods are used to predict emulsion phase equilibria (known as the windsor classifications) based on previously gathered in vitro data to formulate optimal emulsion conditions. Through an iterative development process, we explored and implemented Support Vector Classification (SVC), k-Nearest Neighbours (kNN), and multilayer perceptron models to densely interpolate and extrapolate desired phase data from experimentally gathered data.



To assist in the dynamic characterization by other teams we looked to develop a methodology that allows for the calculation and aggregation of Brownian motion measurements for each amino acid in a sequence. The Brownian motion measurement chosen was the Root Mean Square Fluctuation(RMSF) calculated for every atom of a protein in ten picosecond intervals.

The RMSF data was calculated from a nanosecond Molecular Dynamic Simulation(MDS) completed within GROMACS, an industrial MDS software. These values were then averaged over each amino acid, this ensured that the unit of measurement was observed on a scale that was modifiable by teams. This resulted in a series of curves that quantitatively expressed the dynamics for each amino acid.

Above is a complete view of the movement attributed to every amino acid of a protein. Having all of the curves present at the same time can be quite confusing. The true power of this measurement can be harnessed when amino acids are displayed in smaller clusters. Below is the dynamics of the 25th, 80th, and 90th amino acid of the 6GIX protein.



Detailed formulation of \(k\) -Nearest Neighbours and Support Vector Classification

Support Vector Classification

Support Vector Classification (SVC) provides a classification approach which finds a hyperplane that divides two classes of vectors within a space. The goal is to find the maximum margin between the labelled data and generate parameters for a hyperplane that would divide this margin. The optimization problem of generating a separating hyperplane between two classes holding \(n\) data points can be summarized: $$ \max_{\beta_0, \beta_1, \beta_2, \beta_3, \epsilon_i, \ldots, \epsilon_n} \mathcal{M} $$ subject to, $$\beta_0^2 + \beta_1^2 + \beta_2^2 + \beta_3^2 =1 $$ $$ y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} \geq \mathcal{M}(1-\epsilon_i)$$ $$ \sum\limits_{i=0}^{n} \epsilon_i \leq \mathcal{C}, \>\>\>\>\> \epsilon \geq 0, \>\>\>\>\> y_i \in \{1, -1\}.$$ Where \(\mathcal{M}\) is the size of the margin, \(\beta_i\) are the parameters defining the hyperplane, \(y_i\) is the label of each vector which can only be 1 or -1. \(\epsilon_i\) is the error for each vector which is constrained by \(\mathcal{C}\), the cost parameter (James et al. 2017).
Since we have four phase classes to be separated, we applied the one-versus-one approach, where divisions were constructed for each pair of classes, meaning this optimization was solved 6 times - \( {4}\choose{2} \) is the number of distinct pairs between \(4\) elements.) Since the data is not linearly separable, a non-linear radial basis function (RBF) was used as a kernel: $$K(\vec{v_0}, \vec{v_i}) = e^{- \; \gamma \; \vec{v_0} \; \dot \; \vec{v_i}}$$ Where \(\vec{v_0}\) is the vector to be labelled, and the kernel is applied on each training vector \(\vec{v_i}\) for this test observation. \( \gamma \) is a parameter subject to choice.
The second parameter \(\mathcal{C}\) specifies the amount of errors allowed within the separating hyperplane, allowing the adjustment of the model’s bias-variance trade off. This trade off is an important consideration in the approximation of any function. Approximations that are more flexible have greater variance (tend to follow the data closely) and have low bias. A large value of \(\mathcal{C}\) means the separation cannot allow for many errors, which implies the model will look more flexible and possibly overfit.

\(k\) -Nearest Neighbours

The aim of a general classification model is to provide the likelihood a new unlabelled vector lies within a class. The \(\mathcal{K}\)-Nearest Neighbours method is a non-parametric approach which looks at the \(\mathcal{K}\) nearest (in terms of distance) vectors within the space and assigns a label based on those closest neighbours. The probability given a vector from described above will be labeled with phase can be calculated with KNN by: $$ Pr( \> Y = y \> | \> X = \vec{v}) = \frac{1}{\mathcal{K}} \sum_{i \in \mathcal{N}}^{} I(\> y_i = y \>)$$ Where \(i\) indexes through the \(\mathcal{K}\) nearest vectors in \(\mathcal{N}\) and I is the identity function which outputs a 1 if the label of the neighbour is equal to and 0 otherwise (James et al. 2017).

Appendix A

Procedure of data collected in vitro

