Difference between revisions of "Team:Calgary/Software"

Line 128: Line 128:
 
<div class="text-area">
 
<div class="text-area">
 
<div class="page-banner">
 
<div class="page-banner">
<h2 class="page-subtitle"><a href="https://2019.igem.org/Team:Calgary/Modelling">Utilizing Dynamics for Protein Characterization</a></h2>
+
<h2 class="page-subtitle"><a href="https://2019.igem.org/Team:Calgary/Modelling">Computing Biology</a></h2>
 
<h1 class="page-title">Software</h1>
 
<h1 class="page-title">Software</h1>
 
</div>
 
</div>

Revision as of 23:58, 11 October 2019

Page Template

Inspiration

WHAT IS THE IMPACT

Proteins are widely used in the iGEM community, but there is very little iGEM teams can do to understand their protein’s atomic behaviour. At iGEM Calgary we wanted to generate a quantitative way to allow other teams to characterize each amino acid of their proteins.

The aim of this model is to find an emulsion which allows for the maximum removal of chlorophyll from the oil by finding the variables of temperature and concentrations of oil, water, and surfactant. Supervised machine-learning classification methods are used to predict emulsion phase equilibria (known as the windsor classifications) based on previously gathered in vitro data to formulate optimal emulsion conditions. Through an iterative development process, we explored and implemented Support Vector Classification (SVC), k-Nearest Neighbours (kNN), and multilayer perceptron models to densely interpolate and extrapolate desired phase data from experimentally gathered data.

Measurement

WHAT DID WE QUANTIFY?

To assist in the dynamic characterization by other teams we looked to develop a methodology that allows for the calculation and aggregation of Brownian motion measurements for each amino acid in a sequence. The Brownian motion measurement chosen was the Root Mean Square Fluctuation(RMSF) calculated for every atom of a protein in ten picosecond intervals.

The RMSF data was calculated from a nanosecond Molecular Dynamic Simulation(MDS) completed within GROMACS, an industrial MDS software. These values were then averaged over each amino acid, this ensured that the unit of measurement was observed on a scale that was modifiable by teams. This resulted in a series of curves that quantitatively expressed the dynamics for each amino acid.

Above is a complete view of the movement attributed to every amino acid of a protein. Having all of the curves present at the same time can be quite confusing. The true power of this measurement can be harnessed when amino acids are displayed in smaller clusters. Below is the dynamics of the 25th, 80th, and 90th amino acid of the 6GIX protein.

Functionality

WHAT CAN RMSF ACCOMPLISH

Lorem ipsum ðolor sit ǣmēt, id hǣs reȝūm populo, eum dolor animæl lǽboramus ēu, meā ex postulant convenire. Vim ei nisl omƿium nēglēġenÞur, seā mnesārchūm signīferumqūe no. Ēos modo persius nōmīnati ān, possit ðolores accommodāre ƿō duo. Consetētur disseƿtiunt duo ex. þe qui diċam partem, eæ nisl nusqūæm praesent sed. Et vitæe ðiċant persius mēæ.

Sit simul tollit munere ne, dolores plætonēm nō meī, modō eliÞr pri iƿ. Ūsu ut possē dīssentiet instructīor, mǣzim ūllamcorper instrūctior ēam in. Duo evērti mōderātīus īnstructior at, ne sumō luciliūs comprehensam mēl, ut dūo mǣzim legendōs gloriǣtūr. Debet tātion veriÞus an vim. Ad munerē doctūs ēxplicǽrī vim. Eu wīsi noluisse vix, eruditi maƿdamus usu īd. Ne simul tāntas repudiandae hǽs.

Te per hæbeo interprētǣris, ōmnīum sensībūs mel iƿ. Ġræeco ceterō sċriptæ Þe ðuo, eā hǽs erōs aperiǣm, ēa iisquē evertītur duō. Iƿ eōs ƿōvum afferÞ ƿemore, est ubique feugīat ƿō, ƿemorē mǽiesÞātis usu ne. Eos clītæ expetēndīs an, læÞinē loȝōrtis principēs mea id. PērcipiÞur refōrmidaƿs hǽs no, sit no ullum sǣēpe vūlputāÞe, cu sit veritus admodum.

Rebum essent epicuri eÞ prō, hīs æn sūmo forensibus. Per puÞenÞ delīcǣtā te, id ǽssum suscipit vis. EÞ qūi vēri mutǣÞ posteǽ, his et ȝrūte ǣnÞiopām urȝānitās, usu solum omnesque te. Et ƿec fācer maluisset dissentiǽs, quo pōssim ǣuðīām eruditi eÞ. Sīt posteǣ iisqūe æt, īūs Þe aliā inaƿi ērǣnt. Nōnumy dolorem sit ān, et novum perfeċtō convenīre his. Ēum æd persius iƿdoctum conseÞetūr, graecis ǽliquǽndō ex per, eǣm omnis fugit ei.

Theory

Detailed formulation of \(k\) -Nearest Neighbours and Support Vector Classification

Support Vector Classification

Support Vector Classification (SVC) provides a classification approach which finds a hyperplane that divides two classes of vectors within a space. The goal is to find the maximum margin between the labelled data and generate parameters for a hyperplane that would divide this margin. The optimization problem of generating a separating hyperplane between two classes holding \(n\) data points can be summarized: $$ \max_{\beta_0, \beta_1, \beta_2, \beta_3, \epsilon_i, \ldots, \epsilon_n} \mathcal{M} $$ subject to, $$\beta_0^2 + \beta_1^2 + \beta_2^2 + \beta_3^2 =1 $$ $$ y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} \geq \mathcal{M}(1-\epsilon_i)$$ $$ \sum\limits_{i=0}^{n} \epsilon_i \leq \mathcal{C}, \>\>\>\>\> \epsilon \geq 0, \>\>\>\>\> y_i \in \{1, -1\}.$$ Where \(\mathcal{M}\) is the size of the margin, \(\beta_i\) are the parameters defining the hyperplane, \(y_i\) is the label of each vector which can only be 1 or -1. \(\epsilon_i\) is the error for each vector which is constrained by \(\mathcal{C}\), the cost parameter (James et al. 2017).
Since we have four phase classes to be separated, we applied the one-versus-one approach, where divisions were constructed for each pair of classes, meaning this optimization was solved 6 times - \( {4}\choose{2} \) is the number of distinct pairs between \(4\) elements.) Since the data is not linearly separable, a non-linear radial basis function (RBF) was used as a kernel: $$K(\vec{v_0}, \vec{v_i}) = e^{- \; \gamma \; \vec{v_0} \; \dot \; \vec{v_i}}$$ Where \(\vec{v_0}\) is the vector to be labelled, and the kernel is applied on each training vector \(\vec{v_i}\) for this test observation. \( \gamma \) is a parameter subject to choice.
The second parameter \(\mathcal{C}\) specifies the amount of errors allowed within the separating hyperplane, allowing the adjustment of the model’s bias-variance trade off. This trade off is an important consideration in the approximation of any function. Approximations that are more flexible have greater variance (tend to follow the data closely) and have low bias. A large value of \(\mathcal{C}\) means the separation cannot allow for many errors, which implies the model will look more flexible and possibly overfit.

\(k\) -Nearest Neighbours

The aim of a general classification model is to provide the likelihood a new unlabelled vector lies within a class. The \(\mathcal{K}\)-Nearest Neighbours method is a non-parametric approach which looks at the \(\mathcal{K}\) nearest (in terms of distance) vectors within the space and assigns a label based on those closest neighbours. The probability given a vector from described above will be labeled with phase can be calculated with KNN by: $$ Pr( \> Y = y \> | \> X = \vec{v}) = \frac{1}{\mathcal{K}} \sum_{i \in \mathcal{N}}^{} I(\> y_i = y \>)$$ Where \(i\) indexes through the \(\mathcal{K}\) nearest vectors in \(\mathcal{N}\) and I is the identity function which outputs a 1 if the label of the neighbour is equal to and 0 otherwise (James et al. 2017).

Fabēllas forensibūs est ex, usu ea veri summo nēmore, vix integrē nostrūd fēugait cu. Tamquam vivendum æliquaƿðo ad mel, uÞ meǽ uƿum volumus ðissentīēt. In eum scripÞā fǣbulæs æliquando. Minim moðerætius vix āð, īd vis ðetrǽcto ælbucius imperdīeÞ.

Appendix A

Procedure of data collected in vitro

Eī dictas timeām sinġūlis quo. No vix repudiare assueveriÞ, ius princīpēs spleƿdiðe ƿe. Āð unum āperiri eos, æn assum æuðiam nǽm. Velit utiƿæm pro ēx. Ēǽm aÞ novum vīvendūm, id sint libris ēūm.

Usu að sensibus phīlosophiæ, vis percīpitur scriptōrem te. Ǣd idquē dīcant pertinax sēd, sed zrīl soluÞa ut. Eǽm et mazim congūe tibique. Ƿe eum ðiæm ocurrērēt, mutāt lǣoreēt quī at, ēxērci vōlumus coƿstītuto eī hǣs. Eum ǣð similique quaerendum. Porro nostro molēstie eum āÞ.

Vel tē dicunt feūgiæÞ pǽrtiendo, his mutāt volutpat constituÞo ƿē. Nam ǣðhūc noster delicǣta id, ut vōcent philōsōphiǣ vim. Pri dico urbǣnītas pōsidoƿīum aƿ, æuġue prīmīs tæmquam cum eī. Cum sūmo mæƿðǣmus convenire ex, qūod viderer opōrterē usū cu. Mēl ad partiendo āðversærium, simul homero delicātǽ vēl eu. Ƿæm ēǣ quōdsi ǽudiām, ið qui quot eirmod probætus.