Directed Protein Modification


What inspired our protein modification?

After successfully using the 6GIX water soluble chlorophyll binding protein to purify canola oil, we looked deeper into the industrial application of our solution. As part of our desire to industrialize our solution, we identified the need to optimize our system on all levels. This includes optimization at the smallest interactions within our system, including the interaction between our protein, its environment, and chlorophyll. To optimize our 6GIX protein we looked to make informed modifications to increase its stability while maintaining its affinity to chlorophyll. To accomplish this, we used molecular dynamic simulations to create a modified 6GIX protein called ModGIX.


Steps taken to generate this model

When developing ModGIX, we went back to the molecular dynamics models generated for 6GIX to identify areas where modifications may have an impact. We employed a six-step system to collect this data.

Step 1: Molecular Dynamic Simulation.

To develop a starting point for our model we developed a one nanosecond dynamics simulation of a single 6GIX monomer. This simulation was conducted with the same methodology as the other molecular dynamics models detailed in our In Silico Emulsion Verification models page.

Step 2: Characterize the Proteins Dynamics

From this simulation we used Root Mean Square Fluctuation (RMSF) curves to characterize the dynamics of the proteins individual amino acids. This resulted in a series of curves for amino acid that quantify the amino acids dynamics over time.

Step 3: Perform Functional Principal Component Analysis on the RMSF data.

After generating the dynamics data, functional principal component analysis (fPCA) was conducted on the data. This then provided a series of principal components able to represent the data on a finite dimensional plane. This also generated principal component scores which represent the proportion of total variance explained (PVE) for the parameter.

Step 4: Use Clustering Algorithms on the Principal Component Scores.

On the newly generated principal component scores, we performed clustering through the use of an Expectation-Maximization Algorithm applied to the parameters of a gaussian mixture model. This ensured tight representative clusters of amino acids. Clustering resulted in 4 distinct clusters each defined by the proportion that they contribute to the overall variance from crystalline structure.

Step 5: Use Hotspot Wizard to Avoid Inhibiting Binding.

After identifying the amino acids that attribute the most to structural variance, the team utilized Hotspot Wizard to identify key amino acids responsible for structural and binding functions. Through the use of this tool we ensured that any further modifications would not cause loss in the form or function of 6GIX.

Step 6: Use a Genetic Algorithm to Optimize the Amino Acid Sequence.

Once the problematic amino acids had been identified and cleared by Hotspot Wizard, we made modifications to the amino acid sequence of 6GIX. The team developed and used iGAM, an R-based genetic algorithm that optimizes portions of an amino acid sequence in the context of the entire sequence. This software package is available here. After a five hundred generation run of the iGAM algorithm, our team was able to create our final ModGIX sequence.

For the generation of ModGIX several assumptions were used and validated.
Assumption 1. For the use of Principal Component analysis we assume correlation between the parameters of the data. This assumption was validated due to the fact that the protein is a continuous strand ensuring correlation among the different amino acids.

Assumption 2. For the clustering algorithm we assumed normality and an even number of clusters to ensure appropriateness of fit. We validated this assumption by fixing the amount of clusters to 4 and the clusters obtained from the data were impressive enough to quash doubt in these assumptions.

Assumption 3. For the data collection we utilized data from a one nanosecond simulation, we assume that this is a representative sample for the proteins dynamics. This assumption was used due to the extra computational load that would be required for longer simulations. The increase in simulation time would generate a road block for other teams attempting to replicate our modelling schema for their own projects.


Deliverables Generated

The RMSF curves from the nanosecond simulation were generated and functional principal component analysis was conducted resulting in the principal components and their scores. Clustering was then conducted on these results to generate the following clusters:

Figure 1. Root mean square fluctuation characterization of our 6GIX monomer. More on the generation of this figure can be found here.

After generating the list of amino acids for modification, the iGAM algorithm was used to determine suitable replacements. After the 101 generations, the iGAM algorithm identified ideal replacements. The progression of the algorithm towards its final maximum is seen below.

Figure 2. Monotonic increase of max fitness value per generation.

Ultimately, this resulted in the following sequence being generated.

Figure 3. ModGIX sequence

The amino acids in red represent modifications introduced from the algorithm that were not present in the hot spots provided by Hotspot Wizard. The blue amino acids indicate hotspots that were not changed. The green amino acid was also located in a hotspot, but was replaced by an amino acid commonly substituted in scientific applications of this protein. With these changes in place, our sequence for a modified 6GIX (ModGIX) was complete. ModGIX does not show direct inhibition to its chlorophyll binding. However, to justify the high costs associated with purifying proteins within the lab, we had to generate a proof of concept to earn the confidence of the team.

To gain this confidence, mutagenesis was performed on the pdb of 6GIX to generate a usable structural starting point for a dynamic simulation. After a six nanosecond simulation, an estimate of ModGIXs structure was complete. The resulting structure was then aligned with the original 6GIX structure to identify any catastrophic differences between the two.

Figure 4. ModGIX aligned by proline to 6GIX.

From this we observed that after simulation the ModGIX protein maintained a tetramer structure, and also maintained the chlorophyll binding pocket at its core. With the Hotspot and Dynamics verification complete, ModGIX was sent to the wetlab for their experiments. It is currently in the process of being cloned into E. coli. We hope to purify this protein soon and test its comparative chlorophyll binding efficiency to 6GIX.

ModGIX can be found in the registry under part BBa_K3114007.

Future Directions

The key next step in the use of ModGIX and its development strategy is the successful purification and characterization of the ModGIX protein. After the characterization is complete a reevaluation of the modification strategy will be conducted to address any potential weaknesses. Along with developing the strategy, the implementation of a ModGIX style strategy on the other proteins of our project will be integral to developing the strategy into a potent solution for protein engineering.


1. Palm, D. M., Agostini, A., Averesch, V., Girr, P., Werwie, M., Takahashi, S., . . . Paulsen, H. (2018). Chlorophyll a/b binding-specificity in water-soluble chlorophyll protein. Nature Plants,4(11), 920-929.
2. Abraham M.J., van der Spoel D., Lindahl E., Hess B., and the GROMACS development team (2018). GROMACS User Manual version,
3. Tran, N. M. (2008). AN INTRODUCTION TO THEORETICAL PROPERTIES OF FUNCTIONAL PRINCIPAL COMPONENT ANALYSIS. Department of Mathematics and Statistics, The University of Melbourne.
4. Lemkul J.A. (2018). "From Proteins to Perturbed Hamiltonians: A Suite of Tutorials for the GROMACS-2018 Molecular Simulation Package, v1.0" Living J. Comp. Mol. Sci. In Press.
5. Osorio, D., Rondon-Villarreal, P. & Torres (2015). R. Peptides: A package for data mining of antimicrobial peptides. The R Journal. 7(1), 4-14
6. Xiongtao D, Pantelis Z. Hadjipantelis, Kynghee H & Hao J (2019). Fdapace: Functional Data Analysis and Empirical Dynamics. R package version 0.4.1.
7. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Serie
8. Páll, S., Abraham, M. J., Kutzner, C., Hess, B., Lindahl, E. (2015).Tackling exascale software challenges in molecular dynamics simulations with GROMACS. In: Solving Software Challenges for Exascale. Vol. 8759. Markidis, S., Laure, E. eds. Vol. 8759. . Springer Inter- national Publishing Switzerland London 3–27.
9. R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
10. Sumbalova, L., Stourac, J., Martinek, T., Bednar, D., Damborsky, J., (2018). HotSpot Wizard 3.0: Web Server for Automated Design of Mutations and Smart Libraries based on Sequence Input Information. Nucleic Acids Research 46 (W1): W356-W362. 11.