Team: UNSW Australia

Molecular Dynamic

Analysis of Assemblase

Aims of Molecular Dynamics

Determine the flexibility of our scaffold.
Obtain distance measures between the enzymes attached in our Assemblase system.
Understand the total size of our system.

Molecular Dynamics Theory

It is important to understand what assumptions and simplifications a simulation makes. This allows you to understand the limitations of its representation of a natural phenomenon. Molecular Dynamics (MD) is a computational method for simulating the movement of atoms. A MD simulation is composed of a discrete and finite set of timesteps, in a similar manner to how a film consists of many still frames. The movement of atoms from one timestep to the next is determined by algorithms that characterise every atom’s velocity. An atom’s velocity is based on the forces acting upon it. These forces can be broken down into bonded and non-bonded forces (eq. 1).

E_Total = E_Bonded + E_Non-bonded - Equation 1

E_Bonded = E_Bond + E_Angle + E_Dihedral - Equation 2

E_Non-bonded = E_{Electrostatic} + E_{Van Der Waals} - Equation 3

Bonded forces are modelled as springs (eq. 2). This simplification occurs by treating; a chemical bond as a spring between two spherical masses, the angle between two bonds as a spring or similarly the torision angle between three bonds (fig. 1). Hooke’s law dictates that the force is proportional to the compression or extension of these springs (fig. 1). This model is a simplification in comparison to more accurate empirical energy functions that do a better job of representing the behaviour of quantum mechanics (fig. 2).

Figure 1: The bonded forces of an atom in a MD simulation are springs that exist in three forms. They act according to Hooke's Law; energy and force increase proportionally to the compressing and stretching of the spring away from its equilibrium distance.

Figure 2: The Morse Potential (blue) is an energy potential function that better models the bond energy between two atoms compared to the harmonic oscillator (green) that is used in MD simulations. The harmonic approximation is good for small deviations away from equilibrium values but lacks the asymmetry of the more empirically verified Morse potential.

Alternatively, non-bonded forces are modelled by two equations; the Leonard Jones Potential and Coloumbs Law. The former approximates the effects of Van Der Waals interactions and Pauli repulsion, whilst the later accounts for electrostatic interactions. These forces act on atoms that are not bonded together where the distance between two atoms informs the degree of attraction or repulsion felt by both atoms. However, for both the non-bonded functions, their tendency to attract or repel two atoms approaches zero whilst the distance between the two atoms tends towards positive infinity (fig. 3). To reduce the use of computational resources a cut-off value is commonly employed. This lowers the total number of interactions that the simulation has to account for occurring, by limiting non-bonded interactions to atoms that are within this cut-off proximity of each other.

Figure 3: The Leonard-Jones Potential (LEFT) - The steep gradient to the left of the global minimum reflects Pauli repulsion as the distance between two atoms is reduced, whilst the slow limit towards zero potential energy in the positive direction along the x-axis represents the attractive force of Van Der Waals interactions. There are multiple methods to reduce the computational costs associated with the right-hand side of the Leonard-Jones Potential, such as a cut-off value which sets the potential of two atoms separated by a distance greater than the cut-off equal to zero. Coloumb's Law on the RIGHT - acts in a similar fashion to The Leonards Jones Potential however is dependent on whether the charges of two ions are of the same or opposite signs. In the former Coloumb's law describes repulsion, whilst the later describes attraction. Moreover, in a similar fashion to the Leonard-Jones Potential a cut-off value can be implemented to address the distance between two ions approaching positive infinity.

While the bonded and non-bonded model (eq. 1) is generic for all-atom MD simulations, the intensity of these forces varies greatly between different elements and between different bond types. Force fields account for this disparity, they are a set of functions and parameters informed by experimental data. Different force fields can be chosen at the discretion of the researcher to simulate a diverse range of systems and biomolecules. The two main inputs for a MD simulation is a force field and a structural file of the molecule you wish to simulate. A common structural file format is the pbd file that lists all the atoms in a biomolecule with its Carteisan coordinates and experimental data. A guide on how to find or create the appropriate pdb file for future iGEM projects can be found here. Once the structural file has been inputted and the appropriate force field selected, there are still several steps that need to be performed before a simulation can take place:

A simulation space is defined and undergoes solvation.
The energy of the structure is minimised.
The temperature and pressure of the simulation space are equilibrated in discrete steps
Production MD run - here the simulation takes place
Analysis of production run's trajectory files. The trajectory files describes the movement of every atom at each time step.

Starting Point for Learning MD

A fantastic starting point for any future iGEM teams wishing to learn and incorporate MD analysis into their analysis is the Lysozyme in Water tutorial created by Justin A. Lemkul from the Virginia Tech Department of Biochemistry. This tutorial is the first of many that Lemkul has created to teach one how to best utilise the GROMACS suite to achieve their MD goals¹. GROMACS was used by our team to perform our MD analysis and the Lysozyme tutorial was also very informative for our project.

The need for High Performance Computing for running MD

For future iGEM teams that are seeking to run MD simulations, depending on the aims, it could be highly advised to take advantage of High Performance Computing (HPC). HPC infrastructure is available, upon request, at many Universities and it can make your desired process considerably more manageable or in some cases is necessary. Understandably, it is not difficult to imagine that a simulation of hundreds of thousands of atoms moving over thousands of time steps is a quite computationally intensive task. As a result one should not be surprised for such production run to take hours, if not days or weeks to be completed on a single personal computer. Granted the Assemblase system that our team sought to simulate is a quite large structure relative to other biological structures that undergo all-atom simulations. Regardless, there are unignorable benefits to utilising a computer cluster to perform MD analysis. Certainly, our teams access to UNSW's Katana cluster allowed us to run multiple tasks in parallel to maximise time-efficiency as well as distributing tasks to multiple nodes preventing the need to keep the program running on our own devices.

Distance between Active Sites

When our team started modelling the first thing we did was speak to Alex Mars, the iGEM team member of the 2018 UNSW team who led modelling. We asked him what we could do to improve upon the modelling efforts that the last's years team had done for their Assemblase system. This is where we were advised to improve upon the MD analysis that last year's team had. Truly, the MD work done by the 2018 UNSW was limited to a production run of half the hexamer alone, since the distance vs diffusion mathematical model was their main focus. Therefore our first goal for modelling was to obtain a better distance parameter that better represented the scale between the active sites of our co-localised enzymes.

From our human practices we discovered a second, more sustainable pathway for Paciltaxel production. During the project our lab team became less confident in expressing and assembling this pathway. It therefore, became a priority for our team to model the viability of this novel pathway. In response, we focused our MD analysis on the Assemblase structure for the second pathway.

To find the distance between enzyme active sites, we ran a 3 ns simulation with our full Assemblase structure (details on how we assembled the structure are here). However, there were a few challenges we had to be overcome. Firstly, we needed to amend the structure, by removing selenomethionines which were causing fatal errors to our simulation. These selenomethionines were not present in the sequence we were expressing in the lab, so with some simple scripting we were able to parse through the structure file and replace them with methionine.

The next obstacle we faced was the energy minimisation step, where a short simulation is run to rearrange the atoms in the structure to a more energetically favourable arrangement. Since our Assemblase model was constructed from some manual editing methods this caused some unfavourable bond angles. Furthermore, due to Assemblase’s large size simulating the model was a computationally intensive task. As a result, our progress was halted at this step due to a mixture of different errors. Troubleshooting on online forums enabled our team to overcome this hurdle by performing an energy minimisation run on the structure before as well as, after the solvation of the simulation space. With no water molecules in the simulation space resulting considerably less molecules the overall computational burden was lessened on the first energy minimisation on the structure.

Finally, once we had successfully completed a production run of our Assemblase the command gmx pairdist was used to calculate the distance between the active sites. The command requires an index file that records the location of the atoms, with a numbered label, that you want to find the pairwise distance between. To find the active sites and index them appropriately our group wrote a script that parsed through the simulation’s output, that contained hundreds of thousands of atoms. Due to the sheer size of our simulation space this task could not have feasibly been completed manually.

Figure 4: Viusalistion of 1ns simulation of Assemblase in Pymol suite with distances shown between the actives sites of one LYXL and four DBATS

Figure 5: Each graph shows the distance from one of the LYXL's active sites to each of the four DBAT's on the assemblase system over a 1ns production run

The most striking feature of the distance measures (fig. 5) is how little variation there is over time. This was certainly an unexpected result for our team. This suggested that the structure as a whole was relatively static over the course of the simulation. We expected to see large swinging motions of the Assemblase's hexameric subunits, instead we saw that the vibrating motion of structure was far more noticeable (fig. 4). This unexpectedly restricted range of motion is best explained by the Root Squared Mean Deviation (RMSD) output of the production run (fig. 6). RMSD is a measure of the average distance each atom in structure has deviated from its original position at the start of the simulation (fig. 6). The persistent increasing trend over the simulation's run-time informed us that the simulation did not converge ². This means we did not see the model's full range of motion and that the production run needed to be run for longer. For comparison the RMSD of the 10 ns second simulation of just the hexamer structure (fig. 6) is more akin to what we were expecting to observe - the evening out and deviation around a better-defined range of RMSD. However, as a warning to future iGEMs who decide to utilise MD analysis caution must be taken to determine if convergence was reached over their simulation by viewing the RMSD alone ³. Considering that the run time of the 3 ns simulation of the Assemblase model took 690 hours of real time, to complete a simulation for 10ns, if not longer considering the size difference compared to the hexamer, this would be on a timescales of multiple months. This poses a serious impracticality as a modelling methodology for the iGEM competition, since it leaves no time for the model to produce results timely to inform the project. Regardless, the distance measures approximately 15 nm between the active sites of LYXL and DBAT at any time. This represents a much improved measure compared to previous MD work done by last years team. As such this measure was used as input for the distance vs diffusion kinetic model.

Figure 6: Graphs of RMSD plotted against time for a 3 ns simulation of the full Assemblase model on the LEFT and for a 10ns simulation of the prefoldin hexamer on the RIGHT.

Alternatively, another approach that can be investigated by future iGEM teams who wish to run MD analysis on larger biological systems is the use of coarse-grained MD (CGMD). Indeed, CGMD simplifies the representation of biomolecules to a further extent than all-atom MD allowing longer simulations at the expense of molecular detail⁴. We would recommend using the LAMMPS program and here you can find plenty of learning resources to achieve your simulation goals ⁵. Furthermore, it is possible to use the Visual Molecular Dynamics program to convert PDB structural files into the appropriate LAMMPS input file with the topotool ⁶.

>Figure 7: This diagram was taken from a the review "Molecular dynamics simulations of large macromolecular complexes" from the journal "Current Opinion in Structural Biology".

Size of Assemblase System

When our team was looking to better assess the commercial viability of our system, we became interested in modelling the size of our Assemblase system, to inform a hypothetical recovery process via ultrafiltration. The 3 ns production run was again visualised, but unlike the determination of a distance parameter between the active sites, there was no clear 'end points' to fix our distance measures too. From visual inspection of the structure approximated boundary positions were chosen. This allowed the overall hexagonal impression of Assemblase to be emphasised. These approximated boundary points were paired in a manner that would be expected to maximise their separation, creating the circumference in a regular hexagon (fig. 8). This is in accordance with Euclidean geometry. However, this assumption has obvious limitations, given that the hexameric components are not uniform in sequence identity nor are they stationary in nature. Regardless, this simplification of Assemblase's shape, gave a reasonable methodology to approach to measure its' size. It can be seen that an approximated maximum distance of about 370 angstroms and a minimum distance of 250 angstroms (fig. 9). This estimate however will not likely show the native structure's maximum size due to the previously mentioned limitation of the simulation not running until convergence.

Figure 8: A regular polygon is equiangular, equilateral and concyclic, where all the corners are points on a common circle. So, by extension the maximum distance that the hexagon will be a straight line between opposing corners that is also the circumference of the circumscribed circle - this was reasoning for our teams pairing of our approximated boundary positions.

Figure 9: The 3 ns production viewed in pymol, with distances calculated between the approximated boundary points of Assemblase.

After retrieving size measurements from modelling we sought out expertise on how we could be incorporate our results into a hypothetical Assemblase recovery process. We approached Helene Lebar, the manager of the Recombinant Product Facility (RPF) that provides protein production research for both public and private sectors. In the lab we were shown the cross-flow filtration recovery process that we could hypothetically use for an Assemblase bioreactor, but on a smaller scale than what would be used industrially (fig. 10). Furthermore, Helene told us that most ultrafiltration membranes are sold based on a nominal molecular weight of the protein, however this assumes that protein mass is correlated with protein size. This assumes proteins are uniform in shape, density and flexibility, which in reality is not the case. Thus, our modelling of Assemblase's size is indeed a better estimate for determining the most appropriate pore size. From the minimum size estimate, it was suggested that we would use a 300kDa membrane to retain Assemblase, as that corresponds to a 20nm pore size. Without modelling the size of our Assemblase system we could have made the pore size just large enough to allow Paciltaxel, that is approximately 1.7nm in size, to pass through the membrane. However, the smaller the pore sizes are the slower the filtration process takes to remove the desired product from the bioreactor. By extension our modelling efforts helped inform a faster recovery process.

Figure 10: Lab-scale cross-flow filtration; the Pellicon membrane filter is attached to the mixture of protein and product, beneath a pump forces this mixture up past the membrane filter, the product that is small enough to pass through the membrane's pores exits as filtrate into the bucket, while the protein is fed back into the original reaction mix as retentate.

Flexibilty of Scaffold

MD analysis provided us with the opportunity to better understand the flexibility of our Assemblase system. We hoped to re-affirm previous findings on the hexamer scaffold that the alpha and beta coiled-coil subunits should display high mobility due to a lack of interaction between the subunits⁷.Practically, this validation of Assemblase's superior flexibility would justify our ability to attach larger enzymes, because the structure should consequently be able to spread out in a more planar fashion to minimise the enzymes (conjugated to the scaffold) from crowding. By extension, the greater flexibility would in-effect broaden the total library of enzymes and reaction pathways that the Assemblase system could be repurposed for. Certainly, this objective became of much importance to the team at the start of the project upon discovering on the uniprot that Phenylalanine Aminomutase has a tetrameric structure.

Molecular Dynamics Results

Figure 11: 5ns simulation of the Prefoldin scaffold exppressed with the Spy and Snoop catcher tag systems.

From inspecting a production run of our Assemblase system without any enzymes attached, we saw that the coiled-coil regions of the scaffold began to splay outwards over time, the beta subunit in the front left of fig. 11 is particularly noticeable. We also assumed, from inspecting the RMSD output (fig. 12) of the simulation, that our simulation did not converge, for the same reasons mentioned above. Thus, it is likely that this splaying motion could have been further emphasised given a longer simulation run time. Additionally, we observed a great degree of movement and flexibility from the GSG linker region between the catcher-tag systems and the hexameric subunits (fig. 11). However, despite observing our system's flexibility, we sought more quantifiable data to better understand our system.

Figure 12: RMSD plotted against time for a 5ns simulation of prefoldin expressed with catcher-tag system joined by gsg linkers.

Normal Mode Analysis

Whilst seeking advice from honours student Aiden McMahon-Smith for methods to gain quantitative data from MD to analyse the flexibility of our structure he surprisingly diverted our focus away from MD entirely. Instead, he suggested looking into Normal Mode Analysis (NMA), to get a better understanding of the scaffolds flexibility at different regions along the structure's backbone. The ElNemo server was used to retrieve the data shown below (fig. 13) and the counsel from Prof. Paul Curmi was indispensible to helping our team, who lack a background in physics, make sense of the data^8,9. NMA examines the movement of a biomolecule as normal modes. Normal modes are harmonic oscillations with the same frequency. Higher modes have a higher frequency and thus a shorter period of motion. Prof. Curmi provided a demonstration of dangling the branch of a potted tree in his office as an example to convey the idea. The swaying motion of the whole branch would be described by a lower mode since the distance of the motion is relatively large while the smaller vibrations of twigs and leaves would be very high modes. Since, we were interested in the regions that are the most flexible and would hence show the greatest degree of motion, the location of the atoms belonging to the lowest mode were of more interest to us. Mode 7 was the smallest mode produced by the ElNemo server and the results can be seen in fig. 13 below. Interestingly, the regions of the generic Assemblase model (prefoldin expressed with catcher-tag systems but no enzymes) that demonstrated the greatest degree of periodic motion were the Catcher-Tag systems. Conceptually, this makes sense if one can imagine the catcher-tag system being analogous to a ball on a string, the string being the corresponding alpha or beta subunit and the joining gsg linker regions. The period of the ball will share the same direction of motion as any point on the string but will cover the largest distance. In terms of how this affects the function of our system it is a positive result. The enzymes that are colocalised on our Assemblase system are being conjugated at the site that experiences the greatest mobility and hence can move in a fashion that accommodates their size and shape.

Figure 13: The graph on the LEFT shows the R² values of all the atoms included in mode 7, the lowest mode from the ElNemo output.To RIGHT the corresponding model that was the input for ElNemo has the residues highlighted that are in regions of significantly high R² values. Regions highlighted in purple are the two large peaks on the graph, whereas the pink regions are the four smaller peaks.

References

Lindahl, E., Hess, B. & van der Spoel, D. J Mol Model (2001) 7: 306. https://doi.org/10.1007/s008940100045
W. Schreiner, R. Karch, B. Knapp, and N. Ilieva, Relaxation Estimation of RMSD in Molecular Dynamics Immunosimulations. Computational and Mathematical Methods in Medicine. 2012
B. Knapp, S. Frantal, M. Cibena, 1 W. Schriener, and P. Bauer, Is an Intuitive Convergence Definition of Molecular Dynamics Simulations Solely Based on the Root Mean Square Deviation Possible? Journal of Computaional Biology. Volume 18, Number 8, Pp. 997–1005 2011
J.R. Perilla, B.C. Goh, C.K. Cassidy, B. Liu, R.C. Bernardi, T. Rudack, H. Yu, Z. Wu and K. Schulten, Molecular dynamics simulations of large macromolecular complexes. Current Opinion in Structural Biology. 31: pp.64-74 2015
S. Plimpton, Fast Parallel Algorithms for Short-Range Molecular Dynamics, J Comp Phys, 117, 1-19 (1995)
Humphrey, W., Dalke, A. and Schulten, K., "VMD - Visual Molecular Dynamics", J. Molec. Graphics, 1996, vol. 14, pp. 33-38.
R. Siegert, M.R. Leroux, C. Scheufler, F.U. Hartl, and I. Moarefi, Structure of the Molecular Chaperone Prefoldin: Unique Interaction of Multiple Coiled Coil Tentacles with Unfolded Proteins. Cell, Vol. 103, 621–632, November 10, 2000
K. Suhre & Y.H. Sanejouand, ElNemo: a normal mode web-server for protein movement analysis and the generation of templates for molecular replacement. Nucleic Acids Research, 32, W610-W614, 2004.
K. Suhre & Y.H. Sanejouand, On the potential of normal mode analysis for solving difficult molecular replacement problems. Acta Cryst. D vol.60, p796-799, 2004 © International Union of Crystallography.

unsw.igem@gmail.com

/unswigemteam

@unsw_igem

Team:UNSW Australia/Molecular Dynamics