Team:UNSW Australia/ModellingGuide


Team: UNSW Australia

How to Create 3D Structure Files of Protein Systems

Overview

Creating a 3D model of a protein system can be an invaluable asset for a synthetic biologist. Indeed, by visualising the system's structure one can hope to improve their understanding of the model's cognate function1. Most researchers store and view their protein models in the form of a pdb file. The file lists all the atoms in a molecule paired with their corresponding Cartesian coordinates and other experimentally determined data. For more information, the Research Collaboration for Structural Bioinformatics (RCSB) Protein Data Bank (PDB) has a guide to understanding pdb data. Furthermore, a pdb file makes up the primary input for molecular dynamics analysis, which was our team's main efforts for modelling our system (visit here). Here we have collated a detailed explanation of the methods our team employed to create the 3D structure of our Assemblase system. Our aim is that these instructions will provide a reference point, helping future iGEM teams which wish to create pdb files of their own protein systems. This guide also acts as a non-intimidating entry point to the vast array of free software and web-tools that are at a synthetic biologist’s disposable for structural prediction, evaluation, editing and viewing.

1. Collecting component protein parts from the Protein Data Bank

To create the 3D models of our two Assemblase systems the pdb files of the component parts were first collected from the Protein Data Bank. This included; the prefoldin hexamer scaffold, the Catcher-Tag systems and the two pairs of enzymes that are being co-localised in our Assemblase systems. The Protein Data Bank is a database of pdb structure files derived from X-ray crystallography, NMR spectroscopy, or cryo-electron microscopy2. Almost all the parts that were needed to construct our Assemblase structures had pdb files on the database:

  • 1FXK (PDB location) CRYSTAL STRUCTURE OF ARCHAEAL PREFOLDIN. Be sure to download 1fxk.pdb1 for the full hexamer structure rather than 1fxk.pdb, which is only half (1 alpha and 2 beta subunits).
  • 4MLI (PDB location) Crystal structure of the SpyTag/SpyCatcher complex.
  • 2WW8 (PDB location) Structure of the pilus adhesin (RrgA) from Streptococcus pneumoniae. This structure contained the Snoop Catcher-Tag system which had to be retrieved by deleting the unwanted residues in Pymol (will elaborate later).
  • 5N82 (PDB location) Crystal structure of an engineered TycA variant in complex with an beta-Phe-AMP analog.
  • 4BAA (PDB location) Redesign of a Phenylalanine Aminomutase into a beta-Phenylalanine Ammonia Lyase.

2. Creating predicted models of proteins whose structure have not been determined experimentally

Protein Structure Prediction Overview

Depending on the nature of your protein system this may be the only step that you need worry about. Fortunately, there are many free software tools to choose from for creating a predicted model of a protein's tertiary structure. The methodologies employed by these software tools can be categorised as either template-based or de novo protein structure prediction[3]. As the name implies template-based methods, such as homology modelling, rely on a template. A template is a protein whose structure has been experimentally determined and its amino acid sequence is similar to that of the query sequence wishing to be modelled. The predicted model takes the amino backbone structure of the template and therefore the efficacy of template-based methods is largely dependent on quality of the template. Templates that have a higher sequence similarity to the query are more likely to correctly assume that the query sequence will mimic it's tertiary structure[4]. Alternatively, de novo or ab initio methods of protein structure prediction do not produce models based on a template. Instead, ab initio methods utilise two different approaches. Firstly, there is an algorithmic approach that uses scoring functions and available structural and sequence data to construct a model from a combination of highly probable local conformations. The second approach is to simulate the folding of the protein utilising methods similar to molecular dynamics[5]. Thankfully for a synthetic biologist, it is not necessary to understand in-depth the functionality of these tools that are continually being improved. Most structure prediction web-tools only require a query sequence input, to receive the outputted pdb file of the predicted structure.

Our approach

7-β-Xylosidase and 10-β-Acetyltransferase (LXYL and DBAT) from our second pathway did not have experimentally determined structures on the PDB - which prompted our investigation into structure prediction software. Thankfully, Ping Zhu's team had already constructed a predicted model using the 'SWISS-MODEL' homology modelling web-tool (visit here), using the entry 5kjt.1 as a template [6]. Here is the predicted structure of DBAT, as seen in fig. 1 below. However, for LXYL we had to create our own predicted structure and assess its quality. The use of the Itasser server (visit here) was opted for over SWISS-MODEL due to a lack of templates that had sufficient coverage over our query sequence (only matched with a short subset of the query when it was being aligned). Certainly, because part of our modelling imperatives further down the line in molecular dynamic analysis sought a model that accurately portrayed the size of the enzymes attached to the scaffold. The lack of coverage from the available templates meant that only the parts of LXYL covered in the template would be modelled if we had used SWISS-MODEL. Itasser on the other hand will produce a model that covers the whole query sequence not just the regions that match the template. Indeed, the Itasser server utilises both template-based threading and ab initio simulation methods to produce its final model (for more information).

Figure 1: The predicted model of the mutant DBAT created by Ping Zhu's team using SWISS-MODEL - a freely available homology modelling web-tool.

Figure 2: Our team's predicted model of LXYL, created using Zhang's Itasser server.

3. Assess the quality of predicted models and make refinements

Without knowing the protein's native structure you cannot determine the accuracy of a predicted model. There are however, tools available to assess the hypothetical quality of a protein model. These tools employ stereochemistry checks, statistical potentials, machine learning methods and molecular-mechanics energy-based functions to assess the model's quality[7]. Alternatively model refinement tools employ ab initio methods to minimise the energy on template-based predicted models[8].

The two quality assessment tools our team employed were SAVES and Molprobity . The former is a combination of five different scoring tools (fig. 3), whereas the later, developed by the Richardson group at Duke university, provides a summary of scoring statistics (fig. 4). MolProbity also provides an accumulative score or MolProbity score which can be an easy way to directly compare models produced from different methods.

Figure 3: SAVEs results for the Itasser predicted model.

Figure 4: MolProbity results for the Itasser predicted model.

The scores from our initial predicted model from Itasser were less than desirable. The MolProbity score (fig. 4) has an abysmal 19th percentile score and only one of the nine Procheck tests were passed from our SAVES output (fig. 3). Depending on how well the model is improved after refinement - in this case it would be advisable to seek out and test other prediction methodologies. To refine our models we utilised the GalaxyRefine server that can offer both template-based predictions with refinements or just refinement jobs on a supplied model using ab initio methods. The server will usually take about 1-2 hours to complete its refinement, depending on the structure. Galaxy refine provides five refined models with associated statistics to help the user decide which models to progress with (fig. 5). Our team made the decision to prioritise the refined ITasser model that had the lowest (best) MolProbity score (fig. 5). This refined model was inputted to the SAVES and MolProbity tools to determine the improvement on the original model (fig.6,7). Viewing fig. 7 shows a positive outcome with the overall MolProbity score rising from the 19th percentile to 57th.

Figure 5: GalaxyWeb results for the Itasser predicted model, five refined models with varying degrees of improvements.

Figure 6: SAVES results for the GalaxyWeb refined Itasser predicted model.

Figure 7: MolProbity results for the GalaxyWeb refined Itasser predicted model.

4. Build Protein System From Component Parts

In synthetic biology it is often routine to make additions, deletions and general amendments to biological parts. When trying to construct an in-silico model of such aforementioned synthetic structures visualisation tools such as Chimera and Pymol can be quite helpful to achieve this task. For example, to improve the extraction of our protein parts we expressed His-Tags on the C-termini of our proteins. To mirror this in our models we created the six histidine chain in Chimera (download here) a freely available visualisation and analysis tool for molecular structures.

Steps to join models via a peptide bond in Chimera

  1. First load both models into the same Chimera session: click the 'File' tab and then the 'Open..' option where you will be required to select the pdb files you wish to join. In fig. 8 two histidine residues (blue and yellow) are loaded.
  2. Next select the Nitrogen and Carboxylic Carbon atoms that you wish to make the bond between. This step will require you to see the atomic structure, if instead your models are represented as ribbons click the 'Actions' tab and change the settings to 'hide' 'Ribbon' representation and 'show' the 'Atoms/Bonds'. The first selection will be done with a Ctrl+Click and will be confirmed by green highlighting on the atom selected, subsequent selections must be done with Ctrl+Shift+Click otherwise the previous selection(s) will be replaced the new Ctrl+Click (this applies even if clicked on empty space so be careful!).
  3. Once the relevant Carbon and Nitrogen are both selected (must be on different models) go to the 'Tools' tab click, scroll down to the 'Structure Editing' option and select the 'Build Structure' option within it.
  4. From the previous step a new tab will open at the top panel. Select the 'Join Models' option so that you get a tab like the one seen in fig. 9. If the right selections have been made and the 'C-N peptide bond' option has been selected you should be able to press the 'Apply' button for your desired amino bond to be instantly created.
  5. There are options in the 'Build Structure' tool to edit the bond angles around your newly created peptide bond, but if you are satisfied with the new combined model and wish to save it as a new pdb file go to the 'File' tab and select 'Save PDB...' where you can name and give the new file a suitable storage location on your computer.

Our team was taught this tool in Chimera quite effectively with the aid of Daniel Winter's instructional PDF that you can access here.

Figure 8: Two histidine pdb files loaded into the same Chimera session. The n-terminus has been selected on the yellow model, whilst the c-terminus has been selected on the blue model.

Figure 9: The 'Build Structure' - 'Join Models' tool is open. With the proper selections made and the 'C-N peptide bond' option selected the 'Apply' option is available.

In addition to creating the His-Tags for our proteins, this methodology was utilised to join our Spy and Snoop Catcher and tag systems to our prefoldin scaffold and to our enzymes; either the PAM and TycA or LYXL and DBAT. By extension this tool was critical to effectively building our Assemblase systems (fig. 10).

Figure 10: The first Assemblase model was created using the methods outlined above. The dark and light blue regions are a bioconjugate of alpha prefoldin, the spy tag-catcher system and the predicted model of LYXL, whereas the four remaining coloured regions are bioconjugates consisting of beta prefoldin, the snoop tag-catcher system and the mutated DBAT predicted model.

5. Viewing protein models and deleting unwanted residues in Pymol

Whilst Chimera offer's many of the same viewing options for your molecular structures, one feature of Pymol that made it favourable to use both platforms in parallel was the ability to see the amino acid sequence of your proteins above the structure's viewing space (fig. 11). This feature made the task of isolating single residues, bonds or atoms in the large hexameric structures of our Assemblase systems considerably easier. Especially because selected residues would be both highlighted on the structure and on the sequence bar (fig. 11). Furthermore, the selection of residues using the sequence bar made it relatively easy to delete regions of the amino acids from the structure. This was done by selecting the residues by left clicking on the sequence and then right clicking on the selected region to display the option to 'remove' the selection (fig. 12). One occasion where this was used by our group was to isolate the region of the of the Snoop Tag and Catcher system that we were expressing from the larger structural file that was retrieved from the PDB (fig. 13). This isolation not only included deleting the majority of the peptide chain but also the removal of unwanted ions and water molecules (fig. 13).

Figure 11: View of Pymol workspace with the SnoopCatcher-SnoopTag model loaded in the workspace. The model's sequence is present at the top left portion of the image. The green highlighted residues correspond to the pink selected regions on the model itself.

Figure 12: View of he option to delete selected residues from the model by left clicking on one of the aforementioned selected residues.

Figure 13: View of PDB entry 2WW8 that contains the snoop tag-catcher system needed to be isolated for our Assemblase model. The regions selected are not expressed in our system and were removed.

Concluding Remarks

The creation of physical models for the Assemblase systems as well as component parts was at many times an indispensable tool to help guide our project. However, the model used for our molecular dynamics analysis could have been greatly improved if a guide, such as the one we offer here, had been made available. In hindsight the quality of our predicted model for LXYL could have been greatly improved. Indeed, further investigation into different prediction tools led to better results, however this was too late in our projects timeline where-by, our models were already undergoing molecular dynamic simulation. The Robetta server, that provides the Rosetta ab initio prediction tool, produced a better model of higher quality than Itasser for LXYL. It passed more whatcheck tests and its MolProbity score ranked in the 99th percentile. For future teams wishing to create predicted models for their systems we would recommend going to the Critical Assessment of protein Structure Prediction (CASP) website to search for the most reputable prediction suites. CASP is a competition held every two years to encourage the development of better protein structure prediction tools.

Figure 14: SAVES results for the Robetta model.

Figure 15: MolProbity results for the Robetta model.

In summary our team hopes that this page will help guide future iGEM team's to recreate physical models of their protein systems in silico, no matter the complexity of their design. Certainly, the means exist to search for existing models, predict the strucutre of models that do not exist and cut as well as, paste different biological parts together and then make refinements where appropriate. The tools mentioned above, particularly the ability to make predictions of tertiary structure from a sequence are improving in accuracy every year. This is furthered with the marriage of accumulating experimental data and computational learning techniques and will likely grow in importance to all synthetic biologists.

References

  1. https://www.ncbi.nlm.nih.gov/books/NBK26820/
  2. http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction
  3. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2680823/
  4. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1166865/
  5. J Lee, S Wu, Y Zhang. Ab initio protein structure prediction. From Protein Structure to Function with Bioinformatics, Chapter 1, Edited by D. J. Rigden, (Springer-London, 2009), P. 1-26.
  6. Li, B.-J. et al. Improving 10-deacetylbaccatin III-10-b-O-acetyltransferase catalytic fitness for Taxol production. Nat. Commun. 8, 15544 doi: 10.1038/ncomms15544 (2017).
  7. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808711/
  8. https://onlinelibrary.wiley.com/doi/full/10.1002/prot.24858

I-Tasser

  1. Roy, A., Kucukural, A., and Zhang, Y. (2010). I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, 5, pp.725-738
  2. Yang, J. Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang Y. (2015). The I-TASSER Suite: Protein structure and function prediction. Nature Methods, 12, pp.7-8
  3. Yang, J., and Zhang, Y. (2015). I-TASSER server: new development for protein structure and function predictions. Nucleic Acids Research, 43, pp.174-181

Swiss-Model

  1. Arnold, K., Bordoli, L., Kopp, J. and Schwede, T. (2005). The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics, 22(2), pp.195-201.
  2. Benkert, P., Biasini, M. and Schwede, T. (2011). Toward the estimation of the absolute quality of individual protein structure models. Bioinformatics, 27(3), pp.343-350.
  3. Bienert, S., Waterhouse, A., de Beer, T., Tauriello, G., Studer, G., Bordoli, L. and Schwede, T. (2016). The SWISS-MODEL Repository—new features and functionality. Nucleic Acids Research, 45(D1), pp.D313-D319.
  4. Waterhouse, A., Bertoni, M., Bienert, S., Studer, G., Tauriello, G., Gumienny, R., Heer, F., de Beer, T., Rempfer, C., Bordoli, L., Lepore, R. and Schwede, T. (2018). SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Research.

SAVES

  1. Bowie, J., Luthy, R. and Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253(5016), pp.164-170.
  2. Colovos, C. and Yeates, T. (1993). Verification of protein structures: Patterns of nonbonded atomic interactions. Protein Science, 2(9), pp.1511-1519.
  3. https://swift.cmbi.umcn.nl/gv/refs/. (n.d.). WHAT_CHECK Reference List. [online] Available at: https://swift.cmbi.umcn.nl/gv/whatcheck/ [Accessed 4 August 2019].
  4. Laskowski, R., MacArthur, M., Moss, D. and Thornton, J. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. Journal of Applied Crystallography, 26(2), pp.283-291.
  5. Pontius, J., Richelle, J. and Wodak, S. (1996). Deviations from Standard Atomic Volumes as a Quality Measure for Protein Crystal Structures. Journal of Molecular Biology, 264(1), pp.121-136

MolProbity

  1. Davis, I., Leaver-Fay, A., Chen, V., Block, J., Kapral, G., Wang, X., Murray, L., Arendall, W., Snoeyink, J., Richardson, J. and Richardson, D. (2007). MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Research, 35(Web Server), pp.W375-W383.
  2. Chen, V., Arendall, W., Headd, J., Keedy, D., Immormino, R., Kapral, G., Murray, L., Richardson, J. and Richardson, D. (2009). MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallographica Section D Biological Crystallography, 66(1), pp.12-21.

Galaxy Refine

  1. J. Ko, H. Park, L. Heo, and C. Seok. (2012). GalaxyWEB server for protein structure prediction and refinement. Nucleic Acids Res. 40 (1), pp.294-297.
  2. W. -H. Shin, G. R. Lee, L. Heo, H. Lee, and C. Seok. (2014). Prediction of Protein Structure and Interaction by GALAXY protein modeling programs. Bio Design, 2 (1), pp.1-11.