Determining pathogenicity of the Xanthomonas species
By: Dennie te Molder
This subproject started off a little different from the way it ended up. At first, the idea was to do a very in-depth research on the Xylella fastidiosa genome, in order to obtain new targets for biocontrol strategies. Later in our project, we decided not to work on X. fastidiosa because of safety considerations. This weakened the link of this subproject to the overall Xylencer project. I continued to build the pipeline for X. fastidiosa, whilst looking for a more suitable opportunity to leverage the already existing pipeline. This new opportunity came with the introduction of the phage delivery bacterium (PDB), which required a non-pathogenic, highly related bacterium. We realized that we could adapt and extend the pipeline build for X. fastidiosa, to find the best possible candidate for the PDB. It turned out that the Xanthomonas species were the most promosing candidates, but we needed more certainty on the non-pathogenicity of certain strains. And thus this project set out to determine non-pathogenicity of Xanthomonas species. If you want to know the outcome of this subproject, read the results page.
May
Week 1 (13th of May – 19th of May)
Literature research and start proposal writing.
Week 2 (20th of May – 26th of May)
Continued literature research and proposal writing.
Week 3 (27th of May – 2nd of June)
Set up my computer, installed software and finished proposal writing. The idea of the subproject at this point in time was to retrieve all of the publicly available X. fastidiosa genomes, reannotate them to ensure consistency, then we would compare the core genome of X. fastidiosa to a related non-pathogenic bacterium and analyze the genes that are not shared between the two groups to find genes important for pathogenicity. This would be complemented by metabolic modeling of X. fastidiosa.
June
Week 4 (3rd of June – 9th of June)
Collected 55 Xylella genomes from the NCBI database. Set up the SAPP platform and started re-annotating the genomes with genes and protein domains.
I also got in contact with biointeractions research group (WUR), who are currently in the process of sequencing already existing X. fastidiosa genomes, that are sampled at two points in time, allowing us to observe mutations that occur over a short period of time. They offered us the raw data, so we could analyze them with our own pipeline. We intend to use this data to select optimal primers for the detection of X. fastidiosa in our detection device.
Week 5 (10th of June – 16th of June)
Finished annotating the genomes with domains. Build a SPARQL database with graphDB and uploaded the genomes to this database. Learned the basics of the SPARQL language and Explored the best options for interfacing with this database.
Week 6 (17th of June – 23rd of June)
Started analyzing the genomes with an R script that queries the database directly. Constructed queries to obtained basic statistics like genome sizes, number of genes and number of domains. Turns out 11 of the 55 genomes have a size that completely outside of the expected range (all measuring in less than 60kB). Most likely cause: the plasmids that were assembled together with genomes were linked to genome accession in NCBI instead of genomes itself.
Week 7 (24th of June – 30th of June)
This was the week of the deadline for the descriptions page on our wiki. Worked all week on building a template for the website and constructing the descriptions page and temporary home page.
I also obtained the first set of reads from biointeractions, but I still need to wait for the other half to do a high-quality assembly.
July
Week 8 (1st of July – 7th of July)
Manually retrieved and reannotated the genomes that previously incorrect. Created the functionality to directly interface with NCBI database to retrieve genome information automatically. Wrote a simple parsing algorithm to get out genus, species, and strain.
Week 9 (8th of July – 14th of July)
Build the domain presence/absence matrix, made PCA's and dendrograms. This showed that the X. fastidiosa subspecies pauca has two very distinct groups that correspond to the American and European lineages of the subspecies. Interestingly the European lineage seemed to more related to the other subspecies, multiplex and fastidiosa, than to the American lineage of the same subspecies, indicating that it might be more correct to split up the pauca subspecies into two new subspecies.
Week 10 (15th of July – 21st of July)
We decided to adapt to the material design philosophy for our project. Redesigned the wiki from the ground up to be in-line with our new design philosophy.
Week 11 (22nd of July – 28th of July)
Started comparison of Xylella genomes to non-pathogenic Xanthomonas strain to eliminate genes that are not important for pathogenicity. This was achieved through pan and core domainome analysis of both groups.
Week 12 (29th of July – 4th of August)
Realized non-pathogenicity for the Xanthomonas strains is ill-defined. Need to obtain a higher quality set of non-pathogenic Xanthomonas genomes. No such database is known to literature nor is there a single paper summarizing this information. Decided to manually curate this dataset from literature reports and analyze the reliability of the set based on genome clustering analysis. This information is also crucial for selecting our phage delivery bacterium and obtaining a candidate Xanthomonas strain for the PDB is now the main aim of the subproject.
August
Week 13 (5th of August – 11th of August)
Before spending a lot of time on literature, I decided to see if I can quickly get an overview of the natural groupings inside of the Xanthomonas species. Maybe the set of non-pathogens naturally clusters away from the pathogens. To analyze this I copied an already existing database from our department and started to adapt the script to work with its different database structure. I did observe two district groups, one group with an average domain copy number of 2.5 and one with an average domain copy number 3.5, could this be related to pathogenicity.
Week 14 (12th of August – 18th of August)
Looked through the results more carefully and it seems like the grouping arises from an inconsistency in the database structure generated by SAPP. Adapted my queries to account for this inconsistency, now the group with the higher copy numbers is gone. Since no natural grouping can be observed in the dataset, I decided to try a machine learning approach to find a metric to separate non-pathogens from pathogens. This requires a high-quality set of confirmed non-pathogens. To curate this dataset, I started literature research on the pathogenicity of Xanthomonas.
Week 15 (19th of August – 25th of August)
Obtained the first set of 90 strains known to be non-pathogenic to literature. It turned out that only 22 of theses strains had a publicly available genome. This is too little to train a model, continue with more literature research.
I also received the final set of reads from biointeractions for the mutation analysis of X. fastidiosa, this is on hold until I have more time.
Wrote the wiki coding guide and gave a "how-to wiki" crash course to the wiki team.
Week 16 (26th of August – 1st of September)
Vacation
September
Week 17 (2nd of September – 8th of September)
Turns out that only 7 of the 22 non-pathogenic genomes are in the current database. Rebuild the database by retrieving all 1397 available Xanthomonas genomes from NCBI. Also obtained 5 new genomes for X. fastidiosa. Re-annotated all genomes with the pipeline and uploaded them to the database. It also occurred to me that I should have kept track of confirmed pathogenic bacteria, also reading literature to select these.
Week 18 (9th of September – 15th of September)
Expanded list of non-pathogens from 22 to 44 genomes, but also decided to be stricter on the criteria for a non-pathogen, removing all genomes with unambiguous reports. This resulted in a set of 34 non-pathogens. Also curated a set of 70 confirmed pathogenic bacteria, resulting in a total set of 104 Xanthomonas strains with known pathogenicity.
Week 19 (16th of September – 22nd of September)
Build the first random forest model using. Discussion with supervisor's lead to the realization that I should correct for the imbalance in the dataset when sampling the train set. Redesigned the model building procedure to incorporate downsampling.
Started assembling the reads from biointeractions and preformed quality correction of the assemblies.
Wrote the wiki writing style guide.
Week 20 (23rd of September – 29th of September)
Presented progress to the department. It was brought to my attention that a reliable model depends on a closed pan-domainome in the pathogenicity dataset. Wrote a script for pan domainome analysis, it turns out the pan domainome is closed.
Week 21 (30th of September – 6th of October)
Model tweaking and analysis. Build the SNP analysis workflow for the assembled genomes form biointeractions and created a dataset for Niels to analyze.
Started remodeling the most important proteins with the newly designed workflow.
October
Week 22 (7th of October – 13th of October)
Realized that "accuracy" is a poor metric for assessing a model with an imbalanced test set. Decided to optimize specificity (pathogen prediction rate) instead. Added a bootstrapping procedure to asses the confidence of predictions. This extra procedure mode the modeling process 100x time longer. Incorporated the caret package to build models in parallel, speeding up the procedure. Obtained final results and wrote the wiki piece.
Week 23 (14th of October – 20th of October)
Wiki building!!