Team:Wageningen UR/Results/Pathogenicity

Xylencer

subproject hexbadge

Determining pathogenicity of the Xanthomonas species

divider

The phage delivery bacterium (PDB) demands a strain that is both able to replicate X. fastidiosa phages and is not harmful to the treated plant. This requires an evolutionary closely related non-pathogenic bacterium. An excellent candidate is Xanthomonas, of which there are multiple reported non-pathogenic strains. However Xanthomonas is known as a predominantly pathogenic genus. By combining extensive literature research and genomic information into a machine learning approach, this subproject establishes a genetic basis for discerning pathogens from non-pathogens in the Xanthomonas species. This allowed us confidently select non-pathogenic X. arboricola strain CITA 44 for use as our PDB.

Principal component analysis plot of the average model predictions. This gives an overview of the perfomrance of the model on the differnt samples. Samples that were mislabeled in more than 5% of the cases are designated as 'missed' and are symbolized by a crossed out dot.

A total of 1372 Xanthomonas genomes were retrieved from the NCBI database and reannotated with protein domains, the functional units of proteins. Extensive literature research was performed to curate a high-quality dataset of 104 Xanthomonas genomes with experimentally verified pathogenicity. This dataset was used to train a random forest machine learning model. The performance of 100 models was combined to estimate the sensitivity and specificity of the model, resulting in an average sensitivity (non-pathogen prediction rate) of 0.90 ± 0.12 (SD) and average specificity (pathogen prediction rate) of 0.81 ± 0.11 (SD). The set of non-pathogens, that could be correctly predicted with high certainty, were selected as promising candidates for the PDB. Protein domains important for prediction were extracted from the model, allowing us to deepen our understanding of the biology of pathogenicity in the Xanthomonas genus, by identifying transposable DNA elements and the type III secretion systems as key factors in Xanthomonas pathogenicity.

Box-plots of the sensitivity (prediction rate of non-pathogens) and specificity (prediction rate of pathogens) of the 100 model replicates. This gives a measure of consistency of the predictions.

Introduction

The phage delivery bacterium is an integral part of Xylencer. Early on, trough our contact with phage experts, it became apparent that delivery of the phage to the infected areas of a diseased plant was one of the biggest bottlenecks holding back agricultural phage therapy.

Two major factors at play here are:

  • The difficulty of precise delivery to the areas of infection.
  • Damage to the phage during and after delivery due to environmental factors, of which the main culprit is UV damage.

Xylencer overcomes these hurdles with the introduction of the phage delivery bacterium (PDB). This bacterium forms a protective shell for the phage and facilitates targeted delivery of the phage inside the plant. However, one big question remains: Which specific bacterial species and strain do we use as the PDB?

  • What criteria do we apply to our PDB? arrow_downward

    There are two main criteria when considering a bacterium for use as a PDB:

    1. The bacterium should be able to replicate the phage, by having the right machinery to transcribe and translate the phage proteins.
    2. The bacterium should not be harmful to the plant to which it will be applied as a cure.

    To satisfy the first criterion, the PDB needs to have sufficiently similar machinery to X. fastidiosa, enabling it to correctly replicate the phage. The odds of this being true are increased the more evolutionary related the PDB is to X. fastidiosa. To satisfy the latter criterion, the bacterium should be non-pathogenic to the plant that is to be treated. Since X. fastidiosa can infect over 350 plant species, the PDB should be non-pathogenic to all of them. Simply put, it should be a general non-pathogen. Having a non-pathogenic PDB has the added benefit of increasing the safety of the therapy.

    In the case of X. fastidiosa, these criteria are at apparent odds with each other. The order of the Xanthomonadaceae, of which X. fastidiosa is a part, is considered to be an entirely pathogenic order [1], this violates the second criterion. But if we look outside of the Xanthomonadaceae, the evolutionary distance becomes so large that the first criterion no longer holds. Luckily there might be a way out of this paradox

Over two decades ago, reports of non-pathogenic Xanthomonas strains started to appear [2] and this list has been slowly growing. Xanthomonas is the most closely related organism to X. fastidiosa, at 95% 16S rRNA similarity. Additionally, Xanthomonas produces a yellow pigment called xanthomonadin that protects it from photobiological damage [3]. This same pigment would also allow Xanthomonas to better protect the phage from UV damage. On top of that, it was already experimentally confirmed that X. fastidiosa phages can be replicated by Xanthomonas [4]. This means that these non-pathogenic Xanthomonas strains meet both of the criteria set for the PDB.

However, non-pathogenicity is a hard thing to prove. There is always a risk of testing the wrong host, testing the wrong conditions or using the wrong method of inoculation, to name a few. An example of this, is a set of X. arboricola strains isolated from infected strawberries, that were unable to cause symptoms when manually inoculated to strawberries again [5]. Based on this evidence, these strains could be considered non-pathogenic, but when the pathogenicity-assay was repeated at an increased humidity, symptoms did show [6]. Of course testing all possible hosts and conditions makes for a near-infinite parameter space and is unfeasible. Still, it would be desirable and responsible to extend the evidence for non-pathogenicity of the PDB beyond mere literature reports. This sub-project aims to provide additional evidence, by examining if a genetic basis for delineating pathogens from non-pathogens can be established using an in silico approach.

General Approach

divider

This sub-project is based on the hypothesis that the reliability of the diagnosis of a bacterium as a non-pathogens is increased if we can accurately separate non-pathogens from pathogens on a genetic basis. This genetic basis can be established by combining literature information on pathogenicity with genome sequencing information and feeding it to a classifier (Figure 1). The reliability of this genetic basis can then be assessed by examining the performance of the classifier. This requires three components: A list of information on pathogenicity of different Xanthomonas strains, a database consisting of the corresponding genomes with functional annotations and lastly a classifier that can interpret this data and make a prediction on pathogenicity of a given strain. The next sections will go over these components in more detail.

Figure 1: A schematic overview of the approach. Literature study is combined with all publicly available genomes to trains a random forest machine learning model to predict pathogenicity of Xanthomonas strains

The Genomic Database

Publicly available Xanthomonas genomes were retrieved from the NCBI database, which yielded 1397 genomes. Quality control filtered-out 25 genomes, resulting in 1372 genomes spanning 35 different subspecies. To prevent differences in annotation date and method from contaminating the results, all genomes were re-annotated using the SAPP platform [7]. Prodigal was used to call genes [8], which were subsequently annotated with Pfam protein domains [9] using InterProScan [10].

Figure 2: Genome size versus gene count of the 1372 Xanthomonas genomes obtained after filtering.
  • Why use protein domains? arrow_downward

    Gene function was inferred based on protein domain content instead of traditional global sequence similarity-based methods. The motivation for using protein domains was two-fold: First, the Hidden Markov Models (HMMs) used for predicting domains are more sophisticated than the matrices used for scoring sequence similarity. The HMMs each have their predetermined thresholds that can vary between domains, forgoing the problem encountered with sequence similarity methods of having to select arbitrary thresholds. This makes protein domains a better proxy for biological function [11]. Secondly, when compared to the calculation of bi-directional best hits required for sequence-similarity-based approaches, annotation of protein domains is less computationally intensive. This allows the computation time to scale linearly with the database size as opposed to the quadratic relationship observed for traditional methods [12].

  • Database statistics arrow_downward

    All of the annotated genomes were uploaded to a graph database, to allow for easy querying. The database was interfaced using SPARQL queries and R was used for further analyses. The genomes had an average size of 4.8Mb containing 4425 genes on average. Figure 2 shows the relationship between genome size and gene count. Most of the genomes follow the expected trend of a linear relationship, with only a subgroup of X. oryzae showing a higher gene density than expected. This is most probably a product of the high amount of recombination and rapid evolution observed in this subspecies [13]. The subspecies albilineans, fragiae and dyei all show a reduced genome size, which is in line with the reported genome reduction ongoing in albilineans [14].

    Figure 3: Genome size versus gene domain coverage of the 1372 Xanthomonas genomes obtained after filtering.

    On average, domain annotation yielded at least one domain for 79% of the genes. The average domain abundance was 5213, 2370 of which were distinct. The distribution is visualized in Figure 3. X. oryzae shows the lowest coverage but is still in line with the observed trend. Coverage plateaus at 83%, with strains undergoing genome reduction showing a lower than expected coverage for their gene count. This can indicate that the set of unannotated genes is enriched with biologically important genes that are preserved under selective pressure. The relationship of domain abundance and the unique number of domains in Figure 4 shows a linear relationship, with higher domain abundance giving rise to a more diverse repertoire of domains. Again the subgroup of X. oryzae, that showed a higher gene density in Figure 2, stands out because they have less distinct domains than expected for their domain abundance. This could again be explained by the increased recombination activity, by which new genes are created from combinations of existing domains, increasing only abundance, not diversity.

    Figure 4: Distinct number of domains versus total number of domains of the 1372 Xanthomonas genomes obtained after filtering.

A binary domain presence absence matrix was generated for each genome as the main starting point for further analyses (Figure 5). On this matrix a PCA was preformed to examine the natural grouping present in the data set. The PCA-plot (Figure 6) clusters genomes together on a species level, with two high-density areas near the bottom right. PC-1 is responsible for separation of most subspecies and explains significantly more variation than PC-2. PC-2 is mainly responsible from separating X. oryzae from the rest of the species, this a by-product of the over-representation of the species in the data set (33% of all genomes), causing the PCA to put a lot of weight on separating these genomes.

Figure 5: Binary domain presence/absence matrix for the Xanthomonas genus. Dark red indicates presence of a domain, light red indicates absence of a domain. X-axis represents the different domains, y-axis the different genomes. The colored bar at left indicates the species of the genome in that row.
Figure 6: Principal component analysis of all obtained Xanthomonas genomes based on domain presence/absence.

Curation Of The Pathogenicity Dataset

A model will only ever be as good as the data it is built on. Because of the importance of the model, it is vital that a high-quality dataset is available. In the case of Xanthomonas there is no centralized database or any other resource publicly accessible that holds a large amount of information on pathogenicity. This has brought the need to manually curate our own dataset from literature reports. Ideally, this dataset would be backed up by our own experimental data, but the safety risks involved handling a large number of pathogenic bacteria and the time-consuming nature of pathogenicity assays, has led to the decision to focus solely on computational research. Curation of the dataset was very strict, using only data that was backed-up by experimental confirmation of (non)-pathogenicity. If conflicting evidence existed for a strain, it was excluded from the dataset altogether. This also meant that weakly pathogenic strains were excluded because of the ambiguity in their classification. Genome availability of the selected strains was confirmed using the previously described database. This resulted in a pathogenicity dataset consisting of 104 experimentally verified and publicly available genomes [5, 15–27].

Figure 7: Overview of the distribution of the pathogenicity dataset over the total Xanthomonas database. PCA was generated from the domain presence and absence matrix. Genomes shown in gray are not included in the pathogenicity dataset. This figure can be compared to Figure 6 get the species distribution.

Model Building

The final piece of the puzzle is a suitable classifier. The pathogenicity dataset is a very "wide" dataset, at 104 samples and 2241 features, making it prone to over-fitting. The choice for a Random Forest model [29] was made because it is one of the top-performing models on bioinformatics datasets across the board [30] and on pathogenicity data [31], has great outof-the-box performance, doing well without any hyperparameter tuning [32] and is naturally resilient against over-fitting [29].

Figure 11: Overview of the model building procedure. The dataset of 70 pathogens and 34 non-pathogens is bootstrapped 100 times, each of the resulting datasets is then used to build the random forest model using 100 times 5x cross-validation. The trains set is balanced using down-sampling to avoid bias.
Figure 13: PCA-plot of the average model predictions. Samples that were mislabeled in more than 5% of the cases are designated as "missed" symbolized by a crossed out dot.

From this list, X. arboricola strain CITA 44 was selected as the most promising candidate as it has a 100% confidence interval, lacks the type III secretion system, shown to be central to pathogenicity later in this subproject and has the most rigorous experimental data backing it up, as it was tested on five different hosts [20].

Samples with a 5% CI of below 50% were regarded as "hard to predict" and are displayed in Table 2. The model seems to be biased towards labeling X. arboricola species as nonpathogenic, as they are all predicted corrected, but their pathogenic counter part is often mislabeled. Again this could also mean that these pathogenic strains are incorrectly annotated, but only in planta experiments can provide more conclusive evidence for such claims

  • Table 2: Samples difficult to predict arrow_downward
    Table 2: Samples that were predicted with a lower-bound of less then 50%, these are regarded as "difficult" to predict.
    (a) Non-pathogens
    Species Strain Source Accuracy 5%CI 95%CI
    X. translucens CS2 [17] 0 0,00 1,00
    X. maliensis M97 [23] 2 0,00 20,47
    X. axonopodis NCPPB 1159 [21] 16 7,00 31,00
    X. axonopodis ORST4 [21] 48 26,75 73,77
    (b) Pathogens
    Species Strain Source Accuracy 5%CI 95%CI
    X. arboricola CFBP 6771 [22] 0 0,00 0,00
    X. arboricola CFBP6827 [22] 0 0,00 0,00
    X. arboricola CFBP 7410 [22] 0 0,00 1,00
    X. arboricola NCPPB 1832 [19] 0 0,00 1,00
    X. arboricola NCPPB 1630 [19] 0 0,00 2,00
    X. arboricola CITA 14 [20] 1 0,00 9,00
    X. arboricola CFBP 3122 [22] 3 0,00 26,00
    X. arboricola CFBP3123 [22] 9 1,00 38,95
    X. arboricola CFBP 7407 [22] 26 1,78 76,00
    X. pisi CFBP4643 [22] 50 10,91 84,40
    X. axonopodis CFBP1851 [21] 16 13,48 22,00
    X. theicola CFBP 4691 [22] 50 17,53 74,11
    X. albilineans CFBP2523 [22] 59 21,00 89,00
    X. axonopodis ORST17 [21] 49 32,00 56,23
    X. sacchari NCPPB 4393 [24] 36 36,00 36,00
    X. hyacinthi CFBP1156 [22] 79 36,95 85,25
    X. melonis CFBP4644 [22] 81 40,00 98,00
    X. sacchari CFBP4641 [22] 43 43,00 43,00

Learning from machine learning

Not only can we use the model to select a list of candidates, but we can also study the most important predictors of the model. This set of predictors might hold new information about the biology of Xanthomonas pathogenicity. Random Forest is an ensemble method, meaning that we cannot directly interpret the model. One thing we can do is estimate the importance of a variable/domain by observing the mean decrease in accuracy when the given domain is left out of the model. The mean decrease of the 50 most important domains is plotted in Figure 14. The mean decrease in accuracy has no directionality making it impossible to asses whether a domain is important for either pathogens or non-pathogens. In an attempt to visualize the propensity of the domains, heatmaps of top 30 domains were created Figure 15 & 16. The heatmaps show that no one domain is uniquely present in either of the groups, but most domains do show a clear difference in abundance between the two different subgroups. A summary of top 30 domains is given in Table 3.

Figure 14: Top 30 most predictive Pfam domains for delinaitaing pathogens from non-pathogens in Xanhtomonas, based on the scaled mean decrease in variance.
  • Figures 15 & 16: Domain presence matrices arrow_downward
    Figure 15: Heat-map of domain presence/absence of the top 30 most important Pfam do- mains. Domains are ordered from most to least important. Dark red indicates presence of a domain, light red indicates absence of a domain. X-axis represents the different domains, y-axis the different genomes.
    Figure 16: Heat-map of domain presence/absence of the top 30 most important Pfam do- mains, the domains and samples are clustered using complete linkage and the binary distance metric. Dark red indicates presence of the domain, light red indicates absence of a domain. X-axis represents the different domains, y-axis a different genomes.
  • Table 3: Most important domains arrow_downward
    Table 3: Top 30 Pfam domains and their description. Propensity contains the groups in which the domain is most commonly found. Groups: 1 = non-pathogenic X. arboricola, 2 = non-pathogens excluding X. arboricola, 3 = pathogenic X. arboricola, 4 = pathogens excluding X. arboricola
    Domain Description Propensity
    PF01845 Toxin CcdB 4
    PF07362 Post-segregation antitoxin CcdA 4
    PF09487 Type III secretion protein HrpB2 3, 4
    PF01609 Transposase; IS4-like 2, 3, 4
    PF05394 Avirulence B/C 3, 4
    PF07532 Bacterial Ig-related 2, 3, 4
    PF09838 Protein of unknown function DUF2065 -
    PF09483 Type III secretion protein HpaP 3, 4
    PF09502 Type III secretion protein HrpB4 3, 4
    PF01548 Transposase; IS111A/IS1328/IS1533; N-terminal 2, 3, 4
    PF05426 Alginate lyase domain 1, 2, 3
    PF01095 Pectinesterase; catalytic 1, 2, 4
    PF13855 Leucine-rich repeat 3, 4
    PF09486 Type III secretion protein HrpB7 3, 4
    PF02638 Glycosyl hydrolase-like 10 4
    PF09386 Antitoxin ParD 1
    PF09613 Type III secretion system; HrpB1/HrpK 3, 4
    PF02371 Transposase; IS116/IS110/IS902 2, 3, 4
    PF17263 Protein of unknown function DUF5329 1, 2, 3
    PF05621 Bacterial TniB 3
    PF05932 Tir chaperone protein (CesT) family 3, 4
    PF05015 Toxin HigB-1 2, 3, 4
    PF03412 Peptidase C39; bacteriocin processing 2, 3, 4
    PF09286 Peptidase S53; activation domain 2, 3, 4
    PF02498 BRO N-terminal domain 3, 4
    PF09907 Toxin-antitoxin system; toxin component; HigB; putative -
    PF10899 Abortive phage resistance protein AbiGi 3
    PF07638 RNA polymerase sigma-70 ECF-like 1, 2, 3
    PF13438 Domain of unknown function DUF4113 1
    PF17784 Sulfotransferase; S, mansonii-type 2, 3, 4

From this list of most predictive domains and their propensity we can make the following important obserations:

  • Two parts of the Ccd toxin anti-toxin system seem to be crucial for pathogens outside of X. arboricola species, this system is known as a plasmid retention system [35] and might be linked to a plasmid important for pathogenicity that could have ended up in the genome, either as an assembly error or as a real biological phenomenon.
  • Amongst the top ranking domains are three domains related to transposases, transposases are often found to flank pathogenicity islands in genomes of Xanthomonas and X. fastidiosa [36] and could be a more reliable predictor than any of the genes found in these islands.
  • Type III secretion related domains make up the most abundant class of domains, with five unique domains. Type III secretion proteins hrpB1/HrpK, HrpB2, and HrpB4 are all part of a set of secretion proteins that are essential but non-conserved in X. campestris [37]. HrpB1/HrpK and HrpB4 are membrane-bound proteins, in contrast to HrpB2 that is secreted. HrpB2 is possibly acting as a translocator for effectors into the cell host. Both HrpB7 and HpaP serve an unknown function but are commonly found in type III secretion operons. Avirulence genes are common effectors secreted by the type III secretion system and are used to suppress plant immune responses. Certain plants have adapted hypersensitive responses against these avirulence genes, allowing the plant to detect and respond to the pathogen with an effectortriggered immunity response [38]. Interestingly, the avirulence domain has a propensity towards pathogens, this could mean that the domain is not causing a hyper sensitive response and is actually important for virulence.
  • The bacterial Ig-related fold is commonly found on the outside of bacteria and is a target for receptors that trigger immune responses against these bacteria in mammalians, the function of this fold is still unknown [39].
  • Alginate degradation has not been linked to pathogenicity or non-pathogenicity in Xanthomonas, but for Flavobacterium there are reports of alginate as an anti-microbial compound, protecting the bacterium and its plant host from pathogens [40].
  • The leucine-rich-repeat (LRR) domain is most likely associated with a type 3 effector, that can be used by pathogens to suppress the plant’s defenses. Interestingly, one of the few LRRs known in Xanthomonas confers an avirulence of X. campestris in A. thaliana [41].
  • Tir chaperone proteins are also involved in type III secretion and are a strong predictor of pathogenic potential in E. coli. These small cytosolic proteins serve to stabilize secreted effectors in a secretion-competent state [42].

Overall non-pathogenicity seems to be dictated by a lack of domains deemed important by the model, i.e. pathogenicity seems to be caused by a gain of function in domains that allow the bacterium to exploit its host. Proteins related to type III secretion are over-represented in the set of important predictors and have a propensity towards pathogenic bacteria, this class of proteins is already hypothesized to be important for pathogenicity in Xanthomonas [13] and their predictive power underlines the importance of understanding this system.

Data availability

All genomes used are available at the NCBI ftp database. Genomes were annotated using the SAPP platform and were uploaded to a GraphDB SPARQL database. The manually curated pathogenicity dataset is avaible here and the R makdown version of the script that was used to generate the results is available here.

  • References arrow_downward
    1. Ania M Cutiño-Jiménez, Marinalva Martins-Pinheiro, Wanessa C Lima, Alexander Martín-Tornet, Osleidys G Morales, and Carlos FM Menck. Evolutionary placement of xanthomonadales based on conserved protein signature sequences. Molecular phylogenetics and evolution, 54(2):524–534, 2010.
    2. Luc Vauterin, Ping Yang, Anne Alvarez, Yuichi Takikawa, Don A Roth, Anne K Vidaver, Robert E Stall, Karel Kersters, and Jean Swings. Identification of non-pathogenic xanthomonas strains associated with plants. Systematic and applied microbiology, 19(1):96– 105, 1996.
    3. AR Poplawsky, SC Urban, and W Chun. Biological role of xanthomonadin pigments inxanthomonas campestris pv. campestris. Appl. Environ. Microbiol., 66(12):5123–5127, 2000.
    4. Stephen J Ahern, Mayukh Das, Tushar Suvra Bhowmick, Ry Young, and Carlos F Gonzalez. Characterization of novel virulent broad-host-range phages of xylella fastidiosa and xanthomonas. Journal of bacteriology, 196(2):459–471, 2014.
    5. Joachim Vandroemme, Bart Cottyn, Joël F Pothier, Valentin Pflüger, Brion Duffy, and Martine Maes. Xanthomonas arboricola pv. fragariae: what’s in a name? Plant Pathology, 62(5):1123–1131, 2013.
    6. Patrizia Ferrante and Marco Scortichini. Xanthomonas arboricola pv. fragariae: a confirmation of the pathogenicity of the pathotype strain. European journal of plant pathology, 150(3):825–829, 2018.
    7. Jasper J Koehorst, Jesse CJ van Dam, Edoardo Saccenti, Vitor AP Martins dos Santos, Maria Suarez-Diez, and Peter J Schaap. Sapp: functional genome annotation and analysis through a semantic framework using fair principles. Bioinformatics, 34(8):1401–1403, 2017.
    8. Doug Hyatt, Gwo-Liang Chen, Philip F LoCascio, Miriam L Land, Frank W Larimer, and Loren J Hauser. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics, 11(1):119, 2010.
    9. Sara El-Gebali, Jaina Mistry, Alex Bateman, Sean R Eddy, Aurélien Luciani, Simon C Potter, Matloob Qureshi, Lorna J Richardson, Gustavo A Salazar, Alfredo Smart, et al. The pfam protein families database in 2019. Nucleic acids research, 47(D1):D427–D432, 2018.
    10. Philip Jones, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish McWilliam, John Maslen, Alex Mitchell, Gift Nuka, et al. Interproscan 5: genome-scale protein function classification. Bioinformatics, 30(9):1236–1240, 2014.
    11. Chris P Ponting and Robert R Russell. The natural history of protein domains. Annual review of biophysics and biomolecular structure, 31(1):45–71, 2002.
    12. Jasper J Koehorst, Edoardo Saccenti, Peter J Schaap, Vitor AP Martins dos Santos, and Maria Suarez-Diez. Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics. F1000Research, 5, 2016.
    13. Robert P Ryan, Frank-Jörg Vorhölter, Neha Potnis, Jeffrey B Jones, Marie-Anne Van Sluys, Adam J Bogdanove, and J Maxwell Dow. Pathogenomics of xanthomonas: understanding bacterium–plant interactions. Nature Reviews Microbiology, 9(5):344, 2011.
    14. Isabelle Pieretti, Monique Royer, Valérie Barbe, Sébastien Carrere, Ralf Koebnik, Stéphane Cociancich, Arnaud Couloux, Armelle Darrasse, Jérôme Gouzy, Marie-Agnès Jacques, et al. The complete genome sequence of xanthomonas albilineans provides new insights into the reductive genome evolution of the xylem-limited xanthomonadaceae. BMC genomics, 10(1):616, 2009.
    15. Kanika Bansal, Amandeep Kaur, Samriti Midha, Sanjeet Kumar, Suresh Korpole, and Prabhu B Patil. Xanthomonas sontii sp. nov., a non-pathogenic bacterium isolated from healthy basmati rice (oryza sativa) seeds from india. bioRxiv, page 738047, 2019.
    16. Kanika Bansal, Samriti Midha, Sanjeet Kumar, Amandeep Kaur, Ramesh V Sonti, and Prabhu B Patil. Ecological and evolutionary insights into pathogenic and non-pathogenic rice associated xanthomonas. bioRxiv, page 453373, 2019.
    17. Salwa Essakhi, Sophie Cesbron, Marion Fischer-Le Saux, Sophie Bonneau, Marie-Agnès Jacques, and Charles Manceau. Phylogenetic and variable-number tandem-repeat analyses identify nonpathogenic xanthomonas arboricola lineages lacking the canonical type iii secretion system. Appl. Environ. Microbiol., 81(16):5395–5410, 2015.
    18. Yunxia Fang, Haiyan Lin, Liwen Wu, Deyong Ren, Weijun Ye, Guojun Dong, Li Zhu, and Longbiao Guo. Genome sequence of xanthomonas sacchari r1, a biocontrol bacterium isolated from the rice seed. Journal of biotechnology, 206:77–78, 2015.
    19. Jerson Garita-Cambronero, Ana Palacio-Bielsa, and Jaime Cubero. Xanthomonas arboricola pv. pruni, causal agent of bacterial spot of stone fruits and almond: its genomic and phenotypic characteristics in the x. arboricola species context. Molecular plant pathology, 19(9):2053–2065, 2018.
    20. Jerson Garita-Cambronero, Ana Palacio-Bielsa, María M López, and Jaime Cubero. Pan-genomic analysis permits differentiation of virulent and non-virulent strains of xanthomonas arboricola that cohabit prunus spp. and elucidate bacterial virulence factors. Frontiers in microbiology, 8:573, 2017.
    21. Carolina Gonzalez, Silvia Restrepo, Joe Tohme, and Valérie Verdier. Characterization of pathogenic and nonpathogenic strains of xanthomonas axonopodis pv. manihotis by pcr-based dna fingerprinting techniques. FEMS microbiology letters, 215(1):23–31, 2002.
    22. Déborah Merda, Sophie Bonneau, Jean-François Guimbaud, Karine Durand, Chrystelle Brin, Tristan Boureau, Christophe Lemaire, Marie-Agnès Jacques, and Marion FischerLe Saux. Recombination-prone bacterial strains form a reservoir from which epidemic clones emerge in agroecosystems. Environmental microbiology reports, 8(5):572–581, 2016.
    23. Lindsay R Triplett, Valérie Verdier, Tony Campillo, Cinzia Van Malderghem, Ilse Cleenwerck, Martine Maes, Loïc Deblais, Rene Corral, Ousmane Koita, Bart Cottyn, et al. Characterization of a novel clade of xanthomonas isolated from rice leaves in mali and proposal of xanthomonas maliensis sp. nov. Antonie Van Leeuwenhoek, 107(4):869–881, 2015.
    24. G Karamura, Julian Smith, David Studholme, Jerome Kubiriba, and E Karamura. Comparative pathogenicity studies of the xanthomonas vasicola species on maize, sugarcane and banana. Afr. J. Plant Sci, 9:385–400, 2015.
    25. Wei Qian, Yantao Jia, Shuang-Xi Ren, Yong-Qiang He, Jia-Xun Feng, Ling-Feng Lu, Qihong Sun, Ge Ying, Dong-Jie Tang, Hua Tang, et al. Comparative and functional genomic analyses of the pathogenicity of phytopathogen xanthomonas campestris pv. campestris. Genome research, 15(6):757–767, 2005.
    26. David J Studholme, Eric Kemen, Daniel MacLean, Sebastian Schornack, Valente Aritua, rd Thwaites, Murray Grant, Julian Smith, and Jonathan DG Jones. Genome-wide sequencing data reveals virulence factors implicated in banana xanthomonas wilt. FEMS microbiology letters, 310(2):182–192, 2010.
    27. Issa Wonni, Bart Cottyn, Liselot Detemmerman, S Dao, L Ouedraogo, S Sarra, C Tekete, S Poussier, R Corral, L Triplett, et al. Analysis of xanthomonas oryzae pv. oryzicola population in mali and burkina faso reveals a high level of genetic and pathogenic diversity. Phytopathology, 104(5):520–531, 2014.
    28. Lars Snipen and Kristian Hovde Liland. micropan: an r-package for microbial pangenomics. BMC bioinformatics, 16(1):79, 2015.
    29. Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
    30. Randal S Olson, William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H Moore. Data-driven advice for applying machine learning to bioinformatics problems. arXiv preprint arXiv:1708.05070, 2017.
    31. Tjerko Kamminga, Jasper J Koehorst, Paul Vermeij, Simen-Jan Slagman, Vitor AP Martins dos Santos, Jetta JE Bijlsma, and Peter J Schaap. Persistence of functional protein domains in mycoplasma species and their role in host specificity and synthetic minimal life. Frontiers in cellular and infection microbiology, 7:31, 2017.
    32. Raphael Couronné, Philipp Probst, and Anne-Laure Boulesteix. Random forest versus logistic regression: a large-scale benchmark experiment. BMC bioinformatics, 19(1):270, 2018.
    33. Andy Liaw, Matthew Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
    34. Max Kuhn. Caret: classification and regression training. Astrophysics Source Code Library, 2015.
    35. Laurence Van Melderen, Philippe Bernard, and Martine Couturier. Lon-dependent proteolysis of ccda is the key control for activation of ccdb in plasmid-free segregant bacteria. Molecular microbiology, 11(6):1151–1157, 1994.
    36. Claudia B Monteiro-Vitorello, Mariana C De Oliveira, Marcelo M Zerillo, Alessandro M Varani, Edwin Civerolo, and Marie-Anne Van Sluys. Xylella and xanthomonas mobil’omics. Omics: a journal of integrative biology, 9(2):146–159, 2005.
    37. Ombeline Rossier, Guido Van den Ackerveken, and Ulla Bonas. Hrpb2 and hrpf from xanthomonas are type iii-secreted proteins and essential for pathogenicity and recognition by the host plant. Molecular microbiology, 38(4):828–838, 2000.
    38. Yulei Shang, Xinyan Li, Haitao Cui, Ping He, Roger Thilmony, Satya Chintamanani, Julie Zwiesler-Vollick, Suresh Gopalan, Xiaoyan Tang, and Jian-Min Zhou. Rar1, a central player in plant immunity, is targeted by pseudomonas syringae effector avrb. Proceedings of the National Academy of Sciences, 103(50):19200–19205, 2006.
    39. Qian Han, Ning Liu, Howard Robinson, Lin Cao, Changli Qian, Qianfu Wang, Lei Xie, Haizhen Ding, Qian Wang, Yongping Huang, et al. Biochemical characterization and crystal structure of a gh10 xylanase from termite gut bacteria reveal a novel structural feature and significance of its bacterial ig-like domain. Biotechnology and bioengineering, 110(12):3093–3103, 2013.
    40. Q-D An, G-L Zhang, H-T Wu, Z-C Zhang, G-S Zheng, L Luan, Yoshiyuki Murata, and X Li. Alginate-deriving oligosaccharide production by alginase from newly isolated flavobacterium sp. lxa and its potential application in protection against pathogens. Journal of applied microbiology, 106(1):161–170, 2009.
    41. Rong-Qi Xu, Servane Blanvillain, Jia-Xun Feng, Bo-Le Jiang, Xian-Zhen Li, HongYu Wei, Thomas Kroj, Emmanuelle Lauber, Dominique Roby, Baoshan Chen, et al. Avracxcc8004, a type iii effector with a leucine-rich repeat domain from xanthomonas campestris pathovar campestris confers avirulence in vascular tissues of arabidopsis thaliana ecotype col-0. Journal of bacteriology, 190(1):343–355, 2008.
    42. Robin M Delahay, Robert K Shaw, Simon J Elliott, James B Kaper, Stuart Knutton, and Gad Frankel. Functional analysis of the enteropathogenic escherichia coli type iii secretion system chaperone cest identifies domains that mediate substrate interactions. Molecular microbiology, 43(1):61–73, 2002.