Team:TAU Israel/Software

UNDER THE HOOD

perfect switch

An integration of several algorithms aimed to find suitable tails for a complete switch with the prf15 tail.

Virus-HostDB - The database on which we've worked, contains data of about ~400,000 genes from 10,365 different viruses (most of them bacteriophages)[2]

Heuristic Search (our own BOOM Algorithm) - In order to detect genes with a potential to produce our new tail proteins in the pyocin complex, we looked for genes with similarity to the original pyocin's tail gene (prf15). With intent to achieve that, we divided each gene to its amino-acids trios (in a congruent manner, the first trio is from amino acid 1 to amino acid 3, the second is from amino acid 2 to 4 and so on, we named every trio as a word). Then we checked how many of its words exits in the pyocin's original tail. Finally we calculated each gene's score using the following formula:

Gene Name Detection - An algorithm which searches for genes with indicative names such as "tail fiber" and "tail spike".

Half-Semi-Global alignment - An algorithm which aims to determine how much the beginning of a gene is similar to the beginning of the pyocin's original tail (prf15). In order to achieve that we created a novel dynamic programming alignment algorithm, based on the famous Needleman-Wunsch[3] and Smith-Waterman[4] algorithms: We took the beginning of prf15 (first 164 amino acids, which according to the literature is long enough for a successful connecting to the pyocin's baseplate), and aligned it against the tested gene using the Needlman-Wunsch matrix. We have altered the dynamic programming equations in such a way that we will not reduce the accumulative score in case of a gap in the end of the tested gene (similar to semi-global alignment, but only for the tested gene). In that way, our algorithm determines how well the beginning of tested gene resembles to the beginning of prf15, without defining how long the beginning of the tested gene should be.
The new dynamic programming equation:

(There is a different penalty for gap opening and gap extension)

Below is the visualization of the matrix. As you can see, we were looking for the best path from the top left corner to the lower row.

Filtering Before continuing to the optimization phase we filtered out BOOM's low-scored genes in order to focus only on the genes with the highest probability to be a suitable match for our pyocin complex. In addition, we took only one gene from each bacteriophage - the one with the highest score.

Parameters Optimization - The half-semi-global alignment algorithm uses different parameters such as the gap opening penalty, gap extension penalty and amino acid substitution matrix (BLOSUM numX). We wanted to find the parameters that are most accurate for our dataset. For that manner we used the evolutionary tree as our distance function: for each set of parameters we aligned our genes using these parameters and created an evolutionary tree from it (using Neighbor Joining algorithm[5]). Then we compared this tree to the original evolutionary tree of these species (given by the NCBI[6]) using Robinson Foulds metric[7]. Our assumption is that with the best parameters, our genes evolutionary tree will look like NCBI's species evolutionary tree. In order to quickly find these parameters, we preformed 3-dimensional hill-climbing. Each dimension represented one of the parameters (gap open, gap extension and amino acid substitution matrix) and together they create a discrete space. We tried to start from different points in that space, and for each point we checked all 27 adjacent points - if one of them represented a set of parameters with smaller Robinson Foulds distance, we moved on to this point and checked its neighbors. Eventually we found out that the best results are achievable with many sets of parameters, and we chose to use one of them - gap open penalty of -11, gap extension penalty of -1 and BLOSUM62 as our amino acids substitution matrix.

Validation - In order to validate our alignment algorithm, we used the tail fiber of Pseudomonas phage PS17. We added this protein to our data set as it was successfully used as a new pyocin tail[8] and we wanted to see if our algorithm will be able to detect this automatically. Our alignment algorithm gave this protein an extremely high grade (444) as it comes in second place in comparison to all 3174 proteins in our final data set. This result suggests that indeed good switching candidates will get high score in our algorithm.

Back

fusion

As part of the experiment our project is based upon, a protein fusion of Psueodmonas aroginosa's pyocin tail fiber and V10 phage tail fiber was created. The pyocin containing the fusion protein as its tail fiber was shown to be able to retarget and destroy a new set of bacteria. To accommodate the fusion approach in our Tail-or Swift mechanism, we've integrated data of protein's secondary structure and protein's domains to determine relevant fusion position in each two given proteins. Tail-orSwift allows the user to provide domain and secondary structure data for his favorite tail fiber protein, and we provide him, using an integrated algorithm, with optional positions in the original pyocin tail fiber and the input protein in which the fusion is most likely to succeed. We determined the domains of a protein using Pfam algorithm which runs a multiple sequence alignment over a large database of protein families and out of those, determines the domains using a Hidden Markov Model (HMM)[9]. For the secondary structure we've used an algorithm that runs a BLAST search of the input protein against UniprotKB/SwissProt and uses the alignment results as an input for a neural network[10].

Back

Refrences

[1] Michel-Briand, Y. & Baysse, C. The pyocins of Pseudomonas aeruginosa. Biochimie 84, 499–510 (2002).
[2] Mihara, T., Nishimura, Y., Shimizu, Y., Nishiyama, H., Yoshikawa, G., Uehara, H., Hingamp, P., Goto, S., and Ogata, H.; Linking virus genomes with host taxonomy. Viruses 8, 66 doi:10.3390/v8030066 (2016).
[3] Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)
[4] Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
[5] Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425 (1987)
[6] Federhen, S. The NCBI Taxonomy database. Nucleic Acids Research 40, D136–D143 (2011)
[7] Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Mathematical Biosciences 53, 131–147 (1981)
[8] Williams, S. R., Gebhart, D., Martin, D. W. & Scholl, D. Retargeting R-Type Pyocins To Generate Novel Bactericidal Protein Complexes. Appl. Environ. Microbiol. 74, 3868 (2008)
[9] El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Research 47, D427–D432 (2018)
[10] PredictProtein–-an open resource for online prediction of protein structural and functional features. Yachdav, G.; Kloppmann, E.; Kajan, L.; Hecht, M.; Goldberg, T.; Hamp, T.; Hönigschmid, P.; Schafferhans, A.; Roos, M.; Bernhofer, M.; and others Nucleic acids research,gku366. 2014.

Back