Team:UC San Diego/Software

Alaive

SOFTWARE

The computation part of this project involves a data pipeline to observe similarity within the set of antibodies and target the most representative sequences. The first step of the pipeline is to embed 7-letter phage sequences into numerical values. We observed that biological sequences, just like a lot of other natural sequences, can be seen as complex genetic languages that resemble human language nicely. One can draw such analogy that the information contained in a biological sequence is similar to the semantic information contained in words. Thus from a language processing perspective, if properly segmented and tokenized, the information contained within biological sequences such as structure, folding and interaction can be represented as vectors of proper dimensions.

We train a SentencePiece model to segment the sequences to individual “words” and create a vocabulary set. We do this by first creating a seed vocabulary set then finding the best word segmentation sequence by maximizing the probability of a segmentation sequence which is the product of word occurrence probabilities. We then find the best vocabulary set by computing loss change in terms of marginal likelihood across the entire corpus. After properly segmenting the corpus, we are ready to embed each word to a word vector. The approach we adapted is the classical word2vec skip-gram model in language processing. The model is trained by predicting words surrounding a central word and thus obtaining a weight matrix W which is the representation of all the words. We limit the dimension of the vector representation to 100 which is the default value suggested by the author of the word2vec model.

After obtaining the embedded vectors, we show the similarity between sequences with clustering. Due to the high dimensionality of the data, a dimensional reduction algorithm is necessary to reduce the data to a viable dimension for clustering. In high dimensional datasets, distance-based methods such as K-means clustering often perform poorly due to the non-preservative nature of l2 norm distance in high dimensional euclidean spaces. Therefore, we choose the combination of hierarchy-based clustering and density-based clustering to ensure clustering quality. At the same time, the choice of dimensional reduction algorithm is also non-trivial. Powerful visualization methods such as regular t-SNE cannot be used due to the large amount of data (>= 999999) in the dataset since due to high runtime. We would, however, attempt variations of the t-SNE method (for example, ft-SNE which is t-SNE method with interpolation and fast Fourier transformation might be significantly faster than regular t-SNE), but since we’re using density-based clustering methods, the fact that t-SNE does not preserve distance and density information is fatal.

The final choice is UMAP enhanced HDBSCAN clustering. UMAP allows tighter packaging of data (which highlights density information) and further inter-cluster distance (which highlights hierarchy information). HDBSCAN does not require a priori information on cluster number. It also combined the benefits of hierarchical-based clustering and density-based clustering thus is the optimal choice.