Team:Toronto/drylab/CRAUT

CRAUT




Introduction


Our software is designed as a simplified version of the code provided by the 2014 Heidelberg iGEM team. Here is the their wiki page detailing their version of the software.

Their version of the software was designed to generate linkers for any generic protein. For our specific case, we only needed it for one protein where the N-Terminus and C-Terminus were close together, so we had the option to simplify their algorithm. They considered linkers with 1 to 4 alpha helices, while we considered only linkers with 2 alpha helices. Their approach to find such linkers involved generating huge numbers of potential geometric paths of linkers, and then translating the valid paths into linkers that closely resembled the paths.

Our approach takes advantage of the fact that the output of the program is a ranking of permutations (with repetitions) of 2 helices and 1 angle segment, of which there are only 887=448 possible linkers. As another simplification, we don’t consider all angle segments when given 2 helices, and instead determine the closest choice of an angle segment given the angle between two helices. This gives a total of 8*8=64 permutations to test. However, we do not take into account the fact that there are small flexible ends between the ends of the linker and the termini. The Heidelberg team considered this, which increased complexity of their algorithm. The simplicity of our algorithm results in a quick runtime of the software, typically around a few seconds, as well as easily maintainable code.

Problems Encountered


There was a large number of problems faced when trying to work with the Heidelberg team’s code. Their code can be found here.

Their code was made in a very short timeframe and they had little time to add documentation, so their resulting product is not as high quality as was likely intended. What follows is a detailed explanation of various problems, many of which would likely have been fixed if the 2014 Heidelberg team had more time.

The first problem encountered is that their code does not work out of the box. When trying to run the compiled C version, it will return an error complaining about a missing resource. We found no mention of such a resource in their documentation, so we don’t know how to get the compiled program working.

When trying to run their Python code, there are numerous errors. The first error is that there are 8 alpha helix building blocks given (in the form of an amino acid sequence), yet only 7 calibrated lengths are given. We eventually managed to figure out how to give a proper length for the last helix (1.5 * the number of amino acids), but it is not calibrated like the rest of the given lengths. Another error was encountered when their code failed to keep any possible linkers after about 2000 lines into the code. Even after spending many hours, we failed to understand why their code was not working.

This leads to a huge problem that hindered development of our version of CRAUT. The Heidelberg team’s code was poorly documented, messy, buggy, and partially written in German. They provided some vague explanations of some functions, but it was not nearly enough to understand their code. There are very many instances where they introduce some numbers (such as 150, 15, 154) or formulas (such as one involving logarithms and animo acid counts) with no explanation. What exactly these numbers or formulas are was a mystery that resulted in not being able to replicate their code properly. Attempts to replicate such formulas involved pretty much copying their messy code into our own, out of necessity. Most of our code created this way was cut out to simplify the software. Their code in general is very convoluted, which makes it hard to understand why some steps are done. A significant number of hours were spent trying to understand some pieces of code in order to be able to replicate their weighting components.

Their code would have greatly benefited from being refactored by the Heidelberg team during development. The code is provided in one Python file, which is about 3500 lines long. It also has lots of commented out code, and some unclear loading and saving of the program state (used as checkpoints). Their software design appears to mostly be limited to separating some code into their own functions. Many of these functions are huge blocks of code that are difficult to understand. It’s likely that they could have simplified their software greatly with proper software design if they had more time, which would make working with their code significantly easier.

Another problem that we encountered was that their code takes much too long to run. They reported an early version taking 11 days to run, and a later version taking 1 day. This was not confirmed, since even when the encountered errors were forced to be ignored, the code seemed to not be generating results. So, their code was not run to completion. The overwhelmingly long running time of their code is due to considering a very large space of geometric paths, as well as their code not being very efficient. They use Python which is a slow language, but attempt to make up for this by using Numpy, which provides efficient array operations. Some segments of the code could be optimized, maybe by using Numpy differently, using different algorithms, or implementing data structures.

The biggest cause of the code taking long is the large space of geometric paths being considered. To represent a point being connected to another by a helix or flexible end, they consider points on the boundary of a sphere around the starting point with radius equal to the length of that helix or flexible end. They take discrete samples spaced uniformly with intervals of 5 degrees over the entire boundary of the sphere in order to generate the possible next points in the path. This leads to 5184 points considered when moving 1 step in the path. When building a linker out of many segments, the number of paths increases exponentially, with each path splitting into another 5184 possible paths for every segment. When considering linkers with 2 alpha helix segments and 1 angle segments, there will be 5184^3 = 139,314,069,504 paths (there are 3 intermediary points because of 2 helices and 2 flexible ends). (Note that the actual generation of paths is slightly different from this and is harder to explain, but the total number of paths should be similar in magnitude) As mentioned earlier, the output space of linker sequences is much smaller, so many of these paths will be interpreted as the same linker. This is confirmed in their documentation, where they claim to cluster the paths. They average results for each path in a cluster to get the linker weighting. A problem with this overall algorithm is that their weights depend on the geometric paths which are approximations, so the weight of a linker depends on what geometric paths were generated. This results in discretization error.

Because of the large sizes of arrays, they claim to have stored them on disk when processing. This is an incredibly costly choice, since disk access is tremendously slow compared to memory access, especially if a hard disk drive is used rather than a solid state drive. However, if the disk is used rarely, then the negative effect on runtime will be mitigated.

We implemented the `CountVectorizer` to convert the text to a weight dependent on the number of times that text appears in the set. We then randomly shuffled the data then assigned the test-training split as 20% for testing and 80% training. Furthermore, we assigned numbers to all of the 42 types of proteins, reducing the labels required.

The Heidelberg team provided some example of results from their software, as shown at the bottom of this page.

When trying to replicate these weighting contributions (i.e. length contribution, angle contribution, etc.), we were not able to get similar results when testing the same linker on the same protein. This is a huge problem, since their model was trained using their weighting scheme. So if our weight contributions don’t match, then we will not get reliable predictions from the model. As an example, for the linker they named “ord1”, the length contribution is 4.936. This length contribution was only obtained by the longest linkers for our software, which were much longer than ord1. As another example, the distance contribution for their code is just the distance between the linker and the surface. Their distances are significantly greater than ours, even after attempting to match their computation. We were also unable to find many examples of their results, so it would be hard to know if our linker ranking matched theirs if we tried to use differently computed weighting contributions.

Furthermore, their linear model scaled some weight contributions in such a way that the length contribution is insignificant, and the binding site contribution is much more significant than everything else. This suggests that the model only chooses linkers that are far from the binding site if the binding site might be in the way. This model does not generalize well to other proteins where linkers will not be near the binding site, such as ours. Considering that their model was trained on a particular enzyme where the intersection with the binding site was possible, it likely is an overfit with the few samples they trained on.

Results


The software we made will take a protein database file and output the linkers sorted from best to worst (smallest weight to largest weight) along with information on their weighting components. The software, along with usage instructions and technical documentation, may be found at this github repository.

What follows are the best 5 linkers generated by the program, with the simplified version of the Heidelberg team’s weighting scheme. The binding site contribution weighting was omitted due to missing data and time constraints. The software was run with a protein file for isPETase.

Amino Acid Sequence Weight Length Contribution Angle Contribution Distance Contribution
AEAAAKEAAAKEAAAKAKTAAEAAAKEAAAK 3.595392 2.704839 5.645027 0.359070
AEAAAKEAAAKEAAAKEAAAKANVLAEAAAKEAAAKEAAAKA 3.632183 3.822929 5.740434 0.341162
AEAAAKEAAAKEAAAKAKTAAEAAAKEAAAKA 3.709622 2.785181 5.776163 0.398118
AEAAAKEAAAKEAAAKEAAAKANVLAEAAAKEAAAKEAAAKEAAAKA 3.712502 4.325065 5.690876 0.449891
AEAAAKEAAAKEAAAKAKTAAEAAAKEAAAKEAAAKA 3.825248 3.320793 5.670519 0.574310


The linkers generated from this program are likely unreliable, considering that our weight contributions don’t match the Heidelberg team’s weight contributions. The linkers generated appear quite large, which may or may not be good. This is not surprising, given the fact that we used the linear regression model that mostly disregards lengths. These outputs should not be considered as good linkers, unless a proper regression model trained for these specific weighting contributions is used.

Future Work


Any future work on this should avoid using the Heidelberg team’s code. Too much time needs to be invested to understand their code, and we failed to be able to get their software and code to work. Instead, their ideas for using alpha helix building blocks should be implemented with different code. A new regression model should be trained using software that is much easier to maintain and build upon. The software we made may help, but the weighting scheme implemented may need to be built upon to be able to train an effective model.

Their code was made in a very short timeframe and they had little time to add documentation, so their resulting product is not as high quality as was likely intended. What follows is a detailed explanation of various problems, many of which would likely have been fixed if the 2014 Heidelberg team had more time.

Our software only considers permutations of 2 alpha helix building blocks, and finds the angle segment that is closest to the angle between the two helices. It may be possible to efficiently find the optimal angle segment to use between two alpha helices, with respect to the weighting, and allowing flexible ends to point in different directions. It is recommended not to discretize the space of possible geometric paths and find linkers that resemble the path. This is what the Heidelberg team did, and it resulted in searching a very large space of paths. It is instead recommended to take existing linker sequences and rank them.