Difference between revisions of "Team:Tongji Software/Model"

Line 624: Line 624:
 
             }
 
             }
 
         }
 
         }
          /*  -----------------------------  footer style --------------------------- */
+
        /*  -----------------------------  footer style --------------------------- */
         #FOOTER p{
+
       
              font-size:13px;
+
         #FOOTER p {
        }
+
            font-size: 13px;
 +
        }
 +
       
 
         .footer-distributed {
 
         .footer-distributed {
 
             margin-bottom: 0;
 
             margin-bottom: 0;
Line 717: Line 719:
 
<body>
 
<body>
 
     <div style="position:absolute; margin:0vh 0vw 0vh -10px; width:100vw; color:white" id="section">
 
     <div style="position:absolute; margin:0vh 0vw 0vh -10px; width:100vw; color:white" id="section">
<div style="width:87vw;  --color-bg: #1f174e; --color-bg-2: #151436;--color-bg-3: #060029;background: linear-gradient(90deg, var(--color-bg-3), var(--color-bg-2),var(--color-bg),var(--color-bg-2), var(--color-bg-3));position:absolute; height:100%;box-shadow:10px 0px 15px rgba(0,0,0,0.5) "></div>
+
        <div style="width:87vw;  --color-bg: #1f174e; --color-bg-2: #151436;--color-bg-3: #060029;background: linear-gradient(90deg, var(--color-bg-3), var(--color-bg-2),var(--color-bg),var(--color-bg-2), var(--color-bg-3));position:absolute; height:100%;box-shadow:10px 0px 15px rgba(0,0,0,0.5) "></div>
    <nav id="cd-vertical-nav" style="padding-top: 20vh">
+
        <nav id="cd-vertical-nav" style="padding-top: 20vh">
        <ul>
+
            <ul>
            <li>
+
                <li>
                <a href="#Pathway_search" data-number="1">
+
                    <a href="#Pathway_search" data-number="1">
                    <span class="cd-dot"></span>
+
                        <span class="cd-dot"></span>
                    <span class="cd-label">Pathway Search</span>
+
                        <span class="cd-label">Pathway Search</span>
                </a>
+
                    </a>
            </li>
+
                </li>
            <li>
+
                <li>
                <a href="#Enzyme_selection" data-number="2">
+
                    <a href="#Enzyme_selection" data-number="2">
                    <span class="cd-dot"></span>
+
                        <span class="cd-dot"></span>
                    <span class="cd-label">Enzyme Selection</span>
+
                        <span class="cd-label">Enzyme Selection</span>
                </a>
+
                    </a>
            </li>
+
                </li>
            <li>
+
                <li>
                <a href="#Codon_optimization" data-number="3">
+
                    <a href="#Codon_optimization" data-number="3">
                    <span class="cd-dot"></span>
+
                        <span class="cd-dot"></span>
                    <span class="cd-label">Codon Optimization</span>
+
                        <span class="cd-label">Codon Optimization</span>
                </a>
+
                    </a>
            </li>
+
                </li>
            <li>
+
                <li>
                <a href="#Reference" data-number="4">
+
                    <a href="#Reference" data-number="4">
                    <span class="cd-dot"></span>
+
                        <span class="cd-dot"></span>
                    <span class="cd-label">Reference</span>
+
                        <span class="cd-label">Reference</span>
                </a>
+
                    </a>
            </li>
+
                </li>
        </ul>
+
            </ul>
    </nav>
+
        </nav>
    <a class="cd-nav-trigger cd-img-replace">Open navigation<span></span></a>
+
        <a class="cd-nav-trigger cd-img-replace">Open navigation<span></span></a>
  
    <div id="SectionContainer">
+
        <div id="SectionContainer">
        <div style="position:absolute; margin:0vh 0vw 0vh -10px; width:100vw; color:white" id="section">
+
            <div style="position:absolute; margin:0vh 0vw 0vh -10px; width:100vw; color:white" id="section">
  
            <section class="cd-section">
+
                <section class="cd-section">
                <h1 id="medalMainTitle"><b style="font-size:1.2em; font-weight: bold;">M</b>ODEL</h1>
+
                    <h1 id="medalMainTitle"><b style="font-size:1.2em; font-weight: bold;">M</b>ODEL</h1>
                <img src="https://static.igem.org/mediawiki/2019/d/d0/T--Tongji_Software--model_title.png" style="width:100vw"></img>
+
                    <img src="https://static.igem.org/mediawiki/2019/d/d0/T--Tongji_Software--model_title.png" style="width:100vw"></img>
            </section>
+
                </section>
  
            <!-- <h1 class="HpSubTitle"><b>X</b>IANG <b>M</b>ING <b>H</b>IGH <b>S</b>CHOOL</h1> -->
+
                <!-- <h1 class="HpSubTitle"><b>X</b>IANG <b>M</b>ING <b>H</b>IGH <b>S</b>CHOOL</h1> -->
            <section id="Pathway_search" class="cd-section">
+
                <section id="Pathway_search" class="cd-section">
                <h1 class="medalSubTitle"><b>P</b>ATHWAY<b>S</b>EARCH</h1>
+
                    <h1 class="medalSubTitle"><b>P</b>ATHWAY<b>S</b>EARCH</h1>
                <p><b>Overview</b></p>
+
                    <p><b>Overview</b></p>
                <p>Based on last year's Tongji_software team's algorithm, we optimized it and implemented the new search algorithm. At the same time, we established the model and combined the above two algorithms, which greatly improved the search speed
+
                    <p>Based on last year's Tongji_software team's algorithm, we optimized it and implemented the new search algorithm. At the same time, we established the model and combined the above two algorithms, which greatly improved the search speed
                    of the path.
+
                        of the path.
                </p>
+
                    </p>
                <br>
+
                    <br>
  
                <div class="TextContainer">
+
                    <div class="TextContainer">
                    <br><br>
+
                        <br><br>
                    <p><b>Previous work</b></p>
+
                        <p><b>Previous work</b></p>
  
                    <div>
+
                        <div>
                        <!--500px-->
+
                            <!--500px-->
                        <p>Last year's team abstracted the reaction pathway into a directed graph, searched it with the graph search algorithm depth first search (DFS) algorithm (Fig. 1), and scored the pathway according to the thermodynamic feasibility&
+
                            <p>Last year's team abstracted the reaction pathway into a directed graph, searched it with the graph search algorithm depth first search (DFS) algorithm (Fig. 1), and scored the pathway according to the thermodynamic feasibility&
                            precursor competition, toxicity of metabolites and atom conservation. Details are shown in last year's model.</p>
+
                                precursor competition, toxicity of metabolites and atom conservation. Details are shown in last year's model.</p>
 +
                        </div>
 +
                        <img src="https://static.igem.org/mediawiki/2019/c/c2/T--Tongji_Software--model_DFS.png" class="PushImage" width="40%" alt="DFS">
 +
                        <br>
 +
                        <p style="text-align: center;font-size: 1.1vw;">Fig1. Search process of DFS</p><br><br>
 
                     </div>
 
                     </div>
                    <img src="https://static.igem.org/mediawiki/2019/c/c2/T--Tongji_Software--model_DFS.png" class="PushImage" width="40%" alt="DFS">
+
 
 
                     <br>
 
                     <br>
                    <p style="text-align: center;font-size: 1.1vw;">Fig1. Search process of DFS</p><br><br>
 
                </div>
 
  
                <br>
+
                    <br>
 +
                    <p><b>Optimization of DFS</b></p>
 +
                    <p>We optimize the internal implementation logic and other details of the DFS algorithm to reduce its time complexity. We compile it from python script to dynamic link library through Cython module, which makes it run much faster when
 +
                        it is invoked. As can be seen from the following figure (Fig.2), after optimization, the running speed of DFS was accelerated by nearly ten times on average.</p>
 +
                    <br>
 +
                    <img src="https://static.igem.org/mediawiki/2019/3/37/T--Tongji_Software--model_Optimization_of_DFS.png" class="PushImage" width="80%" alt="DFS"><br>
 +
                    <p style="text-align: center;font-size: 1.1vw;">Fig2. Time consuming between DFS & optimized DFS in different depths</p><br><br>
 +
                    <br>
  
                <br>
 
                <p><b>Optimization of DFS</b></p>
 
                <p>We optimize the internal implementation logic and other details of the DFS algorithm to reduce its time complexity. We compile it from python script to dynamic link library through Cython module, which makes it run much faster when it
 
                    is invoked. As can be seen from the following figure (Fig.2), after optimization, the running speed of DFS was accelerated by nearly ten times on average.</p>
 
                <br>
 
                <img src="https://static.igem.org/mediawiki/2019/3/37/T--Tongji_Software--model_Optimization_of_DFS.png" class="PushImage" width="80%" alt="DFS"><br>
 
                <p style="text-align: center;font-size: 1.1vw;">Fig2. Time consuming between DFS & optimized DFS in different depths</p><br><br>
 
                <br>
 
  
 +
                    <p><b>Another Choice: Greedy</b></p>
 +
                    <p>By transforming Fig.2 into a logarithmic axis (Fig.3), we can find that the operation time of DFS increases exponentially with the increase of search depth. Therefore, we need another algorithm to deal with the search at high search
 +
                        depth. Finally, we implemented a heuristic algorithm: greedy best first algorithm (BFS).Greedy is scoring at the same time of searching, and the one with the highest score in the current searched path is selected for further searching
 +
                        each time, so as to quickly reach the local optimal solution (Gif.1).</p>
 +
                    <br>
 +
                    <img src="https://static.igem.org/mediawiki/2019/d/da/T--Tongji_Software--model_Another_choice_BFS1.png" class="PushImage" width="80%" alt="DFS"><br>
 +
                    <p style="text-align: center;font-size: 1.1vw;">Fig3. Time consuming of DFS exponentially increases by search depth</p><br><br>
 +
                    <img src="https://static.igem.org/mediawiki/2019/5/5a/T--Tongji_Software--model_Another_choice_BFS2.gif" class="PushImage" width="40%" alt="DFS"><br>
 +
                    <p style="text-align: center;font-size: 1.1vw;">Gif1. Search process of Greedy</p><br><br>
 +
                    <br>
  
                <p><b>Another Choice: Greedy</b></p>
+
                    <p><b>Validation of Greedy Validity</b></p>
                <p>By transforming Fig.2 into a logarithmic axis (Fig.3), we can find that the operation time of DFS increases exponentially with the increase of search depth. Therefore, we need another algorithm to deal with the search at high search depth.
+
                    <p>As a heuristic algorithm, the search results of Greedy will converge to the local optimal solution instead of the global optimal solution, so we need to verify the accuracy of the search results of Greedy.</p>
                    Finally, we implemented a heuristic algorithm: greedy best first algorithm (BFS).Greedy is scoring at the same time of searching, and the one with the highest score in the current searched path is selected for further searching each
+
                    <br>
                    time, so as to quickly reach the local optimal solution (Gif.1).</p>
+
                <br>
+
                <img src="https://static.igem.org/mediawiki/2019/d/da/T--Tongji_Software--model_Another_choice_BFS1.png" class="PushImage" width="80%" alt="DFS"><br>
+
                <p style="text-align: center;font-size: 1.1vw;">Fig3. Time consuming of DFS exponentially increases by search depth</p><br><br>
+
                <img src="https://static.igem.org/mediawiki/2019/5/5a/T--Tongji_Software--model_Another_choice_BFS2.gif" class="PushImage" width="40%" alt="DFS"><br>
+
                <p style="text-align: center;font-size: 1.1vw;">Gif1. Search process of Greedy</p><br><br>
+
                <br>
+
  
                <p><b>Validation of Greedy Validity</b></p>
+
                    <p>41,000 random test samples showed that 40,316 of them were correct, i.e. the accuracy rate reached 98.33%. We calculated the search accuracy of Greedy under different conditions and converted it into a heat map (Fig. 4). The results
                <p>As a heuristic algorithm, the search results of Greedy will converge to the local optimal solution instead of the global optimal solution, so we need to verify the accuracy of the search results of Greedy.</p>
+
                        show that when users require less needed paths, Greedy performs very well, and when users required only 1, the error rate is almost no more than 1%. Overall, the Greedy error rate is very low and does not significantly affect user
                <br>
+
                        usage.</p>
 +
                    <br>
 +
                    <p>In addition, in terms of speed, we convert the proportion of greedy running time less than DFS in each case into a heat map. The results show that the deeper the search, the more likely the greedy is to run faster than DFS. However,
 +
                        the test data also shown that the operation time of Greedy is very unstable(Fig. 5). Since the Greedy needs to maintain a priority queue, its single-step search takes much longer than DFS. When there is no pathway between the two
 +
                        compounds, the Greedy needs to search data of the same size as DFS. At that point, the operation time of Greedy is unacceptably long. So we can't just use Greedy instead of DFS..</p>
 +
                    <br>
 +
                    <img src="https://static.igem.org/mediawiki/2019/2/24/T--Tongji_Software--model_BFS_Heatmap.png" class="PushImage" width="60%" alt="DFS"><br><br><br>
 +
                    <p style="text-align: center;font-size: 1.1vw;">Fig4. Comparison of time consuming between DFS & Greedy</p><br>
 +
                    <img src="https://static.igem.org/mediawiki/2019/e/e0/T--Tongji_Software--model_Validation_of_BFS.png" class="PushImage" width="80%" alt="DFS"><br>
 +
                    <p style="text-align: center;font-size: 1.1vw;">Fig5. Comparison of time consuming between DFS & Greedy</p><br><br>
  
                <p>41,000 random test samples showed that 40,316 of them were correct, i.e. the accuracy rate reached 98.33%. We calculated the search accuracy of Greedy under different conditions and converted it into a heat map (Fig. 4). The results show
+
                    <p><b>Build a Model</b></p>
                    that when users require less needed paths, Greedy performs very well, and when users required only 1, the error rate is almost no more than 1%. Overall, the Greedy error rate is very low and does not significantly affect user usage.</p>
+
                    <p>We trained a machine learning model, the random forest, to determine which algorithm could perform the search faster before the search and use it to perform the search.</p>
                <br>
+
                    <br>
                <p>In addition, in terms of speed, we convert the proportion of greedy running time less than DFS in each case into a heat map. The results show that the deeper the search, the more likely the greedy is to run faster than DFS. However, the test data also shown that the operation time of Greedy is very unstable(Fig. 5). Since the Greedy needs to maintain a priority queue, its single-step search takes much longer than DFS. When there is no pathway between the two compounds, the Greedy needs to search data of the same size as DFS. At that point, the operation time of Greedy is unacceptably long. So we can't just use Greedy instead of DFS..</p>
+
                    <p>We obtained 41,000 random test samples and extracted 7 features from the input data, respectively : starting compound number, target compound number, search depth, required path number, similarity between starting compound and target
                <br>
+
                        compound, number of compounds that can be reached after 2 steps of starting compound search, number of compounds that can be reached after 2 steps of target compound search. We use sigmoid function to map the difference between
                <img src="https://static.igem.org/mediawiki/2019/2/24/T--Tongji_Software--model_BFS_Heatmap.png" class="PushImage" width="60%" alt="DFS"><br><br><br>
+
                        the running time of the two algorithms to a range of 0 to 1 as labels. Greater than 0.5 indicates that Greedy takes longer than DFS, that is, DFS should be used, and vice versa. The advantage of using the sigmoid function is that
                <p style="text-align: center;font-size: 1.1vw;">Fig4. Comparison of time consuming between DFS & Greedy</p><br>
+
                        when the time difference is large, the tag value tends to 0 or 1, which clearly tells the machine learning model which category the sample should fall into.</p>
                <img src="https://static.igem.org/mediawiki/2019/e/e0/T--Tongji_Software--model_Validation_of_BFS.png" class="PushImage" width="80%" alt="DFS"><br>
+
                    <br>
                <p style="text-align: center;font-size: 1.1vw;">Fig5. Comparison of time consuming between DFS & Greedy</p><br><br>
+
                    <img src="https://static.igem.org/mediawiki/2019/b/bb/T--Tongji_Software--model_Build_a_model1.png" class="PushImage" width="60%" alt="DFS"><br>
  
                <p><b>Build a Model</b></p>
 
                <p>We trained a machine learning model, the random forest, to determine which algorithm could perform the search faster before the search and use it to perform the search.</p>
 
                <br>
 
                <p>We obtained 41,000 random test samples and extracted 7 features from the input data, respectively : starting compound number, target compound number, search depth, required path number, similarity between starting compound and target compound,
 
                    number of compounds that can be reached after 2 steps of starting compound search, number of compounds that can be reached after 2 steps of target compound search. We use sigmoid function to map the difference between the running time
 
                    of the two algorithms to a range of 0 to 1 as labels. Greater than 0.5 indicates that Greedy takes longer than DFS, that is, DFS should be used, and vice versa. The advantage of using the sigmoid function is that when the time difference
 
                    is large, the tag value tends to 0 or 1, which clearly tells the machine learning model which category the sample should fall into.</p>
 
                <br>
 
                <img src="https://static.igem.org/mediawiki/2019/b/bb/T--Tongji_Software--model_Build_a_model1.png" class="PushImage" width="60%" alt="DFS"><br>
 
  
 +
                    <p>The 41,000 test data processed above were used as training data to train the random forest regression model, and the model was graded according to the following formula. That is, the ratio of the total time of the algorithm selected
 +
                        by the model to the total time of the optimal algorithm selected each time. The smaller the ratio, the closer the prediction results are to the optimal value.</p>
 +
                    <br>
 +
                    <img src="https://static.igem.org/mediawiki/2019/e/ef/T--Tongji_Software--model_Build_a_model2.png" class="PushImage" width="60%" alt="DFS">
 +
                    <p>Considering the higher stability and accuracy of DFS, the final dividing line of which algorithm will be adjusted to a value that minimizes the scoring function , which is around 0.3.</p>
 +
                    <br>
 +
                    <img src="https://static.igem.org/mediawiki/2019/8/86/T--Tongji_Software--model_Build_a_model3.png" class="PushImage" width="60%" alt="DFS">
 +
                    <p>After the above steps, we trained a random forest regression model and obtained 3000 random test samples to verify it. The results showed that the calculation of these test samples only using DFS took about 10,000 seconds. With Greedy
 +
                        alone, it takes about 260,000 seconds; When the best algorithm is used every time, it takes about 7000 seconds. When using the model for prediction, it takes 9,100 seconds. Our model makes our running speed close to the theoretical
 +
                        optimal value of about 30%.</p>
 +
                </section>
 +
                <!-- cd-section -->
  
                 <p>The 41,000 test data processed above were used as training data to train the random forest regression model, and the model was graded according to the following formula. That is, the ratio of the total time of the algorithm selected by
+
                 <section id="Enzyme_selection" class="cd-section">
                     the model to the total time of the optimal algorithm selected each time. The smaller the ratio, the closer the prediction results are to the optimal value.</p>
+
                    <h1 class="medalSubTitle"><b>E</b>NZYME<b> S</b>ELECTION</h1>
                <br>
+
                    <br><br>
                <img src="https://static.igem.org/mediawiki/2019/e/ef/T--Tongji_Software--model_Build_a_model2.png" class="PushImage" width="60%" alt="DFS">
+
                    <p><b>Overview</b></p>
                <p>Considering the higher stability and accuracy of DFS, the final dividing line of which algorithm will be adjusted to a value that minimizes the scoring function , which is around 0.3.</p>
+
                    <p>For each enzyme in the pathway by searching or just for one enzyme selection, we can get it’s physical and chemical properties in each organism that exist the selected enzyme. First of all, we make a hypothesis that the users’ experimental
                <br>
+
                        engineering bacteria is E.coli or yeast. So we compare all the organism with E.coli or yeast with physical and chemical properties. </p>
                <img src="https://static.igem.org/mediawiki/2019/8/86/T--Tongji_Software--model_Build_a_model3.png" class="PushImage" width="60%" alt="DFS">
+
                    <br>
                <p>After the above steps, we trained a random forest regression model and obtained 3000 random test samples to verify it. The results showed that the calculation of these test samples only using DFS took about 10,000 seconds. With Greedy
+
                    <p>After reading paper, we found that there is no standard or formula for this compare methods, and the environment is stable[] in every organism. So we build our enzyme score model by talking about with teachers.</p>
                     alone, it takes about 260,000 seconds; When the best algorithm is used every time, it takes about 7000 seconds. When using the model for prediction, it takes 9,100 seconds. Our model makes our running speed close to the theoretical
+
                     <br>
                     optimal value of about 30%.</p>
+
                    <p><b>KKM and KM Comparison Score</b></p>
            </section>
+
                    <p>After we get many enzyme records, considering the difference of magnitude, we use the standardization of dispersion to calculate the comparison score. </p>
            <!-- cd-section -->
+
                    <br>
 +
                    <p>Among this statistic, if there is KM record in database, it will be added to the KM list. If not, we count the KM value from other enzymes that catalytic the same substrate. And get the median as the substitute.</p>
 +
                    <br>
 +
                    <p>First, we get all KM value and KKM value for selected enzymes as two list. We calculate the KM and KKM position in the interval of KM list and KKM list. And the position distribution is the KKM and KM comparison score.</p>
 +
                    <img src="https://static.igem.org/mediawiki/2019/f/f7/T--Tongji_Software--model_KM_score.png" class="PushImage" width="50%" alt="DFS"><br>
 +
                    <p>The way of calculate KKM score is the same.</p>
 +
                    <img src="https://static.igem.org/mediawiki/2019/b/b0/T--Tongji_Software--model_KKM_score.png" class="PushImage" width="50%" alt="DFS"><br>
 +
                    <br>
 +
                    <p><b>pH and Temperature Similarity Score</b></p>
 +
                    <p>We make a statistic in the data to build a model for the comparison between experimental engineering bacteria and select enzyme. </p>
 +
                    <br>
 +
                    <p>We statistic every physical and chemical properties of enzyme come from E.coli or yeast. And we make the statistic data as a hist plot. Like the plot show, this distribution is normally distributed. So we use this model to score the
 +
                        similarity between enzyme and host’s environment.</p>
 +
                     <br>
 +
                    <p>For this statistic, we first combine pHR (pH range) and pHO(pH optinism) together, combine TR(Tempereature range) and TO(Temperature optimism) together, choose the median as the suitable pH and Temperature value for the selected enzymes.</p>
 +
                    <br>
 +
                    <p>Among this statistic, if there is pH record in database, it will be added to the pH list. If not, we count the KM value from other enzymes that exists in the same organism. And get the median as the substitute. Temperature is the same.</p>
 +
                     <img src="https://static.igem.org/mediawiki/2019/5/57/T--Tongji_Software--model_flow.png" class="PushImage" width="110%" alt="DFS"><br>
 +
                    <p style="text-align: center;font-size: 1.1vw;">Fig4. pH & temperature distribution in E.coli & yeast environment</p><br><br>
 +
                    <p>Base on this model, we consider the frequency as the similarity score.</p>
 +
                    <br>
 +
                    <p><b>Score Formula</b></p>
 +
                    <br>
 +
                    <p>Considering the priority of each physical and chemical properties, we simulation this formula as the enzyme score model.</p>
 +
                    <br>
 +
                    <img src="https://static.igem.org/mediawiki/2019/8/8f/T--Tongji_Software--model_enzyme_score-white.png" class="PushImage" width="70%" alt="DFS"><br><br>
 +
                    <p>After ranking selected enzymes by the score get from this formula, we can get the enzyme selection result for each enzyme.</p>
  
            <section id="Enzyme_selection" class="cd-section">
+
                 </section>
                 <h1 class="medalSubTitle"><b>E</b>NZYME<b> S</b>ELECTION</h1>
+
                 <!-- cd-section -->
                 <br><br>
+
                <p><b>Overview</b></p>
+
                <p>For each enzyme in the pathway by searching or just for one enzyme selection, we can get it’s physical and chemical properties in each organism that exist the selected enzyme. First of all, we make a hypothesis that the users’ experimental
+
                    engineering bacteria is E.coli or yeast. So we compare all the organism with E.coli or yeast with physical and chemical properties. </p>
+
                <br>
+
                <p>After reading paper, we found that there is no standard or formula for this compare methods, and the environment is stable[] in every organism. So we build our enzyme score model by talking about with teachers.</p>
+
                <br>
+
                <p><b>KKM and KM Comparison Score</b></p>
+
                <p>After we get many enzyme records, considering the difference of magnitude, we use the standardization of dispersion to calculate the comparison score. </p>
+
                <br>
+
                <p>Among this statistic, if there is KM record in database, it will be added to the KM list. If not, we count the KM value from other enzymes that catalytic the same substrate. And get the median as the substitute.</p>
+
                <br>
+
                <p>First, we get all KM value and KKM value for selected enzymes as two list. We calculate the KM and KKM position in the interval of KM list and KKM list. And the position distribution is the KKM and KM comparison score.</p>
+
                <img src="https://static.igem.org/mediawiki/2019/f/f7/T--Tongji_Software--model_KM_score.png" class="PushImage" width="50%" alt="DFS"><br>
+
                <p>The way of calculate KKM score is the same.</p>
+
                <img src="https://static.igem.org/mediawiki/2019/b/b0/T--Tongji_Software--model_KKM_score.png" class="PushImage" width="50%" alt="DFS"><br>
+
                <br>
+
                <p><b>pH and Temperature Similarity Score</b></p>
+
                <p>We make a statistic in the data to build a model for the comparison between experimental engineering bacteria and select enzyme. </p>
+
                <br>
+
                <p>We statistic every physical and chemical properties of enzyme come from E.coli or yeast. And we make the statistic data as a hist plot. Like the plot show, this distribution is normally distributed. So we use this model to score the similarity
+
                    between enzyme and host’s environment.</p>
+
                <br>
+
                <p>For this statistic, we first combine pHR (pH range) and pHO(pH optinism) together, combine TR(Tempereature range) and TO(Temperature optimism) together, choose the median as the suitable pH and Temperature value for the selected enzymes.</p>
+
                <br>
+
                <p>Among this statistic, if there is pH record in database, it will be added to the pH list. If not, we count the KM value from other enzymes that exists in the same organism. And get the median as the substitute. Temperature is the same.</p>
+
                <img src="https://static.igem.org/mediawiki/2019/5/57/T--Tongji_Software--model_flow.png" class="PushImage" width="110%" alt="DFS"><br>
+
                <p style="text-align: center;font-size: 1.1vw;">Fig4. pH & temperature distribution in E.coli & yeast environment</p><br><br>
+
                <p>Base on this model, we consider the frequency as the similarity score.</p>
+
                <br>
+
                <p><b>Score Formula</b></p>
+
                <br>
+
                <p>Considering the priority of each physical and chemical properties, we simulation this formula as the enzyme score model.</p>
+
                <br>
+
                <img src="https://static.igem.org/mediawiki/2019/8/8f/T--Tongji_Software--model_enzyme_score-white.png" class="PushImage" width="70%" alt="DFS"><br><br>
+
                <p>After ranking selected enzymes by the score get from this formula, we can get the enzyme selection result for each enzyme.</p>
+
  
            </section>
+
                <section id="Codon_optimization" class="cd-section">
            <!-- cd-section -->
+
                    <h1 class="medalSubTitle"><b>C</b>ODON<b> O</b>PTIMIZATION</h1>
 +
                    <br><br>
 +
                    <p><b>Codon Termed</b></p>
 +
                    <p>In the section of codon optimization, we want to make the enzyme sequence that we get from different organisms have a stable and even a high expression level in our target organism, and meanwhile keep the original codon environment
 +
                        of exogenous gene as possible as we could. So, we abandoned the popular method called “codon harmonization”, using the thinking of random point mutation to instead, only replace specific codons but not almost change them all like
 +
                        what “codon harmonization” do. So, in order to decide which codons could be replaced, we introduce a parameter to represent the relative adaptiveness of a codon termed wi. The wi represents the ratio between the using frequency
 +
                        of the current codon (fi) and the using frequency of the most frequent synonymous codon for that amino acid (max(fj)), and in a way wi represents the codon usage bias of chassis.
 +
                    </p><br>
 +
                    <img src="https://static.igem.org/mediawiki/2019/3/33/T--Tongji_Software--model_wi.png" class="PushImage" width="60%" alt="DFS"><br>
 +
                    <p><b>GC Content</b></p>
 +
                    <p>The using frequency of every codon in E.coli and yeast can be accessed from kazusa online database[1] or GenBank[2]. When we get a table of using frequency, a threshold can be set to screen and find codons with low wi score which may
 +
                        cause negative influence on expression. So here we can have a candidate codon list filled by codons which were filtered out in previous step, later we can random choice some of them to be replaced with its synonymous codon which
 +
                        has a higher using frequency. To figure out how many codons should be replaced and to evaluate the probable effect on expression level, we introduce GC content and CAI as our parameters. A sequence’s GC content can be computed
 +
                        from the ratio between the sum of base G and base C and the total number of bases. If the GC content of an exogenous gene is similar to the GC content of target organism’s genome, this exogenous gene is considered to have a better
 +
                        expressing performance in target organism.
  
            <section id="Codon_optimization" class="cd-section">
+
                    </p>
                <h1 class="medalSubTitle"><b>C</b>ODON<b> O</b>PTIMIZATION</h1>
+
                    <br>
                <br><br>
+
                    <img src="https://static.igem.org/mediawiki/2019/8/8c/T--Tongji_Software--model_GC.png" class="PushImage" width="60%" alt="DFS"><br>
                <p><b>Codon Termed</b></p>
+
                    <p><b>Codon Adaptation Index (CAI)</b></p>
                <p>In the section of codon optimization, we want to make the enzyme sequence that we get from different organisms have a stable and even a high expression level in our target organism, and meanwhile keep the original codon environment of
+
                    <p>And the Codon Adaptation Index (CAI) is the most widespread technique for analyzing codon usage bias and predicting the level of expression of a gene based on its codon sequence [3]. The range of CAI is between 0 and 1, if its value
                    exogenous gene as possible as we could. So, we abandoned the popular method called “codon harmonization”, using the thinking of random point mutation to instead, only replace specific codons but not almost change them all like what
+
                        is closer to 1 which means that sequence may have better expressing level in that target organism. The CAI can be calculated from the geometric mean of the relative adaptiveness of each codon (wi) over the total number of codons
                    “codon harmonization” do. So, in order to decide which codons could be replaced, we introduce a parameter to represent the relative adaptiveness of a codon termed wi. The wi represents the ratio between the using frequency of the current
+
                        (L).</p>
                    codon (fi) and the using frequency of the most frequent synonymous codon for that amino acid (max(fj)), and in a way wi represents the codon usage bias of chassis.
+
                    <br>
                </p><br>
+
                    <img src="https://static.igem.org/mediawiki/2019/3/33/T--Tongji_Software--model_CAI.png" class="PushImage" width="20%" alt="DFS"><br>
                <img src="https://static.igem.org/mediawiki/2019/3/33/T--Tongji_Software--model_wi.png" class="PushImage" width="60%" alt="DFS"><br>
+
                    <p>All the preparations have been done, and here comes to codon optimization. First, we generate a candidate sequence list from input sequence by replacing codon in candidate codon list in different random combinations, then we consider
                <p><b>GC Content</b></p>
+
                        both GC content and CAI to filter and rank our candidate sequence list, and finally give the optimized sequence which has best GC content or highest CAI value.</p>
                <p>The using frequency of every codon in E.coli and yeast can be accessed from kazusa online database[1] or GenBank[2]. When we get a table of using frequency, a threshold can be set to screen and find codons with low wi score which may cause
+
                    <br>
                    negative influence on expression. So here we can have a candidate codon list filled by codons which were filtered out in previous step, later we can random choice some of them to be replaced with its synonymous codon which has a higher
+
                    <p><b>Example</b></p>
                    using frequency. To figure out how many codons should be replaced and to evaluate the probable effect on expression level, we introduce GC content and CAI as our parameters. A sequence’s GC content can be computed from the ratio between
+
                    <p>Here we show an example. We use the sequences of acetyl-CoA C-acetyltransferase [EC:2.3.1.9] from different organisms (donors) and try to optimize them to let them have a higher possibility to express better in E.coli. the average
                    the sum of base G and base C and the total number of bases. If the GC content of an exogenous gene is similar to the GC content of target organism’s genome, this exogenous gene is considered to have a better expressing performance
+
                        GC content of E.coli is 0.58, so we set the target GC content range as [0.574, 0.586]. And the results are as follow.</p>
                     in target organism.
+
                     <img src="https://static.igem.org/mediawiki/2019/5/55/T--Tongji_Software--model_condon_plot1.png" class="PushImage" width="100%" alt="DFS"><br>
 +
                    <p style="text-align: center;font-size: 1.1vw;">Fig5. Codon optimization result showed by CAI value (left) & GC content (right) </p><br><br>
  
                </p>
 
                <br>
 
                <img src="https://static.igem.org/mediawiki/2019/8/8c/T--Tongji_Software--model_GC.png" class="PushImage" width="60%" alt="DFS"><br>
 
                <p><b>Codon Adaptation Index (CAI)</b></p>
 
                <p>And the Codon Adaptation Index (CAI) is the most widespread technique for analyzing codon usage bias and predicting the level of expression of a gene based on its codon sequence [3]. The range of CAI is between 0 and 1, if its value is
 
                    closer to 1 which means that sequence may have better expressing level in that target organism. The CAI can be calculated from the geometric mean of the relative adaptiveness of each codon (wi) over the total number of codons (L).</p>
 
                <br>
 
                <img src="https://static.igem.org/mediawiki/2019/3/33/T--Tongji_Software--model_CAI.png" class="PushImage" width="20%" alt="DFS"><br>
 
                <p>All the preparations have been done, and here comes to codon optimization. First, we generate a candidate sequence list from input sequence by replacing codon in candidate codon list in different random combinations, then we consider both
 
                    GC content and CAI to filter and rank our candidate sequence list, and finally give the optimized sequence which has best GC content or highest CAI value.</p>
 
                <br>
 
                <p><b>Example</b></p>
 
                <p>Here we show an example. We use the sequences of acetyl-CoA C-acetyltransferase [EC:2.3.1.9] from different organisms (donors) and try to optimize them to let them have a higher possibility to express better in E.coli. the average GC content
 
                    of E.coli is 0.58, so we set the target GC content range as [0.574, 0.586]. And the results are as follow.</p>
 
                <img src="https://static.igem.org/mediawiki/2019/5/55/T--Tongji_Software--model_condon_plot1.png" class="PushImage" width="100%" alt="DFS"><br>
 
                <p style="text-align: center;font-size: 1.1vw;">Fig5. Codon optimization result showed by CAI value (left) & GC content (right) </p><br><br>
 
  
 +
                </section>
  
            </section>
+
                <section id="Reference" class="cd-section">
 +
                    <h1 class="medalSubTitle"><b>R</b>EFERENCE</h1>
 +
                    <br><br>
 +
                    <p style="font-size:1vw;">[1] C. Brininger, S. Spradlin, L. Cobani, C. Evilia. The more adaptive to change, the more likely you are to survive: Protein adaptation in extremophiles, Seminars in Cell & Developmental Biology, Volume 84,2018, Pages 158-169, ISSN
 +
                        1084-9521, https://doi.org/10.1016/j.semcdb.2017.12.016.
 +
                    </p>
 +
                    <p style="font-size:1vw;">[2]. http://www.kazusa.or.jp/codon/</p>
 +
                    <p style="font-size:1vw;">[3]. https://www.ncbi.nlm.nih.gov/genbank/</p>
 +
                    <p style="font-size:1vw;">[4]. Sharp, Paul M.; Li, Wen-Hsiung. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research. 1987.15 (3): 1281–1295. doi:10.1093/nar/15.3.1281. </p>
  
            <section id="Reference" class="cd-section">
+
                 </section>
                 <h1 class="medalSubTitle"><b>R</b>EFERENCE</h1>
+
                 <!-- cd-section -->
                 <br><br>
+
                <p style="font-size:1vw;">[1] C. Brininger, S. Spradlin, L. Cobani, C. Evilia. The more adaptive to change, the more likely you are to survive: Protein adaptation in extremophiles, Seminars in Cell & Developmental Biology, Volume 84,2018, Pages 158-169, ISSN 1084-9521,
+
                    https://doi.org/10.1016/j.semcdb.2017.12.016.
+
                </p>
+
                <p style="font-size:1vw;">[2]. http://www.kazusa.or.jp/codon/</p>
+
                <p style="font-size:1vw;">[3]. https://www.ncbi.nlm.nih.gov/genbank/</p>
+
                <p style="font-size:1vw;">[4]. Sharp, Paul M.; Li, Wen-Hsiung. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research. 1987.15 (3): 1281–1295. doi:10.1093/nar/15.3.1281. </p>
+
  
            </section>
+
                <footer class="footer-distributed">
            <!-- cd-section -->
+
                    <div class="footer-all">
 
+
                        <div class="footer-right">
            <footer class="footer-distributed">
+
                            <h3 style="font-size: 1.5vw;color:white"><b>Contact</b></h3>
                <div class="footer-all">
+
                            <p class="my-content-p3">College of Life Science and Technology</p>
                    <div class="footer-right">
+
                            <p class="my-content-p3">Tongji university </p>
                        <h3 style="font-size: 1.5vw;color:white"><b>Contact</b></h3>
+
                            <p class="my-content-p3">No.1239, Yangpu District, Shanghai, China</p>
                        <p class="my-content-p3">College of Life Science and Technology</p>
+
                            <p class="my-content-p3" style="word-wrap:break-word;">Email: Tongji_Software2019@126.com</p>
                        <p class="my-content-p3">Tongji university </p>
+
                        </div>
                        <p class="my-content-p3">No.1239, Yangpu District, Shanghai, China</p>
+
                        <p class="my-content-p3" style="word-wrap:break-word;">Email: Tongji_Software2019@126.com</p>
+
 
                     </div>
 
                     </div>
                </div>
 
  
                <div class="footer-left">
+
                    <div class="footer-left">
 +
 
 +
                        <div class="footer-links">
 +
                            <img class="sponsors" src="https://static.igem.org/mediawiki/2019/d/dd/T--Tongji_Software--picture-footer-TJ.png" alt="Tongji University">
 +
                            <img class="sponsors" src="https://static.igem.org/mediawiki/2019/6/65/T--Tongji_Software--picture-footer-TJLF.png" alt="Tongji University">
 +
                            <img class="sponsors" src="https://static.igem.org/mediawiki/2019/0/09/T--Tongji_Software--picture-footer-GENEWIZI.PNG" alt="Tongji University">
 +
                            <br><br><br>
 +
                            <img class="sponsors" src="https://static.igem.org/mediawiki/2019/d/d3/T--Tongji_Software--picture-footer-kegg.png" alt="Tongji University">
 +
                            <img class="sponsors" src="https://static.igem.org/mediawiki/2019/5/52/T--Tongji_Software--picture-footer-BRENDA.png" alt="Tongji University">
 +
                            <img class="sponsors" src="https://static.igem.org/mediawiki/2019/f/fa/T--Tongji_Software--picture-footer-alpha_ant.png" alt="Tongji University">
 +
                            <img class="sponsors" src="https://static.igem.org/mediawiki/2019/6/68/T--Tongji_Software--picture-footer-Pathlab-white.png" alt="Tongji University">
 +
                            <img class="sponsors" src="https://static.igem.org/mediawiki/2019/e/e1/T--Tongji_Software--picture-logo2.png" alt="Tongji University">
 +
                            <img class="sponsors" src="https://static.igem.org/mediawiki/2019/4/46/T--Tongji_Software--picture-footer-EUCST.png" alt="Tongji University">
 +
                        </div>
  
                    <div class="footer-links">
 
                        <img class="sponsors" src="https://static.igem.org/mediawiki/2019/d/dd/T--Tongji_Software--picture-footer-TJ.png" alt="Tongji University">
 
                        <img class="sponsors" src="https://static.igem.org/mediawiki/2019/6/65/T--Tongji_Software--picture-footer-TJLF.png" alt="Tongji University">
 
                        <img class="sponsors" src="https://static.igem.org/mediawiki/2019/0/09/T--Tongji_Software--picture-footer-GENEWIZI.PNG" alt="Tongji University">
 
                        <br><br><br>
 
                        <img class="sponsors" src="https://static.igem.org/mediawiki/2019/d/d3/T--Tongji_Software--picture-footer-kegg.png" alt="Tongji University">
 
                        <img class="sponsors" src="https://static.igem.org/mediawiki/2019/5/52/T--Tongji_Software--picture-footer-BRENDA.png" alt="Tongji University">
 
                        <img class="sponsors" src="https://static.igem.org/mediawiki/2019/f/fa/T--Tongji_Software--picture-footer-alpha_ant.png" alt="Tongji University">
 
                        <img class="sponsors" src="https://static.igem.org/mediawiki/2019/6/68/T--Tongji_Software--picture-footer-Pathlab-white.png" alt="Tongji University">
 
                        <img class="sponsors" src="https://static.igem.org/mediawiki/2019/e/e1/T--Tongji_Software--picture-logo2.png" alt="Tongji University">
 
                        <img class="sponsors" src="https://static.igem.org/mediawiki/2019/4/46/T--Tongji_Software--picture-footer-EUCST.png" alt="Tongji University">
 
 
                     </div>
 
                     </div>
  
                </div>
+
                    <div id="footer-info" class="inner">
 +
                        <p class="my-content-p3" style="color: #777; text-align: center !important;">Copyright © 2019 Tongji_Software</p>
 +
                    </div>
 +
                </footer>
  
                <div id="footer-info" class="inner">
+
            </div>
                    <p class="my-content-p3" style="color: #777; text-align: center !important;">Copyright © 2019 Tongji_Software</p>
+
            <script src="https://2019.igem.org/Template:Tongji_Software/js/jquery_210_min_js?action=raw&ctype=text/javascript"></script>
                </div>
+
            <script src="https://2019.igem.org/Template:Tongji_Software/js/ProjectMain_js?action=raw&ctype=text/javascript"></script>
            </footer>
+
  
 
         </div>
 
         </div>
         <script src="https://2019.igem.org/Template:Tongji_Software/js/jquery_210_min_js?action=raw&ctype=text/javascript"></script>
+
         <!-- Resource jQuery -->
        <script src="https://2019.igem.org/Template:Tongji_Software/js/ProjectMain_js?action=raw&ctype=text/javascript"></script>
+
 
+
 
     </div>
 
     </div>
    <!-- Resource jQuery -->
 
</div>
 
 
</body>
 
</body>
  
 
</html>
 
</html>

Revision as of 23:19, 21 October 2019

Tongji Software | Pathlab

PROJECT
Open navigation

MODEL

ENZYME SELECTION



Overview

For each enzyme in the pathway by searching or just for one enzyme selection, we can get it’s physical and chemical properties in each organism that exist the selected enzyme. First of all, we make a hypothesis that the users’ experimental engineering bacteria is E.coli or yeast. So we compare all the organism with E.coli or yeast with physical and chemical properties.


After reading paper, we found that there is no standard or formula for this compare methods, and the environment is stable[] in every organism. So we build our enzyme score model by talking about with teachers.


KKM and KM Comparison Score

After we get many enzyme records, considering the difference of magnitude, we use the standardization of dispersion to calculate the comparison score.


Among this statistic, if there is KM record in database, it will be added to the KM list. If not, we count the KM value from other enzymes that catalytic the same substrate. And get the median as the substitute.


First, we get all KM value and KKM value for selected enzymes as two list. We calculate the KM and KKM position in the interval of KM list and KKM list. And the position distribution is the KKM and KM comparison score.

DFS

The way of calculate KKM score is the same.

DFS

pH and Temperature Similarity Score

We make a statistic in the data to build a model for the comparison between experimental engineering bacteria and select enzyme.


We statistic every physical and chemical properties of enzyme come from E.coli or yeast. And we make the statistic data as a hist plot. Like the plot show, this distribution is normally distributed. So we use this model to score the similarity between enzyme and host’s environment.


For this statistic, we first combine pHR (pH range) and pHO(pH optinism) together, combine TR(Tempereature range) and TO(Temperature optimism) together, choose the median as the suitable pH and Temperature value for the selected enzymes.


Among this statistic, if there is pH record in database, it will be added to the pH list. If not, we count the KM value from other enzymes that exists in the same organism. And get the median as the substitute. Temperature is the same.

DFS

Fig4. pH & temperature distribution in E.coli & yeast environment



Base on this model, we consider the frequency as the similarity score.


Score Formula


Considering the priority of each physical and chemical properties, we simulation this formula as the enzyme score model.


DFS

After ranking selected enzymes by the score get from this formula, we can get the enzyme selection result for each enzyme.

CODON OPTIMIZATION



Codon Termed

In the section of codon optimization, we want to make the enzyme sequence that we get from different organisms have a stable and even a high expression level in our target organism, and meanwhile keep the original codon environment of exogenous gene as possible as we could. So, we abandoned the popular method called “codon harmonization”, using the thinking of random point mutation to instead, only replace specific codons but not almost change them all like what “codon harmonization” do. So, in order to decide which codons could be replaced, we introduce a parameter to represent the relative adaptiveness of a codon termed wi. The wi represents the ratio between the using frequency of the current codon (fi) and the using frequency of the most frequent synonymous codon for that amino acid (max(fj)), and in a way wi represents the codon usage bias of chassis.


DFS

GC Content

The using frequency of every codon in E.coli and yeast can be accessed from kazusa online database[1] or GenBank[2]. When we get a table of using frequency, a threshold can be set to screen and find codons with low wi score which may cause negative influence on expression. So here we can have a candidate codon list filled by codons which were filtered out in previous step, later we can random choice some of them to be replaced with its synonymous codon which has a higher using frequency. To figure out how many codons should be replaced and to evaluate the probable effect on expression level, we introduce GC content and CAI as our parameters. A sequence’s GC content can be computed from the ratio between the sum of base G and base C and the total number of bases. If the GC content of an exogenous gene is similar to the GC content of target organism’s genome, this exogenous gene is considered to have a better expressing performance in target organism.


DFS

Codon Adaptation Index (CAI)

And the Codon Adaptation Index (CAI) is the most widespread technique for analyzing codon usage bias and predicting the level of expression of a gene based on its codon sequence [3]. The range of CAI is between 0 and 1, if its value is closer to 1 which means that sequence may have better expressing level in that target organism. The CAI can be calculated from the geometric mean of the relative adaptiveness of each codon (wi) over the total number of codons (L).


DFS

All the preparations have been done, and here comes to codon optimization. First, we generate a candidate sequence list from input sequence by replacing codon in candidate codon list in different random combinations, then we consider both GC content and CAI to filter and rank our candidate sequence list, and finally give the optimized sequence which has best GC content or highest CAI value.


Example

Here we show an example. We use the sequences of acetyl-CoA C-acetyltransferase [EC:2.3.1.9] from different organisms (donors) and try to optimize them to let them have a higher possibility to express better in E.coli. the average GC content of E.coli is 0.58, so we set the target GC content range as [0.574, 0.586]. And the results are as follow.

DFS

Fig5. Codon optimization result showed by CAI value (left) & GC content (right)



REFERENCE



[1] C. Brininger, S. Spradlin, L. Cobani, C. Evilia. The more adaptive to change, the more likely you are to survive: Protein adaptation in extremophiles, Seminars in Cell & Developmental Biology, Volume 84,2018, Pages 158-169, ISSN 1084-9521, https://doi.org/10.1016/j.semcdb.2017.12.016.

[2]. http://www.kazusa.or.jp/codon/

[3]. https://www.ncbi.nlm.nih.gov/genbank/

[4]. Sharp, Paul M.; Li, Wen-Hsiung. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research. 1987.15 (3): 1281–1295. doi:10.1093/nar/15.3.1281.