Team:BM-AMU/Model

Team:BM-AMU- 2019.igem.org

Modeling



General Description

With regard to the modeling part, we have two goals.

Firstly, in the EMT “Monitor” model, we hope to label the proteins expressed in the EMT process with red and green fluorescence and establish a quantitative model of fluorescence intensity to better detect the different stages of EMT development, so we have the fluorescence of the shot. The map performscolor features and textural features for extraction, merging, and SVM training to establish a fluorescence intensity and cell subtype database to determine the cell subtype of EMT by determining the fluorescence intensity.

Secondly, in the EMT “Controller” model, we used the Polynomial regression fitting model of enzymatic reaction to establish the relationship between fluorescence intensity and the dose of inducer DOX/Tamoxifen to more accurately manipulate the process of EMT.

Modeling Details

• Part 1: EMT “Monitor” modeling

In the project, we measured the intensity information of red and green fluorescence, and sorted different cell subtypes by flow cytometry. We hope to study the cells of different subtypes in the EMT process from the perspective of fluorescence level information. The key step is to establish the correspondence between fluorescence intensity and cell subtype. Our usual method is to extract the characteristics of the response in the image library, such as color, texture, sift, surf, etc., and then save it. The data index of the response is established, the corresponding feature is extracted from the image to be queried, and compared with the image feature in the database to find the closest picture.

We first used the image extraction function method to extract the characteristics of color and texture. Then we set up the image category by the machine learning method. By constructing the standard of the test set, we can learn the relationship between the whole fluorescent picture and the cell subtype by using the support vector machine, and can roughly judge the specific cell sub state by any fluorescence intensity. The degree of type, the cognitive process from phenotype to genotype.

i. Data set establishment

We took a picture of the fluorescence at each time point in the EMT process, sorted out four different cell subtypes by flow cytometry, and created four sets of images, as shown below:

Model

We label the four cell subtypes 4,3,2,1 in turn.

According to the figure, we can observe that the red-light intensity decreases with time, and the decrease of red-light intensity means the process of EMT. The three limits of the EMT process are found by flow cytometry and are divided into Four stages.

ii. Image feature extraction

The color and texture characteristics of different fluorescent images are different.

We have selected two features, among which color features are very common image features. The main reason is that the colors are generally very similar to the scenes or objects contained in the images. And compared with other image features, the color features have less dependence on the size and perspective of the image itself, and thus have higher robustness. The color features are corresponding to the image area or the scene corresponding to the entire image. It is a global feature, generally based on the characteristics of the pixel. Here, the color features are H, S, V. The third moments of the mean, variance, and three channels H, S, and V. Finally, a 9-dimensional feature vector is obtained to represent the color features of the image.[1]

The texture feature is an image feature that represents the homogenous phenomenon in the image. The characteristics of the object can reflect the periodic changes of the surface of the object or have a slowly changing structure of the organization structure. The texture features are represented by the gray distribution of the pixels and their surrounding spatial neighborhoods. They have non-random permutations, some local sequence repeats, and a uniform uniformity. The texture features represent the global feature properties. It also describes the surface characteristics of the image region or the scene corresponding to the entire image. It is not a feature based on a single pixel. It needs to be calculated and counted in the case where the region contains multiple pixels. The final 15 texture features are: small gradient advantage, large gradient advantage, gray distribution non-uniformity, gradient Distribution inhomogeneity, energy, gray average, gradient average, gray mean square error, gradient mean square error, correlation, gray entropy, gradient entropy, mixed entropy, inertia, inverse moment.[2]

For color feature extraction of the image, the calculation results are as follows:

Model

Figure 1.1: Color Feature Training Set

For textural feature extraction of the image, the calculation results are as follows:

Model

Figure 1.2 : Texture Feature Training Set

We consider that in each of the four image sets, each element has similar features between the elements. Therefore, we combine the color feature and the textural feature into one vector as the feature vector of the feature matrix, and the sample tag construction will be each image. The folder name is the label of the image and corresponds to the feature vector order.

iii. Support vector machine training and testing

1) Establishment of classifier

We randomly select the part of each class as the training data, and the remaining part as the test data, using the Gaussian radial basis kernel function to map the input feature vector to a high-dimensional feature interval, in this space The optimal classification plane is constructed to obtain a radial basis function classifier.

2) Image testing and prediction

The problem of SVM prediction depends on two important parameters c and g (See appendix for specific meaning), which will affect the prediction performance of SVM. In order to improve the prediction performance of the model, two important parameters in the optimization process of the model are introduced by the network format search method (GS). At the same time, to avoid the over-learning and under-learning of the model, a 5-fold cross-validation method was adopted to optimize the parameters with the highest classification accuracy of the training set as the fitness function. When the highest classification accuracy of verification set is achieved, the obtained c and g are the best parameters. In GS, global search is conducted at intervals of 0.5, and the range of c and g is (2-10,210).

Using the c and g obtained above, the optimal parameters were obtained to establish the prediction model, and 10 pieces were input to test the classification effect of the prediction model. The test results were as follows:

Figure 1.3 : SVM Parameter Selection Result Diagram(3D Plot)

Figure 1.4 : SVM Parameter Selection Result Diagram(Contour Plot)

Figure 1.5 : SVM Parameter Selection Result Diagram(3D Plot)

Figure 1.6 : SVM Parameter Selection Result Diagram(Contour Plot)

The results of classification by the support vector machine after extracting the image are compared by experiments. The following figure is the classification prediction result of a random fluorescence image in the test data:

Figure 1.7 : Classification Result Diagram


3) Evaluation of classification results

Finally, we compare the predicted classification results with the actual training labels to evaluate the correct rate of the EMT “Monitor” model: Accuracy = (number of samples correctly classified / total number of test samples) ×100%.

Accuracy = 90.9091%


iv. Result analysis

We established a cell fluorescence map classifier based on color and texture features, which provided an accurate and reliable way to distinguish cell subtypes in the experiment and had practical significance through efficiency detection and high classification accuracy.



• Part 2: EMT “Controller” modeling

In order to accurately manipulate the process of EMT, the traditional method induces EMT by TGF-beta. Due to the limitation of accuracy of traditional methods, Dox-Teton system and Tamoxifen-Ert2 system can break through the limitations of traditional induction methods, therefore, we recorded the Dox/Tamoxifen concentration corresponding to the induced EMT transcription factor levels and protein expression levels which include Snail1, ZEB1, Twist1, E- cadherin, N- cadherin and Vimentin, using the Polynomial regression fitting model of enzymatic reaction to establish the corresponding relationships between transcription factor levels, protein expression levels and the induction of Dox/Tamoxifen dose to more precise Manipulating the EMT process.

i. Data collection

We induced a certain gradient of EMT process by different concentrations of Dox/Tamoxifen inducer, and the fitted relationships between transcription factor levels and protein expression levels and inducer concentration in EMT process were as follows:

Table 2.1 : The relationship between Dox-induced dose and expression of transcription factors and functional proteins

Table 2.2 : The relationship between Tamoxifen -induced dose and expression of transcription factors and functional proteins

ii. The Polynomial regression fitting model regression model of enzymatic reaction

We have made fitted curves of the correspondence between transcription factor levels and protein expression levels and Dox/Tamoxifen inducer concentration as follows

With Dox:

Figure 2.1 The fitting curve of Dox-induced dose, transcription factor and functional protein expression



With Tamoxifen:

Figure 2.2 : The fitting curve of Tamoxifen -induced dose, transcription factor and functional protein expression

iii. Result analysis

We can observe that the induction curves of Dox-Teton system and Tamoxifen-Ert2 system showed similar characteristics: the expression of transcription factors Snail1, Zeb and Twist1 increased with the increase of the induced concentration. The expression levels of functional proteins Vimentin and E-cad were on the rise, while the expression levels of N-cad were on the decline

According to the graph analysis, the function expressions of the three are (function expression). The conclusion is that protein expression is highly responsive to the Dox/Tamoxifen inducer and System control sensitivity of Dox-Teton system and Tamoxifen-Ert2 system is very high and it helps the selection of experimental induction and determine the precise control concentration.



• Part 3: The fusion protein linker selection modeling

The purpose to be completed in this project is to use the CRISPR/Cas9 technology to construct fluorescently labeled proteins. However, whether the fluorescent protein can function properly and function when combined with the target protein is a problem we are concerned about, and the gene fusion technology is one of the key technologies for success. The linker design of the fusion protein is necessary, which connects different genes of interest through a suitable nucleotide sequence to express it into a single peptide chain in an appropriate organism. The amino acid in which it functions as a linker is called Linker. Whether the two components in the fusion protein can form the correct spatial structure and better the biological activity, and is closely related to the linker sequence of the two components in the fusion protein. Therefore, the design and selection of linker sequences are crucial for the construction of fusion genes. We hope to model the selection of the linker sequences of fusion proteins, and we want to improve the function and activity of fusion proteins by selecting the optimal fusion protein linker.

At present, the most commonly used fusion protein linkers are of three types, namely IRES sequence , 2A peptide sequence and flexible glycine sequence. According to the literature, when the protein is translated, since the IRES is placed between two open reading frames, it can be expressed simultaneously. Two completely independent, unmodified proteins will be translated. However, the presence of IRES sometimes affects the structure of mRNA. The binding of 5'CAP of IRES and mRNA to the translation initiation complex is different, resulting in the level of translation of the open reading frame behind the IRES which may be inconsistent with the previous expression level. This does not allow us to use fluorescence intensity to reflect the amount of protein on the membrane and the purpose of the EMT process, so we are more concerned about the functional activity of the 2A peptide and the flexible glycine chain. Both can link two genes into an open reading frame, and the mRNA is translated into a fusion protein, so that the molar ratio of fluorescent protein to E/N cadherin is theoretically 1:1. However, the difference is that 2A polypeptide will be cleaved by proteases and there will be a peptide chain of more than 20 amino acids on the tail of the previous protein. If the first protein has a small molecular weight, its function will be affected. The flexible glycine chain length is an important factor to consider in the construction of fusion genes. If the length of the Linker is too long, the fusion protein is sensitive to proteases, resulting in a decrease in the yield of the active fusion protein in the production process. The application of a shorter Linker can overcome the problem of decomposition of the recombinant protease, but the two fusion molecules can be separated from each other, which nearly leads to loss of protein function. Finally, we hope to use the effects of different protein sizes and different protein lengths on protein functional activity, and use local weighted regression fitting to find the optimal activity of 2A peptide and flexible glycine chain to improve the efficiency of protein function.

i. Data collection

This model finds the activity metrics of different 2A peptide binding proteins and flexible glycine chain binding proteins from the Unified Protein Database. The data about the activity metrics of proteins and protein sizes were seen in the appendix.

ii. Locally weighted linear regression algorithm

Through the locally weighted linear regression algorithm, the overall trend was fitted with sample points, and the results were as follows:

Figure 3.1 : The fitting curve of flexible glycine chain binding proteins activity and size association

Figure 3.2 : The fitting curve of 2A peptide activity and size association

iii. Result analysis

For proteins of different sizes, the mean and highest activity levels are both lower than those of flexible glycine chain binding proteins. Therefore, we chose flexible glycine chain which is in the most active state as linker protein for the experiment to ensure that the function of the target protein is subject to as little interference as possible.

Possible improvements

a) Due to the time constraints of this experiment, the experimental results were relatively rough and initial, resulting in insufficient sample size of SVM model training set and low classification accuracy. In the future, experimental samples will be enriched to further improve classification accuracy.

b) We hope to establish a bridge between fluorescent phenotype and transcriptional landscape to characterize dynamic EMT. Therefore, a predicting software tool was designed. When we input an image with specific fluorescent proteins tagged, features such as fluorescent intensity, cellular texture are extracted as training tags, then the most probable sketch map of transcriptional landscape can be outputted based on supporting vector machine (SVM)/ convolutional neural networks (CNN). Additionally, an update version more friendly to researchers is just on the way, and we are now planning to expand types of input data, making it possible to get accessible to immunofluorescent images, even label-free ones. [4]

Reference

1. Gong yanhua, zhu aihong, dai lingyun. Color feature extraction based on color histogram. Fujian computer 2007(5): 96-97.

2. Chen yuanyuan, jiao jiao. Texture feature extraction. Hua zhang 2012(26).

3. Kitazume T. Enzymatic Reactions, 2005.

4. Ounkomol C, Fernandes DA, Seshamani S, Maleckar MM, Collman F, Johnson GR. Three dimensional cross-modal image inference: label-free methods for subcellular structure prediction. 2017.

5. Adankon MM, Cheriet M. Support Vector Machine. Computer Science 2002, 1(4): 1-28.

Appendix

Note

i.RGB

The so-called "how much" of RGB refers to the brightness and is represented by an integer. Typically, RGB has 256 levels of brightness, expressed as numbers from 0, 1, 2... up to 255.According to the calculation, 256 levels of RGB colors can be combined to form about 16.78 million colors, that is, 256 × 256 × 256 = 16777216.A color image in MATLAB An RGB image can be represented as a three-dimensional matrix of M*N*3.Each of the color pixels corresponds to three components of red, green and blue in a color image of a specific spatial position. Using MATLAB, you can use imread to read an image. You can see the matrix in the workspace, but the 3D matrix cannot be directly displayed. The 3D array has three faces, which in turn correspond to Red, Green, and Blue. Blue) The three colors, and the data in the face are the intensity values of the three colors. Let the resulting matrix be an X-dimensional matrix (256, 256, 3), and X(:,:, 1) represents a 2-dimensional matrix of red color X(:,:, 2) representing a 2-dimensional matrix of green color, X(:,:,, 3) A 2-dimensional matrix representing a blue color, the first second dimension takes values from 0-255 (2^8 1 byte), and the third dimension is 1-3.

ii. HSV

The values in the histogram are all statistical, describing the quantitative characteristics of the color in the image, which can reflect the statistical distribution and basic color of the image color; the histogram only Contains the frequency of occurrence of a certain color value in the image, and loses the spatial position information of a certain pixel; any image can uniquely give a histogram corresponding to it, but different images may have The same color distribution, thus having the same histogram, so the histogram and the image are one-to-many relationship; if the image is divided into several sub-regions, the sum of the histograms of all sub-regions is equal to the full-image histogram; Next, since the background color of the image and the foreground object are significantly different, the bimodal characteristic appears on the histogram, but the image with the background and foreground colors is not similar.

iii.SVM

Support Vector Machine (SVM) was first proposed by Cortes and Vapnik as a supervised pattern recognition method.Its main idea is to establish a classification decision surface. SVM uses kernel functions to map data into high-dimensional Spaces, making them as linearly separable as possible. The commonly used kernel functions include linear kernel, polynomial kernel, radial basis kernel (RBF), Fourier kernel, spline kernel and Sigmoid kernel. By comparing the data features applicable to these kernel functions, the RBF kernel functions showed good classification performance no matter whether the sample data features were high dimension or low dimension, large data volume or small data volume. Therefore, RBF is selected as the classification kernel function of SVM.[5]SVM data processing process is as follows:Let the sensory characteristic data be N dimension, a total of L sets of data, namely

The decision plane can be expressed as

In the formula ϖ is the weight coefficient of the decision surface, g(x) means nonlinear mapping function, b is threshold

In order to minimize structural risk, the optimal classification hyperplane should satisfy the following conditions is

Nonnegative relaxation variables ξi are introduced so that the classification error is within a specified range. Therefore, the optimization problem is transformed into:

In the formula, c is penalty factor, the complexity and generalization ability of the control model.

By introducing Lagrange algorithm, the optimization problem is transformed into dual form.

In this paper, RBF kernel function is introduced:

In the formula, g is kernel function parameter controls the range of input space.The above optimization problem turns into:

It can be seen that the optimization problem depends on two important parameters c and g, which will affect the prediction performance of SVM.

In conclusion, the SVM prediction process is as follows:

(1) input data, training input, training output, prediction input and prediction output are stipulated

(2) to accelerate the convergence speed of the network, data normalization is carried out

(3) parameter optimization and grid number search

(4) the optimal parameters were obtained and the prediction model was established to avoid the over-learning and under-learning of the model. The highest classification accuracy of verification set was adopted as fitness function to conduct parameter optimization. When the highest classification accuracy of verification set is achieved, the obtained c and g are the best parameters.

(5) input of prediction data

(6) get the classification result

iv. LOESS

The local weighted linear regression algorithm is a non-parametric method.

The understanding of the above formula is as follows: x is a certain prediction point, x(i) is the sample point, and the closer the sample point is to the predicted point, the larger the contribution error or the weight is. The farther the distance is, the smaller the contribution error or the weight is. Regarding the selection of the predicted points, the sample points are taken in our code Where k is the bandwidth parameter, controlling the width of the w which represents bell shape function, similar to the standard deviation of the Gaussian function.

Algorithm idea: Suppose the prediction point samples the i sample point (a total of m sample points) in this point, traverses i to m sample points, and calculates the distance between each sample point and the predicted point. It is possible to calculate the weight of each sample contribution error. It can be seen that w is a vector with m elements that is written in diagonal array form and substituted into the above formula J(θ).

Using the least squares method, a θ vector can be calculated and one predicted point corresponds to one vector.

Data/Code

Part1:

Pictures

Test Pictures

Color characterization

Texture characterization

SVM


Part2:

Table 1: The relationship between Dox-induced dose and expression of transcription factors and functional proteins


Table 2: The relationship between Tamoxifen -induced dose and expression of transcription factors and functional proteins



Part3:


Protein activity and size association


Table 3: protein size and activity association of 2A peptide

Table 4: protein size and activity association of flexible glycine chain binding proteins