{
"header": "Contents
- 1 MODEL
- 1.1 Overview
- 1.2 Question restatement
- 1.3 Problem analysis
- 1.4 Model assumptions
- 1.5 Symbol description
- 1.6 Model building
- 1.7 <button class=\"btn btn-link\" type=\"button\" data-toggle=\"collapse\" data-target=\"#collapseOne\" aria-expanded=\"false\" aria-controls=\"collapseOne\"> Final Dataset </button>
- 1.8 <button class=\"btn btn-link collapsed\" type=\"button\" data-toggle=\"collapse\" data-target=\"#collapseTwo\" aria-expanded=\"false\" aria-controls=\"collapseTwo\"> NGA2_final.py </button>
- 1.9 <button class=\"btn btn-link collapsed\" type=\"button\" data-toggle=\"collapse\" data-target=\"#collapseThree\" aria-expanded=\"false\" aria-controls=\"collapseThree\"> PYFinal.py </button>
- 1.10 Model solution
- 1.11 Analysis and test of model results
- 1.12
- 1.13 References
MODEL
", "content": "Overview
\nBacterial flavin-dependent monooxygenase (bFMO) is the most important enzyme in bio-indigo producing. After we validated this part, we would like to find out how to maximum our production by optimizing culturing condition.
\nThere are many variables that may influence the production and we don’t know if there is interaction between variables. To find the best condition, we must test a huge number of conditions. However, we don’t have enough time and equipment to perform these tests. It is a problem that most teams will face. Therefore, we decided to find the best condition by modeling.
\nWe established a model using single factor design test, response surface analysis, neural network learning modeling and legacy algorithm iteration. We also built a model to simplify our indigo quantification workflow by eliminate the effect of ultrasonic. The model was validated the model by experiment.
\nQuestion restatement
\nIn view of the growth of strains with Escherichia coli BL21(DE3), different experimental groups were set and the variables were defined as time, temperature and tryptophan concentration. The influence of optimized experimental conditions on indigo, the target product of indican, was explored.
\nProblem analysis
\nVariables include time, temperature, tryptophan concentration and strain volume.Since the processes of microbial metabolism are all enzyme activity reactions, the approximate value range of experimental variables can be determined by referring to the optimal action condition curve of enzymes.
\nGenerally in the process of the experiment measuring the indigo generated by ultrasonic cracking when a cell, the cell rupture to get product concentration determination help accelerate dissolution is suitable for the extraction of effective components but will happen in the long time under the action of ultrasonic cell rupture in great quantities, cellular slime and insolubles dissolution, impurity in the extraction liquid, ultrasonic also could damage the structure of the indigo, lower extraction efficiency, and the experimental conditions, the experiment began to take no cracking stage we cell direct determination of absorbance, determine the product concentration. Then the data before and after cleavage were compared and the relationship was obtained by regression fitting.
\n- \n
- \n
Control time as the only variable. Through dynamic differential equation and experimental data, all coefficients in the equation can be fitted out to obtain the time gradient model, and find the relationship between substrate concentration and equation coefficients.
\n \n - \n
In the case of time doesn't change, the reaction is very complicated, could not be determined through a simple equation of one variable, so through 59 set of tryptophan concentration, volume strain, temperature, and the relationship between product concentration data as the training set, the use of neural network for nonlinear problems of high identification degree to establish regression model, the use of 6 sets of untrained validation data sets, and through the genetic algorithm optimization to get the optimal solution of the three variables, then through experimental verification.
\n \n
Model assumptions
\nIt is assumed that all experimental data are reliable after cleaning and screening
\nIt is assumed that the change of data before and after ultrasound is mainly due to the difference of bacterial mass and solubility
\nSymbol description
\nSymbols | \nMeaning | \n
---|---|
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nThe temperature | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nTime | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nInput of the k the sample (variable) | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nPredicted output of the k the sample | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nTryptophan concentration (g/L) | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nStrain volume (ml) | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nThe weight of the JTH node in the I layer in the network | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nOutput of sample k | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nInput to the i the node | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nThe average value of prediction results of the i the modeling | \n
<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"40px\" /> | \nSample mathematical expectation | \n
Model building
\n- \n
- \n
Univariate analysis
\n \n - \n
Response surface optimization
\nOrthogonal design is a simple method in experimental design. Orthogonal design is expressed as an orthogonal arrangement. The vertical column is the number of influencing factors to be studied, and the number of rows is determined according to the experimental needs. Each row combines different factors and levels to form a special independent experiment. Orthogonal experiment can select a few representative experiments evenly among all experimental schemes, and through statistical analysis of the results of these few experiments, a better scheme can be obtained.
\nOrthogonal design, especially in the interaction between factors, has considerable advantages. The key advantage of orthogonal design is to limit the number of experiments to the minimum while providing more experimental information.
\nResponse surface method (RSM) is an optimization method combining mathematical and statistical methods commonly used abroad in recent years.Response surface optimization (RSM) is often used to plot optimal values and is also a method to explain the relationship between different experimental variables and responses through computer simulation.Although the influence of factors on the response value cannot be completely expressed by a mathematical model, it can be simulated with a certain mathematical approximation, and the conditions can be optimized according to the response surface depicted by the model.
\nResponse surface optimization (RSM) is an efficient and systematic tool that can effectively shorten the time required for studying drug dose composition and improving extraction parameters.Response surface optimization (RSM) is also often used to select relevant factors in the real world in order to determine the factors that affect the response the most.When any curve in the response surface needs to be investigated, the most common experimental design method is the central composite
\nDsin) and box-bohnken design.When the values change from low level to high level, the box-bohnken model can reveal the relationship between dependent and independent variables better than other optimization designs.
\nIn this project, according to the results of single-factor experiment, the level range determined by culture temperature, tryptophan concentration and strain volume was selected, and response surface test was designed with design-expert11.1.2.0 statistical analysis software for response surface analysis of three factors. The factor level was shown in table 1, and the test results were shown in table 2.
\n\n<thead>\n
\n\n \n</thead>\n<tbody>\nThe level of \nA culture temperature \nB tryptophan concentration \nC strain volume \n\n \nMin \n27 \n0 \n3 \n\n \n</tbody>\nMax \n38 \n15 \n50 \nTable 1 Factors and levels in response surface analysis
\n\n<tbody>\n
\n\n \n\n \nExperiment number
\n\n \nA
\n\n \nB
\n\n \nC
\n\n \nIndigo
\n\n \n\n \n1
\n\n \n37
\n\n \n50
\n\n \n0
\n\n \n20.952612
\n\n \n\n \n2
\n\n \n37
\n\n \n50
\n\n \n3
\n\n \n29.190765
\n\n \n\n \n3
\n\n \n37
\n\n \n50
\n\n \n6
\n\n \n25.472661
\n\n \n\n \n4
\n\n \n37
\n\n \n50
\n\n \n9
\n\n \n21.46294
\n\n \n\n \n5
\n\n \n37
\n\n \n50
\n\n \n12
\n\n \n19.786148
\n\n \n\n \n6
\n\n \n37
\n\n \n50
\n\n \n15
\n\n \n19.859052
\n\n \n\n \n7
\n\n \n37
\n\n \n3
\n\n \n0
\n\n \n16.5054678
\n\n \n\n \n8
\n\n \n37
\n\n \n3
\n\n \n0.5
\n\n \n26.2746051
\n\n \n</tbody>\n\n \n9
\n\n \n37
\n\n \n3
\n\n \n1
\n\n \n23.07897934
\nTable 2 box-benhnken design of response surface optimization test (example) The total data combination was 65 groups
\nNeural networks
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nBP (Back Propagation) proposed by Rumelhart and McCelland in 1986 to train the multi-layer feedforward network according to the Propagation mode of error backward, that is, starting from the error of the output layer and gradually correcting the weight of the forward layer network.
\nThe process of neural network modeling:
\nForecast output:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"50px\" />\nError sum of sample k:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nError function E is dependent on the adjustment quantity of the weight. Given the learning rate, the adjustment quantity of the weight is:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nThe input of the JTH node is:
\nThe output is:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nF is the activation function
\nAccording to the differential chain rule, the adjustment quantity of the weight can be expressed as:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nAccording to the input layer and error formula, there is:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />Is the partial derivative of output with respect to input. In this paper, logistic function is used, namely:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nThe final result is:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nThe core of neural network is to adjust the weight according to the result of each iteration.
\n(1) use logistic to activate the function
\n(2) normalization and standardization of data, i.e., decimals between -1 and 1 with standard deviation of 1. The original data form is: [T, V, C]
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" /> <img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nAfter standardization (python comes with StandardScaler function) :
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" /> <img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nSome of these Numbers are bigger than -1 or 1 because when you normalize them, you have to satisfy the variance of 1
\n(3) network construction, input layer: 27 sets of data as training set hidden layer: 1 layer
\nOutput layer: 27 nodes: 5
\nThe weight is initialized as a decimal between -1 and 1
\n \n - \n
Genetic algorithm
\nGenetic algorithm is an algorithm that simulates the genetic characteristics of natural organisms, including selection, crossover, variation, etc., retains excellent individuals through iteration, eliminates inferior ones, and finally makes the model optimal.
\nGenetic algorithm consists of the following aspects:
\nEncoding: the abstraction of an object by some encoding mechanism as a sequence of specified symbols. Just as the study of biological inheritance begins with chromosomes, chromosomes are strings of genes.
\nInitial population: a random method is used to generate a collection of several individuals, which is called the initial population and the number of individuals in the initial population is called the population size.
\nFitness function: genetic algorithm USES fitness function value to evaluate the quality of an individual (solution). The larger the fitness function value is, the better the quality of the solution is.
\nSelection: individuals with high fitness have a high probability of being inherited to the next generation; Individuals with low fitness are less likely to be passed on to the next generation.
\nCrossover: two pairs of chromosomes that, depending on the crossover probability, somehow exchange parts of their genes with each other to form two new individuals.
\nVariation: based on the variation probability Pm, some gene values in the individual coding string are replaced with other gene values to form a new individual.
\nThe following is the process of genetic algorithm:
\nThe initial population is generated and sorted by evaluation function. Judge whether the evaluation function meets the condition. If not, enter the main process:
\n- \n
- \n
Select individuals: select the best part of individuals to pass on to the next generation. The rest die
\n \n - \n
Crossover: the generation of new individuals through crossover
\n \n - \n
Variation: random individuals carry out variation
\n \n - \n
Continue to judge, if not satisfied, continue the above process until the conditions are met
\n \n
In this paper, temperature and tryptophan concentration were encoded by binary system. The temperature range was 31-38, 3-bit code was used, tryptophan concentration was 0-8, 4-bit code was used, strain volume was 3,5,10,50, 2-bit code was used.
\nData in the form of: [' 0 ', '1', '1', '1', '1', '1', '1', '1', '1'] represents 34 ℃, 8 g/L tryptophan concentration, bacteria 50 ml.
\n \n - \n
<button class=\"btn btn-link\" type=\"button\" data-toggle=\"collapse\" data-target=\"#collapseOne\" aria-expanded=\"false\" aria-controls=\"collapseOne\"> Final Dataset </button>
\n<code> temp/¡ævolume/mltrp g/L indigo mg/L\n 37\t50\t0\t152.8660685\n 37\t50\t3\t213.4017821\n 37\t50\t6\t186.0803572\n 37\t50\t9\t156.616068\n 37\t50\t12\t144.2946411\n 37\t50\t15\t144.8303553\n 37\t3\t0\t120.1875\n 37\t3\t0.5\t191.9732143\n 37\t3\t1\t168.4910714\n 37\t3\t1.5\t199.2053571\n 37\t3\t2\t203.0446429\n 37\t3\t2.5\t189.3839285\n 37\t3\t3\t194.3839286\n 37\t5\t0.5\t90.04821426\n 37\t5\t1\t84.04821425\n 37\t5\t1.5\t92.56607143\n 37\t5\t2\t105.9053571\n 37\t5\t2.5\t101.6196428\n 37\t5\t3\t94.54821432\n 37\t5\t5\t114.4767857\n 37\t5\t15\t82.24107144\n 30\t5\t3\t255.9055804\n 34\t3\t3\t115.2232096\n 28\t10\t3.5\t56.52678571\n 26\t10\t3.5\t79.11607144\n 28\t10\t3.5\t58.22321429\n 26\t10\t3.5\t81.97321426\n 37\t5\t3\t96.10178569\n 30\t5\t3\t239.9844494\n 30\t5\t3\t239.9844494\n 27\t5\t3\t86.16964283\n 37\t5\t0\t99.11607146\n 37\t3\t0\t102.2410715\n 37\t3\t0.5\t175.0089286\n 37\t3\t1\t121.1696428\n 37\t3\t1.5\t110.8125\n 37\t3\t2\t147.3303571\n 37\t3\t3\t215.3660714\n 37\t5\t1.5\t81.61607142\n 37\t5\t2\t141.7946429\n 37\t5\t2.5\t143.6696429\n 37\t5\t3\t126.2589285\n 34\t3\t3\t163.0446429\n 34\t10\t3.5\t103.2232143\n 34\t10\t3.5\t101.9732143\n 30\t10\t3.5\t193.9375\n 31\t10\t0.5\t118.6696429\n 31\t10\t1\t121.7053571\n 31\t10\t3.5\t187.8660714\n 31\t10\t6\t120.3660714\n 31\t10\t8\t109.1160714\n 31\t10\t0.5\t102.6875\n 31\t10\t1\t97.95535716\n 31\t10\t3.5\t164.4732143\n 31\t10\t6\t117.5982143\n 31\t10\t8\t103.8482143\n 31\t3\t3.5\t120.7232143\n 31\t5\t3.5\t139.7410715\n 31\t50\t3.5\t235.9017857\n 31\t50\t2.5\t230.2053571\n 31\t50\t2.5\t222.2946429\n 31\t50\t2.5\t217.1339286\n 31\t50\t2.5\t241.0446429\n 31\t50\t2.5\t251.1160714\n</code>\n
<button class=\"btn btn-link collapsed\" type=\"button\" data-toggle=\"collapse\" data-target=\"#collapseTwo\" aria-expanded=\"false\" aria-controls=\"collapseTwo\"> NGA2_final.py </button>
\n<code>import numpy.random as npr\nfrom random import randint\nclass Individual:\n _n=0\n eval=0.0\n chromsome=None\n def __init__(self,n):\n self._n=n\n temp=bin(randint(0,2**n-1))[2:]\n if len(temp)!=n:\n temp='0'*(n-len(temp))+temp\n self.chromsome=list(temp)\nclass NGA:\n population=[]\n dimension=1\n bestPos=worstPos=0\n mutationProb=10\n crossoverProb=90\n maxIterTime=1000\n evalFunc=None\n arfa =1.0\n popu=2\n def __init__(self,popu, dimension,deadProb,mutationProb,maxIterTime,evalFunc):\n for i in range(popu):\n oneInd=Individual(dimension)\n oneInd.eval=evalFunc(oneInd.chromsome)\n self.population.append(oneInd) \n self.deadProb=deadProb\n self.mutationProb=mutationProb\n self.maxIterTime=maxIterTime\n self.evalFunc=evalFunc\n self.popu=popu\n self.dimension=dimension \n def dead(self):\n self.population.sort(key=lambda x:x.eval)\n self.population=self.population[int(self.deadProb*len(self.population)):]\n #交叉操作 \n def crossover(self):\n fatherPos=npr.randint(0,len(self.population))\n motherPos=npr.randint(0,len(self.population))\n while motherPos == fatherPos:\n motherPos = npr.randint(0,len(self.population))\n father = self.population[fatherPos]\n mother = self.population[motherPos]\n startPos = npr.randint(self.dimension) #The starting position of the cross\n jeneLength = npr.randint(self.dimension)+1 # //Length of crossing\n #jeneLength = self.dimension - startPos # //Effective length of gene exchange\n son1 = Individual(self.dimension)\n son2 = Individual(self.dimension)\n\n son1.chromsome[0:startPos]=father.chromsome[0:startPos]\n son2.chromsome[0:startPos]=mother.chromsome[0:startPos]\n \n son1.chromsome[startPos:jeneLength]=mother.chromsome[startPos:jeneLength]\n son2.chromsome[startPos:jeneLength]=father.chromsome[startPos:jeneLength]\n left=startPos+jeneLength\n \n son1.chromsome[left:]=father.chromsome[left:]\n son2.chromsome[left:]=mother.chromsome[left:]\n \n son1.eval = self.evalFunc(son1.chromsome)\n son2.eval = self.evalFunc(son2.chromsome)\n self.population.append(son1) \n self.population.append(son2)\n def mutation(self):\n fatherpos=npr.randint(len(self.population))\n father = self.population[fatherpos]\n son = Individual(self.dimension)\n son.chromsome=father.chromsome.copy()\n mutationPos =npr.randint(self.dimension)#;//Position of variation\n if son.chromsome[mutationPos]=='0':\n son.chromsome[mutationPos]='1'\n else:\n son.chromsome[mutationPos]='0'\n son.eval = self.evalFunc(son.chromsome)\n self.population.append(son) \n def solve(self):\n i = 0\n self.result=[]\n while i < self.maxIterTime:\n while len(self.population) < self.popu:\n self.crossover()\n for j in range(self.mutationProb):\n self.mutation()\n self.dead()\n i=i+1\n self.result.append(self.population[-1].eval)\n return self.result\n def getAnswer(self):\n self.population.sort(key=lambda x:x.eval)\n return self.population[-1].chromsome</code>\n
<button class=\"btn btn-link collapsed\" type=\"button\" data-toggle=\"collapse\" data-target=\"#collapseThree\" aria-expanded=\"false\" aria-controls=\"collapseThree\"> PYFinal.py </button>
\n<code># -*- coding: utf-8 -*-\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.neural_network import MLPRegressor\nfrom sklearn.preprocessing import StandardScaler\nfrom NGA2 import NGA\nfrom sklearn.model_selection import train_test_split\nimport random\nimport scipy\nclass PYJ:\n arfa=0.05\n def __init__(self,n):\n self.filenamec='data.txt'\n self.data0=np.loadtxt(self.filenamec)\n self.X=self.data0[:,:n]\n self.y=self.data0[:,n]\n self.N=len(self.y)\n self.key=0\n def Analyse(self):\n self.var=np.var(self.y)\n self.mean=np.mean(self.y)\n ptp=np.ptp(self.y)\n plt.hist(self.y)\n plt.title('hist of indigo concentration')\n plt.xlabel('indican concentration')\n plt.ylabel('number')\n plt.show()\n plt.boxplot(p.y)\n plt.ylabel('indigo concentration')\n plt.show()\n print('variance is:',self.var)\n print('mean value is:',self.mean)\n print('range is:',ptp)\n res=scipy.stats.shapiro(p.y)\n print('Chi square is:',res[0])\n print('p-value is:',res[1])\n def GetCMode(self,X):\n yhat=self.clf.predict(X)\n w=np.mean(abs(yhat-self.y)/self.y)\n print('average error:',w)\n print('whole sample error is:',yhat-self.y)\n print('whole sample loss is:',sum(np.square(yhat-self.y))/2)\n plt.plot(self.y,'bo',label='true')\n plt.plot(yhat,'r*',label='predict')\n plt.ylabel('indigo concentration')\n plt.legend()\n plt.show() \n def GetDMode(self):\n X=self.DealData(self.X)\n self.clf = MLPRegressor(solver='lbfgs', alpha=1e-6,hidden_layer_sizes=(40,20,10), random_state=1)\n for i in range(200):\n X_train, X_test, y_train, y_test = train_test_split(X, self.y,test_size=0.1)\n \n self.clf.fit(X_train, y_train)\n yhat=self.clf.predict(X_test)\n w=np.mean(abs(yhat-y_test)/y_test)\n self.w=w\n if w<=0.1:\n print(w)\n break\n self.GetCMode(X)\n wucha=yhat-y_test\n self.wuchamax=max(wucha)\n self.wuchamin=min(wucha)\n U=scipy.stats.norm.ppf(1-self.arfa/2)\n wucha=U*np.sqrt(self.var)/np.sqrt(self.N-1)\n self.wucha=wucha\n num=0\n self.y_test=y_test\n for i in range(len(yhat)):\n if abs(yhat[i]-y_test[i])<=wucha:\n num+=1\n return [num,yhat]\n def DealData(self,x):\n scaler = StandardScaler()\n scaler.fit(self.X)\n x = scaler.transform(x)\n return x\n def func(self,lis):\n Vol=[3,5,10,50]\n t=lis[:3]\n c=lis[3:7]\n v=lis[7:]\n ts='0b'\n for i in range(len(t)):\n ts+=t[i]\n cs='0b'\n for i in range(len(c)):\n cs+=c[i]\n vs='0b'\n for i in range(len(v)):\n vs+=v[i]\n t=int(eval(ts))+31\n v=int(eval(vs))\n v=Vol[v]\n c=int(eval(cs))*0.5+0.5\n self.resx=np.array([[t,v,c]])\n self.x=self.DealData(self.resx)\n res=self.clf.predict(self.x)[0]\n return (res)\n def OptimistNGA(self):\n #Temperature:31~38,three decimal places\n #concentration:0~8,four decimal places\n #culture volume:3,5,10,50,two decimal places\n self.n=NGA(50,9,0.8,50,50,self.func)\n res=self.n.solve()\n bestx=self.n.getAnswer()\n self.func(bestx)\n print('best condition:',self.resx)\n print('best result:',res[-1])\n plt.xlabel('generation')\n plt.ylabel('Indigo')\n plt.title('optimize curve')\n plt.plot(res,'k',label='result')\n plt.plot(res+self.wuchamax,'b',label='max result')\n plt.plot(res+self.wuchamin,'r',label='min result')\n randres=[]\n for i in range(len(res)):\n ran=random.random()*(self.wuchamax-self.wuchamin)+self.wuchamin\n randres.append(res[i]+ran)\n plt.plot(randres,'m',label='random result')\n plt.legend()\n plt.show()\n return [res,self.resx]\n def GetPredict(self,x):\n x=self.DealData(x)\n y=self.clf.predict(x)\n return y\np=PYJ(3)\np.Analyse()\ns=p.GetDMode()\nres0=p.OptimistNGA()\nprint('Correct number of predictions:{}/{}'.format(s[0],len(s[1])))\nprint(s[1],p.y_test)</code>\n
Model solution
\n- \n
- \n
The suitable value range of each variable was determined through the single-factor design experiment:
\n Temperature: 27~38 ℃; Cell volume: 3~50 mL; Tryptophan concentration: 0~15 g/L \n - \n
The correlation coefficient between the product and each variable was calculated through response surface analysis to verify the correlation between the product and the variable. And through the mathematical model diagram and the descriptive statistics of the sample, verify that it belongs to the normal distribution. This is detailed in the analysis of the results.
\n \n - \n
On the premise of determining the degree of correlation, choose variables with strong correlation to enter the link of neural network modeling:
\nAccording to the empirical formula of node selection:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" /> \n
M is the number of input layer nodes, n is the number of output layer nodes, single hidden layer is the number of input and output data, a is 0~10 random integer. Select the structure of 10 nodes in a single hidden layer and use the neuralnetwork library in python's own sklearn package. StandardScale was used to normalize the data into a number of -1~1, and testsplit was used to randomly divide the data into 90% training set and 10% test set, and the reliability of the model was verified through the accuracy test 150 times. The following table is the test result of one of the tests:
\nPredicted results | \n\n 248.76565404 \n | \n\n 103.89746256 \n | \n\n 90.63697776 \n | \n\n 115.43969243 \n | \n118.10792376 | \n\n 181.94505283 \n | \n
---|---|---|---|---|---|---|
The actual results | \n\n 239.9844494 \n | \n\n 102.6875 \n | \n\n 84.04821425 \n | \n\n 114.4767857 \n | \n144.8303553 | \n\n 187.8660714 \n | \n
Error | \n\n 8.78120464 \n | \n\n 1.20996256 \n | \n\n 6.58876351 \n | \n\n 0.96290673 \n | \n26.72243154 | \n\n 5.92101857 \n | \n
The average error at the end of the result represents the average relative error, and the error represents the residual, that is, the loss represents the value calculated through the loss function. The larger the value, the farther the predicted value deviates from the real value, that is, the worse the result.
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"600px\" />\nThe model established by the neural network was used as the calculation basis of the evaluation function of the genetic algorithm. The population size of 100, the mortality rate of 80% and the variation rate of 70% were selected for optimization. The results are as follows:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nThe blue color indicates the error is too large.
\nThe model established by the neural network was used as the calculation basis of the evaluation function of the genetic algorithm. The population size of 50, the mortality rate of 80% and the variation of 50 times were selected for optimization. The results are as follows:
\nThe maximum indigo yield was obtained at 31℃, 50ml bacterial solution and 2.5g/L tryptophan concentration.
\nAnalysis and test of model results
\nIn order to satisfy hypothesis 1, that is, all sample data are reasonable, real and reliable, the original sample is firstly analyzed briefly. In the experiment, the range of three variables was determined according to the single factor method. Considering that the results were bound to follow normal distribution (galton plate), we first made the histogram, box diagram and calculated the average, variance and range, and then used the significance test alpha =0.05 to prove that the samples followed normal distribution. The results are as follows:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" /> <img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nVariance is: 2393.406280506604 Mean value is: 137.40059618915254 The range is: 199.37879469
\nThe histogram and box diagram appear to be left-skewed normal distribution, but in fact, when selecting the gradient, the results of the mono-factor method are taken into account that some conditions are not conducive to the growth of bacteria, so the experiment is carried out under the condition of ignoring these conditions.
\nThe chi-square test results are as follows:
\nChi square is: 0.9382255673408508 P - value is: 0.004948027431964874
\nP value less than 0.05 is alpha, which indicates that the sample conforms to normal distribution and is one of the favorable factors for the modeling results.
\nRegression model establishment and variance analysis
\nWe established a polynomial regression model through design-expert11.1.2.0 statistical analysis software. The regression adopts the sequence from linear, 2 times, 3 times and 4 times, and the results are as follows:
\n\n Model \n | \n\n P - value \n | \n\n R2 \n | \n\n Significant \n | \n
---|---|---|---|
\n Linear \n | \n\n 0.0517 \n | \n\n 0.0716 \n | \n\n \n | \n
\n Secondary \n | \n\n < 0.0001 \n | \n\n 0.3127 \n | \n\n \n | \n
\n Three times \n | \n\n < 0.0001 \n | \n\n 0.7311 \n | \n\n * \n | \n
\n Four times \n | \n\n 0.0513 \n | \n\n 0.7745 \n | \n\n \n | \n
Table 3 variance analysis of regression equation
\nY=a0+a1A+a2B+a3AB+a4A+a5B and so on.
\nR is the correlation coefficient, indicating the correlation degree between the predicted data and the real data. R2=0.7311 means that 73.11% of the results are true and reliable under the model, which is used to evaluate the goodness of fit.
\nThe results of R2=0.7311 were selected for analysis as follows:
\n\n Sources of variation \n | \n\n Degrees of freedom \n | \n\n F \n | \n\n P - value \n | \n\n Significant \n | \n
---|---|---|---|---|
\n The regression model \n | \n\n 19 \n | \n\n 6.63 \n | \n\n < 0.0001 \n | \n\n * \n | \n
\n T \n | \n\n 1 \n | \n\n 18.35 \n | \n\n < 0.0001 \n | \n\n * \n | \n
\n V \n | \n\n 1 \n | \n\n 4.04 \n | \n\n 0.0504 \n | \n\n * \n | \n
\n C \n | \n\n 1 \n | \n\n 2.62 \n | \n\n 0.1122 \n | \n\n \n | \n
\n TV \n | \n\n 1 \n | \n\n 26.9 \n | \n\n < 0.0001 \n | \n\n * \n | \n
\n TC \n | \n\n 1 \n | \n\n 32.51 \n | \n\n < 0.0001 \n | \n\n \n | \n
\n VC \n | \n\n 1 \n | \n\n 21.61 \n | \n\n < 0.0001 \n | \n\n * \n | \n
\n T2 \n | \n\n 1 \n | \n\n 30.69 \n | \n\n < 0.0001 \n | \n\n * \n | \n
\n V2 \n | \n\n 1 \n | \n\n 1.85 \n | \n\n 0.1808 \n | \n\n \n | \n
\n C2 \n | \n\n 1 \n | \n\n 4.32 \n | \n\n 0.0435 \n | \n\n * \n | \n
\n TVC \n | \n\n 1 \n | \n\n 21.54 \n | \n\n < 0.0001 \n | \n\n * \n | \n
\n TV2 \n | \n\n 1 \n | \n\n 30.7 \n | \n\n < 0.0001 \n | \n\n * \n | \n
\n TC2 \n | \n\n 1 \n | \n\n 29.99 \n | \n\n < 0.0001 \n | \n\n * \n | \n
\n TV2 \n | \n\n 1 \n | \n\n 10.34 \n | \n\n 0.0024 \n | \n\n * \n | \n
\n TC2 \n | \n\n 1 \n | \n\n 0.0993 \n | \n\n 0.7541 \n | \n\n \n | \n
\n VC2 \n | \n\n 1 \n | \n\n 2.92 \n | \n\n 0.0941 \n | \n\n \n | \n
\n VC2 \n | \n\n 1 \n | \n\n 0.4942 \n | \n\n 0.4857 \n | \n\n \n | \n
\n T3 \n | \n\n 1 \n | \n\n 6.1 \n | \n\n 0.0173 \n | \n\n * \n | \n
\n V3 \n | \n\n 1 \n | \n\n 1.01 \n | \n\n 0.3192 \n | \n\n \n | \n
\n C3 \n | \n\n 1 \n | \n\n 3.05 \n | \n\n 0.0876 \n | \n\n \n | \n
Table 4 Analysis for regression equation
\nNote :* significant difference (p<0.05)
\nIf the p-value is less than 0.05, the difference is obvious and the correlation is strong; if the p-value is too large, the correlation is not strong. According to the data, the correlation with T and V is strong, while the correlation with C is weak, but the correlation with C is strong. Therefore, it is proved that the concentration of the result must be related to all three variables. The response surface is used to more intuitively analyze the relationship between the observed product concentration and every two variables (when any two variables are taken, the third variable is taken as the intermediate value of its interval) :
\nBulk temperature
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nFIG. 4 three-dimensional diagram of response surface analysis, indigo, temperature (℃), volume (mL), and tryptophan concentration at this time is 2.55 g/L
\nTemperature concentration
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nFIG. 5 relationship between indigo and tryptophan concentration and temperature, when the volume is 50 mL
\nVolume concentration
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nFIG. 6 relationship between indigo concentration and volume tryptophan concentration, and the temperature was 33.5 ℃
\nOne of the most intuitive is the relationship between the temperature and the product, which is almost the same curve as the enzymatic reaction, with an optimal temperature. The curve of product change with volume is also related to other variables, and there is no obvious single factor relationship. The concentration of tryptophan was negatively correlated with the product.
\nResponse surface analysis can be used to intuitively express the strength of correlation between product concentration and each variable, but the established regression equation results are not accurate (R=0.7311), and the approximate relationship between the data can only be analyzed from a qualitative perspective. Through the above analysis, we have determined that all three variables have a strong correlation with product concentration, which can be input into the neural network as three-dimensional variables for learning, and make up for the deficiency of regression equation with the good modeling ability of the neural network for nonlinear problems. According to the established model, the scatter diagram of the predicted value and the real value of each set of data is made, and then the residual of all data sets is combined (the deviation degree of red and blue points is shown in the image) :
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"300px\" />\nTo see samples from the intuitive prediction result is good, in the results predicted tend to use the size of the average relative error of measured, due to the randomness of the training set selection, the relative error size does not reflect the result is good or bad, and the average of the training sample is not necessarily true, because they obey normal distribution (mu, sigma), therefore, for a given confidence probability of 1 - alpha (alpha = 0.05), then, so the confidence interval of mu is:
\n<img class=\"img-fluid rounded shadow-lg\" src=\"\" width=\"600px\" />\nTherefore, as long as the prediction results fall within the confidence interval, the prediction can be considered successful, otherwise, it will fail.After 150 random selections of the training set and using the rest as the data set, the accuracy rate was 90%. So we think the model is reliable. The following table is the prediction result of one of them:
\n\n Predicted results \n | \n\n 248.76565404 \n | \n\n 103.89746256 \n | \n\n 90.63697776 \n | \n\n 115.43969243 \n | \n118.10792376 | \n\n 181.94505283 \n | \n
---|---|---|---|---|---|---|
\n The actual results \n | \n\n 239.9844494 \n | \n\n 102.6875 \n | \n\n 84.04821425 \n | \n\n 114.4767857 \n | \n144.8303553 | \n\n 187.8660714 \n | \n
\n Error \n | \n\n 8.78120464 \n | \n\n 1.20996256 \n | \n\n 6.58876351 \n | \n\n 0.96290673 \n | \n-26.72243154 | \n\n -5.92101857 \n | \n
Table 5 prediction set result evaluation
\nThe blue is the term outside the confidence interval.
\nAfter multiple predictions, the accuracy is stable at 89%, so the model is proved to be reliable.
\nBased on the reliable model, the results obtained by genetic algorithm are as follows: temperature: 31℃; Bacterial liquid volume: 50 mL; Tryptophan concentration: 2.5g /L.
\nIn the results, the optimal temperature of 31℃ is within the optimal temperature range of bacterial growth. Similar results can be obtained through response surface analysis. Similarly, the increase of bacterial liquid volume will lead to the increase of yield (within the range of our experiment), and 50mL is the maximum bacterial volume that we can use. However, the optimal results of tryptophan concentration obtained by single factor method and response surface analysis are not the same. The product obtained by response surface analysis is the largest when the concentration of tryptophan is 0, which does not conform to the experimental rules. Therefore, the optimized result of neural network genetic algorithm (nga) is 2.5g /L, which is the optimal result, for experimental test.
\n\n Shaker serial number \n | \n\n 1 \n | \n\n 2 \n | \n3 | \n\n 4 \n | \n\n 5 \n | \n
---|---|---|---|---|---|
\n Indigo yield under initial conditions \n | \n\n 230.2053571 \n | \n\n 222.2946429 \n | \n217.1339286 | \n\n 241.0446429 \n | \n\n 251.1160714 \n | \n
\n Error \n | \n\n -22.892742900000002, \n | \n\n -30.803457099999974, \n | \n-35.9641714, | \n\n -12.053457099999974, \n | \n\n -1.9820285999999783 \n | \n
Table 6 Comparsion before and after optmization by flask-shaking fermentation
\nThe error of the red data is large, which just exceeds the confidence range obtained before. Because the other four groups of data are in good agreement, it can be considered as the result of experimental error.
\n\n
\n
References
\n[1] Masato i. forward bacterial strains overproducing L - tryptophan and other aromatics by Metabolic engineering[J]. Appl Microabiol. Biotechnol, 2006, 69: 615-626
\n[2] Martein RC, Marten RC, Jurrius O, Dhont J, ef al. Optimization of a feed medium for fed-batch Culture of insect cells using a genetic algorithm[J]. Biotechnol Bioeng, 2003, 81(3): 269-278.
\n}