M O D E L L I N G

The "Unreasonable Effectiveness of Mathematics in the Natural Sciences" is the title of a very well-known article published by nobel laureate Eugene Wigner in the 1960s. Although this dictum is common reality in fields such as physics, many biologists still neglect the usefulness of these rigorous methods.

This year, our interdisciplinary team has worked hard to change this impression and incorporated many state-of-the-art methods from various scientific fields into the project. We put a high emphasis on standardization which has emerged from a yearning for a meticulous quantitative approach to cyanobacterial research. In particular, our interest laid in determining the optimal growth parameters of our organism S. elongatus as these differed greatly in literature. The development of a state-of-the-art machine learning model allowed us to rapidly speed up this process and guide us towards our ultimate goal. In order to extend our standardization efforts, we additionally implemented a light model to properly predict light intensities for our cultures.

Furthermore, modelling played a crucial role in both the search/design of suitable genome integration sites as well as the construction of a synthetic terminator library based on an extensive biophysical model. Without these rigorous analytical methods our project would have been unfeasible.

G R O W T H C U R V E
M O D E L

Growth Curves

Synthetic Biology was created by introducing engineering principles into the previously existing discipline of biology. While this came with numerous advantages, one of the most important was the standardization and characterization of parts that larger biological system are built of. Only with this toolbox of modular, well characterized parts the current achievements in companys like Ginkgo bioworks or the teams of the iGEM competition were made possible and the biobrick standard is a great example. Not only does this process allow for standardized parts, it also allows to critically question generally agreed on methodologies that otherwise might negatively influence either the reproducibility or performance of experiments.
However, this standardization is not fully a past achievement but an ongoing process. We noticed during the research how to optimally grow our cyanobacteria that this process still needs alot of standardization. This was when we decided that we want to critically question all things currently state of the art and this developed into a project in which we used expertise acquired in chemistry, physics, mathematics and biology in addition to sythetic biology. For example, in the literature the optical density of a culture is sometimes measured with the absorption at 730 nm (Ungerer et.al. 2018) and sometimes at 750 nm (Russo et.al. 2019). Since many labs do not have a spectrometer that is able to measure absorption at 750 nm, we decided after valuable input from James Golden to measure OD at 730 nm.

Light intensity measurements

One of the first aspects that we found to be unsufficiently or contradictorily documented was the light intensity the cultures need to grow optimally. Also, the state of the art measururement for light intensity used when describing growth of cyanobacteria is Einstein, which is a non SI unit. Einstein describes the number of photons that arrive in one second at an area of one square metre, one Einstein being one mole (6.022*10**23) of photons /m² *s. Most of the times when this unit is used in combination with photosyntetic organisms, not all photons are counted but only photosinthetically active photons with a wavelength betwenn 400 and 700 nanometers.
Since Einstein is not an SI unit, there are no clear definitions how to use it which opens up possibilities for introducing errors. Due to this for example the definition of photosynthetic photons could be a point that differs between different research groups and is not specifically defined in most publications. To investigate if there is a better unit to use and if so what and how it should be used, we did an analysis of light units that we describe in the following foray:

Foray to light units

HTML IST SCHEI?E — **Figure 1:** Luminosity function.(Luminosity function)

As previously mentioned, Einstein descibres the number of photosinthetically available photons (400-700nm) per square metre per second and is not the SI unit for light intenstiy (luminous flux). The SI unit is lumen, which is described as candela multiplied with the steradian. Candela is the unit for luminous intensity and steradian is the threedimensional equivalent to a twodimensional angle [Figure 2].

Candela and therefore also lumen are not only describing the intensities of different wavelenghts and adding those up, but they are weighing them using the so called lumiosity function [Figure 1]. This function weighs each wavelength depending on how well it is recognized by the human eye and this has huge applications in professional photography. While this is something that is very useful when working in photography where the recognition of the human eye is important, for photosynthetic purposes this is not useful since photons of various wavelengths can be utilized in a similar way.
The unit Einstein, while not being an SI unit, seems to be the preferable unit for photosynthetic purposes by now, but as previously discussed it is not accurately defined. Einstein can also be expressed with only SI units as mol of photons of wavelengths between 400 and 700 nm per square meter per second. In this form, there are no insecurities about the definition and while being more voluminous to write out, we strongly believe it is much preferable. We are still using einstein as laboratory chargon to communicate more efficiently on a daily basis, however when it comes to scientific publications the SI unit version should be used at all times to circumvent communication errors.

Practical light measurements

In addition to the problems using a non SI unit introduces, the process of of measurement itself is not standardized. The light intensity could either be measured at the lamp or at the cultures, it can be measured with the cultures inside of the incubator, which yields lower light intensities due to their absorbance, or without the cultures at the place they shall be at once growing.
There are two significantly different devices to measure light intensity, one with a planar sensor and one with a spherical one. Since the planar sensor has less area and also only measured light from one side it yields lower light intensities than the spherical one. Depending on the setup and how the planar sensor is used, it can also yield light intensities that are far too high (if pointed at the lamp) or too low (if pointed at e.g. the wall). The spherical measurement device gives both more reproducable and more accurate light intensities. There are empirical conversion tables to convert values measured with the planar sensor to values of the spherical measurement and vice versa, but they should be used with great caution. Again after the valuable input of James Golden we decided to use the spherical sensor to measure the light intensity at any given position.

The need for a light model

Even with the spherical light sensor there are still some difficulties to overcome, for example where to place the cultures for a specific light intensity and that the light intensity has to be measured everytime before a growth curve. To solve both of these problems we decided to build a Light Model that models the light distribution in our incubator. With the help of this model we could enter the exact light intensity we wanted to grow the cultures at and multiple possible places were this light intesity was possible to achieve were given.

Light model

Our overarching goal in this year's project was the standardization in cyanobacterial research. To this end, we have placed a strong focus on our growth curves and developed a comprehensive machine learning model with which we were able to systematically approximate the ideal growth parameters. But as is generally known, a model can only be as good as the data you provide it with.
In order to make our data as accurate as possible, a detailed analysis of all optimization parameters and their implementation was required.
While some of the parameters such as CO2 were fairly easy to measure, we were particularly concerned about the standardization of light measurement.
For the exact measurement we have thoroughly studied this subject and came to the conclusion to measure our complex incubator setup with a spherical light meter. This promised us a high level of accuracy, but was associated with a significant amount of work that was unfeasible with the amount of experiments we did this year. Furthermore, this only allowed for discrete measurements of our setup and didn’t provide any comprehensive overview. To tackle this problem, a light model was desperately needed. In the beginning we tried to address this with a simple grid approximation, but quickly realized that this did not meet our demands for an accurate measurement and standardization. So we looked for a better method and finally found it in numerical mathematics through so-called splines. A spline is a special function defined piecewise by polynomials, a method which is still widely utilised in computer science and has been used to model automobile and airplane bodies since the early 1960s. (Casteljau, (1963)) After a detailed review of the methods, we decided on so-called B-spline (basis spline) surfaces which allows for excellent surface interpolation. In order to use this method, a precise equidistant measurement of the incubator in the two axes was needed. We decided to use a relatively small distance to get a high resolution map of the light intensity. Each of these points was then carefully measured and written down to be later interpolated by these B-splines. In general, an order k B-spline is formed by joining several pieces of polynomials of degree k-1 with some continuity at the breakpoints. A set of ascending breakpoints defines a so called knot vector

which determines the parametrization of the basis functions. The role of these knots will be assumed by our coordinates in one axes. Given the knot vector T, we can easily construct the associated B-spline basis function as follows:

Among other nice properties like positivity, local support, partition of unity and continuity these basis functions allow for fast and easy recursion. (Hoschek et al., 1993) We used these simple B-spline curves to construct a surface which will represent the light intensity for each point in our incubator setup. This B-spline surface is a tensor product surface defined by a topologically rectangular set of control points a_ij and two knot vectors U and V associated with each coordinate x, y. (de Boor et al., 1980) The corresponding B-spline surface is simply given by:

In order to check the accuracy of our model, we generated predictions for various random positions and checked them with our measuring device, yielding an incredible accuracy of ± 17 µE. In addition to the now accurate placement possibilities of each flask, the continuity of this method enabled us to generate specific contour lines which allowed us to position multiple flask at the same light intensity.

Wackelpudding — **Figure 3:** Top down view of the contour lines representing the light intensity. Depth and Width are measured in cm and represent the size of our incubator. The light intensity is measured in µE m^-2 s^-1. Warmer colors represent higher values, we move inwards at 100µE m^-2 s^-1 steps to a maximum value of 1800µE m^-2 s^-1. We see a surprisingly uneven distribution in light intensities and rapid increases in some areas.

Satisfied with these results, we have used this model as the basis for the subsequent growth curves and continuously checked it for accuracy. Due to the versatility of our model we hope that it will have a meaningful impact on future iGEM teams.

Early Growth Curves

With the measurement question for light solved (for us) we started to do growth curves. Many of the publications that we used as templates for our growth curves used specialized cultivation systems that were not at our disposal. With our chosen system (of erlenmeyer flasks in our incubator) there were many adjustable parameters that we stumbled upon once we wanted to do growth curves. Many of these parameters were categorial variables, but there are also some that are numerical values. We decided to do comparative growth curves for these parameters to determine which combination of parameters allows for the best possible doubling time.

Flask Geometry

This categorical variable was a major factor to us. Limited by available space in incubators our first growth curves were designed to evaluate which flask volume would provide best growing conditions. It turned out that small flasks with 50ml capacity supported a growth to a higher optical density. Indeed, at the same time cultures tend to faint into a green yellowish colour as compared to the firm green tone of healthy S. elongatus UTEX 2973 cultures. Flasks with much higher capacities were tested too, revealing that a high flask capacity slowed down culture growth. As cyanobacteria grow on CO₂ as their primary carbon source we speculated this could be due to worse gas exchange and lower light intensities towards the centre of the flask. From these experiments, we settled with a medium flask capacity of 250ml.
While speculating about gas exchange another geometrical flask variant came into our minds: flasks with baffles. They promised a high turbulence inside the flask providing higher nutrition and CO2 distribution within the fluid culture medium. However, we were concerned that to high velocities would lead to physical damage harming our cyanobacteria. Nevertheless, we conducted the experiments. The results are visualized in Figure 5 illustrating the positive influence of baffle flasks towards growth rates. Due to the limited availability of flasks with four baffles we continued to use 3 baffled flasks with a capacity of 250ml. Although they did not show much deviation from non-baffled flasks in our experiments, we were confident that baffles support better growth rates in the long run as indicated with smaller and therefore more CO2 restricting 100ml capacity flasks.

Lid Types

Then we had to figure out how to keep the culture safe from contamination but at the same way provide enough CO2, so that concentrations in the media could support the rapid growth of S. elongatus UTEX 2973. We took several approaches. Closing the flask opening tightly with gas permeable film under the sterile work bench seemed to us as the optimal solution. At the same time we tested foam material stuffing, rubber and transparent plastic lids (Figure 6). The rubber lid closes tightly while the plastic lid on the other hand is engineered to keep a small gap between glass and plastic allowing air to circulate. In the end we were quite surprised that the plastic lids did provide conditions that enabled the cyanobacteria to grow the fastest. Using the plastic lids was the best option for us because they not only ensured best growing conditions but also allowed for pretty easy handling of flasks when doing measurements.

Fill Volume

The fill volume had to be considered as well. Flask capacity and geometry are contributing to this factor, but we found 1/5 of the flask´s capacity the be the most feasible fill volume. Although lower fill volumes grew better based on optical density, we did not feel comfortable with these cultures because they mostly gained a yellowish tone and produced a lot of yellow foam on top when shook in the incubator. We considered the foam and the yellowish colour might be traced back to higher concentrations of cell fragments due to the fact that the turbulence seemed more violent in lower fill volumes. However, we never brought that speculation to testing. Therefore, in the future it might be interesting to assess the relation between optical density and living cell number in lower fill volumes compared to higher ones via Fluorescent activated Cell Sorting (FACS).

Culture Media

Being in contact with the cyano-community, we soon realized that a culture medium in not a culture medium, even though one is speaking from the same medium. This is owed to the fact that different laboratories use different protocols when preparing them. After gathering protocols, we decided on four promising ones and tested them we (Figure 7). Off those four media, the one supporting rapid growth the best was BGM, which was adopted as the main growth medium and replaced BG11. BGM conferred a twice as fast growth within 14h after inoculation to an optical density of around 10. During media preparation, all media were buffered to a neutral pH value of around 7. Measuring pH value after 840min of growth, a lower pH value could be linked to a lower growth rate/final optical density (table 1). In which way around pH value and growth effect each other could not be clarified.

pH values of media after 14h/840min of growth. Media with a lower pH value seem to be connected to a lower growth rate.
Medium	pH value after growth of 14h
BGM	8.21
BG11	7.79
Medium A	8.57
Medium B	7.77

Growth Curves Development

Having resolved the first parameters to our growth curves many more detailed adjustments had to me made, being updated throughout the experimental phase. As we were aiming for doubling times under 2h we came back to literature to look for hints on how to push cyanobacterial growth rate and reproducibility.
The cultivation method has been found to be a key role in rapid cyanobacterial reproducibility. A semi-continuous cultivation was proposed, avoiding nutrient limitations concerning for example light CO2 and trace elements (Tillich et al., 2014).
The culture is diluted at least once a day to keep optical density almost constant at reproducible conditions. As a result, growth rate increased. For this ment diluting our cultures before doing growth curves 2 times out of an exponentially growing preculture before inoculating the growth curve flasks. Two inoculations were chosen as a compromise between a bearable amount of effort, as we aimed for inoculation every 8h to keep the cultures in their exponential phase, and cyanobacterial cultures being pushed towards rapid reproduction. A few growth curves, such as growth curves evaluated under proposed optimal conditions and FACS-counted, up to 8 precultures were made, each inoculated out of an exponential growing preculture. Typically, we timed inoculation to be done when the precultures where at an optical density of 0.6 which was considered exponential. Another way of providing substantial nutrients for growing with rapid doubling times was the use of modified media. BGM being modified BG11, which is used to grow fresh water strains of cyanobacteria, was our basis medium. It contains phosphate and nitrate concentrations equal to MAD medium. During the last period of the project we used 5xBGM medium providing even more nutrients to the cultures. Contrary to what has been publicized before (Włodarczyk, Selão, Norling, & Nixon, 2019) our cultures grew better on BGM than BG11 medium (see above). W hen inoculating cultures and diluting them with new medium, we experienced that cooled medium does delay growth. To encounter this, we let the medium warm up to room temperature before inoculation.
The adaption to high light intensities, for instance when cultures were inoculated from plates which were grown on lower intensities, of 1000µE to about 1800MµE was another important factor. Before growth curves under high light conditions could be performed, the cultures were adapted to the desired intensities by increasing intensities through the precultures.
During growth curve measurements more precautions had to be considered. When sampling cultures volumes were taking not exceeding 0.25ml each sampling to minimize the reduction of culture throughout the experiment. To illustrate the amount of water evaporating over the time period of 36h, equivalent to an extended growth curve experiment we did analysis on evaporating water masses. At the same time, opening the incubator was reduced to a minimum as well while growth curves were conducted to exclude strong CO₂ deviation.
At last, to improve statistical rigidity, for every growth curve experiment with flasks two biological parallels, serving as the biological replicate, were cultivated. Of each biological replicate, two technical replicates were taken.

Growth Curves Model

Variables responsible for growth

As previously described, for categorial differences one can easily do growth curves with all levels of these categories (i.e. the different lids). With this it is possible to determine, at least for the chosen parameters, which level of these categories allows for the fastest growth. In reality, all parameters that play into the growth conditions of the cultures are interlinked and change when other parameters are changed. However, for some parameters the assumption that they do not change upon changing other parameters is probably a fair approximation while drastically reducing the complexity of the investigated problem. Some categorial variables (lid type, number of schikanen in the flask) are probably mostly uncorrelated with other heavily correlated parameters (light intensity, rpm, CO₂, temperature) while there are others (total flask volume, filled flask volume, medium) that are more or less correlated with these parameters. While we think that the assumption of no correlation is a fair approximation for the previously mentioned categorial variables, for the fill volume of the flask we do not think it is a good approximation. This variable that as a further approximation we chose to look at as categorical has a big influence on the amount of oxygen and carbon dioxide in the flask. However, since there were already alot of variables we had to take a look at and it is heavily correlated with the CO₂ percentage that we are also investigating later, we chose to fix this parameter. This would introduce a (small) error into our model, but it would reduce the complexity and the parameters of oxygen and carbondioxide in our flask can be adjusted with the concentration of carbondioxide in the incubator. For numerical parameters (light intesity, rpm, CO₂ %, temperature, filled flask volume) it would also be possible to measure certain values for each variable and use the one that fits best, but there is also the possibility to model the combined effect of these parameters on the doubling time. We did the no correlation assumption of for the previously described categorial variables (lid types, flask geometry, fill volume) and developed based on biological criteria a measurement workflow for other parameters (i.e. how many precultures are used). For four other numerical parameters (temperature, carbondioxide concentration, light intensity and shaker speed) we do think that they are heavily interlinked and decided to investigate them in conjunction with each other. We used the previously established growth curve protocol and collected datapoints varying these four parameters. Due to problems with the incubator and the time constraints going with it we were not able to collect as many datapoints as we would like.

Importance of a mathematical model for growth curve prediction

The data collected is displayed in the following table:

	doubling_time [min]	light_intensity [µmol Photons / m² * s 400-700 nm]	shaking speed [rpm]	CO₂ [%]	temp [℃]
0	89.145	1500	130	5	41
1	100.014	1000	220	5	41
2	99.171	1500	220	5	41
3	96.956	1800	220	5	41
4	118.375	1800	130	5	41
5	113.305	1000	220	5	38
6	117.254	1500	220	5	38
7	122.141	1800	220	5	38
8	77.047	1000	220	3	41
9	81.442	1500	220	3	41
10	104.293	1000	220	5	43
11	96.914	1500	220	5	43
12	97.678	1800	220	5	43
13	102.040	1800	220	7	41
14	110.560	1500	220	7	41

With this data alone we can highlight the importance to measure these parameters in conjunction with each other. In Figure 8 there are the doubling times with three different light intensities displayed with either 38°C or 41°C as temperature. While the doubling times for the lower temperature are smaller, also the trend for the light intensity is reverted. For the high temperature the higher the intensity the lower the doubling time while for the low temperature the contrary is the case. This shows that these parameters are not independent of each other and should also be investigated not on their own but in conjunction with each other.

**Figure 8:** Comparison of different light intensities at 38 and 41 ° Celsius

To investigate these parameters in conjunction with each other we decided to build a model that predicts the doubling time based on the investigated parameters.

Boundary behaviour

Something that is not part of our data are the boundaries that naturally exist for growth curves of cyanobacteria. These are partially given be the machines we are using (e.g. the maximal strength of the lamps, the maximum rpm of our shaker) and partially given by the constitution of the cyanobacteria (e.g. the maximal/minimal temperature they can grow at). With our knowledge acquired while handling this cyanobacteria, we decided on the following cutoffs:

Parameter	Value
min light [µmol Photons / m² * s 400-700 nm]	100
max light [µmol Photons / m² * s 400-700 nm]	3000
min rpm	30
max rpm	260/300
min temperature	30 [℃]
max temperature	50 [℃]
min CO₂	1
max CO₂	10/20

For light and temperature and the lower boundaries of CO₂ and rpm we used cutoffs at which we are convinced that no proper growth is possible. For the upper boundaries of CO₂ (10) and rpm (260) we used the highgest values that are possible due to the hardware used. For these two boundaries we also tried to increase the values further to not punish the maximal possible values too much but still incentivize our model to not use the values near the booundaries. We added one datapoint for each of these boundaries and used the most common values found in our model for the rest of the values. As example, the datapoint added for the low light and low rpm values are shown in the following table:

	doubling time [min]	light intensity [µmol Photons / m² * s 400-700 nm]	shaking speed [rpm]	CO₂ [%]	temperature [℃]
min light	1000	100	220	5	41
min rpm	1000	1500	30	5	41

We added a very high doubling time insted of a doubling time of 0 to ensure that our model has the correct behaviour in edge cases. For example, that the model predict an increased doubling time the hotter the temperature gets insted of predicting very low doubling times for those edge cases because we fed it a doubling time of 0. When we entered this data into our model, the performance was drastically reduced. We even experimented with different doubling times that we entered for this sub dataset, but for all cases tried the performance of the model was still worse than without adding this dataset in the first place. Again, due to the small amount of data that we have in the original dataset, if we add these 8 datapoints they have a huge effect on the model even outside of the boundary cases. Due to this decrease in performance we decided to not use this dataset, but we are still convinced that with enough data this would increase the accuracy of the model, especially in boundary cases.

Modelling approach

Due to the small amount of data we were able to collect we decided to use a polynomial regression model instead of a more data demanding approach like k nearest neighbors, support vector machines or neural networks. This regressional model was built using scikit learn (Pedregose et.al. 2011). Even with this approach, the amount of data we have at our disposal is not enough to deliver a model that we would describe as accurate within and especially not outside of our training data. Nevertheless, we think a model like this is the best way forward if we want to properly predict the doubling time and with more data a very accurate model can be built. We used a common approach to polynomial regression models in that we performed a linear regression on nonlinear functions of the data. This means that we use the previously established variables (temp, rpm, light intensity, CO₂) and construct the polynomial features of this dataset. For two variables x1 and x2 and a polynomial with the degree 2 this would mean we have the following values as data : [1, x1, x2, x1*x2, x1*x1, x2*x2]. This is possible due to the fact that a linear model is not limited to a linear function but to linear parameters for the variables it builds on. The code used to build this model is shown here :

Code

                import numpy as np 
                import pandas as pd 
                import sklearn
                import operator
                #from sklearn.cross_validation import train_test_split
                from sklearn.preprocessing import PolynomialFeatures
                from sklearn.linear_model import LinearRegression
                from sklearn.linear_model import LassoCV
                from sklearn.pipeline import Pipeline
                from sklearn.metrics import mean_squared_error, r2_score
                degree_polynomial = 8
                size_test = 1
      
                data_model = pd.read_csv("data_model_clean_neu.csv")
                data_prep = data_model.drop("Unnamed: 0", axis = 1)
      
      
                # Now I want to add data that shows the constraints of the system, so I will engineer fake data to correctly predict everything
                # Format is doubling time, light, rpm, co2, temp
                low_light = [1000,100,220,5,41]
                high_light = [1000, 3000, 220, 5, 41]
                low_rpm = [1000,1500,30,5,41]
                high_rpm = [1000,1500,300,5,41]
                low_co2 = [1000,1500,220,1,41]
                high_co2 = [1000,1500,220,20,41]
                low_temp = [1000,1500,220,5,30]
                high_temp = [1000,1500,220,5,50]
      
                boundary = []
                boundary.append(high_temp)
                boundary.append(low_temp)
                boundary.append(high_co2)
                boundary.append(low_co2)
                boundary.append(high_rpm)
                boundary.append(low_rpm)
                boundary.append(high_light)
                boundary.append(low_light)
                boundary = pd.DataFrame(boundary)
      
                boundary.columns = ["doubling_time","light_intensity","rpm","co2","temp"]
                result = pd.concat([boundary, data_prep])
      
                # Now we need to split the data into x and y
                x = data_prep.drop(["doubling_time"], axis = 1)
                y = data_prep["doubling_time"]
      
      
                # To troubleshoot and once we have enough data, this is a very easy and sometimes faulty way to generate a train_test_split
                # For an advanced train test split the sklearn functionality would be used
                x_train = x[:-size_test]
                x_test = x[-size_test:]
                y_train = y[:-size_test]
                y_test = y[-size_test:]
      
                # Now we define the polynomial and the data that we want to predict
                poly = PolynomialFeatures(degree=degree_polynomial)
                light_pred = [ 1388, 1541, 1750, 1850]
                rpm_pred = [ 147, 147, 147, 147]
                co2_pred = [ 3.8, 3.8, 3.8, 3.8]
                temp_pred = [ 40.5, 40.5, 40.5, 40.5]
      
                to_predict = pd.DataFrame({"light_pred":light_pred, "rmp_pred":rpm_pred, "co2_pred":co2_pred, "temp_pred":temp_pred})
                to_predict_pol = poly.fit_transform(to_predict)
      
                #Now the actual model is trained as a pipeline for the polynomial features
                model = Pipeline([('poly', PolynomialFeatures(degree=degree_polynomial)), ('linear', LinearRegression(fit_intercept=True, normalize = True))])
                model = model.fit(x, y)
                #print(model.named_steps["linear"].coef_)
                predictions = model.named_steps["linear"].predict(to_predict_pol)
                #print("hello")
                score = model.score(x_test, y_test)
      
                # Now the prediction is done and printed together with score and diagnose values for the model
                to_predict = pd.DataFrame(to_predict)
                to_predict["predictions"] = predictions
                print(to_predict)
                print(score)
                print(predictions)
      
                y_poly_pred = model.predict(x)
                rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
                r2 = r2_score(y,y_poly_pred)
                print(rmse, r2)
      
                #pred_test = model.predict(x_test)
                #print(pred_test)
                #print(y_test)
                #print(data_prep.to_html())

Again due to the lack of data normal ways of benchmarking the model like train test splits and crossvalidation are not rationally possible. If there would be more data we would use LASSO regression, because this would allow us to eliminate variables that are not useful and avoid a high variance mistake. To showcase how this model using our existing data predicts new data, we decided to predict and measure three new growth curves at unsampled regions within the boundaries of our measurement data. We decided to not calculate the minima that our model predicts, but data that is inside the range of our existing data to properly estimate how well this suboptimal model is working. The predictions of different model versions different only in the degree of polynomials used and the measured doubling time is shown in Figure 9.

comparison of predictions — **Figure 9:** Prediction of the model and measurement of the doubling time of four growth curves. Growth curves have been measured with 3.8 % CO₂, 40.5 ℃ and 147 rpm. Light intensity in [µmol Photons / m² * s 400-700 nm] is color coded in the graph. X axis shows the complexity of the model (numbers indicate the degree of the polynomial used to fit) or m for measurement. Y axis shows the doubling time.

As we can see in Figure 9 the prediction quality of the model is poor. The degree of the polynomial is influencing the performance of the model, but there is no clear trend visible. The data for the polynomial degree 3 was excluded since the predictions were negative. The ranking of the different doubling times is the same in all model predictions except for degree 1, with the model predicting the growth curves with the higher light intensities to show a smaller doubling time. However, not only are the predicted doubling time values significantly different from the measured ones, the measured ones are also ranked in a different order (1388>1850>1541>1750). In addition to that the spread of values is higher in the predicted doubling times compared to the measured ones. As expected, the models performance is not good enough to get quantitatively or even qualitatively correct predictions.

Summary and Outlook

During this investigation into how to grow UTEX 2973 in the optimal way we stumbled upon many things that we thought to be insufficiently documented or standardized. We investigated how to optimally measure light intensity and thought critically about the state of the art light units. To make it possible to grow cultures at specific light intensities as well as to make a model of the light intensity in our incubator to help in the everyday life of the wetlab team. After investigating which wavelength to optimally measure the optical density of our cultures at we started to measure comparative growth curves and developed a reproducable growth curve protocol. For all parameters that had an effect on the growth curves of UTEX 2973 we critically questioned if they could be approximated as independent from other parameters and decided to investigate the temperature, shaking speed, carbon dioxide concentration and light intensity in conjunction with each other. Since the investigation of four or more different dependent parameters and their effect on the growth is not exhaustively possible for humans we built an easily extendable model that uses polynomial regression to predict the doubling time of various parametercombinations.

gridbased screening — **Figure 10:**Visual representation of the datapoints we collected without CO₂ concentration. Doubling time is colorcoded. This graph highlights that we used a gridbased approach to collect data. This approach is very useful for humans to compare the different datapoints, but for models a more diverse dataset with many different values is preferable.

However, since measuring a single (or more) doubling time(s) is a very time demanding process, we did not manage to collect a sufficient amount of data to train a model that is able to accurately predict doubling times. In addition to "just" supplying it with more data, if we have more data more steps can be done to increase the performance of the model. In addition to a train test split and cross validation to improve the perfomance and decrease the bias of the model towards new data, LASSO regression can be used which would allow to investigate easily how high dimensional the polynome the model is utilizing has to be. The data we collected was collected only on a couple different levels for each parameter. While this made it much easier for humans to analyse the data, for the model this drastically reduces its usefulness. If all datapoints were measured with more randomized values and all datapoints differ on all dimensions of the input data, the data samples the given range much more equally. With data like that a model can be built that is more robust due to the better sampling of the input space. A visual representation of the sampling in the rpm However, for many of the parameters we cannot do that in one measurement, since the rpm, CO₂ concentration and temperature has to be identical. For the light intensity there could have been more sampling which would have improved the performance of the model. In addition to that, we used doubling times that we calculated by hand and by manually choosing datapoints for the calulations. This can also introduce an error. By automating that process and maybe not only predicting doubling times but the optical densities at different timepoints this manual error could be circumvented. However, the automated calculation of doubling times can be troublesome for some suboptimal growth curves, since the automatic definition of the exponential phase can be troublesome. If this problem would be solved, this would take all the manual work out of the process and further improve the model.

References

Ungerer, J., Wendt, K. E., Hendry, J. I., Maranas, C. D., & Pakrasi, H. B. (2018). Comparative genomics reveals the molecular determinants of rapid growth of the cyanobacterium Synechococcus elongatus UTEX 2973. Proceedings of the National Academy of Sciences, 115(50), E11761-E11770.
Russo, D. A., Zedler, J. A. Z., Wittmann, D. N., Möllers, B., Singh, R. K., Batth, T. S., ... & Jensen, P. E. (2019). Expression and secretion of a lytic polysaccharide monooxygenase by a fast-growing cyanobacterium. Biotechnology for biofuels, 12(1), 74.
Casteljau, P. (1963). Surfaces à pôles, INPI
Hoschek, J. & Lasser, D. (1993). Fundamentals of computer-aided geometric design. Wellesley, Mass: A.K. Peters.
R., J., & de Boor, C. (1980). A Practical Guide to Splines. Mathematics of Computation, 34(149), 325.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), 2825-2830.

A R T I F I C I A L N E U T R A L
I N T E G R A T I O N
S I T E O P T I O N S

As conventional neutral integration sites for cyanobacteria affect cellular fitness by knocking out existing genes (Dempwolff, et al. 2012), we sought to find new integration sites that are truly independent of the genomic and cellular context. The identification of potential artificial Neutral integration Site options (aNSo) in the genome of S. elongatus UTEX 2973 is paramount for the integration of orthogonal circuits and metabolic pathways. To address this issue we developed a custom algorithm based on the Python language.

design build test cycle — Fig.1 - Schematic workflow of the Python script to find artificial Neutral integration Site options

We achieved this by processing the GenBank file (gbk) containing all of the annotated genes and transcription start sites (TSS) of the S. elongatus UTEX 2973 genome. All lines that contained the word “gene” along with their corresponding genomic locational information, described by indices on the plus strand, were parsed. These indices provided information about the position of the first and the last base of the gene, respectively. Therefore this allows all intergenic regions to be described by the index of the last base of an upstream gene and the first base of the downstream gene, independent on which strand the gene was located. These indices were then stored in a Python tuple.

Subsequently, all intergenic regions shorter than 500 bp are filtered out; leaving us with eligible sites. This was accomplished by calculating differences between the index of the start of one gene and the index of the end of the previous gene located upstream, resulting in 56 potential aNSo.

All these potential aNSos are subsequently packaged in the tuple form and translated into a sequence. To ensure homologous recombination, sequences with a length of at least 2500 bp were required. The missing number of nucleotides, which could not be covered by the intergenic region, had to be filled up with the sequence of the upstream and downstream located genes. For this, a FASTA file containing the genomic sequence of S. elongatus UTEX 2973 (Yu et al., 2015) was read into the environment and the potential intergenic sequences were extracted based on the indices +/- the missing nucleotides and saved additionally in the tuple.

Subsequently, the number of potential aNSo was narrowed down by excluding all sequences that contained BsmBI and BsaI restriction sites. This was accomplished by eliminating all entries in the tuple whose sequences contained the substrings "CGTCTC" or "GAGACG" for BsmBI restriction sites and "GGTCTC" or "GAGACC" for BsaI restriction sites. Only 19 of 56 previously identified regions fulfilled these criteria.

The final step of the identification of aNSo is comprised of eliminating all entries which included a TSS in the intergenic region. Using the gbk file comprising all TSS identified in a transcriptomics study by Tan et al., 2018 as input, the indices of TSSs in the genome were parsed and defined in a list. Afterwards a set was created, containing all intergenic regions which inherited a TSS, and the tuple containing all potential aNSo was transformed into a set as well. By subtracting the set of all TSS in intergenic regions from the set of all potential aNSo, a set was generated containing only information about intergenic regions that do not contain BsmBI and BsaI restriction sites and TSS. Of the previously identified 19 potential aNSo 17 contained a TSS, leaving only two entries in the set of final aNSo fulfilling all of the required criteria. To make the information of this final set easily accessible, a CSV file and additionally a FASTA file were generated.

The Python script, required input files as well as the generated results can be found in our Github Repository

aNSo_1 BioBrick parts (BBa_K3228000, BBa_K3228002)

Gene 1 start	139038
Gene 1 end	139727
Gene 2 start	140309
Gene 2 end	140875
Intergenic region length	582
Sequence of aNSo_1 5’ to 3’	TTCAAAATTTGGTGCGCTGGCAGGTCTGTGAACCGGAAACCGCGATCATGCTGGCGACCCTAGCACCTCTGCGGGCCTTGGGGGTGGATTGGTCGGATCCGCGTCTTCTCTATTTGTCCCGTCCCGTCTGTCAGCTGCTGCGCTGGCACCAGTCCGACACGGGAGAACTGACTTGGCAGAGGCTCTGCGAAAACGACGAATTACCGACTCCTACGTCGATCTAGGTCAGTCGGAATATTAGAATCGTCTGCGAAGATGCCGCCCTTGCCATGACAGCCCTCGACGACAAAACTATCGTTCGTGACTATTTCAACGCCACGGGCTTCGATCGCTGGAGCCGGATCTATGGCGATGGCGAGGTCAATTTCGTCCAGAAGAACATCCGCATTGGTCACCAGCGCACCGTCGACACCGTGCTGAGTTGGCTGGAAGCCGATGGCAATCTGAGCGATCGCAGCTTTTGTGATGCCGGCTGCGGTGTCGGCAGCCTCAGCTTACCCCTAGCACAGCGGGGGGCACAGGCCGTTTATGCCAGCGACATCTCCGCCAAGATGGTGGAAGAGGCTCGCGAGCGGGCCAGTCAGATCCCCAATTTGAACAACATTCAGCTCGAAGTTTCGGACCTTGCTTCTCTGAGCGGTCGCTACGACACCGTCATCTGTTTGGATGTGTTGATTCACTATCCAGAATCCGACGCGGCGGCCATGTTGAGCCATCTTTGCAGCTTAGCTGAGCAACGGGTTTTGGTGAGCTTCGCGCCCAAATCCCCTGTCTTGAATGTGCTCAAGCGCATTGGACAGTTCTTCCCGGGGGCCAGCAAAACGACCCGCGCATATCAGCACAGTGAAACCGCGATCGCAGCAGCCTTAGCGGCGAATGGCTTCCAAGTGCAACGTCGGGCCTTCAACAAAGCACCCTTCTATTTCTCACTTCTGCTCGAAGCTGTCCGAACTGCCTAATCAATTGTTGTTCGAGAGGTATCGCAGATTGAAGACTGAACTGGCATTTGCATTAATCAGCTGCAATCACCTCTCAGATTGACTAGACACTCAAGCATACTGAAGGTTTCAAACATCAGTAACAAGCAATAATTTTGAATTTCACAGCAACCTCAGGCGGTAGCATTGCTGCAATTAAATGGCATCTTTCGCCATACCATTCTCTACAGTTTAAGGATGTATTGTTAAATCTTTTTCTTGAGTATCGTGTATCTTCTGCATGGAATCGAATTAACTGATCAGCGATGCAAGCTGCTTCTTCTAAGAAGTAATTTTCTTGGCGTTCTTTCCGTTGTTGCTGCTTGAATATGGAAGGCCGATTATGAGGTGATTTAGGCCAAGAATTAAGTTTTTCCTTCAAGTTTTCTACTTCCCTGAGATGGCAATTAATTTTTTTGTTATCTTGGGCACGAAATAATAGGACTTGAGGATTAGGACAAGCAGTAACGGTTAAATGTGACTGCCCCCCTAAAATAGAGTATCTAGAAGAACTTTTCTTCCCCTGTTCTTTCCTAGAATCGGAGCCCGAGAGAAGAGGTGAACTACGTGGAGTAGGTAAAGTTGATCGTACCGGCAAAGACATCGAGATCAACTGCAGCTTGGCGGTTTTCTGGGGTATCTGCAGCACCACCAAGGAACCATAAAACATCTGCAGAGATACTGTAGTAGTCTTGGGTTCGTTGATAGATATCTGCAGCTTCAATTTTGGCAAGTTGACATTCACCAATAATTCGATAGCCCGTCGAGAAAACGACTGCAACATCTGCAATTCGACCATTCTTTCCAGCTTCTGAAATTGGATGTTCAATTTCAATAAACGCTTCTTGGGCATCAATCATCCCCTTATAAACTTCTTGAAAGTACTTACTGATTTCCAACTTTCCTTGCAAGTGCTCTGGAGATTCCGGATGATGTTCCATTACTGTGGTGCAAGGATGAGTATGAACAAAGTGCAATGAGGTATTTTGTCTCTTTCTGGGAAACATTAATGTTTGACAGAAAGGACAAAAAAGACTTCCTTTGGGAAAATTTTTTCTGATTTCAAGGACTGACTTAAAATCTGTCGCAAGGACTATGTTACCCTGTTGATCTTTTGCTTTGAAAGGCATGATCAAATCTATTCCTTTATTGATACTTCTCGTTTAGAGAGTCAGTATAGTCTTCTTGTAAATCCTGATCACTAGAAGTTGTTCCATGGCTTTTATCAATCCCCCCTAGTCCAGTCAACGTACCAAGAGTAATAGCCTATTTACGAGTTGGGGTCTGTTTTTGCTAAAGAAACACTGCAAAGTGCAGGATTTCATTGATCTCCTCTTCAGGTATTGTCTGGATCAGCTGATAGAGCTTTTCAGTAGCAGTCATAGATTGCAGCGCATAAGAGATCTATATTCTGAGCAATCTCGACGGATCAAGCGATTGAGCTATCGGCGGCGATGCTTGGGGGGATCGTGGCGATCGTAGAAATCGGGTGGATGGCGGCGTACCCATTTCAGAAAACGCTG

aNSo_2 BioBrick parts (BBa_K3228001, BBa_K3228003)

Gene 1 start	1744903
Gene 1 end	1745412
Gene 2 start	1746009
Gene 2 end	1746731
Intergenic region length	597
Sequence of aNSo_2 5’ to 3’	GTTAGTGCCTGCAGCCAAGCCCTAGAACTCCAGCCCAGCGCCGCGCGGGCTCGATATTTGCGGGCCTTGGCTTACTGGCAATTGCATCAGCCGCAAGCCGCGATCGCTGATTTACGACAAGCCTGTGATGCCTTTGCACAAGCTGGAGCAACGGTCCAACTCGATCGAGCCCGTCAGCTTCTGCAACACTGGCAGCAACAGTCCAGCCTCGTCGCCCAGGCTCCTCGCCTACAATCCAAGAACTGGCCTGGAGCTGTAACCTATGCAATGGATTTGGCGAACTGCCACGATCGCAGTCCTCTTAACGAGTTGGAGTTCTGCTGCGATCGCGCATTCCAACAATGCTGATGTCAATCAGTGTCATCACGATCGTCGCACCGGCGAATATCACTGCCACTAGGCCTGACAATAGAGTCGTTTTGATCTTTGCTGATTAGCTTCAATGATGCTTCCGACCCTGAGCACCCTGAAAACAGCGGTGCTCCTGCTTCCTTTGGCAATTCCAACGGCTGCTCTTGCCCTACCTCAAACCGCTGTTTGGCGACTGGCTGATGCTCAAAATCATCAGCACCAGAATCATCAACATCAAAGCGGGGCTGGCCATTCCCATGGCAGCTTGGCGGTGCCAACAGGCACTCCACAACCGACTGTCAATTTAGTGGTTGAACGCGACCGCAAAAGTGGTTGGAATCTCCGGCTAACTACCACTAACTTCCAGTTTGCCCCCGAGGAACTTGACAAAACAAATCGAGTTGATTCCGGGCATGCCCATTTGTTCCTTAATGGGAAAAAGATTGCGAGACTTTACGGACCTTGGTATCACTTGGCTTCGCTCCCAGCCGGGAAGCAGACTCTCATGGTGGAATTGACCAGCAATCAACACAATGTAATTACGGTTAATGGTCAACCTGTCATTGCCAAAGTGACTGTAGACGTTCCAGCGATGAAGTAATTTTCATACTGAGCTACTACGGTAGCCTCTGCCTCTCTTCCAGCAAATGGGGAGAGGCCTTGACAACTAACAGTGTTCAATCGACAGATTTTCAGACCTTGAACGATCGGATCGTAATCCTACCTGAGCGATCGTAAAATCTGTCACGGCAAAGGATATAAATACACTTGAGTTAAAGGTTTAATTCTCAGTCGCTACAGTTGTTTTTTGATTGACTGAATGAAGGTCAAGGAATCAGTTTTAGCGATAGCTTTTCAGTATTAATAATAGTAACCTTCATGCATCGGCCGTAGCTGAAAATGCAAAATAATACTTTGACTATCGTAGGCCAATATCGAGTGACTTATTGCCTGCTCTTAGTCAATGGAATAAATAAAATGCCCATCAAGCTGTCAGTGCTGGCTCGAAGCGATCTGAATCTTGTCCTAGTAGGCTAGCAAGATAATCTCGATGAGAAAAGCGATCGCCCTTAAACCAGATTTTTTGACTTTCTTGATCAATCTATTGTCCAAAAAGACCTAGGTGCGATAATTATAAAAACTATAATTCACTCTAGGGATAGAAGCTTGGCTTTGCACTCTCGTCGTTGGCTATTGATGGTGCTCACAAGCTGCTTCGCGACTAGCCTGTTCGCTAGACCTGCAATCGCTGCTGATGGCTGGTGGATCGATCAGTATGCGGTCATTCTCTTTACTGCCACGGGACGGCTCGATGCAGAACTGAAAGAAATGCGCATCGAAGGAGCCGATACGCTGCTCGTCCATGCGGATAGCCTGCCCCCACTGCTGCTACGTTGGGTTGCTTGGCGTGCCTCTCTACAGAATATGAAGTCAGTCGCCTGGGTTCAGCGTCCCACTCTCCAGCGACTCAAACATGCTAGCTCTCTCAATGGCTATGCTGCGTTGCAAGTGGATGATCACTTTTTTGCTGATCCCATTGTGAGCTTCAGTCAGCTGCGCCAAATGATTGGCAAGAAGCAGCTTTGGTGCTCTTTTCAACCGAATCAATTTTCGGAGTTTCTAGCGCGGAATTGTGATCATGTGGATGTACAAATCTACCGAATGAGTTGCCCTGCCACAATCGATTTAGCCGATAGATTGGGGTTGCTAGGTCGTCCTCAATCTGCGATCGCGGTCTATCATGATGGCACCTCTCAAGCCGATCGCGATCTCCAATGCTTCCGTCAAGCAGGTCGCGATGTTCGTAATTCAATCTTTGTTTTCAAATGGAAGAATCCAGGATCTGTCTTGTCGCGATTTTTGAAGCATCCATTAGTAGCACGACTGGAACGGATATATATTCAGCTATTTAAGGACTAGCGCTGAACTATAATCGAGCGATCAAATTTTATTGTCATCACTAAATTCTTGTGCAATTTCCCTCAAAAATTGGTTGATTTGTTGAGGCGATCGCAAATGGTAGACTTTGCGGTTTGTTCGAGCTGTCTCAATATACTCTCGATATTGAGGTGTTAATCGCTGGTGGCAAAGCCAAAGAACGCGGTAGCTACTCATTGAGCTTTTAAATAAAGGACTGTCCTCAGGCCAGC

T E R M I N A T O R
M O D E L

Talking to numerous experts in the field of phototrophic research necessitated the need for strong transcriptional termination for large genetic engineering projects.
In bacteria, two processes are responsible for proper transcript termination: intrinsic Rho-independent terminators, generally low energy RNA hairpins; and Rho-dependent terminators, which rely on the binding of the Rho protein. The majority of bacteria have a homolog of the E. coli Rho protein, with a few exceptions such as our organism S. elongatus (de Hoon et al., 2005).
We therefore first of all concentrated on the investigation of the natural intrinsic terminators of our strain UTEX 2973. To do this, we had to take a closer look at how these intrinsic terminators function. Rho-independent terminators typically consist of short, 7-20 base pairs long, mostly GC-rich hairpins. The loop structure is followed by a chain of uracil residues. A protein bound to the RNA polymerase then binds to the stem-loop tightly enough to cause the polymerase to temporarily stall. The pausing of the polymerase coincides with the transcription of the poly-uracil region. The weak Adenine-Uracil bonds then lower the energy of destabilization for the RNA-DNA duplex, allowing it to unwind and dissociate from the RNA polymerase (Krebs et al., 2014).

It’s important to note that, especially in our organism S. elongatus, not all terminators cause complete termination. In some cases, these terminators are found in between ORFs inside the same operon and might be involved in creating complex transcription structures. From here on, however, our analysis will be mainly focused on the standard case.
Our first stage objective was to find promising natural terminators. In order to achieve this goal we applied several state-of-the-art bioinformatics tools to obtain a comprehensive overview of as many candidates as possible. The software we used were:

ARNold, which in itselfs consist of two complementary programs: Erpin (Gautheret et al., 2001); RNAmotif (Macke et al., 2001).
TransTermHP (Kingsford et al., 2007)
FindTerm (Solovyev et al., 2011)

Due to its design the resulting list of 2113 sequences contained many false positive and duplicate terminator candidates.

In order to analyze the data we split it into two and ordered the sequences according to its strand. The next step was to clear the list of possible duplicates. This was done by analyzing the intersection of the respective bp positions. If both the intersection and the symmetric difference of two seperate terminator candidates were non empty we expanded its definition by the difference. To redefine the selection we later on analyzed the secondary RNA structure via kinetic modeling.
In order to filter out the misrecognized terminators from our list, we decided to use the much more detailed transcriptomics data of both UTEX 2973 and its closely related strain PCC 7942. Our approach was divided into two parts:

Identify if the sequence is contained inside an open reading frame.
Determine the approximate in vivo termination efficiency of each candidate.

For the first part of this approach we’ve taken into account the Joint Genome Institute (JGI) predictions and transcriptionally identified ORFs. To make sure that we don’t consider wrong candidates we decided to remove any sequence whose intersection with an ORF exceeds a threshold of 15%.
For the in vivo efficiency approximation of the sequences we calculated the relative decline in average base counts in 25-base windows before and after the terminator candidates (Creecy et al., 2015). Sequences which had an approximated efficiency below a high threshold of 80% were ignored for further consideration.

Placeholder image — **Figure 1** Exemplary efficiency analysis of a predicted terminator on the sense strand. The x-axis is the nucleotide position on the genome, y axis the counts for the associated base. The predicted terminator is displayed in read, the 25 bases before and after the terminator sequence are colored blue.

After the careful separation of the unsuitable candidates we were left with the most promising terminators. To further analyze the functions of these terminators a kinetic approach was indispensable.
The RNA secondary structures were predicted using KineFold. To choose the most likely formation we performed multiple independent runs using different random seeds and chose the most frequent structure.

Based upon these results we were tasked with the correct identification of the U-tract, hairpin and the A-tract regions. The predicted secondary structures were often hairpins that extended beyond the terminator hairpin. The reason for this was the formation of base pairs between the upstream poly(A) sequences and the U-tract. For the precise identification of these regions it was important that the poly(U) region was part of the U-tract and not the hairpin. To correctly distinguish these two several steps had to be taken. Given a stem loop structure, we screened for possible U-tracts in the region between the sixth nucleotide in the 3’-arm of the stem loop and the eighth nucleotide after the stem by evaluating every 8 base pairs. For this we have calculated the Gibbs free energy of all possible U-tracts with the formula

Where N_U = 8 is the length of the U-tract, ΔG_RNA:DNA is the free-energy contribution of the RNA:DNA hybridization from the two nucleotides pairs at position i and i+1. The hybridization were calculated using the nearest-neighbor thermodynamic parameters at the respective position (Sugimoto et al., 1996). The 8bp sequence with the highest ΔG_U value was then selected as the U-tract. With the proper identification of the U-tract it was now possible for us to precisely define each region.

ID	Strand	Starting Site	End Site	Length	bp counts before	bp counts after	read trough	A-tract	Hairpin	Loop	U-Tract	Structure
1036	+	166606	166645	39	2182747.4	2.76	1.26E-06	CAACUAAAGA	GAGUCGCUCAGAGAGCGGCUC	AGA	UUUUUUGUUG	(((((((((...)))))))))
1000	+	2622887	2622925	38	295441.36	0.76	2.57E-06	UAAACAACCU	CUUCAGUCACAGGACUGAGG	ACAG	GUUUUGUUUU	((((((((....))))))))
904	-	2379456	2379414	42	585709.4	3.08	5.26E-06	AGCAAAAAGC	CUGUCUAAGCAUUGUCUUGGACAG	CAUUGU	GCUUUUUGCU	(((((((((......)))))))))
743	-	1899716	1899683	33	23739.36	0.16	6.74E-06	UAAAAAACGC	CCGGGCAACGCUCGG	AAC	GCGUUUUUUA	((((((...))))))
279	-	709603	709570	33	437192.08	12.24	2.80E-05	CGAACCCCUA	GUCAUCAAUGGUGAU	CAAUG	AGGGGUUCGU	((((((...))))))
1193	-	1170678	1170642	36	69043.84	7.96	0.000115289	UAUCAGGAUG	UGACUGAGAACUCAAUCA	GAAC	UCCUGAUCGU	(((.(((....))).)))
349	+	908409	908444	35	73266.04	10.08	0.000137581	CAAACCCAGU	GUCUUCUUGUUGGAGGC	UUGUU	UGGGUUUUUG	((((((.....))))))
498	+	1270707	1270744	37	134.76	0.04	0.000296824	GGCAUUUGGG	GGGCGGCGGUGGGUCGCCC	GGUGG	UUUUUUUCUG	(((((((.....)))))))
586	+	1518890	1518927	37	683.16	0.4	0.000585514	CCACAUUAGC	GCUCUCGCCUGUCGAGAGC	CCUGU	UUUUUUAUGC	(((((((.....)))))))
909	+	2385885	2385928	43	731.4	0.56	0.000765655	GUCUAAAACC	CCGCUGGUUCCCAGAGAGCUAGCGG	CCAGA	UUUUCCUUAU	((((((((((.....))))))))))

We now wanted to use these records to analyze the impact of mutations in different terminator regions. In order to experimentally test this, we established a workflow that allows us to screen a huge combinatorial library of terminators. For this we have selected 3 of the strongest terminators which have mutually distinct features such as different hairpin and loop length. Based on research experience we have decided to include mutations in the respective U and A-tracts. The synthetic library was ordered as degenerate oligos. To test the terminator efficiency in vivo we build a GoldenGate Lvl2 constructs with a terminator spaceholder surrounded by 2 fluorescent proteins.
Because of the different emitting spectra of these fluorescent proteins we will be able to measure both independently which allows for indirect measurement of terminator strength. For this we calculate the ratio between induced mTurquoise and induced YFP normalized by control (plasmid with no terminator inserted). With the help of FACS we will be able to systematically separate the different terminators and analyze the impact of different mutations. We hope that this approach will inspire other teams to build and screen large libraries of synthetic parts so that the scientific community can gain a deeper insight into the inner workings of elementary molecular building blocks.

References

Chen, J., Morita, T., & Gottesman, S. (2019). Regulation of Transcription Termination of Small RNAs and by Small RNAs: Molecular Mechanisms and Biological Functions. Frontiers in Cellular and Infection Microbiology, 9. https://doi.org/10.3389/fcimb.2019.00201
de Hoon, M. J. L., Makita, Y., Nakai, K., & Miyano, S. (2005). Prediction of Transcriptional Terminators in Bacillus subtilis and Related Species. PLoS Computational Biology, 1(3), e25. https://doi.org/10.1371/journal.pcbi.0010025
Krebs, J., Lewin, B., Kilpatrick, S. & Goldstein, E. (2014). Lewin's genes XI. Burlington, Mass: Jones & Bartlett Learning.
Gautheret D, Lambert A. (2001) Direct RNA Motif Definition and Identification from Multiple Sequence Alignments using Secondary Structure Profiles. J Mol Biol. 313:1003–11 (abstract).
Macke T, Ecker D, Gutell R, Gautheret D, Case DA and Sampath R. (2001) RNAMotif – A new RNA secondary structure definition and discovery algorithm. Nucleic Acids Res. 29:4724–4735 (abstract).
Kingsford, C. L., Ayanbule, K., & Salzberg, S. L. (2007). Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biology, 8(2), R22. https://doi.org/10.1186/gb-2007-8-2-r22
V. Solovyev, A Salamov (2011) Automatic Annotation of Microbial Genomes and Metagenomic Sequences. In Metagenomics and its Applications in Agriculture, Biomedicine and Environmental Studies (Ed. R.W. Li), Nova Science Publishers, p. 61-78
Chen, Y.-J., Liu, P., Nielsen, A. A. K., Brophy, J. A. N., Clancy, K., Peterson, T., & Voigt, C. A. (2013). Characterization of 582 natural and synthetic terminators and quantification of their design constraints. Nature Methods, 10(7), 659–664. https://doi.org/10.1038/nmeth.2515
Tan, X., Hou, S., Song, K., Georg, J., Klähn, S., Lu, X., & Hess, W. R. (2018). The primary transcriptome of the fast-growing cyanobacterium Synechococcus elongatus UTEX 2973. Biotechnology for Biofuels, 11(1). https://doi.org/10.1186/s13068-018-1215-8
Vijayan, V., Jain, I. H., & O’Shea, E. K. (2011). A high resolution map of a cyanobacterial transcriptome. Genome Biology, 12(5), R47. https://doi.org/10.1186/gb-2011-12-5-r47
Creecy, J. P., & Conway, T. (2015). Quantitative bacterial transcriptomics with RNA-seq. Current Opinion in Microbiology, 23, 133–140. https://doi.org/10.1016/j.mib.2014.11.011
Sugimoto, N., Nakano, S. -i., Yoneyama, M., & Honda, K. -i. (1996). Improved Thermodynamic Parameters and Helix Initiation Factor to Predict Stability of DNA Duplexes. Nucleic Acids Research, 24(22), 4501–4505. https://doi.org/10.1093/nar/24.22.4501

Team:Marburg/Model

M O D E L L I N G

G R O W T H C U R V EM O D E L

Growth Curve Model

Growth Curves

Light intensity measurements

Practical light measurements

The need for a light model

Light model

Early Growth Curves

Flask Geometry

Lid Types

Fill Volume

Culture Media

Growth Curves Development

Growth Curves Model

Variables responsible for growth

Importance of a mathematical model for growth curve prediction

Boundary behaviour

Modelling approach

Summary and Outlook

References

A R T I F I C I A L N E U T R A L I N T E G R A T I O N S I T E O P T I O N S

Algorithm for identification of artificial Neutral integration Site options (aNSo)

T E R M I N A T O RM O D E L

Terminator Model

References

G R O W T H C U R V E
M O D E L

A R T I F I C I A L N E U T R A L
I N T E G R A T I O N
S I T E O P T I O N S

T E R M I N A T O R
M O D E L