Team:UPNAvarra Spain/Model

Menu

Model

Proposal
Vision-based techniques have become more and more relevant in science in past 20 years. Reason is twofold: Firstly, good sensors have become dramatically cheaper and better over recent years. Secondly, society is becoming used to visual-based information, which eases the introduction of vision tools in production items.

In this context, we consider important to base our projects using vision based tools, more specifically built-in phone-like cameras. This could lead to the simplification of the distribution of the final product, as well as to the increase of the familiarity of final users with the tools.

For this project we have used this kind of visual sensors for detecting the concentration of some heavy metals and nitrates. The concentration of the metal or the nitrate is proportional to the intensity of the color that is shown, so we use this characteristic to make a model with the objective of make predictions of concentrations from the RGB values of a photography.

The first step for carrying out our model is to take photos of all the inductions (in this case we know the concentration of each sample). Nevertheless, we took these photos in different days, so the conditions of light, background or shine are different in each photo. This is used to adapt the model for all possible situations: there is no need to use a photography chamber or an exhaustive control of the conditions when taking the photo, so the model can be useful in every situation by just having some control over the light conditions.

The next step is to extract the data from the induction photos. The resulting process of the imaging workload is a list of datasets (one per heavy metal or nitrate) relating the RGB color of the pellet and the numerical concentration of the material. The representative color is taken as the average color over the colored are in the pellet. This dataset is meant to be used as training data for a predictor of the concentration in untested waters. We create a CSV file from the data with four columns: the first three are the values of each RGB vector (Red, Green and Blue) and the fourth column is the concentration of the metal or nitrate in the sample.

The relationship between color and (squared) concentration resulted in a linear (least square) regression model. No higher order regression was needed, and there was no necessity for color gamut training or adjustment. In fact, nitrates require only one of the channels for a fair linear regression to be fitted with only one channel (red for nitrate blue, and blue for nitrate yellow).





Figure 1. Color model.


Mathematical Model
The mathematical model generated from the experimental results obtained in the lab might look like a classical regression model, but it is not. Normally, regression models are used to predict results from an a priori unknown set of data. Hence, the underlying assumption is that the data is differentiable (predictable), and the quest is to produce a mathematical model able to do it.

In the present experiments, our quest is rather the reverse. We intend to proof that the data produced in the experiments is in fact learnable. This made our focus not on making an as-complex-as-possible mathematical model, able to predict future data. Instead, we intend to prove that the (imaging) data we have gathered in the lab is in fact learnable by simple regression models.

This somehow non-standard idea has driven our experiments. We have, in the data analysis part of the process, attempted to learn the data we have gathered using a simple regression model. We have opted out by a standard Least-Square error (linear) regression model, which has been run on the dataset obtained in the imaging part. This dataset consists of the average RGB color in the colored part of the pellets used at different concentrations of each heavy metal or nitrates.

For each color, we have subselected the channels that we are interest for the problem. That is:
  • Nitrate (blue): Red channel
  • Nitrate (yellow): Blue channel
  • Copper: Red and Green channels
  • Mercury: Red and Green channels
  • Cadmium: Red, Green and Blue channel


Hence, the problem shifts from 1D regression to 3D regression, depending on the material we are working with at each specific experiment. For each of the experiments, 5 or 6 samples have been used:




Figure 2. Modeling a Nitrate biosensor. A) Input data; B) Regression model.




Figure 3. Modeling a Nitrate biosensor. A) Input data; B) Regression model.




Figure 4. Modeling a Cooper biosensor. A) Input data; B) Regression model.




Figure 5. Modeling a Mercury biosensor. A) Input data; B) Regression model.




Figure 6. Modeling a Cadmium biosensor. A) Input data; B) Regression model.


It can be seen how the data is easily learnable by linear regression models in the case of the Nitrate (Figure 2 and 3). Also, the accuracy in learning the Copper (Figure 4) and Mercury (Figure 5) models is rather good (it is to be noted that the figure above is a 1D representation of a 2D models, hence not displaying the linearity of the model itself). The model for Cadmium (Figure 6), however, was not successful. As it can be seen from the images, the changes in the coloring w.r.t. the concentration are hardly perceivable. Note also that the concentration learn is the squared root of the actual concentrations, in order to better distribute the measures over the testing range.

As it can be seen from the models displayed above, the error in the model training is rather small. Specifically, the average error is the following:

Contaminant Avg. Error Testing concentration range
Nitrate (Blue) 0.18 [0,4]
Nitrate (Yellow) 0.17 [0,4]
Cooper 1.88 [0,22.36]
Mercury 0.96 [0,12]
Cadmium 0.38 [0,10]
Table 1. Error in the model training .


We can observe that the data is hence learnable from both the perspective of the visual model (Figures 2-6) and from the results in the error measurement (Table 1). Even in the case of the Cadmium the error seems to be low, although that is misleading, since the data is clearly not learnable from the visual display in Figure 6.

Note that the data could be better fit if using other types of regression, specifically higher order regression. However, this would hamper our main point here, which is proving that the data is learnable by simple means.

Machine Learning Model
We have proved that data can be learnable by a simple linear regression model (OLS), but to obtain more accurate predicted values, we should train our model using more data to be more precise. These extra data are extracted from images taken from pellets induced by a known concentration of each specific substance.

We will consider every channel for each specific substance, since we don´t have a great amount of data and we consider that a dimensionality reduction is not necessary in order to reduce the computational complexity of our model. Nevertheless, in order to present a complete study, we will apply a dimensionality reduction (Principal Components Analysis) in the next section.

In order to test our model, we have applied a cross validation. We divided our data in n partitions, each one with 25% of the data. Our training data will be n-1 partitions whereas one will be our test data. Every partition is selected once to be the test data so the training and testing is done 4 times.

For each specific substance we have collected data from 4 different samplings, under different imaging conditions, of the same concentration. Hence, the number of data is 24 for Nitrate (Blue), Nitrate (Yellow) and Mercury, with 6 different concentrations, and 20 for Cooper with 5 different concentrations:



Figure 7. Modeling a Nitrate biosensor.


Figure 8. Modeling a Nitrate biosensor.


Figure 10. Modeling a Cooper biosensor.


Figure 9. Modeling a Mercury biosensor.


Applying our model the R2 (mean of the 4 samplings R2) and the average error (difference between the actual value and the predicted value) are the following:

Contaminant R2 Avg. Error Testing concentration range
Nitrate (Blue) 0.930 0.289 [0,4]
Nitrate (Yellow) 0.949 0.236 [0,4]
Cooper 0.833 2.520 [0,22.36]
Mercury 0.760 2.137 [0,15.5]
Table 2. Error in the model during cross validation.


We can conclude that providing our model with more data, we could predict the specific substance concentration obtaining reliable results. In our case, the results in the error measurement (Table 2) are due to the lack of more concentration values uniformly distributed. Nevertheless, we demonstrated that the data is easily learnable by linear regression models (Figures 7-10), obtaining accurate predicted values.


PCA Model
We aim to demonstrate that our model is able to predict in a reliable way by reducing the number of variables implicated. In order to achieve this goal, we will apply a PCA (Principal Component Analysis) to our data in order to perform a dimensionality reduction. The PCA is an unsupervised machine learning algorithm, i.e., a statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables.

Hence, applying this technic, the computational complexity of our model at each specific experiment gets significantly reduced (Figures 11-13).

Nitrate Blue. We achieved to reduce the RGB data, obtaining just one variable that conserved more than the 95% of the explained variance.


Figure 11. Modeling a Nitrate biosensor.


Nitrate Yellow. We achieved to reduce the RGB data, obtaining two variables that conserved more than the 96% of the explained variance.


Figure 12. Modeling a Nitrate biosensor.


Mercury. We achieved to reduce the RGB data, obtaining two variables that conserved more than the 98% of the explained variance.


Figure 13. Modeling a Mercury biosensor.


Cooper. We have not been able to obtain a good PCA, since the values were few and very separated.

Applying our model the R2 and the average error are the following:

Contaminant R2 Avg. Error Testing concentration range
Nitrate (Blue) 0.838 0.40 [0,4]
Nitrate (Yellow) 0.950 0.232 [0,4]
Mercury 0.770 2.04 [0,15.5]
Table 3. Error in the model training applying PCA.


We can observe that the data is hence learnable from both the perspective of the visual model (Figures 11-13) and from the results in the error measurement (Table 3). We conclude that, with this quantity of data, we can reduce significantly the number of variables. With a higher amount of data for each specific experiment and hence applying a more dimensionality reductions, our model would be more precise.

Contact us


equipo.igem@unavarra.es

Avenida de Pamplona 123, Mutilva
31192 Navarra, España

Follow us on


Our Sponsors