# Model Inspiration

## Our problem and our solution

Our chlorophyll separation system is enabled using water-in-oil emulsions. However, using emulsions can be extremely tricky. What total solution composition of oil, water, and surfactant is optimal? Are there any combinations of variables that we need to avoid? To ensure that we can recover the processed canola oil and remove chlorophyll from it, we sought to rationally design our emulsion system. To do this, we studied machine learning classification algorithms to build models which predicted the microstructure of the emulsion solution as a function of its chemical composition.

**The aim of this model** was to find an emulsion which allows for the maximum removal of chlorophyll from the oil by finding the variables of temperature and concentrations of oil, water, and surfactant. **Supervised machine-learning classification** methods were used to predict emulsion phase equilibria (known as the Winsor classifications) based on previously gathered in vitro data to formulate optimal emulsion conditions. Through an iterative development process, we explored and implemented Support Vector Classification (SVC), k-Nearest Neighbours (kNN), and multilayer perceptron (MLP) models to classify and densely interpolate desired phase data from experimentally gathered data.

We determined that the phase diagram model generated with the SVC algorithm produced the most physically accurate results. We used the results from the model generated at 27 °C to construct and validate our emulsion separation system in the wet-lab.

# Methodology

Phase diagrams are graphs used in chemical and material engineering that map the physical properties of a solution as a function of its chemical constituents and conditions. However, there is no widely accepted method to model solutions of emulsions, and the few that do require extremely difficult and tedious measurements. This is because emulsions are structured fluids, meaning that their equilibrium behaviour is highly dependent on their geometric properties. This precludes the use of thermodynamic models, which is the main method for modelling the equilibrium behaviour of a liquid mixture. Due to the team’s need to predict the phases of emulsions for our chlorophyll extraction system, we developed an alternative modelling methodology.

For the purposes of rapid design of our emulsion system, we were more interested in correlating the chemical components to their Winsor type. Machine learning excels at identifying correlation and can provide an abstract reading without needing to understand the underlying thermodynamic molecular details, making it the ideal method for our purposes. The machine learning algorithms we explored in this modelling project were SVC, KNN, and MLP.

SVC and KNN are both well-known classification algorithms. Classification algorithms work by providing the likelihood of a new data point to be within a specific class. The KNN algorithm assigns a label to a data point based off of neighboring classified data. As an alternative approach, SVC attempts to find the boundaries between classification labels, before assigning new data points labels depending on which classification boundary they fall into. Due to the nonlinear nature of phases, the SVC used in this project utilized an RBF kernel. MLPs are densely connected feedforward ANNs, which means that they are function approximators that can use large amounts of labeled input data to train on, enabling them to become more accurate. For a more thorough understanding of the theory behind these algorithms, click here.

## Modelling Stages

We divided the use of the above machine algorithms into **two stages** of modelling. The primary stage involved choosing a machine learning algorithm to fit to the experimental data. This enabled the team to model the phase behaviour of the emulsion system for the specific temperatures that the experimental data was gathered at. Using those models, uniformly distributed and densely populated phase diagram data at the experiment temperatures could be generated.

The secondary stage utilized the large amounts of data generated in the primary stage to train an MLP model, as ANN accuracy scales markedly well with training dataset size. With multiple temperatures of generated phase diagram data, the training data enabled the MLP model to approximate phases as a function of temperature in addition to the three volume fractions of the system. Subsequently, the team was able to utilize the secondary stage model to interpolate and extrapolate entire ternary phase diagrams between temperatures.

Through several **rounds** of experiments and iteration, we were able to refine our modelling process and final results.

## Modelling Motivations

The work done for this modelling project sets out to achieve the following:

- Identify the best classification algorithm to develop these models
- Develop and interpret a phase diagram model that is consistent with existing literature
**Inform our project design**by identifying combinations of water, oil, and surfactant that yield a desirable Winsor type to test our EBP chlorophyll removal system- Accomplish the above without needing to conduct ponderously long experiments.

**Model Assumptions**

- Each variable phase was assumed to be a homogeneous species; instead of considering the organic phase as a combination of fatty acids and triglycerides, the oil was considered to be a homogeneous component. This applied to the aqueous and surfactant phases as well.
- Input data was assumed to not be affected by extraneous factors, like the creation of foams by mixing in air.
- Emulsion structure was assumed to not be affected by hysteresis effects.
- Any error on experimental input data was negligible, and should therefore not be factored into accuracy and error calculations.
- Phase classification is a nonlinear problem, hence the use of the nonlinear RBF kernel in SVC.
- Phase data can be represented in an n-dimensional feature space that supports distance calculations, which is required for KNN. In this situation, n = 4.
- Phase equilibrium classes are a function of at least one of the four input variables (3 volume fractions and temperature). If this assumption was false, MLP would have been unable to model phase diagrams to any degree of accuracy.

## Emulsion Classes

**Figure 1**:A depiction of the various Winsor emulsion classifications. Four classes exist: Winsor 1, 2, 3, 4.

Compositions that yield the Winsor 1 classification separates into a pure oil phase and an oil-in-water emulsion. The Winsor 2 classification separates out into a pure water phase and a water-in-oil emulsion. Winsor 3 will separate into pure oil and water phases, and a bicontinuous microemulsion. Winsor 4 will be a complete water-in-oil microemulsion and is extremely difficult to separate out.

The desired Winsor classifications were Winsor 1 and Winsor 3 type emulsions, as these settle out to produce a pure oil phase for our oil recovery process. These Winsor classes also had the added benefit of eliminating a secondary purification step to remove surfactants from solutions.

## Emulsion Representation

The data given is four dimensional, containing three compositions of oil, water, and surfactant, and its equilibrium phase. Our model is looking to find a function \(F \; : \; v \longrightarrow y \> \) mapping a given vector to a Winsor phase class such that, $$ v = \begin{pmatrix} r_{oil} \\ r_{water} \\ r_{surfct} \end{pmatrix}$$ $$ r_{oil} + r_{water} + r_{surfct} = 1$$ $$ y \in \{Winsor1, Winsor2, Winsor3, Winsor4\}$$

**Figure 2**: Our approach to finding the Winsor phase boundaries of our emulsions by using data points on the left collected in lab as input. Each axis is the side of the triangle, and each point represents an emulsion with different compositions of oil, water, and surfactant. Color represents Winsor phase class.

# Round 1 Experiments

The wet lab conducted dilution line experiments using a system of green oil, polysorbate 80, and distilled water. The oil phase was created by extracting chlorophyll from spinach with 97% ethanol and mixing it with commercial-grade canola oil at a 1:4 volume ratio, respectively. The ethanol in the solution was then boiled off at 363.15K for 30 minutes, and allowed to cool at room temperature for 1 hour. Samples were thoroughly mixed with a vortex mixer, before being given 24 hours to reach equilibrium at their given temperature. After this period, the emulsion compositions were classified. 80 emulsion compositions at the 5 different temperatures of 300.15K, 310.15K, 315.5K, 328.15K, and 343.15K were manually assigned a Winsor class via observation.

^{*It is important to note that round 1 contained no protein in the aqueous phase. In addition, the parameters chosen for SVC were not justified this round.}

**Figure 3**: Examples of different Winsor classes mapping to locations on a ternary phase diagram.

# Round 1 Results

## Proof of Concept

The wet lab members of our team developed an initial set of emulsion experiments using canola oil with added chlorophyll, water, and surfactant. The data points were evaluated at 300.15K, 310.15K, 315.15K, 328.15K, and 343.15K. Each temperature yielded 80 data points, labeled under their respective Winsor class (Figure 2). Using this data set we sought to determine whether or not our modelling approach was valid. For the first round of our model, we applied three aforementioned classification approaches: support vector classification (SVC), k-nearest neighbours (KNN), and a multilayer perceptron neural network (MLP). Their respective phase diagrams are shown below (Figure 4), along with the table of their classification error (Table 1). The classification error is a measure of the percentage of test points the model predicted incorrectly, averaged over all partitions of train and test using cross-validation.

The purpose behind this round of modelling was to explore different machine learning algorithms and produce a proof of concept for this modelling project. Therefore, the overall project relevancy of the experiments and the accuracy of the models in this round were deemphasized. Instead, the objective of round 1 was to quickly evaluate the feasibility of our modelling being able to model ternary phase diagrams, given enough effort and development time.

## ROUND 1 STAGE 1 MODEL CLASSIFICATIONS

Going into this stage of modelling, exploration of methodology was the main concern. Therefore, the two independent parameters (gamma and cost) for the SVC models in this round were not thoroughly optimized. However, due to KNNs only having one parameter (K), the KNN models were optimized in this round, due to its relatively straightforward implementation. Finally, the team lowered their expectations for the stage 1 MLP models, as 80 data points for each MLP model is far from sufficient to achieve reasonable model accuracy.

**Figure 4**: A visual of all ternary phase diagrams generated with round 1 wet-lab data by Support Vector Classification (SVC), k-Nearest Neighbours (KNN), and a Multi-Layered Perceptron neural network (MLP) classification over four temperatures.

Model | 300K Error Rate | 310K Error Rate | 315K Error Rate | 328K Error Rate | 343K Error Rate | Mean Classification Error |
---|---|---|---|---|---|---|

SVC | 0.2251 | 0.2710 | 0.3712 | 0.2571 | 0.2285 | 0.2693 |

KNN | 0.3222 | 0.4553 | 0.4267 | 0.3696 | 0.3553 | 0.3858 |

MLP | 0.2500 | 0.5714 | 0.5714 | 0.5000 | 0.2857 | 0.4301 |

**Table 1**: The classification error rate of three models at four temperatures: Support Vector Classification (SVC), k-Nearest Neighbours (KNN), and Multi-Layered Perceptron neural network (MLP). The classification error is the proportion of all test points the model predicted incorrectly.

## ROUND 1 STAGE 1 MODEL CONCLUSIONS

By inspection we can see that SVC had the best classification error, even though it was obtained without optimizing its hyperparameters. With SVC remaining as the best option for classification, optimizing the hyperparameters became a major point of concern for future modelling rounds.

In regards to KNN, it performed markedly worse than SVC, despite K being optimized. Therefore, while a stage 2 model was created from the KNN models, KNN was not utilized for subsequent rounds of modelling.

The MLP models performed as expected, which is to say, poorly. Consequently, the team did not even bother with creating a stage 2 model from the stage 1 MLP models.

## ROUND 1 Stage 2 MODEL CLASSIFICATIONS

The MLP architecture for round 1 was 4-512-4.

**4 input layer nodes:**Handled the 4 input variables of oil volume fraction, water volume fraction, surfactant volume fraction, and temperature.**512 hidden layer nodes:**Utilized a softmax activation function and a dropout rate of 0.2 to improve training performance.**4 input layer nodes:**Handled the 4 input variables of oil volume fraction, water volume fraction, surfactant volume fraction, and temperature.

The train-validation-test split used was 0.64:0.16:0.20.

For the SVC and KNN models, the 5151 data points for each temperature that were generated, totalling 25 755 data points, were fed into MLP models. The final accuracy values and training histories for the MLP models are displayed in the table below. The overall MLP model accuracy values were taken as the test data accuracy multiplied by the overall accuracy of the model (KNN or SVC) that they were based on.

MLP Model | Training Epochs | Training Data Accuracy | Validation Data Accuracy | Test Data Accuracy | Overall Accuracy |
---|---|---|---|---|---|

Using kNN Data | 7000 | 93.800% | 94.500% | 94.000% | 57.735% |

Using SVC Data | 70000 | 93.800% | 94.600% | 94.200% | 68.832% |

**Table 2**: Round 1 stage 2 MLP model training,validation, test, and overall accuracy.

**Figure 5**:Training history of the round 1 SVC MLP model.

Once trained, the stage 2 MLP models generated phase diagram data in a similar fashion to the stage 1 SVC and KNN interpolations, although here phase diagrams for the temperatures from 295.15K to 344.15K, in steps of 1K, were uniformly and densely generated. The ternary phase diagrams generated from the MLP models are displayed below.

**Figure 6**: Display of phase equilibria in round 1 changing as a function of temperature in three dimensional cartesian space using an MLP trained on KNN interpolated data.

**Figure 7**: Display of phase equilibria in round 1 changing as a function of temperature in three dimensional cartesian space using an MLP trained on SVC interpolated data.

## ROUND 1 Stage 2 MODEL CONCLUSIONS

Using the human eye, the round 1 stage 2 models were indistinguishable from each other. Closer inspection of the data revealed that while there were incredibly minute differences, the two models were essentially the same. Additionally, the models achieved a similar level of accuracy to their training datasets.

The results suggested that the choice of machine learning algorithm used in stage 1 was inconsequential in terms of producing differences in stage 2 models. This conclusion contributed to future rounds of modelling only using 1 machine learning algorithm for stage 1.

## ROUND 1 Wet Lab Conclusions

Our wet lab studied the initial predicted phase diagrams and compared them to existing literature. Generally speaking, the boundaries between phases on the diagrams are high parametric. The KNN generated model contained both a non-parametric boundary and random discontinuous phase distribution that is not consistent with experimental literature. The boundaries between Winsor types because the relationship between the composition variables and the emulsion structure are non-linear. On the other hand, the MLP generated model was highly underfit. Not only was this represented clearly in the low predictive accuracy of the model, but also in the linear structure of the boundaries between the phases. In comparison to the other two, SVC produced the most physically representative phase diagram as the boundaries were both continuous and non-linear.

# Round 2 Experiments

Having confirmed that the SVC phase diagram model had both a low mean classification error and was better at demonstrating the physical properties of emulsions, our team began developing a second set of phase diagram models.

The second round of modelling results altered the physical components of the solution to be more representative of the actual chlorophyll removal process that we are developing. It was composed of the green oil donated by Milligan biofuels, the aqueous buffer used for protein suspension, and a co-surfactant mixture.

71 experimental data points were gathered for each of the examined temperatures of 300.15K, 310.15K, 323.15K, and 343.15K.

# Round 2 Results

## Introduction

With the results of round 1 analyzed, the team decided that further work into emulsion phase equilibria modelling was worth pursuing. Additionally, potential uses of the models were identified and clarified. However, in order to achieve this potential, refinements would need to be made to the modelling process. Firstly, the team selected the approach of using only SVC for stage 1 modelling. Additionally, exhaustive SVC parameter optimization was developed and implemented, in order to minimize error.

Another benefit of this optimization was the fact that in evaluating SVCs to determine the minimum error, SVCs ranging from highly underfit to highly overfit could be collected. By utilizing these models in a confidence function, confidence heatmaps could be generated for the temperatures examined in the experiments. By using these confidence heatmaps to identify specific phase equilibria of maximum confidence, the team could potentially **form the desired Winsor 1 phases for optimal chlorophyll extraction without any further trial-and-error. **

In regards to the stage 2 models, the fact that they interpolate and extrapolate phase data in temperature enables recording ternary phase areas over temperature. This would provide another method of validating the model results.

## Round 2 stage 1 model classifications

After round 2 data was gathered from the wet lab, SVC with an RBF kernel was applied over the four experiment temperatures, focusing on finding the optimal parameters for each. The lowest mean error rate was chosen after iterating through the parameter space of gamma and cost (see theory here).

The four completed phase diagrams using SVC are shown below (Figure 8), along with a table that includes their minimized classification error along with their respective parameters (Table 3). The data presented in these round 2 stage 1 diagrams were used for constructing a scaled version of our separation system, using our 6GIX protein in the aqueous phase of the emulsion.

**Figure 8**: Round 2 optimized Support Vector Classification (SVC) at four temperatures, 300K, 310K, 323K, 343K. (Dark blue: Winsor 1, pink: Winsor 2, green: Winsor 3, light blue: Winsor 4)

Model | 300K Error Rate | 310K Error Rate | 315K Error Rate | 328K Error Rate | Mean Classification Error |
---|---|---|---|---|---|

SVC | 0.2428 | 0.1714 | 0.1571 | 0.2428 | 0.2035 |

Parameter | Optimal at 300K | Optimal at 310K | Optimal at 315K | Optimal at 328K | |

Gamma | 0.66 | 0.25 | 0.75 | 0.17 | |

Cost | 18400 | 12200 | 8800 | 10600 |

**Table 3**: The classification error rate of Support Vector Classification (SVC) for round 2 data. The parameters that defined the support vector model are specified as gamma and cost, and were chosen as they produced the lowest classification error when iterating through the parameter space.

## ROUND 2 CONFIDENCE REGIONS

We wanted to find the class areas that changed the most when iterating through the Support Vector model parameters. These areas reveal the emulsions that are unstable, allowing us to stay away from using these constructions. To determine these areas, we generated 180 different Support vector models of increasing flexibility (Figure 9). Here are 60 of the 180 models:

**Figure 9**: An animation showing the class area change when iterating through all constructed models of the SVC, starting from underfit and ending with overfit.

Of the changing class areas as shown in Figure 9, we sought to find the emulsion points which changed class labels most frequently. This would reveal such points that our model was unable to classify with great certainty, allowing us to stay away from constructing our emulsions in these regions. Figure 10 (below) depicts how we determined a 'confidence' value for each emulsion point. Dark areas in the end ternary diagram displays such regions of low classification confidence.

**Figure 10**: The approach at finding a confidence value for each emulsion point. Darker areas of the color gradient show those points that were difficult to classify. Yellow regions show no phase change over all Support Vector Classification parameter changes.

## Round 2 Stage 2 Classification

The MLP architecture for round 2 was 4-512-5. The only difference between the round 2 and round 1 architectures is the extra output layer node, to represent the N/A phase (where no emulsion can be formed due to instability).

With a stage 1 model accuracy of 0.7965 and a stage 2 model test data accuracy of 0.942, the overall accuracy of the round 2 stage 2 model is 0.750.

The model training history and accuracy data is displayed below:

MLP Model | Training Epochs | Training Data Accuracy | Validation Data Accuracy | Test Data Accuracy | Overall Accuracy |
---|---|---|---|---|---|

Using Optimized SVC Data | 63229 | 97.20% | 98.30% | 94.20% | 75.00% |

**Table 4**: Round 2 stage 2 MLP model training,validation, test, and overall accuracy.

**Figure 11**: Training history of the optimized Round 2 Stage 2 SVC MLP model.

The trained MLP model result is displayed below:

**Figure 12**: Display of phase equilibria in round 2 changing as a function of temperature in three dimensional cartesian space. Stage 1 interpolation used SVC with optimized hyperparameters. Dark blue: Winsor 1, white: Winsor 2, green: Winsor 3, light blue: Winsor 4.

## ROUND 2 MODEL PHASE AREA ANALYSIS

The data density used in our models is set so that the temperatures in the final generated ternary phase diagrams contain 5151 data points each. Therefore, the area of each phase can be plotted against the temperature of the system by calculating the number of data points of a particular phase over the total number of data points in that temperature (5151).

We were interested in studying the presence of the Winsor phases as a function of Temperature because it can be a useful tool in characterizing the physical prediction of the emulsion system.

The first trend that we were interested in was the evolution of the Winsor 1 region. From a physical standpoint, we were expecting to see the Winsor 1 area decreases with temperature. Our emulsion system uses non-ionic surfactants, which create the emulsion structure by hydrogen-bonding with the aqueous phase. Hydrogen bonds dissociate easily with temperature, meaning that as temperature increases there should be more surfactant mixing into the oil phase. The representation of the graph does show an eventual net decrease in temperature, albeit after a steady increase up to 327.15K. The interplay between the Winsor 2 and 3 regions is also accurate, as the phase inversion process requires the emulsions to pass through a thermally stable Winsor 3 region. The eventual collapse of emulsion structure with temperature increase is also an accurate portrayal as it, as an emulsion’s stability slowly decreases with temperature.

A point of concern however is the decrease of the presence of the Winsor 2 phase between 305.15K and 319.15K, because it should be either increasing or stay constant.

** Figure 13**: Area of phase equilibria in phase diagrams in experiment 2 changing as a function of temperature.

## ROUND 2 WET LAB INTEGRATION

We the SVC generated phase diagram model developed, we sought to use our model to optimize the emulsions used to develop our chlorophyll removal process in the wetlab. For our wet lab experiments, we set out to identify emulsion compositions that meet the following design criteria:

- Composition is classified as a Winsor 1 type emulsion - this emulsion type
- Model predicts the Winsor label with a confidence of 1 - we don’t want to use any compositions that model the model varies by it’s parameter optimization
- Water-to-oil ratio is between 0 and 1 – this is in order to maximize the amount of green oil that can be processed and minimize the amount of aqueous phase needed while maintaining the Winsor 1 structure

Based off of these criteria, we identified five composition candidates to test the performance of our emulsion system.

**Figure 14**: We validated the emulsion structure of these candidates in the lab and found that the phase diagram model had correctly predicted the Winsor type for each.

# Future Directions

A future direction in developing and using our phase diagram models would be in studying the effect of volume fraction of surfactant on the Winsor type. This would essentially be using the same data, but studying it a way that can identify the effect of temperature on different emulsion compositions without needing to manipulate the composition of the existing solution. Creating a confidence function that varies by interpolation in temperature could be extremely useful in the improved design of our separation system. If we can develop our phase diagram to accurately predict the effect of manipulating the process temperature we can enable our emulsion system to maximize the rate at which it separates, since the rate of emulsion coalescence and separation are both correlated to temperature. This same principle also applies to maximizing the extent of our oil recovery.

# References

Figures made with Plotly Technologies Inc. Collaborative data science. Montréal, QC, 2015. https://plot.ly.

A Winsor, B. P., & Hahn, von. (1932). HYDROTROPY, SOLUBILISATION AND RELATED EMULSIFICATION PROCESSES. PART I. Aqueous Solutions of Parajin Chain Salts (Vol. 62). Retrieved from https://pubs-rsc-org.ezproxy.lib.ucalgary.ca/en/content/articlepdf/1948/tf/tf9484400376

Abdulkarim, M. F., Abdullah, G. Z., Sakeena, M. H. F., Chitneni, M., Yam, M. F., Mahdi, E. S., … Noor, A. M. (2011). Study of Pseudoternary Phase Diagram Behaviour and the Effect of Several Tweens and Spans on Palm Oil Esters Characteristics. International Journal of Drug Delivery, 3, 95–100. https://doi.org/10.5138/ijdd.2010.0975.0215.03058

Diosady, L. L. (2005). Chlorophyll Removal From Edible Oils. International Journal of Applied Science and Engineering, 3(2), 81–88.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). An introduction to statistical learning: With applications in R. Springer.

Luciński, R., & Jackowski, G. (2006). The structure, functions and degradation of pigment-binding proteins of photosystem II. Retrieved from www.actabp.pl

Ramamurthi, S., & Low, N. H. (1995). Effect of Possible Chlorophyll Breakdown Products on Canola Oil Stability. J. Agrie. FoodChem (Vol. 43). Retrieved from https://pubs.acs.org/sharingguidelines

Srinivasan, R. (2011). Advances in application of natural clay and its composites in removal of biological, organic, and inorganic contaminants from drinking water. Advances in Materials Science and Engineering, 2011. https://doi.org/10.1155/2011/872531

Wrolstad, R. E. (2004). Symposium 12 : Interaction of Natural Colors with Anthocyanin Pigments — Bioactivity and Coloring Properties. Journal of Food Science, 69(5), 419–421.