Team-iGEM SJTU BioX 201

Overview

What is bio-storage

Just like we have mentioned in the judging form, we believe that the storage should not be limited to the off-target information, but can be extended to a broader binary world. We consider lightened cell represent 1 and unlightened cell represent 0. There for, the storage can be expanded to ;iterally everything.

Why we model

Through our experiment, we notice that sometimes the fluorescence is too dim and cannot be distingutished from unstored cell. However, storage error is un accepetable in any storage system. Therefore, we must establish a codec model that can tolerate storage error. Thus we introduce you our Bio-storage codec----

Introduction

Use a dictionary

Error correcting codes might seem like a difficult mathematical concept, but they are in fact based on an intuitive idea with an ingenious mathematical implementation: let's make the data structured, in a way that we can "guess" what the data was if it gets corrupted, just by "fixing" the structure. Mathematically-wise, we use polynomials from the Galois Field to implement this structure.

Let's take a more practical analogy: let's say you want to communicate messages to someone else, but these messages can get corrupted along the way. The main insight of error correcting codes is that, instead of using a whole dictionary of words, we can use a smaller set of carefully selected words, a "reduced dictionary", so that each word is as different as any other. This way, when we get a message, we just have to lookup inside our reduced dictionary and correct the error in 2 steps.

1) detect which words are corrupted (as they are not in our reduced dictionary
2) correct corrupted words by finding the most similar word in our dictionary

To fully illustrate our model's principle, we present it in comic:

A and B both have a reduced dictionary with only three words of 4 letters: bike, fire and five. One day, A write a letter to B. Unfortunately, the letter is corrupted on the way. Is B able to read the letter?

Let's say B receive a corrupted word: bi??, where ? is an erasure. Since we have only 3 words in our dictionary, we can easily compare our received word with our dictionary to find the word that is the closest. In this case, it's corn. Thus the missing letters are ke.

Now let's say we receive the word fi??. Here the problem is that we have two words in our dictionary that match the received word: fire and five. In this case, we cannot be sure which one it is, and thus we cannot decode.

Here we introduce two significant defintion here：maximum Hamming distance and maximum separability

Back to the second letter B received. That means that our dictionary is not very good, and we should replace fire with another more different word, such as fork to maximize the difference between each word. This difference, or more precisely the minimum number of different letters between any 2 words of our dictionary, is called the maximum Hamming distance of our dictionary. Making sure that any 2 words of the dictionary share only a minimum number of letters at the same position is called maximum separability.

The same principle is used for our codec: we generate only a reduced dictionary containing only words with maximum separability , and then we communicate only with the words of this reduced dictionary. What Galois Fields provide is the structure (ie, reduced dictionary basis), and Reed–Solomon is a way to automatically create a suitable structure (make a reduced dictionary with maximum separability tailored for a dataset), as well as provide the automated methods to detect and correct errors (ie, lookups in the reduced dictionary). To be more precise, Galois Fields are the structure and Reed–Solomon is the codec based on Galois Fields.

If a word gets corrupted in the communication, that's no big deal since we can easily fix it by looking inside our dictionary and find the closest word, which is probably the correct one (there is however a chance of choosing a wrong one if the input message is too heavily corrupted, but the probability is very small). Also, the longer our words are, the more separable they are, since more characters can be corrupted without any impact.

Redundancy

The simplest way to generate a dictionary of maximally separable words is to make words longer than they really are. Back to the dictionary, if we add redundancy data, it will turn to be like this:

fiveabcd

firebcde

bikecdef

Therefore, the coruppted letter will be fi??abcd, and it can be decoded to fiveabcd.

With these examples, we can see the advantage of redundancy in recovering lost information: redundant characters help us recover our original data.

Our core idea based on Reed–Solomon algorithm is similar: appending redundant data to a message based on Galois Field mathematics. The original error correcting decoder was similar to the error example above, search sub-sets of a received message that correspond to a valid message, and choose the one with the most matches as the corrected message. This isn't practical for larger messages, so mathematical algorithms were developed to perform error correction in a reasonable time.

Encoding

Reed Solomon code, or RS code, introduced by Irving S. Reed and Gustave Solomon in 1960, provides a method to encode the message with a redundant in the form of an error correcting code, or ECC.

The main idea of encoding is that we can regard the message $x=(x_i)\in F^k ,i=1...k$ as a sequence of coefficients of a polynomial $p_x$.

$$ p_x(a)=\Sigma x_ia^{i-1}|_{i=1}^n $$

We can receive the message $x$ by evaluating $p_x$ at different points $a_1...a_n\in F^n$. Then we define the RS code by,

$$ C(x)=(p_x(a_1),..,p_x(a_n)) $$

Or, we can write it as $C(x)=xA$, $A$ is a Vandermonde matrix over $F$.

The redundant information $y=(y_i), i=1..m$ can be created by

$$ y=Bx $$

B is $m\times k$ matrix. In this case, even some data is lost, the reciever can revive the original polynomial by interpolation.

In this case, if the numbers of lost information is lower than m, and we know the position of the lost information, we can revive the original data.

For example, if the message we want to send is $D_1, D_2,D_3,D_4$, we can construct a matrix to generate a series $D_1,D_2,D_3,D_4,C_1,C_2,C_3$ by,

$$ \left( \begin{array}{ccc} %该矩阵一共3列，每一列都居中放置 I_{4\times4}\\ %第一行元素 B_{3\times4}\\ %第二行元素 \end{array} \right) \left[ \begin{array}{ccc} %该矩阵一共3列，每一列都居中放置 D_1\\ %第一行元素 D_2\\ %第二行元素 D_3\\ D_4\\ \end{array} \right]=\left[ \begin{array}{ccc} %该矩阵一共3列，每一列都居中放置 D_1\\ %第一行元素 D_2\\ %第二行元素 D_3\\ D_4\\ E_1\\ E_2\\ E_3\\ \end{array} \right] $$

The matrix $\left(\begin{array}{ccc} %该矩阵一共3列，每一列都居中放置 I_{4\times4}\\ %第一行元素 B_{3\times4}\\ %第二行元素 \end{array}\right)$ should satisfy that every sub matrix of it is inversible. This is the process of encoding.

For the receiver, if $D_3,E_1,E_2$ is lost, the message he or she receives is $[D_1,D_2,??,D_4,??,??,E_3]$.

Then the relationship becomes,

$$ \left( \begin{array}{} %该矩阵一共3列，每一列都居中放置 1\ \ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ 0\\ %第一行元素 0\ \ \ \ \ 1\ \ \ \ \ 0\ \ \ \ \ 0\\ 0\ \ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ 1\\ B_{31}\ B_{32}\ B_{33}\ B_{34}\\ \end{array} \right) \left[ \begin{array}{ccc} %该矩阵一共3列，每一列都居中放置 D_1\\ %第一行元素 D_2\\ %第二行元素 D_3\\ D_4\\ \end{array} \right]=\left[ \begin{array}{ccc} %该矩阵一共3列，每一列都居中放置 D_1\\ %第一行元素 D_2\\ %第二行元素 D_4\\ E_3\\ \end{array} \right] $$

Let the new matrix be $B'$, and we can find the inverse matrix of $B'$.

$$ B'^{-1}=B'' $$

Therefore, we can obtain the original information $[D_1,D_2,D_3,D_4]$ by

$$ \left[ \begin{array}{ccc} %该矩阵一共3列，每一列都居中放置 D_1\\ %第一行元素 D_2\\ %第二行元素 D_3\\ D_4\\ \end{array} \right]=\left[ \begin{array}{ccc} %该矩阵一共3列，每一列都居中放置 D_1\\ %第一行元素 D_2\\ %第二行元素 D_4\\ E_3\\ \end{array} \right]·B'' $$

Reed–Solomon codes are encoded by dividing the polynomial representing the message by an irreducible generator polynomial, and then the remainder is the RS code, which we will just append to the original message.

Why? We previously said that the principle behind error correcting codes, is to use a reduced dictionary with very different words as to maximize the distance between words, and that longer words have greater distance: here it's the same principle, first because we lengthen the original message with additional symbols (the remainder) which raise the distance, and secondly because the remainder is almost unique (thanks to the carefully designed irreducible generator polynomial), so that it can be exploited by clever algorithms to deduce parts of the original message.

To summary, with an approximated analogy to encryption: our generator polynomial is our encoding dictionary, and polynomial division is the operator to convert our message using the dictionary (the generator polynomial) into a RS code.

Decoding

Reed–Solomon decoding is the process that, from a potentially corrupted message and a RS code, returns a corrected message. In other words, decoding is the process of repairing your message using the previously computed RS code.

Although there is only one way to encode a message with Reed–Solomon, there are lots of different ways to decode them, and thus there are a lot of different decoding algorithms.

However, we can generally outline the decoding process in 5 steps:

1.Compute the syndromes polynomial. This allows us to analyze what characters are in error using Berlekamp-Massey (or another algorithm), and also to quickly check if the input message is corrupted at all.

2.Compute the erasure/error locator polynomial (from the syndromes). This is computed by Berlekamp-Massey, and is a detector that will tell us exactly what characters are corrupted.

3.Compute the erasure/error evaluator polynomial (from the syndromes and erasure/error locator polynomial). Necessary to evaluate how much the characters were tampered (ie, helps to compute the magnitude).

4.Compute the erasure/error magnitude polynomial (from all 3 polynomials above): this polynomial can also be called the corruption polynomial, since in fact it exactly stores the values that need to be subtracted from the received message to get the original, correct message (i.e., with correct values for erased characters). In other words, at this point, we extracted the noise and stored it in this polynomial, and we just have to remove this noise from the input message to repair it.

5.Repair the input message simply by subtracting the magnitude polynomial from the input message.

It is possible for a Reed–Solomon decoder to decode both erasures and errors at the same time, up to a limit (called the Singleton Bound) of $2*e+v <= (n-k)$, where e is the number of errors, v the number of erasures and (n-k) the number of RS code characters (called nsym in the code). Basically, it means that for every erasures, you just need one RS code character to repair it, while for every errors you need two RS code characters (because you need to find the position in addition of the value/magnitude to correct).

SJTU-BioX-Shanghai

Contact us: sjtuigem@gmail.com

Bio-X Institute, Shanghai Jiao Tong University, Dongchuan Rd. 800

Team:SJTU-BioX-Shanghai/Model/Codec