Team:Tsinghua-A/Model

 

 

Model

Limitation in length

Currently large scale DNA data storage relays on oligo pool chip synthesis technology, and a file must be divided to several data chunks and encoded into oligos of 100-200 bases. To recover the data, index must be added to each data chunk so the file can be assembled with all these fragments.

Channel noise: Error within a sequence

​ Error may happen in one sequence, including substitution, insertion and deletion in the process of synthesis, decay and sequencing. So a file is directly encoded to DNA with base-bits mapping, it may can't be fully recovered after sequencing due to the noise introduced in the DNA data storage channel, as shown below.

 

Channel noise: Lost of sequence

 

Besides errors within a sequence,whole sequence might also be lost due to a couple of reasons:

•decay and PCR are sequence dependent, which will lead to uneven distribution

•Sampling and sequencing amounts to drawing from a pool and some sequences might be lost

Bio constrains

It is well known that sequences with high GC content and long repeats are hard for synthesis and sequencing, so these patterns biology related patterns must be avoided in encoding. However, these patterns can be frequently observed if binary files are transformed to bases directly:

  • The majority of DNA encoded from most of the files have GC content between 0.4-0.6 by nature, except for the text file which have extremely high gc content as will explained later.
  • 4/5 of the encoded sequences have repeats longer than 4 bases

 

In summary, to encode a file to DNA and perfectly recover it back, the file must be spilt into small data chunks and encoded to DNA sequences which fit bio constrains, and the encoding method must endure channel noises, both errors within a sequence and lost of whole sequence.

In addition, some extra requirements for a data storage method should also be taken into account, like information density (how many bits can be stored in one base) and data security.

Can we put forward a method which fits all these requirement? Click here to see our methods.