Revision as of 11:58, 9 December 2019

Source Encoding : DNA ASCII

Problem of text encoding

When encoding txt files into dna by directly transforming bits to ATGC, we found most sequences fits the requirement of max length of homopolymers, but lots of sequences have high GC content, and the average gc content of dna chunks encoded from a txt file is above 60% or higher. Examining the character distribution of a document, we can easily find the reason: most characters in a document are lower-case letters and frequently used punctuations like ' ', and the ascii codes of these letters are in a narrow interval (the ascii code of 'a' - 'z' are from 97 - 122). This arrangement is suitable for computers to perform computation, but is totally unnecessary and problematic when storing data into DNA, as DNA representation of these frequently used characters have high average GC rate, which leads to the uneven gc distribution of a text file. How to solve the problem?

Inspiration from NEFU

The project of NEFU gave us an inspiration. They put forward a novel method to encode text into DNAs, by finding a way to map 1 English letter to one or several codons, so that the frequency of occurrence of an English letter is similar to the sum of the frequency of occurrence of its corresponding codons in nature, so that the encoding result of a text may be "natural".

But as the text is limited to 26 letters and have low information density (0.8125 bits per base) it is a step away from being practical, so we make an improvement upon their project. Borrowing the idea of encoding based on frequency of occurrence, we developed a new method to map ascii code to DNA ASCII, to encode text to dnas with gc content approximate to 0.5 and no long homopolymers, extending the character set to the full ascii code set and maintaining information density of 2 bits per base in the same time.

Improvement：DNA ASCII Design

DNA ASCII encode each ascii symbol to 4 base dna strand. To satisfy biochemical constrains, DNA ASCII should meet the following requirement:

The average of (GC rate* frequency) of all encoded 4 base strand should be around 0.5, and characters with higher frequency of occurrence should have GC rate closer to 0.5
4 base strands with longer homopoylers should be allocated to symbols with lower frequency of occurrence

This is what the frequency table and the ASCII Mapping looks like:

From comparation of sequence encoded with original ascii and DNA ASCII, we can see DNA ASCII has much better performance in encoding text files.

Channel Encoding : DNA Fountain

Design

We choose DNA Fountain Code, put forward by Yaniv and Dina in 2017, as our channel encoding code to encode data in performing large scale dna data storage experiment.

The basic idea of Fountain code design is as follows: to encode, divide a file into N data chunks and regard each chunk as a variable, construct (1+alpha) N equations by selecting several chunks and adding them together in binary field; to decode, one only need to obtain any (1+k) N of those equations, solve the equations to get the data chunks back. (typical value of k is 0.05)

A intuitive picture of the above process: Every equation can be thought of as a droplet, when encoding there is a fountain spraying droplets almost endlessly (as there are so many combinations of chunks), and one just have to hold a glass to collect any (1+k)N droplets to decode the original data.

It is a very smart idea to use Fountain code in encoding data into DNA as it can:

ensure satisfaction of biochemical constrains: A screening process can be added when encoding: for any generated droplet see if it meets the requirement of gc content and no long homopolymer, if it passes the screening, add it into encoding result; if not, just trash it and generate a new droplet. As any desired number of droplets can be generated, we can finally get sufficient number of valid droplets.
provide resistance to sequence lost: even though several dna sequences are lost, the original file can still be recovered as long as the number of remaining dna sequences is above (1+k)N.

This Figure shows the encoding and decoding process of Fountain code:

Limitations

Although DNA Fountain has above appealing features, we also observed serval limits in the process of testing fountain code on computer and analyzing sequencing data from real-world experiment:

Serve consequence when fails to decode

The first disadvantage arises from the difference between communication and storage. In communication applications like Mobile TV where Fountain code was originally used, when the number of dropouts increases and receivers are not receiving enough droplets, sender can keep sending new droplets to receivers until the number is sufficient for decoding (and the sender don't have to know which droplet was lost, which is the advantage of fountain code); but in storage applications when the too many dropouts occur and the number of droplets is not enough to perform decoding, the original file just can not be recovered. Due to the encoding method of Fountain code, the consequence is sometimes very serve: reducing the number of droplets to a certain point (like the destination of red arrow in the figure) may led to failing to encode most of data chunks, while when not using it other chunks are still available.

Error spreading and error-correcting efficiency

Fountain code itself can not correct errors within a dna decode, and one error in a droplet may influence many chunks, thus extra error correction code must be applied to each chunk to ensure every chunk used in decoding is 100% correct. In the original paper and in our experiment RS code is added to every chunk to correct errors within this chunk. However, error distributions in analyzed sequencing reads are uneven, a large number chunks may have no error so the error-correcting capacity can not be utilized effciently, and the problem can not be solved with the approach of adding RS code to large data block which consists of multiple chunks used by Microsoft[].

Extra redundancy especially for small files

Alpha*N redundant dna sequences are added while encoding, but only (alpha - k) * N strands are allowed to be lost. The influence might be trivial for large files with many data chunks (as 72000 in original paper), but for small files in our experiment k can be big(15%+), thus it is not a perfect way for encoding small files.

Inspirations

Encryption

When we obtain a number of DNA droplets and wish to recover the original data, we have to know from which chunks the droplet come, and this is done by putting the same seed into the same random number generator to get the same chunk numbers. So if we want to encrypt the original file we can:

Encrypt the seed: Encrypt the seed (an integer) so the seed hackers get is not the real seed, so they can't get the correct chunk numbers if they don't know how the seed is encrypted.
Encrypt the random number generator: Even hackers can get the right seed, if they can't set up the same random number generator the chunk numbers will still be wrong. This can be done with multiple method, like using the traditional linear congruential way (A * x + C) %M to generate random number and set different random number generator with parameters A, C, M, adjusting the degree distribution, or abandon D random numbers generated first.

As error may spread in dna pattern, few changes in chunk nums may lead to relatively big change in the decoding result, as seen in the figure below:

Iterative generation and screening

The generating-scanning strategy is great to encode 'perfect dna' in terms of satisfying biochemical constraints: no extra redundancy, can shape DNA to what ever you want by changing the screening condition. But now it must be used together with fountain code, which may not be the best option for certain data storage need. Can we design a method which only deals with the biochemical constraints which integrates well with other error-correcting code?

Channel Encoding : DNA Mask

Inspiration from QR code

To avoid certain unwanted pattern in QR code like large blank space, masking strategy is used to transform original picture to a desired one while encoding. With the mask index stored in a scanned QR code, original mask can be found to transform the original picture back to recover the data. As extreme gc content and long homopolyers can also be regarded as undesired patterns, can the same strategy be implemented in dna channel encoding?

Xor data with a mask

Performing xor operation with a single random mask can not ensure generating sequence which fits the biology requirement, it can only randomize the data, i.e. the masked sequence is like a random sequence, shares the same rate of fitting biology requirement with a randomly generated sequence. With computational experiment we found the pass rate of a randomly generated dna sequence of 96nt is around 1/5 (0.45 < gc < 0.55, no more than 3 repeats)

biology constrains are met using Goldman's approach[] in the second to last stage, in which data is transformed to base3 and encoded back to base4 to avoid repeats, which results in 1/4 decrease in information density.

Xor data with a mask set using generating-scanning strategy

What will happen if we xor dna with a set of masks? If the success probability of using one mask is 1/5, as mask are randomly generated, with the same generating-scanning strategy when using N masks the success probability will increase to 1 - (4/5)^N. If we use 16 masks the success rate will be 97.28, when we use 64 masks the success rate will be 0.9999993, and even it fails to find a dna fitting perfectly we can still select a relatively good dna. This coding method enables satisfying biochemical constrains with a relatively low cost (2nt/3nt compared to 1/4 * 96 nt), and can integrate well with other encoding method. For example, in Microsoft's former work[] the Goldman's approach (in which data is transformed to base3 and encoded back to base4 to avoid repeats, which results in 1/4 decrease in information density), can be replaced by DNA Mask without having to change other parts of the encoding pipeline.

[1] Erlich, Yaniv, and Dina Zielinski. "DNA Fountain enables a robust and efficient storage architecture." Science 355.6328 (2017): 950-954.

[2] Organick, Lee, et al. "Random access in large-scale DNA data storage." Nature biotechnology 36.3 (2018): 242.

Encode/Decode

Source Encoding
Channel Encoding：Fountain
Channel Encoding: DNA Mask

@@ Line 23: / Line 23: @@
 					<p>&nbsp;</p>
 <p>&nbsp;</p>
-<a name="Introduction"></a>
-<h3>Introduction</h3>
-<p>From our model it can be seen that if we want to encode any type of data into DNA and fully recover the files from it, several characters of DNA data channel must be taken into consideration:</p>
-<h4>Errors introduced in DNA storage channel</h4>
-<p>	Two kinds of errors might be introduced in DNA storage channel:</p>
-<ul>
-<li><p><strong>Substitution, deletion and insertion in a sequence</strong> </p>
-</li>
-<li><p><strong>Lost of whole sequence</strong>   </p>
-<p>    Luckily for us, the original file might be fully recovered even some parts of the data are damaged, by using channel encoding methods to add some redundancy. Applications of such encoding methods can be seen in our daily life: take the following QR code for example, even though some part of the original QR code is damaged, you can still scan it and visit our website. This is achieved by adding RS error-correction code to QR code: with corresponding decoding algorithm certain amount of errors can be located and corrected so that original data can be recovered if the data is not damaged very badly.</p>
-</li>
-</ul>
-<h4>Biochemical constraints</h4>
-<p>	Different from other channel encoding problems, there are certain <strong>Biochemical constraints</strong> when we want to encode data in DNA, that  <strong>Long homopolymers</strong> and <strong>extreme GC content</strong> must be avoided in encoding result as they are hard to synthesis and prone to sequencing errors.</p>
-<p>	 This requirement can not be satisfied when encoding files into DNA with simple bits_to_quants transformation: <strong>long homopolymers can be frequently observed</strong> in pictures and compressed files and <strong>GC content in text files tends to be very high</strong> under current character encoding method. Strategies to solve this problem must be implemented when encoding data into DNA.</p>
-<p>Beside solving the above problem, we also want the <strong>information density</strong> in encoded data as high as possible. How can we combine these requirements together? </p>
 <a name="Source Encoding"></a>
 <h3>Source Encoding : DNA ASCII</h3>
@@ Line 155: / Line 139: @@
 						</h1>
 						<ul class="lateral">
-							<li class="lateral">
-								<div style="line-height: 100%;">
-									<a href="#Introduction">Introduction</a>
-								</div>
-							</li>
 							<li class="lateral">
 								<div style="line-height: 100%;">

Difference between revisions of "Team:Tsinghua-A/encode decode"