Difference between revisions of "Team:Tsinghua-A/encode decode"

Line 10: Line 10:
 
<center style="height: 100%;">
 
<center style="height: 100%;">
 
<div class="top_pic" style="height: 100%;margin-top: 0;">
 
<div class="top_pic" style="height: 100%;margin-top: 0;">
<div style="background-image: url(https://static.igem.org/mediawiki/2019/4/44/T--Tsinghua-A--project-main.jpg);height: 100%;
+
<div style="background-image: url(https://static.igem.org/mediawiki/2019/b/b5/T--Tsinghua-A--main2.png);height: 100%;
 
     width: 100%;background-attachment: fixed;background-size: cover;">
 
     width: 100%;background-attachment: fixed;background-size: cover;">
 
     </div>
 
     </div>

Revision as of 14:00, 9 December 2019

 

 

Source Encoding : DNA ASCII

Problem of text encoding

​ When encoding txt files into dna by directly transforming bits to ATGC, we found most sequences fits the requirement of max length of homopolymers, but lots of sequences have high GC content, and the average gc content of dna chunks encoded from a txt file is above 60% or higher. Examining the character distribution of a document, we can easily find the reason: most characters in a document are lower-case letters and frequently used punctuations like ' ', and the ascii codes of these letters are in a narrow interval (the ascii code of 'a' - 'z' are from 97 - 122). This arrangement is suitable for computers to perform computation, but is totally unnecessary and problematic when storing data into DNA, as DNA representation of these frequently used characters have high average GC rate, which leads to the uneven gc distribution of a text file. How to solve the problem?

Inspiration from NEFU

​ The project of NEFU gave us an inspiration. They put forward a novel method to encode text into DNAs, by finding a way to map 1 English letter to one or several codons, so that the frequency of occurrence of an English letter is similar to the sum of the frequency of occurrence of its corresponding codons in nature, so that the encoding result of a text may be "natural".

​ But as the text is limited to 26 letters it is a step away from being practical, so we make an improvement upon their project. Borrowing the idea of encoding based on frequency of occurrence, we developed a new method to map ascii code to DNA ASCII, to encode text to dnas with gc content approximate to 0.5 and no long homopolymers, extending the character set to the full ascii code set and maintaining information density of 2 bits per base in the same time.

Improvement:DNA ASCII Design

​ DNA ASCII encode each ascii symbol to 4 base dna strand. To satisfy biochemical constrains, DNA ASCII should meet the following requirement:

  • The average of (GC rate* frequency) of all encoded 4 base strand should be around 0.5, and characters with higher frequency of occurrence should have GC rate closer to 0.5

  • 4 base strands with longer homopoylers should be allocated to symbols with lower frequency of occurrence

    This is what the frequency table and the ASCII Mapping looks like:

​ From comparation of sequence encoded with original ascii and DNA ASCII, we can see DNA ASCII has much better performance in encoding text files.

 

Channel Encoding : DNA Fountain

Design

​ We choose DNA Fountain Code, put forward by Yaniv and Dina in 2017, as our channel encoding code to encode data in performing large scale dna data storage experiment.

The basic idea of Fountain code design is as follows: to encode, divide a file into N data chunks and regard each chunk as a variable, construct (1+alpha) N equations by selecting several chunks and adding them together in binary field; to decode, one only need to obtain any (1+k) N of those equations, solve the equations to get the data chunks back. (typical value of k is 0.05)

A intuitive picture of the above process: Every equation can be thought of as a droplet, when encoding there is a fountain spraying droplets almost endlessly (as there are so many combinations of chunks), and one just have to hold a glass to collect any (1+k)N droplets to decode the original data.

​ It is a very smart idea to use Fountain code in encoding data into DNA as it can:

  • ensure satisfaction of biochemical constrains: A screening process can be added when encoding: for any generated droplet see if it meets the requirement of gc content and no long homopolymer, if it passes the screening, add it into encoding result; if not, just trash it and generate a new droplet. As any desired number of droplets can be generated, we can finally get sufficient number of valid droplets.
  • provide resistance to sequence lost: even though several dna sequences are lost, the original file can still be recovered as long as the number of remaining dna sequences is above (1+k)N.

This Figure shows the encoding and decoding process of Fountain code:

Limitations

Although DNA Fountain has above appealing features, we also observed serval limits in the process of testing fountain code on computer and analyzing sequencing data from real-world experiment:

Serve consequence when fails to decode

​ The first disadvantage arises from the difference between communication and storage. In communication applications like Mobile TV where Fountain code was originally used, when the number of dropouts increases and receivers are not receiving enough droplets, sender can keep sending new droplets to receivers until the number is sufficient for decoding (and the sender don't have to know which droplet was lost, which is the advantage of fountain code); but in storage applications when the too many dropouts occur and the number of droplets is not enough to perform decoding, the original file just can not be recovered. Due to the encoding method of Fountain code, the consequence is sometimes very serve: reducing the number of droplets to a certain point (like the destination of red arrow in the figure) may led to failing to encode most of data chunks, while when not using it other chunks are still available.

Error spreading and error-correcting efficiency

​ Fountain code itself can not correct errors within a dna decode, and one error in a droplet may influence many chunks, thus extra error correction code must be applied to each chunk to ensure every chunk used in decoding is 100% correct. In the original paper and in our experiment RS code is added to every chunk to correct errors within this chunk. However, error distributions in analyzed sequencing reads are uneven, a large number chunks may have no error so the error-correcting capacity can not be utilized effciently, and the problem can not be solved with the approach of adding RS code to large data block which consists of multiple chunks used by Microsoft[].

Extra redundancy especially for small files

​ Alpha*N redundant dna sequences are added while encoding, but only (alpha - k) * N strands are allowed to be lost. The influence might be trivial for large files with many data chunks (as 72000 in original paper), but for small files in our experiment k can be big(15%+), thus it is not a perfect way for encoding small files.

Inspirations

Encryption

We also found encryption method embedded in DNA Fountain encoding, as will be discussed later in encryption section.

Iterative generation and screening

​ The generating-scanning strategy is great to encode 'perfect dna' in terms of satisfying biochemical constraints: no extra redundancy, can shape DNA to what ever you want by changing the screening condition. But now it must be used together with fountain code, which may not be the best option for certain data storage need. Can we design a method which only deals with the biochemical constraints which integrates well with other error-correcting code?

Channel Encoding : DNA Mask

Inspiration from QR code

​ To avoid certain unwanted pattern in QR code like large blank space, masking strategy is used to transform original picture to a desired one while encoding. With the mask index stored in a scanned QR code, original mask can be found to transform the original picture back to recover the data. As extreme gc content and long homopolyers can also be regarded as undesired patterns, can the same strategy be implemented in dna channel encoding?

Xor data with a mask

​ Performing xor operation with a single random mask can not ensure generating sequence which fits the biology requirement, it can only randomize the data, i.e. the masked sequence is like a random sequence, shares the same rate of fitting biology requirement with a randomly generated sequence. With computational experiment we found the pass rate of a randomly generated dna sequence of 96nt is around 1/5 (0.45 < gc < 0.55, no more than 3 repeats)

​ biology constrains are met using Goldman's approach[] in the second to last stage, in which data is transformed to base3 and encoded back to base4 to avoid repeats, which results in 1/4 decrease in information density.

Xor data with a mask set using generating-scanning strategy

​ What will happen if we xor dna with a set of masks? If the success probability of using one mask is 1/5, as mask are randomly generated, with the same generating-scanning strategy when using N masks the success probability will increase to 1 - (4/5)^N. If we use 16 masks the success rate will be 97.28, when we use 64 masks the success rate will be 0.9999993, and even it fails to find a dna fitting perfectly we can still select a relatively good dna. This coding method enables satisfying biochemical constrains with a relatively low cost (2nt/3nt compared to 1/4 * 96 nt), and can integrate well with other encoding method. For example, in Microsoft's former work[] the Goldman's approach (in which data is transformed to base3 and encoded back to base4 to avoid repeats, which results in 1/4 decrease in information density), can be replaced by DNA Mask without having to change other parts of the encoding pipeline.

This picture compares the endoding and decoding process of DNA Mask and DNA Fountain. The same "generating and scanning" technique can be easily seen in the picture.

DNA Mask encoding can be used like a biobrick in other encoding pipline, just solving the problem of bio constrains with little lost of information density. For example, it can be easily embedded in Microsoft's precious work replacing Goldman's approach, saving more than 20% oligos

So with the encoding techniques metioned above, we can now encode a file to DNA and fully retreive it back. But how can we ensure only we can get the file back, preventing others from stealing our secrets? How can data encryption be implented in DNA data storage? Click here to see our approach.

[1] Erlich, Yaniv, and Dina Zielinski. "DNA Fountain enables a robust and efficient storage architecture." Science 355.6328 (2017): 950-954.

[2] Organick, Lee, et al. "Random access in large-scale DNA data storage." Nature biotechnology 36.3 (2018): 242.