Team:Tsinghua-A/Experiments

Experiment

Encoding

The first thing of the whole experiment is to choose the stored files, preferably those that have less repeat and are highly unordered. Then, we open the file in binary mode, and set the parameters of fountain code such as droplet length. By calling the generator function, there can continuously generate chunks. For the generated chunk, it is necessary to ensure that it meets the requirements of homopolynucleotides number and GC content. It will keep generating until there are enough chunks to solve the original file, and continue to generate 5% more chunks as redundancy. The file’s address code should do not specifically match the chunk data. The last step is adding a pair of address to the two ends of the chunk and saving them into csv file. In this design, the total redundancy is near 20%, which can reduce as the file size enlarge.


Synthesis

Compared with outsourcing commercial company synthesis, the cost of synthesis in school lab are much higher, and efficiency is much lower. We synthesis our DNA sequences by iGeneTech 12k chip oligo pool service. The length of DNA chain is actually determined by the synthesis conditions. Because the error rate of single base synthesis is about 0.5%, we select 126nt as our chunk length. In this design, the code efficiency is 51%.


Storage

The storage phase does not require any operation. DNA should be stored at -20 ° c in a dry environment to minimize mutations and chemical damage. Although DNA can still remain stable under more severe conditions, it is better to treat it mildly to obtain higher quality sequences, in order to prove the feasibility of DNA storage.


Access

Before reading the data, you need to extract the required DNA which contains the required information. It has to be enough concentration for high throughput sequencing. In the design, all chunks of a file use the same address and different addresses for different files. Therefore, by using a specific address as PCR primer, only the matched file is amplified. The amplified DNA is far more than the unamplified DNA, so this can be viewed as a read operation. In the first round PCR, the primer is adapter-address composite primer and the second round PCR primer is index-adapter composite primer. Now, the sample can be sent to sequencing machine. The PCR protocol is shown in the table below.


Sequencing

The situation of sequencing is the same as that of synthesis. We choose Illumina NextSeq500 system from iGeneTech to sequence our sample. We respectively use two PCR conditions to amplify all 9 files separately. Due to double-end sequencing, we finally obtained 36 high-throughput sequencing data files. The original data produced in gz format totaled 2GB. Files of 10 cycles PCR conditions were taken for analysis, due to its higher quality.


Decoding

The decoding is divided into 3 parts, collecting, processing and analyzing. The data collecting means unzip high throughput sequencing file to fastq format. Record in fastq contains information of sequence, sequencing quality and others. While in data processing, quality screening was performed on all reads result in poor quality sequences were removed. Then, address screening was performed on all reads to find the reads of completely correct address. Some necessary cutting are made to remove adapter. Count the occurrence frequency of the same sequence, and finally get a table of frequency. In data analyzing, ignore all reads of too low frequency, then try two methods to restore the stored data. The first method is to compare the obtained data with the stored data. Here is a good method of counting the similarity of 2 sequence called Levenshtein distance, also known as editing distance. If the distance is within the threshold, the reads are regarded as they have the same origin. The mismatch may come from mutation or sequencing error. By this way, we recovered most of the original information and proved that the sequencing file did contain enough information to recover. The other method is to self-cluster reads with other parameters. When one record matches another, the record with a higher frequency become the body of the class. Until the number of classes reached the upper limit, calculate the medium sequence of the class's record as the input of decoding. However, with this approach, we find it difficult to decode large files successfully without any prior knowledge.

Taking a file as an example, 1143116 reads were found in the untreated sequencing file, and 494562 reads were left after quality screening and address screening. Finally, 375563 reads with pass sequence alignment, and only 46299 reads were identical, which was equivalent to only 5% of all reads. This also leads to the fact that when the second clustering method is applied, because the correct reads are not in the absolute majority, it is difficult to ensure that the generated class is an original chain. In fact, it always happens that classes are fully utilized, while the decoding program still need more correct data. There are many reasons for the unsuccess decoding unsuccess in some times, but according to the difference of data analysis between different files, the direct reason is that the amount of qualified data is too small, but the fundamental reason is that the error rate is too high. Fortunately, the device accuracy to improve faster and faster over time. What we can do is reduce rounds of PCR, because every time you add a base to the DNA strand, there's a chance of mismatching.

Result

We found the distribution of oligos similiar to the distribution preficted in our modeling.The 5% reduency we set is a little dangenrous little lower than 5% of the sequences were lost, although most of the files can be successfully we failed to get enough sequences to decode the first file. We also analyzed the error rate within a sequence in different positions(blue for addition, green for deletion, red for substitution and orange for the total error rate in %)

To let DNA data storage available to more people, we also developed a software for encoding and decoding data into DNA