Team:Tsinghua-A/Description

DNA Data Storage

It is increasingly difficult for current data storage media to keep up with the explosive amount of data generated around the world. Take the image of a black hole released in March for example: to get the picture scientists of MIT had to deal with 5 petabytes of data, which takes over 1000 pounds of hard drives to store. Modern astronomical telescopes can generate 2 petabytes of this kind of data overnight, and that data has to be stored for decades for future generations of scientists. Considering of the poor data density and longevity (Only a few years of lifetime), new data storage media must be found.

DNA—the molecule that encodes biological information with remarkable longevity and enormous information density—has emerged as a promising storage medium. All the data of a data center about the size of a football field can be stored in DNA about the size of a sugar cube, and it can lasts for thousands of years due to the stability of DNA. There will be a series of thrilling applications if the potential of DNA storage can be fully explored-for example, astronauts can bring all the human knowledge to space with just a box of DNA and a sequencing machine!

Our project

DNA synthesis, decay and sequencing will inevitably bring some errors, so to fully recover the file we have to encode the original file with certain channel encoding technique, and the technique must follow bio limitations of DNA synthesis: current technology can only synthesize DNA molecules consisting of 100–200 nucleotides so the file must be break into pieces when encoding and resembled when decoding, and DNA sequences must have proper GC-rate and no long homopolymer.

After modeling the DNA channel and tried some state-of-art encoding techniques like DNA fountain code, we find that current technique is far from perfect: some characters of DNA channel like uneven molecule distribution lead by PCR and other manipulation hasn't been taken into consideration, fountain code takes forever to run when dealing with large files......So we are going to make some improvement based on current encoding technique, or find a new encoding technique for best DNA storage.

 

To retrieve certain files from a DNA pool, we don't want to read the all the sequences--so random access and retrieval techniques are also crucial in DNA storage. To perform random access, we can apply PCR to DNA with a certain kind of primer, which has been done in various papers. In order to make it more practical, we are going to explore applying multi-PCR and Relay-PCR in DNA storage for accessing multiple files at one time (and other benefits). And as primer sequence itself don't provide information of what you want to retrieve, we also analyze how meaning can be added. We apply image feature extraction and similarity search for image retrieval, by clustering images with VGG16 neural network and PCA and mapping clusters to primers.

 

In order to improve security of DNA storage, we also use Chaotic Encryption Algorithm to encrypt files. Chaotic Encryption provides large key space, high sensitivity, and high efficiency (in terms of time and space complexity).

Related projects

NEFU_China

References

[1] Stewart, K., Chen, Y. J., Ward, D., Liu, X., Seelig, G., Strauss, K., & Ceze, L. (2018, October). A content-addressable DNA database with learned sequence encodings. In International Conference on DNA Computing and Molecular Programming(pp. 55-70). Springer, Cham.

[2] Heckel, R., Mikutis, G., & Grass, R. N. (2018). A characterization of the DNA data storage channel. arXiv preprint arXiv:1803.03322.

[3] Erlich, Y., & Zielinski, D. (2017). DNA Fountain enables a robust and efficient storage architecture. Science, 355(6328), 950-954.

[4] Adleman, L. M. (1994). Molecular computation of solutions to combinatorial problems. Science, 266(5187), 1021-1024.

[5] Yasmeen, A., Du, F., Zhao, Y., Dong, J., Chen, H., Huang, X., ... & Tang, Z. (2016). Sequence-Specific Biosensing of DNA Target through Relay PCR with Small-Molecule Fluorophore. ACS chemical biology, 11(7), 1945-1951.

[6]Matthews, R. (1989). On the derivation of a “chaotic” encryption algorithm. Cryptologia, 13(1), 29-42.

[7]Gao, H., Zhang, Y., Liang, S., & Li, D. (2006). A new chaotic algorithm for image encryption. Chaos, Solitons & Fractals, 29(2), 393-399.