HOME

TheInfoList



OR:

DNA digital data storage is the process of encoding and decoding binary data to and from synthesized strands of
DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
. While DNA as a storage medium has enormous potential because of its high storage density, its practical use is currently severely limited because of its high cost and very slow read and write times. In June 2019, scientists reported that all 16 GB of text from the
English Wikipedia The English Wikipedia is the primary English-language edition of Wikipedia, an online encyclopedia. It was created by Jimmy Wales and Larry Sanger on 15 January 2001, as Wikipedia's first edition. English Wikipedia is hosted alongside o ...
had been encoded into synthetic DNA. In 2021, scientists reported that a custom DNA data writer had been developed that was capable of writing data into DNA at 1 Mbps.


Encoding methods

Many methods for encoding data in DNA are possible. The optimal methods are those that make economical use of DNA and protect against errors. If the message DNA is intended to be stored for a long period of time, for example, 1,000 years, it is also helpful if the sequence is obviously artificial and the reading frame is easy to identify.


Encoding text

Several simple methods for encoding text have been proposed. Most of these involve translating each letter into a corresponding "codon", consisting of a unique small sequence of
nucleotide Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
s in a
lookup table In computer science, a lookup table (LUT) is an array data structure, array that replaces runtime (program lifecycle phase), runtime computation of a mathematical function (mathematics), function with a simpler array indexing operation, in a proc ...
. Some examples of these encoding schemes include Huffman codes, comma codes, and alternating codes.


Encoding arbitrary data

To encode arbitrary data in DNA, the data is typically first converted into ternary (base 3) data rather than binary (base 2) data. Each digit (or "trit") is then converted to a nucleotide using a lookup table. To prevent homopolymers (repeating nucleotides), which can cause problems with accurate sequencing, the result of the lookup also depends on the preceding nucleotide. Using the example lookup table below, if the previous nucleotide in the sequence is T (
thymine Thymine () (symbol T or Thy) is one of the four nucleotide bases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The others are adenine, guanine, and cytosine. Thymine is also known as 5-methyluracil, a pyrimidine ...
), and the trit is 2, the next nucleotide will be G (
guanine Guanine () (symbol G or Gua) is one of the four main nucleotide bases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine ( uracil in RNA). In DNA, guanine is paired with cytosine. The guanine nucleoside ...
). Various systems may be incorporated to partition and address the data, as well as to protect it from errors. One approach to error correction is to regularly intersperse synchronization nucleotides between the information-encoding nucleotides. These synchronization nucleotides can act as scaffolds when reconstructing the sequence from multiple overlapping strands.


In vivo

The genetic code within living organisms can potentially be co-opted to store information. Furthermore
synthetic biology Synthetic biology (SynBio) is a multidisciplinary field of science that focuses on living systems and organisms. It applies engineering principles to develop new biological parts, devices, and systems or to redesign existing systems found in nat ...
can be used to engineer cells with "molecular recorders" to allow the storage and retrieval of information stored in the cell's genetic material.
CRISPR gene editing CRISPR gene editing (; pronounced like "crisper"; an abbreviation for "clustered regularly interspaced short palindromic repeats") is a genetic engineering technique in molecular biology by which the genomes of living organisms may be modified. ...
can also be used to insert artificial DNA sequences into the
genome A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
of the cell. For encoding developmental lineage data (molecular flight recorder), roughly 30 trillion cell nuclei per mouse * 60 recording sites per nucleus * 7-15 bits per site yields about 2 terabytes per mouse written (but only very selectively read).


In-vivo light-based direct image and data recording

A proof-of-concept in-vivo direct DNA data recording system was demonstrated through incorporation of optogenetically regulated recombinases as part of an engineered "molecular recorder" allows for direct encoding of light-based stimuli into engineered E.coli cells. This approach can also be parallelized to store and write text or data in 8-bit form through the use of physically separated individual cell cultures in cell-culture plates. This approach leverages the editing of a "recorder
plasmid A plasmid is a small, extrachromosomal DNA molecule within a cell that is physically separated from chromosomal DNA and can replicate independently. They are most commonly found as small circular, double-stranded DNA molecules in bacteria and ...
" by the light-regulated recombinases, allowing for identification of cell populations exposed to different stimuli. This approach allows for the physical stimulus to be directly encoded into the "recorder plasmid" through recombinase action. Unlike other approaches, this approach does not require manual design, insertion and cloning of artificial sequences to record the data into the genetic code. In this recording process, each individual cell population in each cell-culture plate culture well can be treated as a digital "bit", functioning as a biological
transistor A transistor is a semiconductor device used to Electronic amplifier, amplify or electronic switch, switch electrical signals and electric power, power. It is one of the basic building blocks of modern electronics. It is composed of semicondu ...
capable of recording a single bit of data.


History

The idea of DNA digital data storage dates back to 1959, when the physicist Richard P. Feynman, in "There's Plenty of Room at the Bottom: An Invitation to Enter a New Field of Physics" outlined the general prospects for the creation of artificial objects similar to objects of the microcosm (including biological) and having similar or even more extensive capabilities. In 1964–65, Mikhail Samoilovich Neiman, the Soviet physicist, published 3 articles about microminiaturization in electronics at the molecular-atomic level, which independently presented general considerations and some calculations regarding the possibility of recording, storage, and retrieval of information on synthesized DNA and RNA molecules. After the publication of the first M.S. Neiman's paper and after receiving by Editor the manuscript of his second paper (January, the 8th, 1964, as indicated in that paper) the interview with cybernetician
Norbert Wiener Norbert Wiener (November 26, 1894 – March 18, 1964) was an American computer scientist, mathematician, and philosopher. He became a professor of mathematics at the Massachusetts Institute of Technology ( MIT). A child prodigy, Wiener late ...
was published. N. Wiener expressed ideas about miniaturization of computer memory, close to the ideas, proposed by M. S. Neiman independently. These Wiener's ideas M. S. Neiman mentioned in the third of his papers. This story is described in details. One of the earliest uses of DNA storage occurred in a 1988 collaboration between artist
Joe Davis Joseph Davis (15 April 190110 July 1978) was an English professional snooker and English billiards player. He was the dominant figure in snooker from the 1920s to the 1950s, and has been credited with inventing aspects of the way the game is ...
and researchers from
Harvard University Harvard University is a Private university, private Ivy League research university in Cambridge, Massachusetts, United States. Founded in 1636 and named for its first benefactor, the History of the Puritans in North America, Puritan clergyma ...
. The image, stored in a DNA sequence in ''E.coli'', was organized in a 5 x 7 matrix that, once decoded, formed a picture of an ancient Germanic
rune Runes are the letters in a set of related alphabets, known as runic rows, runic alphabets or futharks (also, see '' futhark'' vs ''runic alphabet''), native to the Germanic peoples. Runes were primarily used to represent a sound value (a ...
representing life and the female Earth. In the matrix, ones corresponded to dark pixels while zeros corresponded to light pixels. In 2007 a device was created at the
University of Arizona The University of Arizona (Arizona, U of A, UArizona, or UA) is a Public university, public Land-grant university, land-grant research university in Tucson, Arizona, United States. Founded in 1885 by the 13th Arizona Territorial Legislature, it ...
using addressing molecules to encode mismatch sites within a DNA strand. These mismatches were then able to be read out by performing a restriction digest, thereby recovering the data. In 2011, George Church, Sri Kosuri, and Yuan Gao carried out an experiment that would encode a 659  kb book that was co-authored by Church. To do this, the research team did a two-to-one correspondence where a binary zero was represented by either an
adenine Adenine (, ) (nucleoside#List of nucleosides and corresponding nucleobases, symbol A or Ade) is a purine nucleotide base that is found in DNA, RNA, and Adenosine triphosphate, ATP. Usually a white crystalline subtance. The shape of adenine is ...
or
cytosine Cytosine () (symbol C or Cyt) is one of the four nucleotide bases found in DNA and RNA, along with adenine, guanine, and thymine ( uracil in RNA). It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attac ...
and a binary one was represented by a guanine or thymine. After examination, 22 errors were found in the DNA. In 2012, George Church and colleagues at Harvard University published an article in which DNA was encoded with digital information that included an HTML draft of a 53,400 word book written by the lead researcher, eleven
JPEG JPEG ( , short for Joint Photographic Experts Group and sometimes retroactively referred to as JPEG 1) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degr ...
images and one
JavaScript JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior. Web browsers have ...
program. Multiple copies for redundancy were added and 5.5 petabits can be stored in each cubic millimeter of DNA. The researchers used a simple code where bits were mapped one-to-one with bases, which had the shortcoming that it led to long runs of the same base, the sequencing of which is error-prone. This result showed that besides its other functions, DNA can also be another type of storage medium such as
hard disk drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating hard disk drive platter, pla ...
s and
magnetic tape Magnetic tape is a medium for magnetic storage made of a thin, magnetizable coating on a long, narrow strip of plastic film. It was developed in Germany in 1928, based on the earlier magnetic wire recording from Denmark. Devices that use magnetic ...
s. In 2013, an article led by researchers from the European Bioinformatics Institute (EBI) and submitted at around the same time as the paper of Church and colleagues detailed the storage, retrieval, and reproduction of over five million bits of data. All the DNA files reproduced the information with an accuracy between 99.99% and 100%. The main innovations in this research were the use of an error-correcting encoding scheme to ensure the extremely low data-loss rate, as well as the idea of encoding the data in a series of overlapping short
oligonucleotide Oligonucleotides are short DNA or RNA molecules, oligomers, that have a wide range of applications in genetic testing, Recombinant DNA, research, and Forensic DNA, forensics. Commonly made in the laboratory by Oligonucleotide synthesis, solid-phase ...
s identifiable through a sequence-based indexing scheme. Also, the sequences of the individual strands of DNA overlapped in such a way that each region of data was repeated four times to avoid errors. Two of these four strands were constructed backwards, also with the goal of eliminating errors. The costs per megabyte were estimated at $12,400 to encode data and $220 for retrieval. However, it was noted that the exponential decrease in DNA synthesis and sequencing costs, if it continues into the future, should make the technology cost-effective for long-term data storage by 2023. In 2013, a software called DNACloud was developed by Manish K. Gupta and co-workers to encode computer files to their DNA representation. It implements a memory efficiency version of the algorithm proposed by Goldman et al. to encode (and decode) data to DNA (.dnac files). The long-term stability of data encoded in DNA was reported in February 2015, in an article by researchers from
ETH Zurich ETH Zurich (; ) is a public university in Zurich, Switzerland. Founded in 1854 with the stated mission to educate engineers and scientists, the university focuses primarily on science, technology, engineering, and mathematics. ETH Zurich ran ...
. The team added redundancy via
Reed–Solomon error correction In information theory and coding theory, Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960. They have many applications, including consumer technologies such as MiniDiscs, ...
coding and by encapsulating the DNA within silica glass spheres via Sol-gel chemistry. In 2016 research by Church and Technicolor Research and Innovation was published in which, 22 MB of a
MPEG The Moving Picture Experts Group (MPEG) is an alliance of working groups established jointly by International Organization for Standardization, ISO and International Electrotechnical Commission, IEC that sets standards for media coding, includ ...
compressed movie sequence were stored and recovered from DNA. The recovery of the sequence was found to have zero errors. In March 2017, Yaniv Erlich and Dina Zielinski of
Columbia University Columbia University in the City of New York, commonly referred to as Columbia University, is a Private university, private Ivy League research university in New York City. Established in 1754 as King's College on the grounds of Trinity Churc ...
and the New York Genome Center published a method known as DNA Fountain that stored data at a density of 215 petabytes per gram of DNA. The technique approaches the Shannon capacity of DNA storage, achieving 85% of the theoretical limit. The method was not ready for large-scale use, as it costs $7000 to synthesize 2 megabytes of data and another $2000 to read it. In March 2018,
University of Washington The University of Washington (UW and informally U-Dub or U Dub) is a public research university in Seattle, Washington, United States. Founded in 1861, the University of Washington is one of the oldest universities on the West Coast of the Uni ...
and
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
published results demonstrating storage and retrieval of approximately 200MB of data. The research also proposed and evaluated a method for
random access Random access (also called direct access) is the ability to access an arbitrary element of a sequence in equal time or any datum from a population of addressable elements roughly as easily and efficiently as any other, no matter how many elemen ...
of data items stored in DNA. In March 2019, the same team announced they have demonstrated a fully automated system to encode and decode data in DNA. Research published by Eurecom and Imperial College in January 2019, demonstrated the ability to store structured data in synthetic DNA. The research showed how to encode structured or, more specifically, relational data in synthetic DNA and also demonstrated how to perform
data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an o ...
operations (similar to
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
) directly on the DNA as chemical processes. In April 2019, due to a collaboration with TurboBeads Labs in Switzerland,
Mezzanine A mezzanine (; or in Italian, a ''mezzanino'') is an intermediate floor in a building which is partly open to the double-height ceilinged floor below, or which does not extend over the whole floorspace of the building, a loft with non-sloped ...
by
Massive Attack Massive Attack are an English trip hop collective formed in 1988 in Bristol, England, by Robert Del Naja, Robert "3D" Del Naja, Daddy G, Grant "Daddy G" Marshall, Tricky (musician), Adrian "Tricky" Thaws and Andrew Vowles, Andrew "Mushroom" ...
was encoded into synthetic DNA, making it the first album to be stored in this way. In June 2019, scientists reported that all 16 GB of
Wikipedia Wikipedia is a free content, free Online content, online encyclopedia that is written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. Founded by Jimmy Wales and La ...
have been encoded into synthetic DNA. In 2021, CATALOG reported that they had developed a custom DNA writer capable of writing data at 1 Mbps into DNA. The first article describing data storage on native DNA sequences via enzymatic nicking was published in April 2020. In the paper, scientists demonstrate a new method of recording information in DNA backbone which enables bit-wise random access and in-memory computing. In 2021, a research team at
Newcastle University Newcastle University (legally the University of Newcastle upon Tyne) is a public research university based in Newcastle upon Tyne, England. It has overseas campuses in Singapore and Malaysia. The university is a red brick university and a mem ...
led by N. Krasnogor implemented a ''stack data structure'' using DNA, allowing for last-in, first-out (LIFO) data recording and retrieval. Their approach used hybridization and strand displacement to record DNA signals in DNA polymers, which were then released in reverse order. The study demonstrated that data structure-like operations are possible in the molecular realm. The researchers also explored the limitations and future improvements for dynamic DNA data structures, highlighting the potential for DNA-based computational systems.


Davos Bitcoin Challenge

On January 21, 2015, Nick Goldman from the European Bioinformatics Institute (EBI), one of the original authors of the 2013 ''
Nature Nature is an inherent character or constitution, particularly of the Ecosphere (planetary), ecosphere or the universe as a whole. In this general sense nature refers to the Scientific law, laws, elements and phenomenon, phenomena of the physic ...
'' paper, announced the Davos Bitcoin Challenge at the
World Economic Forum The World Economic Forum (WEF) is an international non-governmental organization, international advocacy non-governmental organization and think tank, based in Cologny, Canton of Geneva, Switzerland. It was founded on 24 January 1971 by German ...
annual meeting in Davos. During his presentation, DNA tubes were handed out to the audience, with the message that each tube contained the private key of exactly one
bitcoin Bitcoin (abbreviation: BTC; Currency symbol, sign: ₿) is the first Decentralized application, decentralized cryptocurrency. Based on a free-market ideology, bitcoin was invented in 2008 when an unknown entity published a white paper under ...
, all coded in DNA. The first one to
sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is cal ...
and decode the DNA could claim the bitcoin and win the challenge. The challenge was set for three years and would close if nobody claimed the prize before January 21, 2018. Almost three years later on January 19, 2018, the EBI announced that a Belgian PhD student, Sander Wuyts, of the
University of Antwerp The University of Antwerp () is a major Belgian university located in the city of Antwerp. The official abbreviation is ''UAntwerp''. The University of Antwerp has about 20,000 students, which makes it the third-largest university in Flanders. ...
and
Vrije Universiteit Brussel The Vrije Universiteit Brussel (Dutch language, Dutch, ; lit. Free University of Brussels; abbreviated VUB) is a Dutch- and English-speaking research university in Brussels, Belgium. It has four campuses: Brussels Humanities, Science and Engine ...
, was the first one to complete the challenge. Next to the instructions on how to claim the bitcoin (stored as a plain text and PDF file), the logo of the EBI, the logo of the company that printed the DNA (CustomArray), and a sketch of
James Joyce James Augustine Aloysius Joyce (born James Augusta Joyce; 2 February 1882 – 13 January 1941) was an Irish novelist, poet, and literary critic. He contributed to the modernist avant-garde movement and is regarded as one of the most influentia ...
were retrieved from the DNA.


The Lunar Library

The Lunar Library, launched on the Beresheet Lander by the Arch Mission Foundation, carries information encoded in DNA, which includes 20 famous books and 10,000 images. This was one of the optimal choices of storage, as DNA can last a long time. The Arch Mission Foundation suggests that it can still be read after billions of years. The lander crashed on 11 April 2019 and was lost.


DNA of things

The concept of the DNA of Things (DoT) was introduced in 2019 by a team of researchers from Israel and Switzerland, including Yaniv Erlich and Robert Grass. DoT encodes digital data into DNA molecules, which are then embedded into objects. This gives the ability to create objects that carry their own blueprint, similar to biological organisms. In contrast to
Internet of things Internet of things (IoT) describes devices with sensors, processing ability, software and other technologies that connect and exchange data with other devices and systems over the Internet or other communication networks. The IoT encompasse ...
, which is a system of interrelated computing devices, DoT creates objects which are independent storage objects, completely off-grid. As a proof of concept for DoT, the researcher
3D-printed 3D printing, or additive manufacturing, is the Manufacturing, construction of a three-dimensional object from a computer-aided design, CAD model or a digital 3D modeling, 3D model. It can be done in a variety of processes in which material i ...
a Stanford bunny which contains its blueprint in the plastic filament used for printing. By clipping off a tiny bit of the ear of the bunny, they were able to read out the blueprint, multiply it and produce a next generation of bunnies. In addition, the ability of DoT to serve for
steganographic Steganography ( ) is the practice of representing information within another message or physical object, in such a manner that the presence of the concealed information would not be evident to an unsuspecting person's examination. In computing/ ...
purposes was shown by producing non-distinguishable lenses which contain a
YouTube YouTube is an American social media and online video sharing platform owned by Google. YouTube was founded on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim who were three former employees of PayPal. Headquartered in ...
video integrated into the material.


See also

* DNA computing * DNA nanotechnology * Nanobiotechnology * Natural computing * Plant-based digital data storage * 5D optical data storage


References


Further reading

* * * *
DNA Sequencing Caught in Deluge of Data
The New York Times (NYTimes.com). * * {{refend DNA Molecular biology Storage media Computational biology