Storing Data in DNA

by Will Hartridge


Since the digital revolution, technology has been progressing rapidly, with greater volumes and speeds of data storage becoming increasingly more available. As a modern civilisation, we rely heavily on data storage; the International Data Corporation predicts that by 2025 the Global Datasphere- a summation of all data that exists in the world, will be around 175 zettabytes (one zettabyte being roughly equal to 1 trillion gigabytes- the unit of data people are most familiar with). But there is a problem. What if we are producing too much data? With data production growing at such an exponential rate we will soon reach the limits of density possible with our current data storage methods. Furthermore, by 2040, projections state we will have used up all memory-grade silicon available as it is rare in its pure form. So a new solution is needed, this is where data storage using DNA comes in- using the molecule that codes for the complexity and beauty of every living organism to now store data. I find this to be a remarkably strange loop as data, which is essentially human thoughts or ideas (in the form of photos, books, videos etc.), can now be stored in the very molecule created them (by coding for the human brain). 

DNA data storage is not just a research project carried out to show what is possible, it actually has many advantages over traditional data storage which could cause it to be adopted in the future.

  • Very dense: being able to store 215 petabytes per gram, it is “the densest known storage medium in the universe, according to the laws of physics”

  • DNA is highly durable and can still be read after thousands, if not hundreds of thousands of years if stored in optimal conditions. This is immense compared to the 3-5 year average lifespan of Hard Disk Drives.

  • The technology will not go obsolete like other data storage technologies have done, for example, floppy drives. This is because humans will always have a need and desire to sequence DNA as long as we exist.

  • The process will always improve because its core technology is based on other popular and growing fields such as biotechnology and biochemistry, so their advances will go hand in hand.

  • When copying data normally, the speed is proportional to the amount of data: so is very slow with large amounts. Whereas with DNA, copies of the data can be made at a fixed fast rate using a technique called Polymerase Chain Reaction (PCR).


However, this technology also has a fair share of hurdles to overcome before it becomes mainstream and will likely not be seen on a large scale for many years to come.


History

The stored rune and its binary form


In 1988 the artist Joe Davis collaborated with Dana Boyd, a molecular biologist to demonstrate the first ever example of storing data using DNA. They stored a binary image of an ancient Germanic rune the DNA of the E. Coli bacteria.

The automated DNA data storage device


Many attempts soon followed, however, they failed to store more than tens of bytes. The first breakthrough in this field of research was a team led by George Church, who were able to successfully store and retrieve 659 kilobytes of data. Since then, many other researchers have also demonstrated the ability to store data using DNA, with examples including the novel “War and Peace” by Leo Tolstoy, the entire collection of Shakespeare’s sonnets, audio of Martin Luther King’s 1963 speech, a GIF of a galloping horse and much more. In 2019, a team of researchers from Microsoft and the University of Washington developed and demonstrated “the first fully automated system to store and retrieve data in manufactured DNA” this system successfully demonstrated storing and retrieving the word “HELLO” without any human intervention and is a promising first look at what the future of this technology could be like.

How it works

Everything you can access with modern technology is simply made up of a sequence of 1s and 0s. This sequence of numbers, or binary digits (“bits” for short) can be used to represent almost every type of digital information such as videos, images, websites and games. The functionality of binary is vast and thus far, we have had no need to change the fundamental way we store our data.


Encoding- converting data to a DNA sequence

To store digital data in DNA, it first has to be converted into a sequence of bases. As is commonly known, there are 4 bases in DNA: Adenine (A), Thymine (T), Guanine (G) and Cytosine (C), therefore if each base represents 2 binary digits, all four bases can represent all possible permutations of 2 bits and therefore any binary data can be converted into a nucleic acid sequence. Shown below is an example of how this could be achieved. 

A- 00, G- 10, C- 01, T- 11


However, it is not as simple as just converting each pair of bits into a base and then creating one long DNA sequence. As with most data storage methods, there are bound to be errors when reading and writing data so this needs to be accounted for to ensure it has no noticeable effects. Encoded DNA needs to be split into overlapping chunks to create data redundancy. Furthermore, indexing information has to be added to the start and end of each fragment to ensure the DNA can be decoded in the correct order. These steps, and other errors that occur in the later stages of synthesis and sequencing, all mean that the theoretical maximum of 2 bits per base cannot actually be reached.


A labelled phosphoramidite


Synthesis- storing this data in DNA

After the corresponding DNA sequence is generated, it then needs to be synthesised to turn it into a physical molecule for storage. This is carried out using common DNA sequencing techniques- with phosphoramidite based oligonucleotide synthesis being the most successful so far, in which DNA nucleotides are modified to add protective groups and groups that allow for the required reactions. Then these modified nucleotides (phosphoramidites) are used in a synthesis cycle to create the required DNA strand.


The synthesis cycle for phosphoramidite based DNA synthesis




The process of nanopore sequencing


Sequencing- extracting stored data from DNA

Then this DNA can be stored and left until the data is required again, much like conventional hard drives. When the data is required, the DNA can then be sequenced to extract the stored data, again using techniques from existing fields like biochemistry and biotechnology. This is most commonly achieved by one of two methods: nanopore sequencing or sequencing by synthesis (SBS). For the sake of brevity, just nanopore sequencing will be explained. In this method, DNA is aligned with a small pore (either a biological protein pore in a lipid membrane or a nanometer-scale hole in a metal layer), a small voltage is applied to the pore and one strand of the DNA is drawn through (either using motor proteins or an electric current- as DNA is negatively charged). When each nucleotide moves through a nanopore, its base disrupts the current across the pore, with each base creating a characteristic and different magnitude of disruption. This change in current is then measured by an array of electrodes on a sensor chip and is processed to determine the order of bases on the DNA strand (which can subsequently be converted back into the original raw data by reversing the encoding method).

 

Random access

Sequencing the whole DNA strand would be effective to retrieve all of the data it stores; for example, in the historical cases mentioned. However, for DNA to function as a more effective and efficient storage method random access needs to be used so that only the small required part of DNA containing the relevant data has to be sequenced.


One way this can be carried out is by using PCR primers. These are short single strands of DNA that provide a starting point for the Polymerase Chain Reaction- a method used to rapidly copy a certain sequence of DNA. A specific PCR primer can be attached to each strand of DNA containing a file, then when that specific file needs to be accessed, its strand of DNA is amplified (many copies created) using the PCR process and only these amplified pools of DNA are sequenced.


Using DNA as a data storage device is a highly promising area of research. Once more development has been carried out to overcome technical issues and an easy to use, end-to-end system is built that can compete with existing storage media, the advantages of this technology could see it taking off in the mainstream and we could soon be storing our data on the same molecules that give us life.



Sources:

https://www.nature.com/articles/s41576-019-0125-3.epdf

https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

https://www.twistbioscience.com/blog/perspectives/advances-dna-data-storage-random-access-memory

https://blog.storagecraft.com/data-storage-lifespan/

https://www.spiria.com/en/blog/big-data/dna-next-storage-medium-big-data/

https://www.microsoft.com/en-us/research/video/microsoft-and-uw-demonstrate-first-fully-automated-dna-data-storage/

https://www.clotmag.com/biomedia/joe-davis

https://academic.oup.com/nsr/article/7/6/1092/5711038

https://www.theguardian.com/science/2013/jan/23/shakespeare-sonnets-encoded-dna

https://www.wired.com/story/the-rise-of-dna-data-storage/

https://mitjafelicijan.com/encoding-binary-data-into-dna-sequence.html

https://www.sigmaaldrich.com/technical-documents/articles/biology/dna-oligonucleotide-synthesis.html

https://nanoporetech.com/

https://www.nature.com/articles/nbt.4079


 

Comments