Researchers at the European Bioinformatics Institute in the UK have devised a new error-tolerant coding scheme for storing digital information in DNA. Retrieving the information requires reading the DNA sequence and could thus become a new market for DNA sequencing in the future.
In a proof-of-concept paper published online in Nature last week, the scientists, led by EBI researchers Nick Goldman and Ewan Birney, encoded several computer files as DNA oligonucleotides: an mp3 files of Martin Luther King's "I Have a Dream" speech; a jpg file of a photo of the EBI; a pdf file of Watson and Crick's paper "Molecular structure of nucleic acids;" a txt file of all Shakespeare sonnets; and a file describing the encoding. They then had the DNA oligos synthesized by Agilent Technologies and read it back out on Illumina sequencers, allowing them to retrieve the original information.
Under the right conditions — cold, dry, and dark — DNA can last for 10,000 years or longer. This would be a vast improvement over magnetic tape, which is the preferred long-term storage medium today but degrades within about a decade.
DNA is also a very dense information storage medium, taking up little space and weight, and it doesn't require electricity to maintain the information. A gram of DNA, for example, can hold the same amount of information as a little more than a million CDs, according to the researchers.
Because DNA is so universal, it is highly unlikely that the information could not be read out in the future because technology has moved on, as is the case with other storage media that have become obsolete. "There will always be DNA reading technology as long as there is DNA-based life around on Earth," Birney said.
What limits DNA information storage today is the cost of generating it. "The bottleneck is the cost of synthesis," Birney said. "All the other things are solvable. The really key question is, 'Could you make DNA synthesis to be cheap enough for this to work?'"
He and his colleagues calculated that at today's cost, DNA storage makes sense if the information is to be stored for between 600 and 5,000 years, at which point the maintenance cost of conventional storage would exceed the cost of DNA storage.
Making the process of DNA synthesis cheaper by a factor of 100 would make DNA storage cost-competitive with other methods for storage times of about 50 years. "That's quite an achievable change in technology; we've seen the same change over the last decade," Birney said.
Goldman told In Sequence that in the short and medium term, DNA storage will probably not become a major application for DNA sequencing. "If we keep working on DNA storage for about a decade, and all goes well, then [vendors] might start noticing it as a market to pay attention to," he said. And it might take another decade after that for a DNA sequencer to be purpose built for reading out archived information. "How it would be designed would probably be closely linked to what DNA synthesis methods were being used at the time," he said.
"Even though this isn't an application we're focusing on as a company, we do expect that the constant improvements in sequencing quality and cost that we're achieving will enable all kinds of as yet unimagined applications," said Alex Dickinson, senior vice president of cloud genomics at Illumina.
The EBI's paper is not the first proof-of-concept study showing that DNA is a suitable medium for storing digital information: Last summer, a team of researchers at Harvard University and Johns Hopkins University encoded a 5.27-megabit book using DNA and read the information out by next-generation sequencing, a study they published in Science.
According to Goldman, the two groups developed their coding schemes independently around the same time, and while they are similar in many respects, only the EBI's includes an error-correction code that is designed to deal with the types of errors frequently introduced by synthesizing and sequencing DNA.
Error correction is a ubiquitous technology, contained, for example, in hard disk drives and mobile phones today. "Mostly, we don't even think about it," Goldman explained. "In almost every circumstance, the information we'd like to store and transmit gets a little bit corrupted along the way, and the point of an error-correcting code is to be able to not be too upset by that and be able to recover and correct mistakes like that."
In another departure from the Harvard study, the EBI researchers did additional modeling to assess how well DNA storage would scale up. "We've gone further and shown that this could work on a much bigger scale," Goldman said. "If you've got enough money, you can work on a big scale now, and in the future, if the price comes down, it will work on a big scale … for large corporations or governments," and eventually, even for individuals.
To encode digital information as DNA, the researchers converted each byte – a block of eight bits, or zeros and ones – into a five-letter word made up of A, C, G, and T, the four bases of DNA.
These words are strung together into oligonucleotide sequences of about 120 nucleotides in length, which is limited by the synthesis process and could probably go up to about 200 nucleotides now. Because the information they want to encode is much bigger than that, the researchers encoded it in overlapping DNA fragments.
"Every point in the code is written independently four times, and each time we write it, we make about a million DNA molecules," Birney explained, "so there is a lot of redundancy in this process to ensure that you can recover everything at the end of the day." Each piece of information is contained in two overlapping DNA fragments in one direction, and two overlapping fragments in the opposite direction of DNA.
To avoid homopolymer repeats in the oligos — which oftentimes lead to errors in synthesis as well as sequencing — they designed a code that would never allow the same base to repeat itself. Essentially, they allowed only three bases — the ones that were not used just before — as the next base.
Overall, most errors happen during DNA synthesis rather than during the storage process, the researchers said, based on experience with ancient DNA.
Besides labor, the single biggest cost of the project was the DNA synthesis. While Agilent – whose scientists are co-authors on the paper -- provided that to the EBI free of charge, Goldman estimated that at today's commercial rates, it would cost more than $10,000. Birney said several methods for synthesizing DNA exist but Agilent's Oligo Library Synthesis technology has the lowest error rate.
Sequencing costs to decode the DNA, on the other hand — about 750 kilobases — totaled "a couple of thousand dollars," or about an order of magnitude less than DNA synthesis. Reading out the sequence using Illumina's HiSeq 2000 took about two weeks, but that time is coming down rapidly. "Sequencing technology is moving faster than the synthesis technology," Birney said, adding that the project did not attempt to optimize cost or time for coding and decoding.
Practical applications of DNA storage would likely start with large, long-term archives. "If this is going to be useful, it would be for replacing magnetic tape archives, for very long-term storage," Birney said, citing digital archives of big corporations or libraries or nuclear waste archives as examples. One thing to keep in mind is that language evolves, so a few thousand years from now, people might not be able to understand the information even if they are able to decode it correctly.
If the cost of DNA synthesis drops by 100-fold, which the researchers say could happen within a decade, DNA storage might also become interesting for individuals. "Then you can take your wedding video and write that in DNA and put it away safely," Goldman said. "And that would be economically viable and cost-effective on a 50-year timescale, which is about when your grandchildren would want to see it."
He said his "latest business plan" is to set up a company that would provide DNA storage as a service, like today's cloud service companies. "But the cloud now would be a company that writes that into DNA, and there will be a range of services. One of the services will be, they just mail it back to you, and you put it somewhere safe – put it in your refrigerator or bury it in your garden or send it to your relatives in Sweden or somewhere where it's cold. Or another option would be that they maintain a facility somewhere and they store it for you. And now, at no additional cost to them, it's going to be safe for hundreds of years." He declined to provide additional information about commercialization plans for the technology.
Goldman told In Sequence that he and his team are planning to improve their coding and decoding algorithms in order to store more information in less DNA with improved error correction. They also want to work on the miniaturization and automation needed for a "real industrial-scale application" of their methods. In addition, he is collaborating with Charlotte Jarvis, a British artist, who plans to store a newly commissioned piece of music in DNA and have exhibitions and performances around that, "with the aim of stimulating thought and discussion about modern biological research."