Skip to main content
Premium Trial:

Request an Annual Quote

Study Highlights Potential of Next-Gen Sequencing to Read Digital Information Archived in DNA


The availability of affordable, high-throughput next-generation sequencing and DNA synthesis methods are paving the way for a form of long-term, digital information storage based around a collection of short stretches of DNA.

In a proof-of-principle study appearing online last week in Science, researchers from Harvard Medical School, the Wyss Institute for Biologically Inspired Engineering, and Johns Hopkins University showed that it was possible to encode millions of bits of binary data — representing a book converted to html format, nearly a dozen JPEG images, and a JavaScript program — into nearly 55,000 short oligonucleotides that could be read back by next-generation sequencing.

In their system, each nucleotide represents either a zero or a one, allowing for 96 bits of data to be encoded in as many bases of DNA. The locations of the 96-bit data chunks within the larger dataset are tracked using short address barcodes tacked onto the oligos encoding them.

"People have thought about this in the past and even commented that [DNA] is a very dense form of information and that it's extremely long-lived," explained the study's senior author Sriram Kosuri, a researcher at the Wyss Institute for Biologically Inspired Engineering whose lab is interested in both DNA synthesis and next-generation sequencing technologies.

As the scale and cost of both DNA synthesis and sequencing have improved, the notion of using DNA as a dense storage device has become more and more feasible, he told In Sequence.

"We thought it might be a good time to think about re-approaching this problem. Because if we could leverage the massive price drops in [DNA] synthesis and sequencing, we'd actually be able to do this at scale and be able to jump on the cost curves, which are declining very fast for these technologies."

Without the availability of next-generation DNA synthesis and sequencing technologies, Kosuri explained, such an undertaking would have been "such a long and horrendously expensive project that no one really would have attempted it."

Although DNA-based storage of digital data is more time-consuming than conventional data storage methods — encoding information in a linear form that cannot be easily re-written — it is also an extremely data-dense and long-lasting option, researchers explained. As such, those involved in the effort suggested that this approach could eventually prove useful for archiving information over several thousand years.

"We could try to increase our density on silica and magnetic media to get to something like DNA," Kosuri said. "But we're already at this density for DNA and all we have to do is reduce costs. So it's kind of an alternative approach that we could consider."

"Its first set of applications would be, in effect, for archival storage or the equivalent of what people use tape for at the moment," agreed European Bioinformatics Institute Associate Director Ewan Birney, who was not involved in the new study.

In collaboration with EBI researcher Nick Goldman, Birney's team has come up with its own method for storing digital information long term in DNA oligo libraries. That study has not yet been published, but has been submitted for review.

Though he could not provide specifics about his group's information storage method, Birney told IS that the two strategies are broadly similar but differ in some details.

"I think it's great that, in some sense, two groups have come up with extremely similar ideas, which shows that it's probably the right approach," he said.

From Bits to Bases

Researchers have been finding ways to sneak messages into DNA since the 1980s. And DNA has been used for stowing away information ranging from encoded images and music to encrypted messages or entire sentences.

In one notable recent example, investigators from the J. Craig Venter Institute and elsewhere incorporated a watermark message into the synthetic bacterial genome sequence that they successfully transplanted into Mycoplasma capricolum cells. That work, described in Science in 2010, involved using a DNA-based code representing each of the letters in the English alphabet.

In contrast to the method described by the JCVI researchers, the new approach does not require putting data together into a longer stretch of continuous DNA. Nor does it involve a living host cell.

"We purposefully avoided living cells," Harvard geneticist George Church, the study's first author, said in a statement.

"In an organism, your message is a tiny fraction of the whole cell, so there's a lot of wasted space," he explained. "But more importantly, almost as soon as a DNA goes into a cell, if that DNA doesn't earn its keep, if it isn't evolutionarily advantageous, the cell will start mutating it, and eventually the cell will completely delete it."

Instead, the group set out to store arbitrary digital information by encoding binary data into an oligo library where adenine and cytosine nucleotides represented zeros and guanine and thymine bases represented ones — a strategy that makes it possible to code one bit of information for each base while still allowing for some wiggle room with respect to the DNA sequences themselves.

"Certain stretches of sequences are sometimes problematic," Kosuri noted. "Because we can encode an arbitrary bit in many different ways … it gives us a lot of flexibility. So we can just avoid any sequence we want to avoid."

In an effort to take advantage of existing technologies, which make and read relatively short stretches of DNA in a massively parallel manner, the researchers started with a draft of a book that had been co-authored by writer Ed Regis and Harvard's Church.

After swapping the book over to an html format, the team divvied up the 5.27 million bits of data — representing the book's 53,426 words, 11 black-and-white JPEG images, and a JavaScript program — into a library of nearly 55,000 oligos.

"We took an html file — that's the set of ones and zeros — and converted those to a set of bases, where we have one bit per base, and then we basically chopped them up into 12 character chunks, which is about 96 bits," Kosuri explained.

Each of the oligos was comprised of a 96-bit coding sequence, along with a barcoded sequence that can be used to map that oligo back to its location within the larger dataset.

Together, the 96-base coding sequence combined with the "address" barcode brought the size of each oligo in the library up to around 115 bases. The researchers synthesized 55,000 of these oligos on an Agilent Oligo Library Synthesis microarray — a high-fidelity DNA chip that can be ink-jet printed.

After oligos were cleaved off of the chip on which they were synthesized, the team was left with a "super tiny amount" of DNA, Kosuri explained, around 50 nanograms or so.

To read back this DNA-encoded manuscript, the team nabbed around 1 percent of the oligo library, amplified it by PCR, and sequenced it to an average of almost 5,000-fold coverage on one Illumina HiSeq 2000 lane.

"Because these sequencers have short reads and it's not a big chunk, we can directly sequence on these sequences and read out the information," Kosani said.

Once it tossed out reads lacking perfect barcode sequences, the team put together a consensus sequence based on the address designated by the barcodes.

Compared with their original oligo library, researchers found fewer than two dozen errors in the information that had been read back by sequencing. Of these, 10 affected bit data-containing sequences.

That figure could come down further by applying filtering to remove stretches of sequence where the same base is repeated several times, they noted in the study's supplemental information, since sequences in the oligo library were designed to avoid such homopolymers.

As it stands now, though, the cost of doing the type of information storage and reading described in the study would be a few thousand dollars, according to Kosuri. Since the current project was done in collaboration with Agilent Technologies, he noted that the cost was somewhat lower than it would have been otherwise.

The same DNA-based digital information storage approach should be compatible with any of the existing next-generation technologies, and Kosuri said the oligo design could be tweaked to best suit the particular platform available.

He also noted that smaller, cheaper, and/or more convenient sequencing methods on the horizon, such as the USB memory stick-sized sequencer anticipated from Oxford Nanopore (IS 2/21/2012), might make such experiments easier to pull off routinely down the road.

Long-Term Promise

DNA-based digital storage is not expected to suit every type of information, since it stores data in a sequential manner, cannot be re-written, and takes time to encode and decode. But whereas those features might be a disadvantage in some digital storage applications, Kosuri explained, they are a boon for others, including archival information storage.

"All of these things kind of point toward archival storage as the application — where it's not that important that you get [the information] this instant and you don't really need to change it," he noted. "In fact, not changing it is a virtue."

DNA-based information storage has another advantage in the archival storage realm as well, according to Kosuri: the density of the information that can be stored within DNA.

For instance, the researchers estimated that around 1.5 milligrams of DNA would eventually be sufficient to store one petabyte of data — around 1,000 times more information than can be stored on existing terabyte-sized computer hard drives.

Though their current experiment is a million-fold short of that goal, the rate at which synthesis and sequencing technology have advanced already suggest that such future improvements might not be outside the realm of possibility.

"We would need about a million-fold improvement in our technology — which we've seen in the past five to ten years for sequencing," Kosuri said. "It sounds like a lot, but only 10 years ago we were a million-fold worse."

Another potential advantage of storing information in DNA: the long-lasting nature of the genetic material itself. It's that longevity that Birney and his team see as the primary advantage of DNA-based information storage system.

"Ancient DNA labs regularly recover 10,000-year-old DNA," Birney said. "As long as you keep it cold, dry, and dark, you really don't need anything else."

"That's a very rare scenario for a storage technology," he added. "Most of the other digital storage technologies do not have those properties."

For his part, Birney noted that while improvements in sequencing technology could decrease the time it takes to access information stored in a DNA format, he believes the primary financial hurdle to routinely using DNA as an archival data storage medium will be the cost of DNA synthesis.

"We think [the cost of] synthesis needs to come down by one order of magnitude for this to be really, really feasible as a practical way of long-term archiving," he said.

"The rate-limiting aspect to this is synthesis, not sequencing," Birney argued. "Reading in this scheme is really easy."

The Scan

Push Toward Approval

The Wall Street Journal reports the US Food and Drug Administration is under pressure to grant full approval to SARS-CoV-2 vaccines.

Deer Exposure

About 40 percent of deer in a handful of US states carry antibodies to SARS-CoV-2, according to Nature News.

Millions But Not Enough

NPR reports the US is set to send 110 million SARS-CoV-2 vaccine doses abroad, but that billions are needed.

PNAS Papers on CRISPR-Edited Cancer Models, Multiple Sclerosis Neuroinflammation, Parasitic Wasps

In PNAS this week: gene-editing approach for developing cancer models, role of extracellular proteins in multiple sclerosis, and more.