NEW YORK – Researchers from the Massachusetts Institute of Technology have devised a scheme for retrieving digital data stored in DNA without using PCR, instead employing oligonucleotide barcodes and fluorescence-activated sorting.
The method amounts to a sort of "file system" that provides random-access memory and enables searches with Boolean logic in databases potentially millions of terabytes large. Specifically, it involves plasmids encased in glass capsules that are tagged with up to three single-stranded barcodes, up to 25 nucleotides long, that encode file metadata. The capsules can be picked out using complementary probes that hybridize to the barcodes, followed by fluorescence-activated sorting to isolate the plasmids for sequencing.
In a proof-of-concept study published Thursday in Nature Materials, the researchers demonstrated that they were able to retrieve particular image files stored in DNA from a database of such images. "We show that if you search for 'orange' and 'cat' you get a picture of a tabby," said Mark Bathe, a professor in the department of biological engineering at MIT and the senior author of the study.
The selection sensitivity using just one barcode was one in a million, suggesting the ability to operate on archival data pools containing millions of terabytes of data or more when using combinations of barcodes.
The scheme provides direct access to DNA-based data without the use of PCR, which has so far been the method of choice. "[PCR] was a great first step," Bathe said, "but it takes enzymes and machines, it's hard to scale up, and there's always crosstalk in PCR." In contrast, his method is comparatively low-cost, requiring only the hybridizing oligos and access to a fluorescence-activated cell sorter, which is expensive but "effectively free to run once you have one," Bathe said.
"Doing Boolean searches, this is quite innovative," said Yaniv Erlich, a genomics researcher who studied DNA-based data storage while a professor at Columbia University. "That's something that cannot be done easily using PCR." Erlich, now CEO of Eleven Therapeutics, was not involved in the study.
Like many DNA-based data storage technologies, the MIT researchers' file system has applications in archival storage, but DNA synthesis costs must come down before that is commercially feasible. In the near term, Bathe suggested that it could be used to store DNA or RNA from clinical samples.
Bathe and his postdoc James Banal, a co-first author on the paper, have filed for patents covering the technology and have founded a startup, called Cache DNA, to pursue applications of DNA-based data storage. Harvard Medical School professor George Church and MIT professors Jeremiah Johnson and Paul Blainey are on the firm's scientific advisory board; however, Bathe said the company has not raised any funds and is not yet operational.
This file system's origins trace back to a 2016 workshop on DNA-based data storage held by the Intelligence Advanced Research Projects Activity, the US intelligence community's version of the Defense Advanced Research Projects Agency, where Bathe presented on the topic of file access. Getting hold of files using PCR was akin to searching for a book in a library in the days before the internet. "You'd go to a library, look it up in a card catalog index, go to the aisle, and then find the book," Bathe said. "We can now do an arbitrary search for any piece of information. I wanted to implement how we search on computers in a liquid, DNA-based dataset. That's what this paper fundamentally solves."
His group went about the problem by mimicking cells. They created a "nucleus" for the DNA data, drawing on glass encapsulation developed by Robert Grass at ETH Zurich. "[Grass] did a bunch of studies, which we also replicated, showing it is impervious to salt, fire, or even acids."
The MIT team combined the glass capsules with surface barcodes made from orthogonal primers developed by Stephen Elledge's group at Harvard University. Using magnetic beads bearing complementary primers to pull out the capsules provided an alternative to PCR. Besides adding cost and hands-on time, Bathe said, PCR-based data retrieval may not scale well. Bigger databases would require more PCR primer sites in DNA files, taking up space that could be otherwise used to encode more information. Moreover, large volumes of liquid aren't easy to handle in a thermocycler, he said, and adding enough primers to be able to find the specific information in a large, dense data pool would make everything "sticky."
Using a library of 240,000 orthogonal primers, "we've eliminated the stickiness problem," Bathe said. Still, it takes time to perform a search using the barcodes, approximately one to two days. But given that the killer app is archival data storage, "we weren't as concerned about it taking a day or two as we were with being able to do it at all," he said.
The barcodes can represent whatever information one may desire, including classifiers, time stamps, and location data. The barcode sequences and their associated information can be stored and organized in a database on a laptop. Not only can the barcodes be used to locate particular files, "you can compute on this information," Bathe said.
In one instance, the researchers were able to discern between two images of US presidents in their database — George Washington and Abraham Lincoln — using complex Boolean logic. The picture of Washington had been tagged with barcodes representing the concepts "president" and "18th century." Selecting capsules with the terms "president" and "NOT 18th century" resulted in the picture of Lincoln.
Files were read out by sequencing on Illumina MiSeq and MiniSeq platforms.
The team used plasmids in their study, but the method could also work with chemically or enzymatically synthesized DNA. Companies such as Twist Bioscience and France's DNA Script are working on lowering the cost of synthetic DNA for use in data storage; however, they are still some ways off. "It'll very likely take a decade" to bring down the cost of DNA synthesis to enable data storage in DNA, Bathe said.
Erlich said that one limitation of the approach is that it requires the files to be separated from the outset, while cost savings in DNA synthesis will come from being able to synthesize many files together at once. But even if PCR is used to pick out the files for encapsulation, using the new approach for searches would still be useful, he said.
"Microfluidics would be the bridge to integrate our file system with writing files simultaneously," Banal said in an email. "We can co-opt some of the technologies already being implemented for single-cell sequencing."
The authors noted another limitation in their paper, which is that fluorescence-activated sorting may not be fast enough to work with the largest possible databases. But they suggested several possibilities for increasing that speed, including custom flow nozzles for the sorting instrument or direct pulldown using magnetic extraction.
"Everything they show seems technically accurate and reasonable," Chris Takahashi, a postdoc at the University of Washington who is also working on DNA-based data storage, said in an email. "They are also open about their limitations so you can take those at face value." He noted that the researchers used bytes per second as a metric for throughput in file metadata searching. "Files per second is more reasonable and their stated rate (1,000 per second) is very slow from a computing standpoint," he said.
"If we use files per second, then we're already at par with cloud [computing] systems when one requests data from the cloud, assuming each object contains only a file," Banal said. "It might be slow in terms of throughput from a bytes per second perspective, but what you get is this massive parallelization you achieve when one uses molecules to compute."
Even if archival DNA data storage isn't feasible for several years, the ability to encapsulate genomes and access arbitrary subsets of them could be useful now. Storing SARS-CoV-2 samples tagged with date and location data from positive PCR-based tests, for example, could be useful for epidemiology and variant tracking. Storing patient cancer samples to potentially sequence at a later date is another application, Bathe said.
The team is now working to enable even more computation using the barcodes, including accessing numerical ranges. It may even be possible to search for information within the stored DNA, Bathe said, using information theory to help represent the contents of the encased file on the exterior barcodes.