NEW YORK – Catalog DNA is entering the prove-it phase of its bid to become the leader in DNA-based information technology.
With more than $10 million in new funding and an automated device ready to turn ones and zeroes into As, Ts, Gs, and Cs the Massachusetts Institute of Technology spinout is signing deals for pilot programs that would test its data storage technology, which offers greater storage volumes than traditional computing with the added benefit of lower power consumption and increased portability.
"We can have hundreds or thousands of petabytes in a test tube you can hold in your hand," Catalog Cofounder and CEO Hyunjun Park said in an interview. One petabyte is equivalent to 1 million gigabytes. Last year, the company completed its internal proof-of-concept project by encoding the entirety of Wikipedia in DNA using its prototype device.
Now, Catalog is looking for similar success with client data. Park declined to disclose which companies the firm is partnering with, but said they come from several industries including oil and gas, film and media, and even sports.
The Boston-based firm is a leader in the embryonic DNA-based data storage field, along with Microsoft, Twist Bioscience, and Micron Technology, according to Nick Heudecker, an IT industry analyst with Gartner. Whoever breaks through first will likely have plenty of interested customers in a potentially lucrative market. "The need is there, it just needs to be met," he said.
As companies like Catalog progress, they could create a unique market for DNA sequencers where speed and cost efficiency are prized over base pair accuracy. Catalog is primarily using instruments from Oxford Nanopore Technologies and is in talks with other cutting-edge nanopore technology companies to create custom analyzers, Park said.
Founded in 2016, Catalog draws on ideas developed by Park and cofounder Nathaniel Roquet while they were colleagues in the lab of MIT Professor Tim Lu. The firm has participated in the IndieBio accelerator program and has also received funding from venture firms OS Fund and NEA.
Earlier this month, the firm announced it had raised $10 million in Series A financing, led by Horizons Ventures and joined by Airbus Ventures. It has 10 full-time employees and is hiring more now, Park said.
In general, DNA-based data storage is attractive because it is low maintenance. "You can store data in DNA with minor care for about 500 years," Heudecker said. Magnetic tape, which remains a favored technology for long-term storage, has a lifespan of about a decade and takes up lot of space. Petabytes of data stored that way would take up an entire room; petabytes worth of data stored in DNA could fit in a small box, making it portable. In January, the Intelligence Advanced Research Projects Activity (IARPA) granted $48 million to two groups to pursue DNA-based data storage.
In addition to established firms like Microsoft and Twist, several other firms are gunning for their share of that market, including Micron Technology, a Boise, Idaho-based semiconductor manufacturer and France's DNA Script. What sets Catalog apart is its encoding scheme, Park said. The information is not stored in the precise sequence of DNA "letters" he said, which is a departure from the way life has stored information in DNA for billions of years.
Catalog's technology works more like solid-state memory, as used in USB thumb drives, where electrical charge is stored at a predefined address and the presence or absence of a charge is interpreted as a bit. "We're just doing it with molecules of DNA," Park said.
The firm uses premade synthetic oligos — double-stranded linear DNA between 20 and 30 base pairs — and combines them into longer molecules several hundred base pairs long. The sequence of these molecules designates a bit address and a complex mixture exists in each pool. "The fact a molecule was there means it's a one," Park explained. "If you don't read back the address it means it's a zero…You don't need a separate designation between addresses and the value."
So far, the team has encoded 14 gigabytes worth of Wikipedia data using its scheme and its prototype, dubbed Shannon, in honor of information theory pioneer Claude Shannon. The L-shaped machine takes up a room, measuring 14 feet by 12 feet. Using modified inkjet print heads, it deposits droplets containing DNA on a polymer "webbing," Catalog CTO Dave Turek explained, which are later pooled.
Data starts out digitally encoded, runs through several software filters and ends up in a test tube, where it can be extracted using next-generation sequencing. Shannon can write data at speeds of over 10 megabytes per second and can store up to 1.6 terabytes of compressed data in a single run. But the future of Shannon is unclear and Turek said the firm is not planning to mass produce them to ship to customer sites. The pilot projects will also help Catalog determine whether the market is more in need of products or services.
For reading data, Catalog primarily uses nanopore sequencers, including the PromethIon, because it needs to analyze molecules longer than what many short-read Illumina platforms can sequence directly and the data format can tolerate lower fidelity to gain higher throughput.
In addition to providing archival storage, DNA-based data could be embedded into glass or other materials, Heudecker said, potentially surreptitiously. DNA could be used to ensure authenticity or provenance or used to encode instructions.
Some use cases verge on science fiction. "If we're going to send people to Mars, we'd want to equip them with the most important knowledge," Park said. "The only way to send that much information would be something like DNA. Going the other direction, a space probe could collect loads of information and send that back to Earth.
But Catalog's future isn't just on the final frontier. The company plans to enable computing to be done on the data stored in DNA, achieved with the use of enzymes and molecular processes.
It's the "opposite end of the spectrum from quantum computing, which is very complex, but can only be done on small amounts of data," Heudecker said. The complexity of operations that could be performed "will be less than what you find in traditional or quantum computing options, but the volume of data and [low] power consumption makes it much more competitive for a variety of use cases," he said.
In cases where you might want to perform the same operation across huge volumes of data, DNA-based computing could be the only way to do it. Pattern matching within datasets or searching over large amounts of unstructured data are two potential use cases for such technology. And the low power consumption of a DNA-based computer could make it appealing to people developing machine learning algorithms, where vast amounts of data are needed to train models.
Park said Catalog is already on its way to creating a working DNA-based computing architecture, having proven it can do random-access memory and duplicate data.
DNA-based data storage still needs to see improvements on the write and read steps, Heudecker said, noting that progress in DNA synthesis hasn't seen the same leaps made by sequencers. But he expects the field to progress in the next decade, with commercially viable DNA-based data storage solutions appearing in as soon as two to three years. "We're probably a decade away from commercial DNA computing where you're processing that data as well," he said.