NEW YORK – The Institut Pasteur in Paris has won €2 million ($2.1 million) in EU funding to create a "search engine for DNA sequencing data," indexing next-generation sequencing data available in the Sequence Read Archive in order to make it searchable and more accessible.
The five-year IndexThePlanet project, led by Rayan Chikhi, a group leader in computational biology at Institut Pasteur, is supported by a European Research Council consolidator grant, one of several such grants totaling €657 million that were awarded to 321 researchers last month.
Chikhi's team will use the funding to index all data available in the Sequence Read Archive (SRA), the largest public repository of DNA sequencing data.
"The ideal result will be something like Google for DNA sequencing data," said Chikhi. "It will be a website that is easy to access and navigate."
Using a portal, a researcher could input a sequence of interest and retrieve data on where else that sequence has been reported across the globe. In this way, researchers could more easily track antimicrobial-resistant bacteria, for instance, or look up different viruses and their strains. Chikhi likened the concept to the Basic Local Alignment Search Tool (BLAST) that bioinformaticians use to find regions of similarity between sequences.
According to Chikhi, BLAST can only index small amounts of data, such as a non-redundant set of assembled genomes. The tool cannot index large amounts of raw reads, and "definitely not the whole SRA." However, in terms of interface, he said his proposed tool would be similar to BLAST, in that a user would type in a sequence and the search tool would retrieve a list of hits. Chikhi hopes to have a prototype ready within two years.
For more than 15 years, scientists have churned out next-generation sequencing data. A 2015 estimate in PLOS Biology forecast the annual output of sequencing data to reach between 2 and 40 exabytes by 2025. Last June, the SRA reported that its holdings comprised 32 petabases of data. The US National Center for Biotechnology Information and the European Bioinformatics Institute jointly maintain the SRA and its European mirror, the European Nucleotide Archive. Both the SRA and ENA, as well as the DNA Data Bank of Japan (DDBJ), are members of the International Nucleotide Sequence Database Collaboration (INSDC), an initiative that supports the collection and dissemination of sequencing data.
Chikhi has for years been trying to devise approaches for making sequencing data more accessible while reducing its computational burden. In 2021, he and colleagues at the Massachusetts Institute of Technology published a method in Cell Systems that could be used to assemble genomes from long-read sequencing data using a laptop computer. The algorithm relied on minimizer-space de Bruijn graphs, or mdBGs, that represent short stretches of nucleotide sequences rather than individual nucleotides and can therefore store genome sequences more efficiently.
Chikhi said that for the IndexThePlanet project, he will aim to do something "completely different," which he described as large-scale indexing beyond what previously published approaches, such as BItsliced Genomic Signature Index (BIGSI), HowDe-SBT, and MetaGraph, have accomplished.
Such approaches, in Chikhi's words, have "hit a wall" in terms of indexing, as they require "enormous amounts of disc space and/or memory" and are incapable of indexing the entire SRA. His own approach, in contrast, will rely on a more efficient data structure for indexing, but he declined to divulge details at this time.
"I can only speculate at this point, and the proof will be in the pudding," he said.
Part of the grant funding will go toward accessing computing resources. This will include buying additional storage at Institut Pasteur, as the project might require several petabytes of storage, though cloud computing resources offer another opportunity. Chikhi noted that downloading the data from SRA is not possible, making cloud-based analysis the way to go.
He underscored that while the data does reside in the SRA, it is relatively inaccessible, as it is too large to be downloaded by any single lab. Even downloading a certain dataset can take a day, he said, and researchers confront the limitations of current technology. "The internet," said Chikhi, "is not infinitely scalable." The data, he added, is mostly held in the US and the UK, and infrastructure, such as undersea communication cables, that would allow researchers in France to quickly access that data quickly does not yet exist.
"We are talking about petabytes of data," he said. "It's inconceivable that a lab would download tens of petabytes of data per day."
It's making this massive cache of data searchable that motivates Chikhi, as well as lessons from the COVID-19 pandemic, when researchers wanted to be able to look at coronaviruses from around the world but didn't have the resources to easily find those sequences because they were tangled up in stacks of sequencing data.
"My project aims to make [sequence data] accessible, enabling global analyses to be produced all over this data," he said.
Chikhi has already been engaged in such efforts prior to the IndexThePlanet project. A year ago, he and fellow researchers published a paper in Nature that described a new cloud computing infrastructure called Serratus that allowed them to comb through sequence data from 5.7 million biologically diverse samples, resulting in the discovery of nine new coronavirus species, as well as other RNA viruses.
Pierre Peterlongo, a research associate at Inria, France's National Institute for Research in Digital Science and Technology, said in an email that a resource such as the one Chikhi is working on is needed.
"We are spending billions of dollars for sequencing and storing data, but these data sleep in archives, undisturbed, as there exists no way to query them," Peterlongo said. "If we could query these data efficiently, we could bring together pieces of information that would have a great impact on health, ecology, and agronomy."
Peterlongo, who has collaborated with Chikhi before but is not affiliated with his current project, said that while sequencing data is accessible via the SRA, it is not queryable. "This is what the internet would be like with no search engines," he said. "Mainly underexploited."
Guy Cochrane, joint head of the ENA, said in an email that he welcomed Institut Pasteur's effort to build new ways to search, access, and make use of INSDC data.
According to Cochrane, the databases already offer a "breadth of search and retrieval tools" that rely on metadata and assembled sequence similarity, for example. These are in regular use and utilized at scale by users, he said, but he also acknowledged that the institutions that maintain the databases cannot support all possible search modalities.
"INSDC is a data foundation that feeds tools and services around the world," he wrote. A new search system drawing data from INSDC, such as the one proposed by Chikhi, "illustrates nicely the global open biodata ecosystem of which INSDC is a part."