Skip to main content
Premium Trial:

Request an Annual Quote

NCBI to End Support for Sequence Read Archive as Federal Purse Strings Tighten


By Uduak Grace Thomas

This article has been updated to include information about the amount of data stored in the SRA and NCBI's implementation plan for phasing out the database.

The National Center for Biotechnology Information will phase out the Sequence Read Archive and other database resources over the next year as a result of reduced federal research dollars.

"The bottom line is, budgets are tight ... our budget is somewhat less than what we need to do everything," David Lipman, NCBI's director explained to BioInform. "We proposed various possibilities to the leadership at [the National Institutes of Health] and one of the resources that was agreed upon to phase out was the Sequence Read Archive."

A statement from the NCBI said that the database will be closed in phases. SRA and Trace will stop accepting some types of submissions in the coming weeks, and all submissions within the next 12 months.

Lipman said that there is funding available to run the SRA for "a number of projects," such as the Cancer Genome Atlas, "for at least eight months and maybe 12 months." During that time, he said, "we will be making arrangements with staff at the other institutes as to what happens for longer term access to it."

Other databases that are falling victim to budget constraints include the Conserved Domains Database and the OSIRIS (Open Source Independent Review and Interpretation System) and Peptidome projects.

Lipman said that "a number of considerations" were behind the decision to eliminate SRA, but he did not elaborate.

NCBI said in an online statement that SRA and the Trace Archive "will stop accepting some types of submissions in the coming weeks, and all submissions within the next 12 months." It added that it plans to work with staff from NIH Institutes that fund large-scale sequencing projects "to develop an approach for future access to and storage of the existing data."

NCBI added that it will "continue to support and develop information resources for biological data derived from next-generation sequencing such as genotypes, common variations, rare variations, sequence assemblies, and gene expression data," and encouraged researchers to continue submitting such data to the appropriate databases, including the Gene Expression Omnibus, dbVar, dbGaP, dbSNP, and GenBank.

As for Peptidome, NCBI said that it will phase out the online browser, query, and display interfaces "over the next few weeks," though it will continue to make all existing data and metadata files available from its ftp server "indefinitely." Furthermore, NCBI said it hopes to eventually deposit all Peptidome data "in a different public mass spectrometry repository," and noted that it will provide further information about this effort "soon."

Reports of the phaseout first surfaced earlier this week, based on an e-mail from Lipman leaked to the Tree of Life blog.

The report comes on the heels of the 2012 federal budget announcement, in which US President Barack Obama proposed increasing NIH's funding 2.4 percent in 2012 to $31.83 billion from $30.78 billion in 2010.

Responding to a question about the increased budget, Lipman said "I think at this stage in the game, it's really hard to know what's going to be in the president's budget. Lots of negotiations are going to be happening and so I think it's hard to know."

Furthermore, the NIH budget for fiscal year 2011, which began Oct. 1, 2010, is still under debate. Earlier this month, the House Appropriations Committee outlined $100 billion in cuts to discretionary spending for 2011, including a $1 billion cut for NIH.

Under the auspices of the International Nucleotide Sequence Database Collaboration, NCBI, the European Bioinformatics Institute and the DNA Data Bank of Japan established the SRA several years ago to provide a provide a publicly available source for short reads generated on next-generation sequencing platforms. The database was initially called the Short Read Archive and later renamed the Sequence Read Archive.

Each partner set up mirrored databases in which deposits made at any of the archives would be available for search and download at all of them. The three SRAs share a common data model, one accession space, and mirror data and metadata updates on a daily basis.

It isn't clear at this stage what impact the loss of NCBI's SRA will have on the other two sites, but Lipman said that both groups will be "looking at what things they can step up and handle."

Guy Cochrane, who leads the EBI's European Nucleotide Archive, told BioInform via e-mail that "EBI's SRA has no immediate plans to reduce or cease operation," adding that although NCBI and EBI enjoy "strong collaboration, [they] are separate [and] operate services independently."

A paper published in the Jan. 1 issue of Nucleic Acids Research said that as of September last year, the SRA contained more than 500 billion reads consisting of 60 trillion base pairs.

Lipman said that the current volume of the resource is 100 terabases, representing approximately 67 percent growth since September.

Furthermore, almost 80 percent of the sequencing data came from the Illumina Genome Analyzer platform with SOLiD and Roche 454 platforms providing 15 percent and 5 percent of the data, respectively.

The paper also said that the most submissions came from the Broad Institute, Washington University in St Louis, the Wellcome Trust Sanger Institute, and Baylor College of Medicine, which provide 34 percent, 15 percent, 13 percent, and 12 percent of sequenced bases, respectively.

The NAR paper highlighted the growing costs of storing such massive amounts of sequence data. "With the growth of the next-generation sequence data surpassing the growth of disk-storage capacity, the value of storing different types of data is being evaluated," the authors wrote.

"The cost of archiving Illumina GA and SOLiD signal data are now considered to significantly exceed the value of making this data available for any subsequent analysis," they added, noting that NCBI began storing this data "on a less accessible secondary-storage system" that was " no longer guaranteed to be permanently available as part of the SRA archives."

The SRA strategy, they noted, "is to balance data reduction and compression in light of infrastructure costs and usage patterns."

In an e-mail to BioInform, Toby Bloom, director of informatics at the Broad Institute, said that the 12-month phaseout period will "give us some time to plan a transition." However, she added, "I don't yet know how we will archive sequence data after that time."

David Dooling, assistant director of Informatics at the Genome Center at Washington University, told BioInform that the institute's policy is to archive all data locally in addition to submitting it to the SRA using a tiered storage system with a customized "information lifecycle management [system] built into our LIMS and analysis pipelines."

"We don’t keep images or run data or anything like that," he explained. "We essentially keep just a BAM file that comes out at the end of the Illumina pipeline, for example, or the FFF file out of the [Roche] 454 ... we keep a minimal amount of data and eventually it ages out and lands on tape."

As word that the SRA would be phased out trickled into the scientific community, researchers have commented on the NCBI's decision on several blogs and a lively discussion ensued on the next-generation sequencing forum, SEQanswers.

Stuart Brown, an associate professor of cell biology and director of the Research Computing Resource at New York University, said he was surprised to see the NCBI memo. "I had always assumed that archiving biomedical research data was a primary responsibility of [NCBI's parent institute the National Library of Medicine], and, like the Library of Congress, its mission was above politics," he told BioInform via e-mail. "Also, the actual cost of data storage for the NCBI must be an extremely tiny fraction of the entire budget of the NIH. How can we spend hundreds of millions to generate scientific discoveries, yet not fund the data infrastructure on which it rests?"

When asked whether this latest cut might be an ominous portent for other publicly available resources, Brown said, "I have no idea about the funding stability of other public resources (for science, or anything else). I am sure that no one really knows the true nature of the US Federal budget — what parts will be maintained as "essential" and what parts are subject to political negotiation. It is clearly not a rational process."

Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.

The Scan

UK Pilot Study Suggests Digital Pathway May Expand BRCA Testing in Breast Cancer

A randomized pilot study in the Journal of Medical Genetics points to similar outcomes for breast cancer patients receiving germline BRCA testing through fully digital or partially digital testing pathways.

Survey Sees Genetic Literacy on the Rise, Though Further Education Needed

Survey participants appear to have higher genetic familiarity, knowledge, and skills compared to 2013, though 'room for improvement' remains, an AJHG paper finds.

Study Reveals Molecular, Clinical Features in Colorectal Cancer Cases Involving Multiple Primary Tumors

Researchers compare mismatch repair, microsatellite instability, and tumor mutation burden patterns in synchronous multiple- or single primary colorectal cancers.

FarGen Phase One Sequences Exomes of Nearly 500 From Faroe Islands

The analysis in the European Journal of Human Genetics finds few rare variants and limited geographic structure among Faroese individuals.