An emerging project to develop a new data format for short sequence reads has enlisted the help of several next-generation sequencing vendors with hopes of harmonizing efforts in the quickly developing field.
The project, a collaboration between sequencing vendors, genome centers, and other organizations, has developed a new DNA sequence data format called SSR (short sequence read format), that stores base calls and quality scores in a single file.
It is also noteworthy for next-gen shops because of a plan by the National Center for Biotechnology Information to use it as the standard for a new repository that will store reads from next-generation sequencing instruments.
The repository, demand for which has been “pretty urgent,” according to an NCBI researcher, will be similar to the center’s Trace Archive, which stockpiles raw data from capillary sequencers.
Representatives from Applied Biosystems, 454 Life Sciences, Illumina, and Helicos BioSciences are taking part in the SSR initiative, alongside researchers from the major genome sequencing centers, the National Center for Biotechnology Information, and the European Bioinformatics Institute.
Asim Siddiqui, who began developing SSR while working at the British Columbia Cancer Research Center’s Genome Sciences Center, and is now vice president of research at Sirius Genomics, said that all the vendors participating in the project have agreed to adopt the standard “in principle.”
In addition, NCBI plans to use the format for a new repository it plans to launch by the end of the year that will store reads from next-generation sequencing instruments.
Siddiqui said that despite its name, SSR is applicable to Sanger reads as well as shorter reads from new sequencing instruments, but noted that next-gen technologies were the impetus behind the project.
“The development has been spurred on by the availability of this type of sequence data and the realization that we need better ways of representing the data,” he told In Sequence sister publication BioInform last week. “Otherwise we’re going to end up being inundated and overwhelmed by the quantities of genome sequence data that are going to be coming online.”
Siddiqui said that the format grew out of his experiences with a Solexa [now Illumina] instrument while he was working at the BC GSC.
“We had a meeting with the folks from Solexa, and we started to talk about assemblies and how assemblies should be represented, and realized it was a fairly complex problem and one we needed to tackle,” he said. “But prior to approaching that problem, we realized that we needed to have a sequence format for data that would support that.”
Siddiqui said once the project got underway, it was “quite easy” to enlist the help of other next-gen sequencing vendors. “There was a lot of interest in doing this work, so that wasn’t a problem at all.”
Siddiqui noted that vendor involvement is crucial for the success of the initiative. “I wanted to make sure that it wasn’t just an academically driven project,” he said. “A key step is that the vendors have to actually start producing data in this format, and they’ve agreed to do that in principle.”
Flexibility for Vendor-Specific Data
Mike Attili, a software architect at Helicos, confirmed that the company plans to support SSR in its instrument, but noted that the standard is “still a moving target.”
Nevertheless, he cited a number of advantages of the SSR over common formats currently used for Sanger data, such as FASTA, FASTQ, SCF, ZTR, and others. One benefit, he noted, “is that it provides a common framework for compression of all this data, because there’s a huge amount of data coming off these next-gen instruments and there’s not a standard way to compress the other formats, but SSR has that built in.”
In addition, he said, “it provides a way to keep track of all the metadata that’s associated with the reads that come off the machine.”
The format is designed to be flexible so that different kinds of quality scores and vendor-specific information can be stored — an important feature for ensuring cross-platform interoperability, Attili said.
“There is a place for common information and there’s a place for vendor-specific information,” he said. “When we have some data that is different than some of the other suppliers, we’ve got a place to put that, and then … if all [users] want is the standard sequences and quality scores, they can get that. If they want to dig down into a deeper level and then parse the custom data that we provide, they can do that, too.”
“The development has been spurred on by the availability of this type of sequence data and the realization that we need better ways of representing the data.”
The format also associates a unique identifier with each read “so that when you go to your assembly format, you don’t need to incorporate all the read information directly into the assembly file itself,” Siddiqui said. “You can just reference the reads by their ID, and then if you have multiple assemblies you’re able to save space because you’re not incorporating the DNA sequencing information each time.”
Representatives from other vendors were not available for comment before deadline.
New Archive for Short Reads
Eugene Yaschenko, an NCBI developer who is working on the SSR project, told BioInform that NCBI plans to support the format for a new repository it is developing that will store reads from next-generation sequencing instruments.
The repository will be similar to NCBI’s Trace Archive for raw data from capillary sequencers.
“NCBI decided to develop a completely new archive for short reads, and … this group is creating standards, so we don’t have to parse or analyze different kinds of files,” he said. “The smaller variety of standards we have, the better it will be for us.”
Yaschenko said that the demand for a short read repository is “pretty urgent because the large genome centers are getting those machines and are doing test runs and are about to submit some huge amounts of data.”
So far, that has not happened. “We have had some of the 454 data submitted to the trace archive, but it was a relatively small amount of data, a few million reads,” Yaschenko said. “But when they start to produce billions of reads, we'll need more compact storage for them. That's why we are developing the new archive.”
He said that NCBI hopes to have the repository running by the fall, but the timing will depend on the creation of the standard.
Siddiqui said that SSR version 1 is available, and version 1.1 should follow closely behind. “There are a few limitations that we want to correct, and there will be minor modifications,” he said. “It’s still in the process of refinement and we’re still writing the [application programming interface] just to test it and make sure that it’s going to work, but we’re fairly far along.”
A mailing list for the SSR format is available here.