Recent trends in genome sequencing — such as metagenomics studies and the availability of next-generation sequencing instruments — are generating unprecedented amounts of new data that the bioinformatics community is just starting to grapple with in many ways.
While these new methods present a number of informatics challenges related to genome assembly, interpretation, and visualization [BioInform 03-16-07
], they have also spurred several efforts to standardize genomic data formats across different platforms to enable the community to efficiently exchange data.
Two independent projects are currently addressing this issue from separate directions. One, a collaboration between next-generation sequencing vendors, genome centers, and other organizations, has developed a new DNA sequence data format called SSR, for short sequence reads.
The second, a collection of around 30 organizations called the Genomic Standards Consortium, has put together a “checklist” for sequencing experiments called MIGS, or minimum information about a genome sequence.
Oddly, while the projects list some of the same organizations as participants, no such overlap exists between individual participants, and the organizers of both projects had not heard of the other until this week.
However, it appears that there is no overlap between the two efforts. SSR is designed as a file format for sequencing reads that stores base calls and quality scores together, while MIGS is a higher-level specification designed to describe information related to the experiment itself.
“The approaches should complement one another,” said Asim Siddiqui, who began developing SSR while working at the British Columbia Cancer Research Center’s Genome Sciences Center and is now vice president of research at Sirius Genomics. “The area [GSC] appears to be tackling is the higher-level problem of how to describe a genome, while we are focused on the nuts and bolts of how to handle data from sequencers and to put it together into a genome assembly.”
Dawn Field, head of the molecular evolution and bioinformatics section at the Center for Ecology and Hydrology at the UK’s Natural Environment Research Council and coordinator for the GSC effort, agreed.
“I think in many ways, the interactions should be close, but they’re two very separate projects,” she said. “They’re describing reads, so that’s really how industry packages up its data to allow biologists and others to exchange data in order to do something useful with it in the future, while we’re doing this very top-level checklist: ‘Why did you do this genomics experiment, what isolate did you use, what phenotypes did it have?’”
Both Siddiqui and Field said that they plan to coordinate their efforts in the future. “We’re working with a lot of the same partners, and it would be nice just for the visibility if we could circulate our stuff to their lists and their stuff to our lists, because it’s all people doing genomics,” Field said.
The SSR Format
Siddiqui said that despite its name, SSR is applicable to Sanger reads as well as shorter reads from next-gen instruments from 454 Life Sciences, Illumina, Applied Biosystems, and Helicos. Nevertheless, the demand for the format grew out of his experiences with a Solexa instrument while he was working at the British Columbia Cancer Research Center’s Genome Sciences Center.
“We had a meeting with the folks from Solexa, and we started to talk about assemblies and how assemblies should be represented, and realized it was a fairly complex problem and one we needed to tackle,” he said. “But prior to approaching that problem, we realized that we needed to have a sequence format for data that would support that.”
Siddiqui convinced representatives from 454, ABI, Illumina, and Helicos to take part in the development process, as well as researchers from the major genome sequencing centers, the National Center for Biotechnology Information, and the European Bioinformatics Institute.
“It was really quite easy to get people involved,” he said. “There was a lot of interest in doing this work, so that wasn’t a problem at all.”
Siddiqui noted that vendor involvement is crucial for the success of the initiative. “I wanted to make sure that it wasn’t just an academically driven project,” he said. “A key step is that the vendors have to actually start producing data in this format, and they’ve agreed to do that in principle.”
Mike Attili, a software architect at Helicos, told BioInform that the company plans to support SSR in its instrument, but noted that “it’s still a moving target.”
Nevertheless, he cited a number of advantages of the SSR over common formats currently used for Sanger data, such as FASTA, FASTQ, SCF, ZTR, and others. One benefit, he noted, “is that it provides a common framework for compression of all this data, because there’s a huge amount of data coming off these next-gen instruments and there’s not a standard way to compress the other formats, but SSR has that built in.”
In addition, he said, “it provides a way to keep track of all the metadata that’s associated with the reads that come off the machine.”
The format is designed to be flexible so that different kinds of quality scores and vendor-specific information can be stored — an important feature for ensuring cross-platform interoperability, Attili said.
“There is a place for common information and there’s a place for vendor-specific information,” he said. “When we have some data that is different than some of the other suppliers, we’ve got a place to put that, and then … if all [users] want is the standard sequences and quality scores, they can get that. If they want to dig down into a deeper level and then parse the custom data that we provide, they can do that, too.”
The format also associates a unique identifier with each read “so that when you go to your assembly format, you don’t need to incorporate all the read information directly into the assembly file itself,” Siddiqui said. “You can just reference the reads by their ID, and then if you have multiple assemblies you’re able to save space because you’re not incorporating the DNA sequencing information each time.”
“A key step is that the vendors have to actually start producing data in this format, and they’ve agreed to do that in principle.”
Eugene Yaschenko, an NCBI developer who is working on the project, said that NCBI plans to support SSR for a new repository it is developing that will store reads from next-generation sequencing instruments. The repository will be similar to its Trace Archive for raw data from capillary sequencers.
“NCBI decided to develop a completely new archive for short reads, and … this group is creating standards, so we don’t have to parse or analyze different kinds of files,” he said. “The smaller variety of standards we have, the better it will be for us.”
Yaschenko said that the demand for a short read repository is “pretty urgent because the large genome centers are getting those machines and are doing test runs and are about to submit some huge amounts of data.”
He said that NCBI hopes to have the repository running by the fall, but the timing will depend on the creation of the standard.
Siddiqui said that SSR version 1 is available, and version 1.1 should follow closely behind. “There are a few limitations that we want to correct, and there will be minor modifications,” he said. “It’s still in the process of refinement and we’re still writing the API just to test it and make sure that it’s going to work, but we’re fairly far along.”
A mailing list for the SSR format is available here
GSC and MIGS
The Genome Standards Consortium was formed in September 2005 in an effort to standardize the description of genomes and the exchange of genomic data. A paper describing MIGS, the group’s first specification, is currently in community consultation at Nature Biotech.
The concept behind MIGS is in line with a family of “Minimum Information” specifications that grew out of the MIAME (minimum information about a microarray experiment) standard in the microarray community and now includes MIAPE, MIARE, MIRIAM, and a host of others.
The GSC is also developing MIMS (minimum information about a metagenomic sample) as an extension to MIGS, and Field said that current development efforts for MIMS are focused around incorporating more features of interest to the metagenomics community.
One driver for that, she said, is the involvement of the CAMERA (Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis) project that is creating a repository for metagenomics data from the J. Craig Venter Institute’s Global Ocean Sampling expedition and other sources [BioInform 03-16-07
GSC has recently released MIGS 1.1 and is now working on version 1.2, “which will meet the CAMERA requirements, which in essence really means more complex metagenomics studies,” Field said. “We’re hoping to get to MIGS 2.0 by the end of this year, and that’s when we would really sort of ask people to think about implementing.”
GSC has already begun one implementation, which it calls the Genome Catalogue, or GCat. It’s designed as a repository for researchers to submit genome reports and provides the MIGS checklist in an XML schema. “The GCat software just takes the schema and automatically generates input forms and a basic repository so you can search, browse, and edit,” Field said.
Field said that CAMERA and the Alpine Microbial Observatory are also planning implementations of MIGS.
Field cited CAMERA’s experience with metagenomics data as a case where the SSR format might also come in handy. “Their issue is they get one unique identifier from NCBI for the GOS survey, but underneath that, there are 120 sampling sites, and at each site they may have filtered the water sample four different ways,” she said.
“Because they might end up doing multiple assemblies, they want to tag the metadata to the read, so for them, the discrete unit of data is a read, which for people who are doing genomes, that’s completely foreign — you assemble your genome, you publish it, it’s complete.”
The concept of reads as the key unit of information in sequencing studies is also gaining momentum from the availability of new sequencing technologies, she said. “It used to just be Sanger, but now you’ve got 454 and others, and you really would like to know how this read was generated, what kind of quality it has, and any way that we can do that with computers is really going to facilitate a lot of things downstream.”
Further information on the GSC is available here