Spurred by the expected avalanche of data from new high-throughput sequencers, two groups have created separate but complementary sets of guidelines, or checklists, for describing sequencing experiments detailed enough to be compared and reproduced.
Earlier this month, participants of a recent workshop organized by the Microarray and Gene Expression Data Society posted online a draft of MINSEQE, or Minimum Information about a high-throughput Nucleotide SeQuencing Experiment.
Meanwhile, the Genomics Standards Consortium is preparing to publish its MIGS, or Minimum Information about a Genome Sequence, specification, a draft of which it debuted last year, in an upcoming issue of Nature Biotechnology.
The MINSEQE proposal, which is available here, covers both short-read and long-read technologies and broadly follows the microarray guidance called MIAME, or Minimum Information About a Microarray Experiment, that the MGED Society published in 2001.
The workshop that yielded MINSEQE was organized by the MGED Society in collaboration with the Genomics Standards Consortium and took place near Lawrence Berkeley National Laboratory in California. Participants included representatives from large genome centers and genome data repositories, other users of new sequencing technologies, sequencing platform vendors, and NIH institutes.
The draft proposal notes that new high-throughput sequencing technologies are increasingly being used for applications other than genome sequencing, such as transcriptomics, epigenomics, and genotyping.
These types of experiments “measure DNA or RNA in a particular biological state, or compare levels across several different biological states,” and many of them are “directly comparable to microarray experiments,” according to the authors.
Like microarray experiments, they “require a solid understanding of the biological samples used, as well as the analyses carried out on the data.” Therefore, the MINSEQE guidelines focus on six elements for describing an experiment that the authors deem “essential” for publication:
- A description of the biological system and its states that are under study;
- Sequence-read data for each assay “in a recognized format,” including quality scores, raw intensities, and processing parameters;
- Final processed data for the assays in the study;
- Experimental design, including sample-data relationships;
- General information about the experiment; and
- Essential experimental and data processing protocols that allow other scientists to reproduce the experiment.
Because the existing new high-throughput sequencing platforms are “still maturing,” the authors recommend that researchers initially provide detailed information about their protocols, such as on DNA amplification, selection of genome regions, and other steps.
“As the HTS technologies mature and become more standardized, the description of some of these protocol parameters may become redundant,” they write.
The MINSEQE guidelines cover both short-read and long-read technologies, and “while the applications that they are being used for tend to differ, we contend that sufficient experiment annotation is necessary for both.”
According to Paul Spellman, a researcher at LBNL who helped organize the workshop, the MINSEQE guidelines are “a fairly direct porting of the concepts in MIAME into the sequencing space.”
“The efforts to establish MINSEQE early in the development of the new next-generation sequencing technologies is a great idea.”
MGED Society members realized that technologies other than arrays would soon become important for measuring gene expression, “and, most likely, the sequencing technology, in the next five or 10 years, might even replace arrays completely,” he told In Sequence’s sister publication BioInform last week.
The next goal, Spellman said, will be to define a format in which to share the required information about the experiment, which will most likely be a spreadsheet-like document.
The raw sequence data could be linked and stored in the short-read format used by the short-read archives that are currently set up by the National Center for Biotechnology Information and the European Bioinformatics Institute, he said.
“We are going to try and build consensus for this MINSEQE document and begin getting all the vested parties on board for a way of communicating the data,” he said.
“The efforts to establish MINSEQE early in the development of the new next-generation sequencing technologies is a great idea,” Rick Jensen, a professor of biological sciences at Virginia Tech, said in an e-mail. Back when the MIAME requirements were finally adopted, he said, “much of the essential data needed to interpret early microarray results was already lost.”
He noted that researchers are still figuring out what experimental variables contribute to differences in results from next-generation sequencing experiments, including sample quality, library preps, and the different platforms. Studies of standardized reference samples, such as those used in the Microarray Quality Control, or MAQC, study, “would go a long way toward revealing the important differences that must be clearly documented” in the MINSEQE guidelines, he said.
Genomes and Metagenomes
For its part, the Genomics Standards Consortium has been specifying “a formal way to describe genomes and metagenomes in more detail than is captured at present in DDBJ, EMBL, and GenBank documents,” according to its website. The result is MIGS, or Minimum Information about a Genome Sequence.
The group includes representatives from the genome repositories, large sequencing centers, the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis, and a number of other research institutions.
Last year it published a draft of the MIGS specification, which is available on the “Community Consultation” page of the Nature Biotechnology website, and plans to publish a description of MIGS in an upcoming issue of the journal.
The MIGS checklist focuses on detailed descriptions of the sample, its biological content, and the DNA sequenced, as well as of the sequencing method.
The GSC has also developed a portal, called the Genome Catalogue, where researchers can submit MIGS-compliant reports.
The group recommends that authors of genome and metagenome publications file such a report after submitting the sequence data to one of the repositories. MIGS-compliant reports could be a supplementary table in a paper, but “far more beneficial to the wider community would be to submit this information to the Genome Catalogue and report the GCat identifier and the URL of this database,” they write.
MIGS “could be viewed as a subset of MINSEQE that is more strongly typed for genomes and metagenomes,” said Dawn Field, director of the molecular evolution and bioinformatics group at the Natural Environment Research Council’s Environmental Bioinformatics Center in Oxford, UK, who coordinates the GSC.
In an e-mail to BioInform last week, she said that in some areas, MIGS asks for information that does not feature in gene-expression experiments, such as details about the environment from which a biological sample was collected. That information “can now be extended and interwoven into the larger, more complex MINSEQE that describes state-dependent molecules [such as] RNAs, chromatin structure, et cetera,” she said.
Field added that “the GSC is going strong and we are open to working with MGED in this area, fully acknowledging the maturity of that community and the value of the MIAME checklist, and in particular the complementary experience of MGED in describing experimental design, or study design, which should be directly applicable to any sequencing project.”