A recent workshop organized by the Microarray and Gene Expression Data Society has resulted in the latest “minimum information” checklist for reporting data from high-throughput biological experiments.
The proposed guideline adds to a list of 20 such documents that have been developed since MGED’s “Minimum Information About a Microarray Experiment,” or MIAME, paper was first published in 2001.
Dubbed MINSEQE, for Minimal Information about a high-throughput Nucleotide Sequencing Experiment, the checklist was drafted in response to growing adoption of next-generation sequencing technology, and highlights a number of elements that the authors believe researchers should consider when submitting data from these instruments to public repositories.
The draft proposal for MINSEQE, available here, grew out of a March workshop MGED organized to discuss the development of standards for ultra-high throughput sequencing data. Attendees included representatives from the Genomics Standards Consortium, which has already drafted its own “minimum information” checklist called MIGS, for Minimum Information about a Genome Sequence; developers of the SRF, or Short Read Format, data standard; as well as representatives from sequencing instrument manufacturers, genomic data repositories, and funding agencies.
Paul Spellman, a researcher at Lawrence Berkeley National Laboratory who helped organize the meeting, told BioInform that MGED recently began looking a bit more closely at next-generation sequencing because the technology can be used in many application areas that were once the sole domain of microarrays.
“We recognize that arrays are a technology for dealing with a type of data that MGED’s been very interested in, but that other technologies exist and most likely the sequencing technology in the next five or 10 years might even replace arrays completely,” Spellman said.
“So as people within MGED started expanding their interests and technologies and scope, we recognized that it made sense to reach out to other communities who were using these technologies and form a group of people who were interested in using the next-generation sequencing technologies … to answer critical biological questions, and what would be necessary to share that data with the rest of the community,” he added.
He stressed that MGED is an "interested stakeholder" in the MINSEQE guideline, "but this is not an MGED initiative."
MINSEQE is “a fairly direct porting” of MIAME concepts into the sequencing space, Spellman said. Like MIAME, the document outlines the type of information that should accompany a data submission. Specifically, MINSEQE identifies “six elements of experiment description [that] are considered essential for making available data supporting HTS based publications.”
These include: a description of the biological system and the particular states that are studied; the sequence read data for each assay; the “final” processed (or summary) data for the set of assays in the study; the experiment design including sample data relationships; general information about the experiment; and essential experimental and data-processing protocols.
Spellman noted that MINSEQE “does not specify how the data should be shared,” so the group also plans to develop a format to actually capture and share that information. While that work is still in its very early stages, he said it’s likely that they will agree on an equivalent to the MAGE-TAB format that corresponds to MIAME, which he described as “a spreadsheet-like document for explaining what an experiment was about.”
It’s also likely, he said, that the group will agree on “something like SRF” to describe the raw sequence data itself.
MINSEQE adds to several emerging standards initiatives for next-generation sequencing data, but Spellman and others noted that these efforts are all complementary. For example, SRF “exists to describe the sequences themselves,” Spellman said, while MIGS was developed to describe the annotation for a single sequencing run. Neither one, however, describes all the experimental factors involved in a study.
This point is particularly important for next-gen sequencing, Spellman said, because these instruments are increasingly being used for applications well beyond genome sequencing.
“If all you did was sequence a genome, then you don’t really need what’s in this MINSEQE concept, because the result of a sequenced genome is well understood,” Spellman said. “Our interest for this group was an experiment where the sequence is not the ultimate answer.”
Spellman explained that while the genome remains “stable” across every cell of an organism regardless of environmental conditions, the transcriptome, on the other hand, “is the function of a genome in an environment, so we’re interested in those experiments where the sequence results are in the context of an experiment — a genome in an environment.”
“One must realize that checklists like MIAME or MINSEQE are awfully subjective to interpretation, and difficult to actually standardize, validate, and qualify.”
Ron Edgar, a workshop attendee and a researcher at the National Center for Biotechnology Information who works on capturing metadata for NCBI’s Short Read Archive, agreed that it is becoming increasingly important to include this experimental context as part of next-gen sequencing data depositions.
“There are very diverse types of experiments done with [high-throughput sequencing] and although they share the technology, they do not necessarily need to be treated in the same way — in fact I assert that they don't,” Edgar explained via e-mail. “For example, a genomic sequencing experiment could be quite useful if all you know are few facts like the sample source and species … The results are genomic sequences, which will be the same whichever (reasonable) protocols were used.”
However, he noted, “a transcriptomic or ChIP-Seq experiment would be virtually pointless to anyone without a much richer set of descriptions of the bio-samples, conditions, variants, habitat, et cetera.”
GSC coordinator Dawn Field, who is head of the molecular evolution and bioinformatics section at the Center for Ecology and Hydrology at the UK’s Natural Environment Research Council, said that while GSC’s MIGS checklist also includes descriptions of environmental conditions and assay design, it “could be viewed as a subset of MINSEQE that is more strongly typed for genomes and metagenomes.”
Field noted via e-mail that MINSEQE is more than just an “orthologous ‘rethink’ of MIAME,” and includes a number of new features due to GSC’s contributions. The end result, she said “contains some things that have not been in MIAME before, including an emphasis on ‘Environment’ and a requirement to report latitude/longitude and habitat, among other things.”
MINSEQE is also “more complex” than MIGS in that it describes “state-dependent” molecules like RNAs and chromatin structure, she said.
Field said that the GSC is “open to working with MGED in this area, fully acknowledging that maturity of that community and the value of the MIAME checklist, and in particular the complementary experience of MGED in describing experimental design (or study design), which should be directly applicable to any sequencing project.”
She said that a paper describing MIGS is slated for publication in Nature Biotechnology next month.
Up for Debate
The MINSEQE authors are soliciting feedback on the draft proposal, and Spellman said that the next steps for the group include building “consensus” for the document “getting all the vested parties on board for a way of communicating the data.”
NCBI’s Edgar said that even with the draft in place, there are still questions about the level of detail that would be required to satisfy MINSEQE and how to capture it. “I think there is no longer a debate that raw data is absolutely required in support of any publication, [but] there is a debate on what constitutes that raw data and what level should be captured,” he said.
Edgar noted that the genomics community could likely learn a great deal from its experience with MIAME, which he helped implement as part of NCBI’s Gene Expression Omnibus. “MIAME has so far been the only successful checklist,” he said. “The reason is not because it was first, but because it evolved into something useful in a process that took time and not a negligible amount of friction, and was enforced by databases like GEO and ArrayExpress with a mandate given to us by the scientific journals and the community at large.
“Without these two factors,” he said, “there would have been no MIAME today.”
NCBI has actually been using its experience with MIAME to handle high-throughput sequencing data for several years, Edgar said, noting that GEO began accepting “first-generation Solexa data” in 2004 and 454 submissions in 2006 “by applying MIAME-like requirements” to the data.
But despite MIAME’s success, Edgar said that there are risks in introducing new guidelines for submitting data. “One must realize that checklists like MIAME or MINSEQE are awfully subjective to interpretation, and difficult to actually standardize, validate, and qualify,” he said.
“While biological context is essential to interpret the underlying data, choosing the right level of detail to require is where we all struggle: too little and it may not be useful or meet demand; too much and scientists will be discouraged from submitting, or may inadvertently enter inaccurate information.”
Edgar said that the main challenges NCBI has faced with next-gen sequencing data submissions don’t involve standardization. One particular problem, he said, is that “data transfer using current protocols can be frustrating on both ends.”
In addition, he said, “journals are not yet up to speed with a clear requirement to submit HTS data into repositories and that data should comply with a checklist.”
Further information about MINSEQE is available here.