With more more and more scientists incorporating technologies such as microarrays or protein mass spectrometry into their research, datasets have reach unprecedented sizes — a challenge for scientific journals publishing their results.
Journals, traditionally, have seen their role in making information available and putting it into context. But since the advent of genome sequencing and advances in protein structure determination, “making available” has not necessarily meant “hosting.” Depositing data into GenBank, the Protein Data Bank, or other public databases emerged as a de facto standard during the 1990s, supported both by journals and researchers. But since Celera Genomics published its human genome sequence in Science last year and provided access to the data solely through its own database — a precedent that permitted Syngenta to follow suit with the recent publication of the rice genome in Science — scientists have been wondering what standards journals will adopt for making gene expression data and other new data types available.
As datasets have become too large to print, journals are left with three options: requiring their submission to public databases, hosting them as supplementary material, or allowing authors to make them available at their own websites. Most journals have long required sequence data and three-dimensional structures to be deposited in a publicly available “appropriate database,” such as GenBank, EMBL or the DDBJ; PDB; or SwissProt. But when it comes to SNP data, gene expression data, or protein interaction data, policies are vague or simply don’t exist (see table, p. 13).
One reason is that the number of papers submitted in some of these areas is still low. Regarding SNP data, “We have so few papers that have relevant data that we don’t have guidelines there,” said Richard Roberts, chief US executive editor of Nucleic Acids Research. NAR, similar to the Proceedings of the National Academy of Sciences, updates its submission policies at annual editorial meetings.
The gene expression literature, on the other hand, has been exploding. Yet gene expression data is rarely mentioned in journal policies. Genome Research, which lists the Gene Expression Omnibus and Array Express as recommended repositories, is a rare exception. Standards for microarray data, such as the MIAME standard by the MGED working group, are not universally accepted, nor has a single resource become the GenBank of microarray data.
In the meantime, many journals have adopted a wait-and-see strategy, making gene expression data available as supplementary material. “It may take several years for a single microarray standard to emerge,” said Theodora Bloom, editor of Genome Biology. “I am all for it, but I don’t think journals can do it alone. I think the community has to do it.” She is not alone in her view: “There is not yet a single repository that everyone in the community has adopted,” commented Donald Kennedy, editor in chief of Science.
Technical constraints are one contributing factor to the reluctance within the community to agree on a one-stop shop for gene expression data. Compared to sequence data, microarray data files are huge. Chris Thorpe, information architect for Genome Biology and BioMedCentral, estimated that microarray data is likely to double every year, and will quickly reach an unmanagable size. “That’s clearly not something that any central public repository is going to be able to do,” he said. “Storing hundreds of terabytes of data is immensely costly.” Also, directing large data files through a single resource is likely to create data traffic jams. Instead, Thorpe and others believe in Napster-style distributed repositories that could be queried using file-sharing protocols. “If you have a common standard that everyone agrees to and a decentralized resource, then it doesn’t matter if your data is held in Genome Biology’s repository, or in ArrayExpress, or in both…It will still come back as a query from the decentralized community repositor,” he said, adding that all that is required is a web services protocol like DAS to describe collections of data.
At the moment, most journals choose to host microarray data and other large datasets as supplementary material on their websites. “That’s the sensible intermediate stage; the journal arranges for deposition in connection with the paper, and you can go right to it,” said Kennedy. This has not been a particular challenge for most journals, but incurs extra costs that need to be recovered. Genome Biology, which encourages large datasets, images, or movies to be submitted, charges authors an article processing fee that includes data hosting. PNAS charges $100 for making supplementary material available online.
The third way of making data available — through an author’s website — is the most controversial. For a start, publishers lose control over access to the data, or any changes to it. Most of them keep a backup version on their own servers for that reason. PNAS, for example, allows certain types of data, which are part of a paper but do not provide the basis for a direct conclusion, to be hosted solely by the author because “you overwhelm the [PNAS] website, and there are cost problems and other problems,” said Nicholas Cozzarelli, editor of PNAS. Nature’s policy states that microarray data should be made available on the authors’ or another freely available website, “until a public database is available.”
When Science allowed Celera to provide its version of the human genome on its own website last year, the journal created a new precedent for bypassing existing public databases. “People felt that some standard had been violated. The point is, what is the standard that we are talking about?” said Robin Schoen, who directs a study on the responsibility of authorship in the biological sciences at the National Academy of Sciences. A committee conducting the study was created in response to Celera’s paper and plans to publish a report in August. “The analyses are built on more and more data…The question is, how much of that underlying data are you supposed to provide; what is your responsibility?” said Schoen.
But whatever publication standards will emerge, or whatever ways to present data will prevail, they will serve the same purpose: “There is an unwritten agreement that you want to allow people to replicate or falsify your finding. And you need to provide them enough to do that,” said Schoen.