Beginning with the Human Genome Project, pre-publication data release has become an integral part of genomics research. Since the meetings in Bermuda, during which researchers developed the initial guidelines for data release, and follow-up meetings in Fort Lauderdale and Amsterdam, the field has changed. Data producers are no longer only large genomics centers — more and more, smaller groups are able to produce a large amount of data. "There's a much broader dissemination of technologies and data sets being produced," says Tom Hudson from the Ontario Institute for Cancer Research, who co-chaired the most recent meeting along with Ewan Birney from the European Bioinformatics Institute. Funding agencies, Hudson says, thought that the policies from the Human Genome Project would be automatically adopted by these smaller projects. "But they realized that hasn't been the case," he says. About 80 to 100 scientists, funders, editors, and ethicists descended on Toronto last year to hammer out a consensus statement on which genomic data should be shared prior to publication.
Building on the past, the group put together a statement based on the discussion, published in Nature, that encourages pre-publication data release from certain projects. That statement recommends that large-scale projects — those with "broad utility," those that create reference data sets, or those that are part of a community resource — should release their data prior to publication. For example, the statement recommends that data from the whole genome sequencing of a reference organism be shared, but says it is optional if the project sequences only a small number of loci from a just few samples. That way, Hudson says, there is balance between the large discovery or reference projects and the smaller, hypothesis-driven work. "What we wanted to establish [was] what should be a pre-publication data set and what should be hypothesis-driven and, at the same time, realizing there's a gray zone," he says. "It's really about the funding agencies making a determination."
Having recommendations and having the research community follow them are different stories. There, the Toronto group turned to the various funding agencies. For new funding opportunities, the group suggests that applicants are told up front whether there is a data-release policy. It also says that the projects fulfilling those three main criteria require rapid release polices. "Scientists don't like to be told after the fact," Hudson says. "It was understood [at the meeting] that if we are going to be applying new guidelines … that these rules be made explicit to applicants." Furthermore, the group suggests that data-sharing plans be added to grant applications.
There are also responsibilities that fall on the shoulders of both the data producers and the data users. To make the plans for their projects clear, data producers can issue a marker paper that data users can then cite. For example, the International HapMap Consortium published a 2003 Nature paper that set out the goals of the consortium and outlined its data-release policy. "All data on new SNPs, assay conditions, and allele and genotype frequencies will be released rapidly into the public domain," the authors wrote, adding that "the only condition for data access is that users must agree not to restrict use of the data by others and to share the data only with others who have agreed to the same condition." The researchers that then make use of such data should abide by those restrictions and cite the marker paper in their work.
Of course, pre-publication data hasn't gone through all the quality control checks that published data has. "You have to tell the users that there's inherent risk in using the data, but the data producers can check a sample of the data and say that you are mostly right, so 95 percent of the time, these are true mutations," Hudson says.