Skip to main content
Premium Trial:

Request an Annual Quote

January/February 2006: A Defense of Data Sharing


On the afternoon of August 10, 1675, John Flamsteed created a horoscope for his favorite telescope, but made sure that his friends realized that it was just a practical joke. He wrote, tongue-in-cheek, Risum Teneatis Amici — or “This will keep you laughing, my friends.”

Not long after, Flamsteed himself found little to laugh about. Appointed by King Charles II to the Royal Observatory at Greenwich to gather astronomical data aimed at accurately calculating longitudes, he developed a complex method based on the moon’s movement — collecting a vast amount of data in the process — but couldn’t get it to work well. Meanwhile, Isaac Newton needed his peer’s data for his own research, but the two couldn’t come to terms on a collaboration. Ultimately, Flamsteed burned every copy of his data to prevent Newton from using it.

You are laughing, my friends! Three centuries later, we still struggle with the same problems. In early 2002, Mitchell Sogin, a researcher at Marine Biological Laboratory in Woods Hole, shut down a public website containing his team’s raw sequence data for Giardia lamblia when he discovered that a colleague had published a paper using MBL’s sequence information. At the time, the event was much discussed in the field. An article in Science raised questions, such as, “How much control should DNA sequencers wield over the data they gather? And should they be forced to share preliminary results — as many are now required to do — before they publish their own analysis?” In an article in Genome Technology sister publication BioInform, Sogin said at the time that he had received “quite a number of supportive statements” from his peers, and that even without enforcement of data access policies, “most of the scientific community has behaved in a respectable fashion … I do still believe that we should be releasing data and making it available to the community.”

Learning to Share

A particularly interesting insight to our contemporary views about genomics data comes from the so-called Bermuda Principle. It requires that “all human genomic sequence information, generated by centers funded for large-scale human sequencing, should be freely available and in the public domain in order to encourage research and development and to maximize its benefit to society: … Assemblies of greater than 1 Kb would be released automatically on a daily basis; finished annotated sequence should be submitted immediately to the public databases.”

Similar principles have been adopted by other funding agencies, and have become commonly accepted to include all sequence data. The principle is now assumed to also embrace other international collaborative projects such as the Mammalian Gene Collection, the SNP Consortium, and the International HapMap Project.

There are many positive examples involving what can be thought of as community resource projects: protein structure determination, gene expression analysis, and disease-related databases. Development of common data standards like MIAME (minimum information about a microarray experiment) or MIAPE (minimum information about a proteomics experiment); standardized mark-up languages, constrained vocabularies, and ontologies; sharable browsers; integration software — all of these point to a spirit of openness.

A particularly interesting experiment in this kind of community-wide openness is being carried out by NCI’s Cancer Bioinformatics Grid, or caBIG. This grid, which connects researchers and institutions in an attempt to help create and share tools and data, has a goal of ramping up innovation targeted at treating or preventing cancer.

Complexity and Challenges

Despite these examples of successful data sharing, we still have many new problems to brood over. For instance, the issues become rather complex when one includes the conflict between the need for open data sharing and the need for intellectual property protection. Recently, a federal appellate court rejected the claim that the so-called “experimental use” legal defense protects academic researchers from patent infringement liability. In response, the National Research Council of the National Academy of Sciences has endorsed an extended Bermuda principle that would cover protein-structure data, as well as requiring that scientists avoid seeking patents for genes or proteins of unknown functions or haplotype blocks identifiable by few biomarkers, but not directly shown to be associated with a disease. Are we now swinging too much in the other direction by restricting the societal need to encourage innovation?

An even more interesting question comes up as we face technological innovations that make data gathering rather routine and banal. One can envision high-throughput data factories capable of generating massive amounts of genomic, transcriptomic, or proteomic data that can be delivered quickly to any researcher and for a small fee. The technical work could be outsourced and performed where skilled technical workers are cheaply available. Even better, with cheaper and faster whole-genome sequencing machines and high-density gene-chips, it is conceivable that the task could eventually be robotized. If such infrastructures are controlled by only a few, will we not revert back to a world separated into data-haves and data-have-nots?

I think that ultimately the solution is to invest in innovations that would make data gathering cheaper, simpler, less labor intensive, and widely accessible. In such a world, the time and energy of decent scientists will no longer be wasted in data hoarding, data burning, data controlling, and data fighting. Only then, perhaps, will our collective human ingenuity finally focus on seeking out the hidden principles of nature from the data, and do so in the most creative, playful, and joyful collaborations.

Risum Teneatis Amici!

Bud Mishra is a professor of computer science and mathematics at NYU’s Courant Institute of Mathematical Sciences and a professor of cell biology at NYU School of Medicine. He founded the NYU/Courant Bioinformatics Group.

The Scan

Lung Cancer Response to Checkpoint Inhibitors Reflected in Circulating Tumor DNA

In non-small cell lung cancer patients, researchers find in JCO Precision Oncology that survival benefits after immune checkpoint blockade coincide with a dip in ctDNA levels.

Study Reviews Family, Provider Responses to Rapid Whole-Genome Sequencing Follow-up

Investigators identified in the European Journal of Human Genetics variable follow-up practices after rapid whole-genome sequencing.

BMI-Related Variants Show Age-Related Stability in UK Biobank Participants

Researchers followed body mass index variant stability with genomic structural equation modeling and genome-wide association studies of 40- to 72-year olds in PLOS Genetics.

Genome Sequences Reveal Range Mutations in Induced Pluripotent Stem Cells

Researchers in Nature Genetics detect somatic mutation variation across iPSCs generated from blood or skin fibroblast cell sources, along with selection for BCOR gene mutations.