Skip to main content
Premium Trial:

Request an Annual Quote

Science, Nature Papers Address Data Concerns Tied to Genome Sequencing


This week’s publication of the Human Genome Project’s and Celera’s sequencing papers in Nature and Science, respectively, promise to end months of speculation about the pros and cons of each side’s approach to the monumental accomplishment.

Two aspects of the publication have been of particular interest to bioinformaticists: the terms of access to Celera’s data for academic researchers and the relative quality of the two sequence data sets. Each topic is addressed by individual papers appearing alongside the main sequencing papers in the journals.

Science’s primary sequencing paper is accompanied by more than a dozen research papers and relevant commentaries. Notable among these is a provocative essay by University of Pennsylvania’s David Roos that criticizes the journal for agreeing to Celera’s terms for academic access to its data.

Echoing the comments of the controversial open letter penned by Ewan Birney and Sean Eddy in December, Roos wrote, “Bioinformatics research is particularly dependent on unencumbered access to data, including the ability to reanalyze and repost results.”

“For example,” Roos wrote, “a genomewide analysis and reannotation of additional features identified in Celera’s database could not be published or posted on the Web without compromising the proprietary nature of the underlying data. Nor could this information be combined with the resources available from other databases.”

At press time, the full terms of academic access to Celera’s data were not publicly available and a Celera spokesperson did not return calls for comment.

Roos told BioInform that he has not yet seen the data release policy in its final form. “I gather that [it] is still in flux,” he said.

“My understanding is that Celera is quite happy to permit further analysis of their data, and posting of the results in the form of coordinates defining genome features and/or links to Celera’s web site,” Roos said. “Thus it is quite possible to look at the Celera data, but it cannot be incorporated into other databases (e.g. GenBank).”

Roos said that he and Science “went back and forth a bit” regarding publication of his opposing view. He said he’s pleased that the journal agreed to publish his paper, but added, “I suspect I may ultimately wish I had kept out of this mess!”

Of the 24 total items published in Nature on the sequencing project, one details a computational comparison of the Celera and HGP draft sequences. Authored by members of George Church’s lab at Harvard Medical School, it indicates that Celera’s sequence contains approximately 6 percent fewer called bases than the HGP sequence, while Celera’s data includes more information about the locations of sequences on chromosomes.

John Aach, one of the authors of the paper, told BioInform via e-mail that “there’s no reason to believe that the two teams won’t be able to overcome such differences in future versions.”

“Whatever might be said about the relative plusses and minuses of the two sequences today will probably be false tomorrow, since both teams continue to refine and complete their versions,”

Aach added. Aach said that the Church lab wrote a “very simple” algorithm specifically for analyzing and comparing the two sequences. Two bits are allocated for each possible stretch of 15 nucleotides, referred to as 15-mers, of which there are over a billion. The team then read through the reference sequence and added 1 to the counter for each 15-mer, which enabled them to determine the number of times each 15-mer appeared in each of the two sequences and whether it was unique. The same algorithm was run against an input sequence to identify the location of all the unique 15-mers for each sequence.

Aach said that the team is currently interested in using the algorithm to identify unique primers and probes for sequence features of interest.

Both Aach and Roos discuss the computational challenges that bioinformaticists face in meeting the needs of future genomics research, and both anticipate a not-too-distant future when genomics-scale studies are considered routine biology operations.

However, as Roos pointed out, “[S]uch research is only possible if data remain available not only for examination, but also to build upon. It is hard to swim in a sea of data while bound and gagged!”

Roos said that these problems would require “a fair amount of thought” and suggested that patent law may provide some useful precedents for the industry to follow.

— BT

Filed under

The Scan

Study Points to Tuberculosis Protection by Gaucher Disease Mutation

A mutation linked to Gaucher disease in the Ashkenazi Jewish population appears to boost Mycobacterium tuberculosis resistance in a zebrafish model of the lysosomal storage condition, a new PNAS study finds.

SpliceVault Portal Provides Look at RNA Splicing Changes Linked to Genetic Variants

The portal, described in Nature Genetics, houses variant-related messenger RNA splicing insights drawn from RNA sequencing data in nearly 335,700 samples — a set known as the 300K-RNA resource.

Automated Sequencing Pipeline Appears to Allow Rapid SARS-CoV-2 Lineage Detection in Nevada Study

Researchers in the Journal of Molecular Diagnostics describe and assess a Clear Labs Dx automated workflow, sequencing, and bioinformatic analysis method for quickly identifying SARS-CoV-2 lineages.

UK Team Presents Genetic, Epigenetic Sequencing Method

Using enzymatic DNA preparation steps, researchers in Nature Biotechnology develop a strategy for sequencing DNA, along with 5-methylcytosine and 5-hydroxymethylcytosine, on existing sequencers.