Skip to main content
Premium Trial:

Request an Annual Quote

Why Researchers Need to Review Raw Data


By Todd Smith & Sandra Porter


Todd Smith is the president of Geospiza, a Seattle company that sells laboratory data management systems and bioinformatics software. Sandra Porter is a senior scientist at Geospiza and has a long-standing interest in education. She manages the company''s partnership with Bio-Link, an NSF-funded Advanced Technology Education Center, and develops educational activities in bioinformatics through grant funding. Sandy and Todd are married with two daughters, Andrea and Roxanne. Send comments to [email protected]. The authors also recommend reading cfm?pageID=10506537.


In January 2002, district court judge Louis Pollak ruled that fingerprint evidence from crime scenes failed to meet three of the four standards established by the US Supreme Court for scientific techniques. Judge Pollak ruled that fingerprinting does not qualify as a scientific technique because it has not been subject to review by the scientific community, does not have a known error rate, and cannot be tested in a rigorous manner.

Genome sequencers are fortunate that Judge Pollak doesn’t review papers for Nature or Science, since many of these same criticisms can be applied to genome sequencing and SNP projects. Very few publications include data on sequencing error rates. Fewer still provide the program parameters that would allow independent verification of their results. And the greatest problem, despite the ubiquitous cries for data sharing, is that very few researchers provide access to the original data, thus ensuring that the assemblies, error estimates, or results from most genome and SNP projects can never be reproduced by other groups.

One might assume from the criticisms lobbied at private sequencing efforts that all the sequence data from publicly funded projects is freely accessible and widely available. However, one only has to try getting the trace files referenced in published papers to find that sharing trace files is not a common practice. With the exception of Washington University in St. Louis, most laboratories either don’t understand the importance of these data, or for a variety of reasons won’t make files available.

The 1996 Bermuda principles have helped. NCBI’s trace archive contains large numbers of ESTs and shotgun sequences as a result of these measures. But the Bermuda principles don’t go far enough, and the trace archive is far from complete. Notably missing are whole shotgun data from bacterial genomes, viral genome sequences, and trace files from mutation studies and SNP genotyping experiments. A rare opportunity exists at present for researchers to write to the National Human Genome Research Institute ([email protected]) and request that these types of files be included in the trace archive.

Why is it important for researchers to have access to the raw data? In the case of trace files, access is important for several reasons. Depending on the file format, trace files include the electropherogram with peaks and colors, the sample ID, information about the sequencing primer and the dye chemistry, the type of instrument used for data collection, and a string of bases. Base-calling programs such as PHRED require this information in order to calculate quality values for each base. Researchers engaged in SNP discovery, HIV research, and genotyping, need quality values in order to determine the likelihood of a miscalled base or an alternative peak. Further, assembly algorithms such as PHRAP and SNP-finding programs such as POLYPHRED or POLYBAYES require quality data for optimum performance. These data cannot be obtained from a text file of A’s, C’s, G’s, and T’s; they can only be generated from trace files. Researchers must also have access to trace files for comparing base-calling programs, and to develop better technology for base-calling, heterozygote detection, or sequence assembly. Finally, if sequencing is to be considered a scientific technique, it’s time for all sequencers to consider Judge Pollak’s ruling and make their raw data available for review.


Opposite Strand is a column for readers to express opinions and ideas about trends and issues in genomics. Submissions should be kept to 550 words and may be submitted to [email protected].


The Scan

Positive Framing of Genetic Studies Can Spark Mistrust Among Underrepresented Groups

Researchers in Human Genetics and Genomics Advances report that how researchers describe genomic studies may alienate potential participants.

Small Study of Gene Editing to Treat Sickle Cell Disease

In a Novartis-sponsored study in the New England Journal of Medicine, researchers found that a CRISPR-Cas9-based treatment targeting promoters of genes encoding fetal hemoglobin could reduce disease symptoms.

Gut Microbiome Changes Appear in Infants Before They Develop Eczema, Study Finds

Researchers report in mSystems that infants experienced an enrichment in Clostridium sensu stricto 1 and Finegoldia and a depletion of Bacteroides before developing eczema.

Acute Myeloid Leukemia Treatment Specificity Enhanced With Stem Cell Editing

A study in Nature suggests epitope editing in donor stem cells prior to bone marrow transplants can stave off toxicity when targeting acute myeloid leukemia with immunotherapy.