By Todd Smith & Sandra Porter
Todd Smith is the president of Geospiza, a Seattle company that sells laboratory data management systems and bioinformatics software. Sandra Porter is a senior scientist at Geospiza and has a long-standing interest in education. She manages the company''s partnership with Bio-Link, an NSF-funded Advanced Technology Education Center, and develops educational activities in bioinformatics through grant funding. Sandy and Todd are married with two daughters, Andrea and Roxanne. Send comments to [email protected] The authors also recommend reading genome.gov/page. cfm?pageID=10506537.
In January 2002, district court judge Louis Pollak ruled that fingerprint evidence from crime scenes failed to meet three of the four standards established by the US Supreme Court for scientific techniques. Judge Pollak ruled that fingerprinting does not qualify as a scientific technique because it has not been subject to review by the scientific community, does not have a known error rate, and cannot be tested in a rigorous manner.
Genome sequencers are fortunate that Judge Pollak doesn’t review papers for Nature or Science, since many of these same criticisms can be applied to genome sequencing and SNP projects. Very few publications include data on sequencing error rates. Fewer still provide the program parameters that would allow independent verification of their results. And the greatest problem, despite the ubiquitous cries for data sharing, is that very few researchers provide access to the original data, thus ensuring that the assemblies, error estimates, or results from most genome and SNP projects can never be reproduced by other groups.
One might assume from the criticisms lobbied at private sequencing efforts that all the sequence data from publicly funded projects is freely accessible and widely available. However, one only has to try getting the trace files referenced in published papers to find that sharing trace files is not a common practice. With the exception of Washington University in St. Louis, most laboratories either don’t understand the importance of these data, or for a variety of reasons won’t make files available.
The 1996 Bermuda principles have helped. NCBI’s trace archive contains large numbers of ESTs and shotgun sequences as a result of these measures. But the Bermuda principles don’t go far enough, and the trace archive is far from complete. Notably missing are whole shotgun data from bacterial genomes, viral genome sequences, and trace files from mutation studies and SNP genotyping experiments. A rare opportunity exists at present for researchers to write to the National Human Genome Research Institute ([email protected]) and request that these types of files be included in the trace archive.
Why is it important for researchers to have access to the raw data? In the case of trace files, access is important for several reasons. Depending on the file format, trace files include the electropherogram with peaks and colors, the sample ID, information about the sequencing primer and the dye chemistry, the type of instrument used for data collection, and a string of bases. Base-calling programs such as PHRED require this information in order to calculate quality values for each base. Researchers engaged in SNP discovery, HIV research, and genotyping, need quality values in order to determine the likelihood of a miscalled base or an alternative peak. Further, assembly algorithms such as PHRAP and SNP-finding programs such as POLYPHRED or POLYBAYES require quality data for optimum performance. These data cannot be obtained from a text file of A’s, C’s, G’s, and T’s; they can only be generated from trace files. Researchers must also have access to trace files for comparing base-calling programs, and to develop better technology for base-calling, heterozygote detection, or sequence assembly. Finally, if sequencing is to be considered a scientific technique, it’s time for all sequencers to consider Judge Pollak’s ruling and make their raw data available for review.
Opposite Strand is a column for readers to express opinions and ideas about trends and issues in genomics. Submissions should be kept to 550 words and may be submitted to [email protected]