Publicly available genomics data may be one of the most important resources for researchers, but as Affymetrix recently learned, the information does carry its share of risks.
Last week Affymetrix said that due to ambiguities in the publicly available UniGene database, the company had misinterpreted the data and put down sequence information from the wrong strand of DNA. As a result Affymetrix may take up to a $4 million hit as well as a loss in some of its credibility.
Although the company took the blame, the response of a handful of experts indicates that people within academia, industry, and the public projects themselves discount some of the data’s reliability and warn users of the data to be vigilant in checking for accuracy.
“The problem really comes from the issue of posting [sequence information] in the public domain,” said Andrew Brooks, director of the Functional Genomics Center at the University of Rochester Medical Center. “When we accepted the advantages and limitations of Affymetrix technology, in part we also accepted the limitations of what is available in the public domain.”
NCBI Director David Lipman defended the data, noting that it is well known that different research centers submit sequence data from different strands of DNA.
“The database has always had mixed strandedness,” Lipman said.
In fact, the introduction to UniGene comes with the following disclaimer:
“It should be noted that the procedures for automated sequence clustering are still under development and the results may change from time to time as improvements are made. Feedback from users has been especially useful in identifying problems and we encourage you to report any problems you encounter.
“It should also be noted that no attempt has been made to produce contigs or consensus sequences. There are several reasons why the sequences of a set may not actually form a single contig. For example, all of the splicing variants for a gene are put into the same set. Moreover, EST-containing sets often contain 5’ and 3’ reads from the same cDNA clone, but these sequences do not always overlap.”
As a result of the known and documented limitations, the burden of checking the quality falls on the users, experts said.
“This has nothing to do with the quality of the public data; it has to do with the quality of their QC process,” Ewan Birney, head of genome annotation at the European Bioinformatics Institute, said referring to Affymetrix’s quality control operation. “All bioinformaticists know that you can’t trust a massive amount in the databases – you have to treat everything with distrust.”
Birney said that, unlike highly curated protein structure databases such as Swiss-Prot, about one percent of the data in the sequence databases may be faulty. “It’s certainly enough to cause problems,” Birney added.
In order to improve its quality control process, Affymetrix said its subsidiary, Affymetrix Berkeley (formerly Neomorphic), has now been charged with identifying glitches in the data such as those that caused the company to put the wrong information on its mouse chips. In that instance, the company picked probes from the wrong DNA strand due to an inconsistency in the UniGene EST record.
“We have been doing much more detailed analysis of the genome and ESTs to determine which information should have been put on the chips in the first place,” said Cyrus Harmon, Affymetrix’s vice president of computational genomics. “You can’t rely on a single source of information.”
Affymetrix said it would replace the defective chips, which have up to 60 percent non-functional probes, within six weeks.
Other industry insiders said that Affymetrix’s story should serve as a cautionary tale for researchers.
“People have to be vigorous in the way they do research,” said Christian Marcazzo, director of product marketing for Lion Bioscience. “People need to understand the databases and how information is organized in the databases and they need to make connections between the data in different data sources in order to corroborate whether something is true or false.”
Roy Whitfield, CEO of Incyte, a supplier of data to several microarray developers including Agilent and Corning, said that it takes many rounds of sequencing areas of interest to ensure high quality control.
“For any given nucleotide we will have sequenced it many times more than the public domain,” said Whitfield.
Despite the problem with the data and the sharp 15 percent tumble Affy’s share price took in the two days following the news, the company said it would remain committed to supporting the public database efforts due to the company’s belief that the data represent the best, not to mention the cheapest, source of information available.
Hopefully for Affymetrix, their customers will be as generous with them as they have been with the public resource.
Brooks of the University of Rochester said that on the day the story broke he received a handful of calls from Affy’s competitors trying to profit from the news.
“I’ve already gotten calls from other companies’ sales reps and product managers trying to use this to their advantage and saying, ‘our product is different,’” said Brooks.
—JF and MMJ