Researchers at the National Cancer Institute have identified a flaw in Microsoft’s Excel program that can inadvertently change gene names to non-gene names in large genomic data sets. The glitch — a result of the spreadsheet program’s default settings — could affect more than 2,000 genes, according to the NCI team, and has already led to errors that have been propagated in LocusLink and other public data resources.
Barry Zeeberg, John Weinstein and their colleagues in the genomics and bioinformatics group at NCI describe the problem in the current issue of BMC Bioinformatics. While beta-testing internally developed software on microarray data, the team noticed that some gene names were bouncing back as “unknown.” It turned out that the Excel program — commonly used in bioinformatics data processing — was introducing errors via two separate default settings. One, an automatic date conversion feature, changes gene names that look like dates — such as the tumor suppressor DEC1 — to the default data format — in this case, 1-DEC. According to the NCI team, there are 30 gene names that could be affected by this feature.
The second problem affects Riken clone identifiers — or any other clone designation derived from plate coordinates — of the form “nnnnnnnnEnn,” where “n” is a digit and “E” designates row E in the plate. Excel reads the “E” as a signal to convert the identifier to a floating point number. In an example provided by Zeeberg and colleagues, the Riken identifier 231000E13 was changed to the floating-point number 2.31E+13. This could affect more than 2,000 Riken identifiers. “A non-expert user might well fail to notice that approximately 3 percent of the identifiers on a microarray with tens of thousands of genes had been converted to an incorrect form, yet the potential for 2,000 identifiers to be transmogrified without notice is a considerable concern,” the authors wrote.
Even more disconcerting is the fact that these errors are irreversible. The original gene names cannot be recovered.
Zeeberg et al. suggest a number of work-arounds for the problem — such as pre-processing the data by placing a character or space in front of the gene name — and also developed a script to detect conversion problems. The script is available on the team’s website (http://discover.nci.nih.gov/symbolmutation/), and will also be implemented in NCI’s MatchMiner and GoMiner software packages.
Even if Excel’s data and floating-point conversions were to be made non-default options, “there will be a lag time before all researchers have the new program version and an even longer time of confusion before the existing errors and inconsistencies have been expunged from all public and private databases,” the authors wrote. In the meantime, they advised, “it is important to be alert to the problem.”