There’s supposedly a finite amount of space on a standard microarray, but somehow vendors just keep cramming more on there. Affy’s latest arrays include the whole-human-genome chip, weighing in with 1.3 million probes arrayed in 54,000 sets, and the new exon array, which includes 5.3 million probes separated by just five microns. Competitors, likewise, are ramping up their tools, measuring anywhere from hundreds of thousands to well into millions of probes.
With each iteration, scientists marvel at the space-defying feat. And for every appreciative scientist, it seems, there’s a bioinformaticist groaning as the slew of data that can be generated from these chips also looks like it’s approaching infinity.
Truth be told, scientists adopt each new version of this technology so fast that often the bioinformatics tools to go along with them just can’t keep pace. The use of high-density arrays will likely happen just as quickly, predicts Jason Goncalves, general manager of the software business unit at Stratagene. “There’s pretty rapid growth from what we’re seeing,” he says. He believes it won’t take long for these arrays to displace their more traditional brethren, noting that the community switch to using whole-genome chips happened much faster than industry experts had expected.
But it may take longer before the sophisticated visualization and other data analysis tools needed to make sense of all that information are widely available. In the meantime, scientists who have relied on the de facto standard in this field — Microsoft Excel — to keep track of their chip experiments will find that the program just can’t handle this massive volume of data. What to do?Genome Technology spoke with leaders in the field to get the best advice on the pros and cons of using high-density arrays, the informatics and bioinformatics resources you’ll need, and how to handle the tidal wave of data that’s coming at you.
The high-density argument
neighborhood with 55,000 probes. NimbleGen, which makes its own arrays but does not sell them directly to customers, has a platform with 400,000 features per array; a chip with twice that number is in beta testing, and the company is anticipating an even bigger leap this year.
In general, though, scientists aren’t pining for the days when it took several chips to interrogate a whole genome. “It is a very rare customer that doesn’t want more data for their penny,” says Emil Nuwaysir, vice president of business development at NimbleGen. “We are pushing as hard as we can on the density question,” he adds — and so far, customer demand has followed each increase.
Part of the density issue has surfaced recently as arrays have been released for applications other than gene expression. Major vendors have released chips that test for copy number, comparative genomic hybridization, methylation or transcript mapping (tiling arrays), and alternate splicing (exon chips). As chips like Affy’s exon offering, which wins the current heavyweight record with 1.4 million probe sets for a total of 5.3 million probes, interrogate for qualities other than gene expression, data collection is looking at a new set of monkey wrenches. “It’s not a 2D view anymore,” says Nuwaysir. “What you really want to focus on is the ability to manage multiple data sets from multiple tests.” He envisions a customer requesting data from the intersection of altered gene expression and altered methylation results as one example of this.
As density has risen, the cost per probe has fallen, making these microarrays accessible even to people who budgeted for the traditional chips. Atul Butte, an assistant professor of medicine and medical informatics at Stanford, points out that “the cost of these arrays has dropped,” which has given researchers the ability to do an “enormous” number of gene expression experiments.
Add to that claims for better data, and you’ve got a slam dunk of a product. Peter Park, a biostatician at the Harvard-Partners Center for Genetics and Genomics who recently published a paper comparing various analysis tools for array-CGH data, says that higher-density arrays are more reliable than ever because they build on the data problems discovered in earlier versions of chips. New arrays give “both more data and … better data,” Park says. “We’re more confident about the data.”
Scott Kahn, chief information officer at Illumina, says that by their very nature, high-density chips “are affording much cleaner data.”Having so much content on each chip gives manufacturers more sampling space. Steve Lincoln, vice president of informatics at Affymetrix, says these arrays “provide the greatest level of robustness by using multiple independent (and deliberately different) probe features to contribute to each resulting data point.”
When you get right down to it, Park says, “if you want to [use] something other than the high-density oligo arrays, you should have a good reason.” But just because high density will be the way to go for most people doesn’t mean there aren’t very real reasons not to use these chips.
“As you get more and more probes on a particular array, the problems that we knew about have become much more compounded,” says Wendell Jones, senior manager of statistics and bioinformatics at Expression Analysis, an array services provider. He points to managing false positives and false negatives as two particular challenges that are more pronounced with higher-density chips.
Also, if you’re someone who gets along by prefiltering your data, these arrays are just not for you, says Tom Downey, CEO of Partek, a data analysis software vendor. “In the past people have filtered out data they don’t think is important. With these high-density chips, everything [might be] important.”
Harvard’s Park says that platform differences — and even generational differences in the same platform — continue to plague array vendors. “Even Affymetrix data done over many years [is] very hard to put together,” he says. His team studied high-density arrays on behalf of a colleague who was interested in converting but was worried about being able to compare new data with data from older experiments. “We don’t want to throw away all the data that’s sitting in labs,” Park says. Affymetrix offers conversion charts for its arrays, he adds, but the algorithm is of limited use, particularly in its ability to sort by different variables.
Dealing with data
Once you’re confident in your infrastructure, the first step in any high-density microarray experiment should be checking in with data repositories, says Atul Butte at Stanford. “There is an enormous amount of data out there already,” he says, so researchers should “consider all of this data before you even start an experiment.” He says the number of gene expression measurements entered in databases has been growing at between 200 and 300 percent annually, and promises to increase even faster as scientists deposit data from these high-density experiments.
The most important step, of course, is actually analyzing your data. Up until now, Excel has been recognized as the most popular program to sort through chip data. But that tool has a limit of 64,000 rows. With exon chips containing 1.4 million probe sets, says Stratagene’s Goncalves, “you’re now breaking Excel.”
Even if Excel could hold the data, it would no longer suffice, contends Nuwaysir at NimbleGen. “You need a tool that’s visual,” he says. “The data has so far leaped past the ability of your brain to pull a pattern out of just a list of information.”
Vendors are throwing their hats into this ring; in late January, for instance, Genomatix announced that it would soon make available a data analysis package designed to harness the information coming out of exon-array experiments. But most software offerings still fall short of the analysis power scientists will need to really make sense of this data onslaught. “Most of the gene expression software out there today is not optimized [for high-density chips] and can’t handle it,” says Tom Downey at Partek, which offers software compatible with Affy’s high-density and other chips. Downey says that Partek has seen increased demand for its product specifically from people looking for high-density solutions.
Slimming down has been one of the goals for algorithms aimed at this kind of data, says Kahn. “We’ve been trying to make the applications be extremely efficient; from a memory standpoint, not do calculations when you don’t have to,” he says. “We’ve started to do a lot more offline processing.” Kahn recommends that biologists bring informaticists on board early in planning experimental design to be sure that resources and analytical tools are considered from the outset.
Park notes that there are actually several software packages that are sophisticated enough to handle the data — but they’re not designed for biologists. “Most of them are available in R packages,” he says, but “most biologists can’t use R.” There’s a need to take the statistical prowess of those kinds of tools and include them in technologies that are accessible to scientists.
Researchers “do need to consider the software they’re going to use for these chips,” says Jason Goncalves, noting that traditional software is not designed to accommodate high-density chips. He points out that the complexity involved in data emerging from exon and other new-application arrays verges on pathway analysis — so software solutions for these arrays may be coming from a separate field. Atul Butte says Ariadne, Ingenuity, and Genstruct (he’s a scientific advisor for Genstruct) are among those who have products that may prove useful with these arrays. However, he says, “I would want something that’s even more comprehensive, even more in depth, and free — and we’re not there yet.”