Skip to main content
Premium Trial:

Request an Annual Quote

Arrays Go HD

By Meredith W. Salisbury

It’s getting crowded here. When Affymetrix microarrays were just kicking off in 1994, they had 16,000 features per chip, and the spots were set at 100 microns. Fewer than 10 years ago, the company had increased that to 65,000 features squeezed in at only 50 microns.

There’s supposedly a finite amount of space on a standard microarray, but somehow vendors just keep cramming more on there. Affy’s latest arrays include the whole-human-genome chip, weighing in with 1.3 million probes arrayed in 54,000 sets, and the new exon array, which includes 5.3 million probes separated by just five microns. Competitors, likewise, are ramping up their tools, measuring anywhere from hundreds of thousands to well into millions of probes.

With each iteration, scientists marvel at the space-defying feat. And for every appreciative scientist, it seems, there’s a bioinformaticist groaning as the slew of data that can be generated from these chips also looks like it’s approaching infinity.

Truth be told, scientists adopt each new version of this technology so fast that often the bioinformatics tools to go along with them just can’t keep pace. The use of high-density arrays will likely happen just as quickly, predicts Jason Goncalves, general manager of the software business unit at Stratagene. “There’s pretty rapid growth from what we’re seeing,” he says. He believes it won’t take long for these arrays to displace their more traditional brethren, noting that the community switch to using whole-genome chips happened much faster than industry experts had expected.

But it may take longer before the sophisticated visualization and other data analysis tools needed to make sense of all that information are widely available. In the meantime, scientists who have relied on the de facto standard in this field — Microsoft Excel — to keep track of their chip experiments will find that the program just can’t handle this massive volume of data. What to do?

Genome Technology spoke with leaders in the field to get the best advice on the pros and cons of using high-density arrays, the informatics and bioinformatics resources you’ll need, and how to handle the tidal wave of data that’s coming at you.


The high-density argument

The simplest reason to use high-density arrays is that you don’t have much choice. Across the board, chip vendors and service providers are working to ensure that their products include more probes than ever. Affymetrix has long been the industry leader, and as such it essentially sets the tone for the rest of the field. Illumina just released a 300,000-SNP chip, and last year announced plans to have a million-SNP chip on the market by the middle of 2006. Agilent approaches Affy’s density with its own 44,000-probe whole-human-genome chip, and GE Healthcare’s CodeLink system is in the
neighborhood with 55,000 probes. NimbleGen, which makes its own arrays but does not sell them directly to customers, has a platform with 400,000 features per array; a chip with twice that number is in beta testing, and the company is anticipating an even bigger leap this year.

In general, though, scientists aren’t pining for the days when it took several chips to interrogate a whole genome. “It is a very rare customer that doesn’t want more data for their penny,” says Emil Nuwaysir, vice president of business development at NimbleGen. “We are pushing as hard as we can on the density question,” he adds — and so far, customer demand has followed each increase.

Part of the density issue has surfaced recently as arrays have been released for applications other than gene expression. Major vendors have released chips that test for copy number, comparative genomic hybridization, methylation or transcript mapping (tiling arrays), and alternate splicing (exon chips). As chips like Affy’s exon offering, which wins the current heavyweight record with 1.4 million probe sets for a total of 5.3 million probes, interrogate for qualities other than gene expression, data collection is looking at a new set of monkey wrenches. “It’s not a 2D view anymore,” says Nuwaysir. “What you really want to focus on is the ability to manage multiple data sets from multiple tests.” He envisions a customer requesting data from the intersection of altered gene expression and altered methylation results as one example of this.

As density has risen, the cost per probe has fallen, making these microarrays accessible even to people who budgeted for the traditional chips. Atul Butte, an assistant professor of medicine and medical informatics at Stanford, points out that “the cost of these arrays has dropped,” which has given researchers the ability to do an “enormous” number of gene expression experiments.

Add to that claims for better data, and you’ve got a slam dunk of a product. Peter Park, a biostatician at the Harvard-Partners Center for Genetics and Genomics who recently published a paper comparing various analysis tools for array-CGH data, says that higher-density arrays are more reliable than ever because they build on the data problems discovered in earlier versions of chips. New arrays give “both more data and … better data,” Park says. “We’re more confident about the data.”

Scott Kahn, chief information officer at Illumina, says that by their very nature, high-density chips “are affording much cleaner data.”

Having so much content on each chip gives manufacturers more sampling space. Steve Lincoln, vice president of informatics at Affymetrix, says these arrays “provide the greatest level of robustness by using multiple independent (and deliberately different) probe features to contribute to each resulting data point.”

When you get right down to it, Park says, “if you want to [use] something other than the high-density oligo arrays, you should have a good reason.” But just because high density will be the way to go for most people doesn’t mean there aren’t very real reasons not to use these chips.

“As you get more and more probes on a particular array, the problems that we knew about have become much more compounded,” says Wendell Jones, senior manager of statistics and bioinformatics at Expression Analysis, an array services provider. He points to managing false positives and false negatives as two particular challenges that are more pronounced with higher-density chips.

Also, if you’re someone who gets along by prefiltering your data, these arrays are just not for you, says Tom Downey, CEO of Partek, a data analysis software vendor. “In the past people have filtered out data they don’t think is important. With these high-density chips, everything [might be] important.”

Harvard’s Park says that platform differences — and even generational differences in the same platform — continue to plague array vendors. “Even Affymetrix data done over many years [is] very hard to put together,” he says. His team studied high-density arrays on behalf of a colleague who was interested in converting but was worried about being able to compare new data with data from older experiments. “We don’t want to throw away all the data that’s sitting in labs,” Park says. Affymetrix offers conversion charts for its arrays, he adds, but the algorithm is of limited use, particularly in its ability to sort by different variables.


Dealing with data

For researchers to get the most out of these arrays, they’ll need the right compute infrastructure, data repositories, and data analysis software. While you won’t necessarily need a supercomputer to process the output from your high-density arrays, these experiments can’t be handled by your econo-line desktop. Scott Kahn at Illumina says a “practical” computer for this kind of work will have at least 2 gigs of memory. Of course, the more complex your studies are, the more robust your resources need to be. “The real change, of course, is that everybody’s starting to see what numbers of samples you need to put in an association study to get a meaningful answer,” Kahn says, pinning that number around 5,000 or 10,000 samples. Each of those, of course, will be run on a microarray that has thousands — or hundreds of thousands — of probes. “You’re talking about a matrix,” Kahn says. “A larger number times a large number is a really large number.”

Once you’re confident in your infrastructure, the first step in any high-density microarray experiment should be checking in with data repositories, says Atul Butte at Stanford. “There is an enormous amount of data out there already,” he says, so researchers should “consider all of this data before you even start an experiment.” He says the number of gene expression measurements entered in databases has been growing at between 200 and 300 percent annually, and promises to increase even faster as scientists deposit data from these high-density experiments.

The most important step, of course, is actually analyzing your data. Up until now, Excel has been recognized as the most popular program to sort through chip data. But that tool has a limit of 64,000 rows. With exon chips containing 1.4 million probe sets, says Stratagene’s Goncalves, “you’re now breaking Excel.”

Even if Excel could hold the data, it would no longer suffice, contends Nuwaysir at NimbleGen. “You need a tool that’s visual,” he says. “The data has so far leaped past the ability of your brain to pull a pattern out of just a list of information.”

Vendors are throwing their hats into this ring; in late January, for instance, Genomatix announced that it would soon make available a data analysis package designed to harness the information coming out of exon-array experiments. But most software offerings still fall short of the analysis power scientists will need to really make sense of this data onslaught. “Most of the gene expression software out there today is not optimized [for high-density chips] and can’t handle it,” says Tom Downey at Partek, which offers software compatible with Affy’s high-density and other chips. Downey says that Partek has seen increased demand for its product specifically from people looking for high-density solutions.

Slimming down has been one of the goals for algorithms aimed at this kind of data, says Kahn. “We’ve been trying to make the applications be extremely efficient; from a memory standpoint, not do calculations when you don’t have to,” he says. “We’ve started to do a lot more offline processing.” Kahn recommends that biologists bring informaticists on board early in planning experimental design to be sure that resources and analytical tools are considered from the outset.

Park notes that there are actually several software packages that are sophisticated enough to handle the data — but they’re not designed for biologists. “Most of them are available in R packages,” he says, but “most biologists can’t use R.” There’s a need to take the statistical prowess of those kinds of tools and include them in technologies that are accessible to scientists.

Researchers “do need to consider the software they’re going to use for these chips,” says Jason Goncalves, noting that traditional software is not designed to accommodate high-density chips. He points out that the complexity involved in data emerging from exon and other new-application arrays verges on pathway analysis — so software solutions for these arrays may be coming from a separate field. Atul Butte says Ariadne, Ingenuity, and Genstruct (he’s a scientific advisor for Genstruct) are among those who have products that may prove useful with these arrays. However, he says, “I would want something that’s even more comprehensive, even more in depth, and free — and we’re not there yet.”

The Scan

Study Tracks Responses in Patients Pursuing Polygenic Risk Score Profiling

Using interviews, researchers in the European Journal of Human Genetics qualitatively assess individuals' motivations for, and experiences with, direct-to-consumer polygenic risk score testing.

EHR Quality Improvement Study Detects Demographic-Related Deficiencies in Cancer Family History Data

In a retrospective analysis in JAMA Network Open, researchers find that sex, ethnicity, language, and other features coincide with the quality of cancer family history information in a patient's record.

Inflammatory Bowel Disease Linked to Gut Microbiome Community Structure Gradient in Meta-Analysis

Bringing together data from prior studies, researchers in Genome Biology track down microbial taxa and a population structure gradient with ties to ulcerative colitis or Crohn's disease.

Ancient Greek Army Ancestry Highlights Mercenary Role in Historical Migrations

By profiling genomic patterns in 5th century samples from in and around Himera, researchers saw diverse ancestry in Greek army representatives in the region, as they report in PNAS.