As experimental tools have continued to evolve and the data sets they produce grow larger and larger, other research tools have refused to grow up. Such is the case with trusty mathematical analysis techniques like classification trees and forest-based methods, which can be used for examining high-throughput genome data. But if individual researchers without serious compute power want to utilize these tools to analyze the large datasets for genome-wide association studies, which can contain multiple gigabytes of SNP markers, they run into trouble.
"When people initially designed these classification trees, they didn't have this genotyping data in mind. Those tree methods began formerly in 1984 — they were designed for generic statistical modeling," says Heping Zhang, a professor of biostatistics at the Yale School of Public Health. "The problem is when you have genomic data, you're talking about several gigabytes of size."
In order to help researchers work around the memory limitations of their desktops so that they can use these classic statistical tools on GWAS datasets, Zhang and his colleagues recently released Willows, a freely available software package that combines statistical tools with data compression algorithms to analyze gigantic SNP datasets. "If you use the existing software that implements trees or forests, you cannot do any genetic analysis because you cannot even get the data," Zhang says. "In genomic data there's millions of SNPs to be used; because the numbers [are] so high you cannot possibly get all the data into your existing desktop, so if you don't do anything, it will overwhelm the memory."
The paper that presented Willows was highly accessed, and Zhang confirms that there's good reason for the interest — after all, everyone can relate to these memory bottleneck issues for GWAS. "The response has been positive because this is a very practical and important issue, so if you don't deal with it you can't go anywhere unless you buy some ridiculous machine," he says. "What most people do is take the genomic data piece by piece to get into the computer and do the analysis, but that's not even close to ideal so people are desperately looking for a method to manage the memory and get the data into the computer for the analysis."