Experts agree that the ideal way to study each person's differences from one another — and to figure out how those differences relate in terms of disease — is to fully sequence each individual's genome and use that as the foundation for tracking variation. But it's a long way off before sequencing is cheap enough to make that scenario feasible. Good thing, then, that advances are still being made in the SNP genotyping domain
Over the past several years, the field of genotyping has exploded, and with an array of new technologies available to do the research, mining for single nucleotide polymorphisms has never been easier — or as affordable. Driving down the cost has meant increasing SNP content on chips, and in general, a move to multiplex assays. However, issues such as data handling and integration, as well as correlating polymorphism to phenotype, remain hurdles to making SNP genotyping a diagnostic reality.
Today, SNP genotyping studies are typically divided into multiple stages depending largely on throughput needs. A whole genome scan fishes for gene regions of interest, and then follow-up candidate, or fine mapping, studies look at particular SNPs. Further allelic discrimination assays can be used to boil down the disease association to just a handful of SNPs.
For whole genome studies, Affymetrix and Illumina are two of the big vendors. Within the past several months, both have launched million-SNP chips, each of which scans 1 million representative SNPs chosen from HapMap data. Illumina's Human1M chip builds off several previous products, including the HumanHap300-Duo, HumanHap 550 and 550-Duo, and the HumanHap650Y, all launched within the past two years. "[There] is really a market demand, where the market demands more content and the content here [is] defined as better coverage of the genome," says Carsten Rosenow, Illumina's senior marketing manager of DNA analysis.
The two highest-density chips on the market were populated with slightly different content. The Illumina 1M probes about 950,000 tag SNPs and about 100,000 additional non-HapMap SNPs. The 1M also includes about 260,000 copy number variation probes, chosen from both new and reported copy number polymorphic regions. Meanwhile, Affymetrix's Genome-wide Human SNP Array 6.0 also contains both SNP and CNV content, in this case 906,000 SNPs and 946,000 CNV probes.
It seems the field never stops moving, though, and many vendors say that up-and-coming tools will focus on increasing copy number content, multiplexing, and building out existing data analysis software, especially algorithms for CNV analysis. According to Affymetrix Genotyping Specialist Jessica Tonani, "Due to the fact that we're really hitting diminishing return on content, we're really going to begin focusing on the ease of use, and improvements on allowing customers to process more arrays and get easier information off of more arrays." Examples include creating different applications for analysis, increasing automation, and ironing out workflow concerns with customers.
"At this point, your return on more information by adding more SNPs is asymptotic," Ilumina's Rosenow says. Illumina's future focus will be on CNV content and next-gen sequencing, since "once we can genotype [people] individually, I think we have the best power to identify disease-associated genes."
Choosing a Platform
Whole genome association studies have been flourishing for the past few years, but this year they finally made their mark in the mainstream news, thanks to studies in diabetes and psychological illnesses, to name just a few. A quick search in Pubmed for "genome-wide association study" turned up 210 hits for the first half of 2007, compared to 287 in all of 2006 and 236 in 2005. Many of these initial studies have focused on cardiovascular and metabolic diseases or common cancers, and a number of them have been sponsored by large consortia like the Wellcome Trust or public-private partnerships like NIH's GAIN.
Jeanette Erdmann, chief of the molecular genetic laboratory at the Department of Cardiology of the University of Luebeck, partnered with the Wellcome Trust Case Control Consortium in the summer of 2006 to run replication scans of original data predicting gene association for coronary artery disease. When she and her colleagues planned the study, Erdmann says, the Affy 500K was the best on the market, and also was what Wellcome had used in its initial GWA studies. "This consortium already started with the Affymetrix 500K array, therefore this decision was really easy," she says of choosing the Affy platform.
Bob Welch, director of operations at the National Cancer Institute's Core Genotyping Facility in Bethesda, Md., uses GWA studies to find common variants associated with cancer. Some of the facility's recent research has looked for SNPs in breast cancer, prostate cancer, and non-Hodgkin lymphoma. Currently, Welch's lab employs Illumina's HumanHap550 and 1M chips as well as Affy's 5.0 and 6.0 chips. "Every different assay has different content, and the content is usually based on how well that panel covers the genome," he says.
The tag SNP approach, developed through the HapMap initiative, takes into account the fact that genomes are replicated and transferred between cells as wholes, not parts. Testing a relatively small number of SNPs can therefore produce information on a much larger group of SNPs, giving scientists a way to perform fewer experiments and spend less money without sacrificing the utility of the data reported. Illumina has designed all of its chips by choosing SNPs from the HapMap based on linkage disequilibrium data, whereas Affy only recently incorporated tag SNPs into its 6.0 chip.
Hakon Hakonarson, from the Children's Hospital of Philadelphia, plans to genotype 100,000 children in the search for associations to common childhood diseases. Since he and his staff started in July 2006, they've already genotyped about 30,000 children using Illumina's platform. A few months ago, they added Affy's 6.0 chip because "there's a lot of data out there on the Affymetrix platform and some of the people who had expressed interest in working with us … had either done Affy before, or had access to Affy data [and] they didn't really want to change platforms," Hakonarson says.
While both platforms have very good throughput, Hakonarson says, his team's original choice of Illumina was based on workflow issues. He finds the Illumina workflow to be less complicated, and the double-chip setup of the Affy 500K array meant twice as much work. "You needed almost twice the personnel for that, and there's always high risk of human error the more people you have," he says. Now that his group is using technology from both vendors, he says, "my feeling is that these platforms are relatively comparable in the information content that you get, meaning that it's not exactly the same, but the information is so large that you will get the information you need from using one or the other."
CNV Joins the Fray
It wasn't long ago that SNP-seeking scientists realized they could tremendously increase the power of their data by merging it with information about copy number variation, which today is thought to play even more of a role in disease onset than SNPs. While for years people have been using Affy and Illumina's SNP chips to extract copy number data, the newer chips have greatly expanded coverage of this content. Illumina has a CNV-only chip, and its 1M chip covers both SNPs in regions of known copy number variation as well as CNV probes. Affy's 5.0 chip contains 420,000 non-polymorphic probes and the 6.0 chip has 946,000 such probes. Both vendors ' probe sets are taken from the Toronto Database of Genomic Variants, which catalogs regions of known copy number variation.
"The idea is that in one experiment on these, you'll be able to readily capture SNP and CNV-type data," says Steve Scherer of the Hospital for Sick Children, whose lab hosts the Database of Genomic Variants. While there's a lot of interest in using current chips to perform whole-genome CNV studies, "most people in the field feel that it will work in some instances but probably there's not enough coverage yet to have a robust screen for a whole genome CNV association study," Scherer says. "We probably need to design specialty arrays that really target the characteristics of CNVs at a higher specificity."
Still, the newer chips are definitely better than the older ones, which weren't designed to do copy number analysis, says Lars Feuk, a postdoc in Scherer's lab and co-author on the 2004 paper that first identified the importance of genomic structural variation in disease susceptibility. "How much better they are, and how well they compare to CGH arrays that are specifically designed for CNV calling, is not known yet," he says
Although CGH chips have been used to do broad-survey CNV detection, there is still much more to be learned about the diversity of this type of structural variation. "It is not likely that existing microarray products, whether they contain SNP-based content and/or CNV-based content, will adequately capture the spectrum of genome variation that will enable advances in complex disease association studies," says Peggy Eis, director of the array CGH business for Roche NimbleGen. "However, it is anticipated that microarrays focused on the CNV content from studies currently in progress may provide new insights."
Finding a data analysis algorithm to help analyze the CNV calls has proven tricky, but Scherer says he's fairly satisfied with the pace of tool development so far. "I don't think anybody would have imagined that they would have come out with these arrays so soon," he says. "I think the field is generally quite happy, but now we're also realizing that there's a lot more that needs to be tested for."
It's not news to anyone that the large amount of data continues to present challenges. Some of the biggest issues today are around storing, managing, and integrating the data with other large-scale biology datasets.
"The software is going to have to scale to very large [studies], and it's going to have to tie together the genetic information with the information from proteins and RNA and things like that," says Tom Downey, president of Partek, a vendor of scientific data analysis software. Partek focuses on creating one product that can do all types of analysis, and Downey predicts that SNP data will be increasingly integrated with other types of genomic data.
UCLA's Jenny Papp is not particularly impressed with the databases that come with the genotyping platforms, so she suggests building and managing your own. "I think what you really need to do is pull your data out of all of these different systems and then upload it into your lab database," Papp says. "I don't think it's a good idea to rely on the software and the databases that come with the platforms."
The sheer amount of data coming out of many of the whole genome studies is daunting, and managing that requires more and more bioinformatic skill — and staff. One big bottleneck, says Hakonarson, who has 20 billion genotypes in his database, is storing all that data and moving it around. "You need an extremely powerful infrastructure to store that information in such a way that you can access it upon need."
While SNP genotyping has made extreme headway in the past year, mostly due to the ever-evolving toolsets, at the end of the day data analysis is ultimately a means to an end, which at least in part means developing predictive diagnostics. Eventually, separate chips will be marketed for higher-accuracy calling in the clinic, some believe, as well as for diagnostic purposes.
Over the next several years, Hakonarson sees the field continuing to do whole genome scans, but leveling off after a while and focusing on candidate and resequencing studies. Eventually, he predicts, companies will create focused arrays for various diagnostic purposes.
Jeanette Erdmann sees the GWA study as being the "gold standard for the next two to five years," she says. She sees a necessity for better custom arrays, better gene-centric coverage, and improved ways to detect gene-gene interactions. In the end, pinpointing function is key to making use of association data — and SNPs will have to stand up to higher levels of scrutiny. "Today, researchers can publish results of a genome-wide association [study] in very high-ranking journals without having an idea of the functional link between the associated gene [or] region and the disease," Erdmann says. "This will change in the very near future."
The View from the Lower-Throughput Realm
A number of vendors have staked their claim with tools for the medium- to low-throughput markets, including Sequenom's MassArray, ABI's TaqMan and SNPlex, Beckman's SNPstream, and Biotrove's OpenArray platforms. While these are certainly not the only vendors offering such tools, what they have in common is that they focus on being a source for secondary scans — their tools offer simpler, quicker, more flexible, and less costly assays to genotype small numbers of SNPs as a follow-up to genome-wide association studies.
Sequenom's SNP assay is primarily used for confirmation analysis. Based on PCR and single-base primer extension, the readout is mass spec, which offers a "very quantitative and sensitive aspect of detection, which fluorescence-based methods [do] not necessarily always provide," says Dirk van den Boom, senior director of molecular applications at Sequenom. "As for the future," adds van den Boom, "we're working very actively … to simplify our assays, to increase multiplexing, to make it easier for people to analyze the large amounts of data."
Applied Biosystems has leveraged its large installed base of capillary electrophoresis machines to take SNPlex to market. Combining a ligation-based event, PCR, and capillary electrophoresis, the SNPlex platform is used for medium-throughput screening, "so in this moderate level SNPlex fits in because it's highly multiplexed to drive down the cost of the study itself," says Tony Dodge, real-time PCR and genetic systems field scientist at ABI. It can multiplex up to 48 SNPs in one test; however, "the hurdle that people deal with in multiplex technologies is trying to get many SNPs to work together in the same tube," Dodge says. "The benefit is that you're getting lower cost per genotype; the cost is multiplex technologies tend to have higher failure rate than single-plex technologies, like TaqMan."
With so many tools on the market, scientists use a number of criteria to determine which technology will best match their research efforts.
Katia Sol-Church, director of the Biomolecular Core Lab at Alfred I. duPont Hospital for Children in Delaware, currently researches Costello syndrome, a rare congenital disorder in children. She typically uses direct sequencing or pyrosequencing for mutation analysis, and ABI's TaqMan for allelic discrimination. Previously, she was doing whole genome mapping for osteoporosis, and she used Illumina's genotyping service instead of purchasing the equipment for her small lab. "It doesn't make sense for us to buy a very expensive piece of equipment which is not going to be used at capacity," Sol-Church says. "If it's a technology that you use only once a month, then you may have a lot of variability in the technique, which has nothing to do with the biological variability that you're looking for."
Jenny Papp, director of UCLA's genotyping and sequencing core, has a smorgasbord of instruments and platforms on hand for fine mapping and candidate gene studies: ABI's SNPlex and TaqMan assay, pyrosequencers, Beckman's CEQ, and Roche's LightCycler. As for which platform she chooses to run, "what platform is cost effective for what size of study is really the primary consideration," Papp says. For medium-sized studies, she uses the SNPlex, and for low-throughput, TaqMan.
Many variables can lead to genotyping errors and loss of data, including multiplexing in an attempt to get higher throughput, using machines infrequently, and variations in the consistency of the DNA. "When a customer comes in with a 384-well plate and the [DNA] quality and consistency is all over the place, then we end up losing a lot of data," Papp says. "It's better to have a whole plate full of poor quality DNA than to have a plate that's all spotty with some good quality and some bad quality."