By Julia Karow
Many ongoing studies aim to pinpoint genetic variations that play a role in human disease by sequencing the exomes and genomes of thousands of patients and controls, but several research teams who have recently tested the methods used in these projects caution that they have technical limitations that may leave important genetic variants undiscovered.
A team at Cold Spring Harbor Laboratory evaluated early versions of the Agilent and Roche NimbleGen in-solution exome capture kits that have been used in many studies, comparing them with each other and with whole-genome sequencing, and found that they do not cover many functionally important regions of the genome, largely because they do not target them.
The CSHL study is among a handful of exome-capture comparisons published within the last month that have all come to similar conclusions. One of those studies, led by Mike Snyder's group at the Center for Genomics and Personalized Medicine at Stanford University, found that enrichment failed for some regions of the exome, but also determined that exome sequencing picked up many important variants that were not detected by whole-genome sequencing (IS 9/27/2011).
The same Stanford team has also compared whole-genome sequencing by Illumina and Complete Genomics and found that each platform picks up some variants that the other does not, and that they both generate thousands of sequencing errors per genome.
Some of these limitations can be overcome by using newer versions of the target enrichment kits, as well as by combining more than one sequencing platform in whole-genome sequencing studies, the scientists said.
Casting the Net for the Exome
The CSHL researchers published their study online in Genome Biology last month — one of three exome comparisons published in the same issue (the others, led by researchers at BGI in China and the Institute for Molecular Medicine Finland at the University of Helsinki, are here and here).
The CSHL team focused its comparison on the first version of the Roche NimbleGen SeqCap EZ Exome Library SR and the Agilent SureSelect Human All Exon Kit. Though both companies have since increased the target size of their exome kits, the results of the comparison are still relevant because many research studies have been conducted with the original kits, said Dick McCombie, the senior author of the article and a professor at CSHL. "There have been a lot of papers published, and there are a lot of kits with these early reagents floating around that people are still using," he said.
His team conducted the comparison because they wanted to see how well the two exome platforms covered a list of genes they were interested in for disease studies. They did not include Illumina's TruSeq exome kit because it was not available at the time, but they are testing it at the moment, he said.
While the NimbleGen kit they evaluated targets about 26.2 megabases of sequence, the Agilent kit covers about 37.6 megabases. Both cover most of the Consensus Coding Sequences, or CCDS, which comprise about 30 megabases and represent highly curated protein-coding exons that agree between several databases.
The researchers used both kits to pull down the exomes of two HapMap samples that had previously undergone whole-genome sequencing at high coverage by the 1000 Genomes Project's trio pilot study. They also captured the exomes of four additional trio HapMap samples using only the NimbleGen kit. All exomes were sequenced in one or more lanes on the Illumina Genome Analyzer, using 76-base paired-end reads.
The NimbleGen kit, they found, captured its targets more effectively, requiring less sequence data to reach saturation than the Agilent kit. However, Agilent captured about 90 percent of CCDS annotations at 20X or greater depth, while NimbleGen covered only about 85 percent.
Compared to exome capture, whole-genome sequencing by the 1000 Genomes Project covered a greater fraction of the CCDS exons at 20X depth — more than 95 percent. More importantly, because of their limited targets, both capture platforms missed about half of the Reference Sequence collection, or RefSeq, a much larger set of functionally important genomic regions than CCDS. RefSeq comprises about 67 megabases of sequence and includes not only protein-coding exons but also 5' and 3' untranslated regions and non-coding RNAs.
Interestingly, the researchers found that "quite a few" medically important genes were either not targeted or not sufficiently covered by both NimbleGen and Agilent. This included two genes they are particularly interested in — CACNA1C, a candidate gene for bipolar disorder, and MLL2, a gene implicated in leukemia.
Both vendors have since increased the target size of their exome kits, and "that clearly will make some difference, particularly with the CCDS coverage but also with some of the RefSeq coverage," McCombie said. For example, Agilent now offers the SureSelect Human All Exon 50 Mb kit, which includes additional exons and non-coding RNAs, and Roche NimbleGen's SeqCap EZ Exome Library v2.0 covers 44 megabases. NimbleGen also said recently that it will soon launch v3.0 of its kit, which will cover 64 megabases of exons and miRNA (IS 9/27/2011).
Nevertheless, the results provide new fodder for the continuing debate about the benefits of exome versus whole-genome sequencing. According to McCombie, it is still about eight to 10 times cheaper to sequence a human exome than a genome, allowing researchers to study many more samples. He said the capture and sequencing cost for several thousand exomes has reached "well under" $1,000 per sample and is "approaching" $500 per sample, while the cost of whole-genome sequencing at 30X coverage by Illumina is about $4,000 per genome today.
However, some disease-associated variants may not be found in the areas targeted by the exome kits, and structural variants, such as copy number variants, translocations, and gene fusions, are much better analyzed by whole-genome sequencing. Those limitations, particularly of the earlier exome kits, "make it very difficult to interpret negative results," McCombie said. "People need to be aware of that when they are evaluating the cost benefit of exome versus genome."
The Cold Spring Harbor researchers are currently involved in several large exome sequencing projects that target more than 1,000 exomes, for example a collaborative study with Johns Hopkins University and the University of Iowa to sequence the exomes of patients with bipolar disorder and controls, and another study that focuses on schizophrenia.
At the moment, the researchers use NimbleGen's exome kit in their data production, supplemented with some custom targets of additional regions of interest. "Everyone is moving more towards custom [capture]," McCombie said, and NimbleGen "was pretty fast" in making custom capture available to them. However, Agilent and Illumina also offer custom enrichment kits.
Illumina and Complete Genomics
Mike Snyder's group at Stanford also recently evaluated several exome capture methods and found, among other results, that exome sequencing detected some variants that whole-genome sequencing missed because it provided better coverage of the exome regions.
But his group also recently compared whole-genome sequencing by Illumina and Complete Genomics, a study that has yet to be published. Snyder recently provided a preview of the results in a podcast interview with Mendelspod.com.
For their comparison, the researchers sequenced the same human sample using both technologies. Each platform, they found, has an error rate of about 1 in 100,000 bases, meaning that every genome will have about 60,000 sequencing errors. "That's just a huge background on which we are trying to interpret the causative variants of a disease," he said.
The high error rate also means that researchers might only spend on the order of $4,000 to $5,000 to sequence a genome, but "tens of thousands of dollars trying to find out which mutations are real and which ones are sequencing errors."
The solution, he said, might be to sequence each genome with multiple technologies. About 90 percent of the variants called by Illumina and Complete Genomics overlapped, and those variants had "incredibly high accuracy," according to Snyder, although variants of interest should always be confirmed by a "more conventional method."
So as the cost of sequencing drops in the future, "the appropriate thing to do will be to sequence genomes with two different technologies," he said.
Because the error profiles for the two platforms differ, "what can get missed by one technology gets picked up by the other," Snyder said.
For example, one of the platforms but not the other detected a mutation in a telomerase gene that is predicted to cause aplastic anemia, and that mutation was confirmed by another method to be real. Though the individual whose genome was sequenced does not suffer from the disease, the result led to subsequent tests and future monitoring for the disease.
There is still a small percentage of the genome that cannot be sequenced by any currently available platform — for example repeat regions and other complex regions that may contain important genes. "We do need to fill in these regions as well," Snyder said, adding that several companies are working on ways of doing so, for example through longer sequence reads.
"How fast, I don't know, but I think there is a lot of pressure to make that happen quickly," he said. "People need this right away, and companies that provide accurate sequences will be viewed very favorably by the consumers."
Have topics you'd like to see covered in In Sequence? Contact the editor at jkarow [at] genomeweb [.] com.