By Julia Karow
This article, originally posted June 25, has been updated to include comment from other researchers.
SAN FRANCISCO — Current versions of human whole-exome capture products from Roche NimbleGen and Agilent do not include or don't efficiently capture many medically relevant genes, so researchers using them in ongoing large-scale disease studies might miss important results.
That is the conclusion of a pilot project to evaluate exome-capture methods in conjunction with several sequencing platforms by researchers at SAIC Frederick and the National Cancer Institute, presented at Cambridge Healthtech Institute's Beyond Sequencing conference here last week.
Both NimbleGen and Agilent are aware of the problem and are currently redesigning their capture products to include missing genes and improve the capture efficiency, according to Kevin Jacobs, director of scientific operations of bioinformatics at the NCI's core genotyping facility.
His group discovered the omission of the genes during a pilot project that was designed to evaluate several exome enrichment and sequence strategies that he and his colleagues want to use to analyze samples from a collection of families with a high incidence of various cancers that are suspected to have a genetic cause.
For their evaluation, the researchers selected NimbleGen's 2.1M Human Exome Array and Agilent's SureSelect Human All Exon kit. They sequenced 10 samples enriched with NimbleGen arrays on the 454 GS FLX with Titanium chemistry, using four runs per sample; 10 samples enriched with SureSelect on the Applied Biosystems SOLiD system, using one quadrant per sample; and two samples enriched with SureSelect on the Illumina GAIIx, using one lane per sample.
They analyzed what percentage of coding sequences contained in the Reference Sequence, or RefSeq, database was covered at more than eight-fold depth per base on average in each sample — which they defined as sufficient coverage — and found this number to be about 65 percent for the NimbleGen-enriched samples, 75 percent for the Agilent-enriched samples sequenced on SOLiD, and 71 percent for the Agilent-enriched samples sequenced on the Illumina.
The numbers looked even worse when the researchers analyzed how many genes had more than 90 percent of their bases sequenced with at least an average of eight-fold coverage depth: 42 percent of genes for NimbleGen-enriched samples, 55 percent of genes for the Agilent-enriched samples sequenced on SOLiD, and 45 percent of genes for Agilent-enriched samples sequenced on the Illumina platform fulfilled that requirement.
Some of this lack of coverage can be explained by GC-rich exons, Jacobs said, which are difficult to sequence, but a lot of it results from the fact that certain genes are either not targeted at all, or only poorly captured by the probes in the capture products.
One reason for the spotty capture is that both the current NimbleGen and the Agilent products are based on the Consensus CDS or CCDS database, which contains a core set of human and mouse protein-coding regions with high-quality annotations, but which lacks a number of annotated human genes that are contained in the RefSeq database.
Jacobs' analysis showed that as many as 23 percent of human coding sequences in RefSeq were not targeted by probes on the NimbleGen array, and that 17 percent were not targeted by Agilent SureSelect probes.
Missing from the designs are medically important genes such as insulin, the ABO blood group genes, and genes involved in genetic diseases such as xeroderma pigmentosum, as well as genes for transporters, transcription factors, and complement components.
Other genes that have been implicated in disease are covered poorly by the capture probes, for example the apolipoprotein E gene, several HLA genes, and a number of cancer genes.
[ pagebreak ]
Jacobs said he was surprised that many groups he talked to that have been using the NimbleGen and Agilent whole-exome capture products had not realized that these genes were missing from the designs. There is a danger, he said, that large-scale exome-sequencing studies — for example, those trying to pinpoint the causes of Mendelian diseases — using the current versions could be missing important results.
"There certainly are many genes that we know to be important — and probably many others whose importance we have yet to realize — that are not represented in CCDS, and therefore were not targeted, or are targeted but not covered well," Jay Shendure, an assistant professor of genome sciences at the University of Washington, agreed. "At the same time, we should not lose sight of how far this field has moved in a relatively short period of time, and a little patience is warranted."
Shendure, who is involved in several exome-sequencing projects, added that he assumes the next generation of capture products will correct these deficiencies.
Jacobs "is right to be concerned that important information in regions which are systematically under-covered will be missed by all sequencing projects," said Matthew Bainbridge, a researcher at Baylor College of Medicine who has been developing exon-capture methods. "An important advancement in capture technology will be to do better in these sub-optimally covered regions."
To that end, Bainbridge and his colleagues have already devised a set of new exome-capture reagents "that better represent the full knowledge of gene-coding regions in the genome," and plan to make their designs public soon, in conjunction with a scientific publication, he said.
Bainbridge also pointed out that many genomic regions "are simply difficult to sequence," due to low complexity or high GC content, and would likely not be covered well, "whether using capture or whole-genome sequencing."
The final sequence coverage, he explained, is "a confluence of many factors," including how well a region can be captured or sequenced, and how well the sequence reads can be mapped.
Existing capture products are still useful, though, if the genes researchers are interested in are well covered by capture probes, Jacobs said. "Even if your genes are not ideally covered, you could still get useful data."
But even if the companies improve their exome capture products in the near future to make them more complete, he said, their life cycle might be short, and down the road, they might be eclipsed by whole-genome sequencing as it becomes less expensive.
Even if they had whole-genome data, researchers could still focus their analysis on exons, he said, but they would have the entire genome as a "fallback" option if they could not find the answer they are looking for in the exome.