By Julia Karow
Researchers at the University of Washington have demonstrated that they can pinpoint the gene responsible for a Mendelian disorder by sequencing the exomes of a small number of affected individuals.
One application of the approach is to study rare diseases where the underlying genetic cause is unknown. According to a study describing the method, published online two weeks ago in Nature, the strategy could also be used to uncover genes involved in diseases with more complex genetics that will require larger sample sizes.
The paper is one of the first outcomes of the Exome Project, a two-year, $12 million technology development and research effort sponsored by the National Heart, Lung, and Blood Institute and jointly run by NHLBI and the National Human Genome Research Institute that got underway last year (see In Sequence 11/7/2008). The project involves three research teams, led by the Broad Institute, Harvard Medical School, and the University of Washington.
Starting this fall, the three groups plan to sequence the exomes of small numbers of patient samples, which are currently being selected from nominations by the project's steering committee.
The Nature study is "a beautiful paper clearly showing the feasibility of sequencing entire human exomes, and the feasibility of plucking out from the many variants identified those contributing to a disease influenced by highly penetrant variants," said David Goldstein, director of the Duke Institute of Genome Sciences and Policy's Center for Human Genome Variation, in an e-mail message. "It emphasizes clearly that human discovery genetics from this point will depend largely on the sequencing of entire exomes and entire genomes."
"The thing I am really excited about is showing how you can apply [exome sequencing] to learn something" about disease, Jay Shendure, an assistant professor of genome sciences at the University of Washington and the senior author of the study, told In Sequence.
In the study, the researchers sequenced the exomes -- each approximately 27 megabases of sequence -- of eight HapMap samples, including four Yoruba, two East Asians, and two European-Americans, as well as four unrelated individuals suffering from Freeman-Sheldon syndrome, a rare, autosomal dominant disorder that is known to be caused by mutations in the MYH3 gene.
According to Shendure, the scientists chose to include the HapMap samples because a lot of genotyping and sequence data was already available for these to assess the quality of their exome data. Last year, for example, the Human Genome Structural Variation project published an analysis of the same eight samples using a clone-based Sanger sequencing approach (see In Sequence 5/20/2008). The set also includes sample NA18507, which was sequenced independently by Illumina and Applied Biosystems on their respective short-read sequencing platforms.
[ pagebreak ]
In addition to the HapMap samples, the UW researchers chose Freeman-Sheldon syndrome as their test case in order to see whether exome sequencing could identify the known mutations in the MYH3 gene that cause the disease. The gene was originally identified a few years ago through a candidate gene approach, since not enough families were available to do use a linkage mapping approach, according to Shendure.
"We had the samples, we knew the answer, we were familiar with the disease, so it made sense as a proof of concept to address," Shendure said. "Could we directly identify the causes of monogenic disease through exome sequencing, without any linkage mapping or a candidate gene approach?"
To capture exonic DNA, the researchers used two custom-designed Agilent 244K microarrays for each sample, starting with about 10 micrograms of genomic DNA, and sequenced the captured material on an Illumina Genome Analyzer II, using unpaired 76-base reads.
Earlier this year, Shendure's group published another capture method that uses molecular inversion, or padlock, probes (see In Sequence 4/21/2009), but he said that this method still needs to be optimized for the whole exome and is currently better suited to analyze fewer targets in larger numbers of samples.
On average, the scientists generated 6.4 gigabases of mappable sequence data per individual -- about 20 times less than what Illumina produced for its African HapMap sample. About half the captured data mapped to the targeted exons.
Each exome was covered 51-fold on average, and 96.3 percent of the target bases were covered sufficiently to call variants.
Compared to other published exon-sequencing reports, the method showed "roughly equivalent capture specificity, but greater completeness in terms of coverage and variant calling," according to the paper, probably because of greater sequencing depth and differences in array design and experimental conditions.
In order to pinpoint the causal gene in the FSS samples, the scientists looked for genes with non-synonymous coding SNPs, splice site disruptions, or coding indels in all four samples. They narrowed the initial list of more than 2,000 genes down to one -- MYH3 -- by requiring that the variant was neither in dbSNP nor in one of the eight HapMap exomes.
The method could still be improved by using paired-end reads and increasing their density, and by optimizing the capture and the analysis, Shendure said. This would allow the researchers to sequence one exome per flow cell lane, eliminating the need for barcoding samples, he added.
Kun Zhang, who is part of another group funded by the Exome Project, told In Sequence that the UW researchers' results on the HapMap samples provide a good reference for future work. "Now we know how many coding and non-coding variants you should expect from each population of different genome backgrounds," said Zhang, an assistant professor in the department of bioengineering at the University of California, San Diego.
He also pointed out that the group came up with a "clever way" to computationally identify insertions and deletions that are longer than a couple of base pairs from the short-read sequence data.
The results on the disease samples are significant, he said, because they show that only a few patients with a genetic disease -- in this case, four -- are required to identify the causing mutation. That, he said, will likely have "a big impact on guiding the study design of future gene-mapping efforts."
[ pagebreak ]
Getting to the Root of Rare Diseases
According to Shendure, one of the first applications of the exome sequencing approach is indeed to identify mutations underlying rare diseases. "There are tons of rare diseases that are essentially unsolved, and this is a great approach to address those," he said.
More than 7,000 rare diseases -- defined as diseases that affect fewer than 200,000 individuals in the US -- are currently known, many of which are likely to be genetic and have unknown causes.
According to the paper, the strategy might be even easier to apply to recessive diseases, since there are "far fewer genes" in the exome that are homozygous or compound heterozygous for rare non-synonymous variants.
Shendure and his team have already used exome sequencing to study a rare disease with an unknown cause and have "had some success," Shendure said, adding that he plans to talk about the results at the Personal Genomes conference at Cold Spring Harbor Laboratory in September.
One of the reasons for tackling such diseases by exome sequencing rather than whole-genome sequencing is that it is less expensive. Shendure said that he and his team produced about 20-fold less sequence data than Illumina for its published NA18507 genome, "and we achieved essentially equivalent or better sensitivity and specificity with respect to coding variation over the targeted exome."
Besides the cost of sequencing, their method also has to account for the cost of sequence capture, but that is still a much smaller fraction of the overall cost, he said, though he did not elaborate on absolute costs.
Despite the fact that exome sequencing does not identify non-coding and many structural variants, there is a reasonable chance that it can find disease-causing variants. "For Mendelian diseases, the history is that highly penetrant variants, by and large, tend to fall in coding sequences or adjacent splice sites," Shendure said. "That's as well documented as can be from several decades of studies."
He cautioned, however, that it takes more than one affected individual at the moment to find the causative variants. Currently, "having an unrelated individual that appears to have the same disease is still necessary to make a really compelling case," Shendure said, but "as functional prediction improves, as filters for pretty common variants improve, for example through the 1000 Genomes Project, the number of individuals that are required may drop."
Aravinda Chakravarti, director of the Center for Complex Disease Genomics at the McKusick-Nathans Institute of Genetic Medicine at Johns Hopkins University School of Medicine, said that the paper is a good demonstration of how far exon sequencing has come along. "I applaud these guys for moving fast in the right direction," he told In Sequence last week. "They have used much less DNA, [and] the data are much more accurate and comprehensive in coverage" than in previous studies, he said. It is unclear, though, how far the cost has come down. "Cost is an important consideration because it conveys to the rest of us [whether] this is really a feasible project" he said.
"I think they are on the right path in making the technology much more accessible," he added. "I don't mean to the genome centers, but to the Tom, Dick, and Mary who have these patients. ... I think technology, like personal computers, should be disseminated. But it's early days."
Chakravarti called FSS a "well-chosen example" for the demonstration project, but said more examples will be needed to find out how well the approach works in general, and how many samples will be needed in each case. "I think we will need examples such as this to learn, and there is no doubt that even if they take genomes from four patients of every single [Mendelian disease] and do this test, they will crack some of them, [but] they will not crack all of them."
What is going to be particularly difficult is the data interpretation. "In this case, since they knew the mutation, they could say, 'We found the variant,' " he said. "The usual problem that we have in genetics is, when we see a variant, we don't know that that is the variant."
For example, he said, the most variants in protein-coding sequences are missense mutations, which are harder to interpret than nonsense mutations. Also, recent studies have shown that the human genome harbors loss-of-function mutations that do not seem to have a negative effect. "The lack of function of something in the genome does not necessarily equate to disease," he said.
[ pagebreak ]
The Exome Project
NHLBI, for its part, is currently seeking nominations for cohorts of human DNA samples to be sequenced during the production phase of the two-year Exome Project, slated to start this fall. The three groups funded under the project will perform the sequencing of these samples.
According to its website, the institute is looking for existing cohorts with "potential importance to research related to heart, lung, blood, or sleep diseases and disorders" that range in size "from a few to 400 samples."
Evidence of heritability and pre-existing genome-wide association study data "are valuable attributes" of the samples, which must be consented for whole-genome genetic analysis and data sharing via the dbGaP database.
Weiniu Gan, one of the program officers for the Exome Project at NHLBI, told In Sequence that the first year of the project was devoted to technology development, and that the UW article "is pretty much a progress report for the first year for that center."
The goal for the second year to test the new methods on patient samples, which is why the institute has begun to look for appropriate cohorts. "Next year, we would like to see another publication for the proof of concept in terms of applying the technology to patient samples," Gan said.
"We would like to have cases that are very likely to lead to some significant discoveries by sequencing just a small number of patient samples," he said. The institute has already received a number of submissions: "It seems like the community is very interested in trying out these new technologies," Gan added.
According to Shendure, exome sequencing might even be useful to study complex common diseases, "although not all the answers, not all the signal, is going to be [in the exome], and it's no question that the rest of the genome is interesting," he said.
Also, larger sample sizes would probably be needed. "The key there to making it successful is being judicious about the right study design, the right sample selection," he said.
Chakravarti cautioned that "the lack of our understanding of non-Mendelian disorders is not [just] technology. We have this view on the part of many that 'it's just technology, [so] if we sequence everyone's genome, and we put it in a big enough computer, the answer is going to fall out.' I can assure you that some answers will fall out, but the answers will not fall out easily."