The recent discovery of a gene for the fatal inherited disorder Leigh Syndrome proves you can’t have too much data when it comes to genomics research.
“No single functional genomics measure is perfect,” said Vamsi Mootha, a postdoctoral fellow at the Whitehead/MIT Center for Genome Research and lead author on a paper published on the finding in the Proceedings of the National Academy of Sciences last week. By taking the more-is-better approach, Mootha integrated several sets of genomic, gene expression, and proteomics data from an international team of researchers to whittle down a two-megabase region of the genome down to a single candidate gene, LRPPRC, in about a week.
The computational approach had a bit of a head start thanks to old-fashioned genetic techniques: About two and a half years ago, linkage analysis and disease association studies had implicated a two-megabase interval on chromosome 2 as the region likely to be harboring the gene that causes Leigh Syndrome French Canadian type, or LSFC, an autosomal recessive disorder that is associated with high infant mortality in the Saguenay-Lac St. Jean region of Quebec.
The project had been stalled however, until the researchers devised a computational approach based on the hypothesis that the disease was linked to mitochondrial function. “The idea was to try to relate any genes in that interval to mitochondria properties using whatever functional genomics data sets were available,” said Mootha.
They first downloaded all of the genome annotations for the 2 Mb region, including RefSeq genes, Ensembl predicted genes, spliced ESTs, mouse/human homology, and Genscan predicted genes. This set was collapsed into distinct, non-overlapping genes, and they ended up with a final set of 30 genes: 15 with some experimental support and 15 that were computationally predicted.
The researchers then moved onto mRNA data, and downloaded four large publicly available expression data sets: samples from a Whitehead Institute cancer classification project; the Riken Expression Array Database; and the Genomics Institute of the Novartis Research Foundation’s expression atlases for human and for mouse. Based on a set of around 300 known genes that encode proteins localized to mitochondria, the researchers developed a statistical software tool to identify the gene expression signature for mitochondria among a total set of around 1.4 million expression measures. Once they had that gene expression signature in hand, it was a snap to go back to the 30 genes to see if any of them had an expression profile resembling mitochondria. “As it turns out, one of them had a striking similarity to the known expression profiles for mitochondria,” said Mootha.
The third data set the researchers used was from an ongoing proteomics project that the Whitehead is conducting with MDS Proteomics to characterize the peptides in mitochondria. “We took all the peptides from the tandem mass spectra project, and we plastered those onto this two-megabase interval,” said Mootha. They found about a dozen peptides that were extremely high-scoring and were piled up in the region corresponding to the exact same gene identified by the expression analysis. This method also allowed the researchers to annotate the gene: While the cDNAs deposited in GenBank corresponded to a 35-exon gene structure, they had found some peptides that landed upstream of that 35-exon structure, and discovered that the proper gene structure actually consists of 38 exons. “In other words, we were able to annotate the genome with the proteomics data to elucidate its proper gene structure. To our knowledge, that has not been done before,” said Mootha.
Of course, the work was far from over once the candidate gene was identified. But after four months of resequencing the gene and the single exon that harbored the mutation in patients, parents, and controls, the international team was able to experimentally verify that they had found the right gene. Before this approach, Mootha said, this experimental process would have had to be performed on all 30 genes within that region of interest, a process that would have taken ten years.
Mootha said that the Whitehead is currently applying the integrated functional genomics approach to other common diseases, and that most of the software he wrote for this project should transfer easily to other disease areas. In other cases, however, the process may not run as smoothly as it happened to in this particular study, he warned. For example, in more complex diseases, the initial genomic interval of interest may be much bigger than the 2 Mb region that his team started with. In addition, Mootha said, his team happened to have plenty of data to identify a mitochondrial gene expression signature, but depending on the disease area and the clinical hypothesis, that may not always be the case.
The paper serves as proof of concept that such an integrated approach can be effective, Mootha said. The software for the project — “basically a few Perl scripts” — is available “if there are scientists out there interested in the code.”