Scientists will miss the complete genome picture if they continue to use the old standby tools for annotation and finding protein-coding genes, according to Akhilesh Pandey. “As a community, we need to remind ourselves time and again that the problem is not solved, and we must use alternative methods,” says Pandey, an assistant professor at the McKusick-Nathans Institute of Genetic Medicine at Johns Hopkins.
In his laboratory, the India-born biochemist focuses on what he calls “using proteomics for genome annotation.” This approach lies at the intersection of genomics, proteomics, and bioinformatics.
“If one wants to find protein-coding genes, ideally, one should look at proteins,” Pandey says. “Unfortunately, because this is technology-intensive — [involving] mass spectrometry — and requires a substantial bioinformatics effort, it has been somewhat ignored. Therefore, we generally settle with mRNA sequences as proof of protein-coding genes or with gene predictions.”
But that method has its pitfalls. For example, mRNAs might not be complete or represent all splice forms. In addition, oftentimes once a start codon is assigned, everyone works on that particular form of protein, but this first assignment could be erroneous. When inferring protein sequence from predicted genes, exact intron-exon structures and N-terminus may be predicted incorrectly — and some genes might not be predicted at all. Protein-based methods can overcome all these obstacles, Pandey contends.
“Finding unique peptides corresponding to regions not annotated with any protein-coding genes should indicate that genes have been missed by prediction programs,” he says.
In the last year, Pandey has shown success using proteomics for genome annotation. Mass spectrometry coupled with bioinformatics analysis allowed his group to identify several novel proteins and isoforms in human blood, bile juice, and pancreatic juice, many of which could well turn out to be potential biomarkers for diseases. In a study published in Nature Genetics last spring, Pandey looked for similarities between the human X chromosome’s protein-encoding instructions and corresponding regions in the mouse. In regions that were the same between species, scientists found 43 new genes. Almost half of the new genes don’t look like any previously known genes, and some of the genes sit in regions long tied to X-linked mental retardation syndromes.
More recently, Pandey and his colleagues published a study identifying proteins and annotating the genome of Anopheles gambiae. Their efforts confirmed genome sequence data, identified two sequence polymorphisms that were not annotated as SNPs in databases, and found a novel gene that was not predicted by automatic annotation pipelines. The scientists also corrected the translational start sites and UTR assignment of proteins.
Pandey says more researchers need to analyze existing information in different ways and develop new approaches to further crack the genome code. His lab is currently focusing on building software tools that will make searching genomes using mass spectrometry-derived data much easier. In recent years, scientists have begun to examine the role of microRNAs and ponder the significance of conserved noncoding regions.
“We need complementary methods,” Pandey says. “Just beating on one aspect of it is not going to give us all of the answers.”