NEW YORK (GenomeWeb News) – A combination of proteomics and genomics can help scientists refine gene annotations and even discover new genes, according to a paper scheduled to appear online this week in the Proceedings of the National Academy of Sciences.
A team of American and German researchers used tandem mass spectrometry to catalogue amino acid sequences in four Arabidopsis thaliana tissues and one cell line. Their data suggests that previous efforts left gaps in the well-studied Arabidopsis genome — the latest work turned up nearly 800 new protein-coding genes and revised gene models for almost 700 more. And, the researchers argued, implementing proteomics techniques will likely benefit other gene annotation efforts too.
“Historically, the proteomic and genomic communities have operated independently, with the genomic community in charge of annotation efforts,” senior authors Vineet Bafna and Steven Briggs, both researchers at the University of California at San Diego, and their colleagues wrote. “We assert that much is to be gained by joining forces, and incorporating proteomic evidence upfront into the genomics pipelines.”
The researchers applied tandem mass spec to Arabidopsis leaf, root, flower, and silique tissue samples. They also did phosphopeptide enrichment experiments in an Arabidopsis MM2d cell culture line in an effort to assess the phosphoproteome. The researchers filtered their data to achieve a one percent false-discovery rate and relied on at least two different representative peptides before confirming a protein’s presence.
Overall, they identified 144,079 peptides in the Arabidopsis tissues that mapped to the Arabidopsis Information Resource database, a 6-frame translation of the Arabidopsis genome, or an Arabidopsis exon splice graph.
Of these, 126,055 peptides were consistent with existing Arabidopsis gene models, representing roughly 12,769 proteins, or 40 percent, of annotated Arabidopsis genes. But another 18,024 peptides had not been described in Arabidopsis prior to the study. Most of these newly identified sequences — 16,348 peptides — mapped to just one spot in the genome.
The researchers focused on 1,765 clusters containing more than 5,400 new peptides and used the gene prediction program Augustus to identify 778 hitherto unknown protein-coding genes — 52 of which have already been incorporated into TAIR 8, the latest Arabidopsis genome database release.
They also tweaked the annotation of 695 described gene models, including some apparently coding regions that were previously classified differently — for instance, as non-coding pseudogenes or transposons. In addition, the team found 70 instances in which their proteomic data prompted them to make frame corrections to annotated reading frames.
“Assignment of reading frame is particularly difficult for nucleotide-based genome annotation,” the authors noted. “However, proteomic evidence unambiguously defines the frame of translation.”
Based on comparisons with the NCBI’s non-redundant protein database, the researchers concluded that 539 of the new loci were derived from homologous genes. For instance, the researchers found a new gene involved in photosynthesis that aligned with sequences for proteins in the chloroplast.
The work also provides insights into the prevalence of alternative splicing in the Arabidopsis genome. From the 47 genes identified with multiple splice forms, the researchers predict that there are between 6,718 and 8,983 genes with alternatively spliced forms in the Arabidopsis genome.
Based on their results, they estimated that the traditional approaches have neglected some 13 percent of the Arabidopsis proteome. And, they argued, the proteogenomic approach can help fill in those gaps.
“By investing in proteogenomics to complement more traditional cDNA and EST data at the onset of genome annotation, a more complete and accurate proteome can be achieved even in the early releases,” the authors wrote.