NEW YORK (GenomeWeb) – To help researchers in the genomic community use protein data in their projects, scientists from Georgetown University Medical Center, the European Bioinformatics Institute, and elsewhere have been mapping protein function annotation from the UniProt database to the most recent release of the human reference genome.
The move is at least partly the result of a push from within the protein community for more communication with their counterparts in the genomics space, according to Peter McGarvey, an associate professor of bioinformatics at Georgetown University Medical Center and one of the researchers involved in the project. "We have gotten feedback from our scientific advisors and others that we need to outreach more to the genomics community," he told GenomeWeb.
Combining the information is a logical step for the genomic and protein communities because protein data is crucial to understanding the impact of mutations in the genome, Maria Martin, leader of UniProt Development at EBI and one of the researchers involved in the project, noted.
"Is that change in the gene affecting the function of a certain protein? Is it located in an active site? To look into that you need to see what is happening at the protein level," she said. "UniProt is doing quite a lot of work finding all that information and annotating the protein sequence." However, difficulties aligning protein data to the reference has limited its use. Adding their data to the reference will make it that much easier for genomics researchers to access and incorporate that information into their studies, she told GenomeWeb.
The researchers began mapping protein annotations to the genome a few years ago largely on a volunteer basis. They were unfunded and so progress was slow until last September when it received a bit of a boost. Specifically, the researchers received $125,000 in grant funding from a supplement to the UniProt grant from the National Human Genome Research Institute, McGarvey told GenomeWeb. The funds support ongoing annotation mapping efforts as well as outreach to other communities. This includes developers of databases like ClinGen and members of the proteomics community such as researchers involved in the Clinical Proteomics Tumor Analysis Consortium, he said. "It's not paying full time for anybody, [but] with the money things got a little faster."
There are clear benefits to bringing the resources together. For example, researchers will have access to more fine-grained information on proteins, McGarvey said. Existing tools help researchers identify things like introns and truncated proteins but don't seem to do much more, he said. Now curators can determine whether or not those changes are in functional domains for instance.
However, the task is not trivial. Incompatible annotation approaches and challenges with mapping features from the protein space into the genome space has historically made integrating information from the respective databases problematic. Resources like Ensembl and Refseq do map some protein domains to the genome but they are difficult to locate and are not as comprehensive, McGarvey said. "We are doing much finer detail and [a] wider variety of features like well annotated [amino acid] variants, active sites, and more."
Difficulties with combining the two resources have been mitigated somewhat as genomic and protein databases have become more aligned in recent years, McGarvey said. Projects like the Consensus Coding Sequence database, a collaborative effort that involves the EBI, the National Center for Biotechnology Information, and other institutions, for example, are coming up with consistent annotations for mouse and human protein-coding regions.
"Because we've been working together, a lot of things have gotten more standardized," he said. "Our sequences agree better [and] people will call attention to discrepancies." In spite of these developments, combining datasets from the two communities is still challenging. "It's not impossible but difficult."
To map UniProt's data to the reference genome, the researchers compared UniProt protein sequence coordinates to the coordinates of peptide sequences in repositories such as Ensembl and RefSeq, EBI's Martin explained to GenomeWeb. When they found identical sequences, they would then search for the gene transcript that codes for the UniProt protein sequence in question.
There were other ways the researchers could have done the mapping but comparing gene and protein coordinates was the most straightforward, Martin said. It let the team sidestep differences in naming conventions and annotations that the genomics and protein communities use. Annotation approaches are both diverse and error prone and trying to compare sequences by gene names quickly becomes messy because "researchers are going to call the same thing in very different ways," she said.
For example, UniProt researchers assign primary protein names based on how frequently the name in question appears in the literature, she said. Any alternate names are catalogued in a separate list. Ensembl's developers, on the other hand, have their own criteria for naming proteins and may choose one of the synonyms as the primary name for the same protein based on their conventions.
The mapping process was also somewhat less arduous because genomic researchers consult protein databases when they annotate the genomes, and so there is some overlap between resources. For example, they'll search databases like UniProt to find out which protein sequences have been experimentally validated and use that information as supporting evidence for gene transcripts, Martin said. "Because they were doing that already, we were able to map quite a lot of protein sequences because we could find the exact [location]."
The UniProt team's annotation efforts have also resulted in a nice little feedback mechanism for the database as well. When they find protein sequences in Ensembl and RefSeq that are supported by evidence from the literature and are not currently in UniProt, the researchers add them to database, Martin said. They've also been able to extract supporting evidence from databases like Ensembl for some proteins in UniProt that have limited functional data associated with them, she said.
So far, the UniProt team has mapped over 76,000 protein isoforms to the reference. In addition, they have mapped 27 structural and functional features including enzyme active sites, modified residues, protein-binding domains, protein isoforms, post-translational modification sites, and more. The mappings and related annotations — including features' names, supporting literature, and links to full UniProt entries— are provided in Bed and BigBed files. These can be used with genome browsers from the University of California, Santa Cruz, Ensembl, and others. Users can download these files from an ftp site, and they will soon be available as public track hubs on the UCSC and Ensembl genome browsers, McGarvey said. Researchers can also simply download the full sequences and annotations from UniProt directly.
The researchers have also developed a protein data browser that lets researchers scroll through protein sequences and visualize associated annotations. They are now trying to link the protein browser with genome browsers so that researchers can both see and interact with protein features within genome browsers. Currently, protein features and annotations show up as "snapshots" in genome browsers and are not interactive, Martin explained. "That's something that we would like to develop in the next few months," she said.
The researchers are also working on an application programming interface that will provide programmatic access to the protein data, McGarvey said. They are putting the finishing touches on the API and plan to release it sometime this summer. Other activities for the group include mapping protein variation and annotation information to variants in ClinGen.
In a poster presented at this year's Joint Summits on Translational Science, which was held in San Francisco last month, McGarvey shared the results of some initial comparisons between pathogenic and non-pathogenic variants in both UniProt and ClinVar. One table included in the poster reported the results of comparing ClinVar SNPs and analogous UniProt amino acid variants. McGarvey found that 45 percent of UniProt disease variants are also present in ClinVar. His comparison also showed a 35 percent overlap in pathogenic SNPs between UniProt and ClinVar.
The decreased overlap is most likely the result of differing annotation approaches and the fact that each database likely contains information that the other does not, McGarvey said. The comparison was also limited to just ClinVar SNPs and so that could be a contributing factor as well.
"We haven't done a full analysis yet," he said. "We are just starting to look at the data and seeing what it tells us."