Nat Goodman peruses the protein sequence databases and finds there are no easy answers
If you’ve ever set out to navigate the jumble of protein sequence databases, surely you can empathize with our poor comic strip character Dataslave. There’s GenPept, TrEMBL, DAD. RefSeq, SWISS-PROT, PIR-PSD. Nr, SPTR, PIR-NREF. And these are just the general-purpose ones.
In an effort to bring some order to this chaos, the European Bioinformatics Institute recently introduced a new database called the International Protein Index. EBI says IPI aims to “provide a minimally redundant yet maximally complete set of human proteins.”
I gave it a try to see how much easier my life (and Dataslave’s) could be. I’ve had great luck with other EBI databases that consolidate information from multiple sources — for instance, SWISS-PROT and InterPro — so I had high hopes for this new entrant.
Most protein sequences enter the database world by way of the primary nucleotide sequence databases — what most Americans think of as GenBank, or what is officially known worldwide as the tri-partite International Nucleotide Sequence Database Collaboration. Member organizations of the tri-partite collaboration are the Japan National Institute of Genetics, which operates the DNA Data Bank of Japan; EBI, which operates the EMBL Nucleotide Sequence Database, and the US National Center for Biotechnology Information, which runs GenBank. By longstanding agreement, the three databases exchange data on a daily basis, and thus are virtually identical. Why the world needs three nearly identical nucleotide sequence databases is beyond me. It’s not like there’s a shortage of real work for bioinformaticians to do.
Protein sequences are extracted from the coding region features of nucleotide sequence entries and placed in what are essentially the primary protein sequence databases. The resulting databases are DAD in Japan, TrEMBL in Europe, and GenPept in the US. One subtlety is that TrEMBL does not include any entries that have been merged into SWISS-PROT (discussed below), making it not precisely equivalent to the others.
The European and US organizations do this independently and do not exchange these entries. This doesn’t matter from a biological perspective as sequence translation works the same on both sides of the pond. But it’s a nuisance from an informatics standpoint, because the protein sequence entries end up with different accession numbers in the two databases.
I’m not sure how Japan does this, but it seems that DAD uses the same accession numbers as GenPept. The net effect is that we have three nearly identical primary nucleotide sequence databases feeding portions of their entries to three nearly identical primary protein sequence databases.
Further redundancy comes from the fact that a given gene or genomic region may be sequenced numerous times. This has become endemic as large-scale cDNA and genomic sequencing projects plow ground both new and old. Many known human genes appear at least three times in the primary databases: once from the original submission, again through large-scale cDNA sequencing, and yet again via annotation of the human genome sequence. Even more entries may exist if the gene was discovered simultaneously by multiple investigators, or if alternate splice forms of the gene were discovered after the original submission.
These sequences may vary slightly from one entry to the next (or extensively in the case of alternate splice forms) due to experimental error or bona fide variation.
An even greater problem is that the biological descriptions of a gene almost always vary among redundant entries. This can be for reasons as mundane as different investigators choosing different words to describe the same function, or more profoundly because the community’s understanding of the gene’s function may improve over time. Some entries contain no description at all or just a pro forma description such as “sequence of clone XYZ1234,” or “predicted gene ABC000567” leaving the biological characterization to others.
The primary databases make no effort to weed out or merge even the most obvious duplications, since they view themselves primarily as repositories for all publicly available sequence data. This is a good thing. But it’s too much of a good thing for many users, since typical queries will return long lists of nearly identical entries.
To mitigate this problem, EBI and NCBI maintain so-called non-redundant versions of the protein sequence databases, called SPTR and nr respectively. The term “non-redundant” is a tad optimistic. Each organization starts with its own primary protein sequence database (TrEMBL or GenPept), then adds data from some of the secondary databases (see below), then eliminates entries that have exactly the same sequence and are from the same organism. Entries whose sequences differ in even a single letter are not merged.
Another organization that operates protein sequence databases is the Protein Information Resource of the National Biomedical Research Foundation, a private, not-for-profit institution based in Washington, DC. It, too, offers a non-redundant version called PIR-NREF. Being independent of both EBI and NCBI, it starts with both TrEMBL and GenPept, adds in some secondary databases, and merges identical entries as above.
Secondary, curated databases
Beyond the primary databases there exist three secondary protein sequence databases that try to combine biologically redundant entries. These databases are EBI’s SWISS-PROT, which is produced through a collaboration with the Swiss Institute for Bioinformatics, NCBI’s RefSeq and its companion LocusLink, and PIR’s Protein Sequence Database produced in collaboration with the Munich Information Centre for Protein Sequences and the Japan International Protein Information Database. Both RefSeq and PIR-PSD are freely available to all. SWISS-PROT is freely available to academics for non-commercial purposes, but commercial users require a license from GeneBio, a private firm.
The secondary databases seek to coalesce all sequence entries for a given protein into a single comprehensive entry that includes all known splice forms and other variations. The databases also strive to provide an accurate and up-to-date summary of what is currently known about the protein’s biological function. This can only be done by human curators with substantial biological expertise who can review the sequence data and literature on each protein to figure out what’s going on. The databases also include entries for “hypothetical” proteins whose sequences don’t match any known ones, although the coverage varies.
In addition to its role as a secondary database, SWISS-PROT is also a primary database for proteins that were sequenced de novo, in contrast to those whose sequence is inferred from an mRNA transcript. There aren’t very many of these.
International Protein Index
EBI’s IPI is a new type of protein database that tries to sit between non-redundant databases and secondary ones. Like the non-redundant databases, IPI is computer generated without the assistance of human curators. Like the secondary databases, it attempts to merge biologically redundant entries with the aim of creating one entry per transcript. It remains to be seen whether this mid-ground is terribly useful.
The IPI website provides a good description of its approach. At present, IPI is limited to human sequences. It starts with all human protein sequences from SWISS-PROT, TrEMBL, RefSeq, Ensembl’s genome annotation, and NCBI’s genome annotation. It does all pairwise comparisons under very stringent conditions, and clusters sequences using a multi-step procedure that ensures that no cluster contains more than one SWISS-PROT and RefSeq entry. It does not attempt to link alternate splice forms of the same protein, or to highlight sequence variation that shows up in different members of the same cluster. Nor does it try to select the best description of the protein’s biological function — indeed, there’s no mention of how it derives the biological description that it reports, though it seems to prefer EBI data sources (SWISS-PROT and TrEMBL) over all others.
The Lowdown on IPI
To assess the overall quality of the IPI clusters, I selected 20 at random and poked and prodded.
Five of the 20 IPI clusters contained entries from three or more databases, and all five had descriptions that indicated a known biological function. These were IPI00002142, IPI00012449, IPI00012753, IPI00032838, and IPI00100386. All five of these were in RefSeq, although IPI missed the RefSeq entry for two of them. In two cases, the IPI description was at odds with and probably inferior to the LocusLink description of the corresponding RefSeq entry. One was IPI00002142, which IPI described as “KIAA1400 protein (Fragment),” while LocusLink identified it as a known splice variant of a known gene, protocadherin 10 (PCDH10). The other was IPI00032838, which IPI called “TRKC protein,” while LocusLink recognized it as a known gene, neurotrophic tyrosine kinase, receptor, type 3 (NTRK3). LocusLink also indicated that IPI’s description, TRKC, is an obsolete name for the same gene.
Another four clusters contained entries from one or two databases and had descriptions suggesting a known biological function: IPI00003623, IPI00005204, IPI00044667, and IPI00098320. Only one of the four, IPI00098320, was listed as having a RefSeq entry. Its IPI description was “polo-like kinase (Drosophila).” LocusLink has basically the same description, but included a gene name, PLK.
I found a RefSeq entry for a second cluster, IPI00005204, that was missed by IPI. The IPI and LocusLink descriptions were identical for this case, but curiously the RefSeq entry was contained in a different IPI cluster, IPI00020989.
To test IPI’s clustering approach, I FASTAed all 20 clusters against the rest of the IPI database to check for overlaps. Twelve of the 20 showed strong evidence of overlaps with other clusters in the database, including seven that seemed to overlap multiple other clusters. Most of what I’m calling “strong” matches were 95 percent or more identical over 100 or more residues. Some were much stronger than this, up to 100 percent identity over 672 residues and 97 percent identity over 1,310 residues.
I looked closely at one overlap: IPI00005204 in my sample vs. IPI00020989. The two clusters are identical across 1,300 residues, except for a 15-residue insertion in one and a roughly 100 residue insertion in the other. This smells like alternative splice forms of the same gene, although neither is annotated as such.
Poor Dataslave. It was a long weekend after all. Once he discovered that IPI was not a miracle cure, it was back to the usual grunt work. Search each of Carbonoid’s genes against the databases, collect the results in a local database or spreadsheet, merge the ones that are identical and investigate those that overlap but aren’t precisely the same.
It would sure be nice if someone would automate this. Come to think of it, isn’t that what IPI-done-right would accomplish?
Basic Stats on Databases
Most of the databases mentioned in this article can be downloaded from the web as FASTA files. I did so for all but the Japanese databases and calculated the number of human entries. For TrEMBL, it was not possible to infer the organism from the FASTA file, so I obtained the number of human entries by querying the EBI website. I got similar information from websites in a few other cases, as well, to make sure the FASTA files and web databases were at least roughly in synch.
For the primary databases, I found that TrEMBL (including updates, but not SWISS-PROT) contained 37,816 human entries, while GenPept contained 88,633. Among non-redundant databases, SPTR contained 45,903 human entries, NCBI’s nr had 112,269, and OIR-NREF had 77,043. For the secondary databases, the numbers were 8,000 for SWISS-PROT, 15,067 for RefSeq, and 6,507 for PIR-PSD. Ensembl’s genome annotation contained 21,619 “known” genes and 7,457 “novels” for a total of 29,076. NCBI’s annotation contained 45,374 entries. The number for IPI was 66,986.
It’s apparent that databases that are supposed to be equivalent are not. The discrepancy between SPTR and nr is amazing.
I next broke down IPI by component database. The database contained a total of 112,710 references to its component databases, an average of 1.68 references per cluster. Of the 66,986 IPI clusters, 7,880 contained entries from SWISS-PROT, 19,731 from TrEMBL, 14,571 from RefSeq, 27,109 from Ensembl, and 43,419 from NCBI’s genome annotation. 39,807 clusters contained entries from one database, 11,832 contained entries from two, 12,338 from three, 2,820 from four, and 189 clusters contained entries from all five component databases.
Looking at the situation from the perspective of the component databases: 99 percent of SWISS-PROT entries were present in IPI, as were 52 percent of TrEMBL entries, 97 percent of RefSeq, 93 percent of Ensembl, and 96 percent of NCBI’s genome annotation.
I also looked to see how many IPI clusters represented known vs. novel genes. I found that 23,075 clusters (34 percent of the total) had description lines that suggested a real biological function, while the remainder were described as “hypothetical proteins,” or as “similar to” some other protein, or had no description line at all.
Protein Data Organizations
European Bioinformatics Institute http://www.ebi.ac.uk/
Geneva Bioinformatics http://www.genebio.com/
Japan International Protein Information Database No good URL available. See http://www.sut.ac.jp/edocs/edu/rib.html
Japan National Institute of Genetics http://www.nig.ac.jp/index-e.html
Munich Information Centre for Protein Sequences http://www.mips.biochem.mpg.de/
Protein Information Resource http://pir.georgetown.edu/
Swiss Institute for Bioinformatics http://www.isb-sib.ch/
US National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/
Japan National Institute of Genetics
DAD No good URL available. See DNA Data Bank of Japan
DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/
European Bioinformatics Institute
EMBL Nucleotide Sequence Database http://www.ebi.ac.uk/embl/
International Protein Index http://www.ebi.ac.uk/IPI/IPIhelp.html
SPTR http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+LibInfo+-id+1gAE51IeoVe+-lib+SWALL. Or access by clicking SWALL (SPTR) from the SRS search page at EBI.
TrEMBL http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+LibInfo+-id+1gAE51IeoVe+-lib+SPTREMBL. Or access by clicking SpTrEMBL from the SRS search page at EBI.
Ensembl’s human genome annotation http://www.ensembl.org/Homo_sapiens/
Protein Information Resource
PIR Protein Sequence Database http://pir.georgetown.edu/pirwww/search/textpsd.shtml
PIR Non-redundant REFerence protein database http://pir.georgetown.edu/pirwww/search/pirnref.shtml
US National Center for Biotechnology Information
GenPept No good URL available. See remarks in ftp://ftp.ncbi.nih.gov/genbank/README.genbank and http://www.ncbi.nlm.nih.gov/Sitemap/
NCBI’s human genome annotation http://www.ncbi.nlm.nih.gov/genome/guide/human/
Or access by clicking “choose database” from the protein-protein BLAST search page at NCBI.
I must admit that I don’t understand the technical rationale for IPI’s approach.
It would seem better to start with the best curated databases — SWISS-PROT, RefSeq, and PIR-PSD — resolve any discrepancies, and create an IPI entry that combines the curated entries for each gene. Resolving the discrepancies among the curated databases would probably require human intervention, but it would surely be worth it to fix any problems in these important databases.
Next, sweep up into each IPI cluster the entries from the primary databases that were used to create the curated entries. What remains at this point is the dross the curators couldn’t figure out as well as any new sequences added to the databases since the curators last looked. Next try to add these leftovers to the existing clusters. And finally perform de novo clustering of what remains.
This approach would guarantee that the new database would be no worse than the best existing databases, which seems a reasonable objective.
There used to be a database built along these lines, OWL from the University of Leeds. Sadly it is no longer maintained.