Group Leader, Institute of Enzymology
Biological Research Centre of the
Database reliability is a crucial aspect of bioinformatics, since researchers want to be sure, for example, that the structure predicted by a protein-coding gene is correct. In a recent paper published in BMC Bioinformatics, László Patthy at the Biological Research Centre of the Hungarian Academy of Sciences and his colleagues describe a method they developed called MisPred that identifies “suspicious proteins,” meaning ones likely to be mispredictions, in public databases.
MisPred’s developers believe the tool can be used to not only flag errors in current resources, but to help guide the correction of those errors and possibly even improve the quality of future gene-prediction methods.
MisPred uses five “dogmas” or “conflicts” to identify proteins that may be incorrectly annotated.
In the paper, Patthy and colleagues analyzed predicted Ensembl protein sequences of 11 species and found that “the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions.” They also applied MisPred to the National Center for Biotechnology Information’s GNOMON annotation pipeline and found that the rates of mispredictions were “comparable” to those of Ensembl.
The authors note that even the manually curated UniProtKB/Swiss-Prot dataset “is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries.”
Patthy is group leader at the Institute of Enzymology at the Biological Research Centre of the Hungarian Academy of Sciences. For the last 25 years he and his colleagues have been working on multidomain proteins, both experimentally and computationally. This work includes genomic and protein sequence analysis for gene and protein prediction, such as predicting the most probable function of a gene or protein.
BioInform recently interviewed Patthy via e-mail about MisPred and the implications of its findings for bioinformatics databases and software. The following text is an edited version of that correspondence.
Could you explain a bit about your institute? Is it affiliated with a university?
The Institute of Enzymology, which is part of the Biological Research Centre of the Hungarian Academy of Sciences, is not affiliated with a university. The Hungarian Academy of Sciences runs a large network of research institutes covering all major areas of natural sciences.
The Biological Research Center of HAS was founded in 1973 to promote molecular biology in Hungary and is one of five institutes that focus on biophysics, biochemistry, genetics, plant biology, and enzymology.
Our functional genomics and bioinformatics work includes prediction of gene and domain structure, and predicting protein functions. … In the early 1980s, we were the first to define numerous protein module families of multidomain proteins.
Our team is a participant of the BioSapiens Network of Excellence of the FP6 Framework Program of the European Union, the goal of which is to annotate genome data with the help of bioinformatic tools and experimental results. In particular, our team is working to characterize [the] human genome gene set, with an emphasis on protein-coding genes. Our team is also a participant of the eScience Regional University Knowledge Centre of Budapest where we focus on the annotation of human genes of potential interest for biotechnology.
In addition to bioinformatics, our group also has expertise in molecular biology: we clone and express the novel proteins and novel protein domains that we definedin silico, characterize their biological function and — in collaboration with nuclear magnetic resonance groups in the US and Australia — we determine their 3D structure.
What motivated you to launch MisPred?
In the last 25 years my research group has been studying various multidomain proteins that have major medical importance, such as tissue plasminogen activator, and in these studies we usually combined the results of bioinformatic analyses with wet lab experiments to determine their structure and function.
In the course of our analyses of public sequence databases we have frequently encountered protein sequence entries that looked suspicious. For example, we thought that a sequence is unlikely to be correct if it contains just a fragment of a folding domain or — in the case of a protein destined to be extracellular — it is missing sequence signals that could direct it to the extracellular space. These proteins are unlikely to be viable since they would not fold and function properly. Whenever we took a closer look at these entries and performed additional analyses, we could not only show that they are incorrect, but could actually determine their correct sequence.
Perhaps I can best illustrate this point by one of our current experimental projects, which is actually the first practical application of the MisPred approach and which led to the discovery of two novel, medically important proteins that we named WFIKKN1 and WFIKKN2.
First, we concluded that the predicted structures of two candidate protein-coding genes in the human genome are probably incorrect: although the hypothetical multidomain proteins encoded by these candidate genes contained several extracellular domains, they did not possess signal peptides and they contained only a fragment of another extracellular domain.
Next, using appropriate bioinformatics tools, we predicted the correct structure of the genes; that is, we identified those parts of the genes [that] encoded the missing signal peptide and the missing part of the truncated domain. Then we cloned the full-length cDNAs of these proteins with the help of PCR primers based on the “corrected” structure and thus have verified that the “original” prediction was incorrect and that the MisPred-directed correction is valid.
Finally, it was shown recently that both proteins bind myostatin and GDF11 (members of the TGF- family), thus implicating them in the regulation of important biological processes such as muscle growth, neurogenesis, etc.
So, encouraged by such successes of the MisPred concept we decided to develop a MisPred tool that could be used to systematically identify errors in public databases.
Who will benefit from your work?
First, we thought that it is very important to identify erroneous entries in public databases and inform academic users since this would help protect them from drawing erroneous conclusions based on erroneous data. Biotechnology also relies heavily on information originating from genome projects; therefore errors in gene prediction also have a major impact on biotechnology.
Second, the MisPred approach not only cautions users about the fact that an entry might be erroneous, but actually guides the correction process and in this way it helps the definition of the correct, or “novel,” structure of genes and proteins.
So, in our opinion, both academic users and biotechnology would benefit from such tools.
What kind of reactions have you received thus far?
Although our paper was published just a few weeks ago, we have already been contacted by several leading experts from various genome research institutes. The reactions were positive: they ranged from collaboration proposals to suggestions for additional MisPred dogmas. Most experts seem to agree that downstream producers of biomedical research pay a high cost for the errors in genome annotation and that there is an enormous need for approaches by which gross errors at the major databases can be fixed.
What is next? Are you planning to commercialize this tool?
We will continue to develop MisPred by adding additional dogmas to the pipeline for the identification of additional types of sequence errors. Moreover, we have already taken [the] first steps to develop FixPred for the automated correction of sequences identified by MisPred as erroneous.
Correction of erroneous sequences is a complex task, usually performed manually by expert bioinformaticians and — to the best of our knowledge — no attempt has been made by others to automate this process. Since these tools may be expected to be of interest for research institutes as well as biotechnology companies working on exploiting the results of various genome projects, we plan to explore the possibility to commercialize these products.
At the core of MisPred are five dogmas: the idea that subcellular localization of extracellular and transmembrane proteins is defined by the presence of signaling; that transmembrane proteins with cytoplasmic and extracellular parts must also have a transmembrane segment; that extracellular and nuclear domains do not occur in the same protein; that in a domain family, protein fold is conserved so amino acid residues in closely related proteins fall into a narrow range; and that a protein is encoded by exons on a single chromosome.
How did you devise these dogmas? Might someone critique [that] there should be more, fewer, or different ones?
The MisPred approach is based on the very general principle that a protein-coding gene is likely to be mispredicted if some of its features or features of the protein it encodes conflict with our current knowledge about protein-coding genes and proteins.
Of course, there are much more than five dogmas about viable proteins. In the paper we have just published we illustratedthis approach with just five routines, based on five dogmas. But MisPred is being further developed by adding additional dogmas.
The reasons for the choice of the first five dogmas were both historical and practical. First, our original observation was that a large proportion of erroneous entries violate the first of the five dogmas, the one that specifies that the extracellular localization of extracellular proteins is defined by the presence of appropriate sequence signals or the fourth rule, which states that the protein fold is highly conserved in a domain family, therefore it does not tolerate truncation. So these tools were historically the first to be developed and are also very important since they affect many entries.
Second, it was relatively easy to automate these tools since they could be constructed from existing, reliable bioinformatics tools.
It was also clear from our analyses that the five routines detect only a fraction of the truly erroneous sequences; therefore there is significant need for additional routines. We are now in the process of developing additional routines, progressing from simple routines to routines requiring more complex analyses.
Can you elaborate on your results, such as those for the UniProtKB/TrEMBL and NCBI’s Gnomon pipeline? Are you concerned that people will question MisPred’s findings?
Since the Swiss-Prot section of UniProtKB is the gold standard of protein databases and each entry is manually annotated and curated by experts in the field, we have used Swiss-Prot as the benchmark with which to validate the concepts behind the MisPred pipeline.
We expected that very few, if any, of the Swiss-Prot sequences would be erroneous. Interestingly, examination of Swiss-Prot entries has identified a number of truly erroneous sequences that we could correct as guided by MisPred. Nevertheless, the fact that the number of Swiss-Prot entries identified by MisPred as erroneous is very low, attests to the high quality of this database and the reliability of the MisPred approach.
On the other hand, our analysis of protein sequences of the TrEMBL database has revealed that the error rates are orders of magnitude higher than those for the Swiss-Prot dataset. Indeed, 58 percent of human TrEMBL proteins containing at least one extracellular domain were found by MisPred to lack a signal peptide and/or a transmembrane segment in contrast to just 1.05 percent in the case of human Swiss-Prot entries.
Similarly, 14.8 percent of human TrEMBL entries containing at least one member of the Pfam-A domain families suitable for the study of domain integrity were found to contain a domain of abnormal size, while this value is only 0.14 percent in Swiss-Prot.
The large number of erroneous sequences detectable by dogmas 1 and 4 are due primarily to protein fragments virtually translated from non-full length cDNAs: the incomplete proteins lack signal peptides or domain parts.
I think that most expert users are aware of the fact [that] TrEMBL contains a large proportion of “protein fragments” translated from non-full length cDNAs. In fact, the vast majority of the entries identified by MisPred as suspicious are also annotated as fragments in TrEMBL. Another reason why I think that it is unlikely that people will question the validity of MisPred is that the same MisPred that detects so many errors in TrEMBL detects very, very few errors in the high quality SwissProt section of UniProtKB.
MisPred analyses of sequences predicted by the EnsEMBL and GNOMON gene prediction pipelines have revealed that in the case of both pipelines, the majority of erroneous entries are returned for dogmas 1 and 4. The relatively high number of erroneous proteins detectable with the tool for dogma 1 is due to the fact that detection of exons encoding signal peptides is one of the most difficult tasks in gene prediction.
In vertebrates, secretory signal peptides are frequently encoded by distinct, short, poorly conserved exons that may be easily missed by gene-finding programs. A significant proportion of sequences were returned for dogma 4 for all vertebrates analyzed, suggesting that erroneous omission or insertion of exons, causing deviation of domain size is a major source of error in gene prediction.
What is the most important result you would like bioinformaticists to consider?
The novelty of the current MisPred pipeline is the combination of existing, widely used programs such as BLAST, HMMER, PrediSi [prediction of signal peptides], [and] TMHMM [transmembrane helix prediction]. For example, the routine for dogma 1 identifies proteins that contain Pfam domains that occur only in the extracellular space. For this purpose it uses Pfam tools, and asks whether such proteins have sequence signals that can direct these domains to the extracellular space. For this purpose it uses PrediSi and TMHMM.