Researchers from Genentech have released under an open-source license a web-based protein sequence database that they claim addresses the "data preparation burden" facing most bioinformatics research groups.
Reece Hart, scientific manager of research computing and informatics at Genentech, along with Kiran Mukhyala, a bioinformatics programmer and analyst at the firm, began developing the platform five years ago to support the company's target discovery efforts, Hart told BioInform via e-mail.
Hart and his colleagues built the system, called Unison, to reduce the amount of time they were spending on data preparation prior to bioinformatics analysis — a tedious process that entails downloading, formatting, characterizing, and integrating many different types of data from multiple sources.
Recognizing that many scientists outside of Genentech are grappling with the same issues — and in many cases developing very similar systems on their own — they decided to release the system under the open source Academic Free License.
“I would be thrilled to see Unison become a community endeavor," Hart said. "We've done a lot, but a lot more is possible."
Unison is a data warehouse that includes approximately 12 million protein sequences from 20 sources, as well as 200 million pre-computed protein predictions — including protein domains and motifs, signal and transmembrane domains, secondary and tertiary structure, disorder, cellular localization, phosphorylation, and genomic alignment and clustering.
Hart and his colleagues formally released Unison earlier this year at the Pacific Symposium on Biocomputing, where they presented a paper on the project .
Russ Altman, professor of bioengineering, genetics, medicine and computer science and chairman of the department of bioengineering at Stanford University who helps run PSB, told BioInform in an e-mail that he was "very pleased" that Genentech made "made [its] tool available to the public, including the full database schema, API, and some predictions."
In their paper, Hart and Mukhyala reported that the platform has "improved the consistency and currency of data" used in discovery and sequence analysis at Genentech and that it has become a foundation for many in-house projects, for example the discovery they published in 2006 in the journal Cell Death & Differentiation that the zebrafish genome encodes homologs of the mammalian Bcl-2 protein family, indicating that the organism could be a useful model for studying apoptosis.
Hart said that Unison received 60,000 connections from 15 Genentech users last year. "These are direct-to-database connections, typically for [data] mining campaigns. In addition, the data are also used anonymously by our in-house web pages."
Externally, he said Unison usage is still "light," likely because the PSB 2009 paper and presentation were really the "first broad public announcement" of the project.
Hart is hoping to raise awareness of Unison, however. "The current public release is a seed of an idea, with a framework that scales for open science.”
[ pagebreak ]
The researchers created a Unison feature request page on SourceForge in hopes of gathering user input, but Hart said he already has a long wish-list of his own, ranging from minor "blemishes" to adding new user features such as full-text search. Unison's worth is, he said, "is based on the biological problems it solves."
Unison is updated internally about every other week, he said. "Updates of the external site and data are performed as we're able, which historically has meant every three months or so," he said. Unison makes use of a PostgreSQL relational database that runs in a Linux/GNU environment and installation requires around 200 gigabytes of disk space. The API is in BioPerl.
Hart said that Genentech's internal version of the database relies on Kerberos for authentication, but the public version of Unison "is wide-open," with no registration required.
Open Sharing and Integrating
Unison aggregates public and proprietary sequences, and stores proteins that were predicted with both public and proprietary algorithms, but the public version contains only public sequences and predictions made with freely available tools or web services.
"We are extremely fortunate to work in a field where many scientists are personally committed to the principles of open access, open sharing, and free software; they are the ones who enable a resource like Unison," Hart said.
The principle behind Unison, the scientists explain in their paper, is to offer users a "standardized, integration platform" of databases and protein predictions so scientists can use one resource for several scientific pursuits about protein composition, structure and function.
The platform employs two levels of integration, Hart said. Semantic integration pulls together concepts of "fundamentally different types" and distinct sources — for example protein sequences, domains, and structures. Unison also integrates data on another level, the "concept consolidation" level, as Hart said, to reconcile the "specialized representations from source databases into a single, abstract representation."
This representation contains only the "salient characteristics" of a concept and "explicitly discards the proprietary information," he said. Through this degree of integration, the representations are "minimalistic."
"Within Unison, relationships among entities are made primarily at this level," he said.
Hart and Mukhyala acknowledge in the paper that while "high-quality and well-known integrative databases" such as ATLAS, BioMart, InterPro, RefSeq, STRING, and UniProt offer data for scientists, data preparation is often still is a bottleneck.
Some data prep challenges have to do with limitations of the databases in terms of licensing restrictions, incompleteness, and access methods, Hart said.
As Hart explained, a computational biologist's job typically starts with downloading the data needed for a given study, for example, protein sequences or SNPs, followed by filtering the data, integrating it, and then undertaking the necessary computing and storing that data.
"The tasks are often performed manually and ad hoc," he said, and often scientists complete them only just for the job at hand. "When the project needs to be updated, they often repeat their efforts from scratch." As a result, re-running a similar process with other data is not straightforward nor is the reproduction of a study by a separate group of researchers.
"Unison helps with all of these problems — downloading, integration, pre-computing, currency, and consistency — for a variety of data that are commonly used for proteomic mining," he said.
The Genentech platform isn't trying to replace current databases, he said, but rather it aggregates specific biological entities and relies on links back to the source databases for additional information.
[ pagebreak ]
"One way to view this is that Unison represents biological concepts, not the proprietary representations of source databases," Hart said. "By representing concepts, users have an easier time understanding the schema and the relationships among the tables."
Unison uses BabelFish to translate sequence accessions based on exact sequence identity, Hart said. He said that while he and other scientists have used this approach before, "having such a tool for such a comprehensive set of sequences is new."
Better Data Prep, Better Science
Besides helping scientists reduce the burden of data prep, Hart and Mukhyala stated in their paper that Unison's pre-computed data can help develop new scientific hypotheses.
One class of problem they use to demonstrate this approach relates to immunoreceptor tyrosine inhibitor motif, or ITIM, proteins, which are immune system regulators. They contain an extracellular immunoglobulin domain, a transmembrane domain, and an intracellular immunoreceptor tyrosine inhibitory motif. Searching for these proteins in Unison involves translating the query into a Structured Query Language command. Hart pointed out that Unison's "extensive pre-computed data" allows the queries to be amplified in the sense that they involve new scientific hypotheses.
Each of the features of these proteins, Hart said, is "notoriously non-specific" and occurs thousands of times in the human proteome. "Unison is an ideal tool to look for conjunctions of non-specific features rapidly and concurrently, as opposed to serially, as most researchers do," he said.
Querying ITIMs simply, Hart said, would involve searching for an immunoglobulin domain by alignment to an immunoglobulin hidden Markov model from the Pfam database; trying to find a transmembrane domain with the TMHMM server, which predicts transmembrane helices in proteins; and then looking for an ITIM using one of several regular expressions, all in a single query. "Unison provides a means to extend this query in important ways," he said. "For example, we have also used structure prediction to predict [immunoglobulin] domains."
The "real power" of the pre-computed data is the ability to ask exploratory questions such as "what extra-cellular domains occur in the context of transmembrane and ITIM domains?" This sort of question can't be posited with a "traditional serial/pipeline search strategy," of the kind he outlined above which means plowing through the different traits of the proteins one by one.
As the Genentech scientists outlined in the paper, the first known ITIM-containing proteins had immunoglobulin extra-cellular domains. Suspecting that diverged ITIM proteins homologs might have alternative extra-cellular interaction domains, Unison was used to find all the extra-cellular Pfam domains in the context of also a transmembrane domain and intracellular ITIM.
With ITIMs, it turns out that fibronectin III and cadherin domains are also common, Hart said and they "stood out" as alternative extracellular domains. "This immediately caught my eye because they and immunoglobulins are all beta sandwich structures," he said. Subsequent searches on which the scientists do not elaborate revealed "several candidates" at least one of which later was shown to be an ITIM protein, the scientists noted.
"It is this sort of exercise that best illustrates the value of having a huge compendium of sequences, annotations, and pre-computed predictions," Hart said.