Hoping to bring together “a million minds” to annotate proteins, a team of researchers has launched a large-scale, community-based project called WikiProteins that combines automated text, data, and concept mining, manual annotation, and a newly developed software component called the Knowlet.
The project, which aims to annotate proteins and protein-related biomedical concepts such as diseases or organisms, will be powered by a platform technology designed by a group of scientists and a Rockville, Md.-based startup and will be made available free of charge in perpetuity for the scientific community and the public.
Called WikiProfessional, the platform “in a technical sense powers the community-version that we have now put out there, but you could take exactly the same technology platform and install it locally at a pharmaceutical company for them to do drug-lead discovery,” said Albert Mons, a computational linguist and co-founder of the startup, Knewco.
“Philosophically we wanted to hand off the responsibility for the quality of the content of the system to the community and then provide generic technology that can be used to grow the knowledgebase on a daily basis,” Mons told BioInform this week.
Given high-throughput data and the increasing number of papers describing them, “comprehensive and timely annotation of the literature for facts by any central team of experts [is] an unachievable goal. Computer assistance in the annotation process is, therefore urgently needed,” the scientists wrote in a paper published in the current issue of Genome Biology.
WikiProteins was created over the last two years and announced on May 28, by a team of scientists from the Swiss Institute of Bioinformatics, the GO consortium and the IntAct database at the European Molecular Biology Laboratory-European Bioinformatics Institute, Erasmus Medical Centre, Leiden University Medical Centre, both in the Netherlands, the Brazilian Stela Institute, the WikiMedia Foundation, and Knewco.
Four researchers working in molecular biology and computer science along with two businesspeople and a project manager founded Knewco in January 2006 and have collaborated on the platform they are now launching.
The project is an interactive and semantically supported workspace based on Wiki pages and contains a knowledgebase, a navigation tool, and a section on the people in its annotating community. Beneath that layer is a relational Wiki based on WikiData software, an indexer, and software that creates components called Knowlets, which stores the relationships between all the mined concepts.
Knowlets are at the core of this platform, a proprietary concept mining software component and ontology format that Knewco has developed over the last two years.
Writing in the Genome Biology paper, the team said information is mined from scientific publications and the Knowlet links two given concepts, but records that information only once.
“This approach results in a minimal growth of the ‘concept space’ as compared to the text space,” the authors wrote. New and unique facts in the scientific literature expand the corpus to a much lesser degree than the totality of the text generated by new academic journal articles.
Concept pairs are placed in what the WikiProteins creators call a “related concept cloud.” By applying a meta-analysis algorithm, the software calculates a semantic association to reflect the strength and type of relationship the concepts have.
The relationship is dynamic and recalculated based on newly mined information. Its calculated value is based on three factors: factual statements found in the scientific literature or databases, increasing co-occurrence of two concepts in a sentence or a paragraph, and predictive associations based on the overlap of the two concepts.
WikiProteins is the first in a series of what Knewco calls WikiProfessional projects that it plans to help launch with partners. As a derivation of the WikiData model, WikiProfessional is open-source and available for download, Mons said.
“An empty WikiProfessional doesn’t make a lot of sense, so our partners, the National Library of Medicine, Swiss-Prot, and others agreed that they will donate their databases so we could import those in the Wiki,” he said.
That mined information is a starting point for the community to begin annotating data and to “become guardians, hopefully, for the concept,” he said.
As it stands, the Wiki pages contain links to Knowlets with over 1 million biomedical concepts mined from such sources as the Unified Medical Language System, UniProtKB/Swiss-Prot, IntAct, and Gene Ontology. Users exploring the Wiki can filter these concept associations according to, for example, disease or organism, to show strong co-occurrence or more indirect associations.
The WikiProteins terminology has been mapped to concept identifiers in the Wiki-based terminology system called OmegaWiki. WikiProteins and OmegaWiki are driven by a relational database that is linked to the Knowlets by on-the-fly indexing of all Wiki pages. An indexer called Peregrine is designed to recognize concepts by the Wiki, and the indexer is coupled to a terminology system derived from OmegaWiki.
“Posting something on WikiProteins that is scientifically not sound is scientific suicide.”
Each biomedical concept has its own page that includes up-to-date annotations. Registered users can become annotators and edit records. WikiProteins shows the new record alongside the original one that was mined from the authoritative databases. Professional annotators at their respective databases can choose to incorporate some of new community-entered information into their database.
Past community annotation ideas have not worked, Mons said, because scientists lacked incentive. WikiProteins contributors must register with their full name and e-mail address. “The extra power is that the Wiki is directly connected to the knowledgebase so once you start making changes, it is automatically recognized,” he said. An annotator is credited for that change and other users receive alerts about it.
WikiProteins “is almost an ego-system rather than an ecosystem where we have said if you contribute it is going to be good for science in general, which is a good argument but not good enough for individuals,” Mons said. “If you do it, you get recognition for every single contribution you make and you will get news alerts on your field.
“If there is one thing that people are afraid of in science is missing a … change,” he added.
Could errors sneak into the system? “Posting something on WikiProteins that is scientifically not sound is scientific suicide,” said Mons. “Any change will be marked with your name and a time stamp, so when people mouse over the change they will see you.”
“We want to be a technology provider that changes the way the world handles information overload,” he added. Mons stressed that this desire does not make Knewco a content-hosting service, which would create new silos for content and “new problems rather than solve them.” Rather, Knewco stores only the Knowlets, and not the journal articles from which they were mined, which results in “basically very small files” that show the relationships concepts have in the biomedical space, he said.
It Takes a Village
Many scientific organizations lack the manpower and the financials to curate and annotate a complete corpus of data and literature, said Mons, adding that efforts are “always incomplete and it’s lagging behind.”
PubMed contains around 1.5 million authors, and Mons said he the other WikiProteins founders thought, “Why not involve the million minds … to bring the information up to speed?”
Their Genome Biology article invites members of the biomedical community to annotate “minimally one Knowlet in which they are an expert.” In their annotations, they might, for example, include sentences from scientific journals along with references, information that is from the closed access world not accessible to many text-mining tools.
Besides the over 1 million Knowlets that have been created to date, WikiProteins also contains concept profiles of more than one million authors mined from PubMed. Traditional text mining has problems with names, for example differentiating studies by different people with the same last name. “In China the problem is the biggest … so if you do traditional concept text mining, it is a nightmare,” said Mons.
A disambiguation algorithm helped the WikiProteins organizers collect papers for each author and separate publication profiles. “It is highly unlikely that two of the same Wangs have exactly the same publication profile,” said Mons. Ultimately, he said, this system could lead to a Wiki-based author ID in which each author has his or her own Wiki page.
Language presents particular challenges for database searches and this Wiki. Mons said that unpublished studies by the research team of which he is a part revealed that roughly 40 percent of all gene names have homonymy problems.
“For each and every concept that is homonymous, we created a context-concept cloud per concept,” said Mons. By associating a set of documents with each homonymous term, it is easier to determine the context of a given term. The same technique is applied to synonyms, which are plentiful among gene and protein names.
For example, the authors explained in their study that the yeast protein CLB2 has many synonyms. When spelled Clb2, it leads to 25 entries in UniProtKB/Swiss-Prot, only one of which is for CLB2. The Clb2 spelling is not listed in the corresponding Swiss-Prot record, but Clb2 is the synonym of a C. elegans gene called emb-9. In the Saccharomyces Genome Database Clb2 is not listed as a synonym of CLB2 but in a query that spelling will lead to the correct gene. In PubMed Clb2 also delivers papers on dental self-etching primers such as Clearfil Liner Bond 2.
In WikiProteins, Mons said, the Knowlet algorithms address the ambiguity challenge presented by homonyms and synonyms and minimize the havoc they can create, Mons said. For Clb2, the Wiki points out the synonym issue and states that the query has been mapped to several Knowlets.
Mons believes this computational method avoids some problems of traditional text-mining or concept-mining, which usually “work in a silo” using a particular indexer. “You can only compare notes with a resource that has been indexed with the same indexer,” he said. “Otherwise you are comparing apples and oranges.”
He views concept mapping and word mapping technologies as “sophisticated and highly efficient” ways to analyze a particular corpus. “What they are not good at is going global, bringing in all kinds of repositories, all kinds of operating platforms, all kind of database systems together that they are navigable from one angle,” Mons said.
Projects like WikiProteins are crucial because they consider “all the interests of stakeholders in a portal,” said Vinícius Kern, research director at the Florianópolis, Brazil-based non-profit Stela Institute. The center, which focuses on information and knowledge engineering technologies such as indexing and text mining, is a WikiProteins and WikiProfessional partner.
For instance, he said, government changes in Latin America can disrupt initiatives that are underway or discourage user participation. “We saw in the [WikiProfessional] initiative something that cannot be broken by political changes,” Kern said.
Kern and colleagues have developed user-driven platforms, one of which enables researchers to write and edit their CVs, research summaries, and publications lists. Called CV Lattes, it is part of a multi-country platform in Latin America called ScienTI slated to be integrated into WikiProfessional, expanding the system, as the research team pointed out in their article, “with authors who may not be easily found in PubMed.“
Another platform family to be included is SciELO, a curated Latin American electronic library of Spanish- and Portuguese-language journals. WikiProfessional lets researchers “participate in building knowledge in his or her area,” Kern said, so this computational project could expand the reach of Latin American scientists and enhance research of importance to their countries, for example in the study of the mechanisms of tropical diseases.
Right now WikiProfessional is in English, but as Kern explained, the team is exploring computational methods, for example using Universal Networking Language, to provide automatic translations to further internationalize it.
Potentially this project offers users the possibility to browse the “concept space for interesting relationships” and empower knowledge discovery, the scientists wrote. They run through one example in which the concept space around enzyme inhibitors is explored. It yields a connection to a cancer drug that shows potential as an anti-malarial drug.
“What we hope will happen is that we gave an example of WikiProteins and we may offer one or two more to get the ball rolling but then other people, communities, should pick up the ball and do it themselves,“ said Mons. For example, the Duchenne Muscular Dystrophy community has started using WikiProfessional for their own needs.
Knewco, founded in January 2006, is now ramping up its commercial arm. Earlier this year, Knewco hired a CEO and COO. The company is privately funded by, among others, technology investors Bill Melton, with experience at VeriFone, AOL, and CyberCash; and Alfred Berkeley III, CEO of Pipeline Trading, chairman of Kintera Corp., and former NASDAQ president.
The firm has both a non-profit and a commercial business model. “Basically the private world will generate the revenues to keep supporting this,” Mons said of Knewco’s future now that WikiProteins is launched. The firm does not yet have commercial customers but is “talking to prospects,” he said. It also plans to offer premium services to scientists.
The system, in beta testing, is available here.