Skip to main content

Wiki-Based Annotation Takes Off in 2008, But Some Say Data-Mining Tools are Lacking

During 2008, wiki-based collaborations gained a foothold in molecular biology with the launch of a number of wiki-based annotation projects such as WikiGenes, WikiProteins, Gene Wiki, and Wiki Pathways.
In the most recent example of this trend, the journal RNA Biology has decided to mandate Wikipedia entries from authors submitting papers to a new section on RNA families — a requirement that is, “as far as we are aware, a first for any scientific publication,” according to an editorial in the journal by Paul Gardner, the editor of the new RNA Families section, and Alex Bateman, the head of the Rfam database of RNA alignments and secondary structures.
The primary reason for requiring Wikipedia entries, Gardner and Bateman said, is because these pages are usually among the top-ranked hits in Google searches with molecular biology keywords. Since it is their goal to “ensure that the RNA-relevant information in Wikipedia is both reliable and current,” that time spent by experts will, they believe, help “improve the record.” In order to ensure this, they said, “the Wikipedia update will be reviewed alongside the submitted article.”
The creation of Wikipedia entries is also likely to benefit the Rfam database, they said, because the resource currently draws annotations from Wikipedia, so any Wikipedia articles written for the journal “can be used directly by the database as well as the community.”
But even as wiki-based annotation gains in popularity, some in the bioinformatics community are questioning the value of this approach because there are very few tools that enable downstream data-mining of Wikipedia pages.
For example, Masanori Arita, a computational biologist at the University of Tokyo, published a paper in December in Briefings in Bioinformatics that called wiki-based web sites “overrated” as the solution for large-scale management and for resolving data inconsistency in bioinformatics.
The challenge, he wrote, is the fact that wiki pages are “independent of each other,” so that changes made on any one page are not replicated on pages with related information.
In an e-mail interview with BioInform, Arita said that as a frequent user of Wikipedia he finds that “the idea of community annotation is great” and that applications such as Gene Wiki, WikiGenes, WikiProteins, and WikiPathways “may achieve high-level annotations in every single page.” The problem, he said, is for “concepts that span multiple pages.”
This challenge is due to the lack of page dependency inherent to Wikipedia and its underlying WikiMedia software. “To keep the consistency of information, when an original page is updated, all its proper copies in other pages must also be updated,” he said. Currently authors must duplicate information by cutting and pasting from one page to another, Arita said.
A Growing RNA Family
The challenge of manual curation is one of the reasons behind RNA Biology’s new guidelines. Gardner and Bateman note in their editorial that Rfam’s alignments and structures are derived from the literature, but “due to a lack of standards for publishing RNA alignments and structures, often the curators resort to manually typing in the sequence and structure from published figures.”
This approach, they said, “is not going to scale well in an era of comparative genomics, deep sequencing of RNAs, and RNA gene prediction tools,” so they envision the deposition of these alignments and annotations in the journal’s RNA Families track as a means of building a standardized archive of this information.
According to the journal’s guidelines, submissions to the new RNA Families section are to focus on either “substantial updates of existing RNA families” or descriptions of novel ones. Authors are required to submit material to the journal and to Wikipedia. Landes Bioscience, the publisher of the journal, did not respond to queries by BioInform about this section or its new policy before deadline.

“In a sense, a wiki can be a good tool for data collection, nothing more. … We need a tool for knowledge management.”

Gardner and Bateman said in their editorial that the journal’s new track is a forum for short publications that detail the structure, function, and sequence conservation for RNA families. “There will be two extra requirements for publication in this track,” the scientists wrote. One of the requirements is deposition of an alignment and secondary structure in Stockholm format. The other is the “generation or update of a corresponding entry in the online encyclopedia Wikipedia.”
According to the journal’s guidelines, the submission must include “at least one stub article” for Wikipedia centered around the RNA in question to be added either at the author’s user space on Wikipedia, which the publisher describes as the “preferred route,” or to the main Wikipedia space.
RNA Biology offers open access to its articles one year after publication, while authors who wish open access upon publication can pay a fee. For the RNA Families track, however, the articles will be published as open access texts both online and in print “at least in the first years of the track” while articles with color figures or more than four journal pages will incur a fee. The Wikipedia entry articles are to be peer-reviewed along with the manuscript, the guidelines state.
The first RNA Biology article to be published in this fashion is “A survey of nematode SmY RNAs” by Peter Stadler of the University of Leipzig and the Santa Fe Institute, Sean Eddy of the Howard Hughes Medical Institute’s Janelia Farm Research campus, and colleagues at the University of Vienna.
The article can be found here and the Wikipedia entry here.
Wickedly Wiki
Andrew Su, Senior Research Investigator in the Computational Biology Group of the Genomics Institute of the Novartis Research Foundation, who spearheaded the Gene Wiki project, told BioInform that RNA Biology’s venture is “a great experiment worth trying.”
It is a “nice, discrete well-wrapped pilot project” in which the subject matter of the RNA wiki matches the subject matter of the journal, he said.
Scientists might create Wikipedia entries “to varying degrees of enthusiasm” and it may be “tough to have very well-defined criteria [as to what comprises a qualifying submission] but it at least requires the authors to make an effort,” he said. Overall Su does not fear that the added requirement will discourage scientists from submitting papers to the journal. “I don’t think it will be that much of a big deal,” he said.
“I'm most excited about the prospect of ‘community intelligence,’” he said, which underlies projects such as Gene Wiki, which was developed to encourage scientists to contribute information about specific genes to Wikipedia [BioInform 07-11-08], and other wiki-based collaborations in molecular biology.
Once in Wikipedia, data are visible and accessible. “Wikipedia provides that framework for the community to continually edit and summarize and improve these articles, and that's still relatively unique in biology,” he said.
But the University of Tokyo’s Arita said that simply making this biological information available online is not enough. “In a sense, a wiki can be a good tool for data collection, nothing more,” he told BioInform.
“The essence of scientific activity is to organize and extract knowledge out of collected data.” Accumulating data itself is “not science,” Arita said. “We need a tool for knowledge management.”
In his Briefings in Bioinformatics paper, Arita noted that wiki-based websites are a poor substitute for structured databases because they lack a mechanism to check data consistency. “As long as wiki is used as a weblog or encyclopedia, this independency is more than natural: authors take the responsibility for the contents, and they should not be changed automatically by other contents,” he wrote. “For a database system, on the other hand, the ideal design is the opposite: original, consistent contents are managed in the background, and its update affects all views that users create through queries.”
Arita wrote that the “inherent lack of measure for checking consistency may be fatal for forward-thinking biologists who use wiki for the community-driven data management; however, this drawback seems often unnoticed.”
As an example, he told BioInform, “Suppose a gene name is updated. I need to search all its occurrences in all pages and update them one by one,” but if there were a mechanism to propagate an update, it would be a lot easier.”
Arita said that he appreciates efforts to construct data tables in Wikipedia, such as an entry that lists cities by population but said “their maintenance would be extremely hard. Data in such pages are usually inconsistent with those in other pages. We need many such charts in science. How can we manage them on wikis?”
His suggestion is to create a hybrid structure, which is part wiki and part database. “In fact, major wikis are built on relational database systems. In this perspective, Wiki[0] is only a sandbox inside a database,” he said.
The idea would combine the strengths of both since “databases and wikis serve two different purposes. One is for structured, quantitative, or well-defined ideas, the other [for] unstructured, qualitative, or indeterminate ideas.” Arita said the he thinks that “half-structured, half-free formatted design is useful especially for biology research.”
One way to achieve this hybrid design, Arita suggested , is to take the path being paved by Semantic MediaWiki, a semantic extension of the MediaWiki platform that organizes content, tags it, and allows users to browse and share.
Another approach that Arita said is a more “straightforward” translation from the relational model to wiki pages involves users embedding scripts into web pages. This method would achieve “more powerful search function, efficient page design, and most of all, we can propagate updates.” He used this approach in a website run by his group and researchers at two other Japanese universities for research results relating to metabolism.
“My wish is to organize a project team to design a next-generation Wiki[0] or cyberinfrastructure that can manage data integrity while maintaining good parts of the current wikis,” he said.
Arita told BioInform that he has begun discussions with researchers involved in the iPlant Collaborative and “implemented a small prototype of my idea.”
GNF’s Su said that Arita “raises some interesting points” in that the arguments he lays out are in line with those for the Semantic MediaWiki.
“Semantic MediaWiki would fantastic in terms of its applicability to genetics and biology,” Su said.
Semantic MediaWiki, however, “has yet to find its really great application,” Su said. Although it is finding users, it is “nothing on the scale of Wikipedia,” which evolved from an idea about an application in which many users can edit a text with the software built “to satisfy that need,” Su said.
Su acknowledged that after information has been entered into Wikipedia, pulling data out in a structured way and mining it is a significant hurdle. “Now all the data miners are saying Wikipedia is great, but it doesn’t allow me to do downstream data mining.”
Researchers are cognizant of some of these shortfalls, Su said, but when scientists choose to distance their project from Wikipedia, they risk losing visibility. “A one-off wiki solution can easily languish without a user base, which is one reason why we are going with Wikipedia,” he said, referring to Gene Wiki.
“The semantic part really bothers people who are trying to get data out of Wikipedia or out of the Gene Wiki to do downstream data mining,” he said. “But everything to this point has been about encouraging people to get data into the Gene Wiki and into Wikipedia, and that is where you don’t really care about Semantic MediaWiki.”
In Gene Wiki, for example, a change on the page dedicated to the gene utrophin will not propagate to pages about other genes that are associated with the cytoskeleton, Su said. Even a hyperlink lacks context about the relationships between genes, so that searching for all genes connected to the cytoskelton “is very difficult right now.”
“You don’t know, for example, does utrophin promote cytoskeletal development or does it promote cytoskeletal destruction, or is it involved in disease processes related to the cytoskeleton, or is it just a link to another concept?” Su said. With Semantic MediaWiki, users or Wiki programmers such as his colleagues in the Gene Wiki project would be able to launch those types of queries, he added.
Semantic MediaWiki tools include the SPARQL query language and protocol; RDF, the Resource Description Framework, to describe data; and the Web Ontology Language OWL that lend the RDF terms meaning.
Ideally, Su said, Wikipedia could adopt Semantic MediaWiki technology, “but that is a longer row to hoe because of all sorts of technical and bureaucratic hurdles.”
Up until now, wiki-based collaborative projects have focused on lowering the barriers and structural hurdles for participants to entice them to contribute data. “The Gene Wiki and Wikipedia are very focused on making it as easy as possible for people to contribute data, meaning they require little or no structure,” Su said. The more structure necessary for the data, the lower the likelihood to find users willing to adhere to that structure and contribute, he added.
Even the layout templates of his Gene Wiki pages, he acknowledged, offer a bit of structure for “what is fundamentally an unstructured platform.”
“It’s sort of fake structure, it’s a structure in terms of the layout but it’s not structuring the data and so we’re not allowing people to do downstream data-mining,” he said.

Filed under

The Scan

Call to Look Again

More than a dozen researchers penned a letter in Science saying a previous investigation into the origin of SARS-CoV-2 did not give theories equal consideration.

Not Always Trusted

In a new poll, slightly more than half of US adults have a great deal or quite a lot of trust in the Centers for Disease Control and Prevention, the Hill reports.

Identified Decades Later

A genetic genealogy approach has identified "Christy Crystal Creek," the New York Times reports.

Science Papers Report on Splicing Enhancer, Point of Care Test for Sexual Transmitted Disease

In Science this week: a novel RNA structural element that acts as a splicing enhancer, and more.