Skip to main content

Researchers Develop Software to Jump-Start ‘Long Tail’ Gene Annotation via Wikipedia

On the heels of the recently announced WikiProteins project — a community-based effort to curate proteins using wiki technology — another project is taking a similar approach to bolster the online encyclopedia Wikipedia with detailed information about genes.
In a proposal published last week in PLoS Biology, researchers from San Diego State University, the Genomics Institute of the Novartis Research Foundation, and the Washington University School of Medicine outlined their plan for the Gene Wiki project, which automatically generates incomplete “stub” articles on genes in Wikipedia with the goal of encouraging the research community to contribute more detailed annotation.
The project was developed separately from WikiProteins, which does not rely on Wikipedia but instead uses a technology called WikiProfessional developed by a firm called Knewco [BioInform 06-06-08]. Despite this difference, the two groups have begun talking about collaborating.
Andrew Su, a GNF researcher who led the Gene Wiki effort, told BioInform that Barend Mons, the co-founder of WikiProteins, is slated to visit his group soon to brainstorm about ways to collaborate. “We think there is great potential for synergy between the two,” he said.
In a conversation with BioInform, Mons, a biologist at the Erasmus Medical Centre of the University of Rotterdam, confirmed the groups are hashing out plans on how to work together, noting that Gene Wiki’s “approach is very complementary to what we do.”
The key to the Gene Wiki project is software that automatically generates Wikipedia pages from information in existing gene databases. To date, the researchers have used the software to seed 7,500 new page stubs, expanding on the 650 pages on genes that existed in Wikipedia prior to the project’s start.
“Our goal was to bring those 650 plus our 7,500 at least up to a common baseline level, so every gene page has the same level of gene annotation and a relatively complete view of gene annotation harvested from the public gene portals,” Su said.
This initial collection of pages in Gene Wiki does have a bias of sorts. “All these pages have a human slant, they do have some mouse links, but [there is] definitely a human slant,” said Su. The genes were also chosen based on their high citation rates in PubMed, he said.
The project builds on gene annotation work at GNF, said Su, but the Gene Wiki itself was “a little outside of our core mission,” which is why his team sought academic collaborators.
Wikipedia has a “huge critical mass, huge name recognition, good rankings by the search engines” said Su, explaining the choice of Wikipedia as the model for this project.
The PLoS Biology paper wasn’t timed to a particular phase of the project, but was intended as an indication to the community that the group “created a tool that we think will be useful for harnessing community intelligence,” said Su.

“Genes aren’t islands in space; they are links to other concepts like diseases, like other aspects of human biology, and the fact that Wikipedia already has those related concepts to link from and to link to, is a big advantage.”

Once the stub pages with their “base-level of utility” start drawing in a critical mass of readers, “we hope that some percentage of those readers will hopefully stay around and make an edit, fix a typo, add a line, add a reference, and in making that edit they make the utility of that page a little higher,” said Su. More editors means a higher utility factor, which, in turn, should attract more readers and then more editors in a “positive feedback loop,” he said.
While acknowledging the importance of model organism databases as “definitive sources” for gene annotation, Su and colleagues said in their paper that these resources require “a high degree of oversight by expert curators,” and proposed the Wiki-based approach as an alternative.
The proposed model for collaboratively synthesizing knowledge draws on a concept often called the “long tail” in reference to a term coined by author Chris Anderson to describe the benefits of a niche business that manages to garner many customers. “The idea [is] that you have a lot of people making small contributions,” said Su.
The authors pointed out that despite Wikipedia’s popularity for topics of general interest, its use for scholarly subjects has been “uneven.” And even though other researchers have discussed the benefits of wiki technology for gene annotation, such as Steven Salzberg of the University of Maryland Center for Bioinformatics and Computational Biology, who published a paper in Genome Biology last year on the subject, the biological community has been slow to adopt the approach. 
“In principle, a comprehensive gene wiki could have naturally evolved out of the existing Wikipedia framework,” the authors note in the paper, “However, we hypothesized that growth could be greatly accelerated by systematic creation of gene page stubs, each of which would contain a basal level of gene annotation harvested from authoritative sources.”
Gene-Human Interaction
The template gene page Su and colleagues used to illustrate the project in their paper is the Wikipedia page for the ITK gene, IL2-inducible T-cell kinase — a page entry that did not exist prior to Gene Wiki.
The page indicates that ITK is a human gene and includes additional information such as an image of the protein structure, orthologs in human and mouse, and a list of recent publications that reference the gene. Clicking the Wikipedia “edit” command opens the source code of the page to allow an editor to make any changes.
The synonyms for genes do not have separate entries but re-direct users to the main page for a given gene. That feature makes Gene Wiki similar to gene portals such as Ensembl or Entrez Gene. But where this project differs from traditional gene portals is the two-way communication on the page, Su said.
“Genes aren’t islands in space; they are links to other concepts like diseases, like other aspects of human biology,” he said. “The fact that Wikipedia already has those related concepts to link from and to link to is a big advantage.” 
Synonym identifiers are listed on the right hand side of the page below a diagram from the Protein Data Bank, structured gene annotation, links to primary databases on this gene, and citations for relevant publications.
Wondering About Accuracy
Trey Lathe, chief scientific officer at Open Helix, a service provider that trains researchers on genomics resources, told BioInform via e-mail that Gene Wiki pages could have value for the biomedical community, “but there needs to be a certain level of involvement to make them so.”
Even though some projects of this kind have been successful, he said, the Internet is “riddled with dead blogs, wikis, and discussion boards.” In his blog on the Open Helix website, Lathe wrote that a “big hurdle” for Gene Wiki is that Wikipedia has not been able to deliver the “level of completeness and accuracy” required by scientific research.
Lathe told BioInform that in his view, what is needed is a large enough “community of knowledgeable and active contributors to bring a wiki beyond that first difficult stage and that knowledgeable community has to be a large enough portion of the larger community to not be overwhelmed by superfluous or false edits.”
He added that he is not sure scientists will be motivated to edit Gene Wiki. “I don't see the carrot, or the stick.”
While he sees the value and the possibilities of the “long tail” as discussed in the PLoS Biology paper, Lathe said that Gene Wiki maintenance is “going to take a dedicated core of contributing researchers and a very long tail to get the coverage needed beyond the 'sexy' genes to make it useful,” adding that those are the very genes that are already covered in other databases and portals.
Another tricky area could be cases where there is disagreement about certain genes. “Wikipedia is not necessarily a great place to resolve controversy,” said Su. The project “relies on the fact [that] the community has an oversight function; there is no master curator.”
While a curator can assure that every piece of information is accurate in a model organism database or gene portal, “it is a huge bottleneck in terms of getting new data in,” which is exactly the problem that Gene Wiki is trying to address, said Su. He added that the idea of having a curator runs “a little counter” to the mission of the project.
“Tweaks on this model” are possible, he said, for example, citing, which requires that editors use their real name. “It discourages vandalism,” he said, but “it also increases the barrier to entry a little bit.”
Su said that “it will be interesting to see how much contention there will be on these gene pages,” but stressed that “the goal of Wikipedia pages is not to say the fact is one way or the other.” Rather, the page could indicate there is debate in the field and outline the evidence for either side of an issue.
Lathe warned that pages about genes linked to diseases such as autism or dyslexia, or that touch upon areas such as sexual orientation or other controversial subjects, “could be overwhelmed with edits by large numbers of highly motivated but uninformed individuals” from the broader community.
“I'm not sure researchers would be able to keep those gene records accurate on something like Wikipedia,” he said.
Lathe said he is “sold” on the utility of genome browsers such as Ensembl and the University of California, Santa Cruz, Genome Browser, which offer a lot of contextual information for genes, and he expressed doubt about whether a Wikipedia article can come close to that level of information.
That said, he added that he does see the potential for Gene Wiki to add utility. Overall, he is “not convinced” the project will succeed, but will “remain optimistic,” he said.
“I'm taking a wait-and-see attitude and might do a few edits myself,” said Lathe.
Su noted that users will “need to have the appropriate mindset with reading a Gene Wiki entry,” and emphasized that the resource is not a replacement for highly curated databases, but rather a complement to those tools.
He acknowledged that Wikipedia’s anonymous editing could pose some challenges because users will not immediately see if an expert made a change or a high school student. “There is no oversight, no adult in the room saying, ‘You are allowed and you are not’ — speaking metaphorically, of course.”
However, while ceding that “at any given point, a given gene article may be incomplete, misleading, or just plain incorrect,” he said that the vast majority of content will likely be correct and up-to-date. 
“The gene pages at Wikipedia are a good starting point to get an overview of a topic, but that doesn’t substitute for doing literature searching or consulting other resources,” Su said.
Wikis Working Together
Among the resources researchers might choose to consult is WikiProteins, a collaborative effort that allows scientists to jointly annotate proteins.
WikiProteins “is a little bit more difficult to use” than Gene Wiki, said Su, in part because of that project’s “desire to have more structured content.”
In WikiProteins, “every piece of data you contribute has to be tagged as to what type of data it is,” he said. While that process “greatly facilitates” downstream processing by bioinformaticians, it “puts a little bit of a barrier to entry for non-tech geeks to contribute data,” which might discourage contribution, said Su.
Gene Wiki, on the other hand, has “a very low barrier to entry” and thus should attract more contribution from the community, he said. 
“With WikiProteins there is quite a bit of structure around how you contribute,” he said. “It is not, ‘You click edit, you make an edit, you click save,’ [which] is essentially how it is on the Wikipedia pages.”
On the plus side for WikiProteins, said Su, is that its information has a higher likelihood of being integrated into traditional gene portals. “Making its way back to the portals is a process that is difficult in the Wikipedia model,” he said, adding that this feature might be one area where the two projects could find synergy.
WikiProteins co-founder Mons said that Gene Wiki bears a “certain risk” that people will start to violate some of Wikipedia’s “pillars” or values — for example one that prohibits the inclusion of original research.
Scientists who have no idea about these pillars may have the inclination to start adding original research data into a Wikipedia article, “which is not what Wikipedia wants,” Mons said. “They want to stay encyclopedic.”
WikiProteins “is at the other end of the spectrum” with a highly structured database and may establish itself as a resource where only the most earnest researchers will take the time to learn how to navigate the space, said Mons.
Given the amount of genomic and proteomic information that has to be described, “there is no way this can be done … in a centralized fashion,” said Mons, who noted that community annotation appears to be the future of the field, regardless of which wiki-based resource researchers adopt.
“Both of us were interested in not competing; the systems have different strengths and different weaknesses,” said Su.
Added Mons, “we have decided to sit down together and it looks like things are clicking very well.”

Filed under

The Scan

Call to Look Again

More than a dozen researchers penned a letter in Science saying a previous investigation into the origin of SARS-CoV-2 did not give theories equal consideration.

Not Always Trusted

In a new poll, slightly more than half of US adults have a great deal or quite a lot of trust in the Centers for Disease Control and Prevention, the Hill reports.

Identified Decades Later

A genetic genealogy approach has identified "Christy Crystal Creek," the New York Times reports.

Science Papers Report on Splicing Enhancer, Point of Care Test for Sexual Transmitted Disease

In Science this week: a novel RNA structural element that acts as a splicing enhancer, and more.