Skip to main content
Premium Trial:

Request an Annual Quote

Harvard Study Validates Crowdsourcing as Effective Model for Bioinformatics Development


Researchers led by Harvard University have conducted a proof-of-concept study that found that an incentivized crowdsourcing model can solve algorithmic problems in biomedical research, and in some cases provide solutions that are more accurate and faster than existing algorithms.

In a letter published this week in Nature Biotechnology, the researchers describe a project that they worked on with TopCoder, a firm that develops crowdsourcing infrastructure. The project, conducted in 2010, was a two-week genetic sequence annotation challenge related to immune repertoire profiling that offered $6,000 worth of cash prizes for the best algorithms.

By the time the two weeks were up, the researchers received more than 600 code submissions from 122 participants, most of whom didn’t have life science backgrounds. About 16 of the submitted solutions proved more accurate and much faster than two other programs that had been used for the problem, including the National Center for biotechnology Information's MegaBlast algorithm.

The researchers have released five of the best performing methods on the TopCoder website.

"This is a proof-of-concept demonstration that we can bring people together not only from different schools and different disciplines, but from entirely different economic sectors, to solve problems that are bigger than one person, department, or institution," according to Eva Guinan, director of the Harvard Catalyst Linkages Program, an associate professor of radiation oncology at Dana-Farber Cancer Institute, and a lead author on the paper.

It also shows that researchers in biomedicine can learn from their colleagues in the social science fields like economics and management, according to Kevin Boudreau, assistant professor of strategy and entrepreneurship at London Business School and a co-author on the Nature Biotech paper.

"We hope this provides a model of how social science and medical researchers can collaborate to solve real-world problems that matter to people," he said in a statement.

The Problem with Good Immunity

Harvard's Guinan told BioInform that her interest in crowdsourcing began with trying to find way to use expertise from other disciplines to make academic biomedical research more efficient.

That interest led her to a symposium at Harvard Business School where her co-author Karim Lakhani, an associate professor in HBS' Technology and Operations Management unit, gave a talk about prize-based contests and crowdsourcing in the context of the for-profit sector.

"I was interested in … trying to take for-profit-based innovation techniques and applying them in biomedicine," she said. "We started talking about how we could do that, what would be appropriate venues, and how we could not just do it to get a result but also to study the process of doing it and what was going on."

That discussion ultimately birthed the 2010 challenge that she, Lakhani, and their colleagues have published in Nature Biotech. It's also spawned at least five other contests whose results will be published at a later date, she said.

According to the paper, challenge participants were asked to develop a method that could annotate "recombined and mutated" genetic sequences based on "which gene segments contribute[d] to each recombined gene."

This particular challenge came from the laboratory of Ramy Arnaout, an assistant professor of pathology at Beth Israel Deaconess Medical Center.

Arnaout, also a lead author on the paper, told BioInform that his lab had been searching for a better computational solution for his work in immunogenomics.

Specifically, his team wanted to accurately annotate genetic sequences that encode for T-cell receptors and for antibodies secreted by B-cells. Genes for antibodies and TCRs are not encoded as single genes, but are built from gene segments, so the actual DNA sequence can vary between cells.

This is "a problem when it comes to sequencing," Arnaout said, because even though scientists can sequence all the genes that encode for antibodies in a blood sample, in order to understand their function, they would need to know which specific genome segments contributed to their creation.

Arnaout tried to use MegaBlast for annotating these regions but found that it was too slow. Also, MegaBlast looks for matches between known and unknown sequences, so it was thrown off by slight differences in the ways the antibody genes were put together, he said.

Rather than trying to force MegaBlast to meet his needs, Arnaout developed his own annotation program, called Identifying Antibodies, or idAb, which produced results about 30 times faster than MegaBlast and was more accurate, he said.

But then, Arnaout's team worried that the amount of data being generated would eventually be too much for idAb to handle, he said. Also, even though it was faster than MegaBLAST, idAb still took longer than the researchers wanted.

That’s when he got in touch with Guinan and Lakhani and the partners began working on framing the challenge for TopCoder's community, Arnaout said.

Before they posted the challenge, the researchers had to rephrase the data in more generic terms so that it would be more palatable to non-life scientists. They did this by using strings and substrings to represent genomic sequences.

They also compiled test data for solution generation and scoring, which included a public training dataset, a private dataset that contestants used to evaluate themselves, and a third dataset that was used to score the final submissions.

The final step, the researchers said, was to create a scoring metric for evaluating the submissions in terms of accuracy and speed.

After scoring all the submissions they received, 30 submissions did better than MegaBlast with the best methods providing results that were up to 1,000 times faster than the NCBI algorithm, the paper states.

The researchers also reported that 16 submissions outperformed idAb's accuracy score of 77 percent, 30 outperformed MegaBlast's accuracy score of 72 percent, and eight entries achieved an 80 percent accuracy score.

In terms of speed, the three fastest submissions completed their computations in 16 seconds — 178 times faster than idAb and nearly 1,000 times faster than MegaBLAST, the researchers wrote.

Arnaout has been using the top scoring methods in his own lab. He told BioInform that he's been looking at each solution's algorithmic components to see how they differ from idAb.

He's also trying to see if he can incorporate the codes into his lab's computational pipeline and use them for other kinds of analysis besides sequence annotations.

Arnaout said he's also using the methods as an "independent check" for idAb calls.

This isn't Pollyanna

In the last year, Guinan and her colleagues have run at least four other proof-of-concept crowdsourcing studies with TopCoder based on suggestions they received from colleagues at Harvard.

One contest called for an algorithm that could distinguish between HIV variants and noise in sequencing data; a second challenge focused on looking for new endpoints for clinical studies in glaucoma data from a new digital imaging technique; a third challenge focused on developing an algorithm for a de novo genomics problem, and the fourth contest sought an algorithm that could use epidemiology data to identify communities in Boston that have more medical needs than others.

She said the team will soon launch a fifth crowdsourcing challenge that will attempt to address an image resolution problem associated with colonoscopies.

By doing many challenges that cover a wide range of research questions, and then releasing the results publically, Guinan and her colleagues hope that they can convince the biomedical research community that incentivized crowdsourcing is a good supplement to efforts like the Critical Assessment of Genome Interpretation.

In fact, prize-based contests offer some benefits that academic crowdsourcing challenges like CAGI don’t, according to the Nature Biotech paper

One of these is access to the commercial TopCoder platform, which is accessed, in this case, by more than 400,000 computer scientists and software developers who can "immediately attack the problem" and "deliver submissions in the multiple hundreds," compared to tens for challenges like CAGI, the paper states.

Using this model, researchers won't have to invest in expensive equipment to run their analysis and can also save hours that would otherwise have spent trying to solve problems that may not be central to the research in question, Guinan pointed out.

The authors also compare prize-based development contests to online games like Foldit, which also rely on crowdsourcing but don’t have the financial incentive attached.

They argue that contests like theirs can be tailored to work for practically any life science question that can be translated into generic computer science terms. Foldit and similar projects, on the other hand, were developed to answer very specific biological questions and therefore aren't as flexible, they said.

Guinan said she believes that an incentive-based model could work well for large academic consortia, for example. These groups could provide avenues for their scientists to submit research questions they can't answer locally and can get results quickly for relatively little money.

"We are working on what the economic model for that would be," she added.

However, "one can't be Pollyannaish" about crowdsourcing contests, Guinan said.

It's important to realize that this isn't a "no cost" solution, she said. On the surface, $6,000 seems a rather trivial sum but it doesn’t take into account hidden development costs like time, she said. For their immunology challenge, the researchers estimate that contest participants spent more than 2,600 hours coding solutions, according to the Nature Biotech paper.

It also takes a lot of work to reframe biomedical research questions into a form that non-scientists can understand, Guinan said.

Arnaout told BioInform that framing his question for the contest was "pretty straightforward" — they just had to represent the unknown sequences as strings and the gene segments as substrings — but he added that there are situations where it "would be a lot more challenging."

An example would be a research question that involves a biological concept that is "foreign" to computer science and not as easy to translate, he explained.

Finally, it's quite likely that there are some research questions that just can't be adapted to fit a general crowdsourcing mold, Guinan said.

Guinan also pointed out that crowdsourcing firms may have to revise their policies if they want to deeper roots into academia, particularly in biomedicine.

For example, they would have to be able to reassure researchers that their intellectual property will be safe if they share their data openly, she said.

TopCoder, for its part, has been "very detailed in the way that they have dealt with IP issues and the releases that solvers and people who post challenges sign and agree to," she said.

Filed under