NEW YORK (GenomeWeb) – The Broad Institute, Harvard Business School (HBS), and the Crowd Innovation Lab at Harvard University are running a series of prize-based computational challenges that are aimed at identifying better algorithms for tasks such as sequence alignment and assembly and for optimizing inferred gene expression.
The idea for the so-called Crowdsourcing Advances in Precision Medicine challenges, which are run on infrastructure from TopCoder, grew out of two separate efforts at HBS and the Broad. For its part, the Broad was interested in benchmarking new computational tools on FireCloud, its internally developed genomic analysis system, as part of its efforts to develop infrastructure for the National Cancer Institute's Cancer Cloud Pilots initiative, Gad Getz, co-principal investigator on the FireCloud contract and one of the organizers of the bioinformatics challenges, told GenomeWeb.
Getz is also director of the Broad's Cancer Genome Computational Analysis group. He and Anthony Philippakis, the Broad's chief data officer and co-PI on the FireCloud cancer pilot, met with Karim Lakhani, an HBS associate professor of business administration and PI of the Crowd Innovation Institute, to discuss the possibility of running bioinformatics challenges on FireCloud with funds from a $20 million endowment from the Kraft Family Foundation pledged last year.
The so-called Kraft Endowment for Advancing Precision Medicine is part of a larger $6.5 billion capital campaign by Harvard University. The $20 million gift is to support research and other activities that enable precision medicine. Planned activities for the funds include the Precision Trials challenge, an HBS-led effort to engage the biomedical community to find ways to reduce the costs and time needed for clinical trials; and the bioinformatics challenges being organized in collaboration with the Broad.
The idea was to craft a series of challenges that would appeal to computational scientists and software developers from other domains that do not have experience with biomedical data, Getz said. This would provide an avenue for the much broader community of computer scientists, machine learning experts, data scientists, and other experts that participate in Topcoder challenges to bring their expertise to bear on the computational biology problems. "This is a pilot for us," Getz said. "[It tests our] ability to explain a problem in a way that would be appealing to a computer scientist or a machine learning [expert] that has no prior experience with biology or with biological problems, and [to] describe it in a way that they could write code."
For the challenges, the Broad researchers are starting with simple but well-defined questions and then working up to more complex kinds of questions. The winners of each of the challenges share a $20,000 prize purse. The first of these challenges, which launched in April, has already been completed. It ran for 21 days. It is the first of a set of three successive challenges of increasing difficulty focused on DNA sequence alignment and assembly. The so-called DNA sequencing 1 (DNAS1) challenge asked participants to submit algorithms that could align simulated DNA sequences to their appropriate location in the reference DNA. In this case, the sequences had only had minor differences from the reference DNA.
The task for participants was to submit algorithms that are able to align genomic sequences to a reference quickly and accurately. For the challenge, participants were provided with three separate datasets of increasing size. Specifically, they received 10,000 simulated read pairs from chromosome 20; a medium-sized dataset that included simulated one million simulated read pairs from chromosomes 1, 11, and 20; and a full simulated genome with 10 million read pairs. They were also provided with a reference genome.
For each alignment, participants were asked to provide a score that quantified how confident their alignment algorithms were that the reads were mapped to the right location in the reference. They also had to provide written reports that described the algorithms that they used, any additional methods they considered, and any local programs they used en route to getting their results.
Over 1,100 registrants signed up to participate in the first challenge but only 96 participated in the round. Between them, the participants contributed 410 alignment submissions or about 4.2 submissions per competitor. Submissions were scored based on the accuracy of the alignments and the speed of their algorithms. Participants had to achieve a minimum score of 1 million points for their alignments to be considered good. The scores of the top five contestants and winners of this round were all over 2 million.
The first place winner received $8,000, the second place contestant received $4,000, the third place winner received $2,500, while the fourth and fifth place contestants won $1,500 and $1,000 apiece. There were also consolation prizes offered for reaching the million point mark on the smallest test dataset. The developer of first solution to hit that threshold received $2,000 and the developer of the second solution to hit the mark received $1,000
The second challenge in the DNA sequencing series, which will launch this week on May 25, will build on the first one and will be harder, according to the developers. In DNAS2, researchers will have to align sequences that are significantly different from the reference. They'll have to align those sequences correctly and also properly classify any sequences that do not align.
A third challenge that is expected to start on June 10 focuses on gene expression. It uses information from the Connectivity Map (CMAP), a collection of genome-wide transcriptional expression data gleaned from cultured human cells that were treated with bioactive small molecules. The data is used to assess functional connections between drugs, genes, and diseases in order to help researchers find new treatments for disease. For the challenge, called CMAP1, participants will be asked to improve the accuracy of gene expression values inferred from the data while minimizing the number of gene expression measurements.
A fourth challenge, and the third in the DNA sequencing series, DNAS3, focuses on assembly. Participants will receive reads from a high-coverage genome that is significantly different from the reference. The challenge for contestants will be to reconstruct the genome from the pieces. The date that challenge will start is yet to be determined. Although the DNA challenges build on one another, participants do not need to have participated in preceding challenges in order to contribute their code. So for example, a new participant who missed out on DNAS1 could still participate in DNAS2.
For now, the researchers are running challenges on TopCoder's platform but they are working on connecting it with FireCloud so that in future participants will sign up for challenges on TopCoder's system but actually run their algorithms on FireCloud itself, Getz said. They are currently putting in place the necessary tools to automatically package participants' code once it's uploaded to the TopCoder platform and then deposit it in the portion of the FireCloud platform allotted for challenges. "The reason that we didn't do it for the first one is that building these connections are not trivial. It would take too much time," he explained. "So we said 'let's start to run these challenges ... and then in parallel work on the engineering efforts to connect the systems.'"
This is not the first time that Harvard researchers have tried to crowd-source algorithms for bioinformatics tasks through competition. In 2013, they published the results of a proof-of-concept study that showed that an incentivized crowdsourcing model can solve algorithmic problems in biomedical research and, in some cases, provide solutions that are more accurate and faster than existing algorithms. That study centered on a two-week sequence annotation challenge that offered $6,000 worth of cash prizes for the best algorithms. In that time, the researchers received more than 600 code submissions from 122 participants, most of whom didn't have life science backgrounds. Sixteen of those solutions were more accurate than existing methods like the MegaBlast algorithm.
Before that in 2012, researchers at the Harvard Clinical and Translational Science Center, or Harvard Catalyst, turned to crowdsourcing approaches to solve algorithmic challenges in biomedical research. They launched a service on TopCoder through which researchers at the university could submit computational problems in areas such as genomics, proteomics, radiology, pathology, and epidemiology. Harvard Catalyst would then throw the challenge out to software developers worldwide who would compete to solve the problem and win the prize money.
"This idea of using crowdsourcing for development of algorithms [is one] I like very much [because] it's an interesting and effective way to gauge the community worldwide," Getz said. Also, "[researchers] don't have to be local and they could still support solving computational problem," he added. The winner of the first challenge, for instance, is an assistant professor of computer science from the University of Warsaw in Poland.