This article has been updated to remove previously reported information about the funding source for the pilot service and to correct the earlier stated deadline for submissions.
The Harvard Clinical and Translational Science Center is investigating the value of crowdsourcing approaches to solve algorithmic challenges in biomedical research.
The center, which also goes by the name of Harvard Catalyst, has launched a pilot service through which researchers at the university can submit computational problems in areas such as genomics, proteomics, radiology, pathology, and epidemiology. Harvard Catalyst issues the challenge to software developers worldwide who compete to solve the problem and win prize money.
The program is run on crowdsourcing infrastructure provided by TopCoder.
Challenges are selected based on the suitability of the question; the availability of sufficient information to support the efforts of the solvers; the availability of a validation dataset; evidence that the algorithmic problem is important to the success of the submitter's research or progress in a particular field; the relationship of the problem to important classes of algorithmic problems; and the problem's congruence with the overarching mission of Harvard Catalyst.
Once a problem has been accepted, Harvard Catalyst and TopCoder work with the researcher to develop a problem statement, test data, and a scoring algorithm before the contest is launched on the crowdsourcing platform.
One such challenge, dubbed FitnessEstimator, commenced yesterday. The aim of the project is to use next-generation sequencing data to determine the abundance of specific DNA sequences at multiple time points in order to determine the fitness of specific sequences in the presence of selective pressure.
As an example, the project abstract notes that such an approach might be used to measure how certain bacterial sequences become enriched or depleted in the presence of antibiotics. "Those cells with a survival advantage will replicate and increase the abundance of their sequences in the population, whereas those cells with a disadvantage will die off and their sequences will become relatively depleted from the population," the abstract states.
Harvard Catalyst is accepting submissions for a method of analyzing "count data that reflects library member abundance before and after application of a selective pressure at possibly multiple time points."
One of the contest creators, Uri Laserson, a graduate student in biomedical engineering and mathematics at the Harvard-MIT Division of Health Sciences and Technology, explained to BioInform that he and his colleagues hope to find the best algorithms for "most accurately determining which are the most enriched and most depleted sets" in libraries of DNA sequences.
He explained that the FitnessEstimator challenge grew out of experiments involving large libraries of gene sequences where each sequence has some kind of fitness that the researcher is interested in — for example, which genes bind best to particular substances.
Usually, he said, researchers characterize a DNA library using next-generation sequencing and then run experiments that apply some kind of selective pressure, which cause certain sequences to become enriched or depleted depending on their fitness relative to the rest of the population.
As the selective pressure is increased, researchers can resequence the library at different time points and compare the results to the previous population to discover which sequences are being enriched or depleted and which aren't, Laserson said.
A lab might have the resources to follow up on a few population members or perhaps on a few proteins, so "you have a certain kind of mathematical set up ... where you are essentially counting reads for populations at multiple time points" in order to determine which sequences are the most enriched or the most depleted at that point in time, he said.
Additionally, most sequence libraries are quite large and, as a result, previous methods developed by Laserson and his colleagues for ranking all of the population members have been "computationally expensive" or made assumptions that "we didn’t like," he said.
Participants in the contest are provided with 60 training test cases whose solutions are already known, which they can use to test their algorithm locally. They are also provided with a network model that explains how the data in the test cases were generated and the relationship between the data.
Contestants have until June 21 to complete their solutions and turn them in.
Entries will be scored based on the closeness of their results to the values computed for a separate set of test data, Laserson said.
As of press time, 50 registrants had signed up on the TopCoder website under the FitnessEstimator challenge page.
This contest is the third of several challenges accepted by Harvard Catalyst as part of its crowdsourcing investigative efforts.
Patrick Gaule, a research associate at Harvard University and one of the organizers of the crowdsourcing venture, declined to provide additional details about the program. He explained to BioInform in an e-mail that the development team intends to complete and evaluate an initial set of algorithmic challenges before speaking publicly about the service.