Skip to main content
Premium Trial:

Request an Annual Quote

NCBI Hackathons Aim to Develop Standard Computational Pipelines for Bioinformatics Tasks


NEW YORK (GenomeWeb) – The National Center for Biotechnology Information is organizing a three-day hackathon to be held early next year that will bring groups of investigators together to develop tools for analyzing next-generation sequencing data in the context of specific research questions.

The hackathon will take place Jan. 4-6 at the National Library of Medicine. In this iteration of the hackathon, teams will develop pipelines for automated epitope analysis from TCR clones; verifying structural variant data using public information; and for enabling network coalescence of SNPs. They will also build a Bioconductor module to import data from SRA; a tool for deriving quaternary interactions of clinically relevant mutations from protein structure; and on creating an interface for metadata sorting, according to the NCBI.

Teams will have three days to spend on tool development — participants are expected tocommit to the full three days. The NCBI provides datasets from its repositories for the hackathon but participants also have the opportunity to use their own datasets or data from other third-party databases. However, if participants choose to use their own data, they will be required to submit it to a public database within six month of the end of the hackathon, according to the NCBI.

Ben Busby, NCBI's genomics outreach coordinator and organizer of the hackathons, told GenomeWeb that the January event will be the third hackathon that the NCBI has organized. The first one was held in January this year at the NCBI — which featured four projects and yielded three working pipelines — and a second was held seven months later in August.

So far, reactions to the hackathons have been very positive and participation levels, once applicants get onsite, have been high. "The thing that shocked me the most about running these hackathons is that people actually want more structure rather than less," he told GenomeWeb. "We try not to be insanely prescriptive … we come up with a fairly detailed outline about what we are thinking [but] in both hackathons, they've asked us for even more structure."

Also, previous participants are typically heavily invested in finishing projects that they start, according to Busby. "I think that's really the biggest thing," he said. "I've been involved in some hackathons where the working process is not so much a big deal, it's more about the process but in these hackathons, it's about getting something done that the community can use."

In the August hackathon for example, six teams — a total of about 50 people with five to seven people, on average, per team — participated and each group successfully developed a functional software product in the allotted three days. One of these, a cloud-based education tool that teaches users to map RNA sequences to a reference, has since been used in an online course taken by 650 people, Busby said. A second program from the same team of six that developed the RNA-seq tool, called BamDiff, helps users differentiate Bam files.

All pipelines, scripts, software, and programs generated during hackathons are placed in a public GitHub repository set up for the purpose. In addition to use by the academic community, Busby also noted that there is significant interest in hackathon developed tools from industry. "We hear back from a lot more industry folks [saying] that they are using them and in a few cases making some pull requests and so on," he said. "[That's] one thing that's been surprising."

In addition to the aforementioned software packages, other software tools that have come out of previous hackathons include a sequence retrieval application which uses indexes to find matches to query protein sequences, and a pipeline that combines sequencing data, biological annotations, and high-throughput drug screening data to predict drug sensitivity. "We do a little bit of testing [so] we watch [participants] run things on datasets that are already in the [Short Read Archive] and we can see some results there," Busby said.

The organizers haven't yet set up the github repositories for the January hackathon because they are waiting for the team lists to be finalized, Busby told GenomeWeb. "What we found with the August [hackathon] is we put a bunch of potential projects out there and then they change a little bit in the month before the hackathon," he explained. In the January hackathon, one of the previously announced categories, submitting RNAseq-derived variants to ClinVar, has been canceled. "Sometimes team leads drop out [or] they want to change the ideas a little bit, so it's not until a couple of weeks before the hackathon that we really have all of topics totally finalized and that's when I go ahead and make the github repos."

Another reason for the wait is to prevent participants from trying to get started on projects ahead of time, which defeats the purpose of hackathons. "One of the big values ... is you have people butting heads and that [helps] reduce the frequency of people going down blind alleys," Busby said.

Topics selected for hackathons are based on research questions that Busby and others hear frequently in their interactions with the genomics community, he told GenomeWeb. For example, "there's a whole community that's very excited about TCR epitope analysis ... and we'd really like to give people tools to look at epitopes from a whole bunch of cancer genomes and start thinking about aligning them with various TCRs," he said.

Previous hackathons were run on Amazon Web Services and this third one will likely use the same infrastructure but that does not mean that pipelines are restricted to the cloud vendor. For example, one of the projects in the last hackathon focused on building a cloud-based educational resource that currently runs on AWS as well Google cloud and the developers hope to have it running on Microsoft Azure, Busby said.

Organizers and previous participants have published a manuscript describing the results of the January hackathon in BioArxiv and they plan to publish the results of the second hackathon possibly in F1000Research but they are considering alternatives including journals such as PeerJ or Briefings in Bioinformatics, Busby told GenomeWeb. It's not clear when that manuscript will be published. "It really depends on how proactive the participants are in editing the manuscript," he said.

Applications for the hackathon were due on Dec. 1. Team leaders then had a chance to review the applications and pick people they thought would be best suited for their particular projects. "I tend to think that the biggest criteria is what [applicants] do in the lab, what scripting languages they know, in some cases, but also people's motivations," Busby said.

The basic requirement for participation in hackathons is a working knowledge of scripting or programing language and a working laptop. The event targets students, post-doctoral students, and investigators who develop and use genomics analysis pipelines. Applicants' resumes and publication pedigrees aren't checked to ensure that younger researchers aren't put off from applying, Busby said. It's a strategy that seems to have paid off. Participants in previous hackathons have largely been senior graduate and postdoctoral students but there has been some participation from young researchers working in industry, according to Busby.

He told GenomeWeb this week that the organizers received applications from nearly 80 qualified applicants, including 14 international applicants, for the January hackathon. From this pool of applicants, the organizers plan to have six groups of five to six people per group. Applicants are expected to confirm their participation this week. If they are unable to attend, their spots will be opened up to other candidates.

Busby hopes to run, ideally, seven or eight hackathons in 2016 and not all of them will be at the NIH, which could reel in hackers who would not otherwise make the trip to Bethesda — no financial aid is provided for applicants and remote participation is not currently an option. "One of things I would be interested in is getting some of these modules to stack on top of each other so you get a critical mass of bioinformaticians," he said. "The idea is if you can get sets of software modules that you have 50 genomics professionals from across the country interested [in] ... then you can get a critical mass [of] people working on standard toolsets ... and really build a community around modular software for answering some of these big bioinformatics questions."

Busby is mulling hackathons in Boston, Kansas City, Las Vegas, and possibly New York.  Currently, one hackathon is planned for the third week of March in Shreveport, Louisiana, and a second one at Brandeis University in Waltham, Massachusetts in April. "A lot of this is subject to approval, NCBI priorities, and so on ... but that would be the ideal," he said. "We have some [hackathons] that are in various stages of planning ... the biggest [problem] is finding team leads that are willing to head up a project." The organizers have also posted documents that describe protocols and procedures that they have used to plan and run previous hackathons so that interested researchers can organize and run hackathons in their local contexts as well.