Following a successful first outing last year, the Open Bioinformatics Foundation will this year once again participate in the Google Summer of Code internship program along with newcomers the National Resource for Network Biology and the Organization for Genome Informatics.
Googleincluded the three bioinformatics projects among more than 160 mentoring organizations selected for this summer's internship.
For the internship, OBF will oversee BioPerl, BioPython, BioJava, BioRuby, BioSQL, BioLib, and BioDAS; the NRNB will oversee development of GenMAPP, Cytoscape, and WikiPathways; and the Genome Informatics group will manage the joint efforts of WormBase, Reactome, and GBrowse (BI 03/25/2011).
Among its goals, Google's program aims to create and release open source code; help open source projects identify and bring in new developers and committers; and give students more exposure to real-world software development scenarios.
Members of the community can submit their ideas on each sub-project's wiki page and interested students can select which projects they would like to work on or suggest their own original ideas.
Although last year was OBF's first time as a GSoC mentor, prior to 2010, the Bio- projects had recruited students for Google's internship program under the auspices of the National Evolutionary Synthesis Center, or NESCent, a group focused on phyloinformatics.
Although NESCent and OBF's programs are independent — NESCent does evolutionary biology projects while OBF does projects related to the OBF toolkits — "there are some projects that could fit into either organization," Robert Buels, OBF's program administrator for the internship, told BioInform.
For instance, since it began participating in the GSoC in 2007, "NESCent has had several projects where students contributed code to OBF toolkits," Buels said. "Conversely, in 2010, OBF had a student, Sara Rayburn, who worked on BioRuby, implementing algorithms for species tree handling, which could have fit just as well under NESCent's program."
NESCent is currently seeking interns for its summer projects, which involve things like developing Galaxy's phylogenetics pipeline and extending Jalview capabilities to support RNA sequence alignment annotation and secondary structure visualization.
"In actuality, there's a lot of communication back and forth between OBF and NESCent to make sure that students get routed to the right organization, based on their project and how many student slots each organization has been awarded by Google," Buels added.
Last year, OBF received about 30 applications to fill six slots for the GSoC program. This year, Buels said, the group hopes more students apply and to that end is advertising the program as much as it can via its mailing list, at universities, and through other avenues.
Student applications are due on April 8.
Sequence Alignments and PTMs
GSoC mentors are expected to "introduce" their mentees to community members as well as facilitate interactions between the students and the original code developers, Andreas Prlic, a senior scientist at the Protein Data Bank at the University of California, San Diego, told BioInform. Mentors also ensure that interns stick to the project's "standards of quality."
Prlic, who is the project leader for BioJava, said last year's project ideas attracted applications from a lot of students, although ultimately only two were selected by Google: a proposal to develop a new multiple sequence alignment algorithm and a method for identifying and classifying post-translational modifications in proteins.
Peter Rose, a scientific lead at the Research Collaboratory for Structural Bioinformatics, along with Prlic supervised Jianjiong Gao, a doctoral student at the University of Missouri, who worked on the PTM project.
Gao worked on developing BioJava packages to identify PTMs in three-dimensional protein structures, generate sequence diagrams with an option to add PTM annotations, and generate two-dimensional tree images of carbohydrate structures.
Meantime, Prlic supervised Mark Chapman, a graduate student in computer science at the University of Wisconsin, Madison, who worked on developing a Java implementation of a multiple sequence alignment algorithm, along with co-mentors, Scooter Willis, a data analyst scientist at the Scripps Research Institute, and Kyle Ellrott, a software engineer for the University of California, Santa Cruz, Genome Browser.
Chapman explained to BioInform via email that he "designed and implemented a module for calculating and storing alignments." The result, he said, is "an open source codebase for use by the bioinformatics community that is both a ready to use alignment toolkit and a foundation on which to build next generation alignment routines."
Prlic noted that Chapman's implementation of the algorithm is both "flexible and extensible" allowing developers to plug in different components as well as add new layers to it.
Both Chapman and Gao's projects are currently available to the community as BioJava modules.
Prlic said he is "satisfied" with last year's internships and anticipates that, judging from the applications received so far, this year's program will be just as "exciting" for BioJava.
Workflow 'Pain Points'
Prior to becoming a mentor for the BioPython project, Eric Talevich, a bioinformatics doctoral student at the University of Georgia, said he worked as an intern under the NESCent group creating a parser for the phyloXML format
Talevich, who works in a structural biology lab at UGA, told BioInform via e-mail that last year he submitted a project suggesting some features for BioPython that involved improvements in PDB file manipulations as well as integrating the Modeller software into a homology modeling pipeline.
However, he said his mentee, João Rodrigues, submitted a proposal that actually improved on his original project idea and was selected for the summer program. Talevich supervised the project with Diana Jaunzeikare, a software engineer at Google; and Peter Cock, a researcher at the James Hutton Institute in Scotland.
"It was great working on code with another scientist who really knows the problem domain," Talevich said. "In academic projects there seems to be less room for overlapping skill sets in this way. If you can write the code for a project yourself, you don't need a collaborator who can do the same thing — you might get scooped. Open source lets scientists who are facing the same problems support each other."
For his project, Rodrigues worked on "pain points" in the structural biology workflow, "such as transferring data between BioPython and Modeller for a homology modeling pipeline" and then "[wrote] the code to smooth them over," Talevich explained.
Specifically, Talevich said his mentee worked on automatically renumbering the residues in a PDB file; extracting the peptide sequence of a protein structure; removing disordered atoms; identifying disulfide bridges in a structure; and developing several methods for coarse-graining a structure.
Furthermore, Talevich said the programmers went on to begin work on a sub-package in BioPython for structural biology, "which would make way for supporting RNA structures in addition to protein structures — and even DNA — through a consistent API."
Talevich and Rodrigues are currently merging the new code into BioPython in increments, because the existing structural biology module in BioPython has been in place since around 2003, and, as such, "we've had to be careful not to break or change any functionality that users might be relying on," Talevich explained.
This year, BioPython has submitted two project ideas for the GSoC. Although Talevich didn’t submit a project this year and probably won't be a mentor, he expects that OBF's participants "will get some great work done this year."
Inferring Gene Duplications
Christian Zmasek, a research associate at Sanford-Burnham Medical Research Institute in La Jolla, Calif., described Google's program as a "unique opportunity" for students to get involved in developing open source tools and help get things accomplished.
Zmasek, who has participated in the GSoC for three years and primarily mentors students involved in BioRuby projects, worked with a student named Sara Rayburn last year to implement an algorithm for gene duplication inference in BioRuby.
Specifically, Rayburn worked on implementing the speciation-vs.-duplication algorithm, or SDI — an algorithm for speciation duplication inference developed by Zmasek and others in the BioRuby programming language.
This year's BioRuby projects include creating methods to process and analyze next-generation sequencing data in the programming language; as well as a tool for visualizing three-dimensional protein structures.
Zmasek says he hopes to find at least one good student for the projects in which he is involved. He noted that one of the difficulties with the internship is that some projects are particularly challenging and would require more than the three-month timeframe as well as the efforts of more experienced programmers who know their way around the programming languages.
However, on the upside, the limited time frame reduces distractions and makes the internships more focused, he said.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.