Among other key themes, talks at the 14th Bioinformatics Open Source Conference, held two days before the 21st annual International Conference on Intelligent Systems for Molecular Biology in Berlin, Germany this week, highlighted software options and ongoing efforts to make experimental methods, tools, and data more open, reproducible, and interoperable.
This focus reflected the mission of BOSC's sponsors, the Open Bioinformatics Foundation, which through its activities seeks to promote the practice and philosophy of open source software development within the biological research community. Topics in this year's meeting included a new open science session with an emphasis on tools and data sources that try to make that vision a reality, as well as more familiar topics for regular BOSC attendees such as software interoperability and cloud computing.
Among the talks in the open science session was one on the Open Science Data Framework, or OSDF, which provides infrastructure for storing and analyzing metagenomics data. Anup Mahurkar, executive director of software engineering and information technology at the University of Maryland's Institute for Genome Sciences, who presented the tool, explained that it grew out of efforts to manage data generated by the National Institutes of Health's Human Microbiome Project, for which his institution served as the data analysis and coordination center.
OSDF, according to its developers, is a cloud-based system that lets users store, retrieve, query, and track changes to their data. Its features include mechanisms for modeling and defining relationships between data elements and a RESTful application programming interface that makes it compatible with multiple programming languages. It also offers access control lists that are assigned and tracked on a per-document basis, allowing project administrators to control access to their data; as well as a version history feature that lets users track changes in different versions of the data.
In a separate presentation, Stian Soiland-Reyes, a research associate in the University of Manchester's School of Computer Science, described research objects developed by myExperiment — an open scientific workflow repository created and maintained by developers at the universities of Southampton, Manchester, and Oxford — for use in openly sharing detailed information about scientific research workflows and in silico experiments. These objects, Soiland-Reyes explained, enable researchers to aggregate and share data used in their projects and results, the methods employed to produce and analyze the data, and annotations, thus facilitating reproducibility and reuse.
On the data side of things, Fiona Nielsen described DNAdigest, a non-profit organization out of Cambridge, UK, which aims to provide a platform for sharing genomic data openly without compromising the privacy of the individual contributors. Nielsen, the company's CEO, told BOSC participants that the tool lets users ask research questions through an API which provides a response based on aggregated data collected from multiple repositories thus ensuring anonymity.
Meanwhile, Luis Pedro Coelho, a computational biologist at the European Molecular Biology Laboratory, presented Jug, a python-based resource for running tasks in parallel. Jug lets users write code that is broken up into tasks, then run those tasks on different processors. Its feature list includes two storage backends; one that uses a file system to communicate between and coordinate processes, and a second that uses a Redis database. Markus List, a PhD student at the University of Southern Denmark, presented OpenLabFramework, a web-based open-source laboratory information management system that lets users track lab samples.
Other talks during the two day meeting provided a snapshot of the breadth of freely available bioinformatics for visualizing genomic data. Examples include Refinery, a web-based visualization analysis platform developed by researcher Nils Gehlenborg and colleagues at Harvard Medical School's Center for Biomedical Informatics. Refinery comprises a data repository with metadata capabilities based on the ISA-tab file format for describing biological experiments (BI 2/3/2012), a Galaxy-based workflow engine, and visualization tools for exploring and interpreting data. In his presentation, Gehlenborg highlighted features such as Refinery's metadata annotation and provenance tracking capabilities and its ability to implement analyses as Galaxy workflows and execute them using the Galaxy API. The tool is scheduled for release later this summer.
Still on the visualization front, DGE-Vis, a web-based tool from the Victorian Bioinformatics Consortium in Australia, lets users statistically analyze differential gene expression data from RNA-sequencing experiments using two software tools — linear models for microarray data and empirical analysis of digital gene expression data in R. Also discussed was MetaSee, developed by a team from the Chinese Academy of Sciences, which offers a visualization tools for analyzing and comparing metagenomics samples. It includes a visualization engine that provides different sorts of views for comparing samples, metagenomics models, and a portal for developing plugins. Its developers said they are currently working on a GPU-implementation of the tool. Finally, Dalliance, developed by researchers at the Wellcome Trust Sanger Institute, offers an interactive look at genomic data. Its features include a standard distributed annotation system protocol used to gather sequence, annotations, and alignments as well as a vector graphic model through which users interact with and explore data.
Meanwhile, developers of open source projects such as Biopython, BioRuby, BioLinux, and GenoCAD updated the community on their work in the last year. Biopython for instance, has had two releases since the last BOSC that include updates such as a major refresh of the sequence motif handling code and a reworked feature location object model. The developers have also published a manuscript about its phylogenetics module. For its part, GenoCAD now includes a grammar editor that lets users revise existing grammars or develop new ones as needed.
BOSC also featured tools that support interoperability such as BioBlend, presented by Enis Afgan, a python-based library for scripting and running applications on Cloudman, the cloud version of Galaxy's infrastructure. Meantime. Donal Fellows of the University of Manchester's school of computer science discussed efforts to develop workflow components for Taverna — an open source workflow management system that offers tools for designing and running workflows — which are shareable, reusable units of functionality that are designed to perform specific tasks.
A presentation about the UGENE Workflow Designer, a C++-based toolkit that is part of the UGENE genome analysis suite, highlighted its utility for creating and running complex workflow schemas. Similar to projects like Galaxy and Taverna, UGENE provides access to a variety of popular bioinformatics tools such as MUSCLE, ClustalW, and HMMER, and also includes tools that let users add in new workflow elements and features as well as share workflows they develop. Other tools include GEMBASSY, a software package that features 53 tools for things like predicting replication origins and estimating gene expression from codon usage, and is implemented with methods from the G-language genome analysis environment associated with the European Molecular Biology Open Software Suite, or EMBOSS; and Rubra, a command line interface for running bioinformatics pipelines that was developed by researchers at the University of Melbourne using the Ruffus python-based library.
Subject-specific interoperability presentations explored tool options such as the Online Quantitative Transcriptome Analysis, or Oqtans, a Galaxy-based workbench for quantitative transcriptome analysis developed and maintained by researchers at Memorial Sloan Kettering Cancer Center that includes internally developed tools such Palmapper, mTIM, rQuant, and rDiff, as well as external ones such as RNA-geeq and Cufflinks. Also discussed was PhyloCommons, a semantic web-based annotated repository of phylogenetic trees that converts trees into the resource description framework and then makes them available for querying and other activities.
Rounding out the conference were discussions on computing capacity for genomics data and translational genomics. Presentations focused on tools such as MyGene.info, which provides REST web services for querying and retrieving gene annotation data based on the ElasticSearch query engine; and the Gene Priorirization Extended Tool, or Geppeto, an open-source framework for prioritizing genes that incorporates six prioritization modules based on gene sequence, protein-protein interactions, gene expression, disease-causing probabilities, protein evolution, and genomic context. Also presented was the Robust Automatic Multiple Assembler Toolkit, or RAMPART, which is an automatic parameter optimization pipeline for de novo genome assembly that was developed by Daniel Mapleson, a scientific programmer at the UK's Genome Analysis Center.
Finally, under the translational genomics umbrella, Brad Chapman, a biologist and programmer with Harvard School of Public Health's bioinformatics core, discussed efforts to develop human variant calling and validation pipelines called bcbio-nextgen — a python toolkit that provides best-practice pipelines for fully automated high-throughput sequencing analysis. It offers multiple variant calling approaches including tools provided by the Broad's Genome Analysis Toolkit, variant call validation based on reference data from the Genome in a Bottle consortium, and the ability to query variants using the Gemini framework. Future plans include porting the tool to cloud environments such as Amazon Web Services and Microsoft Azure, Chapman said, as well as integrating it with front-end tools such as STORMSeq (BI 12/7/2012).
Meanwhile, Jeremy Goecks, a postdoctoral researcher in Emory University's departments of biology and math and computer science, highlighted ways in which Galaxy pipelines and visualization tools could be used to analyze cancer genomes. These include a transcriptome workflow for finding things like gene fusions and small variations in RNA-seq data; and an interactive Circos plot that could be used to explore genome-wide data and genomic rearrangements.