Researchers from Boston College and the University of Michigan recently received about $865,000 from the National Human Genome Research Institute to continue developing software for identifying variants and assessing their functions.
The project is part of a larger NHGRI initiative called the Genome Sequencing Informatics Tools (GS-IT) program through which the agency funds and supports efforts to develop analysis tools for analyzing large datasets and to make both new and existing software more accessible to the research community.
Seven research centers and universities are listed as participants in the program, according to its website. They include the Broad Institute, Boston College, the University of Michigan, Washington University in St. Louis, Scripps Research Institute, University of Southern California, and Harvard Medical School. Groups at these institutions are building systems to perform tasks such as sequence mapping and alignment, calling structural and short variants, analyzing differential gene expression, and so on.
NHGRI provided the first grants for the GS-IT program about two years ago. The BC and U of M researchers used those early funds and subsequently awarded grants to develop a tool and pipeline management system called Gkno that provides well-tested bioinformatics applications — many of which have been used to analyze data from large-scale sequencing projects such as the 1000 Genomes project — in pre-packaged downloadable workflows for processing and analyzing next-generation sequencing data and covering steps such as read mapping, variant identification and annotation, and linking variants to phenotypes.
Included in the pipelines are tools that were developed in the lab of Gabor Marth, an associate professor of biology at BC and one of the principal investigators on the grant. These tools include MOSAIK, an open-source sequence mapping program; Freebayes, a short polymorphism detector; and third-party tools like the Genome Analysis Toolkit. The workflows can be used as they are, however the system is amenable to the requirements of more experienced users, leaving them room to reconfigure workflows based on their analysis needs or to incorporate additional applications and create more complex pipelines.
With the development of the analysis framework largely completed, the researchers are working on developing a variety of access options for the biomedical research community. They'll use the current funds, Marth told BioInform, to develop Gkno, a cloud-based alternative to the downloadable version of the system. It will provide a virtual image of the pipelines and workflows that run on Amazon's cloud infrastructure, including documentation on how to set up and use the system, for customers who don’t have sufficient servers to run large datasets in house.
Meanwhile, Marth and colleagues are working on another system called IoBio — supported by NHGRI funds and other sources — that will provide web-based access to Gkno's tools but is intended as something of a testing ground for users. The idea for IoBio, Marth explained, is to let users analyze sections of their data — sequences that are relevant just to genes on interest for example — using Gkno's workflows to give them a sense of how the system works and what it has to offer. Once they've had a taste, he said, they can download the entire Gkno framework or use the cloud virtual image, once it's available, to analyze the complete dataset.