The informatics group at the Genome Institute at Washington University School of Medicine has released an integrated analysis and information-management system called the Genome Modeling System.
The system borrows concepts from traditional laboratory information-management systems — such as tracking methods and data-access interfaces, — and applies them to genomic analysis. The result is a standardized system that integrates both analysis and management capabilities, David Dooling, the assistant director of informatics at Wash U and one of the developers of GMS, explained to BioInform.
Dooling described the system during a presentation at the Genome Informatics conference hosted by Cold Spring Harbor Laboratory earlier this month.
During his talk, Dooling said that his group is working on integrating GMS with the Galaxy platform so that both systems can be accessed from each other.
Currently, users can download various analysis tools that Wash U researchers developed internally. However the team is working on packaging all the GMS capabilities into a single virtual machine image for the Ubuntu platform, a version of which they plan to release by the end of the year, Dooling told BioInform in a conversation this week.
The VM image will include tools that are still being developed, including a web-based interface for users to enter and track individuals, associated tissue samples, sequencing libraries, sequencing instrument data, and analysis progress and results. It will also include a text search engine and database that will allow users to find and retrieve their data, he said.
Additionally, the source code will be made available through the group's github page so that users can develop analysis applications that the Wash U team will release in future versions of the system, he said.
Currently, GMS provides several of Wash U's bespoke tools. These include SomaticSniper, which identifies single nucleotide positions that differ between tumor and normal samples in BAM file data; BreakDancer, which predicts several kinds of structural variants in next generation paired-end sequencing reads; and Joinx, which performs a series of set operations on genomic data in .bed files.
Another tool, the Mutational Significance In Cancer package, or MuSIC, comprises a series of statistical analysis tools to identify significantly mutated genes and altered pathways, investigate the proximity of amino acid mutations in the same gene, search for gene- or site-based correlations to mutations, correlate mutations to clinical features, and cross-reference findings with databases such as Pfam and the Catalogue of Somatic Mutations in Cancer.
Under the hood, these tools run several well-known bioinformatics algorithms including the Burrows Wheeler aligner, Novalign, ALLPATHS, VarScan, SAMtools, and Picard.
The Wash U team is also developing new applications for RNA-seq and methyl-seq data analysis, as well as improving existing genomic assemblers and incorporating new ones, Dooling said.
Wash U is using GMS internally to analyze data from several sequencing efforts, including the 1000 Genomes Project, the Cancer Genome Atlas, the Pediatric Cancer Genome Project, and other exome and targeted sequencing studies, he said.
GMS is also finding use in clinical applications. Wash U researchers published two articles earlier this year in the Journal of the American Medical Association in which they discuss their efforts to sequence and analyze patient genomic information using GMS.
In the first study, the researchers used tools from the system to identify a genetic mutation that was linked to cancer susceptibility in whole-genome sequence data from a patient who had previously been diagnosed with breast and ovarian cancer, as well as acute myeloid leukemia.
The second study explored whether whole-genome sequencing could be used to identify "clinically actionable mutations" in patient data, and the results communicated to physicians in a timely fashion. In this case, GMS was used to analyze data from a second patient with AML.
Dooling added that Wash U researchers currently have several additional papers in the publication pipeline but could not provide more details.
The informatics group initially developed the system to analyze the AML genome, which researchers at Wash U's Genome Institute sequenced in 2008.
He explained that at the time there weren't any tools that scaled up adequately to handle the large quantities of genomic data from the AML project and other large sequencing projects.
While "there were systems that would allow people to design and execute workflows at the time, a lot of them were limited to running on a single workstation" instead of adopting a distributed approach, he said.
These systems also required users to upload data into databases and create copies of the information so that the results could be reproduced at a later date — a tactic that isn't feasible for experiments that generate terabytes of data.
Furthermore, next-generation sequencing technologies were still young at the time and took a long time to generate data, while analysis tools were still being developed and versions changed rapidly, he said.
As a result, once the data had been generated and analyzed, "we found it difficult to recreate the exact results that we had because older [sequence] runs had been aligned and variants called with older versions of software and newer versions of software changed the details of how things were” done, he said.
He added that most researchers at the time tracked their work in lab notebooks or in multiple files stored in different locations.
GMS came out of "a desire to make analysis reproducible in the same way that laboratory work is reproducible" and standardized, Dooling said. The "advantage of that is that ... scientists can take their time thinking about experiments and the results and interpreting [them] as opposed to just following recipes to align and call variants."
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.