An informatics team at the Broad Institute of MIT and Harvard has created a freely available, open source web-based tool to help researchers visualize large and diverse genomic datasets in an integrated fashion.
Jill Mesirov, the Broad Institute’s chief informatics officer and director of computational biology, told BioInform that the tool, called the Integrative Genomics Viewer, was developed in response to a “driving scientific need” for real-time panning and zooming on huge sets of omics data along a wide spectrum of resolutions — from the single-base level to entire chromosomes.
“You really want to try to look at these things together to try to understand how they are interacting: mutations, copy number, expression, epigenetic data, and so on,” she said, explaining that her team has become increasingly involved in integrative genomics projects that simultaneously visualize and analyze different data types.
Users can select from various display options: data can be viewed as a heat map, histogram, scatter plot, and other formats.
Jim Robinson, senior software engineer at the Broad and the IGV’s chief developer, told BioInform in an e-mail that it took the team approximately nine months to create and test the viewer, which is a java application designed to let users integrate or superimpose data to help determine how changes on one level can affect another. IGV was also programmed to provide smooth zooming and panning across all resolution scales.
While acknowledging that other genome visualization tools are available, such as the University of California at Santa Cruz, NCBI, and Ensembl browsers, Mesirov said that when she and her colleagues began working on the IGV, the UCSC browser, which she called “a great browser” for sequence data, wasn’t integrated and was not able to visualize large datasets.
Her team also considered the time it takes to scoot around a dataset. “Once you have these huge datasets you want to move around them at different scales in some reasonable amount of time,” she said. The response time for many viewers that the team tested, she said, was “fairly slow” for big datasets such as the Cancer Genome Atlas dataset or epigenetic studies that use short reads from second-generation sequencing technologies.
“You want to move from base pair to whole chromosome quickly; you want to pan just like you do in Google Maps.”
“You want to move from base pair to whole chromosome quickly; you want to pan just like you do in Google Maps,” she said. “You want to look at your route at the county level and when you get closer to your goal, you want to zoom in, and look at the street level or the house level.” It was Robinson’s observation, said Mesirov, that Google’s type of technology could enable a genomics browser.
As Robinson explained, scientists were not just seeking a tool to allow interactive exploration and visualization of multiple data types from the Cancer Genome Atlas project, including DNA copy number, loss of heterozygosity, gene expression, methylation, and sequence data. Rather, “there was a need for a tool capable of visualizing large ChIP-Seq tracks at resolution scales ranging from whole chromosome to sequence level,” he said.
With currently available tools, he said, “high-resolution visualization is possible but [is] inconvenient” because it involves “breaking the dataset into manageably sized subsets.” In addition, he said, no current tool can visualize the complete range of resolution scales for very large datasets that IGV targets.
Jim Kent, who directs genome browser development and quality assurance at UCSC’s genome bioinformatics group, told BioInform in an e-mail that the UCSC Browser has a number of visualization tools online, for example, the browser’s Genome Graphs display tool.
“We are working on similar tools [to IGV], though there are some significant differences,” Kent said. “Most of ours are already online … but there [are] a few new displays in the works here, too.”
Mesirov said that she weighs each project’s importance before investing considerable time and resources, as her group did in this in this project. As part of her decision-making, she connects with colleagues to consider “what are the things that will bring the most scientific benefit quickly and have an effect on the most important projects that we are working on.”
Some projects, like this one, are “red-hot,” said Michael Reich, director of cancer informatics development at the Broad. The informatics team’s projects are demand-driven, a sign for which is “when Jim [Robinson] has a whole line of biologists outside his office with requests,” Reich said.
Once the IGV was programmed, testing began. “We have a number of internal customers that put the project through extensive testing on real-world projects that include TCGA … epigenomics, and genome biology projects,” said Reich. The testing process took about two months. “It was only after a while of real-world-ese, that we said, ‘OK, this is ready for prime time.’”
The effort to create the viewer originally began as work to develop a SNP-visualization module for the Broad’s software platform GenePattern. It involved re-working Harvard University’s dChip software for analyzing and visualizing gene expression and SNP microarrays.
“We re-architected it so it could fit into the GenePattern framework,” said Mesirov. Working through that led up to projects that are now part of the Cancer Genome Atlas project, at which point “we began to see this vision for the IGV,” she said.
At the time Robinson was working closely with users in that group, said Mesirov, crediting him as the person who “saw how to take this to the next level and really enable scientists.”
Robinson worked with Gaddy Getz, who does data analysis as part of the Broad’s efforts in the Cancer Genome Atlas project; John Rinn, assistant professor of pathology at Harvard University and Beth Israel Deaconess Medical Center, who also has a lab at the Broad; and MIT graduate student Mitch Gutman, who was trying to better understand non-coding RNAs.
“Gutman would write a little code [for the viewer], they would look at things; it was agile programming at its best,” she said.
The constant communication between computational biologists, software developers, and experimental biologists, such as Todd Golub, who directs the Broad’s cancer program, or Matthew Meyerson, who is the principal investigator for the TCGA characterization grant, helps “get you invested in a deep and fundamental way,” Mesirov said of her team’s projects.
Going forward, she said the team plans to expand the software’s ability to integrate new data types, new types of tracks, and new internal visualizations, or renderers.
The software has been released under the GPL open-source software license “so that other people can install an instantiation of the software and add what they want,” Mesirov said.
The hope is that the research community at large will help to improve the software to grow it in the directions scientists need. “We strongly believe in making our software available,” Mesirov said.
Companies, too, can adapt the IGV for their own internal uses. “I couldn’t believe that people wouldn’t do that,” she said.
Robinson added that IGV includes internal interfaces “that can be used to extend the tool to new data types or to add new renderers for existing data types.”
Reich explained that he and his colleagues are thinking about future ways to apply the IGV for second-generation sequencing data management.
“One of the things on the board for the IGV is to be able to operate as a client/server” environment, he said. “You don’t need to throw your data around the network; you can move it minimally to get it into and [be able to] visualize it in IGV.”
The IGV can be accessed here.