That so many types of data can be incorporated into a single study is both a boon to cancer research and the field's biggest informatics challenge. To enable discoveries that will eventually find their way into the clinic, software tools that facilitate the integration of disparate molecular data sets with clinical information are becoming ever more essential. "We can profile the same set of tumors on many -levels — a systems biology approach. We might look at microRNA levels, messenger RNA levels, SNPs, and structural variants, not to mention looking at localization, as tumors are very heterogeneous," says Kevin Silverstein, principal investigator of the biostatistics and informatics group at the University of Minnesota's Masonic Cancer Center. "Being able to microdissect the tumors and analyze everything you find would affect ultimate outcomes, and yet we have no good framework for really combining all that data to meaningfully understand how these cancers are developing."
Indeed, the cutting edge of cancer informatics is the ability to integrate clinical data, biospecimen data, and genomics data, says Rakesh Nagarajan, director of the Alvin J. Siteman Cancer Center Bioinformatics Core at Washington University in St. Louis. However, an all-in-one solution is still out of reach. "I don't know of many systems today that really do that out of the box. While you can try and put the data together yourself, the hard part is having technical folks who can do that easily," Nagarajan says. "The other facet of this is that if you have a data set that is not clinically rich, but that has different genomics and proteomics modalities — like expression and methylation information — there aren't very many tools today that, in a single step, can analyze that data together. That is really where the cutting edge would be."
A grab bag of tools
Not surprisingly, bioinformaticists working with cancer data rely on a hodgepodge of open-source software tools to construct the integrative analysis workflows they need. Since there are almost no commercial software toolkits that can cull meaningful findings from multiple types of data sets, many researchers adopt freely available software to construct a virtual integrative workflow. One such package is BioConductor, an open-source suite of bioinformatics tools with more than 460 packages for analyzing microarrays, sequence data, and annotation tools, based on the R statistical programming language. While it is constantly being improved by its users and the open-source bioinformatics community, BioConducor is formally updated biannually.
But BioConductor does not cover all bioinformatics needs. As such, Minnesota's Silverstein maintains a grab bag of open-source software, including the popular TopHat, a fast splice junction mapper for RNA-seq reads that uses the high-throughput short-read aligner Bowtie. Another widely used tool is Cufflinks, an application that assembles transcripts, estimates their abundance, and looks for differential expression and regulation in RNA-seq samples. "Our analysis uses a variety of programs depending on what type of 'omics data we are dealing with — a lot of our data is the high-throughput next-gen sequencing data — and there's rapid production of tools and not a lot of software platforms from propriety vendors," Silverstein says. "We rely a lot on open-source, commonly used tools."
Silverstein's group does plenty of structural variation analyses on cancer cells and populations, so he says it is important for his team to be able to understand which gene fusions are present, and whether they can be detected at the DNA or RNA level. To study gene fusion in tumor RNA-sequencing data, Silverstein will often reach for deFUSE, a program that considers all possible alignments and locations for fusion boundaries, as well as TopHat's Fusion, an enhanced version of TopHat that aligns reads across fusion points.
For structural variation data analysis, meanwhile, he uses HYDRA — a software solution for the exploration of hydrogen/deuterium exchange mass spectrometry experiments. However, HYDRA does require that other alignment programs like the Burrows-Wheeler Aligner algorithm and NovoAlign be pre-installed.
Last but not least is the Web-based genomics analysis Galaxy platform, which provides users with access to genome annotation databases and rich visualization features. Silverstein and many of his collaborators at Minnesota are active Galaxy developers and have worked to integrate many alignment algorithms into the Galaxy framework.
The cloud outlook
Researchers recently developed a cloud computing version of Galaxy, called Cloudman, which begs the question: Might the cloud provide the type of environment that cancer researchers need for integrated analyses? There are open-source cloud computing solutions built specifically for scientific research, like the Nimbus Platform, which provides an integrated set of tools designed to help researchers use their software in cloud environment.
"I think the Nimbus project is aiming for that. They have visions of pulling all the essential resources together, everything from The Cancer Genome Atlas, NCBI, all the different data," Silverstein says. "Ideally, if all of it was on the cloud, we could have seamless tools; we wouldn't have to have a mirror site and reproduce all those data sets everywhere. That would be a fantastic advance, but we're not there yet."
The problem of network bandwidth — uploading and downloading data sets and results to and from the cloud — remains an unresolved issue for cloud computing in most academic research settings. WashU's Nagarajan says that, in the meantime, he is content grappling with data integration on his lab's two large compute clusters. "If you have clinical data sets as well as transfer of raw data, the cloud is really not tenable with its current networking approach — we just haven't really needed it. And especially with identified clinical data, that's a no-go in the current regulatory environment," he adds.
The University of Arizona Cancer Center's bioinformatics core is working to combine not only several different types of 'omics data, but clinical data as well, for a multi-dimensional analysis. "Integrating the information obtained with clinical data, patient data — especially survival data — it's amazing what kind of information you can get out of, for instance, a gene expression array, if you have a lot of clinical data," says David Mount, director of informatics and bioinformatics at the center. "I think that's where we fit in — trying to do more of a deep data integration analysis — and for that we use the BioConductor suite, which has so many statistical tools for high-throughput data analysis that it takes care of most of our needs."
For their translational research projects, Mount's group relies on many of the tools developed as part of the National Cancer Institute's Biomedical Informatics Grid project, to track and manage its tumor data. These include caTissue — caBIG's biorepository tool for specimen management, tracking, and annotation — as well as caArray, an open-source array data management system that can either be locally installed on a cluster or accessed through NCI's Web site. Another caBIG tool Mount uses is NCBI's caIntegrator2, a software package for setting up caBIG-compatible Web portals to facilitate integrative research that incorporates the BioConductor suite. Researchers can use the caIntegrator Web portals to bring clinical, microarray, and medical imaging data together into one graphical interface application.
Mount says that while data storage is always on his group's collective mind, it's not at the "data deluge" stage that plagues next-generation sequencing centers. Rather, it's more like a slow flood kept at bay with the addition of storage arrays and management software. While storage may not be an issue, managing researchers' expectations of what an information core can deliver is something that must be kept in check — at least until someone does develop an all-inclusive analytics platform. "A big problem that we have to deal with is just education of our research faculty in the cancer center about what kind of analyses they can do and how we can build genomics into their research project," Mount says. "The way we deal with that is by having workshops about the tools we offer and what types of analytics workflows can be set up for them, and just do the best we can until we get truly integrative tools."