JMP, a business unit of statistics software firm SAS, has expanded an existing partnership with the National Center for Genome Resources to repurpose its array-analysis software for gene-expression analysis on second-generation sequencers.
The partnership is based on the JMP Genomics software package, a “hybrid application that uses JMP as the visualization tool and SAS as the analytic engine behind the scenes,” Shannon Conners, product manager for JMP Genomics, told BioInform.
NCGR and JMP began collaborating several years ago on a project to analyze array data, but as NCGR has moved more aggressively into sequencing — the center now has six Illumina Genome Analyzers — it drew JMP into the field as well.
“JMP Genomics has been a tool that has been used with traditional array technology,” Conners said, adding that she views the move from arrays into sequence analysis as a “logical extension to a new technology.”
Under the terms of the partnership, NCGR will provide input for new applications in JMP Genomics, so that the software can better support second-generation sequencing data, Conners said.
“There’s a co-development aspect of it — us creating new processes in JMP Genomics as well as NCGR creating appropriate export tools on their side.”
One aspect of the project involves NCGR’s Alpheus software, which was developed to identify sequence variants in very large data sets. NCGR is developing tools to help researchers take datasets out of Alpheus and put them into a format JMP Genomics can read, Conners said.
From Customer to Partner
NCGR, based in Sante Fe, NM, has been using JMP Genomics to analyze array expression data for several years.
Last year, the center, which traditionally had a focus on bioinformatics development, began transforming itself into a sequencing center [BioInform 01-12-07]. With funding from the state of New Mexico it purchased its first sequencer to create the New Mexico Genome Sequencing Center.
As NCGR researchers began generating data on the instrument, they soon realized that “we could get expression data out of it,” Faye Schilkey, associate director of the New Mexico Genome Sequencing Center, told BioInform.
Schilkey said that NCGR chose to stick with JMP Genomics for gene-expression analysis on the new platform for a number of reasons, including its “great graphical features.”
The two organizations “just clicked,” she said, in trying to make the software work better. “We are doing things to help each other.”
Conners said that after the JMP Genomics took part in NCGR’s symposium on second generation sequencing in March, the firm ran a workshop that gave NCGR researchers “a first-hand look at how they could plug their data into JMP Genomics and work with it.”
“Much to our surprise, conventional tools for analysis of array-based expression data work very well for mRNA-seq.”
That experience “kicked off some very cool power users at NCGR just taking the product and running with it working with their own datasets,” she said.
For example, the NCGR scientists found that second-generation sequencers could generate count data that could be imported and analyzed in JMP Genomics, so “they wrote an export tool to be able to export the count data and read them into JMP Genomics,” she said.
Building on the existing collaboration between NCGR and JMP, the two organizations decided to “expand our customer relationship into a partnership,” Conners said.
JMP doesn’t anticipate in-licensing any software from NCGR, because many of the processes in place for Alpheus require “massive compute power” and JMP Genomics is a desktop application, Conners said. Instead, the company is looking for feedback from NCGR researchers on ways to improve JMP Genomics to handle second-gen sequencing data.
The partners are also planning co-promotion events and other ways to help JMP Genomics users who are just getting into second-generation sequencing explore NCGR’s approach for mining and analyzing the data.
A study published in PLoS ONE last week highlights how Alpheus and JMP Genomics can be used together to analyze mRNA-seq data.
In the study, researchers from NCGR, Illumina, SAS, and elsewhere used the Genome Analyzer to generate 16.7 billion nucleotides of mRNA sequence from cerebellum samples of 14 patients with schizophrenia and six controls. They found that 215 genes were expressed “significantly” differently between cases and controls.
The scientists performed read alignment using the algorithm GMAP and Alpheus and used JMP Genomics version 3.2 for statistical analysis.
“Much to our surprise, conventional tools for analysis of array-based expression data work very well for mRNA-seq,” NCGR CEO Stephen Kingsmore told BioInform via e-mail.
Kingsmore noted that a number of recent papers — one, for example, by Christopher Burge at MIT and colleagues, and another by Hans Lehrach at the Max Planck Institute for Molecular Genetics — are revealing that mRNA-seq is “better for measurement of gene expression than arrays, at least Affymetrix arrays,” and that it is “more sensitive and has better precision” than array-based gene expression.
One downside of mRNA-seq, however, is that second-generation sequencers generate orders of magnitude more data than microarrays. If a second-gen sequencing experiment reveals 35,000 novel transcripts, “You don’t want to send that into a statistical program,” said Shilkey.
Therefore, the researchers first used Alpheus to align the reads to the genome and calculate read frequencies for each sample and locus. As the short reads were aligned in a shotgun fashion, an expression pattern began to emerge.
“If a bunch start aligning, piling up, as we call it, to a region of the genome or the transcriptome, the number of reads can be turned into an expression value,” Schilkey said.
“Once we could convert the next-gen data into expression values based on read count, we were able to import it into JMP Genomics,” she said.
Smoothing the Transition
NCGR and JMP are currently working on smoothing the transition of data from Alpheus to JMP Genomics.
Kingsmore outlined the steps that users must take to move their data from one system to the other. The reads with quality scores are aligned to a reference database, and “we import the best alignment for each read into a relational database … together with annotations — any SNPs or indels — and enumeration of all events,” he said. The sample information and metadata is imported into Alpheus.
“You log in and use Alpheus to visualize your data, pick samples, apply filters, such as frequency cutoffs or certain gene sets or fold change cutoffs, and then, when you're happy you know what you're looking for to test your hypothesis, you hit return,” he said. This step queries the relational database and a clickable gene list is returned in the web browser.
Next, a researcher can click an export tool on the gene list that downloads a flat file of the requested data in a format that is acceptable for import into JMP Genomics, he said. Once users open JMP Genomics, the file opens in that software letting users run routines for data transformation, normalization, quality control, variance decomposition, PCA, ANOVA, graphing, and annotation.
Conners said that the JMP Genomics development team is working to “make that quick and easy every time and streamline the interface between the programs.”
One request that NCGR has made under the partnership is to be able to look simultaneously at gene-expression data and information on variant percentages from each detected allele, Conners said, “so we made some tweaks to JMP Genomics 3.2 to allow some of our processes to do that in a more straightforward way.”
The next version of JMP Genomics, due out in April 2009, will include more features that have grown out of the NCGR partnership, she said, but did not elaborate.
Alpheus also includes a number of new features that arose from the schizophrenia project. In addition to the JMP Genomics export functionality, the software now uses the gSNAP algorithm to align short reads using genome-wide indexing for paired-read alignments, as well as the GMAP genomic mapping and alignment program for mRNA and EST sequences for singleton alignments, Kingsmore said.
NCGR offers a fee-based online analysis service based on Alpheus, with pricing based on the level of use.
Kingsmore believes this offering will appeal to researchers who don’t have access to the massive computational resources required to analyze next-gen sequencing data.
“Whereas next-gen sequencing technologies allow any well-funded lab to generate hundreds of gigabases of sequence, analysis of that data is beyond the reach of all but major genome centers and expert computational biologists,” he said.
“By offering sequence alignment services in combination with reasonably intuitive web-based visualization, query, and analysis tools, we believe that any lab that can generate next-generation data can also analyze it,” he said. “We don't think it makes sense for most labs to invest in the massive compute infrastructure or bioinformatics expertise to copy the Alpheus system.”
Alpheus currently supports gene expression and polymorphism detection. “We'll add other capabilities in time,” Kingsmore said.
NCGR has posted all the data from the schizophrenia study here, within the context of Alpheus, so that researchers can re-analyze it for free.
“This is the first time that mRNA Seq data has been freely available to anyone without advanced compute capabilities,” Kingsmore said. “A unique feature of the data is that it is freely available for anyone to re-analyze any way that they wish using a web-based query tool set.” Users can download it into Excel or JMP Genomics once they have run their own comparisons, he said.
Making the dataset freely available within Alpheus opens up more possibilities for analysis, Kingsmore explained. While scientists could previously submit sequences to Genbank for others to download and analyze on their laptop or desktop, second-generation sequencing “has broken the Genbank model,” he said.
Second-generation sequencers have “democratized” genome-scale data generation and the compute infrastructure and computational skills needed to analyze that data have become the new barrier in genomics, Kingsmore said. “This pilot is an ambitious new way to truly democratize genome analysis.”