Illumina this week launched an updated version of its analysis software suite that contains two new modules for sequence-based data in an effort to improve data analysis for its Genome Analyzer.
The new software, called GenomeStudio, replaces Illumina’s BeadStudio data-analysis software and will be available to users next month. In addition to existing tools that analyze data from both the BeadArray and sequencing platforms, it contains two new modules for sequencing-based applications.
In a conversation with BioInform sister publication In Sequence, Scott Kahn, Illumina’s Chief Information Officer, explained that GenomeStudio is an attempt to bring tools together into an integrated package with a common user interface.
“There is also some nice functionality that lets you go between the sequencing and the microarray worlds,” Kahn said. “For example, you can detect SNPs in the sequencing world, and you can output them in a form that you can create a custom array, and then apply that to very large numbers of samples. So these are convenience tools that let the sequencing world and the array world really act as one, as opposed to two very disparate worlds.”
GenomeStudio includes two new modules, one of which has a DNA focus. “It basically allows you to look at one or more runs worth of data and look at SNPs,” Jordon Stockton, Illumina’s marketing manager for computational biology told In Sequence.
The other is designed for mRNA sequencing, which lets researchers align sequences and view them “in the context of the transcriptome,” Kahn said. The functionality of mRNASeq was introduced for full-length cDNA sequencing on the Genome Analyzer and helps research groups avoid the need to design probes or primers.
Kahn added that this RNA data analysis feature allows scientists to take a larger number of reads and explore splice variants or splice junctions. “There really is not a great number of tools that are out there [for that],” he said. “That’s one of the reasons why we have GenomeStudio, to deploy some of the more late-breaking technologies,” he said.
Illumina intends to add to the integration between the different modules that make up Genome Studio, Stockton said, for example “making it easier to compare data and convert data between applications,” he said. “We are trying to empower technologies that, to be quite frank, are pretty young but growing very, very rapidly. And the best way to do that is empower the broadest range of biologists that you can. So we didn’t want to make the computer interface a barrier,” he said.
GenomeStudio is bundled with service contracts for Illumina’s platforms and one seat costs $1,500.
Just about all aspects of GenomeStudio were developed in-house, said Stockton.
The company offers access to third-party developers through a program called Illumina Connect. “What we do is provide the necessary API so that third-party developers can add tools kind of after the fact or during our development cycles,” said Kahn.
A Look Under the Hood
GenomeStudio includes algorithms for detecting copy number variation, SNP calling, and to visualize data. Looking for variants between sequencing runs is “fueled,” as Illumina said in a press release, by a processing module called CASAVA, or Consensus Assessment of Sequence and Variation.
According to the press release this feature lets users align reads and quantify genes, exon, and splice junctions.
Richard Carter, a data analyst from Illumina, presented a poster on CASAVA at the Genome Informatics meeting in Hinxton, UK in September [BioInform 09-12-08].
CASAVA works with the output from Illumina’s alignment tool Eland, collates, bins it, and sorts it such that researchers “can easily go” from all reads organized by chromosome and position, and then make a consensus sequence call for every base, Carter told BioInform at the time.
CASAVA includes a Bayesian allele-calling algorithm that the company developed called Bacon.
“What you can do is then say, I am confident I am calling a base ‘C’ here or an ‘A’, then the next step of CASAVA is, it filters on that data, [for example it can] filter all the consensus calls and show where all the SNP positions are,” he said. So it gives users a list of reads, a sequence call for every base position, and then a list of SNPs, he said.
This tool has been partly developed through interaction with users and partly through the company’s own sequencing analysis needs, he explained. “We need to write and validate the methodology by finding out how many SNPs we can call and we found it validated very well.”
In validation experiments, he said, the company has found a 99.5 percent agreement with genotyping data, he said. “It’s validating the data; it’s not just experimental metrics.”
He highlighted that even if the experimental metrics are fine, the data could contain systematic errors. With CASAVA, “What we have found is that we are reasonably confident there are no systematic errors in there.” Users want to know if there are real biological signals in their data. “So you can get to the biological data through our system.”
— Julia Karow, editor of In Sequence, contributed to this article.