Illumina is developing an "analysis-workflow framework" for its sequencing technology that will run on both cloud and in-house infrastructures, a company official said this week.
The new product, which is targeted for release later this year and has not yet been named, "will make the workflow capabilities of analysis pipelines available in a way that’s well integrated with the sequencers," Illumina CIO Scott Kahn told BioInform this week. It is also "configurable, has a good ease of use, and is able to integrate with other tools."
Illumina CEO Jay Flatley hinted at the company’s efforts during its first-quarter earnings call last week, in which he said Illumina is working on a product that will enable customers to "very graphically and easily build complex workflows."
He added that the firm already has "pieces of our software running up in the cloud," and over the next six to 12 months the firm plans to "allow customers to migrate data and compute capabilities up to the cloud in an Illumina-centric environment."
Illumina's elevation to the cloud has been underway for more than a year (BI 2/12/2010) and is motivated by the rapid increase in general sequencing output. Researchers "can generate enough data, but until they are able to analyze it, they are not able to do other experiments," Kahn said.
"It makes sense to try to provide better capabilities to try to analyze and aggregate data that allows the next experiment to be understood, defined, and ultimately run," he added. "That's the business opportunity."
However, he noted that not all analysis tools are meant for the cloud.
"In general we try to move things over [to the cloud] that make sense — that is, [they] have a high computational demand," Kahn said. "Things like alignment do seem to make sense ... and I think a lot of the multi-sample analysis that people will start to do ... [is an] obvious candidate for the cloud."
However, there is work involved in getting things to function on compute clouds, Kahn cautioned, highlighting some challenges Illumina faced in an internal project to move its existing Casava secondary-analysis software to a cloud environment.
He explained that a lot of software packages, including Casava, are written with a "very strict high-performance computing mentality," meaning that there is "a certain assumption of how you can store things to disk and how you can get things back from disk."
Furthermore, in the cloud, the facilities are "either much harder to use, much slower, or the balance between what you can compute and what you can store on disk is shifted."
As a result, "one of the big challenges was to try to reduce the amount of disk writes that we did with the software to take advantage of the size of the memory that each of the cloud compute nodes has," he said.
To address challenges associated with moving large datasets, Illumina employs a three-fold strategy including investing in more bandwidth for its internet connection, reducing the amount of storage required for individual experiments, and utilizing file compression techniques.
In addition to Amazon's cloud, Illumina has explored IBM's cloud and Microsoft Azure among other cloud vendors.
"We see the cloud as an attractive place to have genomic data [because] it facilitates sharing and collaboration," Kahn said, citing the Galaxy project as a good example of a data analysis workflow package that integrates a number of tools and takes advantage of the cloud infrastructure.
However, Illumina's new offering won't compete with academic tools like Galaxy, Kahn said.
"Our philosophy in software is that we should try to do things that are missing," he said. "There are some characteristics in the way that data is generated and how it comes off the sequencer that current workflow engines don't handle and we have some ideas on how to facilitate [that] and make that more efficient. "
In addition, he said Illumina sees the Galaxy development community more as partners in the space rather than competitors.
Illumina is planning further activity in the data analysis space. For example, there needs to be "a lot of improvement" around "more completely and accurately calling insertions and deletions and other types of [structural] variation," Kahn said.
The company is also looking into enabling multiple sample analysis as well as developing "capabilities that lead to the annotation of the information [that defines] what a variant means or what the characteristics of the sample are telling you about the biology."
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.