Backed by a grant from the National Human Genome Research Institute, Seattle-based software shop Insilicos said this week that it is partnering with neighboring bioinformatics firm LabKey to port LabKey’s CPAS proteomics software to Amazon’s Elastic Compute Cloud, or EC2.
The one-year grant, for around $180,000, follows on a $1 million Phase II Small Business Innovation Research grant that NHGRI awarded Insilicos last summer for a similar project with the Institute for Systems Biology. On that project, Insilicos has been working with ISB and Amazon Web Services, a subsidiary of Amazon.com, to modify ISB’s Trans Proteomic Pipeline for Amazon EC2.
When complete, both projects could eventually allow researchers to run complex proteomic analysis pipelines through a web interface on Amazon’s servers — an option that Insilicos and LabKey expect to appeal to research groups without large IT budgets.
“With the combination of these two grants, we’re going to have a pretty complete proteomics tool suite where people will just be able to go to a website and they’ll be able to do proteomics [analysis] without having to have their own compute cluster, without having their own database expert,” Insilicos CEO Erik Nilsson told BioInform this week.
Nilsson said he expects both suites to be up and running on the cloud in about a year. The project with ISB is further along, and “we have demonstrated to ourselves that we have stuff that works in that environment,” he said, “but as far as getting the whole proteomics world to work in that way, where you can just walk up to a website and use it, I think that’s a year out.”
Kevin Banks, director of marketing at LabKey, said that the company has found that “the complexity of installing the IT infrastructure [for computational proteomics] was technically and cost prohibitive for a lot of our target customers.”
LabKey provides services around a suite of tools built on its open-source LabKey Server data integration platform, including CPAS, or computational proteomics analysis system, which includes a database and a pipeline of proteomics search engines and analytical tools.
Peter Hussey, a founding partner of LabKey, said that while CPAS could run on a single server, “for serious repeated proteomic analysis, you need at least a front-end web server machine and a database machine, and often you need another set of machines that do the search engine step.”
LabKey has found that it’s a “challenge” for many smaller proteomics labs to build or acquire the IT resources to support systems like CPAS, Banks said, adding that this trend is becoming “more of an issue” due to the declining cost of mass spectrometers and the resulting increase in data.
“It’s more cost effective to get a mass-spec machine to create this data and there are more researchers doing this type of research. The barrier is not collecting the data, the barrier is processing and analyzing the data,” he said.
Cloud and Data Downpours
Under the model that Insilicos and LabKey envision, CPAS will be available under the same type of software-as-a-services model that has proved successful in other industries.
“The direction we’re thinking is something like Salesforce.com,” Banks said. “Ten years ago, everyone would buy Siebel and bring their whole [customer relationship management] solution in house and run it on their own servers, and over time they began to realize that they could have a third party host that data so they could access it.”
Banks said that LabKey sees a similar transition taking place in the life science market, “where people need access to very powerful IT systems but don’t necessarily want the cost of having to maintain those, in which case the hosted model and the cloud computing become more attractive.”
Under the cloud computing model, proteomics researchers would first transfer their data to the Amazon system, “and when they want to analyze their data, they’ll start up a virtual server on the Amazon environment, and then they’ll do what they want to do and when they’re done they’ll shut it down,” Insilicos’ Nilsson said.
Amazon’s EC2 pricing varies based on the requirements for a given job, but it starts at $0.10 per hour for computational time. Data transfer costs $0.10 per gigabyte for data transferred in and $0.17 per gigabyte for data transferred out, while storage is $0.15 per gigabyte per month.
Under the cloud model, computing power “is like a reagent. If you need it, buy it. And if that answer isn’t really worth $10 or $100 or whatever it is, then don’t buy it.”
Nilsson said that this pricing ensures that a typical lab can afford most computational proteomics jobs. “If they use a server for a couple hours, that’s going to be the better part of a dollar, and if they want to do a bunch of analyses in parallel, or if they want to do a bunch of computationally intensive quantitative proteomics or something like that, and they want to have, say 30 or 40 servers going for an hour, then that’s going to cost [just a few] dollars.”
Ultimately, he said, the cloud model could enable research labs to better manage their IT budgets. “You can throw CPUs at the problem of proteomics, but you don’t really know they’re going to be used for important problems,” he said. “This way, you’re not waiting in line to get to the supercomputer. When you’re ready for your answer, just buy it out of your grant funds.”
Under this model, computing power “is like a reagent,” Nilsson said. “If you need it, buy it. And if that answer isn’t really worth $10 or $100 or whatever it is, then don’t buy it.”
Insilicos and LabKey also envision the cloud model fostering collaborative research because it will allow disparate researchers to analyze large data sets via Amazon’s services rather than e-mailing huge files around or sharing access codes for secured systems.
Yet despite their high hopes for the project, both Insilicos and LabKey stressed that it is still in its very early stages. “At this point we’re more in the proof-of-concept phase. We don’t want to get people too excited too early,” Banks said.
LabKey’s Hussey noted that porting the complex code base for a system like CPAS to the cloud environment is “not trivial” and “will take some time.”
Insilicos’ Nilsson added that while LabKey Server “is a pretty solid project,” the software was designed to work on a particular network “where you have certain expectations for how things are going to run, and in this case, this is software as a service, so some of those assumptions are no longer valid.”
The partners also face a few technical challenges related to the cloud infrastructure itself. “The biggest obstacle that we’re looking at right now is that the data files are big, so moving the raw data from the instrument over the Internet onto the cloud cluster could take some time,” Banks said. “It’s not like uploading a Word doc that takes seconds. It may take potentially minutes or hours to move those files onto the network, so we still need to get early adopters onto the platform to find out if those bottlenecks are acceptable and if people are willing to work in that scenario.”
Nilsson said that an important aspect of the effort will be testing the performance and usability of the web-based cloud model with a handful of proteomics researchers before rolling it out to the broader community. He said that Insilicos is working with researchers at the University of Chicago and Vanderbilt University who will serve as guinea pigs.
He said that there may be some additional obstacles related to how researchers use the cloud environment, which relies on virtual server “images” that are uploaded to the system. “If you’re running a database on that image and then you want to shut it down, and you want all the data in the database to get somewhere, you have to think about how you’re going to do that efficiently and how you’re going to do that without corrupting the database,” he said.
“When you’ve got your server humming away in your own server room, somebody doesn’t go and unplug that thing kind of randomly. … But in this case, you’re unplugging it yourself whenever you don’t want to pay for it, so some things that don’t need to happen in the corporate computing environment do need to happen in the software-as-a-service environment,” he said.
In addition, LabKey’s Banks said that some scientists may be reluctant to move to the cloud computing model because they “tend to want ownership of their data. They want to feel like it’s close and secure and it’s something that they can touch, so they keep it in house.” However, he noted that many people felt the same way about banking and CRM a decade ago, and both these sectors “have now transitioned to more of an on-demand model.”
Another potential obstacle is security, but Nilsson noted that Amazon has its own security provisions, “and frankly Amazon is going to do a better job of that than most organizations are going to do for themselves.”
The e-commerce giant “doesn’t lose Visa numbers,” he quipped.
Both firms are looking to use the cloud-based system as the basis for a commercial services offering, but neither has formed any concrete plans in that area yet.
“We would most likely provide the commerce front end for accessing” the Amazon servers, Banks said. “So from a business model perspective, LabKey would have the relationship with Amazon and the customer would form the relationship with LabKey.”
Hussey noted, however, that “Insilicos could well be a reseller of the same sort of service,” and stressed that “the first part of this relationship is to figure out technically whether this works on the Amazon cloud service.”
He said that the two firms, which are located down the road from each other in Seattle and have collaborated on a number of projects in the past, have agreed to keep the commercial aspects of the project up in the air for now. “We’ll figure out the right business roles for each of us down the road,” he said.
Nilsson said that Insilicos is “looking for an opportunity to add value on top of what these tools can do,” but he noted that the company’s long-term goal is actually diagnostic development rather than software development, so it isn’t pinning its commercial hopes on the success of this project.
“Our destiny is to be [in diagnostics],” he said, “so I look at a strong vibrant academic proteomics environment as something that we need in order for the sector to be successful. We need academic researchers who want to do proteomics analysis of their research, of their biologically relevant research, because otherwise drug companies won’t turn to us for solutions to their molecular, physiological problems because the basic science isn’t going to be there to support that.”