Amazon.com subsidiary Amazon Web Services last week kicked off a new initiative called “Public Data Sets on AWS,” in which it provides access to several public data sets, including human-genome data from Ensembl, in hopes of spurring greater interest in its cloud-computing offerings.
Amazon said on its website that it plans to add the National Center for Biotechnology Information’s GenBank, UniGene, and PubChem resources “soon.” A 3D version of PubChem provided by Indiana University is already available through AWS. Amazon is also hosting other scientific data through the initiative, such as databases from the US Census Bureau.
By hosting these public data sets at no charge, the company hopes to entice researchers to use its Elastic Compute Cloud offering to analyze their data against the hosted sets. The data is available for free, but users pay for the compute and storage they require for their own applications.
“We are going to be assisting Amazon in terms of keeping some of those data sets up to date,” Stan Gloss, managing director of consulting firm BioTeam, told BioInform in an interview.
While Amazon’s EC2 isn’t new, what is new is the access to public data sets, Gloss said, noting that with Amazon hosting the data, shared data sets become “internal to the system, [which] saves everybody a tremendous headache.”
Currently, bioinformatics researchers who want to use EC2’s computational services must move their data to the cloud and then remove it when their analysis is over.
“Some of these public domain data sets, if everyone keeps replicating them and moving them up and down, there is a lot of bandwidth charge,” Gloss said.
Amazon Web Services did not respond to questions from BioInform prior to deadline.
A Database Lives on the Cloud
Under the hosted model, many researchers can access a copy of a given public dataset on the cloud. Currently, a lab team might download GenBank to its own computers and then upload all of that data to the cloud to perform their analyses.
By hosting this data, the cloud offers researchers flexibility and saves them time, Gloss said. Amazon is not only hosting the service but plans to include the all updates, keeping the copy of a given data resource current. “It’s a real convenience,” he said.
Scientists might prefer being able to do one project on the cloud and then shut their computers down to analyze results rather than manage a compute cluster and pay for upkeep, Gloss said.
Amazon also offers so-called elastic storage for scientists who wish to store data, at 10 cents per gigabyte/month of provisioned storage, in addition to data transfer fees, according to Amazon web site.
Generally speaking, Amazon is not known as an information provider in the sciences. “I think that is why Amazon is looking to work with us,” said Gloss.
BioTeam ported its iNquiry software suite to the EC2 environment earlier this year.
Amazon has set up pre-configured Amazon Machine Images, or AMIs, through which software providers like BioTeam can offer their clients a “virtual machine” on the cloud, said Gloss, “with applications and everything that you want to run.”
“It saves everybody a tremendous headache.”
Porting iNquiry to EC2 was “not hard at all,” he said. “It was just a matter of creating a machine image that would run iNquiry.”
Amazon’s EC2 supports a number of operating systems such as Linux, Windows Server 2003, and OpenSolaris. Amazon lets its customers build AMIs with the software of their choice and already offers free and paid AMIs for software such as MySQL Enterprise, Microsoft SQL Server Express, and Apache HTTP as well as application development environments such as a Java Application Server or Ruby on Rails.
According to a technical description on Amazon’s site, EC2 gives users the ability to execute their applications in software of their choice, and they can pack into the AMI the operating system, configuration settings, libraries, and other modules. “Think of this as zipping up the contents of your hard drive,” the site states.
For scientists, Gloss said, the cloud can offer new ways of doing bioinformatics. “You open up a couple of computers, put Blast on your machine, GenBank is over here, you point to it on the system, and bingo you can run your Blast search,” he said.
In a statement, Harvard Medical School’s Peter Tonellato said that Public Data Sets on AWS will help him and his colleagues collaborate by sharing commonly used data sets, research environments, and tools.
According to a case study published on the Amazon website, Tonellato wanted to avoid setting up servers writing code for a large-scale study in which his lab used computer simulations to assess the clinical value of new genetic tests in “patient avatars.”
“I wanted to devise a system where postdoctoral researchers can scope a genetic risk situation, determine the appropriate simulation and analysis to create the avatars, and then quickly build web applications to run the simulations, rather than spend their time troubleshooting computing technology,” he said in the study.
Working with Oracle, Tonellato’s group took 10 days to customize the company’s private Linux AMIs for his data modeling. Two weeks later, according to the case study, the web application was “up and running.”
“We can set up a controlled environment in minutes, run our computational analysis for a couple of hours, and shut down the environment,” Tonellato said. “I only pay for the compute time I use, and more importantly I can spend more time focusing on research, not downloading and setting up computational infrastructure.”
“Ensembl's approach has always been to try and lower the barriers to entry so that researchers using a desktop PC in a lab or a laptop in an airport departure lounge have access to high-quality, up-to-the-minute genetic information that they can use in their work,” Glenn Proctor, Ensembl Software Coordinator at the European BioInformatics Institute, said in a statement.
“Amazon EC2 allows us to go even further and make all our data available in a robust, scalable and flexible form that anyone with an AWS account can use."
Proctor could not be reached for further comment.