The jury is still out on how useful a new low-cost cloud storage solution from Amazon will be for life science computing, according to several bioinformatics firms.
Amazon Glacier, launched last week, is designed for archiving rarely used data. In conversations with BioInform this week, representatives from several bioinformatics firms familiar with Amazon’s cloud infrastructure offered slightly different takes on the usefulness of this deep storage capability for genomics and other life science disciplines.
Glacier was designed to hold data that is infrequently accessed but needs to be retained for future reference. It allows customers to offload the administrative burdens of operating and scaling archival storage to Amazon Web Services, removing the need for hardware provisioning, data replication across multiple facilities, or hardware failure detection and repair.
Furthermore, for each item stored, the service automatically replicates all data across multiple facilities and performs ongoing data integrity checks, using redundant data to perform automatic repairs if hardware failure or data corruption is discovered.
The price for storing data in Glacier is $0.01 per gigabyte per month in the US, much lower than the cost of storing data in Amazon’s Simple Storage Service, which starts at about $0.13 per gigabyte per month for standard storage and $0.093 per gigabyte per month for reduced redundancy storage.
Glacier is also cheaper than Amazon’s Elastic Block Store, where customers are charged $0.10 per gigabyte per month for provisioned storage and $0.10 per one million input/output requests for standard volumes.
Google, meantime, which recently launched its own cloud computing infrastructure, charges $0.12 per gigabyte per month to store up to one terabyte of data (BI 7/20/2012).
Among other benefits, customers who use Glacier don’t have to make upfront capital commitments and all ongoing operational expenses are included in the cost, Amazon said. Furthermore, businesses can scale their usage up or down as needed rather than guess what their capacity requirements will be ahead of time.
When it announced Glacier’s launch, Amazon indicated that the life sciences — genomics in particular — would be one of its target markets.
“We believe life sciences is a very big market for this service,” an Amazon representative told BioInform in an e-mail. “Outside of genomics, we see the following as important target segments — biomedical, biochemistry, computational neuroscience, environmental science, health sciences, medical devices, and medical imaging.”
Keith Raffel, Complete Genomics’ senior vice president and chief commercial officer, said in a statement that Glacier will enable the company to “provide cost-effective, long-term storage,” which is required for patient data in the clinical space. He added that this capability will “eliminate a barrier to providing whole-genome sequencing for medical treatment of cancer and other genetic diseases.”
Another potential customer is DNANexus, which uses AWS for its cloud-based sequence analysis service. Andreas Sundquist, the company’s co-founder and CEO, told BioInform that his firm is currently evaluating Glacier as a possible offering for its clients.
“This could possibly be a great value-add, and a money saver, for users who have large data sets that they need to park somewhere but not access on a daily basis,” he said in an e-mail. “It's a perfect example of how cloud-based technologies enable economies of scale and the rapid integration of new innovations.”
Richard Holland, chief business officer of bioinformatics consulting firm Eagle Genomics, said he expects researchers in the life science space will find the resource “extremely useful”
Holland told BioInform this that Eagle is looking into ways it can make use of the storage solution in customers’ projects.
But not everyone agrees that Glacier can penetrate life sciences markets just yet.
Chris Dagdigian, a founder of bioinformatics consultancy BioTeam and its director of technology, told BioInform that he believes life scientists will likely pass on the new infrastructure, at least for now, in favor of cloud storage solutions that provide faster access to their data.
“My gut feeling is that Glacier is absolutely intended, designed, and architected for a deep, cold, infrequently accessed archive and I don’t see a lot of scientists clamoring for the deep, cold, infrequently accessed archive,” he said.
With Glacier, “you have to package up your data into an archive and then you upload the archive and then you wait until Amazon notifies you that the data is available and then the economics are such that there are financial penalties if you retrieve archives in unusual ways or more rapid ways,” he explained. “I think for most of the data that the scientist has, they probably want it in a form that’s a bit quicker to access and quicker to get at.”
Indeed, data retrieval takes three to five hours for Glacier. And while both Glacier and S3 offer free retrieval of up to one gigabyte per month, Glacier charges far more than S3 for retrieval beyond that point — $0.201 per gigabyte for up to 10 terabytes per month as opposed to $0.120 per gigabyte for up to 10 terabytes with S3. The price differential widens as volumes increase, so that transferring between 100 terabytes 350 terabytes per month would cost $0.127 per gigabyte with Glacier as opposed to $0.050 per gigabyte with S3.
Dagdigian added that Glacier is not designed to facilitate data sharing. “Generally speaking, the person who owns the data is the only person who has the ability to download it,” he explained. “You can’t point a web browser at it and download the files. You have to navigate the archives, press download, let the [data] percolate down to your system, and then you might have to unpack it and distribute it.”
While Glacier’s pricing is attractive, because of the time to data access and the difficulty of sharing data with third parties, “from an implementation perspective, it might be better to pay the slightly higher rates for Amazon EBS or Amazon S3” since access to data in these systems is much quicker and they have built-in sharing mechanisms, he said.
But Eagle’s Holland doesn’t foresee delayed data retrieval being a problem for users. He told BioInform that he believes researchers would be more willing to wait a few extra hours to get access to their data because of the cost savings associated with Glacier’s storage.
Plus, “the idea is that you put data into [Glacier] only at the point where you’ve done the bulk of your analysis of that data and you just need to archive it away for long-term storage in case you have a need to retrieve it,” he said.
Holland expects researchers will use Glacier to store things like raw and intermediate next-generation sequence data as well as to house image data and FASTQ databases, he said.
For his part, Dagdigian believes that it’s more likely that information technology professionals and vendors of data storage and backup systems will be Glacier’s initial adopters.
“It’s incredibly likely to me that the next time you buy an enterprise storage product with a bunch of disk drives, that an automatic backup to Glacier is simply just going to be a point-and-click software feature in the storage array,” he said.
But that doesn’t meant that the life science community won’t find use for Glacier at some point, although in most cases, its use is likely to be controlled by IT experts rather than the researchers themselves, Dagdigian noted.
Eventually, “I think it’s going to be in widespread use in the life sciences but it’s going to show up via the IT and the network and the backup people or it's going to be baked into a new version of software that scientists are already using and it will just show up as a feature,” he said.
For example, a link to Glacier could be included in the software that comes with sequencers so that scientists can archive experiment directories or other system data.
Eagle’s Holland and BioTeam’s Dagdigian both described Glacier’s launch as a positive development because of its low price per gigabyte.
“It … add[s] another data point on the inescapable economics that are achievable from some of these cloud providers that operate at such a ridiculous scale,” who can “squeeze cost efficiencies and operational efficiencies down,” Dagdigian said.
It’s also “quite clear that this really is the cloud economy of scale,” he added. “It’s difficult for smaller competitors, and … I would argue, probably impossible for someone to replicate this kind of structure internally, if they were honest about the true costs involved.”
While it’s likely that cloud infrastructures marketed by competitors like Google could potentially pose a threat, “Amazon has a lead time measured in years” over most of its competitors in the cloud space, he said.
“There are a lot of people who are doing cloud storage but if you look at what they are doing … its relatively simplistic copies of Amazon S3 and EBS,” he said. To catch up to Glacier, “first of all, I think people would have to do a tremendous amount of engineering and probably have to re-architect new things to copy what Glacier does.”
Furthermore, users have the option to ship data disk drives directly to Amazon where the data is uploaded and made available via the cloud, a feature not offered by other providers. Amazon also offers reduced redundancy storage, which lets users reduce their storage costs by storing non-critical data at a lower level of redundancy, which is “something that I haven’t seen other competitors match,” Dagdigian pointed out.
Google representatives declined to comment on Glacier for this article.