By Matthew Dublin
One of the final talks at Bio-IT World Expo on Thursday was given by Giles Day, managing director of Distributed Bio, an informatics consultancy that caters to pharma and biotech companies. Day said that they typically sell their services to small companies with an informatics staff of usually no more than two or three people with small IT budgets and limited facilities. Most of these outfits are also managing increasingly complex automation in their workflows with unwieldy applications that produce exponentially expanding datasets.
Interestingly, Day said that a large part of his time is spent weaning clients off of their local compute clusters even after they have essentially hit the wall in terms of storage and compute power. Alas, the life of a cloud salesmen is not an easy one; the biggest barrier that Day and his company must help customers overcome in adopting the cloud is moving beyond their phobia of sending work outside the firewall and into the world beyond, or more specifically, up into Amazon’s EC2 cloud. Potential cloud users have a hard time believing that their data and intellectual property cloud ever be truly secure on Amazon’s EC2. But as he points out, roughly 98.9 percent of cloud users in the life science use EC2, and at the end of the day, Amazon really does know how to protect data in regulated environments as they handle tons of financial transactions every day without any breaches. Because of this, he argued, they have the security know-how that makes them one million times more secure than a pharma or biotech IT infrastructure could ever hope to be.
Security aside, the biggest issue with cloud computing has always been, and still is, I/O latency. There are several tools for addressing data transfer including rsync, Aspera, bbcp, and Bulk Ingest. But the folks running the Amazon cloud suggested to Day that the best method for transfer of large data sets is an application called Tsunami. Developed by researchers at Indiana University in 2002, Tsunami uses TCP (transmission control protocol) control and UDP (user datagram protocol) data for transfer over very high speed networks that are connected at long distance so that more throughput than is traditionally possible is achieved over the Internet.
One customer use case Day highlighted describes how the cloud can improve genomic annotation workflows, which is a classic embarrassingly parallel problem. Before the cloud, the client was running their genomic analysis pipeline with a 100 CPU cluster housed onsite, regularly processing upwards of 700,000 small genomes using a range of applications including Blast, hmmalign, hmmpfam, psort, and signalp, that resulted in terabytes of data. With their local cluster these jobs usually took two weeks to complete. But after coming to terms with the fact that they needed to rethink their whole approach, the client relented and switched over to the cloud, enabling them to reduce time-to-completion to just a few days.
Day did stop the singing the cloud’s praises long enough for a moment of “cloud sobriety” during which he pointed out that, because Amazon is really the only game in town for cheap and reliable cloud computing, and the one that the entire life sciences community interested in cloud computing is gravitating towards and developing methods for, what if EC2 goes out of business? While it’s hard to imagine the behemoth that is Amazon closing down its cloud operations any time soon, the question underscores the fact that this is still such a nascent technology, and when combined with the I/O issues and the learning curve to make workflows function smoothly with the cloud, IT staff need to proceed with caution and do the numbers before they commit.