Amazon Comes Clean About Cloud Failure

By Matthew Dublin

It turns out that Amazon’s big failure last week was caused by a malfunction of the storage service-the Elastic Block Storage, a replicated storage resource for Amazon’s EC2 virtual compute instances.

Amazon finally came clean with an lengthy report on their Amazon Web Services blog detailing how a series of errors led to the widespread service outage. Basically a network change that occurred on April 21, intended to upgrade capacity, kicked off a domino effect of complicated failures.

The upgrade essentially resulted in a mistake where the primary network data traffic was shifted to a slower, secondary network, which couldn’t handle the amount of data. It took about 12 hours for an Amazon team to get control of the networking hiccup but then the real issue was recovering customer data, a significant amount of which is reported to be permanently lost. While Amazon’s usual protocol when a node fails is to replicate the data on the node before it is reused, the replication mechanisms were maxed out and adding physical capacity to accommodate the replication took the team two days to set up.

Some of the lessons learned are the need for an improved network upgrade process, including more available free capacity in each EBS cluster, as well as improved isolation between zones. Despite all the “I told you so” chatter from skeptics of the cloud, StorageMojo contends that the Amazon’s team response was commendable and that ultimately, the question is not about the reliability of public clouds, like Amazon’s EC2, versus a private cloud, but rather, the type of architectures that can be implemented. While traditional large-scale networks have been built with a focus on Mean Time Between Failure, clouds are designed with fast Mean Time To Repair in mind.

The AWS team concludes their failure novella with a sincere apology and promise of credit for affected users.

NNSA Spreads Its Supercomputing Wealth

By Matthew Dublin

The National Nuclear Security Administration (NNSA) has a video describing some of the many non-nuclear research projects that are taking advantage of the NNSA's supercomputers, including the Roadrunner supercomputer located at the Los Alamos National Laboratory. While NNSA's IT staff in its Advanced Simulation & Computing division prepares its large-compute resources for the nuclear stockpile stewardship program, they regularly lend out computing hours to a variety of research projects.

In order to test Roadrunner and establish "system stabilization," LANL system technicians invited researchers working in a range of areas to run their applications and large data sets on the system. One project, led by researcher Bette Korber, a professor at LANL, used the NNSA supercomputer to model HIV proteins in an ongoing effort to find a possible cure for AIDS. Roadrunner provided Korber with insight into how the HIV virus replicates and aided her in the identification of common themes in all the variants that could one day be used to pinpoint a possible vaccine.

Biophysics With IBM Blue Gene

By Matthew Dublin

A team of researchers at Brown University led by George Karniadakis are working with Joe Insley from the Argonne Leadership Computing Facility to develop a method to improve the diagnosis and treatment of blood flow complications. Insley is helping Karniadakis take advantage of Argonne's Blue Gene/P supercomputer system, where they were allotted 50 million processor-hours. Argonne's Blue Gene system is capable of 500 trillion calculations per second.

"Previous computer models haven't been able to accurately account for, say, the motion of the blood cells bending or buckling as they ricochet off the walls," Insley said in a press release. "This simulation is powerful enough to incorporate that extra level of detail."

One part of the study is mapping exactly how red blood cells move through the brain as well as the relationship between cerebrospinal fluid and blood flow in the brain.

Below is an multi-scale model and visualization of blood created using Argonne's Blue Gene supercomputer:

Preparations are currently underway for the installation of Mira, Argonne's next-generation IBM Blue Gene/Q system that will be capable of 10 petaflops peak performance and include more than 750,000 cores and three-quarters of a petabyte worth of memory.

Goliath Stubs Toe: Amazon Cloud's Morning Malfunction

By Matthew Dublin

It looks like Giles Day knew whereof he spoke at last week's Bio IT World Expo when he called for a bit of "cloud sobriety," asking attendees: what if Amazon's cloud fails? What then? While most in the audience probably thought of that as the remotest of possibilities, like the entire national power grid failing, the unthinkable did in fact happen today. Early this morning at 1:48 AM PDT, Amazon's cloud failed, crippling many social networking sites including Foursquare, Quora,Reddit, and Hootsuite, Discovr, Wildfir, Livefyre, CampgroundManager, Totango, ESchedule, ZeHosting, Recorded Future, PercentMobile, the Cydia Store, and whatever other jobs were being run by private users at the time.

The technical failures affect Amazon EC2, Relational Database Service (RDS), Elastic Beastalk, CloudFormation, and Elastic Block Store (EBS).

In what is a sobering blow to the idea that the cloud's ubiquity and robustness insulates users from having the rug pulled out from underneath them, these failures were actually localized to Amazon cloud servers in Northern Virgina according to the AWS Service Health Dashboard. While technicians did maintain a detailed log of their efforts to get the servers back on line, it remains unclear as to the cause of the failures. The logs do seem to indicate that there were massive latency and error rates with EBS volumes and connectivity errors affecting EC2 instances, as well as other API errors. Let's just hope nothing more mission critical than a few Tweets and news about Lady Gaga was in jeopardy this morning. What was that about the cloud being ready for health care data?...

At a recent IT security conference held in New York City last week called The Computer Forensics Show, several security consultants balked at the idea of hosting health care data on the cloud, let alone sensitive business information or financial records. However, cloud users were encouraged by several speakers to take it upon themselves to audit the security of their cloud providers. The Cloud Security Alliance (CSA), a non-profit organization focused on promoting best practices for providing security assurance within cloud computing, was endorsed as a go-to resource for users looking to audit their cloud providers by John Kinsella, founder of Protected Industries, who is a co-chair of CSA working group. The CSA will soon be releasing guidelines to arm user with information on how to best kick the tires on their cloud and the right questions to ask providers.

NSF Funds 13 Teams to Advance Cloud Computing

The National Science Foundation has announced a list of 13 research teams that will receive funding through the collaborative cloud computing initiative kicked off in February 2010 by NSF together with Microsoft. The awards total roughly $4.5M and will provide researchers with access to Microsoft’s Windows Azure cloud computing platform for a three-year period in order to develop innovative approaches to get more out of cloud computing.

"Cloud computing represents a new generation of technology in this new era of science, one of data-driven exploration. It creates precedent-setting opportunities for discovery," said Farnam Jahanian, assistant director of the NSF Directorate for Computer and Information Science and Engineering, in a press release. "We are especially proud of these excellent projects, led by top researchers at universities throughout the country that we think will best capitalize on the NSF-Microsoft partnership. They will use the resources Microsoft will provide to explore and experiment with cloud computing in order to address some of society's greatest challenges."

Some of the NSF-funded cloud computing projects include the following:

Kenneth Birman of Cornell University - “Building Scalable Trust in Cloud Computing”

This project will focus on issues such as availability, secure access, fault tolerance and the preservation of privacy and real-time responsiveness.

Audrey Tovchigrechko of the J. Craig Venter Institute - “Bettering Interactive Protein-Protein Docking”

This project will focus on computationally modeling protein-to-protein interactions in the third dimension--or "protein-protein docking."

Zhengchang Su of the University of North Carolina at Charlotte – “Predicting Transcription Factor Binding Sites for Genes”

Su’s project will focus on using cloud computing to address the lack of efficient computational and experimental methods for predicting regulatory DNA sequences.

Wuchun Feng of Virginia Tech – “Conducting Intensive Biocomputing”

Wu Feng, one of the fathers of “Green HPC,” is using his grant to create a new generation of efficient data management and analysis software for large-scale, data-intensive scientific applications, (with a focus on DNA sequence analysis) for use in the cloud.

Closing the Cloud Computing Deal

By Matthew Dublin

One of the final talks at Bio-IT World Expo on Thursday was given by Giles Day, managing director of Distributed Bio, an informatics consultancy that caters to pharma and biotech companies. Day said that they typically sell their services to small companies with an informatics staff of usually no more than two or three people with small IT budgets and limited facilities. Most of these outfits are also managing increasingly complex automation in their workflows with unwieldy applications that produce exponentially expanding datasets.

Interestingly, Day said that a large part of his time is spent weaning clients off of their local compute clusters even after they have essentially hit the wall in terms of storage and compute power. Alas, the life of a cloud salesmen is not an easy one; the biggest barrier that Day and his company must help customers overcome in adopting the cloud is moving beyond their phobia of sending work outside the firewall and into the world beyond, or more specifically, up into Amazon’s EC2 cloud. Potential cloud users have a hard time believing that their data and intellectual property cloud ever be truly secure on Amazon’s EC2. But as he points out, roughly 98.9 percent of cloud users in the life science use EC2, and at the end of the day, Amazon really does know how to protect data in regulated environments as they handle tons of financial transactions every day without any breaches. Because of this, he argued, they have the security know-how that makes them one million times more secure than a pharma or biotech IT infrastructure could ever hope to be.

Security aside, the biggest issue with cloud computing has always been, and still is, I/O latency. There are several tools for addressing data transfer including rsync, Aspera, bbcp, and Bulk Ingest. But the folks running the Amazon cloud suggested to Day that the best method for transfer of large data sets is an application called Tsunami. Developed by researchers at Indiana University in 2002, Tsunami uses TCP (transmission control protocol) control and UDP (user datagram protocol) data for transfer over very high speed networks that are connected at long distance so that more throughput than is traditionally possible is achieved over the Internet.

One customer use case Day highlighted describes how the cloud can improve genomic annotation workflows, which is a classic embarrassingly parallel problem. Before the cloud, the client was running their genomic analysis pipeline with a 100 CPU cluster housed onsite, regularly processing upwards of 700,000 small genomes using a range of applications including Blast, hmmalign, hmmpfam, psort, and signalp, that resulted in terabytes of data. With their local cluster these jobs usually took two weeks to complete. But after coming to terms with the fact that they needed to rethink their whole approach, the client relented and switched over to the cloud, enabling them to reduce time-to-completion to just a few days.

Day did stop the singing the cloud’s praises long enough for a moment of “cloud sobriety” during which he pointed out that, because Amazon is really the only game in town for cheap and reliable cloud computing, and the one that the entire life sciences community interested in cloud computing is gravitating towards and developing methods for, what if EC2 goes out of business? While it’s hard to imagine the behemoth that is Amazon closing down its cloud operations any time soon, the question underscores the fact that this is still such a nascent technology, and when combined with the I/O issues and the learning curve to make workflows function smoothly with the cloud, IT staff need to proceed with caution and do the numbers before they commit.

Genomes, Clouds, and No Headaches

By Matthew Dublin

Probably the best sound bite from day two of the Bio-It World Expo in Boston was provided by Nicholas Socci, assistant director of the Bioinformatic Core at Memorial Sloan Kettering Cancer Center: “Either the computers are ready for me to use, in the way that I want to use them, or they’re not ready-and those are the only real pros and cons.”

The other pros and cons that Socci was dismissing are the often-cited default debating points about what cloud computing brings to the table for researchers (scalability, no hardware ownership costs, etc) and what its lacking (security concerns, application porting issues, etc). But for Socci, cloud computing is only worth using if it requires absolutely nothing from him or his IT staff. "If I have to worry about getting data up into all kinds of clouds I will never get anything done,” said Socci. “Up until this point, I have completely resisted using the cloud because if the cloud doesn’t allow me to run what I’m already running, then it’s no use to me.” The turning point for Socci was a solution whereby Life Technologies’ LifeScope Genomics Analysis Software is hosted onPenguin Computing’s POD (Penguin On Demand) cloud computing service.

Socci proceeded to press upon the audience that what next-generation sequencing analysis really needs is cloud computing. But not just in the sense that cloud computing could provide relief from the massive amount of data being generated by NGS platforms as an elastic storage option, but also, that investigators and IT staff now have a better way to manage an increasingly diverse number of data or classes of data from new applications. Essentially for Socci, the cloud is actually about harnessing people power because as he exclaimed during his talk: “We have too many things to do and too many new things to do with all this next-generation sequencing data!” He pointed out that NGS is creating an environment wherein collaboration is the name of the game as folks with a range of various expertise are increasingly called upon to deal with and analyze the data.

Following up on the idea that if the cloud means having to think very hard about getting things to work properly, then forget it, Angel Pizaroo, director ITMAT bioinformatics facility at the University of Pennsylvania School of Medicine, said the three pillars of cloud computing for life science research are: automatic provisioning of compute instances, automatic configuration of those instances with your applications of choice, and automatic execution (i.e. it should just work when you need it to, no excuses). Without the possibility of a seamless automated workflow that can be initialized at a moment's notice, then the cloud is pointless. This is why Pizaroo is a big champion of the CHEF project, an open source systems integration framework built to bring the benefits of configuration management to your entire infrastructure. The basic idea behind CHEF, which Pizaroo admitted one needs some training to use, is that users can write source code to basically construct an automated infrastructure on any server (or any cloud computing infrastructure) that they like.

A measured talk presented by Victor Jongeneel, a senior research scientist at both the Institute for Genomic Biology (IGB) and the National Center for Supercomputing Applications (NCSA), further explored the role of cloud computing and genomics by examining whether or not the cloud is a good environment for genome assembly. Jongeneel reported performance benchmarks of genome assembly algorithms Velvet, ABySS, and Contrail on Amazon EC2 instances and a large memory local compute clusters assemblying E. coli and S. pombe genomes. While the cloud performs as well as some of the NCSA’s local compute clusters on smaller genome assemblies, Jongeneel pointed out that there are no currently available software implementations for highly parallel genome assembly for large genomes that make using the cloud worthwhile. In the furture, he and his colleagues are planning on developing genome assembly methods that can utilize lots of cores to do high-throughput assembly.

Cloud Computing Aversion Therapy Workshop

By Matthew Dublin

Some of the presenters at a pre-conference cloud computing workshop held at the Bio-IT World Expo in Boston this week were not unlike an overzealous dog owner introducing their Great Dane to someone with a bad case of cynophobia. The general tone of many of the talks seemed to indicate that, for the most part, cloud computing is still in the get-to-know-me phase. Several of the speakers felt the need to play down the hype and instead play up the idea that cloud computing can provide a surprisingly user-friendly solution with lots of possibilities if only everyone just got to know it a little bit better. But they also understood that the real issue was not drumming up interest among researchers and IT managers, but rather, coming up with solid use cases that those folks can use when trying to convince their superiors or funding bean counters that cloud computing is not a technology to be feared and could potentially save money while enhancing research.

Case in point is the widespread use of REDCap — Research Electronic Data Capture) — a secure Web application for building online surveys and databases that has been implemented in the cloud at over 200 institutes to capture clinical research data. According to one presenter, Neil Bahroos, director of the Initiative in Biomedical Informatics at the University of Chicago, when coupled with the cloud, REDCap can be used in a mobile form so that physicians can feed data into a clinical study without ever having to worry about common security concerns, such as a misplaced laptop or hard drive, or just the accidental deletion of valuable data.

There was also much talk about Galaxy, which was described as a sort of "desktop informatics solution for next-generation sequencing" that when used with Amazon's EC2 is a good example of how the cloud be used to enrich a workflow. Users can present reproducible results through Galaxy's History feature, which allows collaborators to track the steps they took in an analysis and share them. In what started to feel like a late-night infomercial, the cloud was also pushed to attendees as something that no researcher should be without, especially if they have to pick up and move somewhere ("The cloud moves with you! No fuss, no muss!"), or for the those still on the fence, it was promised that more solutions like BioNimbus, a 212 node, 1568 core cloud that comes preconfigured with 2PB of data ready for you to use ("More data than you can shake a dongle at!"), will be coming online down the road; all users need to do is mount it onto their virtual machines and they'll be publishing highly accessed papers and winning grants in no time.

As far as identity and credential management in the cloud, there is a growing list of solutions that one could implement to ensure that only authorized users are permitted to gain access, including Openld (which is the same secure login backbone that Gmail uses), InCommon, Shibboleth, and MyProxy. Bahroos acknowledged that the rise in "private clouds" built by those tasked with shepherding and maintaining clinical data has been a natural result of the need to keep bosses happy by steering clear of breaking any HIPAA privacy rules. However, he hopes that this variant implementation of the technology, which seems to take away many of the benefits of what it's ideally supposed to provide (i.e. eliminating the burden of hardware), will soon die off as people become more comfortable the cloud; something that very well might happen if it doesn't jump up and bite anyone too badly.

UMN's $3.6M New Machine

By Matthew Dublin

A powerful new $3.6M supercomputer is now fully operational at the University of Minnesota Supercomputing Institute for Advanced Computational Research (MSI). Dubbed "Koronis," the new system was built to aid investigators with a slew of molecular dynamic modeling projects, bioinformatics, and biomedical imaging.

The system is a serious step up for UMN researchers in that Koronis is built on top of a powerful shared-memory system, high-speed hard drives with a robust I/O network, and high-end visualization capabilities, way beyond the University's previous Altix system.

“The large memory feature is important because many cutting edge research problems are memory-intensive, and Koronis gives researchers a unique tool to tackle them,” said Jeff McDonald, assistant director of high performance computing operations at MSI, in an announcement. “Koronis is the largest shared-memory system at MSI, and it also boasts the highest performance of any MSI system.”

Several researchers are already using Koronis to run complex calculations to study the detection and mitigation of chemical and biological warfare agents, as well as neuronal networks in psychiatric disorders.

The system contains 1,152 processor cores spread across a range of SGI components including an Altix UV 1000 server with 1140 compute cores and UV 100 systems each with 66 compute cores. For the data visualization analysis capabilities, the system also contains four Nvidia Tesla S2050 and Quadro FX5800 adapters dedicated as well as SGI's remote visualization software.

All systems in this constellation run SUSE Linux Enterprise Server 11 SP1 with SGI ProPack 7SP1. Koronis system can access a complete data storage and management solution that includes a 760 TB of disk storage with an automated tape library for automated data migration and archival.

Koronis also uses a large-scale, cache-coherent Non-Uniform Memory Architecture which allows for efficient processing of jobs requiring large amounts of shared memory exceeding what is available on other MSI HPC resources. Although codes running on the system can be programmed using a message passing model, communication among the processes will occur via shared memory. OpenMP or other threaded codes should run well on this resource.

MD Anderson's New Cancer Cloud

By Matthew Dublin

Researchers at the University of Texas MD Anderson Cancer Center have their sights set on building the largest computing resource in the world dedicated solely to cancer research. A team lead by MD Anderson vice president and CIO Lynn Vogel has already begun implementing a system that uses a private cloud to provide computational power from 8,000 processors and hundreds of terabytes of storage along with a service oriented architecture-based Web portal called ResearchStation.

The roll out of the cloud coincides with MD Anderson's expansion into a third data center slated to come online later this summer, making it the second new data center opened at the center in a four-year period.

While Vogel says they initially looked at public cloud offerings, they opted for a private cloud solution not only for better performance, where there's less network latency than with a public cloud which is accessed through the web, but also because much of the data is sensitive patient information which might be more easily compromised off site.

"We've looked into this, but quite honestly, we've found on performance, access and in the management of that data, going to a public cloud is more risky than we're willing to entertain," Vogel says. "This goes directly to the point that this is identifiable patient data...and we're just not comfortable with the cloud given the actionable capability of a patient should there be a breach."

MD Anderson cancer researchers can request large blocks of the campus cloud on demand for analyzing human genomes and investigating radiation physics, epidemiology, and simulation for clinical trial activities.

"When you're in the business of biology, which we are, it's a different ballgame in terms of understanding the structures of data, the kinds of access and models used, and the applications that need to be available," Vogel says. "As much as public cloud providers would like us all to believe, this is not just about dumping data into a big bucket and letting somebody else manage it."

Crash Test Dummy Computer

By Matthew Dublin

In order to learn about what works for an HPC system and what doesn't, North Carolina State University professor Frank Mueller recently completed construction on a supercomputer designed to take a serious beating.

“There is no way that large-scale HPC system operators, like Oak Ridge National Labs, would let us experiment with their systems,” Mueller is quoted as saying in this release. “We could break them.”

Muller, who is known for building one of the first Playstation 3 cluster for biomedical research, has used funding from the National Science Foundation, Nvidia, and NC State to build a sort of "crash test dummy" cluster. The purpose of the cluster is to determine what types of code max out the processors and what changes to the the system's software stack and operating system push it to the limit until system failure occurs. The cluster, called ARC (A Root Cluster for Research into Scalable Computer Systems), contains 1728 cores on 108 compute nodes 36 Nvidia GPUs on 108 computer nodes 32GB RAM each.

“We can do anything we want with it,” says Mueller. “We can experiment with potential solutions to major problems, and we don’t have to worry about delaying work being done on the large-scale systems at other institutions.”

As anyone running a cluster or supercomputer knows, when hardware or software failures happen, downtime can be lengthy. These failures can affect not only the IT managers tasked with repairing the system, but backs up the job submission pipeline resulted in delayed research projects and research funding wasted. Mueller and his colleagues using ARC also hope to use new system to better prepare researchers for developing code and hardware to deal with the monolithic data sets of the future. If technical failures on a teraflop system drive IT managers to the brink, imagine what the blue screen of death would do to someone managing research jobs on an exaflop systems.

GPU Whole-Cell Simulations

By Matthew Dublin

A team from the University of Illinois has published research in PLoS Computational Biology that details their use of molecular dynamics and steered molecular dynamics on GPUs to study point mutation induced Tamiflu-resistance in both avian and swine flu N1-subtype neuraminidases.

According the authors, Tamiflu is currently the frontline antiviral drug employed to fight the flu virus in infected individuals by inhibiting neuraminidase but drug resistance has become a critical problem due to rapid mutation.

The team, led by University of Illinois professor Zaida Luthey-Schulten, used the GPUs to construct an in silico model of the inner workings of a bacterial cell developed in collaboration with researchers at the Max Planck Institute of Biology in Germany and theoretical scientists at the University of Illinois.

Their simulations revealed "an electrostatic binding funnel that plays a key role in directing Tamiflu into and out of its binding site on N1 neuraminidase. The binding pathway for oseltamivir suggests how mutations disrupt drug binding and how new drugs may circumvent the resistance mechanisms."

GPU Apples vs CPU Oranges

By Matthew Dublin

When it comes to GPUs and CPUs, there's really no comparison - at least according to a post on the TeraGrid news site that attempts to lay to rest the ongoing debate about the pros and cons of each for large-scale scientific research. While the new generation of GPUs is capable of performing high-end computational problems sometimes 20 times faster than their CPU counterparts, this does not of course mean that GPUs are pushing out CPUs in HPC. Both do different things really well, and only in some cases overlap to compete for the user's favor. And although Nvidia and AMD have announced plans to released a combine CPU-GPU chip, most GPUs are still dependent on CPUs to access data from the disk or to exchange data from node to node in a cluster environment.

The Teragrid currently hosts several GPU-based systems, including the Lincoln cluster at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign,Nautilus at the National Institute for Computational Sciences, TeraDRE at Purdue University, and the Longhorn and Spur systems at the Texas Advanced Computing Center at the University of Texas at Austin. There’s also the Keeneland Project, a $12 million Track 2D grant awarded by the National Science Foundation for the deployment of an experimental high performance system under a partnership that includes the Georgia Institute of Technology, the University of Tennessee at Knoxville, and Oak Ridge National Laboratory.

Below is a video comparing CPU against GPU performance for a protein folding simulation of TRPCage, an artificially designed protein, using a single Intel E5462 2.80GHz CPU versus an Nvidia C1060 GPU: