Sequencing and Analysis of the Hydra Genome
Chapman, Kirkness et al., Nature
An international research collaboration reports their sequencing and analysis of the Hydra magnipapillata genome, and compare it to the genomes of several other organisms. "The Hydra genome has been shaped by bursts of transposable element expansion, horizontal gene transfer, trans-splicing, and simplification of gene structure and gene content that parallel simplification of the Hydra life cycle," the authors write. They team suggests that comparisons of the Hydra genome to the reported sequences of other animals have helped them to elucidate the evolution of several of the organism's characteristics.
Considering a Cloud? Cost Isn't Everything...
For those of you that missed it, Edward Walker, a research scientist at the Texas Advanced Computing Center at the University of Texas at Austin, has a paper out that provides a bit of a reality check for cloud computing. In "The Real Cost of a CPU Hour," Walker starts by asking whether or not outsourcing your cyberinfrastructure needs to the cloud is really all it's cracked up to be — the answer seems to be no. Just because cloud computing is cheap, doesn't mitigate the performance drops when compared to a cluster. The article describes a significant gap between performing HPC computations on a traditional scientific cluster versus a EC2 provisioned scientific cluster using macro- and micro-benchmarks. Walker used the NAS Parallel Benchmarks, a small set of programs designed to help evaluate the performance of parallel supercomputers, to measure the performance of the clusters for common scientific calculations, as well as examining the performance of the Massage Passing Interface library with the mpptest micro-benchmark. "The opportunity of using commercial cloud computing services for HPC is compelling. It unburdens the large majority of computational scientists from maintaining permanent cluster fixtures, and it encourages free open-market competition, allowing researchers to pick the best service based on the price they are willing to pay. However, the delivery of HPC performance with commercial cloud computing services such as Amazon EC2 is not yet mature," Walker writes. "For cloud computing to be a viable alternative for the computational science community, vendors will need to upgrade their service offerings, especially in the area of high-performance network provisioning, to cater to this unique class of users."
Why HMMER Users Shouldn't Bother Sean Eddy

Sean Eddy, a bioinformaticist at HHMI's Janelia Farm campus, is giving all HMMER users out there one to grow on in his latest post over at Cryptogenomicon. Eddy, HMMER's developer, says that since the release of HMMER3 the amount of emails from users he is constantly bombarded with is becoming unmanageable. Most of the emails are from people who are just not reading the darn documentation or disclaimers (you know who are). Because most of these HMMER inquiry emails seem to center around the three questions below, Eddy has listed the following responses/disclaimers, in an attempt to prevent both his inbox from over flowing as well as any hurt feelings you might have as a result of not getting a reply back because he can't possibly respond to every single email:
How do I do X in HMMER? If you ask me a question that’s already covered in HMMER3’s documentation, I may not reply, in the hope that you realize you can just do your homework. It’s a little awkward to get into a “please read the documentation”, “I did”, “no you didn’t; look at page XX” discussion with a lot of people during the day; partly because I tend to get more testy than I ought to be, partly because I tend to answer in a hurry and make embarrassing mistakes, and partly because it becomes faster to just answer the question rather than have that discussion, but I didn’t have time to answer the question in the first place. Of course, if someone finds something that I haven’t documented well, I always reply, and I always fix the documentation so I don’t have to reply to similar questions in the future. I’m much more likely to reply if someone indicates that they’ve already read the documentation carefully, rather than taking advantage of my easy email accessibility.
How do I do X in this interface to HMMER? If you’re using someone else’s software or web interface to use HMMER, and your problem is in their software not mine, your first point of contact should be the person who developed the interface. That includes commercial software packages that bundle HMMER, and things like BioPerl and other Bio* interfaces. I’m much more likely to reply if a question is directly about HMMER’s own input/output, not something that’s been filtered through someone else’s interface.
How do I do X in this thing that’s called *HMMER* but really isn’t? Programs like MPI-HMMER, GPU-HMMER, LD-HMMER and the like are using the name HMMER without our permission, in annoying disregard of my attempts to get them to use a name that doesn’t confuse people and burden me with a bunch of extra email. Except for the software released by us from hmmer.org, we don’t have anything to do with these other forks or clones, and you again need to contact the people responsible for them.
Python users interested in taking advantage of multicore or GPU hardware should keep an eye out for CLyther, a Python tool for OpenCL (Open Computing Language), the open source programming platform for heterogeneous processing, intended to make scripting OpenCL code as trivial as Python itself. Similar to Cython for C, the CLyther is still under development but is set for beta release sometime before April 14. CLyther's features will include features such as prototyping for OpenCL code, OpenCL kernel function creation using Python language definition, a Python emulation mode for OpenCL code, just to name a few. While OpenCL has not made as much noise as NVIDIA's CUDA, Douglas Eadline says that the future of the open source platform is no longer dubious as was once thought by many in the HPC community. The Khronos Group, an industry consortium geared towards creating open standards for the authoring and acceleration of parallel computing and graphics, and a long list of vendors, including NVIDIA, have elected to make OpenCL the industry standard, so the platform is probably here to stay. "From an Independent Software Developer standpoint, OpenCL is is the gateway to hybrid (CPU/GPU) computing," writes Eadline. "As anyone with scar tissue in the HPC industry can tell you, investing resources and time into non-standard Applications Programing Interfaces is a risky business. MPI was developed for similar reasons, (i.e. programmers did not want to recode every time a new parallel computer architecture hit the server room)."
Washington University Genome Center Receives $14M in Stimulus Funds
Thanks to a $14.3M grant courtesy of the The American Recovery and Reinvestment Act by way of the National Center for Research Resources, Washington University's Genome Center will be expanding its entire data center, which is now at about 80-90% capacity. Washington University is also shelling out $8M of its own money to extend the data center's capabilities and double its size to some 32,000 square feet. The Genome Center's data center currently houses some 5,000 computer processors with more than 5 petabytes of storage.
David Dooling, who oversees the Analysis Developers, Laboratory Information Management Systems (LIMS), and the Information Systems groups at The Genome Center, told us that they are still evaluating exactly how they will go about equipping the expanded facility. He says they are currently in the process of assessing aspects of data design such as cooling and electrical configurations, so the expanded parts of the data center could be different from the original facility's infrastructure.
Click here for a some pics of the construction.
Every data deluge comes with a free power supply..

In case you missed it, head over to Intel's public idea test drive site and check out the recently released Intel Energy Checker SDK, a relatively easy to use software package that provides tools to measure software and hardware energy consumption. The Intel Energy Checker is intended to expose metrics of "useful work" done by an application, whereas activity is usually measured by how hard a severing is working while running an application. What's notable about this approach is that it lets data center managers or HPC application developers actually look at the value of an application in terms of not just how many watts are being used, but also how many watts are being used by the software for actual productive work. The SDK can work with Windows, Linux, Solaris 10, and Mac OSX.
From all accounts, 2010 looks to be the year of the multicore processor, but does this finally mean the emergence of HPC at the desk side or just really expensive space heaters that you can Tweet with? Despite a delayed rollout in 2009, Intel is planning on releasing a 6 core processor code named "Gulftown" sometime in Q2 of this year. The chip is capable of running 12 threads in parallel and will supposedly increase processing performance by some 50% over quad-core processors while drawing roughly the same of power. Intel is also working on a 6-core version of the Nehalem processor, which was originally released with 8-cores, in order to reduce heat issues, and will also be releasing an HPC version of the Nehalem which is slated to be called the Xeon 7500. The chipmaker's tera-scale computing research program is also touting their monster multicore experimental "single-chip cloud computer," a 48-core chip which they describe as architecturally resembling a cloud of integrated computers into silicon. Whatever that means...
And not to be outdone, AMD is also releasing a 21-core processor called Magny-Cours that is clocked at 2.2Ghz, chock full of memory channels, and will also run cooler when idle than AMD's 6-core Opteron.
The AMD Server team has also kicked off a contest for the best response to the question "What Would You Do With 48 Cores?" The winner will be awarded four new AMD Opteron processors, a TYAN S8812 motherboard that features 4 processor sockets with the capacity for you to install up to 8 DIMMs per sockets, and one copy of Windows Server 2008-Approximate retail value of all prizes is $8,189 USD. Contestants can write an essay, blog, or create a YouTube video expounding upon how they would utilize a 48-core machine to help society.
As Douglas Eadline points out, there are other concerns for the HPC users in addition to just multicore count, including parallel I/O, memory contention, and GP-GPUs, not to mention heat, power, and noise. Questions remain as to whether or not 48 cores or more even make sense for a single node, and how GP-GPUs, which definitely have a place in a lot of HPC application areas, can be leveraged in union alongside multicore processing to tackle large data sets. So as per usual, there's a lot of marketing noise about tremendously powerful hardware coming down the pipeline which is drowning out the voices of software developers scratching their heads, wondering how they can win the catch-up game.
Georgia Tech Launches HPC Institute
Georgia Tech announced this week the creation of the Institute for Data and High Performance Computing (IDH). The purpose of the institute will be to promote the development of software and tools to enhance multidisciplinary research and innovation for high performance computing and large-scale data management. The IDH will initially be headed up by Richard Fujimoto, head of the School of Computational Science & Engineering in the College of Computing.
“Georgia Tech has made substantial infrastructure and personnel investments in high performance computing, and achieved many important successes, over the last five years,” said Fujimoto. “I fully anticipate that IDH will enable us to advance beyond prototypes to new levels of accomplishment in the high performance computing area.”
Rice University Receives IBM POWER7-based Supercomputer
Rice University and IBM have announced the roll out of "BlueBioU," an 18.8 teraflop supercomputer based on IBM's new energy-efficient POWER7 processors. The supercomputer comes to Rice as part of $7.6 Million IBM Shared University Research (SUR) award to Rice for advanced biomedical research. BlueBioU is a Linux-based system specifically tailored for parallel processing and includes 608 POWER7 processors capable of running 2,432 tasks simultaneously.
Researchers at the Texas Medical Center plan on using their new machine to accelerate research into complex diseases including cancer and AIDS using a genomics and proteomics approach.
Baylor College of Medicine, a Texas Medical Center partner, will use BlueBio to explore cancer through via genome analysis technologies, including large-scale genome sequencing.
IBM's New Energy-Efficient Data Analysis Method
IBM researchers in Zurich have announced a new method that uses an algorithm to reduce the complexity, cost, time, and subsequently, energy usage for analyzing large-scale data sets. They demonstrated the new method using a Blue Gene/P system at the Forschungszentrum Julich in Germany to validate, without significant error, nine terabytes of data in less than 20 minutes, without compromising accuracy. Usually this process would take roughly a day. The breakthrough, which was presented yesterday at the Society for Industrial and Applied Mathematics conference in Seattle, used just one percent of the energy that would normally be required for such a job. Needless to say, many a bioinformatician is probably waiting with bated breath to see the possible applications of such a method to monstrous next-generation sequencing data sets.
NCSA to Offer Free Webinar on HPC Performance Tools
The National Center for Supercomputing Applications is offering a free webinar on high-performance computing tools from 1:30 to 3 p.m. CST this Thursday, Feb. 25. The webinar will be led by NCSA system engineer Galen Arnold, who will walk you through an introduction to performance tools and techniques, including widely used applications such as the High Performance Linpack.
Convey Computer Rolls Out Life Sciences Division
Convey Computer Corporation, makers of the hybrid-core HC-1 Convey platform, have announced the roll out of a life sciences division to be headed up by industry veteran George Vacek, formerly of Hewlett-Packard. Convey began making its pitch to the HPC community in late 2008 at SC08 with a power-efficient rack unit that combines Intel Xeon processors with commodity FPGAs. About six months later, the University of California, San Diego decided to install the HC-1 to reduce the time to run blind search proteomics experiments on massive protein databases that look for post translation modifications. Regarding Convey's new play for the life sciences market, Vacek says that the company is focusing much of their efforts "on genomics because of the large data volumes being generated by the current generation of sequencers. And, as sequencer technology continues to improve, data volumes will continue their dramatic growth. Although genomics applications generally run fine on x86 clusters, it can be cost prohibitive to quickly complete the data analysis required to support genomics research."
Zero Tolerance Policy on Cloud Computing Balderdash
David Dooling is not having it. In a recent post on his blog, Dooling says enough is enough with all this cloud computing tom foolery that touts the technology as some kind of panacea for life sciences. His understandable frustration at all the hype is directed towards what he dismisses as a "puff" piece by Jason Stowe, CEO of Cycle Computing, called "Is the Future Of High- Performance Computing For Life Sciences Cloudy?" Dooling criticizes Stowe's piece for implying that programs which run well on one or ten computers will run well on hundreds of computers-clearly this is not always the case. Another bone to pick is that Stowe fails to make mention of what Dooling feels is the considerable expertise needed to get a cloud up-and-running with a users' desired applications. I would submit that it goes without saying that any broad, sweeping statements about how cloud computing can be a game changer for life sciences penned by someone in the business of selling cloud computing should be taken with several huge pieces of rock salt anyway.
However, this does not mean that an attempt to spread awareness of cloud computing's potential for life sciences, even if does originate from a vendor, is completely worthless. The technology is new, and the more folks in the life sciences community that are aware of its pros and cons, the better. But when trusted publications like Nature Biotechnology seem to be spreading straight up misinformation, the indignation is definitely understandable. Dooling takes apart a recent article in the aforementioned publication entitled "Gathering clouds and a sequencing storm." Among other things, the article contends that "...bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud." I agree completely with Dooling's criticism that it is just plain wrong to say that it's necessary for users to familiarize themselves with Hadoop, let alone even use it to begin with. Not even Google, which uses the Hadoop's precursor Mapreduce, requires users to have an intimate knowledge of Hadoop. For more debunking, go check out the article "Hype Cycle for Cloud Computing, 2009." When is Consumer Reports going to take a stab at cloud computing?...
HUBZero, a platform used to create dynamic websites for research projects, will be available in an open-source release for the first time at the upcoming HUBBub 2010 workshop on April 13 through 14. Described as either a cloud computing platform or a Facebook for scientists, HUB has its roots in the PUNCH web infrastructure, a Internet computing platform from the mid-1990s developed at Purdue University. The folks behind PUNCH then developed the National Science Foundation-funded NanoHUB.org, a cyberinfrastructure resource for nanoscience and technology. The HUBZero developers say that their new platform is different from both its precursors and other research collaboration platform out there because any user can start an account and upload any research tool they want; there are no gatekeepers.
Check out the video below for more info:
