HPC & Cryo-Electron Microscopy Data

The University of Texas at Austin's Chandrajit Bajaj is using the computational horsepower of the Texas Advanced Computing Center to identify a small group of molecules that might have therapeutic potential.

Bajaj is using practically every piece of computing hardware at TACC to conduct his research, including TACC's Ranger and Lonestar supercomputers, the Longhorn remote visualization system, and Stallion, a super high-resolution tiled display in the TACC/ACES Visualization Laboratory.

"Computers are a good way to accelerate the process of drug design," Bajaj says. "It takes 10 years to proof out a drug, and a billion dollars or more. Hence computational drug discovery is not only timesaving, but economics tells you this is the way we should be going."

Using TACC's HPC resources — including both CPU and GPU-based systems — Bajaj and his colleagues were able to run their algorithms to detect secondary structures of proteins through intermediate and coarse resolution 3D maps that were reconstructed from single particle cryo-electron microscopy.

Bajaj, who regularly collaborates with big pharma, says that increasingly, computational drug discovery is becoming the de facto process of sorting through target compounds.

"They're moving into the computational drug screening arena, and more and more it's teams of people working together," he says. "The biophysicist, the biochemist, and the synthetic chemist are sitting together with the computational expert, and they say it's giving them clues as to what they should be doing next."

His recent research is described in the Journal of Structural Biology.

New Program Powers Population Diversity Research

Database designers at the Cornell University Center for Advanced Computing are providing researchers with a powerful new tool for exploring the microbial world.

Called "CatchAll," this program lives up to its name by computing 12 different diversity estimates with standard errors and goodness-of-fit assessments at various levels of outlier deletion. In instances wherein low-frequency counts may be false, CatchAll computes a discounted estimate by adjusting the diversity component of the selected mixture model.

"The massive data produced by sequencers require advanced statistical tools capable of accurately estimating the total diversity or 'species richness' in a microbial population," says John Bunge, an associate professor in Cornell's Department of Statistical Science.

The software is described in greater detail by Bunge and his colleagues in "Estimating population diversity with CatchAll," published in the journal Bioinformatics.

CatchAll has both command line and GUI interfaces and an associated Excel spreadsheet with graphical displays. Executable downloads for Linux, Windows, and Mac OS platforms with a manual and source code are available here.

An Imperfect Processor

A team of researchers from Rice University are challenging the notion that has motivated the computing industry for over 50 years: accuracy is the ideal.

It might be counterintuitive to think of developing computer hardware that doesn't deliver anything but the most accurate of results. Yet allowing for a small amount of not-so-accurate computation can save users time and money.

This type of "inexact design" can seriously decrease power consumption and enhance processing speed by managing the probability of errors and placing limitations on which calculations produce errors. These less-than-perfect processors are designed by "pruning" or trimming away rarely used sections of digital circuits and creating what the team calls "confined voltage scaling" which trades certain areas of performance to further cut down on power usage.

"In the latest tests, we showed that pruning could cut energy demands 3.5 times with chips that deviated from the correct value by an average of 0.25 percent," says Avinash Lingamneni, a Rice graduate student who is a co-developer on the project. "When we factored in size and speed gains, these chips were 7.5 times more efficient than regular chips. Chips that got wrong answers with a larger deviation of about 8 percent were up to 15 times more efficient."

Inexact computer chips like this prototype are about 15 times more efficient than today's microchips:

While skepticism is understandable, the Rice team is certainly turning a lot of heads in the processor design community— their research has already earned best-paper honors at this week's ACM International Conference on Computing Frontiers in Cagliari, Italy.

The obvious question: What good is a processor that makes mistakes?

According to the team, there are certain application areas that can tolerate a significant amount of error. The example they cite is an image that was rendered with the "inexact" processor with relative errors up to 0.54 percent that were virtually indiscernible to the human eye. So there might be room for these imperfect processors in big data visualization or molecular modeling, although they probably won't be that popular in most areas of bioinformatics where mistakes simply cannot be tolerated.

For now, instead of pitching these processors as ideal for general purpose use, these researchers are saying their design may best be used as application-specific processors, such as embedded microchips in devices.

A New and Improved Database

A group of researchers at Yale University may have leveled the playing field for database performance. What's interesting is their solution is not even technically a database.

Called "Calvin," their new "database" is actually a transaction scheduling and replication coordination service its developers say could provide a nice alternative to the pricey distributed relational databases offered by Oracle and IBM.

Two of Calvin's developers, Daniel Abadi and Alexander Thomson, raise some interesting questions on their blog that were also part of the impetus for developing Clavin:

Why are Oracle's 11g and 10g databases as well as versions of IBM's DB2 database— technologies that were developed several decades ago — still at the top of the TPC-C list? What about all the new general-purpose database management system technologies that can supposedly scale easily?

The reason is that scalability cannot be achieved without a few sacrifices — some quite painful — which they detail in their blog post.

So how does Calvin compare to SQL, "NewSQL" (various new scalable/high performance SQL database vendors), and NoSQL?

Adadi and Thomson write that Calvin should not be compared to any of the three as they "designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access)."

Worthy of note is that Calvin can reportedly run 500,000 transactions per second on 100 Amazon EC2 instances in the cloud and can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe and US West data centers — at no cost to throughput, they write. Not too shabby.

For more information on Calvin, check out their paper here.

Gene Expression Analysis with Apple's iOS

The latest in a growing number of efforts aimed at exploiting mobile app technology for life sciences research, a group at Baylor College of Medicine has developed a prototype app for displaying gene expression data using Apple's iOS mobile operating system.

The group, led by Baylor's Chad Shaw, describe their new mobile "Hematopoietic Expression Viewer" app in a paper published online last week in the journal Bioinformatics.

On Shaw's website, the developers offer both the source code for users with Xcode — Apple's development environment — as well as a local distribution.

According to Shaw and his colleagues:

Many important data sets in modern biological science comprise hundreds, thousands, or more individual results. These massive data sets require computational tools to navigate results and effectively interact with the content. Mobile device apps are an increasingly important tool in the everyday lives of scientists and non-scientists alike. These software present individuals with compact and efficient tools to interact with complex data while at meetings or otherwise remote from their main computing environment. We believe apps will be important tools for biologists, geneticists and physicians to review content while participating in biomedical research or practicing medicine.

Here are some screen shots of the app:

RHadoop Project=Big Data Analytics with R and Hadoop

Below is a video featuring data scientist and RHadoop project lead Antonio Piccolboni in which he introduces Hadoop and explains how to write map-reduce statements in the R language to drive the Hadoop cluster. The RHadoop project is an open-source initiative that aims to better enable researchers to extract data from Hadoop for analysis with R and to run R within the nodes of a Gadoop cluster.

While roughly two years-old, Pacific Northwest National Laboratories' Ronald Taylor has a paper that provides a thorough roundup of Hadoop use in bioinformatics.

Professor Says Big Data Is Just a "Fetish"

Here's one for the contrarian folder. According to Wharton School of Business professor Peter Fader, the increasing emphasis on "big data" is turning into a data hoarding fetish that will ultimately result in a wild goose chase wherein researchers won't learn nearly as much as they hope to from all of their data.

In a Q&A published in Technology Review, Fader compares those who have put all of their faith into big data to technical stock analysts — these are the folks who attempt to predict further stock prices based on past prices. The only issue with this approach, he says, is that, typically, it just doesn't work out that well. Their mathematical models do not take into account the myriad reasons why a stock's price may have changed over time.

What worries Fader is that data scientists are currently doing the exact same thing: loading lots of and lots of data into a formula in the hope that some pattern fits.

While Fader maintains he is no big data luddite, he even finds fault with the potential of machine learning and new database platforms engineered to take advantage of huge datasets such as Hadoop, a database framework with a bright future in the bioinformatics community.

"I make sure my PhD students learn all these emerging technologies, because they are all very important for certain kinds of tasks. Machine learning is very good at classification—putting things in buckets. … The problem is that there are many decisions that aren't as easily 'bucketized'; for instance, questions about 'when' as opposed to 'which.' Machine learning can break down pretty dramatically in those tasks," Fader says. "It's important to have a much broader skill set than just machine learning and database management, but many 'big data' people don't know what they don't know."

Dell Attempts to Win Back Mac OSX Developers with Linux Laptop

In an effort to win back developers who have switched over to Apple's Mac OSX operating system, Dell today unveiled an experimental laptop bundled with Ubuntu Linux and a slew of patches, drivers, and other utilities.

Dell is calling their new effort, launched today at the Ubuntu Developers Summit in Oakland, Calif., "Project Sputnik."

Why Ubuntu? It seems as though this was the lowest hanging fruit for Dell, given this Linux flavors' wide popularity among desktop users. While almost all Linux distributions come packed with an a combo of Linux, Apache HTTP server, MySQL, and Perl or Python, Dell is also intent upon eventually providing Sputnik users with a software stack from github to eliminate some of the workarounds for hardware and software compatibility Linux developers usually have to engineer.

Why a laptop? Dell's Barton George, director of marketing for Dell’s Web vertical group, stated the following on his blog:

As we continued talking to customers and developers the topic of Ubuntu kept coming up and we came across a fair number of devs who were asking for a Dell laptop specifically based on it. To my knowledge, no other OEM has yet made a system specifically targeted at devs and figured it was time to see what that might mean.

This should pique the interest on bioinformatics software developers as there are a number of specifically bio-Linux flavored releases, including DNALinux, Debian, BioBrew, and Biokoppix. In theory, a development community with Dell's money behind it could prove to be a great resource.

Only time will tell if Dell's efforts are enough to win back OSX developers.

Here's a video featuring Dell's Barton George discussing Sputnik.

Novel Algorithm Teaches Old Dog New Compression Tricks

A team of bioinformatics researchers at Illumina Cambridge in the UK have developed a new compression algorithm for next-gen sequencing data that improves upon the old compression standby Burrows-Wheeler transform, or BWT, algorithm. The 18-year-old BWT serves as the basis for numerous compression and data indexing methods. However, because of its design, the technique cannot be applied successfully to large datasets typically produced by large genome sequencing runs.

The team, led by Illumina's Anthony Cox, describe a novel algorithm that can allow the BWT of genome data to be analyzed using only "moderate" hardware, i.e. a workstation or a small cluster.

With 45x coverage of human genome sequence data that takes up roughly 135.3 GB of space, their technique can squash that data down to 8.2 GB. This is more than four times smaller than what can be achieved using a standard BWT-based compressor, such as the bzip2 format.

In addition to saving space and therefore money, the Illumina team's approach can help facilitate the contraction of compressed full text indexes on large sequence collections.

Online Game as Diagnostic Tool for Malaria?

A new application that uses crowdsourcing to diagnose malaria is the latest in a continuing trend of bioinformatics being put into the hands of the masses via online gaming.

A team led by Aydogan Ozcan, an associate professor at UCLA, describes its diagnostic game, called BioGames, in a paper "Distributed Medical Image Analysis and Diagnosis Through Crowd-Sourced Games," which has been accepted for publication in PLoS One.

In the game, players distinguish malaria-infected red blood cells from healthy ones by viewing images obtained from microscopes.

Before the game begins, each player is given a brief online tutorial about what malaria-infected red blood cells look like. After completing training, players are presented with multiple frames of red blood cell images and can use a "syringe" tool to "kill" the infected cells one-by-one and use a "collect-all" tool to designate the remaining cells in the frame as "healthy."

So far, Ozcan's research indicates that a small group of non-experts playing BioGames was collectively able to diagnose malaria-infected red blood cells with an accuracy that was within 1.25 percent of the decisions made by a medical professional.

In the last few years, several online games have been developed to solve scientific problems with data in the form of solutions players have found simply by "winning" the object of the game. These include FoldIt, a game in which players attempt to digitally simulate folding of various proteins and EteRNA that also makes use of crowds to get a better understanding of RNA folding.

The use of crowdsourcing in this context could help overcome limitations in the diagnosis of malaria.

According to Ozcan, "scaling up accurate, automated and remote diagnosis of malaria through a crowd-sourced gaming platform may lead to significant changes for developing countries."

What's in a Name?

A PLoS Computational Biology paper poses the question of whether or not the term "bioinformatics" is still in vogue.

In paper entitled "Rise and Demise of Bioinformatics? Promise and Progress," Christos Ouzounis, a visiting professor at University of Toronto and associate researcher and principal investigator at CERTH, argues that compared to a decade ago, when the word "bioinformatics" was used with excitement, it is now in decline. In fact, Ouzounis points to analytics from Google Trends that suggest a pattern of decline in appearances of the term "bioinformatics" in Google News, which has diminished by six-fold over last seven years.

Ouzounis writes that "such a trend cries out for an explanation. Why is it that a field that appeared unstoppable in all its glory just a few years ago might already be exhibiting signs of (media) fatigue? And does this trend indicate lack of progress, lack of interest, both, or none of the above?"

The author traces the evolution of bioinformatics from its "infancy period" (1996-2001), the "adolescence period" (2002-2006), and the "adulthood" period (2007-2011).

In his paper, Ouzounis is essential trying to assess the development of the field of bioinformatics and its promise by looking at the predictions made when it first came onto the scene (and into the literature). He concludes by writing that it can be argued that the declining trend of the use of the word "bioinformatics" "might be attributed mostly to the nature of the field, which found itself in the midst of the turmoil of a wider transformation, driven by industrial and social needs. In other words, it is not lack of interest and definitely not lack of progress: instead, it might be exactly the opposite. The vast progress and the dislocation of traditional biological research into a more precise and quantitative science has moved computational biology from the fringes to the eye of the storm."

Illumina Hosts App SDK Resource on AWS Cloud & Releases iPad App

Illumina announced yesterday at the BioIT World Expo in Boston that its Basespace Apps initiative will be coupled with the Amazon Web Services' cloud.

The Basespace Apps offering is essentially a software development kit that allows developers to code bioinformatics software tools.

Last October, Illumina launched their BaseSpace service, which provides free data-management and analysis in the cloud for users of its MiSeq sequencing platform.

In order to make Basespace Apps a reality, Illumina partnered with a bunch of companies and small startups including Diagnomics, GenoLogics Life Sciences, Genomatix, Golden Helix, Ingenuity Systems, Knome, Omicia, Spiral Genetics, Omixon, Real Time Genomics, Station X, Integromics, Biomax Informatics AG and Strand Life Sciences.

Illumina is essentially looking to create a version of Apple's App Store, but for bioinformatics software tools wherein users can have at their disposal an eventual ecosystem of software with "one-click access."

In another example of how Illumina is taking their cue from Apple, the company also yesterday announced the release of its MyGenome iPad app. For a cost of $0.99, users can download the new app from Apple's App Store and have access to their genetic data right on their iPhone.

MyGenome includes the following features:

Genome Map — for touring the landscape of chromosomes and visualizing how genetic variants in different locations translate into health impacts or biological traits. Users can view individual genes, their locations, and biological impacts as well as visualize where and how genome sequences differ from the "reference" human genome.

Health Cards — for exploring genetically determined conditions and predispositions, and carrier traits. Users can discover how different genetic variants can contribute to health risks and can be passed on to children, as well as find out how changes in the genome may affect drug response.

Reports — for investigating the possible health impacts of genetic variants for more than 250 conditions

Apple's Siri App Facilitates Voice-Controlled Experimental Workflow

After an apparently prolonged period of tinkering in the lab, the BioTeam and BT Compute unveiled an impressive voice-controlled experimental workflow implementation at the BioIT World Expo in Boston this week.

Using simple voice commands, Bas Burger, president of global commerce at BT Global Services, used Apple's Siri iPhone app to initiate an experiment on BT Compute's cloud computing service that utilized the molecular dynamics program NAMD running on Accelerys' analysis pipeline software.

Siri works — at least most of the time — by statistically analyzing a recording of the voice command provided by the user at Apple's servers with the result returned to the user's device, hopefully delivering the desired action or response.

But in order to create a seamless experience with no misunderstood
commands, this method reroutes the voice command to a proxy server hosted by BT Compute, where it screens for a list phrases related to the workflow analysis pipeline. If none of the key phrases are detected, the proxy server sends along the voice command to Apple's servers.

Here's a video of the demonstration: