Note to Readers

Informatics Iron ceased publication Monday, August 13, 2012. For more informatics news, please check out GenomeWeb's other coverage here.

Supercomputer Provides Simulations for Alzheimer's Study

For the past five years, Joan-Emma Shea, a professor at the University of California, Santa Barbara, has run thousands of amlyoid peptide simulations on the Ranger supercomputer at the Texas Advanced Computing Center (TACC) in an attempt to better understand Alzheimer's disease.

Shea is using the 579 teraflop supercomputer to look for data in support of the hypothesis that says toxcity in the brain is cause by small, transient molecules.

While the accumulation of amyloid plaques ‭ long knotty fibrils that result from misfolded proteins ‭ has long been associated with brain cell death, researchers have now begun to look at oligomers, the precursors of fibrils.

Her work is made possible with a grant from the National Institutes of Health and the National Science Foundation-funded Extreme Science and Engineering Discovery Environment (XSEDE) initiative, which aims to make computational resources widely available to the US research community.

Relatively speaking, a 579 teraflop supercomputer is not that impressive in a world where the fastest supercomputer currently clocks in at 16.32 petaflops. But this goes to show that the XSEDE program has the right stance in that researchers don't need the fanciest, fastest system in the world ‭ they just need easy access to a reliable HPC system with a respectable flops performance.

The Ranger supercomputer allows Shea to model or simulate what structures the amyloid peptides are adopting, at resolutions far exceeding what is possible experimentally.

"The number of atoms is huge•we need a lot of computational resources to simulate them. Nothing that we're doing here is something that we could do on our home clusters. The scale of it is intractable," says Shea. "With growing computational resources and capabilities, we'll be able to look at how these proteins interact with membranes...We're far away from simulating a whole cell, but we can start incorporating additional elements that may turn out to be important."

Ranger is a Linux-based system with eighty-two compute racks housing the quad-socket compute infrastructure that uses the Lustre file system across 72 I/O servers.

Soon, Shea will be able to take advantage of the TACC's "Stampede" supercomputer, which is slated to come online in early 2013 and will be 20 times more powerful than Ranger.

To submit a proposal to request an allocation, you can visit the XSEDE website.

Chinese Team Accelerates Distance Matrix Algorithm with GPUs

Chinese researchers are taking advantage of GPU computing to develop treatments for conditions such as hemophilia, cystic fibrosis, Down syndrome, and sickle-cell disease.

A group at China's Shanghai Jia Tong University have accelerated the DNADist application — a program used to compute distance matrix
from nucleotide sequences — by a factor of 16 using GPUs.

DNADist enables researchers to extract information from data which could lead to a better understanding of the causes of, and treatments for, genetic diseases.

The speedups were achieved using the OpenACC programming standard for parallel computing, which uses "directives" or hints for the compiler to identify which portions of code can be accelerated.

The benefit to accelerating DNADist is that investigators can now study a large range of input data, which they can sift through to have more data earlier in the disease treatment development process.

Accelerating the DNADist application allows researchers to study a significantly larger range of input data and obtain actionable information earlier in the disease treatment research process.

If you would like to learn more, there will be a webinar on OpenACC and DNADist September 6th, which you can register for here.

NCI Develops Roadmap for Open Development

The National Cancer Informatics Program, part of the National Cancer Institute, has a blog post describing its journey on the road to an open-development software ecosystem.

Juli Klemm, associate director of Integrative Cancer Research Products and Programs at the Center for Biomedical Informatics and Information Technology at NCI, writes that for more than two years, she and her colleagues have been discussing and thinking about how to create such an ecosystem around the digital resources developed through the caBIG program.

They are not starting from scratch — the caBIG program is already open-source, with the "caBIG Open Source License" that allows users to download and alter code to suit their needs.

"We recognize that meeting the challenge of creating robust and useful tools to support the rapidly evolving needs of the cancer research community requires an open and collaborative approach to software development," Klemm writes.

NCI has looked to a number of similar efforts to help them build a roadmap for how to create this ecosystem. Their role models have included the open-development efforts of the Veteran's Administration and NASA as well as the open-source efforts of Apache, Mozilla, and Google.

In June of 2011, the NCIP issued an RFI in an attempt to collect input from the community on what their open-development ecosystem should look like.

While they only received 20 responses, Klemm writes that some of the respondents joined them at their Rockville offices and formed a sort of open-source think tank.

Click here to read the results of the meeting and what guidelines Klemm and her colleagues are moving forward with.

Cloud Compliance for Genomics

Well it looks like the argument that says it's doubtful the cloud can ever be used to host patient data securely is becoming increasingly moot — or at least harder to make.

GenomeQuest announced this week the rollout of a Health Insurance Portability and Accountability Act-certified "genomic decision support system" in the cloud, referred to as the GQ-DX platform.

GQ-Dx is basically an IT-support system that allows labs to create diagnostic reports from next-generation sequencing data.

Just to review, HIPPA compliance means that medical data — genomic or otherwise — must be stored, transmitted, and accessed according to a strict set of security or privacy protection standards. The certification steps entail specific training for IT personnel, audits by HIPAA inspectors, as well as required reporting and guarantees to ensure that data is kept safe at all times.

HIPAA also requires that patient data never leave the US and that the physical security housing the hosting hardware is adequate.

The concern with the cloud and patient data has centered on whether or not there can ever be a simple and effective way of ensuring that every identifiable piece of patient data will never be exposed as it is being moved and stored on either a private cloud or a large public cloud hosting service, such as Amazon's EC2.

As is often the case in the cloud computing, the GenomeQuest announcement is a bit ephemeral — there's no explanation of where the physical location of the hosting will be and how exactly they plan on securing genomic data at the networking and hosting levels.

The Centers for Disease Control and Prevention has made some progress in this area using Amazon's AWS GovCloud to create a secure, HIPAA-compliant cloud for hosting a national repository of syndromic surveillance data. The CDC has also built HIPAA-compliant gateways, including data exchanges with Beth Israel Deaconess and the Boston Public Health Commission for the transfer of data to the CDC cloud.

There are other hosting services that have claimed HIPPA compliance in the last year, including Firehost, Symform, Logicworks, and ClearDATA, so it looks like cloud service providers are tackling the challenge of HIPAA compliance head-on.

TACC's API for Scientific Computing on the Web

A new tool developed by a team at the Texas Advanced Computing Center promises to make scientific computing via web interfaces easier and more powerful.

Called the AGAVE API — A Grid and Virtualization Environment — this programming tool allows developers and users to launch a computational experiment using supercomputing resources in a relatively seamless fashion.

"When services have been built to that level, research starts moving really fast," says Rion Dooley, a research associate at TACC and one of the creators of the API. "You can start leveraging manpower and focus exclusively on the science rather than the computation and technology needed to accomplish that science."

For software developers, the AGAVE API provides a way for tools to be added to Web interfaces as well as data management offload and experiment execution with some line coding using a supercomputer.

For users, the AGAVE API essentially provides a science-as-a-service.

This API is currently being used within the iPlant project, which allows users to take advantage of compute resources at the Pittsburgh Supercomputing Center, San Diego Supercomputing Center, and TACC through the XSEDE project.

The developers of the BioExtract Server — a distributed database that consolidates data from heterogeneous bimolecular databases — are also using the AGAVE API to help scale out their resource to meet the demands of bioinformatics researchers.

Software Licensing for the Scientist-Software Developer

A PLoS Computational Biology paper by a team from Harvard Medical School and the University of California, Berkeley's School of Law addresses the challenges of navigating the complex legal landscape of software licensing.

The aim of their guide is to better enable researchers with their own software to engage with their institutions' tech transfer office.

The paper provides an overview of various types of software licenses, such as proprietary licensing, free and open source software licensing, and hybrid software licensing. Choosing the right software license is also covered as well as how to actually apply a license to your software.

Below is a schematic representation of license directionality:

The authors of the paper are members of the SBGrid.org, a consortium of scientific software developers that act as middlemen between developers and end-users of lab-generated software tools.

Taking a Cue From Angry Birds

A group of software developers from the University of Alabama at Birmingham have designed an image analysis app that functions just like Angry Birds or Instagram.

Their new app — called ImageJS — is described in a recent paper published in the Journal of Pathology Informatics entitled "ImageJS: Personalized, participated, pervasive, and reproducible image bioinformatics in the web browser."

ImageJS allows pathologists to analyze digital pathology slides in a Web browser app where it can be analyzed for evidence of tumor cell growth.

"ImageJS" gets its name from "ImageJ," an image-analysis application developed by the National Institutes of Health that was written using the JAVA language — and which allegedly took hours to program and integrate into a hospital's patient data system.

Here's a demonstration of ImageJS:

According to the team, led by Jonas Almeida, director of the Division of Informatics in the UAB School of Medicine's Department of Pathology, the pathology modules are only the first in a series that will eventually include genomics analysis capabilities. The idea is that enabling such quick-and-easy comparisons will increase diagnostic accuracy and improve treatment plans.

Almeida and his team hope that pathologists will take this app and run with it, partnering with bioinformaticians to create other modules using the ImageJS open-source code.

ImageJS is currently available from the Google Chrome App store, Google Code and Github.

Gigascience Offers Readers More Than Just Open-Source Papers

The inaugural issue of Gigascience — together with its sister database GigaDB — is now available.

GigaScience describes itself as an online, open-access, open-data journal that takes submissions related to life sciences that use large-scale data. What makes Gigascience interesting is that standard manuscript publications are offered with an extensive database that hosts all the related data in the study as well as analysis tools and cloud-computing resources.

For example, a research article by the University College London's Stephan Beck that looks at whole-genome analysis of DNA methylation is available with a 84 gigabyte supplemental file of research data.

The first issue features papers by EMBL-Bioinformatics Institute's Guy Cochrane on the future of DNA sequence archiving and Cold Spring Harbor's Michael Schatz discussing the concept of a "digital immune system" comprised of 'omics data to improve public health.

Gigascience is published in a joint effort by BGI and BioMed Central.

Intel-NextBio Collaboration Aims to Perfect Hadoop for Genomics

Intel has teamed up with NextBio in a collaboration focused on perfecting the Hadoop stack for analyzing large-scale genomic datasets.

Hadoop is a Java-based programming framework that supports the processing of large datasets in a distributed computing environment. It was originally developed by Google and then later picked up and developed for enterprise scenarios by Yahoo.

Engineering teams from NextBio and Intel say they will offer up their solutions and innovations to the open-source community once completed.

In the quotes below, the phrase "big data" is apparently now being referred to as "Big Data" — with a capital "B" and "D." This IT marketing neologism is meant to ensure that proper respect is given to this data — which is of a very big nature.

"Intel is firmly committed to the wide adoption and use of Big Data technologies such as [the Hadoop Distributed File System], Hadoop, and HBase across all industries that need to analyze large amounts of data," says Girish Juneja, CTO and General Manager of Big Data Software and Services at Intel. "Complex data requiring compute-intensive analysis needs not only Big Data open source, but a combination of hardware and software management optimizations to help deliver needed scale with a high return on investment. Intel is working closely with NextBio to deliver this showcase reference to the Big Data community and life science industry."

Compression Genomics

A collaborative effort between researchers at MIT and Harvard University has produced a new, high-speed genome search algorithm described in the latest issue of Nature Biotechnology.

The new algorithm combines the power of data compression algorithms with genome alignment search tools.

Capitalizing on the fact that most currently sequenced genomes are very similar to previously collected ones, the team exploited this redundancy to allow for computation on compressed genome data. This approach shaves off time during the analysis of highly similar genomes to that of the time it takes to operate on one genome.

“You have all this data, and clearly, if you want to store it, what people would naturally do is compress it,” says Bonnie Berger, a professor at MIT and senior author on the paper. “The problem is that eventually you have to look at it, so you have to decompress it to look at it. But our insight is that if you compress the data in the right way, then you can do your analysis directly on the compressed data. And that increases the speed while maintaining the accuracy of the analyses.”


As described in their Nature Biotechnology paper, the researchers have implemented accelerated versions of both Blast and BLAT and underscore the importance of compression as a way to cope with ever-increasing amounts of genome data.

One obvious drawback of an approach like this is that, as more genomes are added to a database, the speed resulting from the analysis of compressed genomes decreases.

Click here to download the source code for the prototype of their implementations.

A Linux Server on the Cloud in Under One Hour (Maybe)

Don't have your Red Hat engineer certification yet but need to setup a Linux server? No problem, says an article over at Infoworld, you can just spin up your own using a release called "TurnKey Linux" on Amazon's cloud.

They walk you through the process of establishing this flavor of Linux on the cloud step-by-step here.

While TurnKey Linux can run on any ordinary server, Infoworld says it can be "mindlessly" setup on Amazon's cloud in under one hour. Chances are that even a Red Hat expert would be hard pressed to make everything work flawlessly in under 60 minutes, but TurnKey is an interesting offering nonetheless.

It will soon be released with a version built upon Debian Squeeze, a reportedly very stable version of the Linux kernel. And this version actually comes in an assortment of prepackaged, ready-to-use servers including Linux Apache, MySQL, PHP/Python/Perl, just to name a few.

Get your stopwatches ready and good luck.

Healthcare and the Cloud

Mount Sinai School of Medicine and cloud-solutions provider Cloudera have partnered up to develop new methods for analyzing data in genomics and multiscale biology, among other disciplines.

In the video below, Cloudera's chief scientist Jeff Hammerbacher talks about the collaboration and other work they've done using the cloud to support healthcare.