BGI's Genome Superhighway

BGI announced today the successful transfer of roughly 24 gigabytes of genomic data from Beijing to UC Davis in under 30 seconds using a link that connects US and Chinese research networks.

The transfer demonstration actually took place on June 22 during an event in Beijing celebrating a new 10 Gigabit US-China network connection supported by Internet2, an advanced networking consortium led by the US research community that is focused on developing multi-terabit networking pipeline for researchers across the globe.

That same 24 gigabyte file sent over the public Internet would take over 26 hours from upload to completed transfer.

"The 10 Gigabit network connection is even faster than transferring data to most local hard drives," says Dawei Lin, director of the bioinformatics core at the UC Davis Genome Center. "The use of a 10 Gigabit network connection will be groundbreaking, very much like email replacing hand delivered mail for communication. It will enable scientists in the genomics-related fields to communicate and transfer data more rapidly and conveniently, and bring the best minds together to better explore the mysteries of life science."

Nature-Nurture Data Visualization

Researchers at King's College London's Institute of Psychiatry are using data visualization software to study nature versus nurture in a geographical context.

The research team is headed up by the Twins Early Development Study at the MRC Social Genetic and Developmental Psychiatry Centre, which studied 13,000 pairs of both identical and non-identical twins born between 1994 and 1996.

When 6,759 of these twin pairs were 12 years old, the investigators conducted a number of tests to measure their behavioral traits, including hyperactivity, cognitive abilities, and IQ scores, as well as determined their geographic environments.

Using the open-source visualization software package spACE, the UK team created a color-coded map of genetic and environmental variation.

The spACE package is freely available for Windows, Mac OSX, and Linux.

spACE: putting nature and nurture on the map from Oliver Davis on Vimeo.

Among the results revealed by the map, 60 percent of the difference in traits — such as classroom behavior — could be linked to genes. However, in South East London, it seems that environment plays more of a role than genes.

For more detail, their paper "Visual analysis of geocoded twin data puts nature and nurture on the map" appears in Molecular Psychiatry.

Twisted Terabit Transfers

In a paper published this week in the journal Nature Photonics, an international team of researchers led by a group at the University of Southern California in Los Angeles describe a method for transferring data at 2.56 terabits per second.

That's 85,000 times faster than today's 30 megabyte-per-second broadband Internet — roughly the equivalent of transferring 70 full-length DVDs in about one second.

The method in question uses "twisted light" beams to carry data through a new data stream channel, like a radio with its very own radio station.

The team — comprised of investigators from China, Pakistan, and Israel — says that the benefit of their data transmission technology is that it eliminates the need for bandwidth altogether.

If you'd like to dive into the photonic nitty-gritty, click here to read their paper.

But before you get too excited, this technology is not going to be available any time soon. Apparently the Earth's atmosphere interferes with twisted light data transmission over long distances. So while this may not prove to be a viable replacement for the current Internet infrastructure, there could be a future for this technology in the data center in the form of interconnects or networking fabric —assuming there are processing cores fast enough to handle that rate of data transfer.

A group from the University of Pittsburgh demonstrated a different technique to transmit data earlier this year at rates that also left current bandwidth limitations in the dust. The team, led by Hrvoje Petek, created a "frequency comb" that created over 100 terahertz of bandwidth using a group of atomic motions in a semiconductor silicon crystal.

The Ins and Outs of the World's Fastest Supercomputer

While processors are the obvious place to look when attempting to understand how the world's fastest supercomputers can deliver such speed, two very crucial factors that contribute to performance are the storage resource and the file system.

The architects of Lawrence Livermore National Laboratory's Sequoia supercomputer — currently the world's fastest supercomputer — needed a storage and file system that could top Japan's 10-petaflop K Computer.

LLNL selected NetApp's High Performance Computing Solution for Lustre, which combines the open-source Lustre parallel file system with NetApp's scalable rack storage solutions.

Thanks in part to the new upgrade, Sequoia can now process roughly 1.3 terabytes of data per second — resulting in a 16.3 petaflop peak performance capability.

In this video produced by InsideHPC, Bryan Berezdivin from NetApp describes the company's recent storage deployment for Sequoia.

NetApp has a track record of building out storage facilities for life science informatics cores, including the Duke Institute for Genome Sciences and Policy and the Stanford Genome Technology Center at Stanford University.

Broken Cooling Fan Takes Down Amazon's Cloud

Amazon Web Services experienced yet another outage last week, this time due to the failure of a cooling fan. Yup, you read that right — a cooling fan brought down the great Amazon cloud. The cloud outage affected a number of websites, including Heroku, Pinterest, Quora, and HootSuite.

The source of the outage occurred at a northern Virginia data center when the facility lost utility power. The center then switched to a backup generator power, but then nine minutes later, a defective cooling fan caused one of the backup generators to overheat and fail.

The incident began on June 14th at 8:44 pm PST and lasted until 10:19 pm PST.

According to the AWS Service Health Dashboard blog:

At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity).

So it seems that cloud computing continues to receive doses of reality that counter its posturing as the ultimate replacement for the onsite data center as well as the promise of infinite resiliency and reliability.

Last week's outage was the third major outage in the last 14 months for the "US-East-1 availability zone" — Amazon's oldest availability zone based in a data center in Ashburn, Virginia.

Last April, the US-East-1 zone had a major outage as well as another less serious incident in March.

And back in 2010, the US East region experienced a series of four outages in a single week.

US #1 on Top 500 Thanks to IBM

Looks like the US is back on top of the supercomputing game, thanks to IBM.

The latest edition of the Top 500 list includes the Department of Energy's Lawrence Livermore National Laboratory's Sequoia BlueGene/Q supercomputer, an IBM system that has 1.6 million compute cores capable of 16.32 petaflops.

Sequoia beat out Japan's K Computer for the number one spot on the list.

The June Top 500 list seems to indicate both a growth in the use of co-processors and GPUs. According to the list, 58 of the Top 500 supercomputers use accelerators — up from 39 six month ago — and 53 of those supercomputers use GPU chips.

Amazon Cloud Reaches Its First Trillion Objects

The first trillion objects are always the hardest.

Amazon Web Services has announced that their S3 cloud computing storage service is now storing one trillion objects (1,000,000,000,000 or 1012) — that's 142 objects for every person on the the planet or 3.3 objects for every star in our Galaxy.

Basically, it would take you 31,710 years to count them all, says Amazon's blog, so just to reiterate, the idea is: Their cloud stores a lot of data.

Amazon attributes this growth to customers' use of the S3 object expiration feature, which allows users to specify an expiration date on their data. So far, S3 users have used this feature to delete over 125 billion objects since its release last year.

The two things about this announcement that are notable: Amazon's cloud as a storage solution is here to stay; the days of the cloud as an IT novelty or fad to be viewed with suspicion are clearly over.

The other noteworthy take away is that the object expiration feature.

This new feature might come in handy for small labs or individual investigators if they have an informatics pipeline that connects a sequencing platform directly to the cloud, a model currently being pursued by Illumina in the form of their BaseSpace cloud service.

Drug Discovery with 65,000 Processors

In this video, the University of Tennessee at Knoxville's Cynthia Peterson discusses an Integrated Graduate Education and Research Training Program project for Scalable Computing and Leading Edge Innovative Technologies for Biology.

Peterson's particular area of focus within this project is to develop a supercomputing-based research tool based on validated and widely used docking approaches adapted for high-throughput screening of millions of compounds in a single day. With virtual screening, target protein computations can now be run on over 65,000 processors or more in parallel on a supercomputer, completing what typically took several weeks in one day.

Galaxy Gets an HPC Injection

Researchers can thank the Pittsburgh Supercomputing Center, or PSC, for seriously ramping up the Galaxy platform. The folks at PSC have just completed a new super high-speed link from Galaxy to the National Science Foundation's Extreme Science and Engineering Discovery Environment, or XSEDE using their Three Rivers Optical Exchange, a 10-gigabit per second fiber-optic high-performance Internet networking hub.

There are currently over 10,000 Galaxy users running 4,000 to 5,000 analyses per day, and that number is growing, so the need for a HPC resource on steroids is clear.

"The network connection to XSEDE through PSC is a huge breakthrough," says Anton Nekrutenko, Galaxy co-developer and associate professor at Penn State. "It provides us with the ability to run up to 150,000 jobs per month, and we expect to quadruple that as this link gets fully up and running. It allows biologists to take advantage of HPC resources in ways they otherwise could not, not only the computing, but the storage resources at XSEDE sites. It democratizes research by making XSEDE useful for a scientific community that traditionally has not been a heavy user of high-performance computing."

The new bandwidth hookup is made possible with a four-year $1.5-million NSF's Academic Research Infrastructure program grant.

Big Data Is "Creepy"

At the recent DataEDGE conference hosted by the UC Berkeley School of Information, senior Microsoft Researcher Danah Boyd had this to say about the issue of privacy and Big Data:

Privacy is a source of tremendous tension and anxiety in Big Data, it's a general anxiety that you can't pinpoint, this odd moment of creepiness.

While Boyd spends most of her efforts studying privacy and children's use of social media platforms like Twitter and Facebook, she pointed to personalized genetic data as an example of just how creepy things can get. "If I give away data to 23andMe, I'm giving away some of my brother's data, my mother's data, my future kid's data. … Who owns the e-mail chain between you and me?"

Boyd's discussion is explored on The New York Times' Bits blog in a post by Quentin Hardy. In it, Hardy points out that the definition of privacy changes depending upon who you're talking to, but regardless of the definition, privacy is not the same as security or anonymity — two things no one really has anymore.

But whereas youngsters can hide their identities and lives on Facebook in plain sight by continually destroying and creating new accounts or steganography, ensuring privacy within the realm of personalized genomics is exponentially more complex. For that, the only antidote is regulation, because right now, ignorance is breeding anxiety.

"Regulation is coming, you may not like it, you may close your eyes and hold your nose, but it is coming…Technologists need to re-engage with regulators — we need to get to a model where we really understand usage," she says. "We have very low levels of computational literacy, data literacy, media literacy, and all of these are contributing to the fears."

Commercial Flash Storage Offerings Continue to Grow

The latest storage vendor to rollout a flash/disk array that uses dedupe and compression to deliver impressive cost per gigabyte numbers is Tegile. Their new Zebi product is a multi-protocol box that comes equipped with iSCSI, SMB, NFS, Fibre Channel, dedupe, and compression.

The Zebi box has 10 to 90 terabytes of raw storage capacity and with the inline dedupe and compression it can provide up three to five times the raw capacity. The Zebi HA2100EP model has 96 gigabytes of RAM, 1.2 terabytes of flash and 16 terabytes of disk and the A J2100 expansion tray has 800 gigabytes of flash and 28 terabytes of disk capacity.

The company’s main pitch for their system is that roughly 75 percent less capacity is needed for seven times the IOPS.

So for, Tegile’s customer list for their Zebi box includes Washington and Lee University, which has used the box to achieve seven times the IOPS of its previous storage and a 70 percent reduction in virtual desktop infrastructure capacity needs.

Flash as a large-scale storage medium came to the forefront in 2009, when the San Diego Supercomputer Center built the first flash-memory based supercomputer, called Dash. Since then, Appro, Amax, and Nimbus (to name a few) have all started offering flash-based solutions targeting the HPC market.

So why flash? It provides consistently faster data transfer times and improved latency than traditional mechanical hard drives. And because there is no motor to power, flash drives use less energy, thus saving money.

In addition to SDSC, academic researchers from Carnegie Mellon University and Intel Labs Pittsburgh have been working with an experimental cluster architecture called FAWN, or Fast Array of Wimpy Nodes. Each node is comprised of an embedded single-core 500 MHz AMD Geode LX processor board and a 4 GB compact flash card.

Bacterial Proteins and Boolean Logic

At the heart of even the most sophisticated processors are good ol' Boolean logic gates:

Johns Hopkins University School of Medicine researchers are using a technique called chemically inducible dimerization, or CID, to engineer cells to function like Boolean gates.

While there is research demonstrating logic gates using biomolecules in Petri dishes, using whole cells is a different story. Previous efforts have tried to take advantage of transcriptional machinery, but this can be a slow process often taking anywhere between minutes and days — way too long for computing.

"People like to have speedy computation," says Takanari Inoue, an assistant professor
at Johns Hopkins University School of Medicine. "We were hoping to achieve computation in cells on the order of seconds, which is significantly faster than what people have achieved thus far."

Because AND and OR gates need two different inputs, either together or separately, Inoue's team had to developed two different CID systems that didn't interfere with each other.

One CID system uses two proteins called FRB and FKBP in combination with a drug called rapamycin, which is derived from bacteria, and the second CID system uses proteins GID1 and GAI.

So far, Inoue's testing has shown that their logic gates can produce the desired responses reliably, in a matter of seconds.

More detail on their research is available in their recently accepted Nature Chemical Biology paper "Generation of Intracellular Logic Gates with Two Orthogonal Chemically Inducible Systems."

ORNL's GPU-Powered Titan Does Real Science

Well it looks like the age of GPU-HPC has officially arrived. In the video below, Oak Ridge National Laboratory Director of Science Jack Wells presents a series of research projects accelerated by the hybrid supercomputer Titan.

This monster computer — which weighs in at roughly 20 petaflops of computing power — is actually the ORNL's Jaguar supercomputer reborn into a Cray supercomputer that is crammed with GPUs.

Titan, which became operational earlier this year, uses both AMD Opteron processors and Nvidia's Kepler series GPUs, which are specifically designed with a high core count for HPC applications.