Skip to main content
Premium Trial:

Request an Annual Quote

Michael Schatz: Genome Assembly and the Cloud

Premium

Title: Assistant professor, Cold Spring Harbor Laboratory
Education: PhD, University of Maryland, 2010
Recommended by: Steven Salzberg, University of Maryland

It'd be an understatement to say that Michael Schatz supports the use of cloud computing for genomics research. In fact, he says, "cloud computing is the only way to move forward" with the large data sets that sequencers are putting out, because it would be unrealistic to expect one machine "to be able to store and sift through it all." To keep ahead of the curve, Schatz says computational biologists "need to develop very scalable ... efficient methods for analyzing huge volumes of data" and take them to the cloud.

To that end, he's developing a cloud computing-based genome assembler, called Contrail, that can handle exceedingly large data sets. Schatz's tool is as much a solution to his own research problems as it is to others'. As part of his work on a collaborative project with investigators at Washington University in St. Louis, Schatz will assemble exome sequences for approximately 3,000 families in which only one child is affected with autism. So far, he says, his assembler has shown "some very strong preliminary results" that are "competitive with Abyss and SOAPdenovo," two currently popular packages.

Papers of note

Contrail is not Schatz's first foray into cloud-computing. In a Bioinformatics paper published in April 2009, he described the first cloud-based sequence analysis tool, Cloudburst, that he developed while a research assistant in Steven Salzberg's lab at the University of Maryland. Cloudburst, Schatz says, harnesses the power of cloud computing by mapping short sequence reads to the reference genome. By using 100 computers on the Amazon cloud, he says, he can run analyses "100 times faster than doing it on my desktop." Later that year, with Johns Hopkins' Ben Langmead and their colleagues, Schatz described Crossbow, a cloud-based SNP calling program in a Genome Biology paper. With Crossbow and "320 computers at Amazon, we can call SNPs in a whole human genome in about four hours, for about $100," he says.

Looking ahead

Schatz says that deciphering how best to analyze large-scale genomic data sets is the "driving force" behind most of his work and that he expects the glut of genomics data to increase over time. "The sequencing machines are still improving, but we're playing catch-up in developing the right tools and the right pipelines and the right infrastructure to make sense of it all," he says.

And the Nobel goes to ...

Though he notes that "they don't offer Nobel Prizes for computer science," Schatz says that he's hopeful for a positive outcome from the autism collaboration. "It would be awesome if we could identify a pattern in the genomics," he says. Ultimately, Schatz hopes to apply "large-scale sequencing and data analysis towards understanding the origins of various diseases and, ultimately, life itself."

Filed under