Skip to main content

Michael Schatz: Genome Assembly and the Cloud


Title: Assistant professor, Cold Spring Harbor Laboratory
Education: PhD, University of Maryland, 2010
Recommended by: Steven Salzberg, University of Maryland

It'd be an understatement to say that Michael Schatz supports the use of cloud computing for genomics research. In fact, he says, "cloud computing is the only way to move forward" with the large data sets that sequencers are putting out, because it would be unrealistic to expect one machine "to be able to store and sift through it all." To keep ahead of the curve, Schatz says computational biologists "need to develop very scalable ... efficient methods for analyzing huge volumes of data" and take them to the cloud.

To that end, he's developing a cloud computing-based genome assembler, called Contrail, that can handle exceedingly large data sets. Schatz's tool is as much a solution to his own research problems as it is to others'. As part of his work on a collaborative project with investigators at Washington University in St. Louis, Schatz will assemble exome sequences for approximately 3,000 families in which only one child is affected with autism. So far, he says, his assembler has shown "some very strong preliminary results" that are "competitive with Abyss and SOAPdenovo," two currently popular packages.

Papers of note

Contrail is not Schatz's first foray into cloud-computing. In a Bioinformatics paper published in April 2009, he described the first cloud-based sequence analysis tool, Cloudburst, that he developed while a research assistant in Steven Salzberg's lab at the University of Maryland. Cloudburst, Schatz says, harnesses the power of cloud computing by mapping short sequence reads to the reference genome. By using 100 computers on the Amazon cloud, he says, he can run analyses "100 times faster than doing it on my desktop." Later that year, with Johns Hopkins' Ben Langmead and their colleagues, Schatz described Crossbow, a cloud-based SNP calling program in a Genome Biology paper. With Crossbow and "320 computers at Amazon, we can call SNPs in a whole human genome in about four hours, for about $100," he says.

Looking ahead

Schatz says that deciphering how best to analyze large-scale genomic data sets is the "driving force" behind most of his work and that he expects the glut of genomics data to increase over time. "The sequencing machines are still improving, but we're playing catch-up in developing the right tools and the right pipelines and the right infrastructure to make sense of it all," he says.

And the Nobel goes to ...

Though he notes that "they don't offer Nobel Prizes for computer science," Schatz says that he's hopeful for a positive outcome from the autism collaboration. "It would be awesome if we could identify a pattern in the genomics," he says. Ultimately, Schatz hopes to apply "large-scale sequencing and data analysis towards understanding the origins of various diseases and, ultimately, life itself."

Filed under

The Scan

Call to Look Again

More than a dozen researchers penned a letter in Science saying a previous investigation into the origin of SARS-CoV-2 did not give theories equal consideration.

Not Always Trusted

In a new poll, slightly more than half of US adults have a great deal or quite a lot of trust in the Centers for Disease Control and Prevention, the Hill reports.

Identified Decades Later

A genetic genealogy approach has identified "Christy Crystal Creek," the New York Times reports.

Science Papers Report on Splicing Enhancer, Point of Care Test for Sexual Transmitted Disease

In Science this week: a novel RNA structural element that acts as a splicing enhancer, and more.