Assistant Professor, Department of Biology
Name: Gabor Marth
Title: Assistant Professor, Department of Biology, Boston College
Experience and Education: Staff Scientist (with Stephen Altschul), National Center for Biotechnology Information, 2000-2003
Postdoc (with Robert Waterston), Washington University Human Genome Center, 1995-2000
D.Sc, (Systems Science and Mathematics), Washington University, 1994
BS-MS, (Electrical Engineering), (Budapest Technical University), 1987
At last week’s Biology of Genomes meeting at Cold Spring Harbor Laboratory, Gabor Marth gave a talk about the informatics challenges of next-generation sequencing, and presented some software tools his group at Boston College has developed.
In Sequence spoke with Marth last week to get more details.
Tell me about your background. Where does your interest in next-gen sequencing derive from?
Originally, I was at the Wash U Genome Center, [where I was a postdoc] after my PhD, where we did genome sequencing informatics for the Human Genome Project. When people started thinking about not only sequencing a single genome, but to see what the difference is between the different genomes, we started writing computer software and developing methods to find polymorphisms. I developed an algorithm called polyBayes, which was one of the first comprehensive polymorphism discovery tools, looking for SNPs and short insertions and deletions in sequences.
Then I went to the [National Center for Biotechnology Information], where we used these and other tools for the first large-scale organismal SNP discoveries, and collaborated with a bunch of other places to publish the first big polymorphism map of the human genome that came out in Nature in 2001.
In addition, I did population genetics and ancestral demographic modeling, but when these machines made their presence felt, there was a real need to apply the old methods, and update a lot of the methods, and write new methods to leverage the next-generation sequencing data.
Why was there such a need? What is different about the new technologies, compared to traditional Sanger sequencing?
Virtually everything. [One difference is that the] signals that come from these machines are fundamentally different from the Sanger machines. [The closest to the] four-color traces that the old Sanger machines produce is the llumina, or Solexa, sequencer, which produces a four-color image, but it’s still different, because it’s discrete positions where you measure color intensities. The 454 sequencer is very different because it does not measure individual nucleotides; it measures intensities of two or three or as many nucleotides as incorporated in a single mononucleotide run. And then the [Applied Biosystems] SOLiD technology is again very different, because the measurement is made in what they call color space. So basically, just to interpret these [data] and produce the nucleotides and base confidence values, or base quality values, it’s actually quite different for all these machines.
The second [difference] is the read length. Sanger reads tended to be 750, even up to 1,000 base pairs after they optimized the technology. With these [new machines], even the 454 FLX machine, [produces only] about 250 base-pair reads, and the really high-throughput sequencers, the Illumina and the SOLiD, they produce what’s called short reads, up to 50 [base pairs, but] typically [you get] more like 30 base pair reads.
And the data that comes off of them is just humungous. We are talking about multiple gigabytes per run. In the old paradigm, you looked at sequences as individual files, for example. You can no longer do that. Just being able to manage this data on a computer and access it fast enough [so] you can do something with 100 million or 200 million reads in a project is just a huge challenge.
How do you analyze the data?
The first challenge [is] you have to look at the raw data that these machines produce, and you have to interpret them and translate them into DNA bases, and [assign] confidence values, which tell you how accurate you think that base is. The general name for such software is ‘base caller.’ For some of the technologies it’s more important to write base callers because the ones that are supplied with the machine don’t perform very well.
We have written several other base callers for various needs, so we have the methodology down, and at least for the 454 machine, [we] were able to write a base caller [called PyroBayes] that seems to perform a lot better. But for example, for the Solexa platform, we didn’t have to write one, because the base calls that come with the machine are actually quite accurate. There is one step that we do with them, but it’s a fairly simple step, more like an adjustment, a calibration step.
[For] the other technologies — the SOLiD technology, for example — we are only starting to get data from them [now]. And [ABI is] actually very interested in us working with their data, but we haven’t actually seen much of their data.
The second [step], which I think is really the crux of dealing with these short reads, is the sequence alignment. [We developed a program for that called Mosaik.] There is a lot of commonality between [different aligners currently being developed] in an algorithmic sense. They have to take these reads and quickly find where, in large genomes, they could possibly fit. Usually, [this kind of] software has a first, sort of quick-and-dirty step, where you are looking through the genome very fast and have an initial scan of where this read could be aligned. Usually, there is a secondary step where you take a more in-depth look. That’s common between many of these programs. What differentiates the programs is the specifics of how they actually do it and how much effort goes into optimizing the code to various read lengths.
Another thing that makes [our] software different is, [it can] deal with situations where a short read has an inserted or a deleted base, relative to the reference genome. For example, I know that Illumina’s own software, called Eland, does not have that capability. I know some of the other software [packages] that people are writing can deal with substitution-style differences, but not insertions or deletions. If you cannot align reads that have insertions or insertions relative to the reference sequence, you cannot detect polymorphisms that are insertions or deletions. [This capability] would [also] be an absolute requirement for the 454 reads, because the number of bases in a homopolymeric run is highly variable with the 454 technology. So if you cannot align reads with a couple of base pairs of insertions or deletions, you are going to be throwing out a lot of the reads.
There are a couple of other things that are going into the algorithmic details. Sometimes, what you want from an assembler is to take a read and place it somewhere in the genome, as long as it can be uniquely placed. But there are many reads that come from really repetitive regions, so there is not a single place where you can place them. Then decisions have to be made: Do you just throw this read out, or do you report every position it can be placed? Sometimes, you are interested in a read not only if it exactly matches somewhere in the genome, but if it matches with a couple of mismatches or insertions.
To find all these locations for a read is actually computationally very intensive. So assemblers will vary in terms of their performance, and their philosophy as to how they will deal with this situation, and whether they are capable of reporting and really finding every position. And it all depends on what your application is, because sometimes it’s not a problem if you don’t find them all. If you just want to know whether there is a single location, or if there is more than a single location, that’s one possible way to look at it. Another application might declare that you find them all, so you know every place where this read could be placed. Our [assembler] is flexible in the sense that we are able to specify how we want our alignments.
Can Mosaik also be used for different read lengths?
Yes, that was the No.1 design consideration, that we can do it for the short reads, up to 50 base pairs in length; we can do it for the medium size, the 100- to 250-base-pair 454 reads; and the ABI [capillary electrophoresis] reads, which are up to 1,000 base pairs [long]. Because the idea is that [for] some applications, you want to co-assemble reads from the different platforms. People are still exploring how to use these machines for de novo sequencing and resequencing, for structural variations, and SNP discovery and mutational profiling, so you want flexibility in the aligner, so that you can try out various assembly strategies, and then you pick the best one, the one that gives you accuracy at the lowest cost. Plus, the way we view our assembler is basically as a research tool. If we need to align transcriptome sequences, as opposed to genome sequences, there may be different algorithmic requirements for it.
The third difference between different aligners is that, it’s one thing to align a single read to a genome. And it’s another thing to align many reads to the genome and then make multiple alignments from all those reads, where each read is not only aligned relative to the reference genome, but relative to each other. That’s what sometimes people call a multiple alignment, or sometimes people call it an assembly. Mosaik has functional units, it has the aligner, and it has the assembler. And the assembler takes each read aligned to the genome individually and then makes a montage out of it, [which is] the multiple alignment. Most programs don’t actually do that; very few programs can do this assembly step.
Have you published a description of Mosaik?
No, it’s new. I have a phenomenal student, Michael Strömberg, who is developing it, and he is a real pro, but he is a 2nd year graduate student. He just developed this, and the publication plan is for this summer, and a beta release is [due] hopefully next month.
In your talk, you mentioned the concept of ‘resequenceability.’ Can you explain that a little?
If you take a read and you can place it into two or more different locations in the genome, because it aligns to all those locations, then you really cannot say with confidence where the DNA came from, because it could have been coming from any of those locations. So regions that are so repetitive that you cannot reassign a read to them uniquely are not really resequencable because you really don’t know whether the read came from there or someplace else. Of course it’s not an absolute concept because it may be that with a single read, you cannot really decide whether this read came from here or there. But with a paired-end read, because the other end of that DNA fragment can be uniquely placed, you [can now] choose between the locations where this read came from. So really, resequencability is a relative concept that depends on read length. It may even depend on the number of errors you expect in a read, and it depends on your strategy, whether you are doing single reads or you are doing paired-end reads. And then for each of these technologies, you can make reasonable decisions of what you consider resequencable or not.
Tell me about the assembly format working group you are heading.
That relates to the data volumes, the huge amounts of data as it comes off the machines. After that, when we take those reads, and we align them to the genome and produce an assembly, all the data has to be represented in a way that the downstream software can use. For example, if you have a viewer application, so you want to look at the assembly, and you have to look at, say, 200 million reads in a 200-megabase genome, the amount of data will be so large that you can’t keep all that stuff in the computer’s memory. So you have to find ways in which you can pan across the genome, or focus in on specific regions of the genome. And you have to manage the data in such a way that it’s not all kept in the memory, but it’s very fast to read them from disk, for example. The take-home message is that you really have to keep the data in formats that are conducive for easy and fast access by other software applications that people then use.
There are two groups: the first one is the short-read format group that’s managed by the University of British Columbia. Their [goal] is to produce data formats that the machine manufacturers would be subscribing to. When they produce their data, the way it comes off the machine, in that standard format, then it’s easy for genome centers and other users to immediately use.
The other [group deals with] the assembly format, which we moderate here at Boston College. Here the thrust is slightly different: It is to produce data formats that are conducive for applications. In addition to just the file format, we are also collaborating to produce software libraries that people could use and other software developers would have access to [for] their applications. They would produce just pre-canned methodologies to access the data in an efficient way.
You also developed a viewer, EagleView?
One of my postdocs is developing [EagleView] to be able to look at very large assemblies of tens of millions, or hundreds of millions of reads and be able to browse through [them] very fast. The function of the viewer has also changed. Back in the old days, when people were finishing genomes based on long reads, they would edit reads if they thought that the base caller was making a mistake. And that’s gone away; there is no way anybody would edit 100 million reads; that’s just not going to happen. Primarily, these applications are there for quality assessment and for software development, because that way, you can look at the data and you can see whether your software tools are doing the right thing with the data.
What about your update of polyBayes, your SNP calling software?
Basically, I am developing quite new, quite different versions of that software now for use with these short reads. The major differences are regarding performance. Looking at a few thousand, even 100,000-long ABI reads is quite different from looking at 100 million reads, or 5 million reads even, with these short reads. The performance had to be really improved in this application. Plus the data types are changing; these data that we collect with these new short-read machines are all haploid, meaning that they only sequence one or the other chromosome. Also, for SNP calling, it’s very important to know the number of DNA molecules involved in your sequences. [In a cancer sample, it could be] many-ploid and only a small fraction of the cells [might] actually have the cancer mutations, whereas others don’t. So these are algorithmic details, but they are very important for accurate mutation detection.
You mentioned testing the software in a number of projects involving the 454 and the Illumina platforms. You said you are now starting to analyze ABI SOLiD data. Are you hoping to add other platforms as they come along, like Helicos?
I have not seen any Helicos data, [and] I don’t know anyone who has seen Helicos data other than the Helicos guys themselves, but I understand that there will be some data released there, too. We are obviously very interested in their methodology. Their machines will yet be different, the type of data they produce will be different from the other four. That’s what my lab does: We want to look at every new data source. We look at them critically and see where software is needed, and if we can, would like to develop software to be able to leverage the data for the community.
When are you going to analyze a mammalian genome using short reads?
The projects that you have heard about so far with the short-read machines have been on smaller than human genomes. It turns out that the informatics scale-up is actually very substantial even up to that point. The C. elegans genome is 100 megabases, the Drosophila genome is 180 megabases, the human genome is 3 gigabases. So we have another 20- to 30-fold scale-up [to do]. If you need maybe a couple of runs of Solexa data to cover the C. elegans genome, then you need 20 to 30 times that much to cover the human genome at the same coverage.
I really think that’s the last big scale-up there. I think on the data end, people have whole human datasets, but I think to do the comprehensive study that you can do for a 100-megabase genome, we are [still] working on the informatics of that, I have not seen that happening. I would still give it another probably four to six months before we can reliably, confidently do mammalian-style comprehensive genome analysis with these short-read sequences.