What’s the next big thing for bioinformatics? It could be an old thing, if developers of next-generation sequencing technology have anything to do with it. With a host of faster and cheaper sequencing technologies slated to hit the market over the next decade, it seems that fresh opportunities abound for what appeared to be yesterday’s bioinformatics news.
“Everything old is new again,” said Jim Golden, manager of business development for Curagen subsidiary 454. “Back in 1991 I was writing base callers for the ABI 373, then I got into drug discovery, and now here I am in 2002 working on base callers for a whole different kind of sequence because it’s not fluorescent traces anymore.”
454 is one of a handful of groups working furiously on new methods to accelerate the sequencing process. The company’s technology, which sequences DNA fragments in parallel in thousands of picoliter-scale wells, is up against several other approaches, including single-molecule detection methods that require neither the amplification nor the assembly steps of traditional Sanger-style sequencing.
The post-human-genome perception is that “sequencing is over, it’s a commodity, it’s passé,” said Golden, “and I said the same thing. Until I suddenly realized that if you can do it fast — and I mean fast — then it’s different.”
With the goal of sequencing entire organisms within minutes or seconds, 454 and its competitors hope to deliver on the promise of genomics by enabling large-scale comparative approaches, rapid genotyping and haplotyping, pharmacogenomics, diagnostics, and other application areas that remain out of reach for an industry dependent on capillary- and gel-based sequencing approaches.
But before any of these technologies can deliver on this promise, they have to deliver the data — and that could be a bigger stumbling block than it appears at first glance.
Image is Everything
Image processing will play a crucial role in next-generation sequencing approaches. 454 and UK-based sequencing startup Solexa use CCD cameras to capture the raw data from their systems, while the “polony” (PCR colony)-based method developed by Robi Mitra in George Church’s lab at Harvard Medical School uses a microarray scanner. Companies developing single-molecule technologies, such as VisiGen and US Genomics, are developing proprietary image-based detection technologies.
Spot detection is the first hurdle for Mitra’s polony-based method, he said. While the method works much like microarray spot identification, “there’s the added twist of it being a non-random distribution of spots and there’s also a problem with possible overlapping.” The development group, which only recently increased in size from two to five people, is not far enough along yet to have tackled some of the informatics requirements for the technology, “but we’re starting to really think about it,” Mitra said.
Mitra added that new sequencing technologies will have to approach base calling very carefully. “You have to be extremely accurate,” he said, “because if you’re talking about resequencing the genome, I could just give you the public sequence and your genome sequence is 99.9 percent accurate. So you have to be better than [that] to even bother.” He plans to develop software that models intensity fluctuations for fluorescent molecules in different sequence contexts to help improve the accuracy of the polony-based system.
While declining to disclose the details of his technology’s detection method, Eugene Chen, CEO of US Genomics, said that its detection accuracy “seems better than current sequencing approaches so far.” The approach, which threads single DNA molecules past a reader ticker-tape style, requires only a single tagging step, Chen said, which greatly reduces the errors that multi-step sequencing approaches are prone to. However, the current system is only able to read fluorescent tags spaced around 1,000 base pairs apart. Base-by-base scanning is still three to four years down the road, Chen said.
No Assembly Required
Single-molecule approaches like that of US Genomics promise to eliminate one of the biggest bioinformatics headaches of current techniques: assembling genomic fragments into cohesive sets of contiguous regions. Although US Genomics’ system is currently only capable of streaming molecules of 200,000 base pairs, Chen plans to eventually be able to run an entire chromosome through his reader at a time. Whether the assembly problem will indeed become a thing of the past depends on how well this approach scales up.
Technology under development at VisiGen Biotechnologies, which has engineered DNA polymerase to detect base identity directly from the chromosome, also promises to eliminate the assembly step in whole-genome sequencing. Susan Hardin, a biochemist from the University of Houston, and four colleagues started the company in 2000, and have since received DARPA and NIH funding to develop and commercialize the rapid sequencing technology. The company did not return calls for comment for this article, however.
Other technologies on the horizon still rely on DNA fragments and therefore require assembly, in the case of de novo sequencing, or mapping to reference sequences, in the case of resequencing.
Solexa, which places 100 million 25-mer molecules on an array and then sequences each fragment, is working on a proprietary alignment method to map each fragment against the reference human genome. CTO Tony Smith said that current alignment methods are not adequate for the high-throughput approach. “We need to do it highly efficiently because the rate at which we are going to generate data means that we’ll be aligning 1,000 25-mers per second. It’s not that it hasn’t been done before, it’s just that we need to do it more efficiently.”
Additionally, because Solexa is looking for differences in genomic regions rather than similarities, the algorithms it is developing will have to align inexact matches. “That of course means that the amount of compute power you need in order to tell the algorithm [that] it’s got to find not just perfect matches, but also look at the ability to align non-perfect matches is quite important,” Smith said. Finally, the algorithm will have to be sensitive to sequence variations such as insertions and deletions rather than just SNPs.
This additional IT intelligence, however, must remain computationally inexpensive. Noted Smith, “We don’t want to develop something where the actual sequencer is a smaller instrument than the IT infrastructure. We want to develop something that uses hardware effectively by having efficient algorithms.”
The Real Data Flood is Still to Come
Despite the variety of their approaches to data capture, base calling, and assembly, developers of post-Sanger sequencing are in agreement that the real opportunities for bioinformatics will occur downstream of these steps.
“We’re going to be dealing with essentially the equivalent of a terabyte of data coming out from our system in a second,” said Chen. “Are the current technology platforms plausible for handling that? The answer is no.”
The company is developing a method to sift through the raw signal traces and convert them to genetic information while reducing that terabyte of data to hundreds of kilobytes. “Once we simplify it, each person’s genetic information can be compared line-by-line on an Excel spreadsheet,” he said. However, he added, the complexity inherent in correlating that volume of data with “true biological problems” will present its own set of challenges.
The need to visualize and analyze hundreds or even thousands of genomes at a time will drive bioinformatics developers to look at genomic data in an entirely new way, according to 454’s Golden. Traditional genome browsers “are what I need for one genome, but with 100 genomes I’m not going to be able to scroll through 100 contig viewers for 100 organisms, I’m going to need to do something different,” he said. Data mining techniques, as well as new approaches to storage and good old fashioned computational power, will also be required to cope with the next wave of genomic data, Golden noted.
But the new sequencing methods aren’t that disruptive in everybody’s book. “You’ll have more sequence data, so you’ll be able to ask more questions and newer questions, so new tools will be developed in response to that,” said Mitra, “but it’s not something that’s necessarily unique to the technology.”