HINXTON, UK — Second-generation sequencing is driving rapid development across the full range of bioinformatics tools that make up sequence-analysis pipelines, according to speakers at this year’s Genome Informatics conference, held at the Wellcome Trust Genome Campus here this week.
Cold Spring Harbor Laboratory’s Richard McCombie told BioInform that data management and analysis are “crucial,” particularly now that the capacities of second-generation sequencers are increasing so radically. “We have to re-think and re-work our pipelines,” he said.
Indeed, next-generation sequencing pipelines were a common theme at the meeting, as a number of speakers discussed new approaches for variant-calling, alignment, assembly, and analysis — as well as methods for integrating those methods into a streamlined computational framework.
While manufacturers provide their own data-analysis pipelines for second-generation sequencing instruments, “We have specific applications and want to do further analysis after running the pipeline provided by the vendors,” said computational biologist Quang Trinh from the Ontario Institute for Cancer Research.
One computational challenge discussed at the conference is detecting structural variation in genomes using short reads from second-generation sequencers like the Illumina Genome Analyzer and Applied Biosystems’ SOLiD. Aligning these reads to a reference genome and interpreting them is one area where new algorithms and software are needed, said Richard Durbin, a principal investigator at the Wellcome Trust Sanger Institute.
“It is a challenge to find truly novel variations that are not false positives,” he said. Placing the read on the genome is “tricky,” he said, while another challenge is calling the variant once “you know where you are.”
Getting a better understanding of variation and disease requires going deeper and deeper in the quest for variants.
Erin Pleasance, a postdoctoral researcher at the Sanger Institute, garnered significant interest with her talk about the institute’s Cancer Genome Project, an attempt to “sequence the cancer genome” using ABI’s SOLiD and identify “the somatic mutations that are cancer-specific.”
With whole-genome shotgun sequencing and analysis, she and her colleagues are looking across the spectrum of structural variations, from small substitutions to large-scale structural changes, including copy-number changes across the entire genome, in order to try to understand the functional impact of these changes.
The project is currently in a pilot phase, and is studying one cancer genome from the small cell lung cancer cell line NCI-H209 and the genome from a matched normal cell from the same individual. The researchers are looking to obtain 20x coverage of both the normal and the cancer genomes, with the goal of about 120 billion bases, she said.
The researchers already have capillary sequencing data for approximately 15 megabases of exons in this cell line from previous work, so they have cataloged the known mutations and known SNPs, which is “useful for validation, obviously,” she said.
This sequence will help to validate the variant-calling algorithms under development in her group, she explained. Also in the works are methods to identify insertions and deletions of various sizes, as well as ways to map rearrangement breakpoints to the base-pair level directly from the short read data.
The group is trying to pick out a subset of reads that fall somewhere near the breakpoint, which they pinpoint through comparison with the reference genome. Using the European Bioinformatics Institute’s Velvet algorithm [BioInform 11-16-07], they are taking these reads, doing a local de novo assembly, and mapping that against their reference genome.
OICR’s Trinh, as well as other speakers, agreed that there is no tried-and-true computational approach for large-scale cancer genome analysis. Trinh said there is currently a lack of tools for visualizing structural variation, for example.
“When there are no tools for something, we have to write them in-house.”
Both Trinh and Pleasance said that they use the alignment tool Maq, developed by Durbin, Heng Li, and Jue Ruan of the Beijing Genomics Institute, for variant calling.
Durbin and colleagues recently published a paper describing Maq in Genome Research.
Maq, or Mapping and Assembly with Qualities, maps short reads to reference sequences and can also accomplish variant calling. It gives a probability score for each alignment that informs users about the quality of their mapping, and, for example, estimates the error probability of each read alignment.
Trinh told BioInform that he also uses the University of Toronto’s SHRiMP, the short read mapping package, as well as other tools including those provided by vendors.
“We are trying to understand the data generated from these tools and be able to compare [them],” he said. “We need to be able to trust the data coming off the instrument … and be sure that each instrument is consistent … We can’t expect one instrument to provide an error rate of 0.5 [percent] and the other has 1.25 percent.”
He and his colleagues want to be able to compare the data coming off the different platforms in order to assure consistency. “What if the same sample is run on a different instrument, do we get the same result?” he asked.
With large-scale lab ventures come large-scale challenges, involving much data that requires increasing amounts of computing power. But “more data doesn’t mean we need to buy more machines, so we need to come up with a pipeline to process that data more efficiently,” Trinh said.
“When there are no tools for something, we have to write them in-house,” he said.
Data-analysis pipelines for second-generation sequencing usually include image analysis, base calling, and alignment. “For us, it is additional processing as well, [and] that is where there is a big gap — the post-processing analysis that we need to do,” Trinh said.
Short Reads, Large Memory
Inanc Birol of Canada’s Michael Smith Genome Sciences Center in Vancouver said in his presentation that existing de novo assembly tools are allowing scientists to create contigs from short reads, but with larger genomes the computer memory requirements make assembly on the gigabase scale a challenge.
The tool he and his colleagues developed, ABySS, Assembly By Short Sequencing, constructs a digraph with sub-reads, but unlike other assembly algorithms, it partitions the digraph and therefore parallelizes assembly.
“If you do a de Bruijn graph, it is basically an n log n algorithm, it is close to linear time,” Birol explained to BioInform. With Sanger sequencing, the read number is low and other approaches to assembly can work, but with short reads there are “prohibitively high” memory requirements, he said.
ABySS distributes the job of contig growth over a CPU cluster. He and his colleagues have found that a Linux cluster with 160 AMD Opteron CPUs with 2 GB RAM per core is sufficient to assemble the human genome using 36-base pair paired-end reads covering the genome with 30-fold to 40-fold redundancy.
“One of the motivations to develop this software was how to achieve results on commodity hardware,” he said, adding the fewer nodes scientists need for assembly the better.
Birol said that his team has just released a parallelized version of ABySS. Any cluster system that can use the message passing interface protocol running Linux can run the software, he said.
Eliot Margulies of the NIH’s National Human Genome Research Center Genome Informatics Section told BioInform that ABySS is “a unique way to partition things on the computational side.”
Margulies spoke about a different partitioning approach to address the challenges of short-read assembly. It involves creating so-called “reduced representation,” or RR, libraries that make up a subset of the genome and which can then be run on a short-read sequencing platform.
The basic idea is to cut the genome with a restriction enzyme, run it on a gel, and cut out different bands. The RR libraries of increasing size fractions of DNA result in an assembly that does not require a reference sequence. “We create these different libraries and when we assemble them we know that we are assembling just a smaller part of the whole genome,” he said.
“Hopefully it will be the standard approach to wanting to sequence a large genome, where you don’t have to worry about the cumbersome process of creating clonal libraries which are very expensive, very laborious, and take a long time to do,” Margulies said.
His group has applied this concept to the Drosophila genome, showing it was possible to capture, assemble with Velvet, and construct a majority of the 125 Mb genome.
‘Toe in the Water’
Next-gen sequencing vendors are also developing new methods for handling the data from their instruments. It’s about “putting our toe in the water,” said Richard Carter, a data analyst from Illumina who had presented a poster on his software tool Cassava at the meeting.
Cassava allows scientists to combine sequencing data to obtain a consensus sequence call and a list of SNPs. It includes a Bayesian allele-calling algorithm that the company developed called Bacon. Cassava is currently undergoing beta testing, he said, and will be available to users “shortly” as an add-on to the company’s current analysis pipeline.
“Essentially, it will take the Eland [Illumina’s alignment tool] output, collate it, bin it, sort it in such a way that you can easily go from all your reads organized by chromosome and position, and then make a consensus sequence call for every base,” he said.
In validation experiments, the company has found a 99.5 percent agreement with genotyping data, he said. Cassava is set up to run on a single node or it can be parallelized.
Helicos also presented a poster on implementing de Bruijn graph-based approaches to detect sequence variations within heterogeneous samples such as cancer and normal cells and heterozygous variants within a single individual.
An Open Source Processing Pipeline
Other groups are concentrating on building complete toolkits, rather than single algorithms, for second-generation sequence analysis.
Brian O’Connor, a bioinformatician in Stanley Nelson’s laboratory at the University of California, Los Angeles, presented an open source toolkit called SeqWare, formerly called SolexaTools. SeqWare, slated for 1.0 release in the next few weeks, is available here.
O’Connor told BioInform after his talk that he wanted to pick up where the Illumina analysis pipeline left off — after base calling, quality scores, and the Eland alignment algorithm.
“It imposed certain limitations and was only designed, for example, to have a small number of mismatches and you couldn’t do something like detecting indels, which is something we are really interested in,” said O’Connor.
Scientists want to obtain, for example, a list of small insertions and deletions or single nucleotide variants along with a p-value, and “a lot of those details are still being worked out,” he said.
O’Connor noted that one reason he attended the meeting was to seek out software that will give him “high confidence in what I am calling a single nucleotide variant or small indels.”
SeqWare offers a LIMS to track sequencing runs as well as a pipeline framework to organize data analysis. “When we started [with second-generation sequencing], I was getting Post-it notes about what was being run on the machine,” he said, adding that tools to document which material is put on a given flow cell is not necessarily a problem sequencing companies are trying to solve.
Unlike microarray data, which one can process and normalize “blindly,” in sequencing, “you really need to know what the experiment was…so you know what to align that to,” he said.
O’Connor said that about two-thirds of his time is spent helping researchers within his lab group and the rest of his time is devoted to UCLA scientists who use the lab’s technology and know-how as a core facility, for example in second-generation sequencing.
While large labs can parcel out the computational challenges for next-gen sequencing, smaller labs with two to four sequencers struggle with getting an automated pipeline in place, he said. Small labs are confronting a range of questions, such as “How do you move the data around, where do you put it, where do you put the base calling, all the way to, ‘Now that we have the data what reference do we align it to?’” said O’Connor.
O’Connor said that his team is already developing SeqWare version 2.0 in collaboration with an unnamed corporate sponsor. The next version of the software is being developed to support a cancer genome sequencing project at UCLA called CanSeq.
Other scientists expressed great interest in the approach O’Connor is taking with SeqWare. “We are doing the same exact thing he is doing, but he is taking the extra effort and making it open source, taking it outside,” said Margulies.
Margulies added that O’Connor’s talk served as a reminder that his systems engineers might want to use outside tools as a resource, and perhaps help co-develop them, before they set out to create their own.