With all of the nail-biting that supposedly goes hand-in-hand with the next-generation sequencing "data deluge," the non-informaticist may be surprised to learn that the real worry of the folks tasked with making sense of this data lies not in the quantity, but rather, in the ambiguity of the data these machines are spitting out. Issues such as error rates in data and how to improve base calls to account for those errors result in researchers developing a sort of informatics hoarding disorder in which they sometimes feel the need to store images, base calls, second-best base calls, third-best base calls, and process intensity information — all because of a lack of knowledge about the data.
"The data constantly changes because the sequencing environment is highly fluid — nobody knows what's going to happen a year from now," says David Jaffe, director of computational research and development in the Genome Sequencing and Analysis Program at the Broad Institute. Jaffe's group deals with data from a wide array of DNA sequencing technology platforms and works on both the computational and analytical aspects of genome sequencing. "There are sequencing companies operating under the radar now, and there are companies that are operating above the radar, but nobody knows what is going to materialize, hence the informatics lag because the data appears, and then what do you?" Jaffe says. "It's different from the data that was around before, and this happens over and over."
David Dooling, who oversees the analysis developers, laboratory information management systems and the information systems groups at Washington University in St. Louis, says that his approach for grappling with next-generation sequencing data analysis includes widely used alignment, variant calling, and genotyping tools that are freely available from the academic community, as well as applications developed in-house. Ultimately, making sense of all this next-gen data is not really about which software analysis application works best. "Algorithms aren't really going to address the data deluge situation. It's more of a framework issue than an algorithm issue," he says. "If we can provide tools that would allow people to take these large amounts of data and iterate through these different variations, such as visualization and statistical tools that look at the data and see where the discrepancies are, I think we're not going to be talking about data deluges at that point."
Some bioinformatics developers say that, while it's true that overall it looks like they have not caught up to the next-gen sequencing platforms, things are getting better. There are open-source and commercial packages for many different tasks, but it all really depends on what one wants to know. "If you're interested in just finding SNPs, we have caught up, but if you're interested in finding the structural variations, it gets a bit trickier because there's a few tools out there — but I don't think that anybody trusts the results just out of the box," says Michael Brudno, an assistant professor of computer science at the University of Toronto.
As far as standouts from the academic community, Brudno points to the tools such as Mapping and Assembly with Quality, or MAQ — a tool specifically designed for Illumina and SOLiD reads that maps those short reads to a reference and calls the genotypes from the alignment — and the Burrows-Wheeler Alignment Tool, which has gained popularity for mapping reads and variants. He also likes the Velvet assembler developed by Ewan Birney and Daniel Zerbino at the European Bioinformatics Institute, not only because of its popularity, but because it's a solid piece of software engineering that allows users to tweak it — to handle their current data, but also data coming down the pike.
With next-generation sequencing ramping up efforts for variant detection, there is a growing need for developing read-mapping algorithms that can help researchers identify structural variants in a reliable way. Brudno also says that a new open-source algorithm called Pindel, developed by a group at EBI, shows great promise for identifying exact break points of large insertions or deletions.
Some commercial vendors are offering alignment tools that tout improved performance when compared to academic solutions. Earlier this year, a research team from Genentech published a paper in Bioinformatics describing a new method called Genome Short Nucleotide Alignment Program that was capable of aligning reads of any length, even as small as 14 nucleotides. The team demonstrated that GSNAP could perform alignment on reads longer than 70 nucleotides with slightly better performance than academic tools like MAQ. Unlike the algorithms based on the Burrows-Wheeler method, GSNAP uses a data structure approach called a hash table, which is better suited for picking out complex variants. On a test set of 100,000 simulated 36-nucleotide reads with three mismatches, GSNAP completed the analysis in six minutes, while MAQ churned away for roughly five hours.
For the big sequencing centers, putting together an effective toolbox of assembly and alignment analysis software tends to be a mixed-bag approach. "Sometimes there's analysis software that works off the shelf and other times we have to invent something totally new, or that works better," Jaffe says. "Things like assembly software we build ourselves, but it's an open problem as to how to do it best because it's an area that people are still trying to figure out."
One example of an effective homegrown tool is the Broad's Allpaths, a whole-genome shotgun assembler capable of generating high-quality -assemblies from short reads. "It's something we're very actively working on and we think it's the right approach, but it's not just software," Jaffe says. "We're proposing what we think is a practical approach for sequencing and assembling of novel genomes so we've come up with, effectively, [a] molecular biology protocol of what to do in the lab and what to do computationally, because those are tied together — and they should be."
Other big sequencing centers have been making waves with their own tools. The BGI in Shenzhen published several papers late last year showcasing the effectiveness of their SOAP-denovo, an assembly method for novel short reads designed to assemble Illumina Genome Analyzer reads. BGI researchers published a paper in Nature that described their success with generating and assembling a draft sequence of the giant panda genome. Their efforts seem to point the way forward by illustrating the possibility of utilizing next-generation sequencing platforms to cheap and fast de novo assembly of sizable eukaryotic genomes.
Another area that could use some consensus is storage formats for alignments. One glimmer of hope is the Sequence Alignment Map (SAM)/Binary Alignment Map (BAM) format, which ultimately will move data analysis forward by bringing researchers together. SAM describes the alignment of query sequences or sequencing reads to a reference sequence or assembly and BAM is its binary equivalent, intended for use in a data-intensive production pipeline setting. "It's been both a very positive force and a bit troublesome," Jaffe says. "The downside is that the format is laden with ambiguity that will probably be worked out in a second version. But it's something which has made a whole level of collaboration possible because it's much easier to exchange data."
While researchers can spend time tinkering with the settings on their sequencing platforms to improve the base calls and the quality values, most informaticists should be concerned with the end product and not sweating the small stuff. "We're interested in base qualities as they relate to a mapping quality and genotype quality confidence, because when you run a full analysis pipeline all the way out through the data generation, you find out that the difference between a base quality of 30 and a base quality of 25 isn't all that substantial," Dooling says, "because you're not going to rely on a single observation or a single read that hits that spot with that base to call a variant." He adds, "We take a holistic approach, find out the areas of ambiguity that are problematic and understand them as best we can, as opposed to chasing every gremlin that may affect the uncertainty that essentially doesn't affect your end results that much."