A newly published comparison of next-generation sequencing platforms is underscoring the notion that data from no one instrument captures all of the single nucleotide variants present in a given human genome.
As they reported last week in PLOS One, researchers based at Pennsylvania State University and Genentech looked for SNVs in genome sequencing data generated with Roche 454 GS FLX, SOLiD 3, or Illumina GAII/HiSeq 2000 instruments. All of the sequences represented the same Khoisan individual — first described in a Nature study a few years ago — whose genome is nicknamed KB1.
"The main goal of the study was this idea: 'Can we use more than one platform to reveal systematic biases in the analysis from each of the three platforms?'" the study's first author, Aakrosh Ratan, with the Pennsylvania State University Center for Comparative Genomics and Bioinformatics, told In Sequence.
Moreover, Ratan noted that while Illumina's HiSeq 2000 currently dominates the market, there are benefits to comparing HiSeq reads with those generated by other platforms in terms of better understanding the data that the machine produces.
"When you look at one platform in isolation, you can only tell so much," he explained. "When you have data from multiple platforms, it helps to understand the data."
"What we've tried to say with this paper is that we actually do have options," agreed senior author Stephen Schuster, also with Penn State's University Center for Comparative Genomics and Bioinformatics. "And if we use them and make these comparisons, it is clear that a single platform … does not cover the human genome with the completeness that we need."
For instance, the group saw a shared set of around 3.3 million SNVs that could be picked up using data from all three platforms for the KB1 genome. But, between 71,500 and nearly 443,000 other distinct variants were detected with reads from only one platform.
Indeed, the study's authors uncovered variant call variability due to everything from platform-based technical biases to differences in the methods used to filter and analyze each read type. Based on these findings, they argued that it may be worthwhile to invest in data from not just one, but two sequencing platforms — at least for certain sequencing studies.
"It all depends on what you're trying to do," Ratan said. "If you're trying to create a reference genome and you want the whole genome to be covered, then it makes sense to spend that extra amount and actually sequence using more than one technology."
For their part, Ratan and his colleagues are currently using such multi-platform analyses on new reference genome development. Platform selection is also a priority when doing metagenomic sequencing studies of microbial communities containing sequences skewed toward a high or low guanine and cytosine content, they explained, since their analyses uncovered GC bias within both Illumina and SOLiD reads.
Stanford University geneticist Michael Snyder said the findings from the latest comparison are in line with what he and his team found when they compared data generated by Complete Genomics and Illumina instruments in late 2011 (IS 12/20/2011). There, too, the researchers reported missing a subset of variants in a given genome using reads from a single platform.
"Basically the conclusion is the same," Snyder said. "When you try several different platforms there's a lot of overlap but there are still plenty of differences."
The latest findings may not be especially heartening for human genome sequencing centers, according to Schuster, since it "might make their life more difficult."
"It clearly might require them to spend more work on a single genome," he said.
He noted that the new analysis has received a somewhat more enthusiastic reception from diagnostic and pharmaceutical companies, which are "really, really thrilled" with the potential of generating more complete variants sets for each human genome, since such datasets could contain heretofore-unappreciated variants contributing to small molecule associations or disease risk.
The current comparison centered around the KB1 genome, which represents a Khoisan individual from a present-day hunter-gatherer population in Namibia who was sequenced in 2010 as part of the Southern African genome project (IS 2/23/2010).
From the initial data generated for that genome, Schuster explained, researchers suspected that they might be missing variants or even stretches of sequence when relying on just one platform for genome resequencing.
"When we assembled the genome from only 10x coverage of the 454, we found regions of the genome that are not part of the human reference genome," he said. "If we now repeat this with the current technology from 454 we find the same regions again."
To investigate this more systematically, the team brought together the original Roche 454 data generated for the KB1 genome, along with KB1 genome sequence data generated on the SOLiD 3 platform or using Illumina's GAII and HiSeq instruments.
Two of the three manufacturers included in the study provided free data for the analyses, Schuster added, explaining that "the companies were very interested in the comparison being done."
For their analyses, the researchers selected depths that were both affordable and on par with those being generated for studies typically done using each of the instruments.
For reads generated using Roche 454 GS FLX Titanium chemistry, for instance, the depth hovered around 10-fold, while most Illumina and SOLiD reads covered the genome at between 30- and 60-fold coverage.
"What we wanted to do was not to investigate the adequate coverage in the human resequencing effort," Ratan said, "but we looked at this question from the other end: 'If you look at human data at normally used coverages, what kind of biases can you see?'"
In their analyses of the 454, SOLiD, and Illumina data, for example, the researchers found just 3.3 million of those turned up in analyses of reads from all three platforms — an observation that Schuster said was the study's most surprising finding, particularly since the KB1 genome is known to house some four million validated variants.
Nevertheless, reads from each of the three instruments added a significant number of otherwise undescribed variants to the overall tally, bringing the total number of potential SNVs in the KB1 sequences up to roughly five million.
For example, 442,674 apparent variants in the KB1 sequence that were unearthed from Roche 454 reads weren't found by an alternative platform. Another 225,981 SNVs were specific to the Illumina dataset, while 71,567 unique variants turned up in the SOLiD reads.
Furthermore, many of the newly identified variants seem to be authentic, with some 80 percent to 90 percent of the variants found by at least two of the three platforms being subsequently verified by mass spectrometry.
The reasons for the discrepancies in SNVs calls were multi-faceted. In some cases, variants weren't called in 454 reads owing to inadequate coverage, the team noted, while other SNVs got missed in Illumina and SOLiD datasets owing to biases in the way GC-rich regions of the genome are represented by those reads.
Still other variants were missed because of the way reads from each platform are aligned and processed or due to the methods used to filter variants.
For example, nearly two-thirds of the variants called from 454 and Illumina reads but missed in the SOLiD-based analyses did show up in SOLiD reads, but got filtered out prior to SNV calling, the researchers reported.
Indeed, Schuster explained that computational methods used to deal with the sequence data can have as much influence on the ability to detect SNVs as the technical biases of platforms themselves.
Along with differences between platforms offered by different companies, the researchers also had the opportunity to compare two Illumina instruments — the GAII and HiSeq — and their corresponding chemistries.
In particular, whereas the GAII platform produced sequences that were skewed toward parts of the genome with high GC content, the HiSeq generated reads with nearly the opposite bias, Ratan explained.
"Initially, there was this idea that lower GC-content regions were not being sequenced well by Illumina with the GAII platform," he said. "As we moved on to the HiSeq platform, what we noticed was that the high GC-content regions were not being adequately sequenced."
Schuster said such chemistry-dependent read differences are a potential concern, since the precise chemistry used for sequencing on a given platform is often not documented in sequencing studies.
"The ability to really compare apples to apples is restricted," he explained. "It would be good if the journals would start requiring a standard and then they'd become alerted to the sensitivity of the variant calling depending on the platform and depending on the chemistry."
At the time that sequencing for the study was done, Schuster estimated that that amount of 454 data cost around half a million dollars to generate, while the price tags for SOLiD and Illumina reads, respectively, came in at around $63,000 and $25,000.
Nowadays, it's possible to resequence a human genome for around $5,000 using Illumina's HiSeq, he noted, though it remains roughly 100-fold as pricey to generate genome resequencing data with the 454 instrument.
Nevertheless, results of the study indicate that the additional information gained from investing in a second sequencing platform can be important — particularly when asking certain research questions.
For Schuster's group, that realization has come at the same time as an improved appreciation of the sorts of sequence differences that exist between human populations, which is spurring renewed interest in human reference genome sequencing by that team.
"We are proposing that we will need [additional] reference genomes," Schuster said. "We are convinced that this reference genome approach, in the time to come, will be very, very important."
To that end, he and his colleagues are currently pursuing new reference genomes from populations in Africa and other parts of the world using read data from multiple platforms.
"There are efforts underway to sequence very large cohorts, where we literally will have thousands of genomes," Schuster said. "Within those projects, it would make perfect sense to spend a significant amount of money on references and then go to large-scale sequencing later, after we finish these references."
The team is also turning to a combination of 454 and Illumina sequencing platforms for metagenomic sequencing of certain microbial communities, explained Schuster, who said the two platforms "make a very nice complement to one another," providing long reads in tandem with very affordable short reads.
"If you are doing something that you a priori know has high GC content, then it makes sense to do a small subset of sequencing using 454," Ratan said. "This is especially relevant in metagenomic studies, because there are all sorts of organisms, some with very high GC content, that you might miss if you were using just one of the other platforms."
They have also done experiments using overlapping libraries to generate longer-than-usual contiguous reads from Illumina short read data, stretching the reads from 250 base pairs out to 450 or more bases apiece.
For his part, Snyder said the availability of data from multiple platforms is usually not crucial when considering overall variant loads across many, many genomes.
Still, Snyder noted that he and his team are most apt to turn to multiple sequencing platforms when dealing with genomes that are expected to contain clinically significant variants.
"We always do use more than one platform whenever we have a genome of high clinical value," Snyder said. "We either use two whole-genome platforms or a whole-genome and an exome platform, because we also found that exome sequencing can catch a lot of things that whole-genome sequencing misses."
"Two different platforms and orthologous methods are quite valuable," he added. "And we pretty much do that routinely now for genomes that have high medical significance."
In terms of the read profiles associated with sequencing machines in general, Schuster and Ratan emphasized that the sorts of comparisons presented in the current paper continue to be a moving target, meaning that additional comparisons will be needed as new instruments hit the market.
For instance, the Penn State researchers are using Illumina's MiSeq for some library validation and mitochondrial genome sequencing experiments, Ratan noted. So far, they haven't done a formal comparison of the GC biases and other features of that platform, he added.
Over at Stanford, meanwhile, Snyder said his group is waiting on newer platforms such as Ion Torrent's Ion Proton platform. They have also been trying out the Pacific Biosciences RS for some time.
"Every time a new chemistry comes out — or a new anything — we test it out," Snyder said. "We're always evaluating these."