In a head-to-head comparison of the Roche 454 GS FLX and Illumina GAII sequencing platforms, fish population researchers found that the two systems were equally capable of finding SNPs by sequencing the transcriptomes of fish from a species lacking a reference genome, though versions of the Illumina system introduced since the study was conducted appear to have a cost advantage over newer 454 instruments.
In a study appearing in PLoS ONE late last month, investigators from FishPopTrace and another project searched for SNPs in European hake by sequencing the transcriptomes of a handful of individuals from across the fish's normal distribution range. After isolating RNA from fish muscle tissue, they sequenced the complementary DNA using either Roche or Illumina instruments, assembled and annotated contigs using reads from each platform, and identified SNP candidates, which they then attempted to validate through array-based genotyping of more than 200 hake.
"The first objective was to discover as many SNPs as possible — and, of course, the best ones," University of Padova researcher Luca Bargelloni, the study's senior author, told In Sequence. "But there was also this idea of comparing [sequencing platform] methods to have sort of a pilot study or a case study for other species" that lack reference genomes.
From the subset of the candidate SNPs that appeared most promising, the team found in follow-up experiments that a comparable proportion of SNPs from each sequencing platform were genuine polymorphisms — around 43 percent in both cases. The price per SNP was also similar, though the researchers' estimates suggest Illumina would edge out Roche if such studies were done using the newest instruments available from each firm.
Improved Characterization
European hake, which are found in the northeast Atlantic Ocean and Mediterranean Sea, represent an important species for both European and Mediterranean fisheries, Bargelloni and his colleagues explained. But while a few main hake stocks have been defined so far, those populations have primarily been characterized based on non-genetic features.
In an effort to find new polymorphisms that might help in defining, classifying, and managing hake, the team decided to look for such variants via high-throughput sequencing of hake muscle tissue transcripts.
Muscle tissue made an attractive sample for such studies, Bargelloni noted, because it typically has a relatively simple transcriptome compared to tissues such as the brain or liver. It is also easy to sample and not subject to the high level of RNA degradation seen in other tissues.
The researchers focused on transcript sequences, in particular, in the hopes of uncovering markers that could discriminate between hake populations — a feature they suspected would be more likely in expressed sequences.
"The idea was to go for SNP discovery in the transcriptome rather than in the genome to have a higher number of potential SNPs under selection," Bargelloni explained.
As they did for other species evaluated through the FishPopTrace consortium, the researchers sequenced muscle transcriptomes for several hake collected over the species' range using the Roche 454 GS FLX platform.
Because some of the study authors also had funding to do Illumina GAII sequencing of the hake transcriptome for another project, called MerSNPs, Bargelloni said they decided to take the opportunity to compare the feasibility of finding SNPs in transcriptomes sequenced on each platform.
For the Roche 454-based SNP discovery, the researchers made complementary DNA libraries using messenger RNA isolated from the muscle tissue of eight European hake — two fish each from four commercial fisheries in the Aegean Sea, North Tyrrhenian, North Sea, and Iberian Atlantic coast.
Their Illumina cDNA libraries, meanwhile, came from five fish, two from the North Sea site and one fish each from the remaining three sites.
After preparing platform-appropriate cDNA libraries for each fish, the team generated 100 million base pairs of Roche 454 sequence and almost four million base pairs of Illumina sequence.
Using the CAP3 program, the researchers then assembled around half a million Roche 454 reads into 5,702 contigs that were each longer than 100 base pairs. Illumina reads, meanwhile, were assembled de novo with Abyss.
"The possibility to assemble de novo, even just Illumina transcriptome data, makes it possible to use this approach of in silico discovery feasible, even if you don't have 454 data to have a scaffold of the transcriptome," Bargelloni noted. Even so, he added, the shorter reads produced on the Illumina GAII correspond to far fewer contigs that are long enough for the contig annotation and SNP detection steps.
Of the 9,258 contigs that they initially assembled from 50.7 million GAII reads, the researchers tossed out all but 3,756 contigs, again focusing on contigs that were at least 100 base pairs long. The average length of the remaining GAII contigs was 190 base pairs compared to an average contig length of 331 base pairs for the 454 transcripts.
The availability of longer contigs is particularly useful when trying to annotate transcriptome data against sequence databases, Bargelloni explained. For instance, the team was able to annotate 4,221 of the 454 contigs and just 2,644 of the GAII contigs.
"The biggest difference between the 454 data and Illumina data was … due to the different length of the contigs," he said. "If you use longer contigs, the chance of a positive match, a significant match, against a known protein-coding gene is higher."
On the other hand, when the researchers predicted polymorphisms in each set of sequences using Gigabayes, an extension of the PolyBayes SNP detection algorithm, they found that the sequence depth was much higher at candidate SNP sites in the Illumina contigs.
Overall, they found more than 4,000 possible SNPs spread out over 889 of the 454 contigs and 8,606 SNP candidates in 2,384 of the GAII contigs. The average coverage depth at the SNPs in the 454 data was 89 times compared to 674 times for SNPs found in the GAII contigs.
Nevertheless, the proportion of candidate variants verified by array-based genotyping was comparable for both platforms. When the researchers attempted to genotype 1,536 candidate SNPs with custom Illumina GoldenGate arrays in fin samples from 207 hake, they were able to verify 409 of the 944 candidate SNPs selected for follow-up testing from 454 contigs.
Meanwhile, 296 of 684 candidate SNPs from GAII contigs were validated, meaning the conversion rate — the proportion of SNP candidates that turned out to be scoreable and authentic polymorphisms — was around 43 percent for each platform.
Of the possible SNPs that turned out to be genuine, 67.5 percent could be found in GAII data and 63.3 percent turned up in sequence generated on the 454 instrument.
As a result, both platforms were found to be "suitable for large-scale identification of SNPs in transcribed regions of non-model species, although the lack of a reference genome profoundly affects the genotyping success rate," the study authors noted.
While it's difficult to predict how much SNP discovery could be improved by having access to a reference genome, Bargelloni noted, past studies indicate that SNP conversion rates tend to be higher for species with a reference genome available, as well as species that are closely related to another species that has had its genome sequenced.
"In studies where they made the comparison between having and not having the reference genome, clearly that makes a big difference," he noted. "So possibly, in the future, with a lower cost of genome sequencing, that may be something that could be solved."
The researchers found that the cost-per-SNP discovered was similar for the GAII and GS FLX platforms used in the study, Bargelloni said, though the team's estimates based on sequencing output for newer versions of the instruments, such as the Illumina GAIIx or HiSeq 2000 and the Roche 454 GS FLX Titanium or FLX+, suggest that the Illumina HiSeq would likely be most cost-effective for transcriptome-based SNP discovery.
For example, the reagent cost of producing the amount of Illumina sequence data used in the current study with the GAIIx rather than the GAII would be less than half as much as the reagent cost associated with sequencing the amount of 454 sequence analyzed in the study using the Roche 454 GS FLX Titanium, the study authors calculated.
That price difference is expected to be even more pronounced when comparing the Illumina HiSeq 2000 with the Roche 454 GS FLX+, they noted.
"With the older systems, both 454 and Illumina, the GAII, they were sort of even," Bargelloni said, "there were no big [cost] differences between the two methods."
"But now with the HiSeq, there are more advantages to using the Illumina technology rather than the 454," he added. In particular, the paper notes that the HiSeq could enable an increase in the number of individuals included in the SNP discovery panel "without decreasing coverage depth."
"We tried to represent a range of costs rather than a fixed position, but the trend is that Illumina is going to be better in this sort of application than 454," Bargelloni said.
The authors note in the paper, however, that the most expensive steps in their analysis pipeline were the in silico analysis of sequence reads and the high-throughput genotyping assays used in the verification step.
For his part, Bargelloni said he is interested in combining transcriptome-based SNP discovery methods with sequence capture-based enrichment strategies that make it possible to sequence specific regions of the genome — for instance, protein-coding sequences — in many more individuals in a cost effective way.
That, in turn, may help to not only verify potential SNPs, but also to predict the frequency with which certain SNPs occur in a given population.
"If you have more individuals, you can estimate the frequencies of each allele for individual SNPs," Bargelloni said. "So maybe there will be a more articulate approach in the future."
European hake transcriptome sequence data that was generated for the study has been submitted to the European Bioinformatics Institute's Sequence Read Archive.
Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.