Researchers at the Translational Genomics Research Institute have devised a barcoding method for sequencing multiple samples in parallel on Illumina’s Genome Analyzer, and have tested it by resequencing several regions in 46 HapMap individuals.
The scientists plan to apply the method in targeted resequencing studies, for example in follow-up projects to genome-wide association studies. Coupled with genome capture or partitioning methods, it could be used to sequence hundreds of samples in parallel, they said.
The aim of the study, which was published in Nature Methods this week, was to develop a way to resequence multiple targeted genomic regions in parallel, and to develop an analysis framework for discovering genetic polymorphisms.
The researchers used six-base barcodes to index 46 HapMap samples. In each sample, they amplified multiple 5-kilobase regions by long-range PCR — 10 in one experiment and 14 in another — most of which had previously been sequenced as part of the Encyclopedia of DNA Elements, or ENCODE, project.
They sequenced these regions using a single lane of an Illumina Genome Analyzer flowcell and analyzed the results using a statistical approach called Bayes factors to figure out whether a difference from the reference sequence stemmed from a sequencing error or a polymorphism.
According to the paper, the results suggest “that achieving adequate coverage is one of the most important factors in the design of a multiplexed targeted resequencing experiment.”
With that in mind, the researchers now usually design their experiment for 50-fold coverage, knowing that the actual coverage may dip well below this target in certain samples or sequence areas.
“The algorithms show us that we can go quite a bit lower [in coverage]” said David Craig, an investigator at TGen and associate director of the neurogenomics division. “But there is this real-life, day-to-day variability, and just to get rid of all of that [we decided to] just cover it really well.”
Their barcoding, or indexing, approach is similar in principle to barcodes that various research groups, as well as 454 Life Sciences, have developed for the 454 technology (see In Sequence 8/7/2007).
Illumina is working on its own barcoding method for multiplexing, which it plans to launch before the end of the year, CEO Jay Flatley said during the company’s second-quarter conference call in July (see In Sequence 7/29/2008).
“The question is, what is really the functional variant that is causing disease?”
Also, another research team, led by the Pacific Northwest Research Station of the USDA Forest Service, published a paper in Nucleic Acids Research last month in which they described a multiplexing approach that uses three-base barcodes to sequence four chloroplast genomes per lane on the Illumina Genome Analyzer.
The TGen scientists observed considerable variability in how the 46 barcodes were represented within a sequence run, but running qPCR prior to pooling samples has improved that, according to Craig.
Because of the length of the barcode, it would be possible, in principle, to sequence small regions of several hundred samples in a single run, said Craig, though he and his colleagues “are typically doing anywhere from 10 to 40” because of the size the regions they target, usually between 40 and 200 megabases, he said.
In their study, they amplified the DNA by long-range PCR, which is both expensive and requires many separate reactions. However, Craig said he believes that the barcoding approach will also work in conjunction with capture microarrays. Combining the two methods would “drive down the cost considerably and it removes sample-to-sample variability,” Craig said.
He and his colleagues are exploring microarrays from Roche NimbleGen, Agilent Technologies, and Febit, but still need to improve the yield, he said.
So far, they have applied their barcoding method to resequencing studies of autism and neurological diseases, such as multiple sclerosis, sequencing approximately 200 samples in each study.
“In one [sequencing] experiment, you can tackle a 100-kilobase or 200-kilobase region for 180 people,” Craig said. “That’s one five-day sequencing reaction.”
He and his team have also used the method with five-base barcodes for paired-end sequencing, which he said “works just fine.”
The method could be equally applicable to Applied Biosystems’ SOLiD platform, he said, although he has not yet tested it on the instrument. TGen, he said, has one Illumina Genome Analyzer and one ABI SOLiD on site.
One drawback of barcoding, Craig said, is the loss of several bases from each sequence read, but “losing a few bases to an index for targeted resequencing is not terribly painful, we found.”
The method “would be fairly straightforward” to implement by others, he added.
One of its most relevant applications, he said, will be in follow-up studies to large-scale SNP genotyping studies. “We are following up on a lot of genome-wide association studies that end with an interval, … and the question is, ’What is really the functional variant that is causing disease?’” Craig said. “So after we fine-map, in the second stage of a GWAS, we really do need to exhaustively sequence.”
Others are also using PCR-based amplification and Solexa sequencing to follow up on GWAS. Scientists at the Wellcome Trust Sanger Institute, for example, are currently resequencing genomic regions found in the Wellcome Trust Case Control Consortium (see In Sequence 7/8/2008).
Beyond these types of studies, the method might be useful for multiplexed sequencing of bacteria, viruses, or BACs, Craig said.
And it might be a long time, if ever, before targeted resequencing studies become obsolete, according to Craig. “When you want to sequence 10,000 people for BRCA1 and BRCA2, you still need a targeted resequencing approach,” he said. “I think we are a long ways off for where we would not need something like that.”