By Monica Heger
As next-generation sequencing continues to become faster and cheaper, researchers are using the technology for an increasingly wider range of applications, but for many tasks, such as assessing structural variation, there is no standard method for how to best capture all the relevant information in a genome.
In response to this challenge, researchers from Scripps Translational Science Institute and the University of California, San Diego, have attempted to provide some guidance on how to best design sequencing experiments in a paper published last month in BMC Genomics.
The researchers, led by computer scientist Vineet Bafna at UCSD, evaluated sequencing methods for structural variation detection and for estimating transcript abundance. For assessing structural variation, they found that the optimal design is to create two libraries with different insert sizes — one that is the size of the desired resolution, and the other as large as possible. For estimating transcript abundance, they found that gene expression follows a distribution curve that can help researchers determine the amount of sequencing needed to detect expression of a given gene.
The goal was to come up with optimal sequencing designs specific to the application. "There are no guidelines out there," Vikas Bansal, a research scientist at Scripps and an author of the study, told In Sequence. "For [detecting] structural variation it's not clear what kind of sequencing you should do."
The study will be useful for researchers trying to sort through a variety of sequencing parameters, "such as the number of reads, read length, single-read versus paired end, and insert size," to determine which combination would work best for a given experiment, John Castle, who was not affiliated with the study, and is head of genomics and bioinformatics at the Center for Translational Oncology and Immunology in Germany, told In Sequence in an e-mail.
Bansal's team found that varying the insert size affects resolution and the power to detect structural variations. Short insert sizes yield greater resolution, while longer insert sizes give greater power to detect the variations. In the study, he found that creating two libraries was optimal — one with an insert size as large as possible, and another with a smaller-sized insert. On the Illumina Genome Analyzer, for instance, libraries with insert sizes of 200 base pairs and 2 kilobases work well, he said. The combination of the two libraries improved breakpoint detection by 15 percent at the same experimental cost compared to only using the short insert size.
Somewhat surprisingly, creating more libraries with various insert sizes did not improve the method, Bansal said. "Doing more [sequencing] of the original two libraries is just as good as creating a third library."
The other problem Bansal's team tackled was detecting low-expressed transcripts. He said there are no guidelines for how deeply one should sequence the transcriptome to ensure that all transcripts are detected.
The team sequenced a transcriptome to obtain one million reads. They then were able to show that the transcripts fell within a distribution curve of highly expressed genes to low-expressed genes.
"You can estimate the probability of distribution levels of the genes, and from that you can extrapolate how much sequencing you'd have to do to detect the particular gene you're interested in," Bansal said. "All transcriptomes won't follow the exact same distribution, but this can be used as a guide."
He said it could help to reduce redundancy, and also to ensure that enough sequencing is done to detect the gene of interest. "Some people sequence 10 million reads per transcriptome, but where does that number come from? There's no mathematical or statistical rationale behind it, so we tried to provide some basis for determining how deep you should sequence transcriptomes."
Bansal added that the methods would be applicable across all sequencing platforms.
The study "provides good insights for experimental design," said Christopher Maher, a research investigator at the Center for Computational Medicine and Bioinformatics at the University of Michigan. Maher, who was not affiliated with the study, focuses on translating genomic information from sequencing data for diagnostics. "The questions they are asking are ongoing ones in the community," he said.
With regards to the method on assessing structural variation, Maher said that it wasn't particularly surprising that creating two libraries with different insert sizes was optimal, but it had never before been quantified. "People have thought that the different insert sizes would help, but this empirically demonstrates it," he said. "I think if labs have the resources, they will use it, because it gives the most comprehensive view of somatic rearrangements."
He said the technique for determining the amount of coverage needed in an RNA-seq experiment was also a "useful framework," but cautioned that the protocol had biases, and would vary depending on both the platform and sample. Nevertheless, "they lay out a nice model for labs that want to determine how much coverage they need," he said.
Bansal said the next problem his team is working on is haplotype assembly. He said that there is interest in doing haplotype phasing from sequence data, but that it cannot be done with insert sizes of only 200 base pairs. To do haplotype phasing from sequence data, paired-end reads need to span more than one variation. The density of heterozygote SNPs is about one every 1,200 to 1,500 base pairs, Bansal said, so in a paired-end read with a 200 base pair insert, even if one end contains a SNP, it is very unlikely that the other one will have it. He said his team is still working out the optimal insert lengths, but said it will likely be a combination of different sizes.
"We want to provide a first step to standardized guidelines for how you should design sequencing experiments," Bansal said.