Skip to main content
Premium Trial:

Request an Annual Quote

Researchers Design Low-Pass WGS Technique to Analyze Copy Number in Degraded Samples


NEW YORK (GenomeWeb) – Researchers at the VU University Medical Center in Amsterdam have designed a low-pass whole-genome sequencing technique especially suited for analyzing copy number variation in DNA from formalin-fixed paraffin-embedded tissue.

The method was published in Genome Research this month, and relies on single-end 50 bp sequencing and incorporates a correction step in the bioinformatics pipeline to account for GC bias and repetitive regions. According to lead author Ilari Scheinin, at the time the study was conducted one year ago, the method cost approximately €200 ($257) per sample, and the goal is to use it on clinical cancer samples to quickly identify copy number variations.

Scheinin, who is now an independent bioinformatics consultant with Quiwaes in Finland, told In Sequence that historically the VU University Medical Center lab had used array CGH to analyze copy number variation from archival cancer tissue. But with the development of sequencing technologies, experiments increasingly began moving from arrays to next-generation sequencing, and "we wanted to see what we could do in terms of copy number and sequencing technology," he said.

Methods for analyzing copy number using NGS come in four different categories, Scheinin said: assembly-based methods, which don't require a reference to align reads to but are generally costly because they require high sequence coverage; split-read and read-pair, which work best with longer DNA fragments so are not ideal for FFPE samples; and depth of coverage methods, which infer copy number from the sequence depth and do not require both ends of the molecule to be sequenced.

Scheinin said that the existing methods in these categories were designed for different types of experiments than what his group wanted to run. "We were working mostly with paraffin," he said. In addition, the lab was analyzing large sets of archived samples, whereas many of the other methods were designed for groups looking at smaller numbers of patients and were also focused on identifying all the mutations in the cancer genome, not just copy number.

"Our starting point, was that the method has to be robust enough to work with paraffin and degraded DNA, and it has to be as cost efficient as possible to analyze large sets of samples," Scheinin said.

The method falls into the depth of coverage category, with a few key differences. Rather than paired-end sequencing, it uses single-end sequencing with reads only 50 bp long. The shorter read length allows the method to work better on fragmented DNA, he said, while also reducing the cost and sequence time.

Similar to other depth of coverage methods, this one uses a binning approach. In this case, the researchers first divided the human reference genome into 15 kbp bins. Of those, 12,893 bins were removed because they were composed of uncharacterized bases, leaving 179,187 autosomal bins.

The other main differences are in the way that the data is processed, he said. The researchers developed a bioinformatics approach, dubbed QDNA-seq, that incorporates a correction step for GC bias, which affects raw read counts, and mappability of repetitive regions. While other methods also employ correction steps, they typically do them sequentially and independently of each other, Scheinin said. "Independent correction for GC and mappability is appropriate only if these two factors do not interact in their effects on read counts," the authors wrote.
But because FFPE samples contain artifactual variation that impacts both GC content and mappability, the group wanted to see whether doing simultaneous correction improved results.

The simultaneous correction "in some cases performs equally well compared to consecutive corrections and in some cases performs better, but in our experience, it never performs worse," Scheinin said.

Another difference is that the team also used data from the 1,000 Genomes Project to identify regions of the genome that were problematic due to containing repetitive sequence or just because they were not well characterized. The group then generated a "blacklist" of those poorly characterized areas and incorporated a step in the algorithm to filter out those areas prior to doing the correction step.

In the study, the researchers demonstrated the method on FFPE samples from 15 low-grade gliomas and two oral squamous cell carcinomas, as well as a breast cancer cell line.

Sequencing was performed on the Illumina HiSeq 2000 with between 18 and 22 samples multiplexed per lane. On average, they generated 9.2 million reads per sample. After aligning and filtering, they ended up with around 6 million reads per sample, corresponding to around 0.1x coverage of the genome.

Looking at one representative low-grade glioma sample, the researchers reported that their method identified whole chromosome losses involving chromosomes 10 and 22, as well as a gain of chromosome 20, a focal amplification on chromosome 7, and a homozygous deletion on chromosome 9.

When compared to array CGH, the researchers found that it "costs more and yields a poorer signal-to-noise ratio than shallow WGS."

The lab has since performed the technique on over 1,000 samples from more than 25 hospitals in five countries.

Scheinin said that the next step of the work is to see how much sequence coverage would be needed to detect loss of heterozygosity. One limitation of the technique is that because sequence depth is so shallow it cannot identify SNVs or even LOH. In addition, he said, many laboratories are now using exome sequencing as their primary method of analysis. However, for FFPE samples, exome sequencing is not ideal for detecting copy number variation. So, one possibility would be to use this method prior to target enrichment for exome sequencing. He said that this extra step would add approximately 5 percent to the total cost, but it also would help with quality control.

The eventual aim is to use the method for clinical samples. However, the current protocol multiplexes 20 samples per HiSeq lane, or 160 samples per flow cell, which is a much higher volume than typical for clinical labs, so switching to a lower throughput platform like the MiSeq would be necessary, he said.

The software is all open source and the wet lab steps are just adaptations of existing methods, so Scheinin did not think the lab would be looking to commercialize the process.

Outside of cancer, he said the method could potentially have applications in prenatal sequencing to detect fetal aneuploidy.