NEW YORK (GenomeWeb) – Researchers from Case Western Reserve University have published a paper in Genome Medicine that describes a freely available computational framework that they developed for detecting somatic copy number alterations (sCNAs) in whole-exome sequencing data from multiple platforms.
The researchers believe that the framework, called the Extreme Value Distribution-based Copy-Number Variation Estimation (ENVE) computational methodology, could be a useful tool in efforts to find more effective treatments for various kinds of cancers and other genetic conditions.
The methodology uses a probabilistic modeling approach to distinguish between true copy number changes and false alterations resulting from variations in tumor content across samples and technical variability introduced during sample preparation and sequencing, helping to reduce the risk of false positives.
It also detects CNAs without requiring researchers to define analysis parameters for factors such as ploidy and tumor content upfront, one of the key benefits of the system, according to senior author Kishore Guda, an assistant professor of general medical sciences-oncology in the Case Comprehensive Cancer Center.
Guda and his colleagues point in their paper to a previously published review of a number of algorithmic approaches designed to distinguish between true and false sCNAs. The review showed "substantial variability" in the sensitivity and specificity of these algorithms and reported that having to choose algorithm-specific parameters posed the biggest problem.
Currently researchers try to select appropriate parameters either by trial and error or by using complex algorithms that can be tedious to run, Guda told GenomeWeb. ENVE sidesteps the problem by using data from normal samples to generate models that represent noise in the data and then calculating a probabilistic score that indicates — based on a given threshold — the presence of a copy number amplification or deletion in the sample.
A second benefit of the approach, Guda noted, is that it was developed and tested on data collected from actual tissue samples. This is in contrast to some current methods developed using simulated datasets, which may not give a true picture of what real samples look like in practice, he said.
In its simplest sense, ENVE has two major modules. The first — which comprises four submodules — uses data from non-tumor diploid normal samples to capture and model noise in WES data that comes from sample prep steps and variations in sequencing platforms used. It does so by comparing random normal-normal samples from separate input datasets and estimating segmental LogRatios for each normal-normal comparison based on read depth and circular binary segmentation. Deviations from the base LogRatio — zero in this case — indicate noise in the data or the presence of germline copy number variations in sample pairs being compared.
A third sub-module in this portion of the system identifies LogRatio deviations associated with noise in the data and personalizes them to specific chromosomes. It does so by dividing each chromosome into non-overlapping 10-kilobase windows and then calculating the frequency of segmental coverage within each window, focusing on segments with absolute LogRatio values that are at or above the noise threshold. Numbers above the preset threshold indicate the presence of germline copy number alterations in the genomics segments while numbers below the threshold indicate that the observed variation is due to random noise. A fourth submodule generates so-called generalized extreme value (GEV) distribution-based models of noise in the data using the genomic segments from the previous step that fall below the noise threshold..
The second module essentially repeats the first two steps in module one, but this time applied to tumor-normal pairs. After it generates LogRatio values for the tumor-normal pairs, it then applies the previously-derived GEV model to the data and calculates the probability that a candidate amplification or deletion within a chromosome is due to noise in the data.
As part of tests to validate and demonstrate the performance of their method, the researchers used ENVE to analyze data from two independent matched tumor/normal whole-exome datasets gleaned from fresh-frozen tissue samples collected from Caucasian and African American individuals with colorectal cancer. Specifically, they compared ENVE's sCNA calls to results from SNP array- and qPCR-based assessments of those same samples. They also compared its results to calls made using the Control-FREEC detection algorithm — purported to be the best algorithm for sCNA detection by a separate review of existing methods also referenced in the paper.
According to results provided in the paper, ENVE showed higher sensitivity and specificity in detecting sCNAs in whol-exome datasets than Control-FREEC. ENVE also showed "higher concordance" with the results provided by both SNP arrays and qPCR-based assessments compared to Control-FREEC.
Next, the researchers used the method to characterize sCNAs found in colon cancers specifically in African Americans, something which no other studies have done previously, according to Guda. Using the method, they were able to identify some focal copy number changes in the genomes that could potentially be linked to colon cancer growth and development in this particular population.
"Our next objective," Guda said, "is to compare even more cancerous colon tissue samples from African American and Caucasian patients, sequenced using the same platform, to confirm these focal copy-number alterations selectively identified in African American colon cancers." Once we have that, we can then focus on figuring out if these copy-number alterations have a role in contributing to the aggressive colon tumor phenotypes in African Americans."
Meanwhile the researchers have begun developing a second iteration of ENVE that will be able to detect copy number alterations in archival formalin-fixed paraffin-embedded tissue samples. "To date, there are no algorithms available for profiling such alterations in deep sequencing datasets derived from FFPE samples," according to Guda. "We anticipate releasing a newer version of ENVE in the near future that incorporates this module such that researchers would be able to make use of the vast FFPE resources held in hospital pathology archives." He told GenomeWeb that the researchers expect to release the updated version of ENVE in the next seven to eight months. That should give them sufficient time to test the algorithm on real datasets, he said.
They are also reaching out to members of the community and asking them to test drive the ENVE algorithm. Guda said that his team has reached out to researchers at the New York Genome Center and other institutions as well as some companies, but are open to additional testers putting the algorithm through its paces. Although the researchers used the method to study colon cancer, it can be applied to other kinds of cancer. For example, one of the co-authors on the ENVE paper is already using the method to look at data from breast cancer cohort, Guda said.