Researchers at the Children’s Hospital of Pennsylvania have developed a software tool that they claim more accurately identifies disease-causing copy number variations in genetic association studies than existing software such as CNVtools and Plink.
According to a paper published in a recent issue of Nucleic Acids Research, CHOP’s ParseCNV automatically corrects for varying CNV lengths, flags misleading genomic features that might be misconstrued as accurate calls such as segmental duplications and high or low guanine-cytosine content regions, and includes quality tracking information for filtering confident calls.
The CHOP team said that current CNV association algorithms do not address these limitations, making ParseCNV of particular benefit for researchers doing CNV-based genetic association studies, whether these are case-control-, quantitative trait-, or family-based studies.
In the NAR paper, they focus primarily on the software's application to disease-based case-control studies, where researchers compare samples from healthy and sick individuals and look for telltale differences in how CNVs are overrepresented or underrepresented.
"One person may have a 60-kilobase deletion, while another may have a 100-kilobase deletion. That may determine the difference between a healthy state versus disease," Hakon Hakonarson, director of CHOP’s Center for Applied Genomics and an author on the paper, explained.
Because of these variations, other CNV detection software "may misread the boundary of a CNV region, which could lead to a misclassification and result in false-positive or false-negative associations," he said.
In an interview with BioInform this week, Hakonarson explained further that a particular CNV might be present in the intronic region of a gene with no impact on its activity in the control population. However, within the case population, that CNV might straddle both the intronic and exonic regions of the gene with far more harmful consequences because of its impact on the exon.
Many software programs will “interpret the intronic CNVs in the same way as the CNVs that affect the exons,” failing to note the difference between the two, which is problematic if the CNV impacts the disease condition being studied, he said.
Joseph Glessner, a doctoral student at CHOP and a co-author on the NAR paper, further noted that many CNV detection methods were initially developed to analyze specific regions of the genome rather than conduct genome-wide studies focused on locating CNVs, “which we need to do.”
ParseCNV is the culmination of several years of CHOP research into copy number variation and disease association, Hakonarson told BioInform.
He said that his group has published over 30 research papers on the subject, many of which have looked for CNVs associated with neuropsychiatric and neurodevelopmental disorders like autism, attention deficit hyperactivity disorder, and schizophrenia
While conducting these studies, he said, the researchers found that they had to run multiple quality checks on their data to ensure that they'd accounted for all possible sources of error in the analysis.
That led to the development of ParseCNV, which gathers all these checks into a single program that can “analyze genome-wide SNP data and basically hone in on regions of highest interest” in an automated fashion, he said.
The program “collapses the information from the individual SNP data into copy number variation regions, which provide much more accurate information about actual potential disease association and flag all of these problematic issues that many [current] programs are missing,” he explained.
According to the NAR paper, ParseCNV generates “probe-based statistics for CNV occurrence” using CNV calls generated by variation detection programs such as PennCNV, and then summarizes them based on CNVregions. It then checks for false-positive calls caused by segmental duplications, high or low guanine-cytosine content regions, and CNV regions with high population frequency.
Basically, ParseCNV “segments…different sub-regions” of the genome and [then] looks for specific sub-regions that show up repeatedly in cases but not in controls,” Glessner explained to BioInform.
It accepts CNV calls generated by external software tools such as PennCNV, and then “decomposes them … into SNP-based statistics, he said.
Then, “we reduce redundancy by merging together p-values that are similar in neighboring SNPs” and output “the most significant p-values for specific genomic regions where CNVs are recurrently found in cases and at a lower frequency in controls,” he said.
Merging the p-values means that users don't have to be "so stringent on upfront quality metrics," which is one of the limitations of current programs and can often lead to false calls, he said.
These programs might only include calls that have a "minimum of 20 SNPs or [those that are] larger than 100 kb" and exclude all others, he explained. "By doing all of the quality tracking in ParseCNV, we avoid that limitation and you have those features captured and tracked in the association process so you are not forced to do an up-front quality exclusion."
According to the NAR paper, once it's completed its analysis, ParseCNV provides users with a report that includes the p-value and odds ratios for each CNV region, as well as contributing sample IDs, their copy number states, closest gene, gene description, pathway, and the average number of probes underlying contributing CNV calls.
The NAR paper also provides proof of ParseCNV's improved accuracy. According to the researchers, it has successfully called 90 percent of CNVs — these calls were later validated by PCR — in several studies. Hakonarson and Glessner told BioInform that a "reasonable estimate" for conventional CNV algorithms is roughly 50 percent call accuracy.
Commenting on the paper, Ioannis Ragoussis, head of genomics research at the Wellcome Trust Center for Human Genetics, noted that while there is a lot of research focused on whole-exome and genome sequencing, "there is a vast amount of data out there that, if mined properly, can provide further validation and replication."
ParseCNV's approach, he said, does offer some "significant" advances. One of these being that it deals with the problem of data artifacts by taking into account "all possible scenarios of CNV segment overlaps between samples in order to define CNV regions and distinguish them between cases and controls."
False positives, in particular, "are a major problem as a lot of work is wasted in validation efforts," he said in an email to BioInform.
Also on a positive note, ParseCNV provides a fully integrated pipeline that has "CNV call[s], QC, and filtering, as well as statistical methods to identify associations," he said.
The CHOP researchers said they've used ParseCNV internally since they first developed the algorithm in 2008, to explore CNVs associated with conditions like obesity and autism. One project, published in Nature in 2009, was a whole-genome CNV case control study on a cohort of 859 autism spectrum disorder cases and 1,409 controls. The researchers in that study found some genes involved in neuronal cell-adhesion and ubiquitin pathways that had enriched CNVs in the cases but not the controls.
Another case-control study, published in a 2010 issue of the Proceedings of the National Academy of Science, focused on schizophrenia and analyzed data from 977 schizophrenia cases and 2,000 healthy adults. This study found that a family of genes involved in synaptic transmission were "notably enriched" for CNVs in the case population.