Researchers from the Children's Hospital of Philadelphia and the University of Pennsylvania have released a database of copy number variations present in thousands of healthy individuals.
The resource, called the Copy Number Variation Project, joins a growing list of CNV databases, including the Data of Genomic Variants hosted by the Hospital for Sick Children in Toronto and the Wellcome Trust Sanger Institute's DECIPHER (Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources).
What differentiates CHOP's data collection from these other databases is that it includes information from a single study that was recently published in Genome Research and contains only data from healthy individuals who were screened with a single microarray platform. Because of this, researchers can use CHOP's data as controls in their search for pathogenic variants, according to its developers.
"Our data is most uniform data set you can find," said Tamim Shaikh, an assistant professor in the pediatrics department at CHOP and lead author on the Genome Research paper. "All samples have been analyzed in the same lab on the same platform and with the same analysis tools," he told BioArray News last week.
Shaikh noted that DGV's data comes from a variety of sources, including bacterial artificial chromosome arrays, as well as Affymetrix, Illumina, and Agilent chips. DGV is a "compendium of studies from all over the world," he said. "Ours is based on a single study."
Hákon Hákonarson, director of CHOP's Center for Applied Genetics, led the effort to compile the database (see related Q&A, this issue). Specifically, the CHOP team used Illumina HumanHap550 BeadChips to look for structural variants in 2,026 individuals, including both children and parents. About 65 percent of participants were Caucasian and 34 percent were African-American.
On average, the team found that healthy individuals carried dozens of CNVs. Altogether their search revealed 54,462 variants. Of these, 77.8 percent of the CNVs were found in more than one unrelated individual and 51.5 percent of the variants detected had not been detected in previous studies. Nearly three-quarters of the rest that were non-unique overlapped with the DGV.
CHOP has already sent its data to the Hospital for Sick Children to be made available through the DGV. Shaikh stressed that it is not CHOP's ambition to build a database to rival the DGV, but simply to make its data available.
"Our goal was not to compete against the DGV," Shaikh said. "We are all for integration. The question is, will others want to display their data in our database? That's up to them."
One of the reasons that researchers like Shaikh require a dataset of healthy controls is because genome-wide screens often produce dozens of variants that could be associated with a particular disease. "You find a lot of variants with a genome-wide approach," Shaikh said."The biggest problem has been that, on average, we find between 25 and 30 CNVs per person, but we don't know which one is pathogenic."
Using CHOP's database, researchers could continue to find as many variants per individual, but narrow in on the potentially pathogenic CNVs. "Say that I have 30 CNVs in a patient, but 29 are found in normal controls," Shaikh said. "That means the 29 are less likely to cause phenotype," he said.
He cautioned that the "way field is moving," one can "never be sure what is causing phenotype and not," but said it puts researchers in a stronger position to make a call on what may be causing a particular disorder or disease.
One potential issue regarding the CHOP database is how healthy its controls actually are. "We don't know what these kids will develop once they get older with regards to common and complex disease types," Shaikh said. "These kids will probably be followed for a long time at CHOP," he added. "We can collect data and see if they develop something because they carry a certain CNV."
[ pagebreak ]
Hákonarson agreed. "To take this dataset and reference it with respect to adult diseases is also of interest," Hákonarson said. "It will tell you if these CNVs are neutral or if they are protective or predictive."
Hákonarson said that CHOP's goal is to construct a database that contains longitudinal information on the same individuals over time. As that database grows, researchers will be able to compare the data to the appearance of various diseases. Using the database to "look at diseases and problems in older people, even healthy old people, will be very informative in my view," Hákonarson said. "You can assess the impact of pathogenic CNVs on disease risk."
Additionally, Shaikh said that CHOP is currently analyzing other datasets that it has generated at the Center for Applied Genetics to determine whether to add them to the database. "The more you put in, the more rare CNVs become frequent CNVs," he said.
Shaikh said that CAG will continue to generate most of its data on the Illumina platform to maintain "consistency" between datasets. He suggested that there should also be databases of healthy controls for researchers using platforms from Affymetrix and Agilent Technologies for CNV studies.
"Every platform is different, every analysis tool is different," Shaikh said. "It would make things a lot easier."