Skip to main content
Premium Trial:

Request an Annual Quote

BioArray Q&A: CHOP's Hakonarson on Building a Map of Human Copy Number Variation


By Justin Petrone

hakon.jpgName: Hákon Hákonarson

Title: Director, Center for Applied Genomics, Children's Hospital of Philadelphia

Professional background: 2006-present, director, Center for Applied Genomics, Children's Hospital of Philadelphia; 2003-2006, vice president of clinical services, Decode Genetics, Reykjavik, Iceland; 1998-2003, director of respiratory, inflammatory, and pharmacogenomics research, Decode Genetics; 1995-1998, assistant professor of pediatrics, CHOP and University of Pennsylvania

Education: 2002 — PhD, University of Iceland School of Medicine; 1986 —MD, University of Iceland School of Medicine; 1980 — MS, biology and mathematics, Graduate College of Akureyri, Iceland.

A new database of copy number variants found in healthy individuals is just the first release of a resource that is expected to continue growing, according to its developers at the Center for Applied Genomics at the Children's Hospital of Philadelphia (see related story, this issue).

Under the leadership of Hákon Hákonarson, the CAG has conducted tens of thousands of genome-wide screens, using mainly Illumina arrays, to assemble a large database of copy number variation that it can use to better resolve the relationship between rare genetic variation and human diseases and disorders.

With the ultimate goal of genotyping 100,000 individuals and following them over time, CAG hopes to use its database not only to better inform its own research, but to make all its data available to the public. Moreover, with the construction of CHOP's $400 million Colket Translational Research Building, expected to open later this year, CAG plans to drive more of its discoveries into clinical use.

BioArray News spoke with Hákonarson this week to learn more about all of these endeavors. Below is an edited version of that interview.

When did you join CHOP?

Exactly three years ago. Our center became operational in July 2006 and we recruited our first patients in the fall of 2006. I worked as a scientist and officer at Decode for about eight years before and prior to that I was at CHOP where I trained and subsequently became an assistant professor in pediatrics, mostly doing research at the time. The focus then was candidate genes in asthma and other airway allergic diseases. We didn't have tools to go beyond that at the time. Research was mostly done in animal models and cell-based assay work. Decode was building a database in the mid to late 90s using family linkage approach, so that was attractive at the time.

Why did you decide to assemble this database?

The objective from day one was to generate a high-density CNV map in controls and to make that dataset available to the public. We started off with these 2,000 individuals back in 2007. Since then, we have recruited another few thousand controls and have thought of expanding that with a larger dataset. The versions of the chips we use keep changing and we felt it would make most sense to publish data from a single version in this paper and then have a follow-up paper with more depth later on. We also have large datasets in autism and diabetes and several other disease areas that are pending publications. Our goal is to make all these datasets available to the public once we have completed our initial analyses.

How does this dataset aid your research and how could it aid the research of others?

The most valuable use for us is two-fold. One is in the area of clinical cytogenomics where we receive samples from patients with suspected abnormalities. To have a reference database of thousands of healthy individuals is very valuable. You can take any CNV you observe in a child with developmental delay or other problems and cross-reference it against this database. If you see the same CNV in healthy individuals, it's less likely to be pathogenic. Obviously, if there is some overlap, it might persuade you to put more weight on a gene or CNV that is not present in the control. That dataset gives us valuable clinical information.

The other aspect is research. While these children are healthy today, many are likely to develop diseases later in life. To take this dataset and reference this with respect to adult diseases is also of interest. It will tell you if these CNVs are neutral or if they are protective or predictive of a disease.

[ pagebreak ]

How will you track these patients through life?

Our goal is to construct our database with longitudinal data from collection of ongoing data and information. If the parent has given us permission for an update, and most do, then we have the ability to collect longitudinal data on these kids. The database will therefore continue to accumulate data. You can also improve on the CNV analyses, and they will be more comprehensive once we have done sequencing on the samples as there will be more CNVs that come up in that context. And to compare that to various diseases and problems in older people, even healthy old people, could be very informative in my view. You can assess the impact of pathogenic CNVs on disease risk.

I was informed that you may eventually screen up to 100,000 people for these purposes.

We have genotyped on the high-density array platform somewhere around 90,000 individuals today. There are about 50,000 children, the rest are parents, and we have a few smaller adult projects. Our goal is to build this database with samples from about 100,000 children. Having parent samples is important, too. It allows you to assess events that are de novo or possibly inherited. We expect this database will be completed in about two years. Once we are up to 100,000, even rare events become common. You will have better ways of assessing what CNVs mean. Then there will be between 10,000 and 15,000 individuals in the database who are completely healthy as children — they will be a powerful control reference database. We will also have an array of values to assess pathogenicity of changes as well as for any common variant.

You used Illumina arrays to construct this database. Why have you used these chips?

We built a fairly sizeable operation here three years ago with the Illumina platform. We have been using the HumanHap550 BeadChip and we are now using the Human 610-Quad BeadChip for most of our projects. There was one other version we used in between, called the 550-Duo, and we have fairly good consistency in SNP content across those platforms. We have an Affymetrix platform as well, and we have run several projects on that platform too, and we have done validation of cases of all individuals we type on the Illumina platform on the Affy platform and vice versa because these are different technologies. This has worked well for us. Our goal is basically to genotype on whatever platform gives us the most cost-effective way of analyzing data. At the time, we felt Illumina was most attractive for those purpose. Since then, Affy released the 6.0 chip, which is also highly valuable in the context of analyzing CNVs as you have close to a million SNPs and 900,000 CNV probes on that platform. You can analyze samples for both common variants and CNVs on both of these platforms and I prefer them over the various CGH platforms [that] only give you intensity data. It has made more sense for us to keep going with the same platform for consistency.

There has been some recent debate on the value of the information coming out of genome-wide association studies. What is your take?

I think there is no question that the GWAS approach has transformed the field of complex disease research with respect to biology. We have opened up multiple novel pathways that give the pharma industry tremendous power to design new and most likely more effective drugs because they could fix or normalize the underlying biological perturbations — not just treat symptoms as most current drugs do. Based on information from these studies, they will be able to build gene networks within each individual disease area and come up with the most attractive target [for intervention]. This was not possible before GWAS.

For diagnostic purposes, in most instances the value of GWAS is not close to being of meaningful clinical relevance today. We still know only a small fraction of common risk behind genetic diseases. Until we know more, we will not have meaningful diagnostics for complex disorders. We could also keep in mind that many common variants may be tracking rare variants. That's why many resequencing efforts focusing on specific loci have not come up with any new information. Once you have identified rare variants, you will have something more beneficial for diagnostics. Sequencing beyond the loci is what you need for that that purpose. This is too costly today to be able to do on a large scale, but that will change.

[ pagebreak ]

You mentioned sequencing. How do you plan to integrate next-gen tools into your research?

Most of the efforts so far have been on targeting the loci that we have identified, either with common variants or altered copy number states, and to sequence both cases and controls in comparison. The 1000 Genomes Project will be very valuable in giving us controls. Most projects will be achievable by sequencing enough cases only to generate sufficient haplotype diversity across the genomic loci of interest, and then taking those variants and genotyping them using a custom array in a large cohort of individuals. That is where I expect most labs will be heading. Once the cost of sequencing comes down more, we can technically sequence all of these controls and get better resolution of copy number state in these individuals. Currently, we are picking out CNVs that are a few kilobases in size or bigger.

You have published on a variety of disorders and diseases, from autism to diabetes. What is your main focus?

At CHOP we have put together a large-scale effort to investigate any common complex disease, including autism, asthma, ADHD, diabetes, cancer, inflammatory bowel disease, and many others. Once we have reached our target of 100,000 patients, we will be able to address the underlying causes for the vast majority of them in a genetic-genomic context. That is the main goal, of dissecting out the key major variants that are responsible for or rooted in the biology of these diseases. Then we will have more focused information on these diseases for our translational efforts.

How will these findings be translated to the clinic?

Currently, we are moving forward on those discoveries we have made with the ultimate goal of translating those into the clinic. CHOP is building a new research building set to open later this year. The focus of activities within that building will be translational medicine. Our efforts will funnel into that translational program. There will be additional arrays, such as expression arrays and methylation arrays, that will be used in that translational process, all of which will be guided by the genotyping work we are doing here.