Name: Hakon Hakonarson
Position: Director, Center for Applied Genomics, Children's Hospital of Philadelphia
Associate professor of pediatrics, University of Pennsylvania School of Medicine
Experience and education:
Vice president of clinical sciences and other positions, Decode Genetics, 2002-2006
PhD, University of Iceland School of Medicine, 2002
MD, University of Iceland School of Medicine, 1986
As the director of the Center for Applied Genomics at Children's Hospital of Philadelphia, Hakon Hakonarson has led a variety of genome-wide association studies to detect the genetic causes of childhood diseases such as attention-deficit hyperactivity disorder, asthma, autism, diabetes, epilepsy, inflammatory bowel disease, pediatric cancer, and schizophrenia.
The center, which was founded in 2006 and has processed samples from more than 100,000 individuals, is equipped with several genotyping systems, including the Illumina BeadArray and VeraCode technology and the Affymetrix and TaqMan platforms. For its largest genotyping studies, it currently uses the Illumina Infinium assay.
Recently, the center started to bring in high-throughput sequencing, purchasing 15 Applied Biosystems SOLiD sequencers from Life Technologies — a mix of SOLiD 4 and 4hq — which will be installed by the end of the year.
In Sequence spoke with Hakonarson last week to find out how he plans to use the new instruments in his research. Below is an edited version of the conversation.
The Center for Applied Genomics at CHOP just purchased 15 SOLiD instruments. When will they be installed?
We have eight, currently. Some are up and running, and the rest are being installed this week. The capacity of those instruments, which are SOLiD 4s, is quite reasonable. But when the upgrade to the hq happens, that's when we are going get the other batch of machines, sometime later this year, and that's going to scale it up to a much greater operation.
How are you planning to use these instruments, and how is this work going to build on your human disease genotyping studies?
We have built a very large-scale database with samples from well over 100,000 individuals, about 50,000 children as well as parents and samples from some adult projects we also work on. We have made multiple discoveries and published probably over 100 manuscripts so far, and we have a very large number of regions where we have captured the association, but we now need to do targeted sequencing in order to find the causative variants.
The other focus is to perform a hybrid of exome sequencing and whole-genome sequencing at a relatively low coverage level. Which one we lean more towards is mostly going to be driven by cost efficiency. And once the throughput goes up as we go to hq and more machines later on, we will start doing those genomes at a higher coverage.
The targeted projects will be more focused, and we will do them on a large number of individuals, whereas the exome and the whole-genome projects are more intended for general discovery, so we will not be doing them on a huge number but maybe looking at 100 to 200 individuals or pairs per group.
We are planning to then take those variations that we identify and will consider genotyping them on a very large number of individuals that we have in our database.
[ pagebreak ]
What kinds of DNA capture methods are you currently using?
We have a hybridization-based in-house method that we have been utilizing to some extent, and then we are using Agilent [SureSelect] for the exome. That can give us a hybrid of the exome plus some additional captured targeted regions. And we are also working with RainDance [Technologies], using that for smaller regions up to a few megabases.
Can you describe one specific project as an example?
We have, for example, the autism project, where we have already identified a region with a common variant that associates with autism [This work was published in Nature last year — Ed]. It sits in a very highly conserved region. There are two genes close to a megabase away — those are cadherin genes, which are very strong candidates for autism. We have now captured that whole region, and we are already starting to sequence that, but now, with the additional sequencing capacity, we are doing that on a much greater number of individuals, aiming for about 500 individuals for that particular locus.
In addition to that, we have numerous regions that we have co-captured where we have rare copy-number variants. Each of them is rare — sometimes we found them in only a couple or a handful of families among thousands of individuals that we have studied. So the question is, because we are capturing these CNVs with relatively rough methods — the GWAS method — where we have thousands of base pairs between markers, maybe other individuals will have smaller CNVs that we are missing with the GWAS, and we can now capture them with sequencing. That is one of the hypotheses we have, that there may be different types of variants, which may still have an impact in the same way. Even though they may not knock out the entire gene, they may knock out critical elements of the gene, and we are just not seeing them.
How did you test the SOLiD technology before you decided to purchase 15 instruments?
We had been previously analyzing data that is available in the public domain, and we have generated our own data on both the Illumina/Solexa and the Applied Biosystems SOLiD platform. I think both of these technologies are quite comparable in terms of what we do with these technologies today.
The key advantage that I saw with SOLiD, and the reason we went in that direction, was both that we got a very good deal on it, and also, if we work with mitochondria, which is a very significant component of what we and our collaborators do, if we work with tumor samples to find somatic mutations, and if we do whole genomes at low coverage, then accuracy becomes very critical. And at low coverage, the accuracy — as far as I can tell from data we have analyzed — is better on the SOLiD machines. Once you get to higher coverage, it becomes much less relevant. But if you have low coverage, study somatic changes, and for mitochondria, this is critical.
The most ideal situation would be to have both platforms. That may be the direction that we will go to. But we have to start somewhere, and this is where we decided to start, based on various things, and we may move forward with the focus of getting as many of the platforms that are being offered in house. There are companies that are currently working on single-molecule sequencing, which is also an attractive technology to us, but not something that we would want to invest in today, given the state of development.
Why is now the time to scale up significantly on sequencing equipment?
I would have done that sooner if the technology was more advanced in terms of the ability to deliver, because it's still quite a bit of a task to do all the preparations that need to be done, making the libraries and so forth — it's not trivial. But this is improving now at a significant rate as it is being more automated. It's being automated now to the extent that you can start doing this in a more high-throughput way, whereas it was very labor-intensive last year. And my view on it then was, I felt I got much more for the money by doing dense genotyping than putting a lot of effort into sequencing, not really knowing exactly what I would be getting.
[ pagebreak ]
Now, we have analyzed more data, and more data is being generated, and I think this is a balance: You are still going to continue with extensive genotyping, because there is still a high value in that, but we just need to refine these signals that we have with the capture.
The question is, if exome sequencing or low-coverage sequencing gives you a lot of new information — maybe of rare variants that are more highly impactful than the GWAS variants — that's added value. But there is still quite limited data on that. [The sequencing] technology has been very successful in identifying Mendelian types of variants in families, but no one has really shown yet that this is going to introduce the same wave of results with the complex diseases as GWAS did when GWAS came about. We believe that it will, at least to some extent.
Will you have to scale up your personnel and your IT capacity to store and analyze the data from these sequencers?
When we set up the genotyping lab here a few years ago, we did that in such a way that we built a very large-scale infrastructure. Therefore, we can leverage that today and plug in more storage space, but it's within the same framework. So if we expand from, say, eight to 15 sequencers or whatever number — it's relatively trivial for us to accommodate that from the IT standpoint, given the infrastructure that we have.
Now, the data analysis, of course, is extremely labor intensive. We have taken maybe a little shortcut there – we recently published a program called Annovar in Nucleic Acids Research [for the functional annotation of genetic variants from high-throughput sequencing data]. It's a very effective program in terms of variant reduction, if you will, by quickly reducing the number of variants to candidates for being disease-causing, or disease modifying, and we can apply this to multiple whole genomes and get results from screening and scanning through that in a relatively short period of time.
This is not a full analysis of the genome, but it's a very reasonable first-pass analysis to do to capture what may be the low-hanging stuff. We have already successfully identified a few disease mutations, but that's been more in families.
Did you ever consider outsourcing sequencing to service providers, for example BGI Americas or Complete Genomics, instead of building your own sequencing center?
We thought of that. We actually have very good relationships with both of them — we have done samples with Complete Genomics, and we have established a collaboration with BGI. There is also a flow of information that's bidirectional and helps with workflow, informatics, and so forth.
In terms of BGI, they are so big, and even though we have what we are building here, it's going to be very likely that we will be taking on some large-scale collaborations with them, independent of the fact that we have some [sequencing] units here. It really goes both ways. Given that we have this infrastructure and all these samples, I felt it was very difficult to be completely dependent on outside parties with respect to timelines and ways of customizing things.
Where does the funding for this purchase come from?
We are building this on hospital funds that have been allocated to research, with the notion that they are going to create opportunities, for example to gain back grant funding. This is similar to the genotyping effort a few years ago, which obviously was a very major internal investment on behalf of CHOP, but that has resulted in very substantial grant funding coming back, and made a huge impact across the entire institute here, as well as multiple collaborators, in terms of the research results it has generated.
What are you ultimately hoping to gain from this research?
We are one of many programs that are taking this on on a large scale. But I think we have a unique component here, being a pediatrics healthcare institute, where we can hopefully — and that was always the overall goal — generate results from the genotyping and sequencing efforts that will be implemented in the clinic. And that's where other [sequencing or genotyping] platforms come in, such as Ion Torrent and these smaller, customized platforms, that we envision we might be utilizing in the clinic in the relatively near future.
The content that will be utilized there is what we have discovered through our genotyping and sequencing efforts. We would put that on a specific platform — it could be genotyped but it could also be sequenced through a more customized, focused sequencing platform such as Ion Torrent, and there are a couple of other ones that are in development.
Long-term, will this work have an impact on disease treatment?
I think that there is no question that this is going to make a change for the personalized medicine concept, and that persons who have certain variants are going to be treated differently when these variants that we find are in pathways that are targeted by specific drugs. We are already starting to see this on a small scale, but I think you are going to see a wave of this a couple of years from now, once sequencing has delivered a little bit more than it has today.
Will that be applicable to diseases other than cancer?
I would think so. The cancer field is ahead in these terms, largely because they have two genomes, the germline and the [tumor] tissue genome. Most often, when you are working with a disease like autism or schizophrenia, for example, you can't get brain tissue to validate and amplify your results. That's something the cancer field benefits from, and therefore is ahead of most other disease areas. But I think, essentially, all diseases are going to be there in a relatively short period of time.