Name: Deanna Church
Position: Staff scientist, National Center for Biotechnology Information, Bethesda, Md., since 1999
Experience and Education:
Postdoctoral fellowship, Mt. Sinai Hospital, Toronto, 1997-99
PhD in Biological Sciences, University of California, Irvine, 1997
BA in Liberal Arts, University of Virginia, 1990
Deanna Church is a staff scientist at the National Center for Biotechnology Information, where she coordinates clone-based assembly management, the NCBI Map Viewer, the NCBI Structural Variation database, dbVar, and CloneFinder. She is also involved in various projects involving eukaryotic genome assembly, annotation, and data representation. A year ago, Church and colleagues from the Wellcome Trust Sanger Institute, the Genome Center at Washington University St. Louis, and the European Bioinformatics Institute founded the Genome Reference Consortium, which aims to improve the representation of and to complete the human reference genome.
In Sequence spoke with Church recently to get an update on the project.
What has the Genome Reference Consortium achieved since it was launched a year ago?
Over the past year, we have spent a large amount of time on putting operating procedures into place, and working on identifying the regions of the human genome that needed either additional representation, because they are very complex, or correction because there was an error in the previous version of the assembly.
All of that work led to a recent release of a new version of the human genome that's referred to as GRCh37.
How does GRCh37 differ from the previous version?
In addition to some issues being corrected, I would say that one of the big changes that we implemented was trying to formalize this notion of having alternate loci.
One of the problems that we now have a greater appreciation for, that I don't think we really did when the human genome project was initiated, was the level of complex variation that we see between individuals. We know that there are regions of the genome that can vary so much between two individuals that we really cannot annotate the differences. It's large-scale insertions, deletions, inversions, or blocks of substitutions, these sorts of things. And if we cannot annotate the variation, we really need to instantiate a separate sequence to represent that variation. Instead of thinking about the genome as a golden path — which I think has been a popular way to think about it, just one path to the final chromosome — we now have regions of the genome for which you might have alternate paths.
For instance, at the [major histocompatibility complex], we actually now have seven different haplotypes that are represented. There is one path that's represented in the reference chromosome, but you can, at the MHC, substitute the path that was chosen for the MHC and see what other representations there are. It's a much better way to start capturing diversity, and I think it's going to be important, especially in the context of things like the 1000 Genomes Project, where a lot of short-read sequences are being generated. For some of those short reads, if they were just aligned to the reference chromosome, you might not get a good alignment, because the individual that was sequenced might actually have a haplotype more related to one of the alternates.
How complex can it get? What's the greatest number of alternate loci there currently are?
For the MHC, since it's such a well-defined locus, you have a lot of alternates because people have been working on trying to sort the haplotypes there because of their clinical significance.
You can think about the complexity in multiple ways. You can think about the number of alternate locus contigs that we have, but you could also think about the number of regions for which we have alternate representation. Right now, we have three regions for which there are alternate representations: MHC has the most options, but the other two regions are actually fairly important because there is clinical significance for both, and they also both represent regions for which we had a mixed haplotype in the previous version of the assembly. One of these is on chromosome 4 around the UGT2B17 locus, the other is on chromosome 17 at the MAP2 locus.
We really cannot take credit for fixing these regions, because they were actually looked at by other individuals, and they published papers on these regions, which we have referenced on our website, so you can see who did the work. These people were kind enough to contribute the information to us, so we could correct both the tiling paths on the reference chromosome and also generate a robust alternate locus, so that both haplotypes could be represented.
In how far will the representation of variation in GRC overlap with structural variation databases?
The difference, I would say, in what we have versus what a lot of these large-scale structural variation studies provide is that because we have sequence, we have base-pair resolution of the differences. I anticipate getting better base-pair resolution out of things like the 1000 Genomes Project, but most of the currently published data is largely array-based or paired-end sequencing, so you don't have a whole lot of base-pair resolution.
And you have the mechanism by which you can think about generating alternate representations of the reference chromosome. We don't have software in place for that yet, but it's one of the things we are thinking about building. This is work that we are doing some research and development on right now, in terms of thinking about how we could use the alternates to allow people to generate the chromosome that they wish to see. That type of software is a little bit in the future, but it's certainly one of the requests that has been made with respect to dealing with this data a little bit better.
I think it's a paradigm shift in terms of how people think about the reference assembly because it really means we cannot just think about reference chromosomes anymore, you need to start thinking about ways to also understand the alternate loci.
[ pagebreak ]
Do we need more de novo-assembled human genomes to provide more alternate loci for the reference?
There is a technology and a cost problem. We have used, for instance, the Venter assembly to compare to the reference assembly and get information, and we certainly do get useful information from that sort of work.
However, the difficulty that we face is that many of these regions that are sufficiently complex that we need an alternate representation are also challenging to sequence. They are duplicated, they are problematic, so what we really need, and where we still have to spend a fair bit of time, is kind of the old-school clone-based sequencing. We are certainly looking at ways that we can sequence the clones using new generation technology, but at this point, these regions are sufficiently complex that they tend to not be well represented in de novo assemblies, at least with most of the current assemblers.
I'm certainly not advocating against more sequencing of individuals, but we still can't ignore the fact that we need to do the more laborious, manual, clone-based work in some of these complex regions. At least until there is some new [sequencing] technology, long 40-kilobase reads, something along those lines.
What role have new sequencing technologies played so far in closing gaps in the human genome?
The Broad Institute has done a lot of work in using some of the next-gen technology to close some of the gaps that had remained on chromosomes that they had been involved in, and they kindly shared all of that information, so that we could use it in order to make a better version of the genome. We are happy to take information from where we can get it at this point.
Are you using new sequencing technologies to close more gaps now as part of the GRC, to go through them systematically?
We are systematically looking at all different sorts of technologies to try [to] close the remaining gaps. All available technology is on the table right now for us to try … and we certainly closed several gaps in this version of the human assembly.
Some groups that have sequenced human genomes using second-generation technologies have said that they have discovered novel sequence that is not represented in the reference genome. In how far will this information help improve the reference further?
I completely believe that if you sequence other individuals, you will find sequences that are not currently represented in the reference assembly. However, doing a de novo assembly of a million genomes is still not technically feasible with next-gen, especially short-read technologies.
There have been some attempts that I am aware of that when you get reads that don't align to the reference assembly, you might be able to get some information about the rough area that they might belong to based on paired-end analysis, and you can do some de novo assembly of those reads and get some sequence information. We are very interested in looking at that sort of data, because we know there are regions in the current reference right now for which we are representing a deletion allele. So, the assembly is absolutely correct. It's just that the allele that was chosen for that region was a deletion allele.
We certainly want to get better representation if there is sequence that is missing, but we will need to get that data into public databases, because one of the requirements that we have is that any sequence that's used to generate the public assembly has to be available in a public database like GenBank, EMBL, DDBJ, the Short Read Archive, or Trace.
What are your goals for the next year, and where do you see the next technological advancements coming from that might help you?
We are working very hard right now on the mouse genome. We are doing some analysis right now for getting an improved version of the mouse assembly. We are continuing to work on [the] human [genome] to both make improvements but to also define these regions that need additional representation. That's a considerable amount of work. We are potentially going to add other organisms to the group, but I don't know if I am at liberty to say which ones yet.
As far as new technologies go, I certainly am interested in any work that — even using current technologies — is an attempt to close gaps. Although, I have to say, I am probably most excited about some of the new, longer-read sequencing technologies that are not quite available yet. I think they hold a fair bit of promise with respect to improving our ability to represent some of these regions.