By Monica Heger
This article was originally published Aug. 26.
Three years into a project to improve and complete the human and mouse reference genomes, the Genome Reference Consortium is beginning to incorporate next-gen sequencing into its efforts, is honing in on medically relevant portions of the human genome, and has added the zebrafish to the project. The consortium is also collaborating with other researchers to make use of technology like optical mapping to close gaps and validate assemblies.
As whole-genome sequencing becomes more common, the need for a high-quality human reference genome that takes into account variation — including both single-base polymorphisms and structural variation — is proving critical, particularly for clinical applications.
"Improvement of the human reference assembly is critical as we move towards an era of clinical and personal genomics," consortium members wrote in a review of the project in PLoS Biology last month. High quality assemblies "most accurately capture all forms of human genetic variation and facilitate investigation of human disease in model organisms."
The consortium includes researchers from the National Center for Biotechnology Information, the Wellcome Trust Sanger Institute, the Genome Institute at Washington University, and the European Bioinformatics Institute.
Transitioning to Next-Gen Sequencing
Recently, members of the GRC have started using next-gen sequencing to aid in the consortium's efforts. The Wellcome Trust Sanger Institute is now using the Illumina platform as its primary sequencing machine for the zebrafish reference, and recently begun using it for its clone sequencing efforts on the human and mouse genome.
Katherine Auger, project coordinator for the GRC at the Sanger Institute, said the transition hasn't been without issues. For instance, the team had to adopt a PCR-free protocol for the zebrafish sequencing. The organism is very AT-rich, and the PCR step in the library preparation introduced too much bias, she said.
However, the increased throughput from the Illumina platform makes it worthwhile. Currently, the team is multiplexing 24 clones in one lane and plans to increase the multiplexing to 96 clones per sequencing lane.
This year, the Sanger team has also begun using Illumina for its clone sequencing needs on the mouse and human genomes, although it is still using capillary sequencing to finish the clone assemblies, to close some gaps, and to confirm sequence.
"We're finding [the Illumina platform] doesn't catch everything because of the short reads and short inserts," Auger said, and added that they are also looking to incorporate either a 454 or a Pacific Biosciences machine for longer reads.
"The plan is, because Illumina is quite a good platform to do a first pass, to run everything through to see how it assembles. Then we can run it through the 454 or PacBio," she said.
The team is evaluating both the 454 and the PacBio, and will decide on one or the other in the upcoming months.
Doing a combined assembly poses its own set of challenges, however. "You have two different data types and they have different ways to do assembly — different parameters, different insert sizes, different read lengths — and that poses some difficulty," Auger said.
Meanwhile, the sequencing team at Washington University's Genome Institute has begun using the 454 platform for clone sequencing and is looking to also implement Illumina, and possibly PacBio, said Tina Graves, leader of the reference genomes group at Wash U.
While switching to a next-gen platform has helped reduce costs and increase throughput, it has come with a number of challenges. For instance, Graves said, in a case where the team was resequencing an area of the human reference genome that been found to be a mixed haplotype, pooled clone sequencing on the 454 proved to be unable to parse repeats.
"Some clones were repetitive with each other, so we couldn't sort out all the clones," said Graves. We couldn't know just by sequencing on the 454 whether a repeat went to clone A or clone B." So, the researchers had to sequence the clones individually using capillary sequencing.
[ pagebreak ]
"For clone-based sequencing, we try to put as much as we can on the 454 and are working toward the Illumina, but we have to know whether the clones are going to be repetitive, or whether we can put enough variation in one pool so we don't have clones that overlap," Graves added.
Medically Relevant Patch Updates
One function of the GRC is to release quarterly "patch updates," said Deanna Church, a staff scientist at the NCBI. Rather than a full assembly update, a patch update provides the correct sequence and location without interrupting the assembly. It is a local sequence that corresponds to that region, but it's outside the chromosome. We align it to the chromosome so you understand where it goes, but it doesn't disrupt the chromosome coordinates," she said.
It acts as a compromise between the two types of users of the reference genome, said Church.
Many of the users are doing "whole genome analysis and value chromosome stability," said Church. These users don't want frequent major updates because "remapping the annotation and redoing all the assembly is very time consuming." On the other hand, some users are more interested in individual loci, so if a locus is wrong in the assembly, they don't want to wait for the assembly to be updated.
"One of the things we're trying to focus on [for the patch updates] is regions we think are phenotypically important or medically important," Church said.
For instance, the team recently released a patch update of a region called CCL3L1 on chromosome 17. The consortium found that the region, which is associated with several different phenotypes, was misassembled on the chromosome. .
For instance "there is some controversy about whether it's associated with HIV infectivity, and possibly rheumatoid arthritis. So, we thought it was important to get the correct sequence out, so people could try to follow up with these assertions," Church said.
The team is currently working on the 1q21 region, which is associated with a range of developmental disorders in children and has also been linked to multiple myeloma. The GRC hopes to release a patch update either late this year or early next year.
The region is especially tricky because it is highly repetitive and also highly polymorphic. "Any given person — even on their own two choromosomes — will probably have a different structure," Church said. "That makes it very challenging to work with."
Church said the team has a single-haplotype BAC resource that it is using to help work through the region. While the researchers are still struggling with the repetitiveness of the region, the resource helps eliminate "haplotivic differences that you might see from the different alleles in a library that come from a diploid source."
Error Correcting
Regions that are highly repetitive and/or highly polymorphic are the most likely to contain either misassemblies or gaps. Sequencing technology, although it is improving, still has trouble dealing with such regions.
As such, the GRC has looked to other technology, such as optical mapping, to help close gaps and identify misassemblies in these regions.
The team has been working closely with David Schwartz and his lab's optical mapping technology to identify regions of both the human and mouse genomes that have been misassembled and to determine the size of gaps.
Optical mapping is a way of first stretching out a single molecule of DNA on a glass surface, and then using a restriction enzyme and fluorescent dye to cut and label the DNA in order to create optical maps.
[ pagebreak ]
Last year, Schwartz's team used his optical mapping technology to evaluate the gaps in Build 35 of the human genome, and for most of those gaps was able to determine the breadth of the gap and whether the gap was real. In addition, in the assemblies themselves, "we can pick up misjoins, places where there should be gaps, and just about every issue you can think of outside of correcting a single base," he said.
In addition, Schwartz tracks problems that researchers report encountering with the reference genome to the GRC, and then uses optical maps to focus in on the problem area. That helps identify areas such as supposed gaps that are not actually there, or to clarify the size of a gap, he said.
Additionally, because optical mapping is a different technology than sequencing, it has proven to be an important tool to "have confidence that your assembly is correct," Church said. "Any non-sequencing based technology that can help confirm or deny that a region is assembled correctly," is important.
However, optical mapping is only effective if there are a sufficient number of cut sites in the region of the genome that needs to be assembled. So, the GRC is also using linkage maps, fingerprinting, and transcript sequencing to validate assemblies. The RefSeqGene database has been particularly useful for validating transcript sequences, Church said.
Another common assembly error has to do with polymorphic regions, particularly areas with structural variation, Church said. Individuals and groups — such as the 1,000 Genomes Project and Evan Eichler's group at the University of Washington — that are doing whole-genome analysis specifically looking for structural variation have proven to be instrumental in helping to identify problem areas in the reference genome, Church said.
"If you're looking for structural variation and every person is variant for a region, that's very suggestive that that region of the assembly, if not misassembled, is a very rare allele," Church said. The consortium can then go look at those specific regions.
Finally, problems also arise due to most of the assembly being done from a diploid source. For example, the original assembly of the MAPT region on chromosome 17 came from the library of a single individual. However, there is a known inversion at that locus, and it turned out that the individual was heterozygous for the two alleles, which caused the original assembly in the reference to represent a mixed haplotype.
Only by going back and working with Eichler's group to do haplotyping of the region was the discrepancy resolved, allowing both haplotypes to be represented.
"Since most libraries are made from an individual, and that means two haplotypes, a lot of the misassemblies are caused by places where you can get haplotypic diversity, and that confounds the assembly," Church said.
Building a Pan-Genome
While de novo assembly from next-generation sequencing is continuing to improve and many researchers are aiming to eventually do de novo assembly for every whole-genome sequencing project, most researchers currently rely on alignment to a reference genome.
Figuring out the remaining gaps, misassemblies, and alternate loci will be critical for the correct alignment of whole genomes generated by next-gen sequencing technology.
In the PLoS Biology paper, the consortium evaluated the impact of its recent updates on aligning next-gen sequencing data. The authors chose two samples from the 1,000 Genomes Project to align to the reference with and without incorporating the alternate loci. Without the alternate loci, they found that two-thirds of the alternate-locus specific reads were misaligned.
Additionally, even as de novo whole-genome assembly becomes more common, a reference genome that takes into account the full extent of human variation will still provide value, Church said.
The goal, she said, is to create essentially a pan-genome, with population-specific variation. This will be useful for identifying within an individual the specific population or ethnicity a particular sequence comes from. For instance, she said, researchers could obtain a particular sequence, align it to the reference genome, determine what population it is from, and then look at individual assemblies from that specific population.
How much variation to include in the reference genome is still an open question. Currently, Church said the team is "somewhat liberal about the variation we include because the reads are still short and alignment is still hard," but as technology evolves and as the understanding of variation evolves, that could change, she said.
Have topics you'd like to see covered by In Sequence? Contact the editor at mheger [at] genomeweb [.] com.