This article has been updated to clarify that the current human reference assembly has regions with more than one haplotype.
A consortium of researchers from Penn State University, Nanyang Technological University in Singapore, Roche 454, the National Center for Biotechnology Information, Genentech, the Children's Hospital Research Institute, and the Genome Institute at Washington University has sequenced one of the 20 donors to the Human Genome Project using several next-gen sequencing platforms, and is working on a de novo assembly of his genome in order to improve the human reference assembly.
Stephan Schuster, who spearheads the project, presented preliminary results and a draft assembly of the genome, from donor RP11, for the first time during a workshop sponsored by Roche two weeks ago at the Advances in Genome Biology and Technology meeting in Marco Island, Fla.
The Human Genome Project originally collected samples from 20 anonymous donors from Buffalo, New York, but for reasons likely related to the quality of the BAC libraries made from their DNA, a single donor – RP11 – accounts for about 72 percent of the human reference.
The current version of the human reference assembly, GRCh37.p11, despite being the most complete and accurate genome to date, still has multiple gaps and ambiguities. It is also largely haploid, although it has many regions where more than one haplotype is represented.
A RP11 genome assembly could help close some of these gaps and correct single base errors, according to Deanna Church, a staff scientist at the NCBI who is involved in the project and is a member of the Genome Reference Consortium that updates the human reference assembly.
In addition, RP11 could potentially provide phasing information for a diploid reference genome. "It's always been a dream of the field to have a way of resolving the two chromosomes in their entirety," Schuster, who holds appointments at both Penn State and Nanyang Technological University, told In Sequence.
He said he also likes that the human reference can be used to validate the accuracy and completeness of the RP11 assembly, which sets this project apart from others, for example the National Institute of Standards and Technology's "Genome in a Bottle" (Clinical Sequencing News 9/5/2012). While that effort has similar goals in terms of creating a high-quality human assembly, he said, it "lacks the ability to validate the sequence." However, Church, who is involved in both projects, said that RP11 would not be a good candidate for the NIST project because there is no cell line for RP11, so the amount of DNA available is limited.
According to Schuster, the RP11 project originated as a collaboration between his lab and Roche/454, although it now also includes large amounts of Illumina sequence data. He obtained additional IRB approval for sequencing RP11, he said, even though IRB approval existed for this sample from the days of the Human Genome Project.
In a previous collaboration with 454, Schuster's group at Penn State generated a de novo assembly of the genome of an African bushman from 454 data alone (IS 2/23/2010), which included contigs and scaffolds that were not part of the human reference.
Genentech, which is owned by Roche, also participates in the RP11 project, providing a collection of 800 sequenced human genomes from different disease cohorts for comparison against the RP11 genome, for example to assess whether DNA not present in the human reference does exist in other human genomes. Pharmaceutical companies like Genentech have "an enormous interest" in obtaining the most complete and accurate reference genome possible, Schuster said, allowing them to make disease associations that they otherwise could not.
So far, 454's service center has generated 20x GS FLX+ fragment read data for the project, as well as 2x GS FXL+ mate pair data from libraries ranging in size from 3 kilobases to 10 kilobases. Schuster said the quality of those data is extremely high and most of the runs had a modal read length of 950 base pairs, with many reads exceeding 1,000 base pairs.
Schuster's own laboratories have generated 34x Illumina HiSeq 100 base paired-end data as well as 22x MiSeq 250-base paired-end data. From the MiSeq data, they stitched together paired-end reads into single contiguous reads of about 450 base pairs, which have an average error of 0.6 percent and an error of around 1 percent in the overlap area of the two reads.
In addition, researchers from CHORI have produced about 750,000 40-kilobase fosmid clones, pooled into 96 libraries that each represents about 10 percent of the diploid genome, which Schuster's lab sequenced on the HiSeq, generating another 102x of data.
WashU's Genome Institute has also contributed HiSeq reads, including 120x 100 base paired-end data.
For a draft assembly of the RP11 genome – called RP11_0.7 – the researchers used only a subset of the data, 16.5x of the 454 FLX+ shotgun reads, 1.8x of the 454 FLX+ mate pair data with 5-kilobase inserts, and 7.5x of the stitched MiSeq reads, because the assembly software – 454's Newbler – cannot handle more data at the moment.
A second draft assembly, called RP11_1.0, is currently in progress – it will include additional 454 FLX+ shotgun reads, MiSeq stitched reads, as well as HiSeq fosmid reads.
But even this preliminary assembly, Schuster said, resulted in the best de novo assembly of a human genome from next-gen sequencing data to date, judged by the total number of bases in contigs — 2.813 gigabases — and the contig N50, 127 kilobases. The next best de novo assembly is that of Schuster's bushman genome, KB1, followed by the ALLPATHS-LG assembly and the Chinese YH1 assembly, he said.
However, the RP11 assembly still lags behind the GRCh37.p11 Sanger assembly, which has 2.861 gigabases in contigs and a contig N50 of 46.4 megabases. Craig Venter's HuRef assembly, also generated from Sanger data, has 2.809 gigabases in contigs and a contig N50 of 107 kilobases.
Both Sanger assemblies have a better scaffold N50 — 46.4 megabases for GRCh37.p11 and 19.5 megabases for HuRef — than RP11, which has a scaffold N50 of 4.6 megabases.
Importantly, RP11_0.7 has already improved 89 of the 223 internal gaps in the reference sequence, closing 32 gaps, providing scaffolds that span 44 gaps, and shortening 13 gaps by more than 20 kilobases, and Schuster believes later assemblies will close even more gaps.
Some of these gaps contain genes with potential roles in disease. For example, the assembly provided sequence data for a gene involved in cancer, ECSCR. While there was evidence for this gene from RNA-seq data, it is not contained in the current reference assembly, and any resequencing studies that map against the reference would miss it. "You cannot map against holes," Schuster said. "Now, we're fixing these things. Many people might go back and map their old data again and identify variants that are specific for their cohorts."
Besides closing gaps, the researchers were able to assemble the two haplotypes correctly in "problematic" regions of the reference genome, Schuster said, for example the H1 haplotype in a region of chromosome 17.
Researchers at 454 are currently working on a new version of the Newbler assembler that will hopefully be able to use all the sequence data generated for the project, Schuster said. The consortium might also partner with other groups to use additional software, for example the Phusion assembler, he added.
"This is a great project for folks interested in performing assemblies," said Church. Because so much data is available for RP11, sometimes for both haplotypes, "we can really evaluate how the assembler is performing."
While it would be interesting to include other types of data in the project — Illumina's Moleculo long synthetic reads and Pacific Biosciences' long single-molecule reads would be "logical candidates," Schuster said — the consortium wants to "guard this very precious DNA that is left" from the RP11 sample, so "if there is a huge advance in sequencing technology, with 100-kilobase accurate reads or something, we have enough material left so this can be made the most complete genome."
Long read technologies may help with phasing the RP11 genome, Church said, and they could be useful for resolving the haplotypes in regions where RP11 appears to be heterozygous for complex structural variations.
The current goal of the project is to generate an assembly of the RP11 genome that includes all sequence data available by mid-year, along with publications that describe the genome and "the degree of completeness that can now be obtained," Schuster said.
This timeline would allow the information to be included in the next major release of the human reference assembly. According to Church, the Genome Reference Consortium plans a data freeze for August, and the new assembly will be submitted to GenBank by early fall.
Church said the GRC team at NCBI is aligning the RP11 read data to the reference assembly in order to identify potential single-base errors in the reference. They will also align the RP11 assembly to the reference to find novel sequences. Technologies such as optical sequencing might be useful for validating the RP11 sequence, she added.
While RP11 is currently still anonymous, he is of mixed ancestry, and Schuster's team is studying this further. Most of his genome is of European background, he said, but his ancestry also appears to be about 30 percent African Yoruban.