By Julia Karow
As a first step to build a "human pan-genome," researchers led by the Beijing Genomics Institute in Shenzhen have generated de novo assemblies for two human genomes, based entirely on data from Illumina's Genome Analyzer.
The genomes each contain about 5 megabases of novel sequences — some with potentially functional coding regions — that are not represented in the NCBI human reference genome. An analysis of the two genomes was published online in Nature Biotechnology this week and a detailed description of the assemblies is currently in press.
The results show not only that human genomes contain novel sequences that are specific both to populations and to individuals, but also that "combining individual-specific sequences with shared core human genome sequences will enable the creation of a human pan-genome that will be important for the better understanding of personal genomes and their use in medical genomics studies," said Jun Wang, executive director of BGI, via e-mail.
A complete human pan-genome, the researchers estimate, would contain approximately 19 to 40 megabases of novel sequence that is not currently contained in the reference genome and would require generating "extensive sequencing data," according to Jun. Array-based technologies that depend on the current reference genome would not be able produce these data, he added.
With further technology improvements, the authors write, "sequencing is becoming a practical and affordable method for analyzing a large number of complete human genomes, making it feasible to establish a more comprehensive understanding of the human genome, to make discoveries in medical genomics, and to develop new applications for personalized medicine."
Wang said that BGI plans to sequence and assemble additional genomes from the Asian population but did not provide further details.
For Illumina, the study likely serves as validation that its technology is not only suited for resequencing applications but also for de novo sequencing of large genomes. Already, "we are seeing a big uptake of de novo sequencing by our customers," Jeremy Preston, marketing manager for sequencing at Illumina, told In Sequence, adding that such projects are becoming easier with the 2x100 base paired-end reads the company recently launched for the Genome Analyzer. "I think for a long time, Illumina was perceived as a short-read company, and people have this vision that short reads are 35 base pairs," he said.
For their published study, BGI researchers assembled the genomes of an African HapMap sample, NA18507, and of an Asian individual, YH, from 35-base pair reads. Both genomes, sequenced using Illumina's platform, were published a year ago by researchers at Illumina and BGI, respectively (see In Sequence 11/8/2008), but those were resequencing projects, where the teams mapped their reads to the NCBI reference genome.
Using the same sequence reads originally generated for last year's projects, and, in the case of YH, additional paired-end reads, the scientists now generated de novo assemblies, using an algorithm called SOAPdenovo that they developed in house and that is optimized for Illumina data. A more detailed description of the algorithm as well as the assembly results is in press at Genome Research, according to the paper. According to Wang, the assembly takes about two days on eight CPUs.
In total, they assembled about 118 gigabases of existing sequence data and 82.5 gigabases of new paired-end reads, with library insert sizes ranging from 200 base pairs to 9.6 kilobases, for the YH genome, and 135 gigabases of existing data for the HapMap genome.
[ pagebreak ]
The total assembled sequence size is 2.87 gigabases for sample YH, with an N50 scaffold size of 446.3 kilobases and an N50 contig size of 7.4 kilobases. For sample NA18507, the total assembled sequence size is 2.68 kilobases, with an N50 scaffold size of 61.9 kilobases and an N50 contig size of 6 kilobases.
After aligning the scaffolds against the NCBI human reference genome, the scientists found approximately 7,000 sequences, covering about 5 megabases in total, in each genome that were absent from the reference. However, more then 80 percent of these novel sequences could be found in the Watson genome, the Venter genomes, or in human clones deposited in GenBank that did not make it into the reference genome.
They also found that the frequency of the novel sequences differs between human populations worldwide, showing distinct clusters.
Comparing the YH and NA18507 genomes directly to each other, they found that they differed in a least 8 megabases of sequence, consisting of 4 million SNPs and 4 megabases of individual-specific sequences.
There were fewer such differences — 1.8 megabases of individual-specific sequences — between YH and the genome of a recently sequenced Korean individual (see In Sequence 5/28/2009) because they belong to more closely related populations.
Given the differences they found, the researchers estimated that a complete human pan-genome will have between 19 and 40 megabases of novel sequences not included in the current reference genome.
These might be functionally significant: the BGI team found that the novel sequences in YH and NA18507 contained dozens of human NCBI RefSeq genes, or parts of them, that are not contained in the NCBI reference genome. Most of those genes are hypothetical and have no known functions, and another large fraction belongs to highly variable gene families.
"I strongly agree that de novo assembly is essential for the analysis of human genomes, as well as other genomes, and the authors are to be commended for pushing this approach while most other groups are focusing solely on resequencing," said Michael Egholm, CTO and vice president of R&D at 454 Life Sciences. He said that 454, too, "can now routinely assemble human-sized genomes," but no one has published such an assembly in a scientific journal yet. Recently, researchers in Norway presented results for an assembly of the 800-megabase cod genome from 454 data (see In Sequence 11/3/2009).
According to Deanna Church, a staff scientist at the National Center for Biotechnology Information and a co-founder of the Genome Reference Consortium, "this is a great first step towards identifying sequences that are missing and need to be represented in the reference, but not the final step." In the absence of details of the two assemblies, however, it is difficult to assess how useful they will be, she said.
"We know there is human sequence not represented in [human genome version] GRCh37. Finding that data, incorporating it into the assembly, and presenting it to users in a coherent fashion is a critical next step in genome biology," Church told In Sequence. "De novo assembly that produces long, contiguous scaffolds/contigs in the absence of a reference assembly will be incredibly useful for understanding the genome."