By Monica Heger
Researchers from the University of Washington recently characterized human genome sequences and copy number polymorphic insertions that have been missing from the human reference genome.
The sequence data, published in a study this week in Nature Methods, makes for a more complete reference genome. Characterizing structural variation, and in particular new insertions, has been difficult because the reference genome is a mosaic of individual genomes. So, in some areas it includes rare structural configurations, while in other areas it omits common sequences.
In addition, the team compared the data to a recent de novo assembly from short-read sequence data, highlighting the limitations and advantages of newer assembly methods from next-generation sequence data.
The researchers studied end-sequenced DNA clones from nine HapMap individuals, including four Yorubans, two Europeans, one Japanese, one Han-Chinese, and one of unknown ethnicity. They searched for sequences that did not align to the reference genome, and then through fluorescence in situ hybridization analysis and oligonucleotide arrays, were able to predict the location of those sequences and assess whether they showed copy number polymorphism.
They then completely sequenced 156 of the new insertions, which identified new exons and noncoding regions not included in the reference genome. Additionally, one of the Yoruban genomes had been previously sequenced to 30-fold coverage on the Illumina Genome Analyzer and de novo assembled by a group from BGI in China, so the researchers were able to compare their results to the de novo assembly.
Typically, human genome sequencing studies "generate large numbers of short sequence reads and map those on to the existing reference," said Jeff Kidd, lead author of the study and now a postdoctoral fellow at Stanford University. "Now, we're finding new sequences, and adding them to the reference to make it more complete."
Jun Wang, executive director of BGI, told In Sequence in an e-mail that the University of Washington team "tackled the most difficult parts of the human genome," and that the findings "serve an important step in the long way to achieve a perfect reference genome."
The team first fragmented the DNA, and then subcloned 40-kilobase segments. They sequenced both ends of each fragment using capillary sequencing, generating 9.7 million end-sequence pairs, and mapped the clones to the reference genome. They identified 44,415 high-quality sequences that did not map to the reference, and assembled those sequences into contigs with an N50 of 1,148 base pairs, generating 2,363 sequence contigs that corresponded to 720 loci ranging between 1- and 20-kilobase pairs in length. Of those, 400 fell within euchromatin, and 320 could not be assigned a position.
FISH analysis enabled the researchers to predict the locations of the contigs. The analysis results "indicate that megabases of uncharacterized sequence remain within heterochromatin and euchromatin-heterochromatin transition regions of the human genome, but they also confirm the presence of missing euchromatic sequences that are copy number polymorphic," the authors wrote.
The team also designed oligonucleotide arrays to assess copy number polymorphism in the new sequences. They first tested sequences that were not able to be formed into contigs. Of the sequences that their array could detect, 31 percent were copy number polymorphic. And, of the sequences that could be assembled into contigs, 37 percent were copy number polymorphic.
Using capillary sequencing, the team was able to completely sequence 156 new insertion sequences, identifying novel exons and noncoding regions not annotated in the reference genome.
For instance, they showed that some of the common insertions differed in allele frequency among populations. In African populations, for example, the average insertion allele frequency for variable loci was significantly greater than among Europeans or Asians, which the authors wrote was a "pattern suggestive of either selection or genetic drift since the migration of humans out of Africa."
One of the Yoruban genomes they analyzed has also been sequenced and de novo assembled by a group from BGI, so the researchers were able to compare that approach to their clone-based approach. They found that many of the new contigs were only partially represented in the de novo assembly, and that more than one third of the contigs were fragmented.
Kidd said that there were advantages to both methods. The BGI group achieved "a much higher coverage, so they found a lot of things we didn't find. But, in terms of getting the structure and assembly in regions where there was repetitive sequence or duplicate sequence, the [de novo] assembly was not as complete or accurate," he said.
BGI's Wang noted that his group sequenced and de novo assembled the Yoruban genome about two years ago and that next-generation technology has improved since that time.
The current paper shows next-generation sequencing's "low capability to deal with retrotransposon regions," he said. "This could be potentially solved by improved paired-end sequencing protocols, longer reads, and more powerful data-mining in bioinformatics; or even by short-read sequencing and de novo assembly clone-to-clone."
In the future, Wang added, "we do not think capillary sequencing will be unnecessary, but researchers should be able to solve most of the problems by new-generation sequencing technologies."
Kidd said that the group's next steps will be to continue to characterize the novel sequences, particularly in regions of highly repetitive regions or duplicate regions. He said they would also incorporate next-generation data from newer whole-genome sequencing projects.
"Regions where there are larger segments or segments that are more variable or highly polymorphic are probably good segments to go after to make sure that we have all the sequence," he said.
Wang added that while the work contributed novel insertions to the reference genome, an important question to keep in mind going forward is how crucial having a reference genome will be.
For the time being, he said, having a high-quality reference genome is important. But, "as the sequencing costs continuously go down, the necessity is decreasing." Advances in assembly technology and further cost reductions could eventually allow researchers to assemble an appropriate reference genome de novo for each sequencing study, he said.
"And, even a perfect reference genome does not indicate other human genomes could be completely deciphered by mapping-based approaches, as structural variations and individual-specific sequences require assembly approaches to be properly identified," Wang said. "Therefore, we think the future form of reference genome may need to be adapted to variability, such as a graph-structured pan-genome," which BGI is currently working to build and that aims to include sequences specific to individuals as well as common shared sequences (IS 12/8/2009).