This article has been corrected to note that the genome sequenced is that of Seong-Jin Kim, the project leader, not that of an unidentified donor.
Researchers at Gachon University of Medicine and Science in Korea and at the Korean BioInformation Center have sequenced the genome of a Korean individual and have compared it to other recently sequenced genomes from different ethnic backgrounds.
Despite being most closely related to the Han Chinese genome that was published last year, the Korean genome differs significantly from this as well as the other genomes, suggesting that individual genetic differences even among members of similar ethnic groups may be large.
The researchers published their results last week week online in Genome Research.
For their project, they generated more than 80 gigabases, or almost 30-fold coverage, of paired-end sequence data from the DNA of Seong-Jin Kim, director of the Lee Gil Ya Cancer and Diabetes Institute at the university, using an Illumina Genome Analyzer. Kim is also the project leader.
Using three paired-end libraries with insert sizes of 100, 200, and 300 bases, the researchers generated 1.75 billion reads, about two-thirds of which are 36-base paired reads, and the remaining ones 75-base paired reads. They first released the sequence data last December here.
The scientists aligned the data to the NCBI human reference genome using the MAQ program and found that almost 6 percent of the reads could not be mapped. A small fraction of these reads mapped to unanchored NCBI human scaffolds as well as novel sequences of other recently sequenced human genomes, but most of them probably represent mapping inefficiencies, novel human sequence, or other factors such as sequencing errors or contamination with other species.
Using the Velvet de novo assembler, the researchers also assembled about 500,000 of the non-mapping reads into almost 30,000 contigs.
The team identified about 3.4 million SNPs in the Korean genome, of which about 420,000 were novel as they were not contained in dbSNP. In total, they found almost 9,500 non-synonymous SNPs in more than 5,300 genes.
They compared the SNPs of the Korean genome with those of the Venter, Watson, Han Chinese, and Yoruban genomes, all of which were published over the last two years.
While Craig Venter's genome was sequenced using Sanger technology (see In Sequence 9/4/2007), Jim Watson's genome was analyzed using 454's Genome Sequencer platform (see In Sequence 4/22/2008), and both the Han Chinese and Yoruban genomes were sequenced on Illumina's Genome Analyzer (see In Sequence 11/14/2008)
The Korean genome shared the greatest number of SNPs — 60 percent — with the Chinese genome, and the smallest fraction, about half of all SNPs, with the Venter genome. This result is not surprising, according to the authors, because genotyping studies have shown that ethnic groups in Asia — including Chinese, Japanese, and Koreans — have only diverged relatively recently.
[ pagebreak ]
However, they noted that a significant genetic difference, of more than 1.3 million SNPs, exists between the Chinese and the Korean individuals.
For comparison, the Venter genome shares approximately 56 percent of SNPs with the Watson genome, according to the researchers — both individuals are of European descent.
Using MAQ, the researchers also detected almost 350,000 short insertions or deletions, two-thirds of them single nucleotide changes. About 67 percent of these indels are not in dbSNP, "probably because indels are under-represented in dbSNP, even though they are important genomic variations," the authors note.
The Korean genome shared almost half its indels with the Yoruban genome, a lot more than with any of the other genomes, but this probably resulted from the paired-end sequencing method used to sequence both genomes, rather than from ethnic similarities between the two individuals, according to the scientists. The Han Chinese genome was sequenced with a mix of paired and unpaired Illumina reads.
In addition, the researchers identified almost 3,000 deletions and more than 400 inversions in the size range of 100 bases to 100 kilobases, as well as almost 1,000 insertions about 200 bases in size. Deletions affected 21 coding genes, and may affect the structure and function of the proteins they encode. Again, the Korean genome shares more than 60 percent of its deletions with the Chinese genome, more than with any of the other genomes.
The authors conclude that because of the significant differences between individual genomes, and the fraction of sequence reads that do not map to the NCBI reference genome, it is possible that "building reference genomes for populations can be useful in reducing the cost and time in mapping and analyzing very large numbers of personal genomes."
However, they note that resequencing "has a limitation in building a truly diploid reference genome representing more accurate individual and ethnic differences."
For that reason, they say, they plan to analyze the Korean genome further by combining de novo genome assembly and targeted gap filling.