Scientists at the Genome Sciences Centre of the British Columbia Cancer Agency have assembled a human genome de novo from short paired-end reads that Illumina generated on its Genome Analyzer.
The unpublished analysis, first presented last month at the Short Read Special Interest Group meeting during the Intelligent Systems for Molecular Biology conference in Toronto, represents one of the first de novo assemblies — if not the first — of a human genome from short-read sequence data.
Though details about the quality of the assembly are currently unavailable, scientists say that de novo assemblies could unveil structural variations that might otherwise be missed.
“Less than a decade ago, it took one of the world’s most powerful supercomputers to assemble the human genome, and it took two or three weeks to come up with the first draft shotgun assembly,” said Inanc Birol, leader of the bioinformatics group at the Genome Sciences Centre, who led the project and discussed it at the conference. “Now, we can do it in three to four days on commodity hardware.”
“This is an extremely significant development in the field of fragment assembly,” said Mark Chaisson, a graduate student of bioinformatics at the University of California, San Diego, who attended Birol’s talk. Chaisson and his colleagues developed a short-read assembler, Euler-SR, which he presented at the same meeting.
For their assembly, the Canadian scientists downloaded sequence data from an African male HapMap sample, NA18507, from the National Center for Biotechnology Information’s Short Read Archive. Illumina generated the data in-house earlier this year (see In Sequence 2/26/2008 and 5/6/2008) and made it publicly available this spring.
That initial data set provided approximately 24-fold coverage of the genome, but Illumina recently added more data to the archive, bringing the coverage up to 42-fold, according to Birol.
“When the new data set came along, we were all excited about it,” he told In Sequence last month. “We wanted to see whether or not our assembler would be able to handle the increased volume of data.”
The Illumina data consists of paired-end reads with 200-base nominal insert sizes and read lengths ranging from 36 to 42 bases, though “we had to trim the 42-base reads a little because of quality concerns to make them work with the assembly process,” Birol said.
“Less than a decade ago, it took one of the world’s most powerful supercomputers to assemble the human genome … Now, we can do it in three to four days on commodity hardware.”
To stitch the reads together, they used their “Assembly by short sequences,” or AbySS, assembly algorithm and a 21-node cluster with 8-core nodes and 2 gigabytes per core. They said the assembly took three days: one day to assemble single ends and two days to add the paired-end information.
Birol said they were able to keep memory requirements low by parallelizing the assembly, using a multithreaded message passing interface, or MPI, architecture. He said they did not use the human reference genome at all for the assembly.
Other short-read assembly algorithms might not yet be suited for a similar-sized assembly task because of memory limitations, he said.
Chaisson confirmed that benchmarking studies his team has conducted showed that two other short-read assemblers, Euler-SR, which he published last year in Genome Research, as well as Velvet, developed by researchers at the European Bioinformatics Institute (see In Sequence 3/18/2008), use “too much memory.” The Allpaths assembler, developed by scientists at the Broad Institute, “may soon work on the human genome since it can assemble sub-problems,” he added.
Chaisson said he and his colleagues have used Euler-SR to assemble eukaryotic genomes but no mammalian genomes so far. “We are working on paring down the memory required and will attempt it soon,” he said. However, unlike AbySS, that project will require computing resources at the San Diego Supercomputer Center “that most labs do not have access to.”
The BC researchers’ draft genome consists of “millions of contigs,” according to Birol, who did not provide further details, such as the contig size distribution or contig quality, because the study is not published yet.
He did disclose that the assembly is approximately 95-percent concordant with the human reference genome. That amount of variation is what he and his team predicted, he said, given that the HapMap sample originated from Africa.
The assembly could be improved in several ways. Longer reads and insert sizes, for example, would reduce the required coverage depth and might help resolve some “contig extension ambiguities,” Birol said.
However, mate-pair reads with longer inserts alone would not be sufficient, he said, because “the complexity of bridging different contigs would grow exponentially with the insert size.” As a result, a combination of small and large insert sizes would be best, he said.
Lowering the error rate in the data would also reduce these ambiguities, he added. Chaisson said he is now collaborating with Birol to decrease errors in the data, using error correction routines provided by Euler-SR to preprocess the data for the assembly.
Birol and his colleagues are also thinking about using SOLiD data to improve their assembly. ABI has internally sequenced the same NA 18507 HapMap sample as Illumina using its SOLiD system. “We are thinking about bringing in that orthogonal dataset from SOLiD to fill in some of the gaps and bridge some of the contigs,” Birol said.
In general, de novo assemblies could help determine structural variations that would otherwise be missed by aligning short reads based to an existing reference genome. “If you would like to make any inferences about longer rearrangements, longer indels, translocations, and inversions, having an assembly would be quite helpful,” he explained.
For example, he and his colleagues showed with several examples that “even though the genome you are investigating would not have certain features, it can be shoe-horned into the reference genome,” he said. “So you may be missing some of those structural variations that you are looking for if you build your study on alignments or reference-based assemblies.”
Chaisson added that the assembly might be especially useful to detect structural polymorphisms of intermediate sizes. While mapping unpaired Illumina reads to a reference can detect SNPs, and alignments of paired reads can unveil large structural variants, intermediate-size variants might be missed by such approaches, he explained.
Birol and his colleagues have also used their assembler on other human-genome data, for example from two individuals from a HapMap trio that are being analyzed as part of the 1000 Genomes Project. That data was generated by the Wellcome Trust Sanger Institute and the Broad Institute on the Illumina platform. They did not use 454 data that was also available for these samples because the data has a different error model that would not work with their algorithm — although they might be able to use 454 data after an error correction, Birol suggested.
In that case, the Canadian researchers combined data from two genomes, that of a mother and daughter. “Taking only one individual was not deep enough coverage, so it did not assemble well,” according to Birol. He said he hopes that the results will convince 1000 Genomes project coordinators to sequence more genomes at high coverage “so that we can assemble and compare them to the reference and make some cross-comparisons between individuals.”
Although so far they have only assembled Illumina reads using AbySS, “there is no theoretical limit in using SOLiD data in our assembler,” he said.
AbySS will be available to researchers after the publication of the human genome assembly.