Skip to main content
Premium Trial:

Request an Annual Quote

Long-Read Genome Assemblies Have Many Errors in Protein-Coding Regions, Study Finds

NEW YORK (GenomeWeb) – Researchers at the University of Edinburgh have found that three published long-read human genome assemblies — two from Pacific Biosciences sequencing data alone and one from Oxford Nanopore and Illumina data — contain considerably more errors in protein-coding regions than short-read assemblies. As a result, they suggested, more efforts should go into fixing such errors going forward.

"[T]he results should serve as a cautionary note for those researchers seeking to sequence genomes (and seeking funding to sequence genomes) using single-molecule technologies and those wishing to use long-read technologies in clinical practice," co-authors Mick Watson and Amanda Warr of the University of Edinburgh's Roslin Institute wrote in their paper, published yesterday in Nature Biotechnology.

However, in a reply published in the same issue, one of the groups whose assembly was included in the analysis took issue with the results, arguing, among other things, that new bioinformatic tools have already improved their nanopore assembly and that "[f]urther advances in algorithms and technology will ultimately enable reference-grade consensus sequence from [Oxford Nanopore] and PacBio data alone."

For their study, Watson and Warr compared five published human genome assemblies, three from PacBio data or Oxford Nanopore reads with Illumina data and two from Illumina short reads.

Specifically, they looked at a PacBio-only genome assembly of NA12878 — a reference sample used by the Genome in a Bottle Consortium — that was published in 2015, with a stated accuracy of 99.7 percent; a PacBio-only assembly of CHM1, a haploid cell line, that was published in 2017 with a stated accuracy of 99.8 percent; and an Oxford Nanopore assembly of NA12878 that used Illumina polishing and was published last year, with a stated accuracy of 99.8 percent. In addition, they included an Illumina-only assembly of NA12878, published in 2011, and an Illumina-only assembly of CHM1, published in 2014.

They then compared these assemblies against a set of about 41,000 mRNA transcripts and found that the long-read assemblies had significantly more indel errors in their protein-coding regions than the short-read assemblies. In particular, the older PacBio assembly, which used an earlier sequencing chemistry, had almost 11,000 genes with indel errors, whereas the newer PacBio assembly only had 740. The Illumina-polished Oxford Nanopore assembly, on the other hand, had almost 4,000 genes with indel errors, and the two Illumina-only assemblies had about 400 (CHM1) and 600 (NA12878) genes with indel errors, respectively.

The researchers noted that the large improvement in the more recent PacBio assembly — partially enabled by the fact that the sample was haploid — "proves that it is possible to reduce the number of erroneous protein-coding regions to a few hundred, but it is important to note the resources and skills needed to do so."

The fact that the Oxford Nanopore assembly, despite using correction with Illumina reads, still had a large number of indel errors in protein-coding regions "should serve as a warning to those groups [working on nanopore assemblies] to pay particular attention to indel errors," they wrote.

Watson and Warr explained that their analysis "should not be considered a criticism of either PacBio or Oxford Nanopore" and that it is "not intended to be a comparison of sequencing technologies, nor should it be interpreted as such. Rather, it is an attempt to use published single-molecule sequencing assemblies of the human genome to demonstrate that indel errors, many of which can critically affect protein-coding transcripts and genes, remain prevalent."

However, others appeared to feel differently. In their reply, the authors of the published nanopore assembly, a group led by Nick Loman at the University of Birmingham and Matt Loose at the University of Nottingham, argued that the Edinburgh researchers "incorrectly focus on a single assembly from our previous paper," which relied on a base caller that is now obsolete.

They recently reassembled their original nanopore data using an updated base caller, they wrote, which improved both the contiguity of the assembly and the consensus accuracy with nanopore data alone to 99.77 percent.

In addition, they said they have posted an updated assembly of NA12878 from nanopore and Illumina data, using several rounds of polishing with two different tools, which resulted in a consensus accuracy greater than 99.99 percent.

Also, assemblies should not only be judged by their consensus error rate but also by their continuity, absence of misassemblies, and other measures, they wrote, adding that long-read sequencing technologies "can produce dramatically improved assemblies as measured by a variety of assembly quality metrics."

Watson and Warr agreed that "[l]ong reads have transformed genome assembly, and we believe they should be the starting point for all new genome assembly projects."

The Vertebrate Genomes Project, for example, which aims to generate assemblies for all 66,000 vertebrate species, plans to initially use four complementary technologies, including PacBio sequencing, that generate long reads or other long-range mapping information.

However, the Edinburgh researchers argued, in order to maximize accuracy of long-read assemblies, multiple rounds of polishing and "additional checks for remaining indels and errors" should be used, including manual inspection and error correction.

Loman and colleagues wrote that they "disagree that special resources or skills are required" to reduce indel errors in single-molecule assemblies, adding that bioinformatic and technology improvements will further increase the quality of assemblies from single-molecule platforms.