Skip to main content
Premium Trial:

Request an Annual Quote

GenomeWeb Feature: The Lost Art of Genome Finishing


When the Human Genome Project announced that the entire human genome had been sequenced the term "finished" was a matter of semantics. Technically, it was the most complete human genome, but researchers were still laboring away, putting bits and pieces together and trying to resolve some of the trickier regions. In fact, efforts to finish the human genome persist 10 years later.

The Genome Reference Consortium is continuing to improve on the human reference genome by releasing patches that fix various regions of the genome, and it is also constructing multiple iterations of regions that are too complex and variable to be represented by just one assembly.

Genome centers used to invest heavily in the finishing aspect of sequencing a genome, said Ian Korf, an associate professor at the University of California, Davis, Genome Center. But today, the focus is more centered on generating many genomes as quickly as possible for as low a cost as possible, he said.

"No one really finishes genomes anymore," he said.

Part of the reason for there being less of an emphasis on finishing efforts can be attributed to advances in technology. Next-generation sequencing has made it possible to sequence a human genome in a little more than a day. But genome finishing typically requires time and labor intensive techniques like BAC-based sequencing or fosmid sequencing that researchers are less likely to invest in, Korf said.

Additionally, how one defines a finished genome is up for interpretation, and the importance of having a finished genome depends on the specific research question. Still, technological advances are needed so that genomes can be finished in a cost-effective, automated manner without relying on fosmid sequencing or BACs.

Defining a finished genome

The first problem in finishing genomes is defining what a finished genome is, said Lex Nederbragt, who coordinates sequencing on the Roche 454 and Pacific Biosciences RS at the Norwegian Sequencing Center in Oslo. "If you mean that you should not have any unknown bases and any error, than for certain organisms, like bacteria and worms, that's feasible," he said.

But for human and other animal genomes, "there are regions in these large eukaryotic genomes that are hardly accessible for any sequencing technology," like highly repetitive regions and centromeres, he said.

Additionally, "if you look at the human genome, your genome is completely different from mine in certain regions," he said. "There might even be sequences that people have never seen. So you can have a complete genome from one single individual, but the next one may have differences, so you might have missed some things."

This is one problem that the Genome Reference Consortium is now addressing, he added, by releasing patches to the reference genome and sequencing alternative loci.

Another issue with defining a finished genome is that humans and other animals are diploids, receiving half of their genome from their mother and half from their father. Plants, which are often polyploidy, are even more complex. Technically, Korf said, a "finished genome would have to contain both the mother's and father's" haploid genomes. Currently, "the reference genome is a haploid representation," which he noted is an "artificial concept."

Eric Antoniou, manager of Cold Spring Harbor Laboratory's sequencing center, said that cancer genomes add even more complexity to the question of what a finished genome is. Tumors are frequently a collection of variable genomes, he said. "There are different [genome] populations within the same tumor mass," he said. Yet, sequencing does not distinguish between these various genomes, unless the researcher is doing single-cell sequencing. However, single-cell sequencing has its own technical challenges, such as amplifying DNA from one cell in an unbiased manner and generating sufficient coverage, that make de novo assembly and genome finishing extremely difficult and costly.

Cost and labor

Because methods to finish genomes still often rely on fosmid sequencing and bacteria cloning, the price and the amount of time required to finish genomes is simply too high, Nederbragt said.

With the advent of next-gen sequencing, researchers became more interested in sequencing many genomes quickly and cheaply, rather than completely finishing one genome, Korf said. "People want the $1,000 genome, not the $1 million genome," he said.

Additionally, few researchers involved in sequencing have the bacteria that would be required to do BAC-based sequencing to finish genomes, Korf said, since most labs are now using next-gen, which does not require bacteria.

Technology deficiencies

The main advances needed to finish a genome without using the BAC-based or fosmid approaches are longer reads and unbiased sequencing, Korf said. Pacific Biosciences' technology could help, he said, but it is currently too expensive for most projects.

With current technology, researchers can sequence the human genome and readily identify SNPs or short indels, Antoniou said. "But, if you want to look at inversions [and] translocations, you need much better assembly and long reads," he added.

The "new upgrades [to PacBio] are a good start," he said, although throughput is still an issue.

Researchers are also interested in long-read offerings from Moleculo and Oxford Nanopore. However, Antoniou said one problem with the Moleculo technology is that it still relies on Illumina sequencing, so it will likely have the same biases as Illumina sequencing, such as difficulty sequencing through GC-rich regions, although he said he has not yet tested Moleculo. Oxford Nanopore's technology is not yet available.

The longer reads will help generate better de novo assemblies, resolve structural variations, inversions, and translocations, and help sequence through highly repetitive regions, but Nederbragt anticipates that there will still be a level of manual work required to truly finish any specific genome.

"Technology improvements will result in better drafts and reference genomes without any additional finishing efforts," he said.

For instance, he said, the Genome Reference Consortium is conducting many localized experiments on specific areas of the genome to figure out what is causing the problems in terms of sequencing and assembling those regions.

That type of work is different from "tackling the approach by going shotgun or creating a new global set [of genomes]," he said. Rather, it's localizing the problem to a specific troublesome region to figure out what the issue is and then fixing it. "That's just manual work, and there's many years of work left," Nederbragt said.

Aside from improvements to the sequencing technology itself, alternative technologies like optical mapping and physical or linkage maps are helping to complete genomes. These maps "give longer range information," he added.

Korf anticipated that once longer read sequencing technology became cheaper, more researchers would start finishing genomes. But finished genomes are not necessary for every research application.

However, Antoniou said that finished human genomes will have important medical consequences because they will allow for the phasing of genomes and a better understanding of the large structural variations in cancer genomes, he said. "There are things we can't even see now," he said.

Nederbragt also said that finished genomes are "important in the medical world." But for other organisms, it will depend on the research question.

If the researcher is interested in a gene list, then having a complete, assembled genome may not be necessary. But if the researcher is "developing a model system for gene regulation, for instance, then the regions between are also important," he said. "For example, long non-coding RNAs are not coding genes, but they do regulate them."

In the future, Korf predicted, once longer read sequencing technology — whether from PacBio, Moleculo, Oxford Nanopore, or another company — becomes cheap enough and high throughput enough to sequence whole genomes, "we will probably revisit many of the genomes we sequenced and realize how many mistakes we've made."

"We're just waiting for the technology that gives us 10-kilobase reads reliably and cheaply," he said.