Skip to main content
Premium Trial:

Request an Annual Quote

New Assembly Method Addresses Challenges of Short-Read Single-Cell Sequencing


By Andrea Anderson

Researchers from the University of California at San Diego, the J. Craig Venter Institute, and Illumina Cambridge have designed a genome assembly method to deal with non-uniform sequence data generated during single-cell genome sequencing.

"Our approach enables acquisition of genome assemblies for individual uncultivated bacteria using only short reads," the study authors wrote, "providing cell-specific genetic information absent from metagenomic studies."

As they reported in Nature Biotechnology, the new approach combines a modified version of the open source Velvet assembly algorithm, called Velvet-SC, with an error-correction approach used in the Euler assembler.

Velvet-SC does not immediately toss out low-coverage reads as the original version of Velvet does. Instead, it assembles reads in an iterative manner, starting with low-coverage reads, to take into account the uneven genome coverage caused by non-uniform amplification of single cell DNA. The authors show in the paper that the complete Euler+Velvet-SC assembly method produced relatively long and complete contigs.

"What [co-author Pavel Pevzner's] group did was to make the software be responsive to this amplification bias to try make use of these rare reads," senior author Roger Lasken, a microbial and environmental genomics researcher at JCVI, told In Sequence.

"The problem they're addressing is a problem that is particular to single-cell sequencing, which is that the genome coverage is so variable," Johns Hopkins University's Steven Salzberg, who was not involved in the study, told IS.

Even so, he noted that this type of assembly approach might also prove useful for other situations where there is variable genome coverage, such as areas of the genome where guanine and cytosine nucleotides are particularly common. In next-generation sequence data, he explained, coverage tends to drop off as the GC content increases.

"It might be worth exploring it for other types of assemblies as well," Salzberg said. "I don't know if it would work, but it's certainly something worth looking at for other kinds of genome assembly problems."

After showing that the approach improved single-cell genome assemblies for Escherichia coli and Staphylococcus aureus cells, bacterial species with well-characterized genomes, Lasken and his colleagues turned their attention to an uncultured marine bacterium collected off the California coast. Again, the EULER+Velvet-SC method proved useful for assembling the single-cell genome for this previously uncharacterized Deltaproteobacterium.

A Computational Challenge

The multiple displacement amplification method used to amplify genomic DNA from individual bacterial cells has opened the door for sequencing studies on single cells by dramatically increasing the amount of DNA available for sequencing.

But sequencing DNA that has been amplified by MDA leads to variable and non-uniform genome coverage, JCVI's Lasken explained, owing to random amplification bias and variable template DNA quality.

Excess sequence data from some parts of the genome and very low coverage in others can confuse standard genome assemblers, which assume that reads that don't match much else in the genome are the result of sequence glitches.

"Whenever you have an error — one of the nucleotides is wrong — then all of the short sequences containing it will suddenly appear unique," Salzberg said. "They won't match anything that's actually in the genome."

To avoid including these sorts of mistakes in the genome assembly, most short-read assemblers chuck out or correct reads with low coverage relative to the rest of the genome.

While this can help reduce errors in assemblies based on sequence reads from populations of cells, it is problematic in single cells with non-uniform coverage where a lack of sequence depth might lead to short, difficult-to-assemble contigs or incomplete assemblies that are missing swaths of genome sequence.

Consequently, the study authors argued that the "challenges facing single-cell genomics are increasingly computational rather than experimental."

In an effort to address some of these issues, the team developed Velvet-SC, which progressively adds information for parts of the genome that have more and more depth of coverage, correcting the contigs as sequencing depth increases.

"[I]nstead of using a fixed cutoff to prune contigs … Velvet-SC uses a variable cutoff that starts at [one] and gradually increases," the researchers explained. "After the lowest contigs are removed based on the current cutoff, some contigs may merge into a larger contig, whose average coverage is recomputed."

Building contigs from lower-coverage reads before gradually excluding contigs that aren't supported by sequences in regions of the genome with deeper coverage allows for longer and longer contigs, they noted.

"The basic strategy is to start with the areas that have very, very low coverage and clean these up," study co-author Glenn Tesler, a mathematics professor at UCSD, told IS, "then move to the areas that have slightly higher coverage and clean those up, and just work progressively from cleaning up the lowest coverage regions to higher and higher coverage regions."

[ pagebreak ]

Compared to other assembly methods, the researchers found that Euler+Velvet-SC assemblies of Illumina GAIIx reads generated from MDA-amplified DNA in individual E. coli or S. aureus cells typically produced single-cell assemblies that had long contigs, high gene content, and relatively low substitution errors.

In one of the E. coli Euler+Velvet-SC assemblies, for example, the team identified 3,943 genes — just over 91 percent of the 4,324 genes that are known for the E. coli genome — in 481 contigs with an N50 of 36,581. In contrast, a standard Velvet-based assembly of reads from that cell contained sequences for 3,131 genes, around 72 percent of E. coli genes — and 522 contigs with an N50 of 18,410; while an assembly using SOAPdenovo identified 3,353 genes in nearly 1,400 contigs with an N50 of 20,319.

Although the assembly approach theoretically carries some error risks, Lasken said, results from the E. coli and S. aureus single-cell sequencing experiments do not show evidence of an elevated error rate. Instead, he said, "The assemblies were highly accurate even though we were accepting regions that don't have great sequencing depth."

"Their substitution error rate was pretty low," Salzberg agreed, "so that gives you some comfort that [the assembler] is doing a good job."

Similarly, when the team sequenced an uncultured marine bacterial cell called SAR324 using MDA and Illumina paired-end sequencing, they found that the Euler+Velvet-SC assembly of these short reads produced fairly long contigs containing more recognizable open reading frames than assemblies generated by Velvet or Velvet-SC alone.

"The Euler+Velvet-SC ORFs were of higher quality, as evidenced by the greater number of ORFs with taxonomic affiliations identified using Blast and phylogenetic analysis using the Automated Phylogenetic Inference System, by the greater numbers of ORFs corresponding to orthologous genes, in the Clusters of Orthologous Groups database, and by greater numbers of single-copy conserved genes detected," the team reported.

"By all these criteria, Euler+Velvet-SC yielded the most robust assembly for annotation," they wrote.

Their analyses of the single cell genome indicated that the SAR324, a Deltaproteobacterium collected off the coast of La Jolla, Calif., is likely aerobic, mobile, and chemotaxic and may help to break down photosynthetic organisms as they sink in the ocean.

Lasken was hesitant to estimate the cost of such single-cell sequencing and assembly methods, citing variable sequencing costs and labor costs. Still, he said, any assembly method that curbs the time and labor needed to assemble genomes should cut sequencing costs.

Complementing Metagenomics

Those involved in the study argued that the work illustrates the potential of using single-cell genome sequencing and assembly to access genetic data in microbes that can't be cultured in the lab.

"If you can sequence from a single cell, you don't need to grow cells," Lasken explained. "You could just see a cell on the microscope — anything from the environment — and if you could see it, you can amplify its DNA by MDA and then you can sequence its genome."

For his part, Pevzner argued that these sorts of single-cell sequencing approaches will complement metagenomic studies that look at the DNA sequences present in microbial communities as a whole by providing a detailed look at the genes present in each member of the bacterial community. And, he said, having an inventory of the genes present in each cell is expected to aid in proteomic studies of the bugs as well.

"We finally can say, for any bacteria, what it does, what is its lifestyle," Pevzner, a computer science researcher at UCSD, told IS. "This was part of the bacterial genome that was not reachable before and now it's reachable."

So far, those involved in the study say the single-cell sequencing and assembly approach they used is limited to prokaryotic cells, though Lasken said his team is interested in coming up with ways to sequence single human cells. At the moment, he explained, the size and complexity of the human genome has complicated that application of single-cell sequencing.

Other teams have been working on methods for sequencing the genomes of individual human cells. At the Biology of Genomes meeting last year, for instance, a Cold Spring Harbor team reported that it had some success sequencing individual tumor cells (IS 5/18/2010). So far, though, amplification problems have prevented researchers from sequencing more than a fraction of each cell's genome (IS 4/12/2011).

Lasken said he and his colleagues are currently using the MDA and Euler+Velvet-SC-based method to study bacteria in many different environments, For example, he said, the team is collaborating with researchers from the Scripps Institute of Oceanography to study bacteria in deep ocean trenches.

In collaboration with National Institute of Allergy and Infectious Disease researchers, the team plans to use its single-cell sequencing approach to look at microbes in and on the human body for the Human Microbiome Project. Lasken said his group is also participating in bacterial sequencing studies centered at some UCSD hospitals to track possible pathogens.

Despite their progress so far, though, researchers are continuing to look for ways to improve their DNA amplification and assembly methods. For example, Lasken and his colleagues are looking at the DNA amplification process in more detail to try to figure out whether factors such as primer utilization and the DNA polymerase concentration might influence amplification bias during MDA.

"We study the basic chemistry of how the DNA gets amplified," he explained. "If we can understand the mechanism, then we might possibly understand ways to reduce bias."

Meanwhile, other study authors are continuing to hammer away at the assembly algorithm itself, looking for ways to make single-cell sequence assemblies even more accurate and complete.

"This is the first demonstration of what is possible," Pevzner said. "I'm absolutely convinced many people will move into this area and develop better and better assemblers."

Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.

The Scan

Gone, But Now Reconstructed SARS-CoV-2 Genomes

In a preprint, a researcher describes his recovery of viral sequences that had been removed from a common database.

Rare Heart Inflammation Warning

The Food and Drug Administration is adding a warning about links between a rare inflammatory heart condition and two SARS-CoV-2 vaccines, Reuters reports.

Sandwich Sampling

The New York Times sent tuna sandwiches for PCR analysis.

Nature Papers Describe Gut Viruses, New Format for Storing Quantitative Genomic Data, More

In Nature this week: catalog of DNA viruses of the human gut microbiome, new dense depth data dump format to store quantitative genomic data, and more.