Skip to main content

U of Washington Team Uses Metagenomic Sequence, Mate-Pair Reads for De Novo Assembly of Novel Genomes


By Andrea Anderson

University of Washington researchers have developed a strategy for harnessing the information provided by mate-pair reads to assemble candidate genomes for individual microbes de novo from metagenomic sequences.

As they reported in a paper in Science earlier this month, the team used mate-pair sequence data to sort and stitch together chunks of partially assembled sequence generated for Puget Sound seawater samples on the SOLiD v3.0 instrument. In the process, they put together 14 candidate genomes from the samples, including the genome of an uncharacterized and uncultured archaeal species from marine group II.

"With this marine group II, there is no closely related genome available," first author Vaughn Iverson, a graduate student in senior author Virginia Armbrust's oceanography lab at the University of Washington, told In Sequence sister publication GenomeWeb Daily News. "So we were working completely blind. It's like putting together a puzzle and not having the box top."

The project "is a really good example of being able to assemble novel genomes representing microorganisms that had never been obtained in culture," Janet Jansson, a researcher affiliated with Lawrence Berkeley National Laboratory and US Department of Energy's Joint Genome Institute, who was not involved in the new study, told In Sequence.

In a paper published in Nature last fall (GWDN 11/7/2011), Jansson and her colleagues assembled the draft genome for an uncultured soil methanogen using metagenomic sequence generated for an Alaskan permafrost sample. The team relied on Illumina GAII sequencing for that study and used a somewhat different read binning and assembly approach.

Early metagenomic studies focused on getting a census of the microbes present in an environmental sample using marker sequences such as 16S ribosomal DNA, while many of the studies reported more recently have used metagenome sequencing to catalog the genes and non-coding sequences present in a given microbial community.

But as metagenomic sequencing matures and researchers set their sights on complex diverse environments, approaches for extracting individual genome assemblies from metagenome samples are taking on increasing significance.

Even so, systematically assembling individual genomes from a mishmash of reads generated from an environmental sample has proven far more difficult than assembling reads for a single organism.

"The basic process of putting them together is pretty straightforward — you just put them on a big enough computer and you run them," said Titus Brown, a genomics, evolution, and development researcher from Michigan State University who was not involved in the new study.

"But," he told IS, "very few of the programs that do the stitching together really are optimized or think about metagenomic samples."

Though some metagenomics researchers have successfully assembled chunks of sequence or even nearly complete genomes for individual microbes in metagenome samples, there has been a dearth of tools for doing this routinely and without extensive manual assembly steps, especially for uncultured organisms.

With the publication of the University of Washington's metagenome assembly study, that routine assembly step appears to be a bit closer, Brown said.

"What this group did was they went another step," he explained. "They ran the assembler and then they developed an automated way — it's a little bit hard to tell how automated — of grouping these contigs based on additional information beyond the raw sequence into sequences that are likely to all come from the same organism."

Exploiting Mate Pairs

For the study, Iverson and his colleagues did SOLiD mate-pair metagenomic sequencing on two seawater samples: a surface sample taken from Puget Sound in the fall of 2008 and another collected in the spring of 2009.

The first of these samples was sequenced using a mate-pair method on the SOLiD v3.0 instrument at Life Tech, while the study authors did mate-pair sequencing on the second sample in their own lab, also on the SOLiD v3.0.

After doing quality control analyses and tossing out very short reads and reads with low complexity patterns that were suspected of stemming from SOLiD sequencing errors, the researchers did error correction on the best quality reads using the SOLiD Accuracy Enhancer Tool, or SAET, before assembling these reads into contigs using the open source fragment assembler Velvet.

"You have to do a lot of read quality screening," Iverson said. "We did a lot of work cleaning up, filtering, and trimming the reads to keep just the good parts."

From there, the researchers put contigs together into longer scaffolds and sorted the scaffolds into candidate genomes based on the information provided by mate-pair connection graphs. Indeed, Iverson said mate-pair data were the " key piece of information that we leveraged."

Custom software called Select and Estimate Abundance from Short Aligned Reads, or SEAStAR, that was used during contig processing also proved useful for estimating relative species abundance from metagenomic sequence data without doing additional 16S sequencing, the group reported.

"SEAStAR emits a mate-pair connection graph which represents contigs as nodes, mate-pair connections as edges, and encodes for each element of the graph, statistics such as total bit-score, coverage, [percent] GC content, sequence length, and mean aligned positions of mate-paired reads," the study's authors explained in supplementary information provided for the Science paper.

Among the 14 candidate genomes they assembled, the researchers found a genome resembling that of a bacterial strain from the Rhodobacterales order called HTCC2255.

They also identified the 2.06-million-base genome of a Euryarchaeota marine group II archaea known mainly from 16S sequencing of environmental samples. Gaps in the genome were filled by Sanger sequencing.

Through inferences from metagenomic sequence data, researchers estimated that the marine group II organism made up 1.7 percent of the reads generated for the May metagenome sample. That genome was sequenced to an average 118-fold coverage.

Patterns in the genomeprovided insights into the archaeal organism's lifestyle and evolutionary history. For instance, the researchers' analyses indicated that MG-II archaea were likely the source of the proteorhodopsin genes found in some bacteria.

Other Approaches

Other methods have been proposed for sorting reads from metagenomic data, MSU's Brown noted, including strategies that exploit paired-end sequencing information or glean information from tetranucleotide frequency patterns in sequence reads.

Because groups of four nucleotides occur at different frequencies in microbial genomes in patterns that reflect a range of factors — from an organism's environment to the DNA repair molecules it uses — these tetranucleotide patterns may be a useful source of information during genome assembly.

"The distribution of those four-letter words is a signature of specific organisms," Brown explained. "So you can often take 15 different microbes and you can split them up based on their tetranucleotide sequence distributions."

In last year's permafrost paper, for example, researchers from LBNL, JGI, and elsewhere put together the 1.9-million-base draft genome for an uncharacterized soil methanogen using Velvet-assembled contigs that had been binned based on tetranucleotide frequency patterns and read coverage information.

Researchers involved in that study are now tackling even more diverse soil microbial communities, Jansson said. But while their binning methods are generally producing long stretches of sequence from metagenomic sequence data, systematic assembly of microbial draft genomes from these samples remains elusive.

"We haven't really seen that many draft genomes falling out of the data," she said. "We do get really long contigs, but nothing like whole genomes."

The tetranucleotide frequency-based binning approach "doesn't necessarily give you the complete genome or the connections in the genome," explained Brown, who is currently working with Jansson and her team on large soil metagenome assemblies.

"To do that, you generally have to do a lot of sort of manual labor, where you run a couple programs, see how things connect with the paired-end [reads]," he said. That automation appears to be the main advance in the University of Washington team's newly reported assembly pipeline, Brown noted.

"If that's true," he said, "the implications really are that this process that was fairly technical and sophisticated suddenly became a lot less technical and sophisticated — and more accessible."

For his part, Brown said he is keen to see whether he can combine some of the approaches described in the Science study with computational methods that he and his colleagues have been developing for scaling up the metagenomic assembly process.

"What we've been working on is ways to deal with these massive amounts of information that we're generating," he said.

Rather than assembling metagenomic reads and then partitioning them into different organisms, Brown explained, his team is working out ways to compartmentalize metagenomic reads before feeding them into Velvet.

"We're taking a very computer science, abstract approach where we're looking at the connectivity of the reads in the assembly graph prior to doing the assembly — and doing it in a very efficient and low-memory way," he said, noting that this strategy seems to be producing "chunks of genomes."

Brown believes the assembly pipeline developed by the University of Washington team could "sort of bolt on to what we're doing," though he is disappointed that the software is not yet available.

"The paper really rests on how good their process is," Brown said. At the moment, he added, it's impossible for others to reproduce the results described in the paper or to apply the same tools to try to answer their own research questions.

Iverson told GWDN that the software outlined in the archaeal genome paper will likely be released in three phases, starting with the read quality analysis software. That is expected to be available sometime in the next few weeks, potentially coinciding with the publication of a more bioinformatics-focused paper.

The team plans to release the SEAStAR software after that, followed by some or all of the other software used in the candidate genome assembly pipeline.

Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.

The Scan

Call to Look Again

More than a dozen researchers penned a letter in Science saying a previous investigation into the origin of SARS-CoV-2 did not give theories equal consideration.

Not Always Trusted

In a new poll, slightly more than half of US adults have a great deal or quite a lot of trust in the Centers for Disease Control and Prevention, the Hill reports.

Identified Decades Later

A genetic genealogy approach has identified "Christy Crystal Creek," the New York Times reports.

Science Papers Report on Splicing Enhancer, Point of Care Test for Sexual Transmitted Disease

In Science this week: a novel RNA structural element that acts as a splicing enhancer, and more.