NEW YORK (GenomeWeb News) – By coming up with methods to piece together chunks of sequence from metagenomic data, University of Washington researchers have successfully constructed de novo genome assemblies for microbial species within a marine mix, they reported online today in Science.
The team used mate-pair sequencing with Life Technologies' SOLiD instrument to assess the metagenomic sequences present in surface seawater samples collected in Puget Sound in the fall of 2008 and in the following spring. After finding the best quality reads and putting them together into longer stretches of sequence, investigators relied on new computational strategies to create and sort sequence scaffolds into more than a dozen candidate genomes, representing both cultured and uncultured marine microbes.
Among them: an uncultured representative from a poorly characterized group of archaea known as marine group II.
"This whole group, marine group II, is a mysterious, uncultured group of archaea that's quite distantly related to anything else growing in the ocean that we know about — and even quite distantly related to other archaea," first author Vaughn Iverson, a graduate student in University of Washington School of Oceanography Director Virginia Armbrust's lab, told GenomeWeb Daily News.
"With this marine group II, there is no closely related genome available," he added. "So we were working completely blind. It's like putting together a puzzle and not having the box top."
The genome was assembled using metagenomic sequence from one of two surface water samples collected in Puget Sound. The first of these was collected in October 2008 and the second was collected in May 2009 to see how the microbial community changed seasonally.
Mate-pair metagenomic sequencing for the fall samples was done on the SOLiD v3.0 instrument at Life Tech. Researchers sequenced the metagenome for the second sample, also by mate-pair sequencing on the SOLiD v3.0, in their own lab.
Although they used16S rDNA sequencing to help determine the species composition in the first sample, investigators subsequently came up with a computational strategy for teasing apart information on species composition in the community using the metagenome sequence data itself, Iverson explained.
"We got so much sequence for the metagenomes, that I developed techniques that can actually use existing 16S databases to estimate the abundance of these groups straight from the metagenomic sequences," he noted.
After doing quality analyses of their short-read sequence data and cleaning up the reads accordingly, the researchers put reads together into contigs using the open source fragment assembler VELVET.
By exploiting information provided by mate-pair sequence reads, they then cobbled the contigs together into longer scaffold sequences using software that Iverson wrote for the project, allowing them to create 14 candidate genomes for organisms in the Puget Sound samples.
"That extra bit of information — that ability to connect two reads together — was really key," Iverson said, calling the mate-pair information "the key piece of information that we leveraged … to be able to connect up the contigs that were generated by the VELVET assembler."
From the seawater sample collected in the fall, for instance, the team found a genome that most closely resembled that of a bacterial strain called HTCC2255 from the order Rhodobacterales.
Another genome, found in the sample collected in May, belonged to an uncharacterized archaeal organism from Euryarchaeota marine group II, or MG-II, known primarily from 16S rDNA clone libraries reported by groups sequencing various environmental samples.
Past research suggested that archaea in the MG-II group contained proteorhodopsin, a gene better known in bacteria where it codes for a light powered proton pump, Iverson explained. Until now, though, the relationship between the proteorhodopsin genes found in MG-II archaea and bacteria was poorly understood.
The MG-II genome, which was just over two million bases long, appears to represent at least two variants of MG-II, based on available data from hyper-variable regions, though other available sequence data suggests there are at least five marine group II strains in the Puget Sound sample.
Using data from the current study, as well as comparisons with sequences from GenBank data on other metagenomic and environmental samples, the team identified genes in the MG-II genome predicted to code for almost 1,800 proteins.
As expected, the MG-II genome contained proteorhodopsin similar to that seen in some bacteria. But it also housed a second group of rhodopsins that are more distantly related to the bacterial version of the gene. Together with other genomic data, this finding points to a MG-II archaeal origin for the proteorhodopsin found in marine bacteria.
"It's kind of the opposite of what had been assumed," Iverson said. "[The rhodopsin gene] had been seen in so many marine bacteria but only seen in this one case in the archaea that it had been widely assumed, but not proven, that it had jumped from bacteria to the archaea."
"Based on the evidence we have, in fact, the opposite happened: this gene originated in archaea, one of the two versions of it jumped to the bacteria, and it subsequently radiated to all these different groups of bacteria."
In contrast to genetic patterns found in types of archaea that can fix inorganic carbon, the MG-II genome did not contain the genes typically associated with autotrophy. Instead, its metabolic genes, flagellar genes, and genes related to protein and lipid breakdown, suggest that MG-II archaea are heterotrophs that rely on organic carbon sources.
The team plans to release software used for the study in three stages, Iverson noted, starting with the software for doing read quality analyses, which should be available within a few weeks. Down the road, the "Select and Estimate Abundance from Short Aligned Reads, or SEAStAR, software used to estimate relative species abundance using metagenomic sequence will also be released.
Some or all of the software from the pipeline used to produce candidate genomes will be made available after that, Iverson explained, depending on the approaches developed by other groups in the meantime.
"There are definitely pieces of it that we will release," he said. "Whether we release the whole thing as a pipeline is going to depend on what happens in the next few months to a year in the bioinformatics community."