Name: Patrick Chain
Position: Team leader for metagenomics applications, genome science group,
Los Alamos National Laboratory, since 2009
Member of Department of Energy Joint Genome Institute's metagenomics, microbial, and microbial interactions programs
Experience and Education:
PhD candidate, microbiology and molecular genetics, Michigan State University, since 2006
Microbial genome program finishing lead and head of microbial interaction program, JGI, since 2004
Group leader, biology and biotechnology research program, Lawrence Livermore National Laboratory, 2000-2009
Research scientist, department of biology, McMaster University, 1999-2000
MS in molecular microbiology and genetics, McMaster University, 1998
BS in molecular biology and biotechnology, McMaster University, 1996
This article was originally published December 10
Earlier this fall, a group of scientists, including representatives of several genome centers, published a policy paper in Science in which they proposed a new set of standards for describing the quality of assembled genomes.
As a result of new sequencing technologies that can produce quick and dirty draft genomes, they wrote, there has been "an ever-widening gap between drafted and finished genomes that only promises to continue." For many of these draft genomes, it has been difficult to assess their quality, creating "some havoc for genome analysis pipelines" and contributing to "many wasted hours."
In response, the authors proposed six categories to describe different stages of genome completion: standard draft, high-quality draft, improved high-quality draft, annotation-directed improvement, noncontiguous finished, and finished.
Standard draft, "the minimum standard for a submission to the public databases," according to the authors, comprises "minimally or unfiltered data, from any number of different sequencing platforms, that are assembled into contigs." At the other end of the scale, finished genome assemblies have "less than 1 error per 100,000 base pairs" and can act "as a high-quality reference genome for comparative purposes."
In Sequence recently spoke with Patrick Chain, leader of the metagenomics applications team at Los Alamos National Laboratory and the first author of the paper. Below is an edited version of the conversation.
When did you realize that it was necessary to develop new quality standards for genome sequences?
I was certainly not alone — it never would have been realized without a number of people listed in the author line [of the paper]. A lot of credit has to go to "Finishing in the Future" meeting, now called the "Sequencing, Finishing, and Analysis in the Future" meeting, which is held every May. It has been a very good meeting for the sequencing community to get together and discuss issues with how we bring about a finished or complete or usable genome, in terms of analysis.
As part of these meetings, which have been ongoing for several years, we had a series of roundtable discussions, where we tried to address issues in both the new technology field and how to deal with this data. And throughout these meetings, as well as other interactions with people, we had recurrent themes come up, and one of them was defining better standards for genomes.
With the advent of the new sequencing technologies, the shorter reads, and the inherent errors in some of these platforms, as well as the difficulty with which we can assemble these shorter reads, a number of different 'genome products,' let's call them, have come about. Some of these are released to GenBank just as such, assembled with packaged software that either comes with the platform or that you can use for these new technologies, and released to the public. I have heard numerous interesting stories about wasted time and effort on chasing ghosts, meaning chasing sequencing errors rather than actual biological information.
[ pagebreak ]
At the sequencing centers, we have first-hand knowledge of these issues. We are still driven by costs, so we have to try and find the most cost-effective way to bring about a high-quality genome. We have understood that now, with the ease with which we can sequence genomes, there are a number of new applications that have come up, like resequencing a number of very highly similar strains, or even evolved strains, in which case you don't really need a completed genome, you are going after just some of the differences.
With the different types of projects and the different types of platforms, we realized there was really a need to distinguish between the different types of genome products.
How did you come up with the six categories proposed in the Science paper?
Through many discussions and meetings. There have been dozens and dozens of conference calls regarding these standards — how many, and how they should be described. And we have fluctuated from the gamut of being extremely specific to exceedingly vague, and [from] having three categories to having 10 categories. Every center has its own biases, because of what they are tasked with sequencing, and how they are asked to perform those sequencing projects with a specific allocated [amount of] money. Obviously, you want to look as good as you can with the tools in hand. It was a lot of negotiations. It was still very friendly negotiations, it was just difficult to come up with a set of standards that everyone was comfortable with.
To what extent are these standards still a work in progress? What do you expect to change over time?
The way it's written, they are quite broad categories. This has pros and cons. One con that has already been mentioned is the fact that it's not so easy to categorize a genome into a specific bin. There is no definitive set of rules that suggests 'this belongs to one category and no other category.' However, this does allow us to encompass a large number of different types of projects, and to encompass new technologies that we haven't seen before. So we are not pinning ourselves down to specify a depth of coverage, or a specific number of reads per base that must be sequenced, because the different platforms will give you different levels of quality.
And also, [due to] the broadness of our categories, as well as the ability just to specify — we have written in that one can specify [certain] regions that meet a particular category — you don't need to categorize an entire genome into one category, but you can categorize it into several, just by specifying regions that meet specific qualities.
With regards to the future, we certainly want to be a bit more specific for platforms as well as for processes, both specific to the type of genome product one is aiming for, as well as for processes specific to individual sequencing centers. The manuscript was written in a way to try and encompass all of these, but it is incumbent upon all major centers to publish, or at least make publicly available, their list of specifics, and how each center will attempt to get genomes into a specific category, or what those thresholds are, and those will likely be specific to platforms or a combination of platforms used to get the genome.
Do any of the recently deposited genomes not even meet the minimum standard?
No. I would say that the way things are written, all genome sequences meet one of the standards. The draft standard is not really what any of the major sequencing centers aspire to. That's because, particularly for unsequenced or uncharacterized genomes, we try [to] get a very high quality, and that's true of every center. So the minimum that we generally release is a very high quality draft.
I would say that even novice centers that are not very familiar with the platforms that they are using, or even with the assembly techniques, or removing contamination from their reads, do fall into the draft category.
Most people who are running sequencing centers do have a very good grasp of the technologies and issues, and it's not always possible to have a team, or even a few people, dedicated to finishing genomes. It's become increasingly difficult with the shorter-read platforms. So we expect to see a lot more of the intermediate genome products. And really, this was an attempt to get ahead of the curve, anticipating this large number of intermediate products, and to be able to at least try and classify them, so that users can better assess the quality.
[ pagebreak ]
What kind of feedback have you received since your paper came out in Science?
I'd say, overwhelmingly positive. Either that, or I'd just say 'happy.' There have been responses from a number of people, particularly in the microbial scientific community that I'm more familiar with, that have been overwhelmingly positive. For quite some time, they have seen a need for additional metrics or standards.
We have also had plenty of positive feedback from more eukaryotic-centric people, and this effort has even been hailed by human geneticists as a great accomplishment.
And there has been some very fair critique of this as well. One [point of criticism], by Svante Pääbo, [an expert in] ancient DNA, was published in [a news article in] Nature. He said that what was really required was more metadata to describe the genome project, and that [since the] level of contamination and quality of the DNA affect the outcome much more than anything else, that maybe just a simple category is not sufficient. And I entirely agree with this — however, our goal was really not to address the complete metadata that should be associated with genomes. That's another ongoing effort, by the Genomic Standards Consortium, which we are a part of as well.
How do you interact with the Genomic Standards Consortium?
Very tightly. We have discussed this at length with them and made sure that these terms are adopted. These will also be part of the Minimum Information about a Genomic Sequence [standard].
How are you making sure that these standards are being adopted?
The easiest is complete enforcement. We are discussing, in part through the GSC, with the data repositories — GenBank, EMBL, and such — to make sure that there is a field that should be requisite — that needs to be entered before one can submit a genome sequence.
Many of the major sequencing centers have agreed to abide by these standards, and will begin labeling their genomes as one of these categories. The Human Microbiome Project consortium is also adopting these terms. So that's just an indication of one effort to make sure that the new genomes coming out will have one of the tags.
When will this start?
It is already ongoing. In fact, as part of the JGI, another effort that we have been heavily involved in is the Genomic Encyclopedia of Bacteria and Archaea, or GEBA, project, which I believe is going to be published very shortly. This will be an analysis of the first 56 or so genomes that have come out of that effort, and all of those will also have one of these tags associated with them.
So efforts are already underway, and I believe that the Genome OnLine Database — the GOLD database — that's run by Nikos Kyrpides is also adopting these terms, and the Sanger Center, I believe, is even retroactively trying to go back to some past genomes and provide some terms.
The main goal, of course, is to have a common language, so we can understand each other's products. It will be a great boon to scientists who rely on these GenBank entries and these genomes to do their analyses.
How useful can a genome of any one of these qualities be?
I'd say all genomic data is useful, as long as you understand the caveats implicit in these genomes. Obviously, the closer to finished you get, the greater number of different types of analyses one can perform.