This article has been updated from a previous version, which incorrectly stated that the Baylor Human Genome Sequencing Center opened in 1998. It opened in 1996.
Associate director, Genome Center
Name: George Weinstock
Position: Associate director, Genome Center at Washington University, and professor of genetics, Washington University School of Medicine, since Jan. 2008
Experience and Education:
— Co-director, Human Genome Sequencing Center, and professor, department of molecular and human genetics, Baylor College of Medicine, 1998-2008
— Co-director, center for the study of emerging and re-emerging pathogens, University of Texas Houston Medical School, 1995-2001
— Professor, department of biochemistry and molecular biology and department of molecular and microbial genetics, University of Texas Houston Medical School, 1984-2001
— Head of the DNA metabolism section, NCI-Frederick Cancer Research Facility, 1980-1984
— Postdoctoral fellow, biochemistry department, Stanford University Medical School, 1977-1980
— PhD in microbiology, Massachusetts Institute of Technology, 1977
BS in biophysics, University of Michigan, 1970
George Weinstock has been associate director of the Genome Center at Washington University since January. He joined the center from Baylor College of Medicine’s Human Genome Sequencing Center, which he co-directed since 1998.
At Wash U, his main activity is to oversee microbial genomics, in particular the center’s share of the National Institute of Health’s Roadmap Human Microbiome Project.
In Sequence spoke with Weinstock two weeks ago at the Biology of Genomes Meeting at Cold Spring Harbor Laboratory about the status of the project.
How did the Human Microbiome Project get started?
Before there was an official [National Institutes of Health] Human Microbiome Project, there were a series of white papers that had been submitted to [the National Human Genome Research Institute] by the NHGRI genome centers. The very first one of those came from Washington University’s genome center, and it was to sequence the genomes of 100 bacteria from the gut. That was really presaging what has since evolved into the larger-scale Human Microbiome Project that has now many moving parts to it.
Following [the white paper project], there was growing interest at NHGRI and at NIH in the human microbiome. And as a result of that, there were then two subsequent white papers submitted to NHGRI by [the sequencing centers at] Wash U, the Broad Institute, and Baylor. The second of those was to sequence 200 [additional] genomes, more genomes from the gut and also some genomes from the vagina.
A third white paper that was submitted was to do metagenomic sequencing. The sequences that you get from that give you a snapshot of the community structure, sort of a census, plus some information about the abundance of the different species that are present, too. Again, that was going to be a sample from the gut, but since you can’t comfortably do invasive sampling of the intestine, this would be a fecal sample. Of course, when you do metagenomic sequencing, you are sequencing both culturable as well as unculturable organisms, whereas the previous two white papers were only focusing on organisms that you could culture. It is not really known what the ratio of those two is in the human microbiome. Probably, there are twice as many unculturable organisms as culturable ones, but nobody really knows. In contrast, in the soil, and maybe the ocean as well, we know that only about 1 percent of the organisms are culturable.
The purpose of doing the reference genomes — the individual genome sequences of bacteria — has to do with the utility of the metagenomic sequencing. So if you take a sample that is a community of organisms, and you do shotgun sequencing of those, and now you get lots of individual sequence reads, how do you know what they are? You compare them to a database, and if you have the organism that they come from in the database, then you can recognize that this individual read represents this particular species. But if you don’t have that organism’s genome in the database, then you just have a read and it doesn’t match anything, and you don’t know very much about it. In the human body, there are thousands of species of bacteria, and less than 10 percent of those have been sequenced, so we needed to expand the catalog of genome sequences in order to intelligently interpret metagenomic sequence data.
How far have these three projects progressed?
The reference genomes are part of the catalog of genomes that we are building. Because we have so many we have to do — 300 genomes — [they] have not yet been analyzed to the same extent as [bacteria causing] cholera, or tuberculosis, [where] you really want to know how it causes the disease. At the time those projects were proposed, there were maybe a total of 500 [bacterial] genomes that had ever been sequenced, so this was a very significant contribution to the number of genomes that have been done.
We wanted to drive cost down, [and] what really opened the door to do this whole project is the new sequencing technology. We [also] had to think about how to improve the automated methods for doing gene predictions and for doing annotation. So there are a lot of developments on the technical side that need to be done so that we can do hundreds and hundreds of bacterial genome sequences, and that’s been the major focus of these [projects].
At Wash U, we have probably done half or so of the number [of genomes] that we set out to do in that first white paper, and that’s picking up. We have identified all the bacteria that we want to sequence, and we have samples of all of them. This is one of the big problems, obtaining [the bacteria], getting DNA from them, making sure the quality is good, so we are very much involved in that.
There has [also] been a big process of trying to identify the organisms that we want to do. That has pulled in different institutes at the NIH who have experts on those particular body sites and understand their microbiome.
For the metagenomics part of the project, there is really only a limited experience in terms of using new sequencing technologies to shotgun-sequence metagenomic samples and match those to databases and try to interpret the data. We want to generate a very large dataset and put that out there for us and other people to analyze and develop the software tools that are necessary to work on that. Again, it’s all focused on next-generation sequencing. What’s in the white paper, I believe, is to do something like 100 454 runs, which would be 40 million reads and 10 gigabases of data, so it’s a very significant survey of those samples. These are fecal samples that come from trios, monozygotic twins and their mother. I think we have three trios, and from those nine individuals, not only does it give us a very large dataset to begin to analyze, but [we can] also start to ask questions about variability between humans and [their] microbiomes, and one would hope that variability between twins is not as great as variability between families. But that’s what we are out to learn. We have identified the samples, centers have the DNA, and we are just about to start cranking on the sequence.
What have you learned about the different new sequencing technologies, and what are their distinct advantages?
We have been very pleased with how they work. We are all focusing on, at the moment, particular approaches that rely mainly on 454 for the data production and assemblies, and getting a good draft genome. Eighty-five percent of these [reference genomes] will be draft genomes. We said we would finish 15 percent, and they would be organisms found to be important or interesting or have a particular role in the microbiome.
But in the meantime, it looks like that the current state of 454 sequencing gives a good draft genome, and in some cases, we found that doing a little bit of Solexa sequencing and adding that data improves the overall quality. But all of these platforms continue to improve, so the need for Solexa [sequencing] is maybe not as important now as it was in the beginning. And in addition, we are constantly exploring, as each new platform comes out with a longer read length or with some improvement in their technology, using even the short reads from Solexa or SOLiD as possible ways of doing genome sequences. But the goal is to be able to do the data production for bacteria very inexpensively — reduce the cost by an order of magnitude or two — and get a decent assembly, so we can really do hundreds and hundreds of genomes and not break the bank. And that seems to be well within reach.
Eventually, are you going to settle on one method for data production, or do you expect that these methods will keep changing?
We build fluidity into [the process] and constantly reevaluate things. I think from the point of view of the reference genomes, whenever you have a pipeline and you solve the first bottleneck, you create another bottleneck downstream. This would seem to be a situation where we have removed bottlenecks in terms of producing good draft assemblies, and now the bottleneck is on informatics to do annotation and to do analysis.
So we focus more of our attention on how to do those processes. Traditionally, they involve a lot of manual inspection, so now we have to automate that more. [It involves] making a list of genes, building metabolic pathways, and doing other things downstream from that. We sort of have ‘mark 1’ versions of all of those things, so we can move forward.
There has been a lot of discussion about what the standards should be. We want to, on the one hand, reduce cost. On the other hand, we don’t want to compromise quality, and so one is trying to define quality in a very operational way. Of course, if you go to very high coverage, and you put a lot of manual effort into it, we know you can get fabulous genomes. But, since this is targeted to taking metagenomic sequences and Blasting them, what quality of a genome do you need?
Based on sequencing a number of reference genomes that we [had already finished], we have some good metrics that we are still discussing, but we are sort of zeroing in on what the final versions will be. And that will set the bar, then, for genome assemblies and gene prediction lists, so that for each center, regardless of what platform or what software they are using, the public can be assured that what was submitted to the public databases was at least this good in terms of the assembly and the gene predictions. That gets all the centers on the same page in terms of giving some uniformity of the quality.
What happened after the three pilot projects got started?
While [the three white paper pilot projects] were going on, NIH was going through a process of choosing the Roadmap initiatives, special projects that are of interest to all the institutes of NIH. They come from a special funding source, which is a pool of funds from the institutes. The Human Microbiome Project was picked to be one of the Roadmap projects last year (see In Sequence 6/19/2007); the other one was an epigenetics project.
There is a full set of [requests for applications] and different subprojects that come out as part of this, called different initiatives. Some of them are additional technology development, [such as] ways to purify and sequence unculturable organisms, software development for the analysis of those, how to do transcriptional analysis of metagenomic samples, how to continue to increase the size of the catalog of genomes, all kinds of things like that. Many of [these] build on the NHGRI [pilot projects] and take them to a much larger scale — this is a five-year program with much more funding. There is one RFA to select a number of projects and focus on diseases, and try to show the microbiome causes, or has correlations to, particular diseases.
Once [the Roadmap project] was approved, to get it started, what they call the ‘jump-start’ phase, [NIH] gave supplements to NIH-funded centers like the centers at NHGRI to do some work for one year. In the meantime, the RFAs were issued [and] the process has started to collect those grants and to get those funded.
The main activity [right now] is this jump-start project, which is to do another 200 reference genomes on top of the 300 that NHGRI was already doing, so that gets us up to 500. And one of these RFAs is to do another 400 genomes, [so] that will get us up to 900. And then the expectation is that there will be another 100 done elsewhere in the world, so the number to shoot at is to try to sequence 1,000 genomes in this first round of things. That would certainly improve the catalog of genomes that are out there.
The jump-start focuses on five body sites. These are nose, mouth, skin, the vagina, and the gut. There is still some discussion about what particular body sites within those [should be the focus], because you have lots of sub-anatomical regions to look at.
And then there is going to be the recruitment of a number of individuals. What was originally proposed was 250 individuals, but this may change before it is all finally decided. These individuals will be sampled at all of these five body sites, and those samples we would then do metagenomics on, but we would just do 16S ribosomal RNA sequencing. The idea was to get a measure of the diversity of the different body sites in a large cohort of healthy individuals.
The activities of which technology to use and where to set the standards and how to automate the analysis and things like that, those are all part of both the jump-start and the white paper process now, because they are all folded together. And the recruitment of the individuals and the sampling and the 16S sequencing has not started yet because it has been a very complicated clinical protocol to write, sampling each individual at five different body sites. Those are just about to enter into the phase of getting approval by the [Institutional Review Boards] and institutions, so we can start and do the actual work.
When will you start studying disease, and what are you hoping to learn about how the microbiome influences disease or disease susceptibility?
That is the RFA that is due in June, [so] sometime early next year [is] when those projects would start.
There are plenty of examples of how your normal microbiota causes pathological conditions. For example, why do you get cavities? Dental caries is caused by the bacteria on your teeth. Your teeth are covered in plaque, which is a bacterial biofilm where there is a community of bacterial organisms living on the surface of your teeth. And normally, evolution has created an ecological balance, so that this particular tissue and those particular microorganisms can live without creating damage. But if you eat a lot of sugar, the organisms that can use carbohydrates overgrow the other ones, and those organisms can cause damage; they can release proteases or secret acids or things like that. That is one of many classic examples of how your normal microbiota, when the ecology is changed, has deleterious consequences for you as the host.
Of course, tooth decay is quite common but it’s not fatal, but there are a number of very, very nasty diseases where one is interested in studying how the microbial communities have changed, and trying to make this kind of connection with the ecology of a particular tissue. Some of the ones that are getting the most attention are inflammatory bowel diseases, like ulcerative colitis and Crohn’s disease, but there are plenty of other conditions that are like this. We suspect that there is a real microbial component to this, but we don’t really know what it is yet, and that’s what we are going to study.
Are you also going to study how the microbial community interacts with the host’s genotype?
In all these samples in the jump-start [phase] where people are being recruited, we are also taking blood samples. There is no plan to do genotyping or sequencing of those yet, but the expectation is that down the road, there will be some interest in doing that. And you can imagine that in some of these syndromes, individuals have mutations in genes of innate immunity [or] in the inflammatory response, and this causes some alteration in their ability to maintain the microbes in their gut or in whatever tissue, and as a result of that, they are more sensitive to having problems caused by their microbiota than other people who don’t have those mutations. You can imagine that there will be a host genotype as well as a microbial component, although that will remain to be seen. I imagine there will be many, many mechanisms, and this will turn out to be a very rich area for a mechanistic understanding of disease, both from the host side as well as from the microbial content.
Is there going to be an international human microbiome project?
There is an international Human Microbiome Consortium that has been formed. There have been a certain number of discussions to define what the charter of that would be and what its mission would be. Now, the application process is ongoing to have different groups around the world who have their own microbiome projects to join. The EU has funded some projects, in China and Japan there are microbiome projects, there is a lot of interest in Australia and India, Canada, and other places. It’s at the formative stages, and one expects that over the next year, formal participation in this by many or all of these countries will be certified, and then that group will have its own regular meetings. And I think the goal is not only to exchange knowledge and to compare notes, but to address whatever issues need to be addressed to standardize nomenclature, data release, analysis, access to samples, various things like that, so that each of these individual projects can sort of leverage and synergize with the other ones and get more out of this that way.