Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: NCGR's Greg May on Building a Next-Gen Genome Center from Scratch


gregmay.jpgName: Gregory May
Age: 43
Title: Vice president and director, Genome Center, National Center for Genome Resources, since 2006
Experience and Education:
Associate scientist and head of genomics program, Samuel Roberts Nobel Foundation, Ardmore, Okla., 1999-2006
Assistant scientist, Boyce Thompson Institute, Cornell University, 1995-1999
Postdoctoral fellow, Institute of Bioscience and Technology, Houston, 1993-1995
PhD in plant physiology, Texas A&M University, 1992
BS in biology, Southeast Missouri State University, 1987

In 2007, the National Center for Genome Resources installed its first Illumina Genome Analyzer, marking a shift from pure bioinformatics research to both data production and analysis. As of earlier this year, the Santa Fe-based non-profit research institute had expanded its fleet to six Genome Analyzers and said it had generated 20 gigabases of sequence data in a single run on one of its instruments. In Sequence last week talked with Greg May, who runs NCGR's genome center, and asked him about the ongoing work at the institute.

When you joined NCGR in 2006, the center did not have any sequencers. How did the transformation from a bioinformatics research center to a sequencing center come about?

While we would continue to work on bioinformatics solutions for complex biological problems, the idea that we should also be in the data-generation mode was discussed. At that time, a lot of the next-gen sequencing was coming online, 454 in particular. We looked at getting involved with that, and we found that some of the technologies — Solexa in particular — enabled a workflow [with] potentially very minimal lab requirements. The instruments were fairly self-supportive, and the workflow worked well with generating lots of data, and a lot of uptime [was possible] with very few technical personnel in place.

We [received our first instrument in the fall of 2007], and very soon after — I think after our first two weeks — we realized we needed our second one [which arrived in November], and we just continued to grow as new projects came on board.

Today, we have six instruments. We will have another two within the next month, and we will probably add another two by the end of the year, so we should have [at least] 10 by the end of the year.

We [also] had to make a significant investment in new hardware, and we continue to develop software and bioinformatics tools and apply that to our customers and to our collaborators. I think we are in a unique position because we have been in bioinformatics for so long, and we do have the compute resource and infrastructure. Lots of groups can generate data, but I think when it comes to the data analysis, we are in a unique position.

We do have folks that ask for just data alone, and then within a few weeks, they come back [and say], 'Actually, I do need some help with the analysis.'

How is NCGR funded?

We are a non-profit research organization, and most of the funding is from federal grants, either NIH, USDA, or NSF. And then we have research contracts and our fee-for-service.

We have collaborations with academic groups from universities or other institutes, we have collaborations with industry, and we have just fee-for-service from both these kinds of groups as well.

[ pagebreak ]

Could you highlight a few interesting research projects?

Our initial project was with 454 technology, a collaboration between Brigham and Women's Hospital, NCGR, and 454, and that was to [sequence transcripts from] lung biopsies of mesothelioma, and was published in PNAS (see In Sequence 2/26/2008).

Other projects include schizophrenia — sequencing the transcriptomes of schizophrenic brain samples to identify candidate genes that are maybe involved in the process.

We have other projects where we are sequencing near-isogenic lines of soy beans. They are quite similar, but they differ in traits. And we are trying to find the differences between these lines, either at the expression or at the SNP level.

We [also] have the cotton genome project. And I think probably one of the bigger plant projects will be the National Science Foundation Medicago HapMap project, where we will sequence 400 ecotypes of Medicago.

There are [also] a couple of projects where we are helping other groups in the assembly and annotation and characterization of human genomes, [but] the specifics of these projects are not disclosed yet.

There is also interest in orphan diseases with the Beyond Batten Disease Foundation, where we are looking to use sequencing to determine if someone is a carrier for rare diseases (see other article in this issue).

What kind of interest have you seen for different sequencing applications – such as genome sequencing, transcriptome sequencing, or ChIP sequencing?

We had almost 300 Solexa instrument runs [so far], and I would say that 90 percent of those or greater have been on transcriptome sequencing. The rest is resequencing or de novo sequencing at the genome level, and then a much smaller percentage is either ChIP-Seq or methylation enrichment-type sequencing. Most folks are either interested in the gene expression aspect or the SNP discovery aspect.

We originally started RNA sequencing in collaboration with Illumina to identify SNPs, reducing the amount of sequencing that you are doing, because you are focusing only on the expressed portions of the genome in those tissues. So the idea was to try to find SNPs in the expressed genes. It turned out [that] RNA sequencing is highly quantitative, so now you can get not only SNPs you are interested in, but you can [also] get the expression levels. What we are really looking for are SNPs that affect expression.

You decided fairly early on to focus exclusively on the Illumina sequencing technology. How did you make that decision?

We had originally looked at all three [existing technologies, from 454, Illumina, and Applied Biosystems]. We did at that time a bake-off — essentially, we gave the same nucleic acid sample to all three groups, and we compared the data we got back. And we also compared the workflow and the cost per data point.

And for us and our small research group, the Illumina platform was our best choice at the time. All these groups have continued to make improvements. As I said before, after we had our first instrument, we realized quickly, 'This is really working well in our hands, let's get another one,' and the demand was there. And part of continuing with the same platform is just ease of management of not two different platforms, two different supply chains, technicians trained on two different platforms and instruments.

How many people do you have in the lab?

We have two technicians and a lab manager. That's pretty skinny.

[ pagebreak ]

How has the Illumina technology improved at your center since you installed your first sequencer?

[With] our first instrument, we were happy to get 400 to 500 megabases [per run]. And then, with gradual improvements both in our hands and improvements in the chemistry and the software along the way, we were soon getting 700 and 900 [megabases], and then we finally started getting a gigabase of sequence routinely. And then with paired ends, we started getting 3 gigs of sequence. And then, with new chemistry, we began running longer runs. Starting off, we consistently only ran 36-base pair runs. Then we ran 48 base pairs, and now 90 base pairs for paired-end runs is standard, and we are running 106, and we are looking at trying a 124-base pair run. That's off of each end of the run, so these are actually 212 bases for the 106-base pair paired-end run. And that's how we got to a 20 gig run earlier [this year].

Do these improvements enable new kinds of projects that you could not do two years ago?

Yes they do. They increased the amount of output, and as a result, they decreased the price per base data point. So some projects that were on the bubble as far as being economically feasible are much less expensive today. It opens up the availability of large datasets [to] the folks that can't normally afford to do this with traditional Sanger approaches. We have seen this with lots of international groups with small pots of money, [who] were able to generate tremendous amounts of data and accomplish their projects with this approach.

What hurdles did you have to overcome along the way?

[One of the] things folks have to deal with is the volumes of data, the terabyte-size files. As the runs continue to increase in size, there is that much more data. And when you spec out hardware for 1 gig of sequence, and now the instruments are putting out 20 gigs of sequence, you have to be flexible and forward-looking on what they are really going to be generating for you. So we continue to add instruments, and the projections for this summer are 30-50 gigabases per run on these longer runs, [and] there is just a scalability issue for folks to keep in mind.

I have heard that some labs are starting to spend more money on computers than on sequencers. Is that true for your center also?

It will be. I think we had spent heavily on the sequencer side, but now we are slowing down and really spending heavily on the infrastructure side. I think by the end of the year, it will be a dollar-for-dollar match on that. You can be drowning in your own data if you don't prepare and invest in the hardware side of things.

Looking forward, are you considering adding other sequencing technologies? What additional benefits would these need to provide?

The philosophy that we have here on this is, we can't be adding equipment that gives us what we have now. So if another company comes along and is a direct competitor with Illumina, [if] it gives the same performance, the same read length, the same output as Illumina, there is no reason to add another platform.

What we really need is technology that comes along that is a paradigm shift. That could be in really long read length; I think that's one of particular interest. And I think for a lot of the applications in crop improvement, for de novo sequencing, or in metagenomics, having really long read lengths is attractive. We are following closely the progress of some of these groups and we are hoping that we could maybe be a beta test side for some of these folks.