Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: Yale University's Mark Gerstein on the Real Cost of Sequencing


gerstein2.jpgA recent study by scientists at Yale University suggests that the actual cost of sequencing may be much higher than some current estimates indicate since those figures may not factor in the analysis costs that are necessary for a successful sequencing project.

In the paper, published in Genome Biology last month, Yale's Mark Gerstein and colleagues consider costs that weren’t taken into account in a survey conducted by the National Human Genome Research Institute that pegged the cost per genome as of March 2011 to be a little over $10,000.

Gerstein and colleagues note that the NHGRI survey, which analyzed data from the Large-Scale Genome Sequencing Program, omitted so-called "non-production activities," such as costs for the development of computational tools to improve sequencing pipelines or downstream analysis; quality assessment and quality control; technology development to improve sequencing pipelines; management of individual sequencing projects; informatics equipment; and downstream analyses such as sequence assembly, sequence alignment, identifying variants, and the interpretation of results.

They estimate that the cost of downstream analysis for a whole-genome sequencing project could add as much as $100,000 to the overall costs.

BioInform spoke with Gerstein earlier this month. What follows is an edited version of the conversation.

Why did you conduct this analysis?

A few months ago, the [National Center for Biotechnology Information] announced that it was potentially closing [the Short Read Archive]. That was big news in bioinformatics because the SRA is the resting place for a lot of sequence [data]. That precipitated a workshop that the NIH organized afterwards on the costs associated with storing and managing data and thinking about it in different communities such as DNA sequencing, RNA sequencing, metagenomics, and so forth.

A Genome Biology representative was at the workshop and they asked me if I wanted to write an opinion piece addressing these issues. That was the genesis of this particular piece.

In your paper, you include some graphs that show that sequencing has long since outpaced Moore's law and storage seems to be coming along nicely but then analysis is lagging behind. Why is that the case?

I think the thing about analysis that makes it much more problematic is that it's not a single thing that’s easily measured. There are certain analyses that are to some degree straightforward and that have certain scaling properties with more sequences and then there are other things that are much less well defined. Usually the things that are fairly undefined or not as precise to find, tend to scale much worse. Mapping the reads to the genome would be a type of analysis that is fairly well defined and has very defined scaling properties relative to the number of reads.

On the other hand, here's an example of something that wouldn’t scale very well as you sequence more and more genomes: you need to interpret them and you might want to interpret the variants in light of annotation or to integrate variation with annotation. That is a reasonable thing to do but it’s just much less well defined what it means. Potentially, it could involve things that could take a lot of time and the amount of time could scale in a very nonlinear way relative to having two genomes, five genomes, a hundred genomes, and so forth.

Isn’t the analysis problem made worse by insufficient funds?

Historically, analysis has always been underfunded relative to data production. I think genomics and biological science in general has historically always emphasized data production and with good reason. Now people are coming to the realization that the data is almost free. You can produce a gargantuan amount of data for almost nothing and it's really changing people’s view because previously they always saw the data as the valuable thing and the analysis was an afterthought and easy to do. Now the whole equation changes around. It’s easy to get the data but suddenly now there is this whole new thing that hadn’t really been thought of before, the analysis which is taking up this bigger place in people’s thinking about things.

Is the message that data analysis is a necessary component of the research process really getting out into the funding agencies?

That’s a hard question to answer. I would say that [the National Institutes of Health] and [the National Science Foundation] certainly support computational biology and they realize that next-generation sequencing is putting a premium on their offices and they are certainly issuing increasingly more [requests for applications] and programs that are pointed more at the development of analysis tools or workflows. That said, ... it’s not trivial being funded and ... it’s probably still considerably harder to garner funding for bioinformatics than for clinical medicine.

How well are researchers budgeting for analysis in their grant proposals?

I think increasingly when funds are allocated in budgets for projects that generate these datasets, part of that is for someone to do some sort of analysis. In these things, the analysis tends to be somewhat underfunded relative to the data production in the sense that usually, you are seeing the person who is doing the analysis not being able to keep up with things and I think that’s partially because the budget was written years ago and suddenly you can generate much more data for a given dollar. Scaling isn’t taken into account. I think also there is a historic de-emphasis on analysis relative to data production.

In the paper, you mention that the costs for experimental setup and design have increased. Why is that the case?

There are two aspects of [cost] going up. There is going up in the relative sense and in an absolute sense. Clearly, as the cost of NGS goes to essentially zero, almost by definition, the other components to doing an experiment have to increase in relative contribution. For example, if an experiment once cost $1 to collect samples, $1 to do the sequencing and $1 for the analysis, and the sequencing cost dropped to zero, the [relative] cost of the other things goes up even if the absolute cost goes down.

Another aspect is because the cost of sequencing is dropping to zero and sequencing is becoming much easier, people are now tackling much harder-to-procure samples. Now, if you look at [an] experiment, most of it is procuring the specimen and very little is the actual sequencing.

Now that sequencing is moving into clinics, will analysis become even more expensive?

I think that the data reduction end of things can get commoditized and I can easily imagine in a clinic that a lot of standard analysis would be automatically run and I suspect that the sequencing companies would like to incorporate that analysis into their products. Thus, the machines would not only sequence the genome but they would automatically map [reads] against the reference and automatically call variants. I don’t think the interpretation and the downstream stuff would be that quickly commoditized. Those things will remain quite expensive.

What’s the way forward? How can data analysis catch up?

I don’t know if it’s a question of catching up. I think it’s just that the world has changed and it’s just become much, much easier to procure sequencing data and that the cost structure of a lot of things is going to fundamentally change.

The Scan

LINE-1 Linked to Premature Aging Conditions

Researchers report in Science Translational Medicine that the accumulation of LINE-1 RNA contributes to premature aging conditions and that symptoms can be improved by targeting them.

Team Presents Cattle Genotype-Tissue Expression Atlas

Using RNA sequences representing thousands of cattle samples, researchers looked at relationships between cattle genotype and tissue expression in Nature Genetics.

Researchers Map Recombination in Khoe-San Population

With whole-genome sequences for dozens of individuals from the Nama population, researchers saw in Genome Biology fine-scale recombination patterns that clustered outside of other populations.

Myotonic Dystrophy Repeat Detected in Family Genome Sequencing Analysis

While sequencing individuals from a multi-generation family, researchers identified a myotonic dystrophy type 2-related short tandem repeat in the European Journal of Human Genetics.