After the genome, DOE has a choice: Be a high-throughput data contractor, or a role model for the new biology?
by Oliver Baker
Trevor Hawkins, deputy director at the US Department of Energy’s Joint Genome Institute, sits behind a low barricade of papers, binders, and software boxes at his desk at the agency’s Walnut Creek Production Sequencing Facility. An empty cardboard carton lies in one corner of the room and a fresh lab coat lies in another. Only a big laser-print “B” tacked near the window by a construction crew decorates the wall.
New office,” explains Hawkins, with a faint English ac-cent. He wears wire-rim glasses and a collar shirt with no tie.
Hawkins now occupies a corner of a refurbished one-story cinderblock building — a twin to the 30,000 square-foot building next door where his office used to be. Both are remnants of a 1960s era Dow-Chemical Agricultural Research facility. Hawkins calls the older facility, which houses the JGI’s MegaBACE farm, the “engine that churns out bases.” The new addition will host research and production work of a new sort.
The campus is DOE’s stab at post-genomic paradise. Researchers here are automating and scaling up the process of first isolating proteins from 2-D gels and then analyzing them by mass spectrometry. They are doing the same for the process of collecting RNAs from cells and depositing sequences on microarrays. And they are readying to clone and express genomic sequences to purify and crystallize peptides and ship them over the hill to Lawrence Berkeley National Laboratory’s synchrotron light source.
DOE’s heritage in nuclear physics gives it a corner on the market for synchrotrons, and a chance to shine in structural genomics. Ari Patrinos, who heads the agency’s Office of Biological and Environmental Research, has said the prospective role for DOE in a worldwide structural genomics program excites him.
Patrinos’ sentiment isn’t surprising. His agency gave birth in 1986 to what evolved into the international Human Genome Project, but became second fiddle to the National Institutes of Health when Congressional funders found genomes more relevant to health than to energy. In 1988 the two agencies received almost equal shares of nearly $30 million in public genome funds. This year, NIH’s $360 million slice of the genome pie was four times bigger than DOE’s.
Now could be the Energy Department’s time to gain back some glory in genome research. The JGI is clearly positioning itself for biology’s next revolution.
But is this to be a scientific revolution, or an industrial one? The human genome-sequencing race has turned science here into an assembly-line-like exercise in mass data production.
Few would argue that DOE would not be making a useful contribution to the life sciences if it delivers more monolithic mounds of data, as other agencies deliver dams or bridges.
But beneath the hard-hats of the DOE dam-builders are thousands of scientific minds. This is a scientific institution with a capital S, after all: it’s the agency that brought you many of the known constituents of matter. So if genomic data will enable a new approach to biology, these are folks who could be showing how it’s done.
Is DOE working toward a biological revolution, or is it locked in mass-production mode? On a visit to JGI the messages are mixed.
Provincialism at Play
Hawkins has an idea for a new Web portal that he hopes to have engineered for JGI. He describes it as a collection of windows on the whole spectrum of data that the institute will generate—sequences, SNPs, microarray images, structures, and tissue localization. “A layer cake is the way I see it,” he says.
His portal would enable researchers to flit through varieties of biological data. He says no such thing exists in the public domain.
But the goals he relates are provincial: to integrate just JGI-generated data. If expression data from another institution were to implicate a JGI-sequenced gene in cell proliferation, users of the portal would not be told.
As welcome as a broadly integrated portal would be, bioinformaticists say they need to be able to do more than browse through the cordoned paths and analytical tools of one institution. At companies that have lots of proprietary data and their own analytical goals, informaticists need to bring data in house—a task that pits them not against the portal but against the flat files that the portal’s supporting institution uses to export data.
The lack of a common file format and the surfeit of portals place a pile of busywork in the way of integration, and bioinformaticists around the world are indeed busying themselves with largely duplicative efforts. The problem is a theme for meetings year after year.
But it’s not an interest of JGI, says Hawkins. He calls the problem unfortunate, but says it’s not to be dwelled on while things are changing so rapidly. “Just put the data out there,” says Hawkins. “It’ll work out.”
Nor is Hawkins yet captivated by the quasi-integration of data that sequence annotation accomplishes. His calm monotone rises in pitch when he addresses the notion that an annotated human genome exists. “It’s just the stupidest thing I’ve ever heard,” he says. “Let’s annotate it with some real data.”
JGI is sequencing the mouse genome now, and Hawkins says that more than a dozen microbes are in line behind it. Chicken is also under consideration, and Hawkins says he hopes some day to sequence a mammal per year.
Hawkins remarks that biology is taking lessons from the automotive industry these days.
Drenched by Data
David Thomassen, who works in Washington as programs coordinator for the DOE office that funds JGI and other genome research, confirms that JGI’s emphasis lately has been on sequence production. “Essentially all JGI funds” have gone to high-throughput DNA sequencing “in one way or another,” he says. The mania was the same as what struck NIH and agencies overseas, he says, dating the policy to around the time of Celera Genomics’ entrée into genome sequencing.
Before production fever set in, Thomassen’s office routed as much as $11 million per year to basic bioinformatics research—such as for the development of tools for integrating and analyzing diverse sets of data. But the share of funds for JGI—and hence for bioinformatics tied to production—has risen, while the portion going to basic research has declined. The money for basic research this year was only half of its previous high.
From where he stands, Tom Slezak says it’s been a tough climate for developing tools for the post-genomic era. Slezak’s title is bioinformatics team leader for JGI, and he divides his time between Walnut Creek and Lawrence Livermore National Laboratory.
An example Slezak cites is the Data Foundry: A two-person project at Livermore with the goal of developing middleware to automate extraction and warehousing of data from different sources on the Web.
“It’s using metadata to automatically create the wrappers and the translators,” Slezak says. Translators recast data from source ontologies (the conceptual and textual schemes used for conveying biological information) into the warehouse’s own. A Sybase object-relational database archives the data.
With the metadata composed so far, Slezak says the Foundry team’s server can inhale data from PDB, SwissProt, SCOP, and DBEST—databases containing sequence, homology annotation, 3-D coordinates, and fold classifications. He says a patent has been filed and papers have been published in IEEE journals.
But successive attempts never won the project a peer-review ranking that could have reaped it a helping from the diminishing pot of DOE funds earmarked for basic bioinformatics research. Slezak says he’s had to scrape to keep the project going.
He says he agrees one hundred percent with a comment Patrinos made in a recent speech: “It’s very, very hard to overestimate the difficulties we may face as we get drenched by all the data that are heading our way.”
Adds Slezak, “Within the DOE we’re going to be paying for the fact that we have not made significant long-term investment in the kinds of research tools needed.”
Meandering the Money Trail
Inroads on the problem may depend on superficially subtle changes in how genome research is organized within DOE. One factor—illustrated by Thomassen’s bioinformatics numbers—is the portion of DOE biology to happen within JGI. The other is who decides what projects JGI pursues.
Thomassen says DOE has always intended for JGI managers—whose offices are now at the 18 month-old facility in Walnut Creek—to provide a budget and formulate project proposals (while external and internal DOE advisors also figure, he adds).
The money trail hints at a more federated process, because the routing of JGI funds over its four-year history remains like it was when Lawrence Berkeley, Lawrence Livermore, and Los Alamos National Laboratories hosted independent genome centers. Even money for the Walnut Creek headquarters and production sequencing facility flows via the national labs, which receive their JGI funds from Washington as they do funds for non-JGI genome projects of their own initiative.
The say of member labs in JGI priorities and project choices has oscillated, comments one insider who asks not to be named, and heated quarrels have coincided with these swings. Thomassen says DOE is devising changes in how projects will be identified and funded in the future.
JGI might be viewed as an institute in draft phase. While Thomassen regards it as three labs plus the production sequencing facility—reflecting perhaps how he distributes its funds—Hawkins sees it as sequencing facility plus five.
Hawkins counts Oak Ridge National Laboratory: though Oak Ridge has traditionally hosted its own bioinformatics efforts without JGI funds, researchers there are now contracting out services to the JGI.
Hawkins also counts the Stanford Human Genome Center. Though the center calls itself an NIH facility, it has done finishing work on chromosomes 5, 16, and 19, which are DOE’s dominion in the international sequencing effort.
Hawkins’ list of goals for the new building in Walnut Creek includes “wet-bench work” that is to take its lead from se-quence data. Scientists will investigate gene regulation by first comparing the mouse genome and chromosomes 5, 16, and 19 of humans for shared arrangements of genes.
Hawkins says that with 80 million years of divergence between mice and humans, the sequences where proteins bind and orchestrate gene activation will stand like red flags in the desert of dissimilar junk. He wants JGI scientists to trace the co-activated genes to the enzymes they express, to their interactive role in cell physiology, and to the molecular mechanics of how they do it using structural biology. Microarray and other data will indicate what tissues these genes turn on in and point to their roles in health.
A Bioinformatics Czar?
A final organizational undulation that may play into the integration technology of the future is the creation of a new position—associate director for computational genomics. In July, Berkeley physicist Dan Rokhsar assumed the role, after heading LBNL’s Computational and Theoretical Biology Department. Conversations with Hawkins led to the job, says Rokhsar.
He says he’s a fan of the XML-based distributed annotation system (DAS) advocated by Cold Spring Harbor’s Lincoln Stein and others. Rokhsar likes the grass-roots structure of the Napster-like system, he says. Of the goal of integrating biological data, Rokhsar says, “I think that it’s going to have to be a common effort.”
Embracing DAS or something like it is crucial to effectively presenting the range of data that JGI intends to mass produce, Rokhsar says, and he wants the institution to have a role in the development of standards. He says he also wants JGI to assume the task of developing tools for comparative analysis, which will work with and exploit the new standard: “I’m putting it into the budget,” he says.
Having only spent a month on the job, Rokhsar doesn’t venture to guess how many of his plans will actually garner funding—a matter largely up to Hawkins—but he says he’s hopeful.
The JGI’s short history may not seem to offer him grounds, but hey, this is a revolution.