In another indication that bioinformatics has entered the post-sequencing phase, the US National Center for Biotechnology Information has launched the Gene Expression Omnibus, a public database for gene expression data. Alex Lash, director of the fledgling resource, expects that the database, which began accepting submissions at the end of July and now contains 15,000-20,000 measurements, will mirror GenBank’s pattern of sustained, robust growth.
“We expect a target of 500 million in the first two years,” Lash said. There are 37 current submitters, including Brian Oliver of the US National Institutes of Health’s National Institute of Diabetes, Digestive, and Kidney Diseases and Marc Kenzelmann of the European Molecular Biology Laboratory, who were asked by publishers to make their data publicly available.
The omnibus grew out of NCBI’s SAGEmap (serial analysis of gene expression) project, a technology developed for Bob Strausberg’s four year-old Cancer Genetics Anatomy Project at the National Cancer Institute, a catalog of genes expressed during oncogenesis. While SAGEmap is limited to expression data generated by the SAGE methodology, lessons learned during its development are applicable to all types of gene expression data and are being incorporated into the omnibus.
The omnibus accepts data from a host of array types, including the two-channel method invented by Pat Brown, SAGE arrays, and Affymetrix high-density oligonucleotide arrays.
Expression data is complex and voluminous, so it is trickier to design the omnibus database than a genomic database like GenBank. Part of the problem is the sheer number of data elements.
“If a thousand microarrays produce a hundred experiments each, and each array has 10,000 spots, then we’re talking about a billion data measurements,” Lash said.
At an operational level this could make for unreasonably long database searches. The Sybase IQ database, which has an efficient way of indexing and retrieving data, is being adapted for querying the omnibus database to get around the problem.
When it is in place, within the next few months, multi-field querying will be possible. For now retrieval is by accession number or name of experimenter only. An Entrez-like interface will also be added, as will an FTP batch submission feature. Presently submission is one-by-one through the omnibus website at NCBI.
A more fundamental design problem stems from the chronic lack of standards in bioinformatics, a result partly of the field’s spontaneous, fast-changing nature, and partly that its practitioners are by nature innovators who are disinclined to abide by formal standards. It doesn’t help that the hoped-for discoveries have not yet been described, which makes the design of a database to hold them something of a Catch-22.
Accordingly, the designers of the omnibus have, for now at least, opted for a simple design based on the tried and true GenBank format. A few data fields such as accession number are required, but allowance is made for investigators to make up their own fields according to the nature of their experiments or data.
To keep things manageable from a storage and integration standpoint, the database does not include images. Submitters can attach a reference image if they choose, but fields in the database won’t be keyed to image grid locations. Here again is the standards beast rearing its head.
“There still isn’t a consensus on the best methods for image analysis or for what quality control metrics should be gathered,” said Lash.
Of course, the lack of error-handling standards is an argument that images are absolutely necessary, he conceded. But at this early stage in the game, he thinks post image analysis data is all most expressionists care about. “Some investigators will want to study the images and get very detailed in their analyses, but that’s not most users,” he said.
With the new data poised to start flooding in, though, the directors of the omnibus are anxiously looking for ways to accommodate future growth both in volume and diversity of data. Object-oriented approaches based on CORBA, the OMG’s middleware solution for linking together disparate data and processes in a platform-independent way, were considered but rejected as being too cumbersome.
Another organizational scheme that’s being given more serious thought is XML. While XML allows representation of elements in a nested hierarchical structure, a drawback is that its profuse tagging causes “data bloating.” For example, in representing a table, each cell has to be wrapped with tags to identify it, so instead of a single column header plus one tab delimiter per cell, as in GenBank’s ASN.1 format, an XML-formatted table might have up to 20 characters of identifier information per cell.
Cognizant of the pitfalls on the way to developing large, versatile databases, such as the one the omnibus is expected to become, NCBI is participating in expression standards-setting efforts through the European Bioinformatics Institute-sponsored annual microarray gene expression data meetings, the third of which will be held at Stanford University next year.
At this point expression analysis is an esoteric set of skills. It’s still relatively difficult to get an expression lab up and going, and it takes a rarified set of skills to perform the experiments. Accordingly, expression analysis is still a cottage industry, rather like sequencing used to be 15 or so years ago.
But because of the secrets contained in this data the life sciences community is very interested in it. NIH is funding a number of efforts to advance the state of the art, with extramural grants from NIH’s neurological, kidney, aging, and genome institutes. Other funding is flowing from the NIH-associated National Center for Research Resources, as well as from private charities such as the Howard Hughes Medical Institute, and industry sources like Affymetrix.
The Association of Biomolecular Resource Facilities, an association of research labs based in Santa Fe, NM, has started keeping track of the technology. Chandi Griffin, of the University of California, San Francisco, is co-chair of ABRF’s Microarray Research Group, which started about a year and half ago.
Pointing to a recently completed survey of 47 expression labs worldwide, Griffin said 73 percent were located in North America and only 19 percent in Europe. Seventy-six percent were academic and 15 percent pharma or biotech. Most labs have been started within the last two years, and have small and relatively inexperienced staffs.
Griffin added that a harbinger of surging interest in expression analysis is ABRF’s Listserv, which has a high level of participation and annual growth in the 25-30 percent range.