There are two schools of thought when it comes to bioinformatics standards -- that there aren't enough of them, and that there are too many of them. Geospiza is hoping to keep both camps happy with a new project that will adapt a standard commonly used in other scientific computing disciplines for bioinformatics applications.
The endeavor is likely to face an uphill climb because the project is "speculative and high risk," the format is "not [yet] perfect for bioinformatics," and the bioinformatics community is customarily loathe to embrace new standards efforts. Nevertheless, Geospiza expects to have a prototype application ready by April and is confident that the format offers a number of benefits that will win over any skeptics.
Last month, the National Human Genome Research Institute awarded the company a six-month $150,000 Phase I Small Business Technology Transfer grant to test the feasibility of the Hierarchical Data Format (HDF) for use in bioinformatics [BioInform 09-12-05].
HDF was developed at the National Center for Supercomputing Applications at the University of Illinois for analyzing, visualizing, and converting large amounts of scientific data, and is widely used for applications such as astrophysics, remote sensing, and atmospheric modeling. The current version of the format, HDF5, is supported by a non-profit NCSA spin-off called THG (The HDF Group).
The hierarchical structure of HDF enables data from multiple experiments to be stored in the same file and organized in a directory-like structure. According to the HDF5 website (http://hdf.ncsa.uiuc.edu/index.html), the format stores two types of objects: datasets, which are multidimensional arrays of data elements; and groups, which are structures for organizing objects in a file.
The HDF5 format is designed to enable scientists to "create and store almost any kind of scientific data structure, such as images, arrays of vectors, and structured and unstructured grids."
"Using these two basic objects, one can create and store almost any kind of scientific data structure, such as images, arrays of vectors, and structured and unstructured grids. You can also mix and match them in HDF5 files according to your needs," according to NCSA.
Todd Smith, chairman and CEO of Geospiza, described HDF5 as a "file container" that "has real potential to meet some of the future scale needs that we're going to have in bioinformatics." Smith noted that as next-generation sequencing technologies enter the market over the next decade to 20 years, "we need to assemble the data, we need to analyze through the assemblies for variation patterns in the data, we need to associate this information with quality values to get real statistical models of the variation to separate the true biological signal that's in the signal from the noise of the experiment."
In order to do that, Smith said, "We need applications that can communicate with scalable data-storage and file-management systems. So HDF might be good for this, and that's what we're going to test."
'Speculative and High Risk'
Geospiza plans to work closely with THG to adapt the format to handle bioinformatics data, which will initially be large volumes of resequencing data. Smith said that a prototype application that will demonstrate the feasibility of the format should be ready by April, at which time the company will apply for a Phase II STTR to continue the project.
Smith acknowledged that the project is "speculative and high risk," and that HDF5 is currently "not perfect for bioinformatics." However, he said that the company has two things in its favor: the format has been proven to work for large data sets in other scientific disciplines, and Geospiza is collaborating with THG rather than developing the format on its own.
"A lot of people, when they search for technology, they don't go and engage with the original groups to make the technology better," Smith said. "What we've done here is we have the group that has developed HDF doing the HDF development, we have a group that's very good in bioinformatics guiding the process."
Peter Good, an NHGRI program manager, said the project is an example of "a good collaboration where someone reached out to a group doing this, knowing that they would have to apply this to biomedical research, and forged this collaboration. It is an ideal example of how to go about doing it, I think," he said.
As next-generation sequencing tools enter the market over the next 20 years, "we need to assemble the data, we need to analyze through the assemblies for variation patterns in the data, we need to associate this information with quality values to get real statistical models of the variation to separate the true biological signal that's in the signal from the noise of the experiment."
Likewise, Rob Henson, manager of the bioinformatics development group at the MathWorks, whose Matlab scientific software supports HDF5, said the project is "a good idea," and that he's "pleased that the biology world is not trying to reinvent the wheel, as they've done multiple times in the past."
Henson said that the format is well-established in other industries with which the MathWorks deals, such as aerospace, telemetry, and geospatial mapping, and that it ought to be effective in bioinformatics, where "one of the big challenges we face as a developer of bioinformatics software is dealing with the fact that there aren't standard data formats."
The ability of HDF to "tie multiple formats together" could help address that problem, he said. While not aware of anyone currently using HDF5 with bioinformatics data in Matlab, Henson said, "we'd be very happy if it worked because it would plug right into our existing tools."
However, Henson noted that many people in the bioinformatics community are unfamiliar with the format, which may pose a barrier to adoption. "Someone will have to market it," he said. He added that HDF5 is a "complex format" that is "kind of scary if you get in at the low level," and there is a risk that people "won't buy into it because they have to convert the data. There's an awful lot of data that's not in this format, and somebody will have to take on the task of converting the data."
If that does happen, "and Geospiza does a good job of hiding the complexity of HDF5, then I think it will have a good chance of succeeding," Henson said. "If it doesn't, there's a risk that it just becomes an interesting academic exercise."
The bioinformatics community is customarily loathe to embrace new standards efforts, and a number of promising formats never spread much farther than the groups that developed them. Smith acknowledged that adoption will be an uphill battle, and said that the company intends to address that challenge through some initial tools that will demonstrate the benefits of the format.
"First, you want to engage the community around interfaces to the technology, so we'll have to develop some APIs so that you can very easily move your data, say in an XML format, in and out of the file," Smith said. "The other thing we have to do is develop some successful applications to really show that, with this technology, you're better off with the overhead of working with the binary file format than you are with just writing your data as a stream to an ASCII text file."
Smith said that a binary format is better than ASCII for applications that require random read-write access. "It's very hard in XML to randomly access the data and then randomly write to the file, so a problem occurs if you have a program that's dealing with a lot of data and you need to make some changes to that data," he said. Most programs handle this problem by reading all the data into memory, he said, "so if you do a large project, you need a computer with a lot of memory, and that's just to handle the GUI tool."
Geospiza intends to release a "light prototype" by April with the same functions as the PolyPhred SNP-analysis package that will likely include a "web-based visualization tool showing we can read and write from the file" and the ability to "handle multiple gigabytes of data fairly easily," Smith said.
Ultimately, Smith said that Geospiza sees a business opportunity in adapting HDF5 to bioinformatics because it will enable the firm to build "next generation applications." In addition to the expected performance advantages of the format, "there are engineering aspects of developing a scalable software application that come into play as well," he said.
-- Bernadette Toner ([email protected])