According to its developers, a new microbial sequence database from the US Department of Energy’s Joint Genome Institute is a relative rarity in the world of publicly available bioinformatics: A commercial-quality system built with industry-standard procedures.
Victor Markowitz, the head of Lawrence Berkeley National Laboratory’s Biological Data Management and Technology Center (BDMTC), led development of the database, called IMG (Integrated Microbial Genomes). Markowitz left his position as CIO at Gene Logic just over a year ago to set up the BDMTC [BioInform 2-2-04], and told BioInform that IMG is the first project to embody the principal goal he laid out for the center: to bring industrial-quality informatics development processes to public resources.
IMG “has provided some excellent experience in understanding how processes used in industry can be adapted to an academic setting,” he said.
In the public sector, Markowitz said, “there’s a lack of industry processes in developing systems, and that’s simply because of the mindset.” Academic and government developers tend to be more interested in publishing the results of their research than in building resources that will stand the test of time, he said, leading to “examples and prototypes” of many bioinformatics tools, but very few reusable systems.
IMG, Markowitz said, was built from the ground up to comply with commercial standards, including extensive documentation — a feature sorely lacking in many public bioinformatics projects, he said.
Public bioinformatics resources rely on federal funding to survive. If tools aren’t built to industry standards, Markowitz said, they quickly become obsolete once their funding is discontinued. But even if funding for IMG were to dry up, he said, “We designed this with the goal that someone could pick up our system in two years and work with it.”
Markowitz, who leads a team of four people at the BDMTC, also relied on the industry expertise of JGI colleague Nikos Kyrpides, formerly director of bioinformatics at Integrated Genomics. Kyrpides helped create Integrated Genomics’ ERGO microbial genomics database, and joined JGI last year to lead its Microbial Genome Analysis Program (MGAP) and guide development of IMG.
While IMG was developed to serve the entire microbial genomics community, the initial motivation for its development came from the MGAP group, which relies on comparative genomics to annotate new microbial genomes. Kyrpides said that the mantra at Integrated Genomics was that “it’s easier to annotate a thousand genomes than it is to annotate one,” and that the MGAP team is following a similar path.
The problem, however, was that they didn’t have an ERGO of their own, so they built one.
Launched last week at http://img.jgi.doe.gov/, IMG 1.0 contains 296 microbial genomes — 263 bacterial, 24 archaeal, and 9 eukaryotic. Of these genomes, 224 are finished and 72 are in draft form, and 102 were sequenced at JGI. It draws from a number of publicly available resources, such as EBI Genome Reviews, RefSeq, UniProt/SwissProt, UniProt/Trembl, InterPro, GO, Pfam, COG, KEGG, and ChEBI.
The core of the system is a data warehouse implemented with Oracle 9i. A Perl-based ETL (extract, transform, load) toolkit integrates and loads data from the external resources into the IMG warehouse (see figure, this page, for the IMG architecture).
JGI plans to update the database quarterly. An additional 200 microbial genomes are expected to be added to the resource this year.
According to Markowitz and Kyrpides, IMG is a departure from other public microbial genome resources, such as the Comprehensive Microbial Resource from the Institute for Genomic Research, in the level of integration that it offers. Kyrpides said that CMR and other databases offer sequence data for several hundred microbes, just as IMG does, but said that those genomes are only available “in isolation,” rather than in a comparative format that empowers annotation.
Markowitz said that there are two key elements in integrating the genomes in IMG. One is a “comparative graphical interface” that enables users to look at many genomes at once. The other is the fact that “everything is done in the context of phylogeny.”
In addition, Markowitz said, IMG was designed to have a more intuitive user interface than other resources, and is also able to be integrated with in-house and third-party informatics resources.
Kyrpides said that the JGI team is working to improve the quality of its annotations relative to other databases. BDMTC developed a set of annotation and curation software tools for the MGAP group that Kyrpides claims allow for more accurate gene models. He said that the MGAP team plans to go back and re-annotate all the publicly available genomes from other resources using the comparative context of IMG. “It’s a huge task,” but a necessary one, he said, because the team has unearthed many inaccuracies in other annotations.
These curation tools will eventually be available to the biological community through IMG, but are still not user-friendly enough for broader usage, Markowitz said, adding that they should be available to outside users by the end of the year.