BALTIMORE — Infectious disease research is an important driver for bioinformatics tool development, according to speakers at the ninth annual Computational Genomics conference.
This year’s conference, co-hosted by the Virginia Bioinformatics Institute, the Jackson Laboratory, and the Institute for Genomic Research, was the first time the annual meeting focused on computational methods for studying pathogens and infectious disease.
Last year, organizers decided to narrow the focus of future meetings in response to steady growth in genomics-based infectious disease research, as well as to help differentiate the event from other bioinformatics conferences [BioInform 11-21-05].
The renewed focus was a good idea, judging from the number of talks describing bioinformatics methods for studying pathogens and infectious disease, although many of the bioinformatics challenges discussed would be familiar to researchers in other fields.
Data integration, for example, is as much of an obstacle for pathogen research as it is for any other biological discipline. Lynn Schriml of TIGR discussed a system called Gemina (Genomic Metadata for Infectious Agents) to address this issue by integrating pathogen and epidemiological information with genomic sequence data.
Schriml said that the foundation for the system is the Microbial Rosetta Stone database developed by Ibis Biosciences, which links microorganism names, taxonomic classifications, diseases, and scientific literature for microbial pathogens to public genomic sequence databases. All the data in Gemina is linked via taxonomy identifiers from the National Center for Biotechnology Information.
Noting that it is a “challenge to standardize epidemiology data from many resources,” Schriml said that Gemina uses a combination of six ontologies and controlled vocabularies to help correlate terms from different resources. The data is organized around so-called “infection systems,” which describe relationships between pathogens and hosts for specific diseases. Users can query Gemina to explore associations between pathogens, hosts, diseases, symptoms, body tissues, transmission modes, and epidemiological information such as the gender or age of the host or the date or location of an outbreak.
Schriml said that the TIGR team has fully curated 38 genomes for pathogens that the National Institute of Allergy and Infectious Disease has classified as category A, B, and C priorities. These genomes comprise around 2,500 infection systems, she said, and represented a “huge curatorial effort.”
Gemina is not yet available online, but should be live in two to three months, Schriml said.
In a related effort, researchers at the Center for Bioinformatics and Computational Biology at the University of Maryland are creating a web-based system to help develop DNA and protein signature-based assays that use TaqMan PCR to detect pathogens. The system, called Insignia, is based on a set of pre-computed pair-wise alignments for more than 3,600 microbial genomes and enables researchers to identify unique signatures for a given organism “on the fly,” according to CBCB’s Adam Phillippy.
Phillippy noted that the concept for Insignia is similar to that of K-Path, a software package developed by Lawrence Livermore National Laboratory to identify pathogen signatures as part of the United State’s biodefense infrastructure [BioInform 10-13-03]. However, Phillippy noted that the K-Path signatures have not been publicly released, “so we wanted to develop something that was publicly available.”
NCBI’s Leonid Zaslavsky discussed an emerging challenge related to infectious disease research: extremely large datasets. In particular, Zaslavsky discussed how NCBI is managing data from NIAID’s Influenza Genome Sequencing Project, which has so far sequenced the genomes of more than 1,644 human and avian flu isolates.
Zaslavsky said that the large number of closely related genomes raises a number of data representation issues. NCBI has developed a “multi-scale” approach that enables researchers to analyze the dataset at different levels of resolution. The method will be implemented in the next release of NCBI’s Influenza Virus Resource.
Annotation ‘Not a Solved Problem’
Several talks addressed the challenges of annotating microbial genomes. For example, assigning functions to genes based on homology — a common practice in annotating eukaryotic genomes — is problematic in prokaryotes, according to TIGR’s Jeremy Selengut, because there is more genome rearrangement across species and strains. “Microbes are tricky little buggers,” he said.
“Microbes are tricky little buggers.”
Selengut described a “context-based” annotation approach based on the Genome Properties system developed as part of TIGR’s Comprehensive Microbial Resource. The Genome Properties system is a set of prokaryotic attributes whose status can be described by numerical values or controlled vocabulary terms.
Selengut said that his team is developing an algorithm to help automate the assignment of Gene Ontology process terms using the Genome Properties system, but stressed that the project is still in its early stages. “It’s difficult to develop an algorithm to churn up and spit out this data,” he said. “We need to come up with a better set of rules.”
Qiandong Zeng of the Broad Institute noted that gene prediction in prokaryotes is “not a solved problem.” Zeng, who discussed a project to sequence and annotate multiple strains of Mycobacterium tuberculosis, said that ab initio prediction tools generate too many false positives, while annotation based on comparative genomics or mass spectrometry data suffers from a high false-negative rate and incomplete data.
Zeng described an approach the Broad is taking called “synteny-based mapping,” in which annotation from a closely related bacterial genome is transferred to a newly sequenced genome. The method “takes advantage of the genomic context of open reading frames,” Zeng said, and is tolerant of sequencing errors and gaps. It also helps identify SNPs and large-scale structural differences, he said.
One drawback, however, is that the method doesn’t identify novel genes in the newly sequenced genome. In addition, it only works for closely related sequences.
Nevertheless, Zeng presented data showing that the approach successfully identified a SNP in M. tuberculosis F11 that is linked to the strain’s resistance to isoniazid, a first-line treatment for TB.
These results indicate that the method could help combat TB infection and drug resistance, he said.