CAMBRIDGE, Mass. — This year's Computational Genomics conference highlighted an emerging trend in the bioinformatics community: Developers must continue to improve existing methods that are in common use, even as they scramble to devise new approaches to handle high-throughput data from emerging experimental platforms.
The eighth annual Computational Genomics conference, held here Nov. 9-12, and co-hosted by the Jackson Laboratory, the Institute for Genomics Research, and the Center for Bioinformatics and Computational Biology at the University of Maryland, was the last of the series to address broad issues in the field of bioinformatics. Conference organizers said that next year's meeting, to be held in the Washington, DC, region, will focus on computational genomics for pathogens and infectious diseases.
Carol Bult, staff scientist at the Jackson Lab and a co-organizer of the conference, told BioInform that the organizers decided to focus next year's meeting in order to distinguish it from other bioinformatics conferences. In addition, she said, pathogen informatics is an area of increasing interest for researchers and funding agencies, citing the National Institute of Allergy and Infectious Diseases' network of eight national bioinformatics resource centers as one example [BioInform 10-24-05].
With TIGR as a co-organizer, this year's meeting already offered a healthy serving of microbial informatics, but talks ranged the gamut from Bacillus anthracis and Escherichia coli to mice, humans, dogs, and even the emu. More importantly, speakers focused on the challenges of analyzing data from familiar experimental platforms, as well as new systems that are just coming online.
While maintaining that "the best way to find genes is genomics — not genetics," Roberts noted that there are cases in which current sequencing methods fall short.
Something Old …
Richard Roberts, chief scientific officer of New England Biolabs and winner of the Nobel Prize in Physiology or Medicine in 1993 for his work on gene splicing, kicked off the meeting with a discussion of the informatics challenges in identifying restriction enzymes.
While maintaining that "the best way to find genes is genomics — not genetics," Roberts noted that there are cases in which current sequencing methods fall short. For example, he noted, a gene of interest to a researcher may be lethal in E. coli, which would make it impossible to sequence. Following a hunch that this might result in "holes" in the regions of the raw sequence that corresponded to those lethal genes, Roberts said he asked TIGR to retrieve the original sequence traces for Haemophilus influenzae "from the basement," and sure enough, the gaps in the sequence revealed a novel restriction enzyme.
Roberts said he is confident that this method would be useful for other genes that are lethal in E. coli. He added that the approach is "a new use for shotgun sequence data that people thought was junk," and highlights the importance of enabling access to raw data so that researchers can find "real biological treasures."
Mark Gerstein of Yale University addressed another persistent problem in bioinformatics — that of pseudogenes, which are non-functional elements that bear just enough similarity with protein-coding genes to befuddle most gene-prediction algorithms. Gerstein described his lab's "pseudopipe" predictive pipeline for identifying pseudogenes, which identified around 20,000 pseudogenes in the human genome. Only 40 percent of these have a homolog in mouse, Gerstein said, indicating that the majority of human pseudogenes arose after the mouse/human divergence.
Gerstein said that his lab is currently using tiling arrays to study transcription in intergenic regions of the human genome to determine whether these predicted pseudogenes are indeed non-functional. While the results are still preliminary, he said that on chromosome 22, around 45 out of 525 pseudogenes are transcribed, and that on average, around 10 percent of pseudogenes in the human genome appear to have "some evidence of transcription."
Rebecca Jornsten of Rutgers University discussed another challenge that bioinformatics developers have been wrestling with for some time: missing values in data from cDNA arrays. Jornsten estimated that missing values could comprise as much as 10 percent of a single slide due to smears, high background noise, and other experimental circumstances — a portion that is too high to ignore.
Mootha said that even though mitochondrial dysfunction "may contribute to all degenerative diseases," the protein components of the organelle are still not well understood.
The alternative to ignoring those values is imputing them, she said. While there are a number of methods available to do that — including row-mean imputation, k-nearest neighbors, and Bayesian techniques — Jornsten said that researchers have a difficult time determining which of these is the best for their data, and what effect the imputation will have on their final results.
Jornsten and her colleagues developed a method called LinCmb, which uses a library of all the currently available methods that is "weighted" based on available data. In an assessment of LinCmb and its component methods, Jornsten said that row-mean imputation and k-nearest neighbor were found to generate an undesirable high false positive rate in downstream analysis. The study also indicated that it is better not to impute at all when there is very little data missing.
In a similar vein, Safquat Rahman of Cornell University addressed another aspect of bioinformatics analysis that is often taken for granted. Biologists assume that genes with similar expression profiles are functionally related, he said, but how can you be sure that those expression profiles are actually similar? Most researchers currently use methods such as Euclidean distance and the Pearson correlation to assess similarity, while other methods such as Z-score are also used. The problem, Rahman said, is that none of these methods takes the background distribution of the expression data into account, so he and his colleagues developed a new method, called the mass-distance measure, which "adjusts to the distribution of the expression values."
In an evaluation of all the available methods for assessing expression similarity, the mass-distance measure outperformed all other methods in three out of four publicly available expression data sets, Rahman said.
… Something New
Other talks focused on analytical methods under development to handle data from new experimental platforms. Thomas LaFramboise of the Broad Institute, for example, discussed how his lab is "abusing" Affymetrix's 100K SNP mapping arrays for applications beyond the team's "original intent" in genotyping normal samples. LaFramboise said that researchers at the Broad are using the chips to analyze loss of heterozygosity, copy number gains and losses, and allele frequency — all of which place a burden on the bioinformatics team to develop effective analysis methods.
As an example, LaFramboise discussed a method his team developed to extract the allele-specific copy number at each SNP site. A statistical model was trained on normal samples and then applied to cancer samples, he said, to accurately identify the allele-specific copy number. One benefit of the approach, he added, is that it can be used to determine the haplotypes for amplified regions. "If you see the same haplotype amplified [in multiple samples], it may be tied to predisposition for tumor growth," he said.
LaFramboise noted that so far this "appears to be the case, but the results are very preliminary."
The Broad's Vamsi Mootha, meanwhile, discussed the somewhat-unexplored territory of the genomics behind mitochondrial function. Mootha said that even though mitochondrial dysfunction "may contribute to all degenerative diseases," the protein components of the organelle are still not well understood.
Web-Based Tools Discussed
at Computational Genomics
|AMOScmp, from the Center for Bioinformatics and Computational Biology at the University of Maryland, can assemble a set of shotgun reads from an organism by mapping them to the finished sequence of a related organism: http://amos.sourceforge.net/docs/pipeline/AMOScmp.html.|
|gMap, from the National Center for Biotechnology Information, uses synteny levels to align multiple genomic sequences for bacteria and archaea, which are color-coded and presented in a single view: http://www.ncbi.nlm.nih.gov/sutils/gmap.cgi.|
|GOMER (generalizable occupancy model for expression regulation), from Johns Hopkins University, is a software package that predicts transcriptional regulation by modeling the binding of transcription factors to genome sequences: http://biophysics.med.jhmi.edu/clarke/granek/GOMER/.|
|Mauve, from the University of Wisconsin, Madison, is a multiple alignment tool that accounts for genome rearrangements and inversions: http://gel.ahabs.wisc.edu/mauve/.|
|The mass-distance measurement tool for assessing the similarity of expression profiles, from Cornell University: http://biozon.org/tools/expression/.|
|The Mouse Phenome Database, from the Jackson Lab, is a collection of phenotypic and genotypic data for the laboratory mouse: http://jax.org/phenome.|
|The Mouse SNP Query Form, from the Jackson Lab, is a new interface for exploring SNP data for mouse: http://www.informatics.jax.org/javawi2/servlet/WIFetch?page=snpQF.|
|The Pseudogene database, from the Gerstein lab at Yale University, includes predicted and validated pseudogenes for human, mouse, worm, fly, yeast, and prokaryotes: http://pseudogene.org.|
|XcisClique, from Virginia Tech, predicts cis-regulatory elements for Arabidopsis thaliana, and can be adapted to other organisms: http://bioinformatics.cs.vt.edu/XcisClique.|
Mootha said his team has estimated that there are around 1,500 loci in the human genome that encode proteins that make up the mitochondria, but only half of these are described in public databases. In a collaboration with Matthias Mann of the University of Southern Denmark, his group tried to use tandem mass spectrometry to identify the remainder of the mitochondrial proteins, but this method could only discover around 150 because it could not detect low-expression proteins.
Mootha said that computational prediction methods, meanwhile, tended to give a "high proportion of false positives," so the team settled on an approach that combined eight separate data sets using a data-integration tool called Maestro. The method, which integrated information on targeting signals, mass spec data, co-expression information, protein domain data, cross-species homology, and cis-regulatory motifs, was shown to have better sensitivity and specificity than any of the individual data sets alone, Mootha said. When applied to 33,000 predicted transcripts, the approach identified 709 novel mitochondrial proteins, which are currently being validated experimentally, Mootha said.
Another unsolved problem in bioinformatics is the assembly of shotgun sequence data from environmental samples that include DNA from multiple organisms. Mihai Pop of the Center for Bioinformatics and Computational Biology at the University of Maryland described his group's experience assembling shotgun sequence data from the human gut.
Pop described a pilot project to determine the feasibility of such an effort, which sequenced the gastrointestinal bacteria from two human subjects. Pop said that assembly challenges included low depth of coverage (on average 2X to 3X), as well as the fact that not all the organisms in the samples were sequenced to the same level of coverage. Rearrangements in closely related organisms within the environmental sample also posed a problem, he said.
Another challenge, he said, is that "existing assemblers were built for a single piece of DNA, and for uniform coverage." Pop said that his group used the Celera Assembler to combine all the shotgun data into a single assembly, which they then separated into two samples using SNP data. They also developed a software tool called AMOScmp to identify the genomic sequence for known organisms in the human gut.
One goal of the study, Pop said, was to determine whether shotgun sequencing identified anything that wouldn't have been picked up via 16S ribosomal DNA-based identification of bacteria. Despite the fact that the "assembly was really horrible" due to the low coverage, Pop said that the approach did yield some new insights. For one thing, he said, there was "an unexpected abundance of archaea" in the human gut. In addition, he said, one of the samples included a number of antibiotic-resistance genes, which may have been due to the individual's use of antibiotics, or could possibly be from the use of antibiotics in animal feed.
— Bernadette Toner ([email protected])