HINXTON, UK — Despite the flood of biological data that has flowed into research pipelines over the past decade, the bioinformatics community is still hungry for more.
As a number of speakers at the Genome Informatics conference held here Sept. 22-26 noted, there’s no such thing as too much information in bioinformatics.
“Right now, the experiments are more limiting than the informatics,” said Matthew Scott, of the Stanford University School of Medicine. Scott, who delivered the keynote address at the annual meeting hosted jointly by Cold Spring Harbor Laboratory and the Wellcome Trust, noted that that future biological discovery will depend on new experimental methods that can provide novel insights into the complexities of biology.
The “great fear” following the sequencing of the human genome, Scott said, is that “from now on, it’s just a bunch of details.” But Scott said he’s convinced that there is plenty more to discover, as long as the tools are there to detect the patterns.
Luckily, that data is on its way: Many new organisms are being sequenced for the first time, and multiple species of model organisms aren’t far behind. For Drosophila alone, 12 species are scheduled to be sequenced over the next year. As several conference presentations proved, this new sequence data has proven invaluable in identifying conserved genomic regions across multiple species — both protein-coding regions and, perhaps more interestingly, non-coding regions that play a vital role in regulation.
But it’s not just new sequence data that’s coming online. A number of talks at the meeting discussed new data from novel experimental approaches, and how that information can be used in combination with sequence and gene expression data to accelerate biological discovery.
John Stamatoyannopoulos, CSO of Regulome, described a method his company has developed to map locations across the genome that are hypersensitive to the enzyme DNAseI and correspond to cis-regulatory elements, such as enhancers, silencers, promoters, and the like. The method, called DACS (Digital Analysis of Chromatin Structure), uses DNA tags of 19-20 base pairs in length to identify DNAseI cut sites in nuclear chromatin.
The tags are then mapped to the genome and an algorithm determines statistically significant “tag clustering events” to identify which chromatin elements correspond to functional elements. Stamatoyannopoulos said that Regulome has used an initial set of hypersensitive sites to train a support vector machine to identify such regions in new data. While the purely computational approach is still in development, “it seems to work,” he said.
Akhilesh Pandey of Johns Hopkins University and the Institute of Bioinformatics in Bangalore, India, discussed a new way to use data that is already being generated. Pandey noted that much of the peptide sequence data coming out of proteomics experiments in labs across the world could be “recycled” for use in genome annotation. Peptide sequences, he said, can be aligned to the genome just as ESTs are to identify exons and exon-exon boundaries. Because these sequences result from known transcribed proteins, he added, they could also be used to correct existing gene predictions, disprove pseudogenes, and identify protein isoforms.
Pandey said he has developed a version of the Distributed Annotation System called PDAS, for Protein DAS, for annotating proteins in the Human Protein Reference Database [BioInform 09-08-03]. He said he is hoping to enlist the proteomics community “to post their mass spec results to the genome” via HPRD, and told BioInform that he plans to submit a paper outlining the mechanism by which these researchers will be encouraged to contribute their “discarded data” to the initiative.
Another new data type coming online is imaging data. An international effort is currently looking for a way to “feed cellular microscopy data back into the genome databases,” according to Jason Swedlow of the University of Dundee. Swedlow described a collaborative effort to develop an open system, called the Open Microscopy Environment, or OME, for storing image data, metadata, and analytical results in a single relational database.
Around 12 developers across six labs have contributed to the project so far, he said, and a number of commercial imaging vendors, such as Applied Precision and PerkinElmer, have already adopted the OME specifications. Swedlow said that the database does not yet support sequence-level data, but “a schema has been developed, and we’re waiting to bolt GO [the Gene Ontology] on once it’s stabilized.”
Opening Up
The annual Genome Informatics meeting tends to be an open-source-friendly affair, and this year’s gathering was no different. Signaling the prevailing trend toward openness in bioinformatics, two major vendors — Affymetrix and Applied Biosystems — announced during the conference they would be releasing informatics tools into the public domain.
As a collaborator in the effort to develop the next version of DAS [BioInform 09-27-04], Affymetrix has released the source code for its Integrated Genome Browser (IGB, pronounced ig-bee) through its Affymetrix Developers Network website (http://www.affymetrix.com/support/developer/tools/affytools.affx). Affy’s Gregg Helt, principal investigator on the DAS2 grant, said that IGB will be available as a DAS2 client. The browser, released under the Common Public License, is Affy’s first contribution to the open source software community, “but it will not be the last,” Helt said.
IGB was developed to visualize the huge amounts of transcription data that Affy has collected through its tiling arrays “so that for any GeneChip, you can see where the probes land on the genome,” he said.
Affymetrix is collaborating with Cold Spring Harbor Laboratory, Dalke Scientific Software, and Ensembl on DAS2. CSHL is responsible for the server implementation, Affy for the client, and Dalke Scientific software for a “test suite” that will allow DAS users to determine whether they are following the specification properly. “The main point of the grant is the specification,” Helt said. “We want to make sure it is stable.”
DAS2 is expected to retain the “core principles” of the original DAS, namely that “simplicity is key,” Helt said, but there will be a few improvements in addition to the stable spec. DAS2 will include a mechanism for registering DAS2 servers as web services, and enhancements to query URLs and XML responses. In addition, there will be a “writeback” feature that will permit client-to-server annotation.
Applied Biosystems, meanwhile, has decided to publicly release the PANTHER classification library developed at Celera for annotating the human and mouse genomes, according to Paul Thomas, senior director of the molecular and systems bioinformatics group at ABI.
The current version of the system, PANTHER 4.1, is available at https://panther.appliedbiosystems.com/navigation.jsp. The next version, PANTHER 5.0, will be available in December, and will also be integrated with the publicly available InterPro protein family database on an “incremental” basis, Thomas said.
PANTHER (Protein Annotation Through Evolutionary Relationships) was designed to classify genes to a “simplified” ontology of protein function suitable for browsing and high-level analysis of genomes, Thomas said. Curators (over 60 have been employed during the course of the project) define and name subfamilies of proteins on the basis of conserved function. They then associate each subfamily with ontology terms that describe the molecular functions of the proteins, as well as the pathways and biological processes the proteins participate in. New sequences can be automatically classified using hidden Markov models that are built using the properties of these subfamilies.
Thomas said that the current version of PANTHER contains more than 250,000 GenBank sequences organized into around 30,000 subfamilies.
Celera used the PANTHER system to assign function to genes in its 2001 human genome paper, Thomas said. After that, the proprietary system was kept under wraps as part of the commercial Celera Discovery System. As Celera’s business model changed, however, “it became clear that we would have to move [PANTHER] into the public domain if it was to survive and be of use to the research community,” Thomas said. After more than a year of internal wrangling, the public version of PANTHER 4.1 “cleared legal” in August.
Thomas said that he was meeting with members of the InterPro team at EBI following the meeting to discuss plans for integrating the resources.
— BT