Skip to main content
Premium Trial:

Request an Annual Quote

Sweet Time for Informatics

Premium

 

While the nascent field of glycomics has not received nearly as much attention and funding as its more established systems biology siblings (namely, genomics and proteomics), this small but steadily growing research community is solving its own database and software challenges.

Just like genomics and proteomics, glycomics has an "ome" as its holy grail — although this one is a long, long way off from completion. The glycome is a complete map of all the complex carbohydrate or glycan structures in a particular organism. These intricate sugar structures have been shown to play a key role in everything from pathogen recognition and sperm-egg interaction to immune system response. In addition, many glycoproteins have been identified as biomarkers for cancer and several diseases.

Unlike DNA, RNA, or proteins, which are all template-driven, glycans are created by the actions of a large number of enzymes, which can result in a seemingly endless number of structural variations. "If someone wants to decode the glycome of a cell, it's a fairly complicated process. It's not the same as someone saying, 'I want to decode the genome or proteome,'" says Rahul Raman, director of the bioinformatics core for the Consortium for Functional Glycomics. "In a cell, you have different glycoproteins, and each protein has multiple glycosylation sites. At each site, you have a variability of the type of glycans that can be expressed, so you can see how complex it becomes when you want to know every glycan at every glycosylation site of every glycoprotein in a particular cell."

Glycan databases

Compared to both the availability and sophistication of databases and software tools for genomics and proteomics, glycomics trails way behind. And just as each of those has gone through its own database growing pains, glycomics must first get its data resources up to speed before more researchers and commercial vendors have a reason to start seriously contributing software tools to this area. Currently, there are three major databases that house glycan structure data: KEGG Glycan, Glycosciences.de, and a relational database hosted by the Consortium for Functional Glycomics, an international initiative funded by the National Institute of General Medical Sciences. The CFG's database is a Web portal connected to the integrated interfaces of diverse datasets in the CFG's relational databases, which contain content on glycan-binding proteins, glycan structures, and glycosyltransferases. KEGG Glycan is an extension of the Kyoto Encyclopedia of Genes and Genomes database and is managed and developed by the Kyoto Unversity Bioinformatics Center. Glycosciences.de is maintained by the German Cancer Research Center and provides researchers with mass spec and glycan structure data as well as applications for glycan analysis.

While certainly not impressive to those familiar with current proteomics and genomics databases, these repositories do mark an important development for glycomics. Prior to their arrival, the only major glycan structure resource available to researchers was CarbBank, a database hosted by the University of Georgia that served as the de facto central repository for all glycan structures. CarbBank had its heyday in the 1990s and has since run out of funding, although it provided a large part of the glycan structure data and system architecture for two of the three newer databases.

Though the three current databases share the same initial collection of glycan structure, they use different file formats, a huge informatics stumbling block. KEGG Glycan uses its own KEGG Chemical Function Format; Glycosciences.de uses the LINUCS format; and CFG uses a format established by the International Union of Pure and Applied Chemistry.

Setting standards

In September 2006, a workshop was held at the National Institute of Health so that glycobiologists from across the globe could assess bioinformatics needs and the current state of glycan structure analysis tools. An outgrowth of this meeting  was the establishment of a standard file format for exchanging glycan structure data. They chose the GLYDE-II XML file format, developed by William York, an assistant professor at the Complex Carbohydrate Research Center at the University of Georgia. And while this is certainly exciting to many, it's like a bunch of TV owners still using bunny ears learning about HDTV. "I think it's fantastic that the format was agreed on, and that's really going to help," says David Goldberg, a research fellow at the Palo Alto Research Center. "But it's not used much — not because there's anything wrong with the standard itself, it's just that there is not that much software out there yet that's designed to take advantage of it."

Encouraging glycomics researchers to adopt a standard file format for glycan structure data submission would be beneficial not only to facilitate independent database integration, but also to make incorporating experimental data published in journals easier. "Each individual database has made [its] own attempts to update their data according to the literature, but it's hard because of the variety of notations used to represent glycan structures," says Kiyoko Aoki-Kinoshita, an associate professor of bioinformatics at Soka University in Tokyo. "In general, it can be assumed that a lot of [glycan] structures are still not represented among all these databases, and it will take time and money for a repository like GenBank for glycobiology to be developed."

Aoki-Kinoshita and others believe that the most urgently needed improvement is the consolidation of these databases along with supplementary data such as pathways, interacting proteins, and binding affinity into a one-stop resource. The creation of such a resource was also deemed a priority at the 2006 NIH meeting, as leaders in the field hope that a standardized glycan structure data file format such as GLYDE-II XML will eventually lead to a centralized and curated glycan structure database.

"There's really a need to get the scientific community and the journals to agree on certain guidelines, so whenever someone wants to deposit a structure they could just do it through a central submission system — and then that structure will automatically go to the different large initiative databases," says Raman. "All of us right now are trying to manually collect this information because there is no system to deposit a structure that will automatically be piped into the different databases. That's the main challenge in maintaining and expanding the current glyco-databases."

Early tools

Still, serious gains have been made since the formation of the CFG and other large-scale initiatives geared toward mobilizing the glycomics community. But Goldberg says that it's a bit difficult to predict how long it will take for glycomics to catch up to proteomics and genomics in terms of software development. Many in the field feel that this is due partially to the fact that glycomics is still too small a sector of the market for commercial developers to care about, although a handful of vendors have started offering some tools. Proteome Systems has a glyco-database and a suite of software tools for analyzing mass spec data and structure prediction. And Premier Biosoft International is also pitching SimGlycan, its mass spec software analysis tool geared toward studying glycosylation, a key area of post-translational study for glycobiology that looks at when glycans attach themselves to proteins. Still, many glycobiologists agree that, for now, most of the software development will come from academia.

Along these lines, Goldberg has developed an automatic annotation software tool called Cartoonist that works with single MS data to determine the composition of a particular glycan structure. The program works by selecting the most plausible annotations for each peak in a mass spectra profile from a library of possible cartoons. Goldberg says the current version of Cartoonist is unique among software tools; earlier research in glycomics utilized MS/MS because researchers were merely copying the same techniques that worked in proteomics.

"From single MS data, Cartoonist lets you figure out what the glycans' compositions are, and then it makes a very good first guess at what the actual structures [are]," says Goldberg. At the moment, those wishing to use Cartoonist must send their spectra directly to Goldberg, but he says that will change in the next year as he works out the kinks and beefs it up to include MS/MS data. He hopes to distribute the tool to CFG members, and then ultimately to make it more widely distributed. "The tools used to assist in the annotation of glycan mass spectra have made major contributions to this field," says Aoki-Kinoshita. "In particular, the Cartoonist suite of software, which is being used by the CFG to annotate the large amount of data they are generating, has been apparently very useful."

Over at the CCRC, GlycoVault, a Web-based informatics gateway that contains databases, ontologies, and other glycan structure-related data, is also promising. GlycoVault is hosted by the University of Georgia's Integrated Technology Resource for Biomedical Glycomics, an initiative funded by the National Center for Research Resources. This application also contains the Glycomics Browser, a Web-based visualization and analysis tool for glycan data.

Overall, the lack of software and the glyco-informatics community's small size may be a sort of vicious cycle hampering its growth. "It's a chicken or egg problem. Because the [glycomics] community is small … there isn't much of a demand for software," Goldberg says. "But on the other hand, if there was better software, maybe more people would do this kind of experiment, so I think it's going to ratchet up slowly." This leaves researchers like Goldberg and others with the onus of providing tools to the small but growing community.

And as was the case in the early days of genomics, once that divide between computer scientists and bench biologists is traversed and the databases become more developed, things will ramp up. "I think that once the data can be accumulated, the bioinformatics fields can start to develop methods specifically for glycobiology, but it is important for informaticians to work closely with experimentalists in order to develop useful tools," says Aoki-Kinoshita. "The language barrier between bionformatics and glycobiology needs to be broken down [and] it is my hope that this can be overcome in the near future."