Usually bioinformatics — the analytical savior of the wet lab — is a discipline uniquely positioned to assist researchers with sorting out the messiness of biology and make sense of all that data. But when it comes to pathway analysis, the sheer complexity of the processes involved combined with the number of intersections in any given pathway makes the challenges for bioinformaticians anything but trivial. While genomic technology allows for a clearer picture of all the genes in a cell, pinning down how the proteins of the cell are interacting in different ways, in different cells, and at different moments is still pretty fuzzy. These days, the traditional linear understanding of a pathway — where interactions happen protein to protein — has been replaced by a much more holistic model in which interactions occur in vast networks.
"We talk of pathways like the holy grail of biology, and the hurdles for pathway analysis informatics are pretty big because they are very abstract and very complicated," says Zhang-zhi Hu, an associate professor at Georgetown University's Lombardi Comprehensive Cancer Center. "The traditional understanding of pathways was as linear paths of protein-protein interactions in cells, but nowadays we know all the genes in the cell — at least for the completed genomes — so the pathway becomes so much more complicated because of the many intersections — just like airline flights, there are many hubs, you can go from Washington to LA via all kinds of different routes."
This complexity is not a problem in terms of developing analysis software and obtaining compute power robust enough to visualize and analyze the data, but rather a problem with the data itself. More specifically, it is the intrinsic complexity of pathways that make it difficult for databases — such as KEGG, Pathway Interacting Database, or Reactome — to claim to provide researchers with fully modeled pathways. This complexity is the reason why the problem of annotation comes up frequently when discussing the state of informatics approaches for pathway analysis. "There are different databases that claim to be pathway databases, but the way they capture different pathways in a digital form is very different. Several databases try to capture as much information as reflected in the biology as possible so they start to model the molecular events between one molecule and another. They then have to assign different identifiers not only to the molecule itself, but to even interweave those two," Hu says. "But when you capture this database in a digital form, how do you decide what is the boundary of a particular pathway? Different databases have different boundaries. In terms of informatics challenges, you analyze your data using public information, but you can't use just one database, you have to use all available databases, because they all have different understandings of the pathway. So that's a big complication."
Bioinformaticians including Andrey Ptitsyn, an assistant professor at the Center for Bioinformatics at Colorado State University, often run headlong into serious difficulties when conducting pathway analysis studies. In his case, he faces issues in the study of new infectious disease data. "We're facing a terrible problem of improperly interconnected databases with naming conventions for different genes and, of course, with annotations," Ptitsyn says. "The only gene interaction and systems biology analysis you can do that is more or less well established is for the Drosophila genome. But if you're dealing with a mosquito [genome] you're facing a lot of challenges because the maps are not specifically charted, so you have to do analysis using a proxy genome. And many of the bioinformatics tools which are designed for analysis and visualization are made to accept the data with names and specific interactions with the Drosophila genome only."
Tools of the trade
The idea that pathway analysis software tools are only as good as their annotations lies at the crux of the informatics challenge. In order to manage these complications, pathway analysis investigators tend to turn to commercial tools over open source options because commercial vendors are able to invest more resources into the annotation of their databases. "The current state of commercial tools is still very good — not because they have very robust algorithms or methods, but because several commercial pathway analysis programs, like Ingenuity and GeneGo solutions, have the money behind them to hire people to annotate those pathways from both existing databases, which they can integrate, but also by annotating from the literature, and that is very powerful," Hu says. "That's why I always say that pathway analysis is knowledge-based and qualitative, not quantitative. The coverage for gene ontology for the human genome is 80 to 90 percent … but the annotation of the pathways across the human genome is only less than 25 percent, so three-quarters or more of the genes are not annotated with pathways. If you use public databases, chances are you don't get any information, because they are not covered in the databases."
Ptitsyn's lab subscribes to both Ingenuity's pathway analysis and Gene-Go's MetaCore, but says he would like to have resources to also subscribe to Ariadne's Pathway Studio. Instead, he settles for augmenting his analysis arsenal with open-source platforms such as Cytoscape and PANTHER that are freely available and have a larger user community from which to draw support. "These analysis tools essentially work like a scavenger hunt: every time an analysis is run, the user is trying to solve the problem of understanding the biological process. You might analyze a pathway using one approach trying to see what functional groups are overly represented, and if you can add the ranking for significance, great. If you can't, you might be better off with representation methods based on fissure statistics for which there are plenty of commercial and open source tools available," Ptitsyn says. "A few years ago, we had a few breakthroughs, but still the majority of the bioinformatics tools follow along the same lines: improving the same approaches, developing extra bells and whistles around the same few ways to analyze the pathway — usually either something dealing with drawing the charts in different ways or with calculating a set of very set of simple statistics on representation analysis for certain pathways."
Both Hu and Ptitsyn say that the assumptions researchers make to form the pathway analysis must be accounted for. Many of the methods are based on certain assumptions of equal representation or normal distribution of something — which is not necessarily true when you get down to the molecular genetics and the way the cellular circuitry works. Oftentimes during analysis, investigators see their representation of certain pathways only because those pathways are expressed at a higher level and are the first to be found. However, many methods of analysis generate the expected data first and only somewhere on the fringes of statistical significance, or more often even beyond, lies something really important.
For the time being, investigators like Hu say that the informatics challenge with pathway analysis is not really an informatics issue at all, but rather a problem of biology. "Ten years back, we sequenced the human genome, and it sounded like we've already completed the big task — like [the] annotation is easy or is done. But it's the pathways that make everything complicated," he says. "Without those protein-protein interactions, the genome is just a blueprint. It has to be converted into a reality, and making sense of those interactions is still a mind-bogglingly complex problem."