NEW YORK (GenomeWeb) – Seeking to help biologists easily analyze multi-omics data from genomic, proteomic, transcriptomic, and other kinds of studies, scientists from Sanford Burnham Prebys (SBP) Medical Discovery Institute, the Genomics Institute of the Novartis Research Foundation (GNF), and the University of California San Diego have developed Metascape, an open-access, web-based portal that automatically pulls information from various open-source repositories.
As explained in paper published recently in Nature Communications, Metascape is designed to provide a comprehensive gene list annotation and analysis resource for experimental biologists. It combines functional enrichment, interactome analysis, gene annotation, and membership search from over 40 independent knowledgebases spanning 10 common model organisms.
"Even for computational scientists, compiling and analyzing large omics datasets can be a difficult and time-consuming task," Yingyao Zhou, first author of the study and director of GNF's Data Science and Data Engineering division, said in a statement. "Metascape provides biologists with a platform from which they can access the power of numerous analysis tools all within a simple interface and generate an easy-to-interpret report."
Biologists can use Metascape, for example, to perform comparative analyses of datasets across multiple independent and orthogonal experiments to identify new disease targets and better drugs for cancer subtypes, infectious diseases, and neurological disorders.
"This ability of orthogonal datasets to talk to each other, so proteomics data to talk to RNA-seq data to talk to metabolomics data, that’s a key piece," said Sumit Chanda, senior author of the study and director of the Immunity and Pathogenesis Program at SBP, in an interview. "Each platform gives you one level of insight into biology, but you only start seeing the whole picture if you look at the disease or the phenotype using multiple orthogonal approaches."
Metascape has already found widespread use in the research community. According to its developers, to date, it has been used in more than 330 published studies. That's because it fills what they perceive is an unmet need for tools that support systems-based analysis encompassing multi-omics data, Chanda said. In working on their own systems-based analyses internally, "we would need to come up with our own tools to integrate and analyze the data or implement tools that other people have developed but there wasn't really a turnkey solution for analyzing big data without having significant levels of computational expertise," he said.
For example, a proteomics researcher might use one tool to convert protein identifiers into gene symbols, a second tool for pathway enrichment analysis, a third tool for assessing protein interaction networks, and other tools for visualizing the data, the researchers explained in their article . In some instances, not only do users need to learn the details on how to use each interface, they also need to be able to integrate the outputs of each individual tool. "We want[ed] to build something that was powerful yet intuitive and accessible for someone with limited training and expertise," essentially "a one-stop shop that you can do all your analysis [with]," Chanda said.
Most gene-based analysis tools focus solely on enrichment analysis, and in many cases, existing tools are not properly maintained, according to Zhou. He said in an interview that when the team surveyed a list of popular enrichment analysis tools, only 40 percent were well maintained. The remaining 60 percent were no longer up to date and were unable, in some instances, to recognize 10 percent of input genes. According to a 2016 analysis of 25 pathway enrichment tools and citations of these tools in over 3,800 publications, 42 percent were outdated by five or more years. The researchers also found that just over 2,600 publications from 2015 cited outdated tools.
Moreover, gene-based analysis covers far more than just enrichment analysis, Zhou noted. Researchers also need to assess, for example, which pathways or biochemical complexes are enriched, as well as the functions of any protein complexes. And simply performing gene enrichment analysis will not take full advantage of the omics dataset. Lastly, researchers often need to analyze long lists of genes rather than a single gene. Currently, "very few tools available are able to analyze multiple gene lists and synergize them," he said.
Metascape addresses all these issues, according to its developers. Specifically, it is capable of analyzing lists of genes from multiple assay types, including transcriptomics, epigenetics, and proteomics. To demonstrate its features, the researchers used three previously published genetic studies of influenza that used RNAi screening to identify factors that influence replication rates. The portal includes options for basic analysis tasks as well as tools for visualizing data and generating reports. It features an automated workflow comprised of four components: an identifier conversion, gene annotation functionality, membership search, and enrichment analysis tools.
When researchers add their gene lists into the tool, the data is automatically processed to recognize popular gene identifiers as well as primary locus names from various model organism databases, including FlyBase and WormBase. To annotate genes, Metascape integrates multiple sources of information, including gene descriptions and summaries, disease implications, genomic variants, and tissue expression. The membership search feature lets users search key query words against knowledgebase category term names and description fields, and identify genes with specific functions or features. This reduces the reliance on existing hierarchical ontologies, the researchers wrote. Lastly, Metascape's enrichment analysis feature lets users compare their lists of genes to thousands of gene sets selected based on their involvement in specific biological processes, protein localization, enzymatic function, pathway membership, or other features.
Metascape also includes functionality for analyzing gene lists in the context of protein interactions. The output of Metascape is an analysis report that summarizes key results, a Zip package containing supporting data files, and a PowerPoint presentation of the analysis, among other resources. It also offers various visualization functionalities, including Circos plots and clustered heatmaps.
Without Metascape, biologists would need multiple tools and knowledgebases to fully analyze their data. For example, "if we have three gene lists we want to analyze [and] we go to some enrichment tool website, we may have to do an ID conversion because that website only accepts gene symbols," Zhou explained. After enrichment analysis, a researcher might need to export the results to a tool like Cytoscape to construct a protein-protein interaction network and then move the data to a separate tool to identify densely connected components within the protein network. The next step might be to move the data back to the enrichment analysis tool for further exploration. This process is repeated for each gene list individually and then all those results have to be merged. All of that is "what’s behind Metascape," Zhou said.
Metascape's developers began developing the tool in 2014 and released a beta version in 2015. Since its release, it has been used in a wide variety of studies focused on diseases, such as cancer and influenza. The list of studies includes one published by Chinese researchers, who sought to clarify key candidate genes and signaling pathways in an osteoarthritis rat model. Another study used Metascape for enrichment analysis as part of efforts to identify differential gene expression patterns and key biomarkers in cases of intervertebral disc degeneration. A third study used it for enrichment analysis as part of research focused on identifying genes associated with neuropathic pain. Yet another study used the portal in a meta-analysis of data RNAi screens and protein interaction networks as part of research into virus-host interactions in influenza cases.
To improve Metascape further, the researchers plan to use artificial intelligence to enable it to extract more insights from research data. Specifically, they plan to use machine-learning methods to help researchers prioritize candidate genes for functional validation from their gene lists. "AI may help leverage a number of existing pieces of data that by themselves may not identify or highlight individual genes but that in combination allow you to rank or prioritize these genes," Lars Pache, one of the study authors and a research assistant professor at Sanford Burnham Prebys, explained. "That's not something you may be able to do manually but using AI, you can combine these multiple lines of evidence to prioritize your data."
In addition, the researchers plan to train Metascape to use known information about genes and protein complexes to identify new genes, protein complexes, and pathways in research data. "The goal here is to leverage known biology to learn new biology," Chanda said. "What can we learn from what we already know? What is it about the features of those factors and candidates that [enable] new factors and candidates … to be discovered?"