NEW YORK (GenomeWeb) – A team led by researchers at the University of California, San Diego has developed a mass spectrometry database for natural products research that could significantly aid data sharing in the field.
Presented in a paper published last week in Nature Biotechnology, the resource, called the Global Natural Products Social Molecular Networking (GNPS) enables continuous re-searching and re-annotation of submitted mass spectra, which both improves the quality of the information in the database and encourages its use, UCSD researcher Nuno Bandeira, who led the development effort, told GenomeWeb.
Bandeira and his colleagues are now working to incorporate similar features into the proteomics portion of UCSD's mass spec database, Massive.
Natural products research focuses on analysis of molecules found environmentally, such as metabolites generated by animals or plants or microorganisms. This includes the discovery and characterization of antibiotics like penicillin, rapamycin, and vancomycin, as well as molecules used as drugs in a variety of other areas.
Natural products researchers are interested in, among other things, the structures of these molecules, and typically use a combination of methods like NMR and mass spectrometry to solve them, said Pieter Dorrestein, a UCSD researcher and developer of the GNPS.
Due to the complexity of these molecules, this work can take a significant investment of time and resources, Dorrestein said, noting that a very complex structure can take millions of dollars and several years to solve. Additionally, he said, solving the highly complex structures requires a level of expertise that can take five to 10 years of training.
Further exacerbating the field's challenges is the fact that many times once such a structure is solved, it isn't shared with the larger community in any systematic way, Bandeira said.
"What has happened is that people invested sometimes months or years of time to figure out a single spectra, and then, at best, that [solution] would end up as a figure in supplementary materials in a paper that is not searchable," he said.
"So one of the things we wanted to do with GNPS was give people a place where they could share that spectrum with that identification and that automatically would become available to the whole community," Bandeira said.
Building a repository is one thing. But getting researchers to use it consistently is another. To this end, the UCSD researchers developed the GNPS not only as a place to store mass spectra, but as a tool for interpreting those mass spectra.
All data added to the repository is searched against all data existing in the repository, meaning that by submitting their data, researchers are able to learn more about it. At the same time, the existing data is searched against all new datasets and re-annotated with this new information.
"When you upload your data, you get the maximum amount of knowledge that currently exists," Dorrestein said. "Even if there is disagreement about an annotation, that comes through as well, and you are going to continue to learn about your data. So the data repository is no longer just some dead weight, so to speak, that you do at the end of publication when you no longer care about data. You do it at the front end of the data analysis, and it becomes a part of your workflow, and that I think is really important."
Launched in 2014, the GNPS repository had, as of November 2015, roughly 10,000 users and had processed more than 93 million spectra from 250,000 mass spec runs.
Getting researchers to deposit the raw data underlying their studies has been a challenge for a number of fields beyond natural products, with proteomics being one notable example.
Proponents of raw data submission hold that it is important in that it allows outside researchers to more thoroughly assess the accuracy of large mass spec experiments. It also allows for re-analysis of past experiments using different or novel informatics approaches, which could enable discoveries not made by the group that initially generated the data. Additionally, because many proteomics studies are focused on a relatively small aspect of the data generated — protein fold changes or phosphorylation, for instance — deposition of raw data allows other researchers to investigate it from different angles.
One issue with regard to proteomics has been establishing and maintaining resources capable of hosting large amounts of raw mass spec proteomics data. For instance, several years ago the University of Michigan-based Tranche repository, which was at one time the only resource for hosting large raw mass spec datasets, significantly cut back its activities due to funding challenges.
Today a number of resources, including UCSD's Massive repository, accept raw mass spec proteomics data. Now the challenge has shifted to convincing researchers to take the time to submit their raw data.
One approach the field has taken is having journals mandate that researchers submit raw data for any papers they publish. Last year, for instance, leading proteomics journal Molecular & Cellular Proteomics revised its guidelines to require submission of raw mass spec data.
Bandeira suggested that a GNPS-style resource, which actively aided proteomics researchers in identifying their spectra, would also promote better data sharing.
Dorrestein noted that while efforts to improve data sharing have largely used negative incentives — for instance, a funding agency not providing funds for a project unless data is made publicly available — a repository like GNPS presents a different sort of appeal.
"It's, 'Hey, if you don't make this data publicly accessible or deposit it in this way you won't get the maximum amount of knowledge,'" he said.
"We definitely need something like this in proteomics," Bandeira said. "And that is something we are actively working on for the Massive resource, where we will be describing how to make proteomics data sets 'come alive' in the same way."
"In proteomics we are still unfortunately in the stage where the data goes to a repository, but a large majority of the [spectra] do not even have IDs," he said. "And even the ones that do are not necessarily easily searchable or correlated to each other. So those are the sorts of things that made GNPS such a success and that we definitely want to bring to proteomics in the very near future."
Many of the key features are currently available for proteomics data in Massive, but Bandeira said the community is largely unaware of them given that he and his colleagues have only recently begun that effort. He said he hoped to send out a paper detailing the repositories' new capabilities in the next several months.