Skip to main content
Premium Trial:

Request an Annual Quote

Northwestern s Lin and Kibbe on the Limits of XML-Based Data Standards

Premium
Simon Lin,
assistant professor, Northwestern University Medical School Bioinformatics Core
Warren Kibbe, research associate professor and director of the Northwestern University Medical School Bioinformatics Core

A provokative article in the current Expert Review of Proteomics entitled "What is mzXML good for?" raises a number of issues about the current state of data standards in proteomics [Expert Rev Proteomics. 2005 Dec;2(6):839-45].

Despite the rise in XML-based standards efforts in the bioinformatics community over the last few years, the authors note, "XML is not a panacea for bioinformatics or a substitute for good data representation, and groups that want to use mzXML (or other XML-based representations) directly for data storage or computation will encounter performance and scalability problems."

Specific limitations of the standard, according to the authors, include the fact that it only enables the exchange of processed m/z intensity pairs and does not capture raw mass spec data; it does not comply with the FDA's 21 CFR Part 11 regulations governing electronic signatures; and that it is an inefficient data structure for large-scale computation.

Last week, BioInform caught up with Simon Lin and Warren Kibbe of Northwestern University, two of the paper's five authors, to discuss some of the implications of their claims.

What was your motivation for looking into what mzXML can and cannot do?

Lin: Right now there are a lot of incentives for using proteomics to investigate biological problems — especially in cancer-related research. So the first problem of using the proteomics tools is that many of the proteomics data come in different formats from different manufacturers, so there are a lot of different data formats that need to be converted from each other before we can let them share a common data-analysis pipeline. So for that part, mzXML becomes a very handy tool, and that becomes the first critical step for us to move ahead.

The paper mentions how the MGED society has led the way in bioinformatics standards initiatives, and how they've really helped encourage the microarray community to share data in public repositories. But I'm not aware of the equivalent of the GEO or ArrayExpress repositories in the proteomics community.

Lin: That is the right impression of the current status of proteomics databases. Right now, there are not many large databases for proteomics data like GEO or ArrayExpress. There are starting to be some smaller databases around — largely by individual research centers and universities.

What is the relationship between resources like this and standards? Is the dearth of databases due to the fact that there is no clear standard, or is it possibly the reverse?

Kibbe: There are several things. One is that the standards for how people keep and analyze the data haven't completely solidified yet. What made GEO and databases like it possible was really the MIAME standard coming along and being accepted by the community, and that hasn't happened yet for the proteomics world. So, I hate to pick on standards as being an issue, but I think there are, if you will, too many standards in proteomics, and a database like GEO relies on everybody having things in a similar format so that at least when you pull the data out, you know what to expect from an analysis standpoint.

In proteomics right now, not only are there a lot of vendors of equipment that are not interoperating well — in terms of the kinds of formats they're using to internally represent the data and then to export it — but the analysis packages haven't really solidified yet on a single standard, either. So it's both sides.

What do you see as a solution to this? Is it just a matter of the community getting together and agreeing on MIAPE and mzXML and the other standards that are already available, or are there limitations to these particular standards that still need to be addressed?

Lin: I think that's probably the way to get it solved, and also a little bit of time.

Kibbe: I guess maybe the limitations of the current methodologies and the current formats and the current analysis is that, again, there are a lot of different technologies out there, and they all have slightly different requirements — and that's not a bad thing. They really do truly have a little bit different requirements as far as how they go about data acquisition, and how you go about analyzing the different types of data. And in a lot of ways it's like comparing two-color microarray slides and Affymetrix. They're very different from a data collection and a data analysis standpoint.

I think that, as Simon said, it's just going to take a bit of time for everybody to understand how they're going to be using the proteomic data — so really what do they need to be able to extract from the data sets, how do they go about analyzing it — and that, in fact, having a plethora of standards doesn't really help get at that.

The other side is just people being comfortable — a new community essentially saying, 'Well, we really do need to describe our experiments in a very precise way that can be cross-correlated between experiments.' The microarray community has had 10 years or so to deal with that and become accustomed to capturing those kinds of elements. And I think that's going to happen very naturally in the proteomics community as well.

I got the impression from the paper that one of the drivers behind it was some level of misunderstanding in the community about the capabilities of standards like mzXML. Do you think people assume that these standards can do more than they were designed for?

Kibbe: I think that whenever there's a new technology, there's a danger that some people believe it's the panacea, it's the end-all, be-all, that it will solve all their problems. And each time a new technology comes along, it has a new set of limitations that people have to understand. From a software engineering standpoint, there are communities out there that love UML — the unified modeling language — and there are groups that will swear up and down that it will solve all their problems. It doesn't. It helps rationalize certain aspects of what they do that maybe they hadn't thought about before. But there are other ways of doing that, and UML itself doesn't really help them solve all their problems.

Likewise, XML is a wonderful tool — particularly for data exchange. And I think the point that Simon really made in that paper is that XML is a great interchange format — it is so flexible, and you can annotate almost anything using XML, but there's a price to pay for that, and that price is performance, and you don't want XML to be the format that you're actually thinking about using internally when you're writing analysis programs, or even, perhaps, when you're thinking about internal representations of data inside of an instrument — that isn't a natural way to go about it.

You need to think about XML as a way of exchanging data or a very flexible way of annotating data — not that it's good for how you actually go about the analysis. There need to be transformations, and people just have to understand that that's a limitation of XML. It becomes very cumbersome to make that the main way you represent data.

Again, every time a new tool becomes available to a new community, they need to realize what the limitations are. I think that's what the article was about — making sure people are clear what it's good for and what it's not so good for; in fact, what it's really bad at.

I was interested in the issue you raised regarding regulatory submissions. Would an XML standard for, say, peak data as opposed to processed data, solve that particular problem?

Kibbe: We're involved a fair amount in clinical trial work here, and 21 CFR part 11 is actually a beautifully written guideline, but really, to be in compliance with that regulation, you need to keep all the data that you ever had, and you need, in fact, to be able to demonstrate its chain of custody. That's really the important part. You need to show [that] any time data has been manipulated in any way that not only can you document exactly how it was manipulated, but [document] the data before the manipulation and the data afterwards. So for that auditing piece, XML itself doesn't really help you. And that's okay. Again, I don't think that's such a big deal because there are existing wrappers for how to go about tracking and auditing information. And even if all of the raw data was kept in a single XML file, frankly, that to me doesn't really help — it's how you've gone about moving it from system to system, and if, in fact, you can guarantee — as much as anyone can guarantee anything — that the data that you say was unmanipulated really is unmanipulated, and when you've had manipulations occur — so, for instance, you've done analysis on the data — that you can document exactly how it happened.

Traditionally for clinical trials, that means that if you're using SPSS or SAS, you actually include the source code of your program. So you very explicitly show to the reviewer that if they run this little bit of code on this data, then here's the data they're going to end up with, and that gives the FDA a tremendous ability to go in and audit.

Thinking about proteomic data, it's almost mind-boggling to think about having to have that level of detail accessible to an auditor, so I think people are going to have to go back to the guidance behind 21 CFR Part 11 and say, 'What was really the intent of the guidance? Where do we need to be able to document the data stream, and how critical is it to the particular submission?'

So if it's something that is not directly relevant to the trial at hand, I'm hoping that they will just force [submitters] to say, 'Here is the initial data set, the original data set,' and demonstrate that that is unmanipulated. Again, part of it goes to standards because everybody is changing their analysis software right now, so having to keep snapshots of all the software, all the source code [will be difficult]. Proteomics analyses are much more complicated than the typical statistical programs that people write to look at clinical trial data where you only have maybe 400 variables that you're looking at. Whereas in a proteomic data set, you're typically talking about thousands, tens of thousands, hundreds of thousands of data points, and the algorithms aren't just 100-line programs, they're thousands, tens of thousands of lines.

And the number of intermediate data results that you may end up generating if you do similar sets of analyses multiple times, if you have to keep all that, the amount of data an auditor would have to go through would just go through the roof.

I think that from an FDA perspective, that's actually OK because they want to be able to do a spot check. It's kind of like the IRS, actually. They're not going to dig through everybody's files in great detail, but when there's a discrepancy, they're going to look at it with a fine-toothed comb. So they need to have the data there. But again, it's [asking] 'How important are those intermediate steps?' And I think that's where everybody needs to go back and think a bit more about it. And in fact it would be helpful to have more guidance from the FDA now in that area.

Taking into account the limitations that exist with standards like mzXML, how do you recommend that people use these standards in order to benefit their research?

Lin: In practice, I think mzXML is an instrumental development. There are a lot of statisticians starting to work on proteomics data — they find it a very interesting and challenging research topic. And for either statisticians or computer scientists, the first thing is, 'Let's open the data and take a look.' Before mzXML, it used to be very hard even just to take a look at the data. You need to deal with a thousand different binary formats before you can take a look at the data. Right now, mzXML really makes that first step very easy.

Kibbe: If you think about that from a practical standpoint, now, each instrument manufacturer really only has to come up with a parser that will let them basically transform all their proprietary formats into mzXML, and then all the analysis software just needs to be able to take that format in and then translate it into their internal formats. So it becomes a very nice exchange standard, which is exactly what it's designed to be.

Are you satisfied with the level of adoption for these exchange standards that you're seeing on the instrument side and the software side?

Kibbe: Not yet. I think this community is one where there's all different levels of expertise, and all different levels of both experience and interest, in either the data generation side, the data analysis side, or the interpretation. And depending on who you are, as Simon pointed out, you really want to be able to dig deep into the data, and other groups may just want to be able to plug existing pieces of software together. They're basically happy with the way the algorithms work, and they just want to be able to plug them together and make it work and turn out the data.

I think in the end, everybody wants to be in that last group. That's where everybody wants to get to. And depending on your level of comfort or frustration with some of the existing analysis or interpretation packages, you may want to go in and tweak a lot. And again, that depends a bit on the community and the purpose. And it's just coming to the point now where it's easy to download some of these packages, figure out what they're really doing, and then actually manipulate them. And that's when everything becomes transparent. So then, each group, when they want to do proteomic analysis, can just plug things together, and it should all work more or less as advertised.

So if you think of Bioconductor for microarray analysis as being a good example of how you could have a series of instrument vendors generating various kinds of data, now apply a very open package to those data, and depending on who you are, generate various levels of interpretable data.

And you could build a really nice GUI-driven application where you just plug data in and you can let people who aren't trained in the analysis side of it see the results. But we're a little ways off from that in proteomics. Right now, it's probably a lot of folks who are either mass spec gurus, people who are very interested in the chemistry side of things, and then the computational folks — they're really the ones who are the most interested now in proteomics. And it requires somebody like that right now to work with the data.

So it's not quite as transparent yet, but that's rapidly changing. And there are existing packages that I think are getting very, very close to being usable by someone with only passing training in proteomics.

What kind of feedback have you gotten on the paper so far?

Kibbe: I was surprised to see that Simon got requests for reprints before it even went out. That's unusual, so obviously we're either hitting a nerve or it's a very hot topic.

 

File Attachments

Filed under

The Scan

For Better Odds

Bloomberg reports that a child has been born following polygenic risk score screening as an embryo.

Booster Decision Expected

The New York Times reports the US Food and Drug Administration is expected to authorize a booster dose of the Pfizer-BioNTech SARS-CoV-2 vaccine this week for individuals over 65 or at high risk.

Snipping HIV Out

The Philadelphia Inquirer reports Temple University researchers are to test a gene-editing approach for treating HIV.

PLOS Papers on Cancer Risk Scores, Typhoid Fever in Colombia, Streptococcus Protection

In PLOS this week: application of cancer polygenic risk scores across ancestries, genetic diversity of typhoid fever-causing Salmonella, and more.