In 2002, Harald Mischak, a professor in the department of Nephrology at the Medical School of Hannover in Germany, founded Mosaiques Diagnostics and Therapeutics with the goal of using mass spectrometry to identify disease-specific polypeptides.
While the company’s focus is on proteomic services and diagnostic development, it also has a strong emphasis on bioinformatics development. The firm has developed a suite of proprietary tools for its in-house use, and has released one of these tools, MosaiquesVisu, as an online service through a subsidiary, Biomosaiques Software.
BioInform spoke to Mischak recently to find out why Mosaiques opted to develop its own software, and the role of bioinformatics in its business.
Can you tell me about the proteomics software you developed at Mosaiques Diagnostics and Therapeutics, and how it fits into your broader business?
Our main focus is not software at all — it is to use mass spectrometry to define disease for diagnostic purposes, but also to evaluate therapeutics and drugs and so on. On the one hand we have hardware that we developed to do the analysis, which works quite fine, but we soon found out that there is no software we could use to either interpret the spectra, nor to subsequently compare the different data sets we have.
So we have developed a bunch of proprietary software solutions, starting with software that enables us to identify all the compounds in complex mass spectra. These mass spectra contain 7,000 to 8,000 different compounds and this is distributed over a certain period of time – I would say between 10 seconds and maybe 50 seconds or so. But it’s an enormous problem that has never been solved by any supplier of mass spectrometers to find all the compounds in there, to combine the signals that come from identical compounds, and to then have a combined amplitude or signal intensity for a certain compound — and also to do so-called charge deconvolution. Mass spectrometers measure mass per charge, which is not really very helpful if you want to know the exact mass.
So [we have developed] these algorithms, and the software works great — at least for all time-of-flight mass spectrometers. When it’s used, one can interpret complex spectra within about five minutes. In order to do the same thing by hand, it would take at least three to four weeks per spectra.
So is this something that would replace Mascot or Sequest or another peptide mass fingerprinting search tool?
No. Mascot and Sequest are algorithms to find sequences based on MS-MS data. We don’t really care about the sequence. We define the biomarkers based on mass and migration time. Of course, in the next step we would like to obtain sequence as well, but that’s a completely different story and has nothing to do with the software we developed.
What benefits have you seen in terms of your in-house diagnostic development and biomarker discovery since you’ve been using this software?
Without this, we would not have been in a position to even define any biomarkers. We would not be in a position to use all the data from one of these complex mass spectra. Nobody else, as far as I know, is in a position to use that data if they don’t have proper software.
So … they are large spectra, between 200 megabytes and one gigabyte of data — and the software compresses [them] more or less to a few kilobyes, essentially to just the compounds that are in there, which is the only thing that’s really interesting to us.
It’s quite easy to compare array data because you have an exact position on your array for each gene, peptide, whatever. But it is a lot more difficult to compare spectra because you don’t have an exact position. In other words, migration time and mass cannot be infinitely accurately described. So there’s always some variation, and if you all of a sudden have thousands of different compounds and you want to compare these to thousands of different compounds in another spectrum, it is quite difficult to decide which are similar or identical compounds and which are not identical compounds, especially if you now have to allow for certain variation in the signal.
So this prompted us to also generate calibration software [so] that we can at least calibrate the spectra for migration time as well as signal intensity, and then software that allows us to more or less assign an ID to define which compounds — in our case, proteins and peptides — are identical to others and which are not. This is actually based on certain clustering algorithms. It looks for clusters in all the data sets when you combine the data sets, and it defines these clusters as being a certain peptide if they are of high enough quality, and of course there can be different levels of stringency.
So the first part of the software, which is called MosaiquesVisu — and this is for data evaluation, to read the mass spectra — this is available for others through the web. The other pieces of software are not available to others — at least not yet — because we feel that would jeopardize our core business a little bit too much. At least for now.
Why do you feel that releasing MosaiquesVisu doesn’t jeopardize your core business?
Even if you run MosaiquesVisu to evaluate the data, then you still have the problem of calibrating and comparing it, and we have a database of literally thousands of samples that have been run. If you would like to compete with us, you would also have to run thousands of samples, and you would first have to obtain these samples, and you would have to develop software and compare them. So we feel fairly safe in giving out the first piece of the software, which is also quite useful for basic science, I think, and also for completely different applications like finding post-translational modifications in proteins and so on.
But the other pieces of software are, first of all, only for the things we use it for, so it’s not something you’d want to use in basic science. And also it would jeopardize our business a little bit. We’re a small, young company with a certain amount of money but not billions of dollars.
How many people are in your company overall?
Do you have any particular short- or long-term goals for your software development?
Well, of course it should run faster, if possible. A big problem is to compare these enormous data sets. There is no such thing as the ideal biomarker — you have to combine an array of biomarkers to a pattern or whatever, which is pretty much what others do as well. We just have a lot more biomarkers to choose from. And we use support vector machines to generate a disease-specific model. Optimization of these models takes literally days, which is a bit of a pain.
This is using maybe 500 cases and 3,000 controls, so it’s 3,500 patients, for example, each of them with roughly 1,500 polypeptides, so that’s already more than 5 million features. So it’s just a lot. We need to make some progress in speeding up this process.