Name: Sarah Calvo
Position: PhD candidate bioinformatics and integrative genomics, Health Sciences &Technology, Massachusetts Institute of Technology/Harvard University, 2005 to present, focusing on computational identification of the parts, pathways, and pathogenesis of the human mitochondrion
Background: BA, computer science, Williams College
While mitochondria are essential for cellular life, death, and differentiation, little is known about the proteins that make up the organelles.
The 13 proteins encoded by the mitochondrial genome have been known since 1981 when it was sequenced, but more recent research suggests there may be as many as 1,500 nuclear-encoded mitochondrial proteins. Less than half have been identified with experimental support.
In work outlined in an article published during the summer in Cell, a group of researchers used mass spectrometry, microscopy, and machine learning to construct a library of proteins comprising the mitochondria. In total their library contains 1,098 genes and their protein expressions across 14 mouse tissues.
ProteoMonitor recently spoke with Sarah Calvo, the co-first author of the article, describing the library dubbed MitoCarta. Below is an edited version of the conversation.
Describe MitoCarta.
The goal was to figure out what are all of the proteins and components that make up human mitochondria, and not just in one tissue but in a whole atlas of tissues. The reason this is important is once you know the protein composition of mitochondria, then we can do a systematic analysis [to figure] out what are disease-related genes, we can try to figure out its functions, what are the main pathways underlying mitochondrial form and function across tissues.
Is the idea to make this a catalog of proteins or do you plan also to include biological information such as expression profiles?
Ideally, you would love to model all sorts of information about the mitochondrian. But the MitoCarta atlas is a list of the proteins, and then a proxy for their protein abundance in 14 tissues.
We then intersect this list with all sorts of other biological information to look at functional insights, but the MitoCarta catalog itself is just a list and protein abundance.
We actually do have a list including the mRNA levels and then in terms of subcellular localization, they’re all in the mitochondria.
Is this a gene-centric approach you’re taking?
Yes. We found that currently the mass-spec technology is not sensitive [enough] to figure out all the splice variants, and so we aggregate the data at a gene level, just because the protein level is not robust enough right now. It’ll get there, but current technology [doesn’t allow it].
Right now, to figure out the proteins that you identify … you do an in silico look-up of a database of known proteins, and that database of known proteins is what’s the limiting factor right now.
In reality, there are many more splice forms than what we have in our databases.
Is there any redundancy between MitoCarta and other protein databases out there, or because you’re looking specifically at mitochondria proteins, MitoCarta would be able to answer a different set of questions?
I think, that yes, there should be redundancy. Hopefully proteins are being found by many different methods, and then by intersecting information from many different sources, you can answer different questions. The goal of this project is to specifically interrogate mitochondrial proteins, so you isolate the mitochondria first, and then you figure out the complex mixture of proteins making up the subcellular organelle.
Now, these proteins are likely to be found in many other proteomic studies. For example, any study that looks at proteins in the liver — liver has a lot of mitochondria, so you’re going to get a lot of mitochondrial proteins — but it’s very difficult to correctly isolate just the mitochondria from other cellular components, and many studies have tried to do this in the past.
[ pagebreak ]
The problem is the technology has gotten so good that previously, you only found the abundant proteins, which tended to be mitochondrial, but now the technology is so sensitive that you pick up any contaminants that are in your sample.
So you try to isolate your mitochondria but you get some [erroneous results] and you get some other stuff. Most current approaches identify a lot of mitochondrial proteins and a lot of contaminants, and they have no way of telling the difference.
The real effort that we’ve put into coming up with this bona fide set of mitochondrial proteins as opposed to contaminants is we integrate information from many other different data sources to help us tease apart which are mitochondrial versus non-mitochondrial.
And we also use an experimental technique to try to figure out — as you create more pure mitochondria and you look for the protein composition of purified mitochondria versus crude mitochondria — which proteins are gaining in abundance in the purified mitochondria. And that’s a way of separating out your truly mitochondrial proteins from your contaminants.
Other people have tried to do this. There’s a paper on protein correlation profiling by Matthias Mann’s group. Ours is in the same vein as this approach, but in general this is not used, but I think it works pretty well.
Describe the mass-spec work you did.
After the mitochondria were isolated ... you homogenize them to break up the membranes. And you run the protein mixture out on a 1D gel. And we cut 20 different bands for each tissue. You run it on a gel and you cut out 20 bands, which are then run under a mass spec, on an Orbitrap. Then it goes on a two-hour gradient.
Then separately, we another set of mass spec experiments, where we take pure mitochondria and crude mitochondria, so we have two different tubes, and then we do in-solution digestion to get the peptides, and then run those out on two-hour gradients on an LC-MS.
The Cell article describes 1,098 genes that you identified. Have there been more you’ve identified since the article was published?
No, the list pretty much is the same as it was then. I think this is the basis for doing systematic biology. This was the first step, and then this enables all sorts of other investigations into mitochondrial form and function. Once we have this list, we can do RNAi screens, we can do chemical screens. We can systematically look for disease-related genes.
So this was a first step, and now we’re going from there. Other groups will continue to refine this list, I think there are about 100 things on it that are wrong, and there are going to be another 100 proteins that are going need to be on it, but we’re not actively looking to find those.
Were there any proteins you found that were particularly interesting or surprising?
Once we have the list, then we can try to figure out what these newly discovered genes do. We applied an evolutionary approach for one of the most important pathways in mitochondrial functions, which is the electron transport chain.
So we could then annotate a whole bunch of proteins that previously no one knew anything about, and now we can say not only are they mitochondrial based on our mass spec data, but they’re involved in complex 1 assembly and then using that list, we discovered several disease-related genes. One is described in the Cell paper, and then another came out a couple of months later [in American Journal of Human Genetics] describing another one of these genes.
[ pagebreak ]
The paper says you had a 10 percent false-positive rate. Are you working on getting that figure down?
In fact, mass spec had a much higher false discovery rate. The 10 percent in our final list was as far down as we could get it.
We basically have a ranked list of 20,000 human or mouse genes, and the ones at the top of the list are [the ones] we’re really sure are mitochondrial, and then as you go down the list, there’s less confidence. You can draw a threshold anywhere you want and say ‘If you want fewer false-positives, then we just move the threshold up and say all right, now we have 800 genes that have only a 5 percent false-discovery rate.’
But we don’t really have any extra methods in the past couple of months to help us bring the list down.
What was the false-positive rate on the mass spec?
The false-positive is not a problem with the mass spec, it’s a problem of identifying non-mitochondrial contaminants that are present in the samples, and that was well over 50 percent, probably something like 75 percent, but they’re not mass spec false-positives, they’re experimental design false-positives.
We identified 4,000 proteins from our mass spec experiments, and only 900 of them are truly mitochondrial.
Aside from C1 diseases, to what other diseases would you like to apply MitoCarta?
We actually would like to apply it to all diseases related to primary mitochondrial disease. This is a very heterogeneous group of diseases where in total it has a prevalence of about one in 5,000 people.
The one we talked about is complex 1 deficiency, there’s also complex 2 deficiency, complex 3 deficiency. There are a bunch of mitochondrial myopathies, some other mitochondrial encephalopathies, but each one is very rare.
The paper also talks about probing the ancestry of some of these mitochondrial proteins.
The mitochondria have this fascinating evolutionary history, which is that way back when eukaryotic cells were being created, there’s this endosymbiotic theory where a proto-eukaryotic host cell engulfed an … aerobic respiration bacteria that then has evolved into the modern day mitochondria.
That endosymbiont had its own genome one and a half billion years ago, and that genome probably had about 3,000 genes, but over the past one-and-a-half billion years, almost all of that DNA has been transferred to the host, the nuclear genome, or lost.
This is kind of an interesting history of this organelle, but if you look at a snapshot right now of, say human mitochondria, 13 proteins in human mitochondria are coated for by the … mitochondrial DNA and all the rest are coated from the nuclear DNA.
One of the questions is: Are the ones coming from the nuclear DNAs, did they originally come from that host, that endosymbiont, or did they come from the host nuclear genome?
So what I did was do an evolutionary analysis of all of the human mitochondrial proteins, this list of 1,098 MitoCarta proteins and I asked, ‘What’s their evolutionary history across 500 fully sequenced species?’
You can [compare that] by sequence similarity, and what we find is that three-quarters of modern day mitochondrial proteins are ancient in origin … they come from bacteria before it was the endosymbiont or the host.
And a quarter is really new innovations, and some of the new innovations are really interesting. [For example] in order to have the host create this mitochondria, you have to have proteins to allow for import of proteins into the organelle, so that whole machinery … as far as we can tell, just came from nowhere. It came into being right at the beginning of eukaryotic life, so that’s interesting, and we can look at what other proteins are similar to that.
And there are some proteins in human mitochondria that are just mammal-specific, or primate-specific, so then we can piece out what is the history of this organelle across bacteria and across eukaryotic evolution, and you can look at particular pathways and which ones are ancient and which ones are new … and look at the history and functions of this organelle.
I know that you said you’re no longer actively working on increasing the number of genes, but are there new technologies that exist now that didn’t when you were trying to identify the true mitochondrial proteins that would help add to that list of proteins?
Certainly, one in particular is we come up with these kinds of proxy protein abundance measurements. We use simple spectral counts, which can be definitely improved upon by new SILAC or quantitative iTRAQ techologies, so I think that to get a better handle on actual protein abundance, the new technology will be a huge boon.