At A Glance
Name: Morgan Giddings
Position: Assistant professor, departments of microbiology, immunology and biochemical engineering, University of North Carolina at Chapel Hill, since 2002.
Background: Postdoc, University of Utah, 1998-2001.
PhD in distributed departments, University of Wisconsin, 1997.
MS in computer science, University of Wisconsin, 1991.
BS in physics, University of Utah, 1989.
At this year’s PITTCON conference, held in Orlando, Fla., this week, Morgan Giddings, an assistant professor of microbiology at the University of North Carolina, Chapel Hill, gave a talk on the Genome Fingerprint Scanning program she developed to link the proteome to the genome.
ProteoMonitor caught up with Giddings before her talk to find out more about her background and work.
Can you tell me a bit about your background? How did you get into computational biology?
It’s a little bit of an odd background, but I started out in both physics and computer science — I thought I would do kind of applied computational science to physics problems. In my undergrad I got a physics degree and a computer science minor, and then I went on to get a Masters degree in computer science. As I was pursuing graduate school, I actually became interested in biological problems, and that’s when I started thinking, ‘Huh. Maybe biology is a more interesting place to apply computer science to than physics.’ So it was in about 1991 that I really made that switch and started working with Lloyd Smith at the University of Wisconsin to develop algorithms to help with DNA sequencing. I developed a program called Base Finder that’s still available on our website, and still used by a few people.
What did you do after you completed your PhD?
After that I went to the University of Utah. That was in the human genetics department. And I joined an interesting lab that was studying the biochemistry and the biology of these things called recoding events. These are aberrant cases where the protein that gets read out by the ribosome is not what you would expect. So, for example, it may have a frameshift in it. It may be reading along in one codon frame, and then be reading in a different frame, and produce a protein product that you would not expect. So there are all sorts of cases like this, in all sorts of organisms. We actually built a database while I was a postdoc. It’s really quite intriguing how this whole mechanism works.
So that’s actually what got me into proteomics — trying to figure out ways we can use proteomics to look for these odd proteins that were being produced in these cases. That was what has really motivated me since that time in the development of all our software — really being able to help in the discovery of novel proteins, be they produced by alternative splicing, post-translational modification, or one of these frame-shift events.
I started developing software for linking the proteome back to the genome when I was at the University of Utah, and I continued that as a faculty member at the University of North Carolina. I joined UNC in early 2002.
What did you start out doing at the UNC?
One of the things I’ve developed is this approach called Genome Fingerprint Scanning. Basically, it takes the mass spec information and maps it back directly to the raw genome sequence. And it bypasses all the databases of proteins or known genes, and it says, ‘What part of the sequence in this organism encoded this protein?’ And the power of that is that a lot of time those databases are incomplete or incorrect, especially when it comes to things like alternatively spliced genes. So generally the databases aren’t fully representative of all protein products you might get. So when you’re doing searches for proteins that are modified by one of these processes, you may miss out on it entirely.
Also, there are all sorts of organisms being sequenced for which gene annotations aren’t ready yet. So another thing GFS can do is allow you to do identifications immediately as soon as the draft sequence is published, and well before annotations are usually published.
I developed that concept when at the University of Utah, but it was basically only a prototype. And here at UNC, that has been really expanded. We do quite a few different things, but that’s really our core focus — making that software so that we have a very nice, easy to use website that people can come to and use the software to map the proteomic data against whatever genome they want to.
How does your program compare with genome-based protein databases?
Any time you get into the game of trying to predict from the genomic sequence what proteins might arise, those predictions just aren’t very good yet, especially for the problem of finding, again, alternatively spliced genes.
One of my postdocs studies an alternatively spliced gene family that can produce up to 38,000 different proteins from a single gene. It’s a pretty incredible case. The more we look into it, the more interesting it becomes. In a case like that, the prediction programs are currently not meant to deal with that kind of case.
Nobody knows how many of those cases there are in various vertebrate genomes, and how the proteins are produced. So if you build a protein database just predicting of the genome, it’s going to be moderately representative, but it’s certainly not going to represent every possible product at the current time.
The idea behind GFS is to totally bypass that — to take the mass spec data and directly say, ‘Here are the places on the genome sequence that that falls.’ And then to go back and look at those sequences and say, ‘What series of exons makes sense given this mass spec data?’
Would the GFS program be able to handle that alternatively spliced gene family?
We’re working on that actively. We’re approaching it from a purely bioinformatics standpoint at this time, trying to understand how this intriguing alternatively spliced pattern evolved. So we’re doing a lot of cross-genome comparisons between the exons in this particular organism and exons in other organisms.
How are you looking to develop GFS more in the future?
The biggest challenge is being able to solve this problem of taking the mass spec data and successfully go back and find the set of exons that express that protein. So, to be able to take protein products from something like the Dscam [alternatively spliced] gene, and directly map it back and say, ‘Ah ha! This protein represented this particular splice pattern, this particular isoform.’ That’s the big goal.
And then an auxiliary goal is to add lots of new features — one of the things not currently available on the website is the ability to put in tandem mass spec data. It’s built in to the program itself, but we can’t put it on the website due to licensing issues. So we’re taking a different approach to interpreting the tandem data, and we should have that on the website pretty soon. That’s really important. A lot of the community is going a lot more towards using tandem data — it has a lot more specificity. So that’s a big step.
Then, there’s lots of other features — adding more genomes, allowing people to submit their own sequences, whatever those may be. Adding the ability to pick different enzymes besides trypsin — there a lot of those sorts of little things we want to add to make it a very full-featured tool in the future that people can come to and use. Right now I see it as fairly limited, and I want to go beyond that.
What was the most challenging thing in developing GFS?
Initially to prove the concept, it took about six to eight months. But then from there to here has been a pretty long road, mainly because of the sheer volume of data that these represent.
What we do for each genome is we go in and calculate for that genome what are all the possible peptides that that genome might produce. And that’s a big number. For Saccharomyces, that’s eight million possible peptides, approximately. For the human genome, that’s over two billion peptides. So when you start talking about managing that much data, and searching that size data efficiently, it’s not a trivial problem. We’ve spent a lot of time trying to optimize routines to make searches more efficient and faster. And unfortunately, that doesn’t lead to interesting research publications, but it’s really crucial to getting to the point where we can take some mass spec data and quickly scan the human genome without waiting for two days.
Of the last two years, that’s taken a big part of the time. We’re really at the point where a lot of the optimization is complete, so we’re sort of moving on to these other things.
What projects besides GFS are you working on?
We also have a program that’s aimed towards analyzing top-down proteomics data. That was published in Analytical Chemistry a little over a year ago. What that program does is take intact mass measurements, and a sequence of a protein that’s known to correspond to the mass, and then it searches the space of possible post-translational modifications, as well as truncation events, like N-terminal cleavages. It searches those to try to match up the mass input with the sequence that was input.
For example, you might have a protein that has an intact mass of 47,000 daltons. Let’s say you have a sequence for that protein that says it should be 49,233. So what the program does is it says, ‘OK, what modifications events would I have to do to this original sequence such that it would match that mass that you actually got?’
That project is up and running. It works great for simple modifications and simple cleavages. The challenge is taking it in new directions, such as adding the ability to look at more complex modifications, like carbohydrate modifications. Another area is looking at polymorphisms.
In the long-term prospective, I really see these tools uniting into a combined approach where we can take top-down data, not just the intact mass, but also the peptide or fragments, and put those into a combined program of GFS and PROCLAME that completely characterizes that protein automatically.
We haven’t begun to combine the two programs in any kind of automated way yet. We do often combine them manually — running one, then the other and comparing results. My goal is to, three years from now, have a website where people can put in top-down data and have this kind of answer pop out.
Is all of this free of charge?
It is. We initially had some debate about how do deal with it, but recently we just decided to go with open source, and to make it available to the community — freely available.
What other projects are you working on for the future?
We do have a small wet lab effort going where we are specifically trying to use proteomics to detect single amino-acid substitutions and things like that in bacteria. What we do is we start with a parent strain of bacteria, expose them to antibiotics and derive an antibiotic-resistant strain from that parent. Obviously, there’s some genetic differences between the two that manifest as protein differences — differences in the proteome that somehow make one antibiotic resistant when the other wasn’t.
One way that people have addressed that is going back and sequencing known places where sequences change, but our goal is to say, ‘Well, if the changes are at the protein level, why not just look at the protein level?’ So that’s our main wet lab effort.
In some ways it motivates and drives our informatics development, because we always run into analysis problems — how do we detect these variants? What’s the best way? How do we then utilize the data once we have it?
There is one other project that I’m excited about, which is to do some modeling of a particular biochemical pathway. I’ve already given you the three to five-year vision, which is to have these nice proteomics tools. But let’s look 20 years out — then hopefully a lot of the proteomics technologies will be very solid, and that will be done in some sense.