Jim Kent is known for a number of things in the bioinformatics world, including his work on the University of California Santa Cruz “Golden Path” genome browser and the development of the BLAT (Blast-like alignment tool) rapid alignment algorithm. But Kent’s claim to fame is a grueling month he spent writing the GigAssembler program to assemble the public human genome in time for the White House announcement of the draft sequence in June 2000.
Kent, who spent 15 years as a computer animation programmer before returning to graduate school for a PhD in biology, is often credited for single-handedly keeping the public project on course to meet the June 2000 goal, an honor he modestly prefers to extend to the hundreds of researchers who contributed to the international consortium’s work. Nevertheless, Kent continues to receive accolades for that month’s worth of programming he did three years ago. This summer, Kent will pick up this year’s Overton prize from the International Society for Computational Biology at ISMB in Brisbane, Australia. Most recently, Kent was granted the annual Benjamin Franklin award from open source bioinformatics advocacy group Bioinformatics.Org for promoting “freedom and openness in the field of bioinformatics.”
BioInform caught up with Kent after the Benjamin Franklin award ceremony and his keynote talk at the O’Reilly Bioinformatics Technology Conference held in San Diego last month to find out what he has in store next.
Congratulations on winning the Ben Franklin award. It makes me curious what it’s like for you to be known as ‘The Guy Who Saved the Human Genome Project.’ How do you follow up on something like that?
It is strange, from my perspective, getting famous for that. It was a lot of work, but it was really the last organizing step on top of work that hundreds of people, many of them working for 10 years on it, had done. It turned out that it was a good thing, but I’m happy to be back working on comparative genomics and regulatory stuff, because [the assembly] wasn’t actually the main focus of my research at all. It just needed to be done really badly.
How do you balance your research with software development and your work on the UCSC Browser?
For about a year now we’ve had some staff that’s devoted to the browser, and we have a reasonable NHGRI grant to support it as well. It was probably at its most hectic around a year ago, exactly, when I was training the new staff as well as maintaining it and trying to do a little bit of other research. And also the mouse was getting very intense at that point. But now, I don’t have to work so much on maintaining the browser. I still do work on it, but it’s mostly to extend it.
What is the breakdown time-wise between your research work and software or browser work?
Probably about half my time is research and the rest is split many ways, mostly actual management of the browser team — training and stuff like that — and then the usual administrivia, writing grants. Actually, I also do a lot of user support. I like to [laughs]. People say, ‘What? I always figured you’ve got better things to do,’ but it’s direct contact with the users, which is very valuable for me, especially because my PhD program was not in informatics, it was in pure biology. So while I was a grad student, that provided a lot of contact with the end users, but I feel that I have to make deliberate efforts to keep in touch with them because bioinformatics itself is growing into its own discipline.
When you write software, say in the case of BLAT, does the motivation come from particular research problems?
BLAT was from research that kind of pre-dated my involvement with the Human Genome Project. I was working in an alternative splicing lab, so good cDNA-mRNA alignments were very helpful for the raw material for detecting alternative splicing. My PhD was in an alternative splicing lab that had done some work on the human but was starting to work on the worm. [The principal investigator] had all this beautiful genome out there, and he had all these RNAs out there, which was sort of the raw material we needed, but the existing tools for putting them together just using Blast were not adequate in a lot of ways. Blast will tend to find just an exon at a time, it won’t find the whole gene, and then it will tend to bleed over the edges of the exon so it’s not clean, you get a little extra stuff. And then on top of that, the AceDB display that they had would just pile the ESTs and mRNAs right on top of each other, so you couldn’t tell whether it was alternative splicing or not.
So that was really my first [bioinformatics program]. My PI knew that I had a computer background so he asked me to sort of basically untangle that and make it so that he could use the data. So I wrote a program. My first browser was actually called the Intronerator for worm, and I worked out a first-generation cDNA-mRNA alignment algorithm. It took 12 days to run it on five computers that I had in my garage — I had five computers of various vintages to align all the worm cDNAs against the worm genome. The UCSD computer setup was so backwards at the time that it was faster than what they had there, and I wasn’t a poor graduate student because I had a business earlier selling software, so my garage was better equipped at the time. David Haussler grew embarrassed by this and quickly built up a cluster, so as soon as we did the human I was out of my garage [laughs]. I had the capacity for worm in my garage, but not for human.
And that’s really what I thought I would be doing for the Human Genome Project, was figuring out how to align all the ESTs to the genome, and it was a big problem because what took twelve days in the worm, we had 20 times as many ESTs for the human and the genome was 30 times as big, so the problem was 600 times as big and twelve days did not cut it any more. And it turned out that those algorithms also could very, very quickly be applied then to genome assembly in finding the overlaps between these clones, and that’s kind of how I backed into doing the assembly.
How did that work lead to the assembly?
I had this work done that would provide the raw material for the assembly and they asked if I could generate the list of clone overlaps for them and organize it, so I did that, and I guess that was in March 2000, and I thought that was all I was going to do. Once you’ve done the alignments, they’re just the first step to actually then identifying all the alternative splicing stuff.
Then it turned out that the assembly wasn’t going very well. It looked like the other things were kind of getting bogged down because they were trying in a sense to do too good of a job, and the data wasn’t there to do a really good job in the first place, so even a relatively crude assembly was a huge improvement over no assembly at all.
So initially I focused on just trying to write a very simple thing that I knew wouldn’t be perfect, but would be a vast improvement over no assembly. Actually, though, it kept growing, and by the last version that we finally handed over to the NCBI, it had 13 inputs and was getting to be a very complicated program. Their own program, which was somewhat fancier, was started earlier and finished later, but when it was finished it was basically just as good…So by that point it was so much better for everybody to be on the same page, so we moved to their assembly once their algorithms got solid enough.
So for other organisms now, there are a couple of whole-genome assemblers out there — Arachne from the Whitehead and Phusion from Sanger.
Oh, I’m out of this now [laughs]. But I’m somewhat of an expert in spite of myself on assembly now, and do evaluate them. With mouse, we did a reasonable amount of the evaluation — NCBI did as well — and it was largely a comparison between Phusion and Arachne, and they went back and forth. The Phusion team was extremely graceful in the way they took it when Arachne was chosen for the final product…and from what we can tell it is higher quality than the Celera one, but they both are not finished by any means.
Can you talk a little about the kind of research you’re doing on regulatory regions?
There are really two broad angles that we’re taking. One is looking for conserved non-coding things using comparative genomics with human and mouse, and now we’re starting to get a lot of species. It’s very clear that that’s not the final answer, but it works pretty reliably, and at this stage it’s so hard to get experimental data on these things, so conservation across genomes is one of the things you definitely can get.
Then, the other approach is based on finding clusters of co-regulated genes. Just looking in the neighborhood for motifs from them. That is going more slowly. We have a pretty good tool called the Improbizer. In the worm it seemed to be working quite well if you feed it a whole bunch [of sequence] from the start codon upstream — worm genes are remarkably compact, so maybe just 500 bases for a co-regulated cluster of genes. One input is supposed to be enriched for your motif that you’re finding, and you give it another input, which is your background level. So if you’re looking for liver-specific genes, you might feed it in putative liver promoter regions, and for the background you might give it all promoter regions. We’ve got a couple of researchers in worm who have pursued it, but in human, we’re somewhat flummoxed on two ends — in finding the transcription start site in the first place, and then also getting clean regulatory information.
Like the genome itself, some of this information is helpful in many, many other ways, so I’m tending to focus on that at the moment, so focusing on collaborations with the MGC [Mammalian Gene Collection] to get data on the genes which for my personal research will then hopefully define transcription start, and then collaborating with Affymetrix to get nice, clean expression information.