Lincoln Stein is a well-known champion of open source bioinformatics software, a fact that earned him this year’s Benjamin Franklin Award in Bioinformatics. The award is presented every year by the Bioinformatics Organization to the person who has “promoted free and open access to the materials and methods used in the scientific field of bioinformatics.” Stein received his award during a ceremony on March 31 during the Bio-IT World Conference in Boston, which he followed with a keynote that strayed a bit from the topic of open source software. Instead, Stein suggested a model for “open source pharmaceutical R&D” that would replace the current patent-driven system that promotes the development of “me-too” versions of blockbuster drugs, while “neglected” diseases like malaria go untreated. Stein suggested that the rise of open source development in the IT sector — in which companies like Red Hat and IBM provide services around software developed by volunteers — could serve as a model for pharma R&D. In this scenario, public-sector researchers would conduct the bulk of the R&D with government funding, patents would be eliminated, and drug firms would manufacture and distribute the products. While ceding that “this would never happen in the United States,” Stein said it’s worth exploring because “it’s a humanitarian issue.”
BioInform spoke with Stein after his talk to learn a bit more about this concept, and to catch up on some of the many bioinformatics projects he’s better known for.
I think the subject of your talk today surprised some people. Do you plan to play a bigger role in making this model of open source pharma a reality?
I was in South Africa from January to February, and they have a huge AIDS problem there. And then, I saw a New York Times article about how all the great promises for making inexpensive retroviral therapy available had been postponed or broken, and I got very upset about that so I decided not to talk about open source software, but to talk about pharmaceuticals instead. I don’t know if anything is going to happen, but if people keep talking about it, and particularly if drug prices continue to rise, and the states start rebelling and buying all their drugs from Canada, then I think this issue of where the drug sales money goes is going to become increasingly important.
Back to the work that more people know you for, what’s new with some of the projects that you lead, like GMOD (Generic Model Organism Database)?
Two weeks ago, we had our version 1.0 release of GMOD. We’ve had releases of components — Apollo, GBrowse, CMap, and PubSearch — but now we’ve had our first release of a packaged system. So it gives you GBrowse, it gives you the database schema Chado, and it gives you schema loading and querying tools. The next release is going to add Apollo for editing, CMap for genetic maps, and hopefully Textpresso, which is a literature-mining tool. And then sometime, hopefully before the end of the fourth quarter 2004, will come the website. Then people will have a functioning version of FlyBase or WormBase, where they can put their data into the database and they’ll have a website that does everything.
How many model organism communities are out there right now that would use the GMOD package?
There are a limited number of things that really qualify as model organisms, but when you consider that people are sequencing everything from possum to duck-billed platypus to snake, each of those is a model organism waiting to be born. There are twenty different fly strains being sequenced, more mosquitoes are coming in, so there’s an unlimited number of things that are coming in.
You’re also working on the Gramene project, which doesn’t seem to get discussed too much.
Gramene is a comparative genomics database for monocots. In essence it’s a genome database for the rice sequence and annotations, so we use the Ensembl software for all the display and browsing and mining, and then we add to it curated data on rice functional genomics. We also maintain ontologies of rice for phenotype. Then we make connections to other monocots — to wheat, to sorghum, to maize, to oats, to barley. The reason is that of all the monocots, rice is a relatively compact genome. It’s 400 megabases in size — about the size of Drosophila — whereas the other genomes are much larger. Maize has a genome that’s almost the size of human, and would be much more expensive to sequence because it has a high number of repetitive elements, so it would be quite difficult to assemble from a shotgun. Wheat has a genome size that’s 16 times [that of] human, so it’s not going to be sequenced any time soon. And even though these genomes are so much larger than rice, if you look at where the genes are, they tend to be in the same positions and in the same relative order. So really what we’re seeing in the other monocots is that they started out with a compact genome like rice, and were invaded by a sea of repetitive elements that forced the genes apart. So what you can do then, is if you have genetically mapped a trait of interest to maize, with Gramene you can take that region in maize and you can ask, ‘Well, what is the corresponding region in rice?’
Another project that we’re working on in collaboration with a group from Wisconsin is screening maize for genes that have undergone selection during domestication and improvement. This is really very fun because the ancient wild ancestor of maize is this ugly-looking roadside plant called teosinte that grows in Mexico. It doesn’t look very much like maize. It went through multiple cycles of selection, and it all happened so recently that you can really see the effects of that selection on the genome. So this is really a fast way of finding the genes that are responsible for the things we like about maize — the things that make the plant tall and the fruit big and the kernels sweet, or that make the kernels pop so nicely in the microwave. So that information will be coming out in the next year and it will be a fabulous resource for breeders.
Well, you are working on human variation as well, so what’s new with the HapMap project?
The goal of the project is to identify where the haplotypes are, to pick from each [haplotype] one or two or three SNPs that identify that block uniquely, and then to make the information about these tag SNPs — the SNPs that distinguish one block from another — to make them available to the public to do their screening. It is expected that there will be roughly 400,000-500,000 tag SNPs. The caveats here are that the project is not going to find all the blocks. It’s probably going to cover about 85 percent of the genome. The rest will be regions that have very small blocks that we haven’t been able to map out, or just very variable. The other caveat is that we’re not doing the association studies ourselves. We’re not screening for diseases, we don’t know anything about the samples, so you can’t mine disease data out of the databases — it’s just an infrastructure like the genome sequence.
What criteria do you use to determine the tag SNPs?
The first step is to gather enough genotype data in order to be able to define the edges of blocks, and then many of the SNPs in the final blocks are candidates as tags, so what we would do is find the ones that perform best on genotyping platforms and are most informative — that is, they have the highest heterozygosity in the population. So some of it’s going to be practical: How well does it work in the assay? Does it give a nice strong signal? And some of it’s going to be based on the population.
The subtlety to this is that we’re looking at three different populations. We’re looking at a European population, an African population, and an Asian population. The blocks are going to be different in some cases between the three population groups, and the tag SNPs will be different as well, depending on their frequencies and the histories of the populations. So researchers who are running association studies will want to choose which SNPs to look at based on the ethnicity of their study group.
My group is coordinating the project. We’re running the sample-tracking database, taking the data as it comes in and running the quality-control software. Then we release the data to dbSNP at monthly intervals and run a website that lets them download the datasets.
What data is available until the blocks are assigned?
Right now you can download the data sets, or you can also browse it. We then provide researchers with a tool that lets them interactively explore the block structure, and it provides access to three different block-calling algorithms. We haven’t decided which one is going to be the official blessed one. They each have multiple parameters that you can set, so researchers currently have to make their own judgments about what parameters to set. This is interim. As soon as we finish collecting the first data set — we’re doing it a population at a time — we’re going to make a judgment about which algorithm is most robust, and that will be the basis for the official HapMap project blocks.
When will the first population be done?
We’re just getting the last little dribs and drabs of the [European population] now. We did the European samples first because ... it took longer to get informed consent and perform community consultation with the African and Asian groups. In contrast, the European samples were the CEPH [Centre d’Etudes du Polymorphisme Humain] samples that have been used forever for genetic work, so someone just went to the refrigerator and pulled them out.
You recently began another project called the Genome Knowledgebase. How is that going?
It’s a nice project with a crappy name. We’re actually going to change the name, but it’s a biological pathways and reactions database, curated from the literature, and organized like an electronic journal. So we recruit authors to write a database module sort of like a review article. This is actually a collaboration between my lab and Ewan Birney at EBI, so we have curators and software engineers both in his group and in my group that work on this. The curators act like editors, and they make sure that the information has gone into the database correctly. Then the module goes out to other bench researchers for review, and often revision, and then it gets published on the web. The emphasis is well-understood pathways in human.
It’s very well connected to actual proteins that are in Swiss-Prot. One of the big problems with the metabolic databases is that they use EC numbers. They speak about catalyst activities — which is really what the pathways are made out of — and they then make a guess at what human protein is responsible for the catalyst. This is very ambiguous, because let’s say we’re looking at tyrosine kinase activity, there’s an EC number for tyrosine kinase activity, and there are 20 or 30 different things in Swiss-Prot that have that activity, and only one of those is actually expressed in the cell at the time that that reaction is supposed to occur. So if you just use the catalyst activity EC number, there’s complete ambiguity. You don’t actually know what gene you’re talking about. So we do a lot of extra work in order to get that relationship in — it’s our best guess of what the protein is. We cover about 7 percent of human proteins now.
We have this lovely visualization that is coming online, in which each of the molecules is a little star, and then we have arrows joining them. It looks like a constellation. And then as you move through the database and look at different pathways and interactions, the stars light up. So you get to know where you are in the biochemical milieu of the cell. You get to know where the TCA cycle is, and where cell division is, and then as you move around, you can say, ‘Oh yes, I’m somewhere in cell division. Is that chromatin metabolism that’s also lighting up?’
When will you rename it?
Soon. We have an SAB meeting in June, so certainly by then we will have chosen a name for it, and we’ll relaunch it at some point in the fall.
I’m curious how do you keep all these balls in the air at once. How large is your research group?
About 20 people. It’s growing. I have a number of consultants, so if I count them it goes up to 23. The group is spread out around the world. One of my fellows works in Marseille, France. Another is based in Dallas, Texas. Two are in Columbus, Ohio, and two are in Iceland — Reykjavik. They telecommute, and come visit. It works out.