FORTALEZA, Brazil — It’s been a busy two weeks in Fortaleza for Amos Bairoch, leader of the SwissProt group at the Swiss Institute of Bioinformatics. Bairoch co-organized “In Silico Analysis of Proteins,” a five-day conference here that marked the 20th anniversary of the SwissProt database; delivered a keynote address at the Bioinformatics Open Source Conference just prior to ISMB; and then played a large role in the “New Frontiers” track at ISMB, leading a discussion on funding models for biomolecular data repositories.
SwissProt’s 20th anniversary is worth celebrating, since the long-term survival of the database has never been guaranteed, and the resource has had to rely on what Bairoch described as a “yo-yo” funding cycle: The initiative lost its public funding in the late ‘90s and turned to a hybrid model in which for-profit users were required to license the resource through SIB’s commercial arm, Geneva Bioinformatics. Between 1997 and 2004, the company sold around 750 licenses to 400 companies and generated more than 30 million Swiss francs ($24.3 million) in cumulative revenue.
While Bairoch described that model as successful, some in the bioinformatics community feared a “domino effect,” he said, in which funding agencies would begin enforcing that model for other resources. In 2003, SwissProt reverted to a freely available public model with funding from the NIH, which created a new database called UniProt that linked the SwissProt groups at SIB and the European Bioinformatics Institute with the developers of the Protein Information Resource at Georgetown University Medical Center.
With this history as a backdrop, Bairoch is now focusing his efforts on a number of new initiatives to ensure continued development — and funding — for SwissProt and other bioinformatics resources. Two potential initiatives in the works for SwissProt involve community annotation.
One, called “Gray Matter Counts,” would rely on a network of retired scientists who volunteer to add their knowledge to the resource. Bairoch said that this idea came from several scientists nearing retirement age who asked him if they could aid in curation.
The second, called “Adopt a Protein,” would take a page from the success of Wikipedia and use a wiki-based annotation approach. Bairoch said that the success of this scheme is still uncertain since it’s very likely that only “a small percentage of life scientists have the time and are altruistic enough to fully participate.” Nevertheless, he said, a pilot project within the yeast community — selected due to its limited size and proven willingness to share — will soon be underway.
During the New Frontiers track at ISMB, Bairoch proposed the “Fortaleza declaration,” which would require experimental labs to set aside a certain portion of each grant to support long-term data management (see feature, this issue, for additional reporting). BioInform spoke to Bairoch after the session to discuss the proposal, the SwissProt meeting, and his thoughts on the future of bioinformatics resources. A transcript of the interview, edited for length, follows.
What were some of the highlights of the SwissProt meeting for you?
It was a really nice meeting in terms of the size. There were more than 200 people, so there was a lot of interaction. Each of the talks had a historical perspective, which I think was interesting to the audience to see what led [the speakers] to be where they are now. And some people extrapolated on what they were going to do, so it was really a mixture of different things. I think it was quite exciting because you could see the state of research for people like Janet Thornton [director of the European Bioinformatics Institute], Rich Roberts [of New England Biolabs], and so on — why they came into the field and what they were expecting to do.
Were there any surprises for you in terms of how people are using the resource, or what they’re getting from it?
During the conference, people from the SwissProt group in Geneva asked everyone who wanted to be interviewed, and then we asked them how they are using the database. So we did the survey, but we don’t have the results yet. There were 120 questionnaires, where they sat with the people being interviewed and spent an hour with them going over the questions because we didn’t want just to give them a questionnaire that they would fill out in five minutes.
So they spent an hour with them and now we have more than 100 questionnaires to try and synthesize. It won’t be a good survey of the user community in the world, because the people that came [to the conference] already knew about SwissProt. So we can’t say it’s the average user — more like what do users who know about SwissProt expect from it, which is interesting also. We need to know what our normal users, people who have known about it and have been using it for a long time, what do they expect?
We said to the audience that we would make it available to everyone who participated in Autumn, because it will take time to put together.
I know you already have some ideas for SwissProt going forward. You mentioned the Adopt a Protein and Gray Matter Counts initiatives in your talk at BOSC.
That’s more to pull data in, since there are a lot of other things we need to do. The question is what are we going to be able to do with limited resources? There is no way that we can deal with everything that comes in, so we have to prioritize. And the question is, how do we do more with the money we have?
I think that’s one of the challenges — people putting the data in. And not only raw data, like I was talking about during this session, but getting people to really get the important information and the knowledge in. Data is easy to get in. There’s a lot to be done, but still you can get people to submit their sequence, you can get people to submit their microarray data. But the knowledge is more difficult.
I think that there will be a lot of work to be done in collaboration between databases and journals. Because people submit knowledge to journals, and that’s the way they get credit, they build their CVs, and they get kudos for that. But every paper — that’s knowledge. Papers are supposed to summarize knowledge.
In fact, we had a number of talks during the [SwissProt] meeting, because there were people like Matthew Cockerill [from BioMed Central] and Rich Roberts, and both of them spoke about open access and the fact that this may help getting data to databases directly. So that’s an interesting factor, but it [raises] a question: Are journals going to become themselves database providers? A lot of them are going to compete with existing databases.
I think it’s going to be a mixture. You’re going to have journals that are going to do this, you’re going to have journals that are going to collaborate with public databases, and you’re going to have databases that are going to become publishers and compete with editors.
We also had a very interesting talk from Barend Mons. He created a company called Collexis and they work on knowledge extraction, knowledge representation. Not to say that he has the best answer, but everyone is trying to build these tools. It’s, of course, based on text mining, part of it, but it’s based really on getting knowledge from mining the data resources. And all of those people are working together trying to build the tools that will be available to the general public, so that’s quite nice.
That will be a big challenge, getting people to get the data in, because now they publish and they forget about it. They don’t forget about the data, but they go to their next grant, and don’t have time to go back to the database, and, of course, two years later, they come and say, ‘Oh, it’s a pity we didn’t tell you about this. My data is not in the database but I didn’t have time. Don’t worry — when I come back from this meeting I will send you things.’ And it never happens. Some do, but…
There are a number of people who take the time, though. And not only for their own data, because you could say they have an incentive to have their data play a part in a database because it gets cited, but a lot of people say, ‘I published this, but there’s also this colleague that published this and shows this and showed that, and so on.’ So a lot of people are not only doing this in an egotistical way — ‘I will submit to the database, but I will only submit mine.’ If they do submit, they generally do a lot more than just submitting their work, and that’s nice, but it’s far less than what it should be.
There has been a lot of discussion of this at ISMB this year, and a number of speakers have mentioned that there aren’t many incentives for authors to submit their data.
Yes, what carrots do we have, and so on. One of the things I think about also for the future of annotation, at least for UniProt/SwissProt, and so on, where we do a lot more annotation, is to get more and more people involved. But of course, we’re not going to grow in Geneva or at EBI. I’m not saying that growth is over. We have 70 people in Geneva and there are about 100 people at EBI working on databases – not only UniProt, but others. Doubling or tripling that is not good; 300, 400 people on a site becomes very bureaucratic, but that’s my personal opinion.
That’s one reason we did the 20-year anniversary in Brazil. We have a group of four people who started in Brazil at the LNCC [Laboratório Nacional de Computação Científica]. I met them I think in 2002, 2003, and at one point they said, ‘Could we help with SwissProt annotation?’ And I said, ‘Yes, but what do you mean by help?’ And they said, ‘Could we have annotators locally, and could you train them?’ And so we started this collaboration with four people in Brazil, and we’d like to do a lot more. They are annotating proteins from pathogenic bacteria, and Brazil is doing a lot of genome sequencing on that. What we will try to get is a grant to have maybe 10 or 15 people in Brazil working on protozoa — Leishmania, Trypanosoma — and it’s a wonderful place to do that annotation. They have access to the labs that are working on it and are directly interested in getting the results. These diseases are endemic in Brazil. So you have the researchers on site, which can help also with the annotation process.
Brazil is an example, but you can imagine this taking place at a number of places. Not maybe 1,000 labs, but given 10 or 20 labs, each with their own funding, which would help. It would be wonderful. We could double the number of people without having too much bureaucracy. …
So we now have in the UniProt consortium only three members, but we will soon have what we call affiliate members, like Brazil and so on. But we have to be careful how we grow. We can’t open it up to hundreds of people and have them pulling in every direction.
You mentioned in your talk at BOSC that when companies were paying for SwissProt they gave you feedback. I was wondering if that might be a side effect of the tax proposal you mentioned today. If people are paying for this service, in a sense, would you expect more interaction?
No, because it would be part of the grant. I think of this proposal not as a way to fund something like SwissProt, because SwissProt is not a data repository — it’s a knowledge base. So the original data is deposited in other databases. We massage it, we add onto it. I was thinking more of a database like PRIDE for protein identification, Genbank, EMBL, PDB, and so on — all of the repositories. Of course people will say, ‘I paid and I want to make sure that the data is correct,’ and that should be the case.
I’m not sure there’s going to be so much more feedback. To be honest, when the companies were buying SwissProt, yes, there was a little bit more feedback from companies than we have now because they thought, rightly, ‘If I’m paying $100,000 to get this database, I need this and this to be correct.’ But it was not a lot because most of the time, the companies, whatever feedback they gave could have been mined so that others knew what they’re doing. There’s a lawyer standing there saying, ‘Don’t tell the database that we need this.’
Sometimes it did happen, but most of the time it came later. You know, five years later saying, ‘We would have liked you to do this G-linker receptor. But of course I couldn’t tell you I wanted the G-linker receptor to be annotated, because that would have been a risk.’
I think that if academics were paying, they would give feedback. But with the companies, that was rare. If it was their lead target, they couldn’t give feedback.