We’re pretty conscious of the barrier needing to be low, and we work pretty hard to keep it as low as we can. If we add more instructions onto the website, it will just be to help them realize what to fill in rather than to make it a lot more complex.
TAIR’s Eva Huala on Forging the Database’s First Curation Partnership with a Journal
This week the Arabidopsis Information Resource, or TAIR, announced a partnership with the journal Plant Physiology that will allow authors to submit gene information to the model organism database at the same time that they submit their manuscripts to the journal for publication.
The relationship is the first of its kind in the model organism database community, and promises to greatly expand the amount of information in TAIR. According to the database organizers, its curators were only able to capture 50 out of 200 Arabidopsis papers published in the journal during 2006 and 2007.
The goal of the new effort is to increase the curation of Arabidopsis gene-function data from the journal to “as close to 100 percent as possible,” according to a statement from TAIR.
BioInform caught up with Eva Huala, director of TAIR, to get some background on this partnership, and whether it could be a sign of things to come for other model organism databases and journals. An edited version of the interview follows.
Can you explain how this partnership with Plant Physiology came about?
We take as our mission to curate as much of the gene function data as possible out of the literature into TAIR, and for quite a long time, actually since the project began, we’ve been constrained as to how much we can do because of our limited team and the amount of literature.
So it’s always been a question of how to prioritize our work, and we’ve tried various strategies, including a gene by gene [approach, where we] get all the papers for a certain gene and then get all the information. And then a few years ago we changed to a strategy based on journal impact factor, where we take the top impact factor journals and do those first.
Right now we’re curating about a quarter of all the papers, so we work down from the highest impact factor to the lowest. And of course, in the middle of that set, there are some journals like Plant Physiology and a few others that have a huge number of very important papers, but their impact factor is not so high as to put them into the group that we would do first. We know there’s a lot of really valuable, important data in those papers that we would love to get, but we haven’t been able to stretch our resources to cover those articles.
For a long time, we’ve been trying to leverage the community to help us with this task, and we’ve developed simple methods for researchers to submit data to TAIR … We’ve had that mechanism in place for quite a long time already, where people can just enter the names of their genes, and what the functions of their genes are from the paper, and the reference, and so on.
This has been a discussion for many years in the biocurator community: How are we going to get more community participation? We’re all faced with this tremendous disproportion in our resources versus the literature we have to cover — all the model organism databases have this problem.
So we’ve been throwing these ideas around at these meetings for a long time, and saying, ‘Wouldn’t it be great if we could get the journals to help us,’ but nobody really decided to go ahead and try and see what they would do. Finally, we realized that we had a fairly good connection with Plant Physiology. [TAIR principal investigator] Sue Rhee is on the editorial board and she knew who to contact and she knew there’d be a board meeting at the [American Society of Plant Biologists] meeting last summer. And since our curators were also going to that meeting, we thought it would be a good opportunity to try to pitch this idea.
So Tanya Berardini, our lead curator for gene function, went to the board meeting and presented a 10- or 15-minute presentation [and asked], ‘Is there any way you could capture this information at the time of submission?’ This is really the ideal time to get this information because it has a reference, which is important for us because it’s been peer-reviewed and it’s sort of solid data at that point. It’s trustworthy. At the same time, because it’s just at the point of submission, the postdoc or the graduate student or whoever worked on it is there with their lab notebook and it’s fresh in their mind, and that’s really the premiere time to grab that data and get it into TAIR.
So we pitched this idea, and we had various levels where we were hoping they would go with us, and the simplest level was just to give us the gene identifiers for Arabidopsis, which have a standard format, and if they were willing to do that, we also said, ‘Well, it would be really great if we could also get the function data — this is how we do it at TAIR, here’s our form, and would you consider doing this?,’ and they were actually very enthusiastic and interested.
How does the submission process work?
[Plant Physiology] eventually came up with a web form, which is hosted at the journal site, that’s designed to collect the gene function data on individual genes. The authors are encouraged to fill this out at the time of acceptance of the paper. So we don’t take data from papers that have been submitted. They may be sent back for revision or they may be rejected, so at the time of acceptance is when we try to grab the data.
So they collect the data, and they just opened the website a few weeks ago, in February, so we haven’t yet gotten the first set of data. They’re sending us monthly sets that they’ve gathered over the last month. So we’ll be getting something in the next few weeks.
On our end, what we do with that data is translate the gene functions into Gene Ontology terms associated to an experimental method and a reference, load those into TAIR, and they get sent out and propagated to various other places where GO annotations go, for example the GO website, which has all the different model organism GO annotations, and they also get into our NCBI submissions. When we do a genome release, we send all that information to NCBI and they actually capture all the GO annotations and put them into the RefSeq record.
So they get spread far and wide, and Plant Physiology gets more recognition as a result because their name is attached to these GO terms. So that’s their angle: the data published in Plant Physiology gets more visibility, and people are going to go back to those articles and read them, and it may improve the impact factor.
Are you planning on forming similar partnerships with other journals?
We plan to approach other journals after we’ve seen the first few months of submissions, or even before, perhaps. The obvious next target is Plant Cell, because they’re also associated with ASPB, and they also have a lot of high-quality Arabidopsis papers.
If we can get a few more journals to do this, especially these large journals that are publishing a lot of Arabidopsis work, it will free up our curators to do the remainder, and our hope is that if we can get several of these going, we may be able to cover all the literature, and that would be really big for the community to get all the experimental data in a form that is available for computational analysis, or for comparison across species, associating to other plant genomes to see if the genes are similar, and so on.
How many curators do you have working on TAIR now?
We have several, but we’re split among several different projects as well, so the actual [full-time equivalent] time on literature curation is not that large. We have about 2.8 FTEs on gene function, but that includes not just literature curation, but also going to other websites and trying to find information that we can import into TAIR, taking community submissions and processing them, and so on. There are a whole bunch of other tasks they do other than just sitting down and curating the literature.
So it’s probably realistically only about 1 FTE or 1.2 FTE or something like that, although we have some volunteer curators, too. I’ve started recruiting former curators to help us out. We have various ways to get this job done.
This partnership has made it easier for authors to submit this information in a form that’s useful to TAIR, but what incentive is there for them to actually go through with that step?
When they do this, or when they submit the data directly to us outside of the Plant Physiology method, we attribute those GO annotations to the person who submitted them. So their name does appear on the TAIR record that shows the annotation. So they do get some exposure as a result, some recognition that they did send this data into TAIR.
Whereas if a curator curates the paper as part of our normal curation process, that record gets attributed to the curator, not the author of the paper.
If this went live a month ago, it’s probably too early to tell what sort of response you’re seeing.
It is too early to tell, but I’m pretty optimistic. There’s no real enforcement mechanism, but because of the way it’s done, I expect high compliance, because it’s being made part of the submission process. But that remains to be seen. It’s still a big question.
If we get a high compliance rate, it’s going to be a huge amount of data, so I’m pretty excited.
Are other model organism databases considering taking this approach?
There are. We’ve been getting inquiries from other databases, including the tomato database, SGN [Solanaceae Genomics Network]. They’re interested in following in our footsteps. And we’ve also begun discussing with the GO Consortium whether they could put resources toward making a general web submission form that any journal could adopt, because the format that we’re taking is not at all organism-specific. We’re asking for names of genes, and gene function, and experimental methods, and then it gets attached to the PubMed ID, or the reference ID. So this could be applied to any organism and any model organism database. It really could be a very general thing. So I think the GO Consortium is considering whether they can get involved in this, and take the work off the shoulders of the journals and make some sort of web interface that any of them could adopt and display on their website.
I think there’s great potential for a little more general mechanism.
So you guys are really the guinea pigs for this model.
We are, yes. But we’re happy to be in that situation.
Do you have any specific goals or outcomes from this project that you’d like to see over the next six months to a year?
Well, we’d like to see a high compliance rate, and if we do have great success then we’ll be working internally on ways to streamline the checking. Initially we’re going to be doing some checking of the data to make sure that the quality is what we would expect, and we may even contact the submitters if more information is needed.
I think we can refine this process both on the external side by providing more explicit instructions on the journal website if needed to help us get the best quality data, and also on our side to streamline the process of checking and loading the data. We do get a lot of data and we’ll be working on ways to load it more efficiently. I would expect to see that in the next few months.
I suppose you’ll have to find a way to strike a balance in terms of providing enough information so that submitters provide the right type of data, but not so much that it scares them off.
Exactly. We’ve always gone way on the side of simplicity, and not asking for too much, because it is such a problem. People are very busy and they often don’t see submission of data to the database as a priority. And we understand that there are a lot of things for a researcher to do, and we wouldn’t be surprised to hear that this isn’t at the top of their list. So we do try very hard to keep our methods simple.
If you compare the procedure for submitting a record to GenBank or a microarray experiment to one of the repositories, there are pages and pages of submission information. Here we’re just asking for one simple spreadsheet with a very small number of columns, and we actually do the translating of whatever they enter as the gene function into the GO term. We don’t ask them to enter the exact GO term.