The European Bioinformatics Institute said this week that it will play a role in developing UK PubMed Central, a free online archive of full-text research papers modeled after the US version of PubMed Central maintained by the National Center for Biotechnology Information.
A group of nine UK funding agencies including the Wellcome Trust awarded the UKPMC contract to the British Library, the University of Manchester, and the EBI last week.
Funding for the project, which aims to ensure that research supported by the funding agencies is freely available, was not disclosed.
As the primary contractors on the five-year project, the British Library and the University of Manchester are charged with first developing a mirror version of the US site. The British Library will coordinate the project and help develop a process for handling author submissions, while the University of Manchester will host the service, which will launch in January 2007.
EBI’s role in the project is a bit longer term, Peter Stoehr, head of IT services at the EBI, told BioInform. The institute’s primary effort will be to “enhance the content of the literature by connecting it with biological databases,” including EMBL, Genbank, Uniprot, PDB, and others, he said.
The goal, Stoehr said, is to hyperlink all molecular entities mentioned in the PubMed Central archive to their records in public data resources.
“In some cases, these will be quite obvious links like for accession numbers for sequences,” Stoehr said, noting that a number of electronic journals, such as the BMC family of journals, already do this. “But as time goes on, we expect to implement text-mining methods and tools that look for biological terms in the literature and mark these up to connect them with the underlying biological databases. So [it will include] not just accession numbers, but concepts such as gene names or drug names, gene ontology terms, chemical names, protein-protein interactions.”
The US version of PubMed Central already offers similar functionality through a “related material” option on the left sidebar of every article that provides relevant links to other NCBI databases. Ed Sequeira, who manages PubMed Central for NCBI, told BioInform via e-mail that the center “generates links for, and across, all the Entrez databases automatically, using various computational methods. Links between databases are bidirectional and the network of links is updated daily.”
A text-mining research team at EBI led by Dietrich Rebholz-Schuhmann will contribute several software tools to the UK effort, Stoehr said, but he added that he expects other text-mining research groups in Europe to collaborate on the project as well.
“We have to be a little bit careful about what the end users will get from PubMed Central. We don’t want to confuse them or blind [them] with marked up documents where every word is a hyperlink to some underlying resource.”
These text-mining tools will be used to mark up legacy articles that are already in the archive and will make up part of the processing process when new articles are submitted, Stoehr said. Eventually, however, “we feel there may be some kind of mileage in trying to set up some kind of standards for the markup of biological text, so that well-described entities can be marked up at the author submission stage, rather than a text-mining tool doing its best to identify a drug name.”
Stoehr noted that this do-it-yourself process will require “buy-in from publishers, authors, and the community” before it is implemented, however.
Another area for consideration is the extent to which an article should be integrated with bioinformatics resources. “We have to be a little bit careful about what the end users will get from PubMed Central. We don’t want to confuse them or blind [them] with marked up documents where every word is a hyperlink to some underlying resource,” Stoehr said.
The project is an example of how the line between peer-reviewed literature and bioinformatics databases is growing increasingly blurred, particularly as journals publish more articles on high-throughput molecular biology experiments, and as these articles are published in open-access digital formats so that readers can mine the full text of each article, rather than just the abstract.
“We increasingly view the literature – especially the data-intensive literature – as being an information source like a database, which we ought to integrate into an information system,” Stoehr said. However, he noted, “the main obstacle to making [PubMed Central] a useful resource right now is the quantity of the content.” He noted that the resource contains around 500,000 articles, “which is a small percentage of the literature that’s out there.”
Indeed, according to NCBI’s website, there were more than 16 million citations in its Medline databases as of June 30.
Initiatives like UK PubMed Central “will clearly help get more important content into the archive, and I would expect this will be followed up in other regions of Europe and the world,” Stoehr said.