The Natural History Museum, London, is midway through a project to convert the wealth of data locked in its card archives into an online database of biodiversity and taxonomic information.
The museum is partnering with the University of Essex on the development of the database, and has enlisted the aid of Boulder, Colo.-based Parascript, whose FieldScript software is being used to translate legacy typewritten, handprinted, or cursive handwritten data into a digital format.
Malcolm Scoble, head of biodiversity at the Natural History Museum, said the first stage of the project involves converting data from 29,000 index cards on the Pyraloidea family of moths. He is pleased with the progress of the project so far, which would have required an estimated 430 man-years to re-type manually. While the current process is not entirely automated — a team of curators examines the data after its been scanned in and analyzed by FieldScript — the project is on track to complete the first phase of the database in 18 months.
The VIADOCS (Versatile, Interactive, Archive Document Conversion System) project team at the University of Essex is coordinating the IT side of the project. Andy Downton of the university’s department of electronic systems engineering said the challenges of the museum project are unique, rendering many optical character recognition packages unacceptable. In particular, the specialized Latinate vocabulary used to describe the specimens was difficult for many recognition packages to deal with. FieldScript, however, was able to identify and categorize the various fields on the cards and associate them with specific database fields with an acceptable error rate, according to Downton.
The VIADOCS team is also developing a web-based interactive verification tool for the project and currently houses the database.
Scoble said the museum intends to make the completed database part of the Species 2000 project at the University of Reading — a collection of 14 databases that currently catalogues over 220,000 species. Similar biodiversity informatics projects are on the rise worldwide, Scoble said. “There’s so much information stored on index cards in natural history museums across the globe and many of them are trying to get it accessible on the web now,” he said.
The UK’s Engineering and Physical Sciences Research Council and Biotechnology and Biological Sciences Research Council are funding the VIADOC project. Essex has received £125,000 ($175,100), while the museum has received £71,600.
— BT