Cleaning up Messy Data with Google Refine

By Matthew Dublin

Rod Page over at iPhylo has a post describing how useful Google Refine is for cleaning up taxonomic databases. Google Refine, formerly known as Freebase Gridworks, is a freely available web-based "power tool" that supports TSV, CSV, Excel, and XML file formats. Among other features, Google Refine allows users to pull together disparate data sets and work with the data in a collated, polished fashion.

Page, a professor of evolutionary biology at the University of Glasgow, is a big fan of Google Refine's "Reconciliation Services," which he uses for matching names to external identifiers.

So far, Page has used Google Refine with EOL, NCBI taxonomy, uBio , WORMS, and GBIF.

Here's an introduction to Google Refine: