A ‘big issue’ that university research managers are often grappling with is the variable, messy nature of bibliographic data. Authors don’t always use the same names, sometimes different people publish under the same, research is published at different campuses, locations, & countries, publishers & online indices don’t format data consistently, and, of course, typographic & human error are all factors that can impact the quality of research activity data.
Typically cleaning this data is labor-intensive manual process – the data needs to extracted in format that is useful and manageable. Normally it’s a spreadsheet application used to handle the data after it’s been extracted. Because of the nuanced variations and conventions in bibliographic dataset, cleaning it usually means inspection, comparison and cognitive processing – it’s the sort of work that would make a fantastic study in machine learning, if only there were enough time and budget to teach a computer how to make sense of the data! After cleaning, sorting, de-duplication & re-organising, the data typically needs to be converted into a useful format, then fed back into research information systems, databases and reports.
The traditional approach to managing this data has been to capture and maintain it by hand. However, with increasing frequency it is being automated using scholarly index services like Scopus, Web of Science, Google Scholar & others, and this opens up some big challenges (and exciting possibilities) for keeping research metadata clean, accurate & reliable.
GoogleRefine is a tool that makes managing this type of data really, really easy. In it’s basic form the tool allows users to load data from various formats and sources, perform transformations and manage merging the results into something useful. It’s advanced features facilitate transforming & exporting data into different formats, looking-up and validating values using web services, and the ability to script complex parsing rules. The most powerful aspect of GoogleRefine lies in each change being ‘undo-able’ (if you make a mistake) and repeatable (so that data can be transformed through a series of iterations). GoogleRefine operates from a small footprint and in most cases can run without administrative rights which is a big plus, and it’s Open Source and completely free!
So, enough back-story – let’s see GoogleRefine in action! I’m going to skip the basics of installation etc and jump right in a real-world example, but if you need help getting set up, there is plenty of easy-to-follow documentation available (http://code.google.com/p/google-refine/wiki/GettingStarted). Here’s what the tool should look like once it’s up and running:
My demonstration is going be based on cleaning up list of author’s names – the data is fictional but the problem is representative of the typical jobs for which I use Refine. You’ll notice that the followng list of names contains variations of title, typographic errors, different ordering:
Mr Bryan Albright
Dr Regina Troupe
Ms Natasha Lewis
Prof. Amy Seal
Dr Leblanc, Keith
Dr Keith Leblanc
Dr Melinda Chester
Mr Brandon Runnels
Mr Michael Redd
Prof Suzanne Rawlins
Dr Brandon Runnels
Prof Susanne Rawlins
First off, we need to load the data into Refine…
We just need to select ‘CSV’ as the input format, customise the character encoding (if needed), and assign a name to out project (I’m using “Names”), then click “Create Project”. Now we’re on to the exciting stuff…
Clicking on the name of any column in our dataset opens the Facet menu, where various group, sorting and transformation functions can be accessed. Let’s create a basic text facet by clicking Author -> Facet -> Text facet.
Here’ we’re picked up two values for Keith Leblanc with different formatting. Select Merge and Close – we’re returned to our original project view, only now we can see that the two original values representing Keith Leblanc have been merged into one. When it comes to clustering functions, there’s no shortage options – for example:
Nearest Neighbor – Levenshtein Distance
Nearest Neighbor – PPM
Metaphone analysis (comparison based on strings “sound” similar when spoken)
When we’re satisfied with our matches, we simply merging & then export the data to whatever format we need. We can even define our own data format templates, but that’s a topic for another day.
As we’ve seen, the tool is really flexible but this tutorial has really only scraped the surface of Refine’s features. It’s flexible, scriptable & it’s true power lies in the ability to quickly and conveniently lookup and validate data in external APIs & services. Stay tuned for another post in coming weeks on how we can combined Refine with the Elsevier Scopus API to extract a count of index publication records for our authors.
GoogleRefine – http://code.google.com/p/google-refine/
GoogleRefine 2.0 Introduction – http://www.youtube.com/watch?v=B70J_H_zAWM
Getting Started Guide – http://code.google.com/p/google-refine/wiki/GettingStarted