Using OpenRefine to perform text mining on your data: food for thoughts

This is an introduction to OpenRefine and its named entity extraction extension, called NER, to better understand when it’s useful to use it.
We have published a short presentation on Slideshare to summarize the topic:


In a recent discussion on OpenRefine mailing-list, someone has asked about what industries are using OpenRefine:

What industries are using OpenRefine?

Probably the oldest and biggest community is the data driven journalism community. They’re pretty much self-sustaining in that they run their own seminars, teach each other how to use the tool, etc. It’s popular with curators of metadata from libraries to museums to research labs. It’s used by SEO folks normalizing keywords in logs. It’s used by patent attorneys and insect scientists and a whole host of other folks.

Who uses OpenRefine

If you don’t know what you can really do with OpenRefine, compared to other data parsing tools, there is an interesting response published on Open Data Stack Exchange:

  • results of transformation expressions are previewed interactively with live data
  • quick, interactive, filter facets which allow for easy browsing of instances/rows which match a variety of filters
  • exploratory analysis of data to do quick visualization via facets and explore interactions among columns
  • reconciliation of text data against reference data services containing strong identifiers (Freebase, OpenCorporates, any SPARQL or RDF, etc)
  • simple linking of reconciled entities to other info sources like Wikipedia, MusicBrainz, IMDB, etc
  • complete provenance/undo history of all modifications
  • combination of machine smarts and human review for tasks like clustering of names.
  • wide variety of input & output formats including both file formats and online repositories like Google Spreadsheets & Fusion Tables
  • one click selection of record boundaries to produce a grid of data from a JSON or XML API is great for exploring new API endpoints

In SpazioDati, Refine is one of the key tools of our data curation workflow: we are using it in several ways. Mainly to normalize data from different sources, and to export our data modelled in RDF, aligned to our internal ontologies.

OpenRefine is inside our data curation controller

And not surprisingly, it’s not so unusual also for other enterprises, as Phil Simon said.

For all of the talk about Big Data, many hidebound organizations struggle with basic blocking and tackling. That is, their Small Data is a mess.

Small Data and Big Data: two sides of the same coin.

When you need to do entity extraction in Refine

To understand why and when you should use entity extraction inside OpenRefine, it’s useful to remind what reconciliation is, citing “Free your metadata” group:

The goal of reconciliation is to connect your collection-specific vocabulary to a controlled vocabulary on the Web.
For example: does your label instruments indicate musical instruments, measuring instruments, or even aeronautical instruments?
Reconciliation is about giving meaning to field values, making your metadata interpretable by the whole wide world.

Reconciliation identifies keywords in flowing text and gives them a URL.

Why you need reconciliation

If you have a column containing domain specific text, like companies names, or location data, you can use a reconciliation service to normalize that kind of data.

But why you need more than a reconciliation service?
This is how the official book on OpenRefine describes the usage of entity extraction:

Reconciliation works great for those fields in your dataset that contain single terms, such as names of people, countries, or works of art.
However, if your column contains running text, then reconciliation cannot help you, since it can only search for single terms in the datasets it uses. Fortunately, another technique called named entity extraction can help us.

An extraction algorithm searches texts for named entities which are text elements, such as names of persons, locations, values, organizations, and other widely know things.
A new data curation workflow with NER and OpenRefine
In addition to just extracting the terms, most algorithms also try to perform disambiguation. For instance, if the algorithm finds Washington in a text, it will try to determine whether the city or the person is mentioned.
This saves us from having to perform reconciliation on the extracted term.

This is Linked Data at its best; what used to be a human-only text field, now has a machine-interpretable links to the concept it contains.

So, as we can see in the Open Data Community Forum, there are some real useful scenarios, where we can successfully use OpenRefine and its NER extension:

  • parsing Curriculum Vitae to extract some meaningful informations, like name, telephone numbers and emails, and also company names, dates and role titles, skills etc…
  • perform text mining on judicial decisions, extracting specific informations in judicial decisions (judge’s name, court, area of law and neutral citation number)
  • extract relevant news about a precise topic (a person, a brand or a company): useful for data journalists, for example
  • write a summary from a political speech, starting from the main concepts extracted from text
  • perform text mining on tweets, or on a single webpage: extract brands, places and concepts easily. Useful to improve website SEO ranking
  • understand your own bank account statements: extract useful informations, like brands and places, to categorize and classify your own expenses. Like business intelligence on personal data, usually a topic related to “Quantify self” movement
  • [...] your use case: tell us on Twitter using #dataTXT #ner #refine

Perform data mining on Social Media content with OpenRefine

We are working on some tutorials on OpenRefine, to explain better some of these scenarios: stay tuned!

Some other useful sources on OpenRefine