Using Named Entity Recognition to Generate Searchable Metadata

Published Jan 5, 2016

Ask any librarian and they'll tell you that good metadata makes for a positive and productive search experience for users. Trying to find resources about a historic person or place, produced in a particular time period, and especially about a specific topic, is always more easily achieved when resources have been analyzed and described by a trained professional, with metadata applied from a controlled vocabulary, a process long known as "cataloguing".

Sure, search engines do an ever better job of returning relevant search results based only on the full text of a resource, with little or no metadata, thanks to some pretty sophisticated algorithms. Google is a giant because Google works! And even the Apache Solr search engine in our Andornot Discovery Interface and VuFind is impressive in its ability to parse and return meaningful results from large amounts of non-catalogued, metadata-free text.

But good metadata, applied by a librarian, archivist, curator or other skilled person, is still an even better source of data for a search engine. However, producing it does take time and staff resources. So, many have asked, "what if a computer could help me figure out what this resource is about, who is mentioned in it it, and where and when it takes place? What if the computer could extract the full text as well as metadata from a resource?"

We're very interested in some work being done on this. While automated subject analysis is still challenging, work at Stanford University by a Natural Language Processing group has produced a Named Entity Recognition engine that shows great promise. In a nutshell, this engine does a fine job of reading a passage of text, as long as you like, and finding within it the names of people, organizations and locations.

Here's an example of a passage of text processed by the engine, with entities identified.

The screenshot shows that the engine did a pretty good job of identifying the names of people, organizations and places. This metadata can be used for increased searching options in a search engine, or fed back into a database for review and editing (as the engine may not always be perfect, there's still a role for professional review).

We're researching the possible uses of this with some of our projects, such as those built from the Andornot Discovery Interface (AnDI). When importing the full text of documents, that text will be run through a Named Entity Recognition engine to generate name and place metadata. For unstructured data, this may provide to be a great means of populating the Names facet, for example.

Stay tuned to this blog for further results, or contact us to discuss your collections and how they could be made more accessible with AnDI and Named Entity Recognition.