Advanced autocomplete with Solr Ngrams

Published Jul 3, 2013

Overview

The following approach is a good one if you require:

phrase suggestions, not just words
the ability to match user input against multiple fields
multiple fields returned
multiple field values to make up a unique suggestion
suggestion results collapsed (grouped) on a field or fields
the ability to filter the query
images with suggestions

I needed a typeahead suggestion (autocomplete) solution for a textbox that searches titles. In my case, I have a lot of magazines that are broken down so that each page is a document in the Solr index, and has metadata that describes its parentage. For example, page 1 of Dungeon Magazine 100 has a title: "Dungeon 100"; a collection; "Dungeon Magazine"; and a universe: "Dungeons and Dragons". (Yes, all the material in my index is related to RPG in some way.) A magazine like this might consist of 70 pages or so, whereas a sourcebook like the Core Rulebook for Pathfinder, a D&D variant, boasts 578, so title suggestions have to group on title and ignore counts. Further, the Warhammer 40k game Dark Heresy also has a Core Rulebook, so title suggestions have to differentiate between them.

To build this typeahead solution, I:

added new Solr field types to schema.xml to support ngram matching
added a /suggest handler to solrconfig.xml that weights matches appropriately
bound the suggestions in JSON format to Twitter's typeahead.js

Example 1: two core rulebooks.

Example 2: "dark" matching in Title and Collection

Add new field types to Solr schema.xml

text_suggest_ngram

For partial matches that will be boosted lower than exact or left-edge matches, e.g. match 'bro' in "A brown fox".

text_suggest_edge

For left-edge matches, e.g. match 'A bro' but not 'brown' in "A brown fox".

text_suggest

For whole term matches. These will be weighted the highest.

These field types are taken lock, stock and barrel from https://github.com/cominvent/autocomplete. In that project, the suggest engine takes the form of an entirely separate core - I have simplified matters for myself. Great stuff, though.

Make copies of relevant fields in Solr schema.xml

As noted above, the fields in play for me are title, collection, and universe. Note I am also making a string copy of each to group on.

Add /suggest request handler to solrconfig.xml

The /suggest handler looks for user input matches within the suggest fields defined in the qf parameter. Each field has a boost assigned: the higher the boost number, the more a match on that field will contribute to the final document score. I found I had to play around with the boost numbers relative to each other before getting the behaviour I really wanted. Boosting the whole-term text_suggest fields highest was not an automatic route to success. Your mileage may vary.

The pf parameter is additional to qf: it boosts documents in cases where user input terms appear in close proximity.

Above, I mentioned that a Solr document in this index is equated with a single page from a book. If a book is 50 pages long, then a naive suggester is going to return 50 documents when that book's title is matched. The suggest handler avoids that problem by collapsing (grouping) on the fields in play, which explains why the universe field is referenced there, even though it's not being used to match query input. With grouping, a unique suggestion consists of universe+collection+title. Note that group.sort and sort parameters differ. The former must produce valid groups, while the latter determines order in which suggestions are displayed to the user.

Conclusion

In a future post, I will describe how I bound the results from the /suggest handler to Twitter's typeahead.js on the front end to produce what is seen in the examples seen in the screenshots above.