I learned an interesting lesson about Solr relevancy tuning due to a request from a client to improve their search results. A search for chest tube was ranking a record titled "Heimlich Valve" over a record titled "Understanding Chest Tube Management," and a search for diabetes put "Novolin-Pen Quick Guide" above "My Diabetes Toolkit Booklet," for example.
Solr was using the usual default AnDI (Andornot Discovery Interface) boosts, so what was going wrong?
Andi default boosts (pf is phrase matching):
qf=title^10 name^7 place^7 topic^7 text
pf= title^10 name^7 place^7 topic^7 text
The high-scoring records without terms in their titles had topic = "chest tube" or topic = "diabetes", yes, but so did the second-place records with the terms in their titles! Looking at the boosts, you would think that the total relevancy score would be a sum of (title score) plus (topic score) plus the others.
Well, you'd be wrong.
In Solr DisMax queries, the total relevancy score is not the sum of contributing field scores. Instead, the highest individual contributing field score takes precedence. It’s a winner-takes-all situation. Oh.
In the samples above, the boost on the incidence of “chest tube” or “diabetes” in the topic field was enough to overcome the title field's contribution, in the context of Solr’s TF-IDF scoring algorithm. I.e. it’s not just a matter of “the term is there” versus “the term is not there”, instead the score is proportional to the number of query terms the field contains and inversely proportional to the number of times those query terms appear across the whole collection of documents. Field and document length matters. Also whether the term appears nearer the front of the text.
So I could just ratchet up the boost on the title field and be done with it, right? Well, maybe.
As someone else* has said: DisMax is great for finding a needle in a haystack. It’s just not that good at searching for hay in a haystack.
The client’s collection has a small number of records, and the records themselves are quite short, consisting of a handful of highly focused metadata. The title and topic fields are pithy and the titles are particularly good at summarizing the “aboutness” of the record, so I focused on those aspects when re-arranging relevancy boosts.
New Solr field type: *_notf, a text field for title and topic that does not retain term frequencies or term positions. This means a term hit will not be correlated to term frequency in the field. It is not necessary to take term frequency into account in a title because the title’s “aboutness” isn’t related to the number of times a term appears in it. The logic of term frequency makes sense in the long text of an article, say, but not in the brief phrase that is a title. Or topic.
New Solr fields: title_notf, topic_notf
Updated boosts (pf is phrase matching):
qf=title_notf^10 topic_notf^7 text
Note that phrase matching still uses the original version of the title and topic fields, because they index term positions. Thus they can score higher when the terms chest and tube appear together as the phrase “chest tube”.
Also, I added a tie=1.0 parameter to the DisMax scoring, so that the total relevancy score of any given record will be the sum of contributing field scores, like I expected in the first place.
|total score = max(field scores) + tie * sum(other field scores)|
So, lesson learned. Probably. And the lesson has particular importance to me because the vast majority of our clients are libraries, archives or museums who spend time honing their metadata rather than relying on keyword search across masses of undifferentiated text. Must. Respect. Cataloguer.
* Doug Turnbull, author of both articles above.