Skip to the content Back to Top

Problem

Numbers mixed with alphabetic characters are sorted lexically in Solr. That means that 10 comes before 2, like this:

  • Title No. 1
  • Title No. 10
  • Title No. 100
  • Title No. 2

Solution

To force numbers to sort numerically, we need to left-pad any numbers with zeroes: 2 becomes 0002, 10 becomes 0010, 100 becomes 0100, et cetera. Then even a lexical sort will arrange values like this:

  • Title No. 1
  • Title No. 2
  • Title No. 10
  • Title No. 100

The Field Type

This alphanumeric sort field type converts any numbers found to 6 digits, padded with zeroes. (If you expect numbers larger than 6 digits in your field values, you will need to increase the number of zeroes when padding.)

The field type also removes English and French leading articles, lowercases, and purges any character that isn’t alphanumeric. It is English-centric, and assumes that diacritics have been folded into ASCII characters.

Sample output

Title No. 1 => titleno000001
Title No. 2 => titleno000002
Title No. 10 => titleno000010
Title No. 100 => titleno000100

This post's categories

I just have to share this voyage of discovery, because I have wallowed in the doldrums of despair and defeat the last couple of days, only finding the way this morning, in 15 minutes, after sleeping on it. Isn't that always the way?

My Scylla and Charybdis were a client's oral history master and tracks textbases. The master record becomes the primary document in Solr, while the tracks atomically update that document. We've done this before: each track contributes an audio file to the document's list of media. No problem, it's easy to append something new to a primary document.

However, each track also has its own subjects, names and places, depending on the contents of the audio track. These also need to be appended to the primary document. Easy, right? Well, no. It is easy to blindly append something, but you start getting repeats in the primary document. For instance, if the name 'Blackbeard' is in the metadata for 8 out of 10 tracks, the primary document ends up with name=Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard. You get the picture.

Okay, so let's look in the existing primary record to see if Blackbeard already... oh, wait. You can't get at the existing values while doing an atomic update. Hm.

Ah, we can 'remove' values matching Blackbeard, then 'add' Blackbeard. That should work. And it does. But what about multiple entries coming out of Inmagic like 'Blackbeard|Kidd, William'? Dang it: that string doesn't match anything, so neither name gets removed, and we're back to multiples of each name. We'll need to script a split on the pipe before remove/add.

Split happening: great, great. Now 'Blackbeard' and 'Kidd, William' are going in nicely without duplication. Oh. But wait, what about when multiple textbase fields map to the same Solr field? For example, HistoricNeighbourhood and PlanningArea => place?

And here the tempest begins. It's relatively simple to deal with multiple mappings, or multiple Inmagic entries. But not both. The reason is that now the object representing all the possible values is a Java ArrayList, which doesn't translate perfectly to any javascript type. You can't treat it like an array and deal with the values separately, nor can you treat it like a string and split it to create an array. You can't enumerate it, you can't cast it, it's a black box that is elusive beyond imagining.

Everything I tried, failed. It was dismal. It was all the more maddening because it seemed like it should have been such a simple thing. "Appearances can be deceiving!" shouted the universe, putting its boot-heel to my backside again and again.

Finally this morning, a combination of transformers (including regex) saved my bacon and I am eating the bacon and now I want to lie down for a while, under a blanket made of bacon.

The Technical

I'm using a RegexTransformer to do the splits, THEN a script transformer to remove-and-append.

In Solr DataImportHandler config XML:

 

<entity 
    name="atomic-xml"
    processor="XPathEntityProcessor"
    datasource="atomic"
    stream="true"
    transformer="RegexTransformer,script:atomicTransform"
    useSolrAddSchema="true"
    url="${atomic.fileAbsolutePath}"
    xsl="xslt/dih.xsl"
>
    <!--
        Sequential order of transformers important: regex split, THEN script transform.
        Handles multiple entries plus multiple mappings. E.g.
        <field name="name_ignored">Kyd, William|Teach, Edward</field>
        <field name="name_ignored">Rackham, John</field>
    -->
    <field column="name_ignored" sourceColName="name_ignored" splitBy="\|" />
    <field column="place_ignored" sourceColName="place_ignored" splitBy="\|" />
    <field column="topic_ignored" sourceColName="topic_ignored" splitBy="\|" />

</entity>

 

In Solr DIH script transformer:

 

var atomic = {};

atomic.appendTo = function (field, row) {

    var val = row.get(field + '_ignored');
    if (val === null) return;

    var hash = new java.util.HashMap();
    hash.put('remove', val);
    hash.put('add', val);
    row.put(field, hash);

};

var atomicTransform = function (row) {
    atomic.appendTo('name', row);
    atomic.appendTo('topic', row);
    atomic.appendTo('place', row);    
    return row;
};

 

Overview

The following approach is a good one if you require:

  • phrase suggestions, not just words
  • the ability to match user input against multiple fields
  • multiple fields returned
  • multiple field values to make up a unique suggestion
  • suggestion results collapsed (grouped) on a field or fields
  • the ability to filter the query
  • images with suggestions


I needed a typeahead suggestion (autocomplete) solution for a textbox that searches titles. In my case, I have a lot of magazines that are broken down so that each page is a document in the Solr index, and has metadata that describes its parentage. For example, page 1 of Dungeon Magazine 100 has a title: "Dungeon 100"; a collection; "Dungeon Magazine"; and a universe: "Dungeons and Dragons". (Yes, all the material in my index is related to RPG in some way.) A magazine like this might consist of 70 pages or so, whereas a sourcebook like the Core Rulebook for Pathfinder, a D&D variant, boasts 578, so title suggestions have to group on title and ignore counts. Further, the Warhammer 40k game Dark Heresy also has a Core Rulebook, so title suggestions have to differentiate between them.

To build this typeahead solution, I:

  • added new Solr field types to schema.xml to support ngram matching
  • added a /suggest handler to solrconfig.xml that weights matches appropriately
  • bound the suggestions in JSON format to Twitter's typeahead.js

 

Example 1: two core rulebooks.

 

Example 2: "dark" matching in Title and Collection

 

 

Add new field types to Solr schema.xml

 

text_suggest_ngram

For partial matches that will be boosted lower than exact or left-edge matches, e.g. match 'bro' in "A brown fox".

 

 

text_suggest_edge

For left-edge matches, e.g. match 'A bro' but not 'brown' in "A brown fox".

 

text_suggest

For whole term matches. These will be weighted the highest.

These field types are taken lock, stock and barrel from https://github.com/cominvent/autocomplete. In that project, the suggest engine takes the form of an entirely separate core - I have simplified matters for myself. Great stuff, though.

 

Make copies of relevant fields in Solr schema.xml

As noted above, the fields in play for me are title, collection, and universe. Note I am also making a string copy of each to group on.

 

Add /suggest request handler to solrconfig.xml

The /suggest handler looks for user input matches within the suggest fields defined in the qf parameter. Each field has a boost assigned: the higher the boost number, the more a match on that field will contribute to the final document score. I found I had to play around with the boost numbers relative to each other before getting the behaviour I really wanted. Boosting the whole-term text_suggest fields highest was not an automatic route to success. Your mileage may vary.

The pf parameter is additional to qf: it boosts documents in cases where user input terms appear in close proximity.

Above, I mentioned that a Solr document in this index is equated with a single page from a book. If a book is 50 pages long, then a naive suggester is going to return 50 documents when that book's title is matched. The suggest handler avoids that problem by collapsing (grouping) on the fields in play, which explains why the universe field is referenced there, even though it's not being used to match query input. With grouping, a unique suggestion consists of universe+collection+title. Note that group.sort and sort parameters differ. The former must produce valid groups, while the latter determines order in which suggestions are displayed to the user.

Conclusion

In a future post, I will describe how I bound the results from the /suggest handler to Twitter's typeahead.js on the front end to produce what is seen in the examples seen in the screenshots above.

This post's categories

In 2010 Andornot was approached to develop a system to manage the patient education materials produced and recommended by Fraser Health Authority (FHA) staff. FHA serves over 1.6 million people and employs 26,000 staff spread out over 12 acute care hospitals and numerous other facilities.

The challenge was to identify and review all existing patient education resources both in hard copy and electronic formats. Types of material included general informational pamphlets or posters on topics such as smoking cessation or breastfeeding, plus procedure-based factsheets and discharge instructions. Add in to the mix the multiple language versions created to cater to the large Asian, Indo-Canadian, Korean, and Filipino populations, and the fact that there were multiple similar pamphlets on popular topics, as many hospitals had developed these on their own.

Andornot provides similar patient education resources systems to Vancouver Coastal Health and McGill University Health Centre and so had considerable experience in the workflow associated with managing these types of publication.

FH Patient Education search results

We chose to emulate various well-respected websites for patient health information and index every item by multiple categories, including Disorders & Conditions, Body Location and Demographic, as well as an uncontrolled Keywords field and MESH headings. In addition, fields were added for FHA Program information and site locations. Considerable planning went into the choice and values for each category in order to create facets to allow users to easily limit or expand their searches.

We prototyped the catalog aspect of the system using Inmagic DB/TextWorks, and a library technician with a medical transcriptionist background was hired to do the first pass of data entry. DB/TextWorks is a great tool for this, as inevitably the scope changed during the project as different types of resources were uncovered, and integration with the FHA Print Shop was added to facilitate the ordering of multiple copies. 

Once a thousand or so items had been catalogued and the system parameters finalized, we transitioned to a SQL Server database with a Solr based front end search using our Andornot Discovery Interface (AnDI). This allows us to better specify relationships between documents, i.e. for multiple language versions, plus it supports versioning.    Searching for medical terms can be challenging with acronyms and abbreviations, as as well as problems with correct spelling.

Apart from the refine by facets capability, the new system features automatic truncation and a Did You Mean capability, so for example if a user types “anasthesia” they will be directed to the correct spelling. Did You Mean example

The collection includes full-text documents created by healthcare professionals in FHA, plus links to the URLS’s of full-text documents created by other reliable organizations. FHA professionals are reviewing each publication for appropriate content in compliance with plain language and formatting standards.

An authentication system is in place to limit who can see what, and from where, depending on the Status field of each item.  Many of the resources are best viewed with a health care professional, so access to these will be limited to PC’s within the IP ranges of the FHA facilities. The aim is that in 2013, the catalogue will also be made available to the public through a link from the FHA website. In the meantime this direct link shows just a small subset of materials already approved for the public at large. The default AnDI search results are displayed by relevance but we were able to boost certain parameters to display the active and English language items first.FH Admin

FHA staff are able to submit new patient education resources for evaluation using the resource submission function. These show up immediately with an In Process status so that other FHA staff can see whether a similar publication already exists or is under development.

Behind the scenes is an extensive administrative interface to allow FHA authorized staff to edit records, make batch changes to lookup fields and export reports of downloads for statistical analysis. The system has only been launched officially to the ER departments but it already provides a fascinating snapshot of the most in demand resources, thus helping guide ongoing review and translation priorities.

“Emergency departments in FH are now able to access the patient education catalogue to download patient discharge instructions as they send patients home. Plus they are able to capture reports showing how regularly each specific item is utilized. FH physicians, staff and volunteers are looking forward to accessing and sharing their patient education materials from across the health authority, to enhance the experience of patients, clients, residents and the public served by FH.  [Kathy Scarborough, MSN, RN, Clinical Practice Consultant, Professional Practice and Integration, Fraser Health.]

This new system showcases Andornot’s expertise in both designing and implementing a custom, complex web application over a multi year period.

Please contact us to discuss how we can help you develop a similar patient education system or for any other projects.

The Canadian Conservation Institute (CCI) in Ottawa, a long-time Andornot client, required a new version of their bilingual online catalogue and staff bibliography that would pass the strict requirements of W3C’s Web Content Accessibility Guidelines (WCAG). Andornot helped CCI boost the requirement into an opportunity to add new features, including facets, multi-database search, spelling suggestions, and faster search performance.

The CCI Library has one of the largest conservation and museology collections in the world. The collections are regarded as an important source for conservation and museology literature on a wide variety of topics, such as preventive conservation, industrial collections, architectural heritage, fire and safety protection, museum planning, archaeological conservation, preservation in storage and display, exhibition design, disaster preparedness, and museum education. The holdings include a large selection of books on textiles, furniture, paintings, sculptures, prints and drawings, and archaeological and ethnological objects.

-- "CCI Library". Canadian Conservation Institute. Retrieved 4 July 2012.

cci-results-facetedThe upgraded website uses the Andornot Discovery Interface (AnDI for short), a modern and highly configurable web application that tempers cutting-edge open source search technology with many years of Andornot experience in search-focused design.

It was possible to meet WCAG compliance because AnDI provides complete control over every HTML tag and CSS statement. The HTML5 structure presents a clean cross-browser template that reads well on mobile devices and has backwards-compatible support for older browsers.

The CCI Library's French and English versions were created with AnDI's built-in multilingual support, and are triggered through the presence of "en" or "fr" in the URL. Moving from one to another is a smooth transition: a user can switch the page language at any time without interrupting their experience or being redirected to a start page. Even errors and page-not-found messages are bilingual.

Facets and spelling suggestions (and many other features) are made possible by AnDI's open source search technology: Apache Solr. Solr is blazing fast, optimized for full-text search on the web, and relied on by some of the biggest names on the internet.

Every page is bookmarkable because the URL always holds the information needed to reconstruct the page. This makes the site friendly to permalinks and Search Engine Optimization (SEO).

The CCI Library retains its catalogue and staff bibliography collections in separate Inmagic DB/TextWorks databases that staff continue to update through its familiar desktop interface. Updates are extracted and indexed by Solr automatically on a regular basis via Andornot's Data Extraction Utility (internally we nickname it 'Extract-o-matic') from a Powershell script. The index schema is a Dublin Core derived metadata element set that Andornot helped to map to both collections.

 

andi-element-lozenge-1.0_188x188AnDI can be configured to reflect any field set from any data store or database, as well as rich documents such as PDF and Word, images with EXIF metadata, etc. Contact Andornot about AnDI for your own collection.

Categories

Let Us Help You!

We're Librarians - We Love to Help People