Skip to the content Back to Top

I just have to share this voyage of discovery, because I have wallowed in the doldrums of despair and defeat the last couple of days, only finding the way this morning, in 15 minutes, after sleeping on it. Isn't that always the way?

My Scylla and Charybdis were a client's oral history master and tracks textbases. The master record becomes the primary document in Solr, while the tracks atomically update that document. We've done this before: each track contributes an audio file to the document's list of media. No problem, it's easy to append something new to a primary document.

However, each track also has its own subjects, names and places, depending on the contents of the audio track. These also need to be appended to the primary document. Easy, right? Well, no. It is easy to blindly append something, but you start getting repeats in the primary document. For instance, if the name 'Blackbeard' is in the metadata for 8 out of 10 tracks, the primary document ends up with name=Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard,Blackbeard. You get the picture.

Okay, so let's look in the existing primary record to see if Blackbeard already... oh, wait. You can't get at the existing values while doing an atomic update. Hm.

Ah, we can 'remove' values matching Blackbeard, then 'add' Blackbeard. That should work. And it does. But what about multiple entries coming out of Inmagic like 'Blackbeard|Kidd, William'? Dang it: that string doesn't match anything, so neither name gets removed, and we're back to multiples of each name. We'll need to script a split on the pipe before remove/add.

Split happening: great, great. Now 'Blackbeard' and 'Kidd, William' are going in nicely without duplication. Oh. But wait, what about when multiple textbase fields map to the same Solr field? For example, HistoricNeighbourhood and PlanningArea => place?

And here the tempest begins. It's relatively simple to deal with multiple mappings, or multiple Inmagic entries. But not both. The reason is that now the object representing all the possible values is a Java ArrayList, which doesn't translate perfectly to any javascript type. You can't treat it like an array and deal with the values separately, nor can you treat it like a string and split it to create an array. You can't enumerate it, you can't cast it, it's a black box that is elusive beyond imagining.

Everything I tried, failed. It was dismal. It was all the more maddening because it seemed like it should have been such a simple thing. "Appearances can be deceiving!" shouted the universe, putting its boot-heel to my backside again and again.

Finally this morning, a combination of transformers (including regex) saved my bacon and I am eating the bacon and now I want to lie down for a while, under a blanket made of bacon.

The Technical

I'm using a RegexTransformer to do the splits, THEN a script transformer to remove-and-append.

In Solr DataImportHandler config XML:


        Sequential order of transformers important: regex split, THEN script transform.
        Handles multiple entries plus multiple mappings. E.g.
        <field name="name_ignored">Kyd, William|Teach, Edward</field>
        <field name="name_ignored">Rackham, John</field>
    <field column="name_ignored" sourceColName="name_ignored" splitBy="\|" />
    <field column="place_ignored" sourceColName="place_ignored" splitBy="\|" />
    <field column="topic_ignored" sourceColName="topic_ignored" splitBy="\|" />



In Solr DIH script transformer:


var atomic = {};

atomic.appendTo = function (field, row) {

    var val = row.get(field + '_ignored');
    if (val === null) return;

    var hash = new java.util.HashMap();
    hash.put('remove', val);
    hash.put('add', val);
    row.put(field, hash);


var atomicTransform = function (row) {
    atomic.appendTo('name', row);
    atomic.appendTo('topic', row);
    atomic.appendTo('place', row);    
    return row;


MS Word uses characters from the Windows-1252 character encoding set which are not represented in ASCII or ISO-8859-1. This is often a pain in the butt. Special characters include:

  • the… ellipsis
  • ‘smart’ “quotes”
  • en – dash and em — dash
  • dagger † and double dagger ‡
  • and more, but these are most common.

If you want to replace them with ASCII cognates, here's a function to do that. (Daggers don't have cognates as far as I know.)


/// Replaces commonly-used Windows 1252 encoded chars that do not exist in ASCII or ISO-8859-1 with ISO-8859-1 cognates.
var replaceWordChars = function(text) {
    var s = text;
    // smart single quotes and apostrophe
    s = s.replace(/[\u2018\u2019\u201A]/g, "\'");
    // smart double quotes
    s = s.replace(/[\u201C\u201D\u201E]/g, "\"");
    // ellipsis
    s = s.replace(/\u2026/g, "...");
    // dashes
    s = s.replace(/[\u2013\u2014]/g, "-");
    // circumflex
    s = s.replace(/\u02C6/g, "^");
    // open angle bracket
    s = s.replace(/\u2039/g, "<");
    // close angle bracket
    s = s.replace(/\u203A/g, ">");
    // spaces
    s = s.replace(/[\u02DC\u00A0]/g, " ");
    return s;

C# extension method

public static string ReplaceWordChars(this string text)
            var s = text;
            // smart single quotes and apostrophe
            s = Regex.Replace(s, "[\u2018\u2019\u201A]", "'");
            // smart double quotes
            s = Regex.Replace(s, "[\u201C\u201D\u201E]", "\"");
            // ellipsis
            s = Regex.Replace(s, "\u2026", "...");
            // dashes
            s = Regex.Replace(s, "[\u2013\u2014]", "-");
            // circumflex
            s = Regex.Replace(s, "\u02C6", "^");
            // open angle bracket
            s = Regex.Replace(s, "\u2039", "<");
            // close angle bracket
            s = Regex.Replace(s, "\u203A", ">");
            // spaces
            s = Regex.Replace(s, "[\u02DC\u00A0]", " ");
            return s;


Highlight words and phrases within specified elements on the page. Search syntax is trimmed or eradicated, stopwords and words less than 3 characters  are ignored.


You will need:

   2: /*
   3:     methods to help highlight search words and terms 
   4:     depends on jquery
   5:     Peter Tyrrell, August 2009
   6: */
   8: // 
   9: var highlightTermsIn = function(jQueryElements, terms) {
  10:     var wrapper = ">$1<b style='font-weight:normal;color:#000;background-color:rgb(255,255,102)'>$2</b>$3<";
  11:     for (var i = 0; i < terms.length; i++) {
  12:         var regex = new RegExp(">([^<]*)?("+terms[i]+")([^>]*)?<","ig");
  13:         jQueryElements.each(function(i) {
  14:             $(this).html($(this).html().replace(regex, wrapper));
  15:         }); 
  16:     };
  17: }
  19: // returns array of unique search terms (words, phrases) found in value        
  20: var parseSearchTerms = function(value) {
  22:     // split string on spaces and respect double quoted phrases
  23:     var splitRegex = /(\u0022[^\u0022]*\u0022)|([^\u0022\s]+(\s|$))/g;
  24:     var rawTerms = value.match(splitRegex);
  26:     var terms = [];            
  27:     for (var i = 0; i < rawTerms.length; i++) {
  29:         // trim whitespace, quotes, apostrophes and query syntax special chars
  30:         var term = rawTerms[i].replace(/^[\s\u0022\u0027+-][\s\u0022\u0027+-]*/, '').replace(/[\s*~\u0022\u0027][\s*~\u0022\u0027]*$/, '').toLowerCase();
  32:         // ignore if <= 2 chars
  33:         if (term.length <= 2) {
  34:             continue;
  35:         }
  37:         // ignore stopwords
  38:         var stopwords = ["about","are","from","how","that","the","this","was","what","when","where","who","will","with","the"];
  39:         var isStopword = false;
  40:         for (var j = 0; j < stopwords.length; j++) {
  41:             if (term == stopwords[j]) {
  42:                 isStopword = true;
  43:                 break;
  44:             }
  45:         }
  46:         if (isStopword === true) {
  47:             continue;
  48:         }
  50:         // add term to term list
  51:         terms[terms.length] = term;
  52:     }
  53:     return terms;
  54: }

Example 1

Pass an array of terms to be highlighted in jquery-selected elements:

   1: <script type="text/javascript">
   2:     $(document).ready(function() {
   3:         var searchTerms = ["banana", "monkey"];
   4:         // highlight valid terms in search results          
   5:         highlightTermsIn($("#HighlightWrapper"), searchTerms);        
   6:     });
   7: </script>

Example 2

Parse raw search input to strip out stopwords, query syntax characters, and wee short words less than 3 characters that do nobody any good. Quoted phrases are treated as a single term.

   1: <script type="text/javascript">
   2:     var rawSearch = "give the banana* to a monkey";
   3:     var termsToHighlight = parseSearchTerms(rawSearch);
   4:     // termsToHighlight now = ["give", "banana", "monkey"]
   5: </script>

Example 3

Put it all together to retrieve raw search input from the query string, parse out terms to highlight, and highlight within specified containers.

   1: <script type="text/javascript">
   2:     $(document).ready(function() {
   3:         // get quick search value from query string
   4:         var quickSearch = $.query.get("q");
   5:         // highlight valid terms in divs marked class="HighlightWrapper"        
   6:         highlightTermsIn($("div.HighlightWrapper"), parseSearchTerms(quickSearch));
   7:     });     
   8: </script>



I just read, and then for good measure re-read, "JavaScript: The Good Parts" by Douglas Crockford, who is probably the foremost javascript authority on planet Earth. The book blew my mind. I thought I knew javascript; I thought I had a pretty good grasp of it; before I picked up this book I would have referred to myself as a javascript expert when introducing myself at parties. It turns out I had a lot more to think about. I find this delightful.

One of the valuable lessons I learned is a javascript module pattern, discussed with examples at the YUI (Yahoo! User Interface) blog: The take home message is that you can create objects that support private members. I didn't even know that was possible, but it turns out that it is, due to function scope and the concept of closure. It took me a few reads with furrowed brow to grok closure, so I can hardly explain it quickly, but essentially:

  1. A function can return a function and then wink out of existence.
  2. The returned inner function retains access to other members and data defined in its original parent function.
  3. Those other members and data are not directly accessible anymore, so they are private.


Further to my post of May 2007, where I described how to create a trace log in DB/Text and handle exceptions, here is a method that is handy for building more complex log statements.

   1: // construct formatted log statement
   2: function logFormat(val, arg1, arg2, arg3)
   3: {
   4:     var formattedVal = val;
   5:     if (arg1 != null)    {
   6:         formattedVal = formattedVal.replace("{0}", arg1);
   7:     }
   8:     if (arg2 != null) {
   9:         formattedVal = formattedVal.replace("{1}", arg2);
  10:     }
  11:     if (arg3 != null) {
  12:         formattedVal = formattedVal.replace("{2}", arg3);
  13:     }
  14:     log(formattedVal);
  15: }
  17: // write log statement to form box
  18: function log(val)
  19: {
  20:     var box = Form.boxes("boxDebugLog");
  21:     if (box == null)
  22:         return;    
  24:     if (box.content != '')
  25:     {
  26:         box.content += "\n";    
  27:     }    
  28:     box.content += val;
  29: }


logFormat takes a string value and up to 3 arguments, and replaces tokens found in the string value with those arguments. It's based on C#'s string.Format method, and others like it.


It's easier to read (and write) a format string that contains tokens than a string concatenated together with plus signs.

   1: var count = 10;
   2: var name = "Peter";
   3: var duration = 30;
   5: function LogTheOldWay()
   6: {
   7:     log(name + " was lashed with a wet noodle " + count + " times for " + duration + " seconds.");
   8: }
  10: function LogTheNewWay()
  11: {    
  12:     logFormat("{0} was lashed with a wet noodle {1} times for {2} seconds.", name, count, duration);
  13: }


My example above only proves how piddly my skills really are. There's a *much* better way of doing this that allows for n arguments instead of a maximum of three. I had forgotten that every javascript function has a local property called arguments that contains all parameters passed to the function as an array.

   1: function logFormat(val)
   2: {
   3:     var formattedVal = val;
   4:   for(i = 1; i < arguments.length; i++)
   5:   {
   6:     formattedVal = formattedVal.replace("{" + (i - 1) + "}", arguments[i]);
   7:   }
   8:   return formattedVal;
   9: }

Let Us Help You!

We're Librarians - We Love to Help People