AnDI–A New Discovery Interface from Andornot

Published Jul 5, 2012

Andornot has been championing discovery interfaces as the next generation of search interface for online collections for the past few years. We firmly believe they provide the best means for connecting users with the ever-growing amount of resources available online. While they have been widely adopted by public and academic libraries, and many other online content providers, they are not yet often used as the search interface for more specialized collections.

We’re pleased to announce the availability of a solution specifically designed for these situations: AnDI, the Andornot Discovery Interface.

AnDI is a web application based on both the Apache Solr search engine and our own extensive experience in developing search interfaces.

It provides the features users expect in a search interface in 2012, including:

relevancy-based search results;
automatic search term stemming and spelling corrections; and
facets to allow refinement of those results.

The intention is to deliver the most useful resources to the user in their initial search, but allow them to quickly narrow down the results further.

Great for Libraries, Archives, Museums and Other Collections

While other discovery interface systems, such as VuFind, are ideal for bibliographic data, AnDI works well with all descriptive formats. AnDI’s search index is based on the Dublin Core metadata standard, to accommodate materials described in many different ways. This includes archival descriptions, museum artifact records, bibliographic records, and more. By mapping fields from different data sources into the Dublin Core schema, almost anything can be made discoverable through AnDI.

AnDI includes permalinks and social bookmarking features to help users act on the results they find. Additional features are available to customize AnDI for specific projects (details are in our AnDI data sheet).

AnDI is a great choice for all organizations wishing to update their online collections to a system that meets their users’ expectations for a search experience.

Data Sources

AnDI can utilize a variety of data sources. This allows clients to retain their familiar legacy systems such as Inmagic DB/TextWorks, MS Access, Excel or other proprietary software for all data entry and other administrative tasks, and yet provide their end users with a sophisticated front end web interface. AnDI can also be tied into SQL Server for a completely web based application for both data entry and searching.

Example Sites

The Canadian Conservation Institute Library Catalogue and Staff Bibliography search interface. This site in particular provides an example of refining a search with facets. It's available in two languages, with an easy toggle between them at any point in a user's search.
The multi-database search within the Canadian Jewish Heritage Network site provides a single search of the separate archival and genealogical DB/TextWorks databases.

More Information

Further technical details and a data sheet with features and system requirement is available here.

Please contact us to discuss how to upgrade your search interface to AnDI, VuFind or a similar solution.

Pitfalls to avoid when importing XML files to Solr DataImportHandler

Published Jun 12, 2012

Don’t send UTF-8 files with BOMs

Any UTF-8 encoded XML files you intend to import to Solr via the DataImportHandler (DIH) must not be saved with a leading BOM (byte order mark). The UTF-8 standard does not require a BOM, but many Windows apps (e.g. Notepad) include one anwyay. (Byte order has no meaning in UTF-8. The standard does permit the BOM, but doesn’t recommend its use.)

The BOM byte sequence is 0xEF,0xBB,0xBF. When the text is (incorrectly) interpreted as ISO-8859-1, it looks like this:

The Java XML interpreter used by Solr DIH does not want to see a BOM and chokes when it does. You might get an error like this:

ERROR:  'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Content is not allowed in prolog.'

Avoid out-of-memory problems caused by very large import files

Very large import files may lead to out-of-memory problems with Solr’s Java servlet container (we use Tomcat, currently). “Very large” is a judgment call, but anything over 30 MB is probably going to be trouble. It is possible to increase the amount of memory allocated to Tomcat, but not necessary if you can break the large import files into smaller ones. I force any tools I’m using to cap files to 1000 rows/records, which ends up around 2 MB in size with the kind of library and archives data we tend to deal with.

Sample Solr DataImportHandler for XML Files

Published Jun 7, 2012

I spent a lot of time in trial and error getting the Solr DataImportHandler (DIH) set up and working the way I wanted, mainly due to a paucity of practical examples on the internet at large, so here is a short post on Solr DIH for XML with a working sample, and may it save you many hours of nail-chewing.

This post assumes you are already somewhat familiar with Solr, but would like to know more about how to import XML data with the DataImportHandler.

DIH Overview

The DataImportHandler (DIH) is a mechanism for importing structured data from a data store into Solr. It is often used with relational databases, but can also handle XML with its XPathEntityProcessor. You can pass incoming XML to an XSL, as well as parse and transform the XML with built-in DIH transformers. You could translate your arbitrary XML to Solr's standard input XML format via XSL, or map/transform the arbitrary XML to the Solr schema fields right there in the DIH config file, or a combination of both. DIH is flexible.

Sample 1: dih-config.xml with FileDataSource

Here's a sample dih-config.xml from a an actual working site (no pseudo-samples here, my friend). Note that it picks up xml files from a local directory on the LAMP server. If you prefer to post xml files directly via HTTP you would need to configure a ContentStreamDataSource instead.

It so happens that the incoming xml is already in standard Solr update xml format in this sample, and all the XSL does is remove empty field nodes, while the real transforms, such as building the content of "ispartof_t" from "ignored_seriestitle", "ignored_seriesvolume", and "ignored_seriesissue", are done with DIH Regex transformers. (The XSLT is performed first, and the output of that is then given to the DIH transformers.) The attribute "useSolrAddSchema" tells DIH that the xml is already in standard Solr xml format. If that were not the case, another attribute, "xpath", on the XPathEntityProcessor would be required to select content from the incoming xml document.

<dataConfig>
    <dataSource encoding="UTF-8" type="FileDataSource" />
    <document>
        <!--
            Pickupdir fetches all files matching the filename regex in the supplied directory
            and passes them to other entities which parse the file contents. 
        -->
        <entity
            name="pickupdir"
            processor="FileListEntityProcessor"
            rootEntity="false"
            dataSource="null"
            fileName="^[\w\d-]+\.xml$"
            baseDir="/var/lib/tomcat6/solr/xxx/import/"
            recursive="true"
            newerThan="${dataimporter.last_index_time}"
        >

        <!--
            Pickupxmlfile parses standard Solr update XML.
            Incoming values are split into multiple tokens when given a splitBy attribute.
            Dates are transformed into valid Solr dates when given a dateTimeFormat to parse.
        -->
        <entity 
            name="xml"
            processor="XPathEntityProcessor"
            transformer="RegexTransformer,TemplateTransformer"
            datasource="pickupdir"
            stream="true"
            useSolrAddSchema="true"
            url="${pickupdir.fileAbsolutePath}"
            xsl="xslt/dih.xsl"
        >

            <field column="abstract_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="creator_facet" template="${xml.creator_t}" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="ispartof_t" sourceColName="ignored_seriestitle" regex="(.+)" replaceWith="$1" />
            <field column="ispartof_t" sourceColName="ignored_seriesvolume" regex="(.+)" replaceWith="${xml.ispartof_t}; vol. $1" />
            <field column="ispartof_t" sourceColName="ignored_seriesissue" regex="(.+)" replaceWith="${xml.ispartof_t}; no. $1" />
            <field column="ispartof_t" regex="\|" replaceWith=" " />
            <field column="language_t" splitBy="\|" />
            <field column="language_facet" template="${xml.language_t}" />
            <field column="location_display" sourceColName="ignored_class" regex="(.+)" replaceWith="$1" />
            <field column="location_display" sourceColName="ignored_location" regex="(.+)" replaceWith="${xml.location_display} $1" />
            <field column="location_display" regex="\|" replaceWith=" " />
            <field column="othertitles_display" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="responsibility_display" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="sourceissue_display" sourceColName="ignored_volume" regex="(.+)" replaceWith="vol. $1" />
            <field column="sourceissue_display" sourceColName="ignored_issue" regex="(.+)" replaceWith="${xml.sourceissue_display}, no. $1" />
            <field column="sourceissue_display" sourceColName="ignored_year" regex="(.+)" replaceWith="${xml.sourceissue_display} ($1)" />
            <field column="src_facet" template="${xml.src}" />
            <field column="subject_t" splitBy="\|" />
            <field column="subject_facet" template="${xml.subject_t}" />
            <field column="title_t" sourceColName="ignored_title" regex="(.+)" replaceWith="$1" />
            <field column="title_t" sourceColName="ignored_subtitle" regex="(.+)" replaceWith="${xml.title_t} : $1" />
            <field column="title_sort" template="${xml.title_t}" />
            <field column="toc_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />
            <field column="type_facet" template="${xml.type_t}" />
    </entity>
      </entity>
    </document>
</dataConfig>

Sample 2: dih-config.xml with ContentStreamDataSource

This sample receives XML files posted direct to the DIH handler. Whereas Sample 1 is able to batch process any number of files, this sample would work on one at a time as they were posted.

<dataConfig>
    <dataSource name="streamsrc" encoding="UTF-8" type="ContentStreamDataSource" />
    
    <document>
        <!--
            Parses standard Solr update XML passed as stream from HTTP post.
            Strips empty nodes with dih.xsl, then applies transforms.
        -->
        <entity
            name="streamxml"
            datasource="streamsrc"
            processor="XPathEntityProcessor"
            rootEntity="true"
            transformer="RegexTransformer,DateFormatTransformer"
            useSolrAddSchema="true"
            xsl="xslt/dih.xsl"
        >
            <field column="contributor_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="date_t" splitBy="\|" />
            <field column="date_tdt" dateTimeFormat="M/d/yyyy h:m:s a" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="language_t" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="relation_t" splitBy="\|" />
            <field column="rights_t" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="subject_t" splitBy="\|" />
            <field column="title_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />
        </entity>        
    </document>
</dataConfig>

How to set up DIH

1. Ensure the DIH jars are referenced from solrconfig.xml, as they are not included by default in the Solr WAR file. One easy way is to create a lib folder in the Solr instance directory that includes the DIH jars, as the solrconfig.xml looks in the lib folder for references by default. Find the DIH jars in the apache-solr-x.x.x/dist folder when you download the Solr package.

2. Create your dih-config.xml (as above) in the Solr "conf" directory.

3. Add a DIH request handler to solrconfig.xml if it's not there already.

<requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">dih-config.xml</str>
    </lst>
</requestHandler>

How to trigger DIH

There is a lot more info re full-import vs. delta-import and whether to commit, optimize, etc. in the wiki description on Data Import Handler Commands, but the following would trigger the DIH operation without deleting the existing index first, and commit the changes after all the files had been processed. The sample given above would collect all the files found in the pickup directory, transform them, index them, and finally, commit the update/s to the index (which would make them searchable the instant commit was finished).

http://localhost:8983/solr/update/dih?command=full-import&clean=false&commit=true

Video: Slay Monster PDFs with pdfbox

Published Oct 21, 2011

A screencast of the lightning talk presentation I gave at the Access 2011 Conference on October 21, 2011 in Vancouver, BC, entitled "Quick and Dirty's Guide to Slaying Monster PDFs," in which I show how to use pdfbox to slice up large PDFs for indexing to make search more meaningful.

http://youtu.be/Pn4MW6bs7a8

Solr and the Trend to Open Source Search

Published Jun 2, 2011

On Saturday I caught a cold. On Sunday, I caught a flight to San Francisco to attend Lucene Revolution - the biggest open source search conference on the planet – to catch up on the latest developments with Apache Lucene/Solr.

From the Solr project website:

“Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.”

I was joined there by developers and representatives from AT&T, CareerBuilder.com, Corbis, Ebay, eHarmony, EMC, Etsy.com, HathiTrust, Healthwise, Intuit, Travelocity, Trulia, Twitter, Woot, and Yelp, to name a few. (Yes, I’m name dropping – just trying to appear cool here.)

I’ve been working with Solr for the better part of a year, and I thought it very impressive, but the conference blew my socks off in terms of what Solr can do. To think that the best-of-breed search performance in the world is open source! (Solr beats Google in overall search performance. No, I’m not exaggerating.)

Solr is a game-changer, there’s no doubt. No longer is open source just a freebie alternative, it is the go-to standard that is beating the pants off of proprietary search engines. I heard quite a few stories of prominent household-name enterprises switching to Solr and reducing costs while simultaneously vastly increasing their capabilities and performance.

Happily, Solr not only scales up to the largest data collections ever created by our species, but also down to the relatively modest needs of the rest of us. It democratizes search. A kid making a website in his parent’s basement can utilize the same cutting-edge search features as a multinational corporation, and that’s not just convenient, it’s necessary, because search is vital now to every level of our interaction with information.

I can’t wait to apply what I learned at the conference back at El Rancho Andornot. Also, I need to keep ahead of that kid in the basement.