Tips for Scaling Full Text Indexing of PDFs with Apache Solr and Tika

Published Jun 23, 2017

We often find ourselves indexing the content of PDFs with Solr, the open-source search engine beneath our Andornot Discovery Interface. Sometimes these PDFs are linked to database records also being indexed. Sometimes the PDFs are a standalone collection. Sometimes both. Either way, our clients often want to have this full-text content in their search engine. See the Arnrpior & McNab/Braeside Archives site, which has both standalone PDFs and PDFs linked from database records.

Solr, or rather its Tika plugin, does a good job of extracting the text layer in the PDF and most of my efforts are directed at making sure Tika knows where the PDF documents are. This can be mildly difficult when PDFs are associated with database records that point to the documents via relative file paths like where\is\this\document.pdf. Or, when the documents are pointed to with full paths like x:\path\to\document.pdf, but those paths have no meaning on the server where Solr resides. There are a variety of tricks which transform those file paths to something Solr can use, and I needn't get into them here. The problem I really want to talk about is the problem of scale.

When I say 'the problem of scale' I refer to the amount of time it takes to index a single PDF, and how that amount—small as it might be—can add up over many PDFs to an unwieldy total. The larger the PDFs are on average, the more time each unit of indexing consumes, and if you have to fetch the PDF over a network (remember I was talking about file paths?), the amount of time needed per unit increases again. If your source documents are numbered in the mere hundreds or thousands, scale isn't much of a problem, but tens or hundreds of thousands or more? That is a problem, and it's particularly tricksome in the case where the PDFs are associated with a database that is undergoing constant revision.

In a typical scenario, a client makes changes to a database which of course can include edits or deletions involving a linked PDF file. (Linked only in the sense that the database record stores the file path.) Our Andornot Discovery Interface is a step removed from the database, and can harvest changes on a regular basis, but the database software is not going to directly update Solr. (This is a deliberate strategy we take with the Discovery Interface.) Therefore, although we can quite easily apply database (and PDF) edits and additions incrementally to avoid the scale problem, deletions are a fly in the ointment.

Deletions from the database mean that we have to, at least once in a while (usually nightly), refresh the entire Solr index. (I'm being deliberately vague about the nature of 'database' here but assume the database does not use logical deletion, but actually purges a deleted record immediately.) A nightly refresh that takes more than a few hours to complete means the problem of scale is back with us. Gah. So here's the approach I took to resolve that problem, and for our purposes, the solution is quite satisfactory.

What I reckoned was: the only thing I actually want from the PDFs at index-time is their text content. (Assuming they have text content, but that's a future blog post.) If I can't significantly speed up the process of extraction, I can at least extract at a time of my choosing. I set up a script that creates a PDF to text file mirror.

The script queries the database for PDF file paths, checks file paths for validity, and extracts the text layer of each PDF to a text file of the same name. The text file mirror also reflects the folder hierarchy of the source PDFs. Whenever the script is run after the first time, it checks to see if a matching text file already exists for a PDF. If yes, the PDF is only processed if its modify date is newer than its text file doppelgänger. It may take days for the initial run to finish, but once it has, only additional or modified PDFs have to be processed on subsequent runs.

Solr is then configured to ingest the text files instead of the PDFs, and it does that very quickly relative to the time it would take to ingest the PDFs.

The script is for Windows, is written in PowerShell, and is available as a Github gist.

IIS Application Pool Resurrection Script

Published May 25, 2015

Overview

Default IIS application pool settings allow for no more than 5 uncaught exceptions within 5 minutes, and when this magic number is reached, the application pool shuts itself down. Uncaught exceptions are somewhat rare for us in the web applications we write because we have frameworks that catch and log errors. Some of our older web applications suffer from uncaught exceptions however, and so does Inmagic Webpublisher on servers where we host clients that use that software.

It used to be that text alerts would wake us up in the middle of the night screaming that sites dependent on Webpublisher were down, and we would remote in to the server to restart the relevant application pool. Well, that was pretty much untenable, so I wrote a script to restart the application pool automatically that would trigger when the application pool's shutdown was recorded in the Windows Application Event Log. A caveat here is that application pools usually shut themselves down for good reason - you shouldn't apply this script as a bandaid if you can fix the underlying causes.

Prerequisites

PowerShell v2 (get current version with $PSVersionTable.PSVersion).
PowerShell execution policy must allow the script to run (i.e. Set-ExecutionPolicy RemoteSigned or Set-ExecutionPolicy Unsigned).

Install the Script

Register a new Windows Application Event Log Source called 'AppPool Resurrector'. Do it manually or use my PowerShell script.
Put the AppPoolResurrector.ps1 script somewhere on the server, and take note of the name of the application pool you want to monitor.
Create a new task in Windows Task Scheduler once per application pool you want to monitor
1. Trigger is 'On an Event' Event ID: 1000, Source: Application Error, Log: Application
2. Action is 'Start a program', Program/script: PowerShell, Add arguments: -command &" 'c:\path\to\apppoolresurrector.ps1' 'name-of-app-pool' "

Note the script activates to check whether the named application pool is still running, and then proceeds to restart it if necessary. There will be times it is activated by a log event to find that the application pool is fine, probably because the log event was unrelated to the application pool in the first place.

Tips for Scaling Full Text Indexing of PDFs with Apache Solr and Tika

IIS Application Pool Resurrection Script

Overview

Prerequisites

Install the Script

Script Content

Categories