Pitfalls to avoid when importing XML files to Solr DataImportHandler

Published Jun 12, 2012

Don’t send UTF-8 files with BOMs

Any UTF-8 encoded XML files you intend to import to Solr via the DataImportHandler (DIH) must not be saved with a leading BOM (byte order mark). The UTF-8 standard does not require a BOM, but many Windows apps (e.g. Notepad) include one anwyay. (Byte order has no meaning in UTF-8. The standard does permit the BOM, but doesn’t recommend its use.)

The BOM byte sequence is 0xEF,0xBB,0xBF. When the text is (incorrectly) interpreted as ISO-8859-1, it looks like this:

The Java XML interpreter used by Solr DIH does not want to see a BOM and chokes when it does. You might get an error like this:

ERROR:  'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Content is not allowed in prolog.'

Avoid out-of-memory problems caused by very large import files

Very large import files may lead to out-of-memory problems with Solr’s Java servlet container (we use Tomcat, currently). “Very large” is a judgment call, but anything over 30 MB is probably going to be trouble. It is possible to increase the amount of memory allocated to Tomcat, but not necessary if you can break the large import files into smaller ones. I force any tools I’m using to cap files to 1000 rows/records, which ends up around 2 MB in size with the kind of library and archives data we tend to deal with.