Skip to the content Back to Top

Java is undergoing some considerable licensing changes, prompting us to plan an all-out move from Oracle Java 8 to OpenJDK Java 11 this Spring for every Solr instance we host. I have been running covertly about the hills setting traps for Java 11.0.1 to see what I might snare before unleashing it on our live servers. I caught something this week.

Dates! Of course it's about parsing dates! I noticed that the Solr Data Import Handler (DIH) transforms didn't handle making created dates during ingest. (In DIH, we use a script transformer and manipulate some Java classes with javascript. This includes the parsing of dates from text.) Up until now, our DIH has used an older method of parsing dates with a Java class called SimpleDateFormat. If you look for info on parsing dates in Java, you will find years and years of advice related to that class and its foibles, and then you will notice that in recent times experts advise using the java.time classes introduced in Java 8. Since SimpleDateFormat didn't work during DIH, I assumed that SimpleDateFormat was deprecated in Java 11 (it isn't actually), and moved to convert the relevant DIH code to use java.time.

Many hours passed here, during which the output of two lines of code* made no goddamn sense at all. The javadocs that describe the behaviour of java.time classes are completely inadequate, with their stupid little "hello, world" examples, when dates are tricky, slippery, malicious dagger-worms of pure hatred. Long story short, a date like '2004-09-15 12:00:00 AM' produced by Inmagic ODBC from a DB/Textworks database could not be parsed. The parser choked on the string at "AM," even though my match pattern was correct: 'uuuu-MM-dd hh:mm:ss a'. Desperate to find the tiniest crack to exploit, I changed every variable I could think of one at a time. That was how I found that, when I switched to Java 8, the same exact code worked. Switch back to Java 11. Not working. Back to Java 8. Working. WTF?

I thought, maybe the Nashorn scripting engine that allows javascript to be interpreted inside the Java JVM is to blame, because this scenario does involve Java inside javascript inside Java, which is weird. So I set up a Java project with Visual Studio Code and Maven and wrote some unit tests in pure Java. (That was pretty fun. It was about the same effort as ordering a pizza in Italian when you don’t speak Italian: everything about the ordering process was tantalizingly familiar but different enough to delay my pizza for quite some time.) The problem remained: parsing worked as-written in Java 8, but not Java 11.

I started writing a Stack Overflow question. In so doing, I realized I hadn't tried an overload method of java.time.format.DateTimeFormatter.ofPattern() which takes a locale. I had already dotted many i's and crossed a thousand t's, but I wanted to really impress anyone reading the question that I had done my homework, because I hate looking ignorant, so I wrote another unit test that passed in Locale.ENGLISH and, ohmigawd, that solved the problem entirely. If you have been following along, that means that "AM/PM" could not be understood by the parser, even with the right pattern matcher, without the context of a locale, and obviously the default locale used by the simpler version of DateTimeFormatter.ofPattern() was inadequate to the task. I tested and Locale.ENGLISH and Locale.US both worked with "AM/PM" but Locale.CANADA did not. Likely the latter is my default locale, because I do reside in Canada. Really? Really, Java? We have AM and PM here in the Great White North, I assure you.

I don’t know if this a bug in Java 11. I’m merely happy to have understood the problem at this point. Just another day in the developer life, eh? Something that should be a snap becomes a grueling carnival ride that deposits you at the exit, white-faced and shaking, with an underwhelming sense of minor accomplishment. How do you explain to people that you spent 8 hours teaching a computer to treat an ordinary date as a date? Write a blog post, I guess. Winking smile

* Two lines of code. 8 hours of frustration. Here it is, ready?


import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.Locale;

public class App {
  public LocalDateTime Parse (String dateText, String pattern) {
    DateTimeFormatter parser = DateTimeFormatter.ofPattern(pattern, Locale.ENGLISH);
    LocalDateTime date = LocalDateTime.parse(dateText, parser);
    return date;
  }
}

The Problem

1000s of date strings in short date format like m/d/yy. Fine as long as system date settings assume month/day/year. Then system date settings change to day/month/year to conform with international standards. 1000s of date strings are misinterpreted.

E.g. 06/01/2007 Before = June 1, 2007 After = January 6, 2007

The Fix

  1. Export date field and unique ID field to delimited text file.
  2. Use regular expression to switch day and month:
    • find expression: (\d+)/(\d+)/(\d{4})
    • replace expression: \2/\1/\3
  3. Import modified file into Excel or Access, treating the date string field as DateTime so it's interpreted as a proper date, not a string.
  4. Change the format of the date field to a Long Date
    • Access query expression: Format([MyDate], "Long Date")
  5. Import the file with long date back into Textworks, matching on unique ID field; replace field values.
  6. Date strings are now in unambiguous Long Date format, e.g. MMM dd, yyyy.

UPDATE: This comic at xkcd.com is totally awesome. And relevant and stuff. comic

Okay, I did some tests and this is the conclusion: in v7 of dbtext anyway, the dbtext.ini file can be given a short date format, BUT this is very misleading because although dbtext uses the dbtext.ini format to *stamp* the date (when automatic), the dbtext.ini format is completely ignored when the date is indexed. And when indexing, dbtext indexes the date as an unambiguous absolute. What does this mean? Let's say today is Feb 10, 2005. My machine's regional settings are M/d/yyyy. To my machine, Feb 10 2005 reads 2/10/2005 in short format. If I don't touch the dbtext.ini, any autodate in short format in dbtext is going to be stamped in as 2/10/2005, and that date string will be indexed as Feb 10 2005. All good. As soon as you try to *overrride* Windows regional settings with dbtext.ini, problems arise. Let's summarize in shorthand, assuming the new record date is Feb 10, 2005: Trouble shows up when dbtext.ini is set to the Windows regional settings' opposite. The autodate gets *stamped* according to dbtext.ini, but *indexed* according to Windows regional settings. Remember "today" is Feb 10 2005 so the date should be getting indexed as Feb 10 2005. Wherever you see Indexed As Oct 2 2005 (Case 2 & 4) it's because dbtext.ini is set to the reverse of Windows regional settings. The take-home message is: do not assume that a date format setting in dbtext.ini overrides Windows regional settings, because it doesn't in all actions. Also, because the date index is never subject to reinterpretation, whatever absolute date was laid down at the time of the orginal indexing (usually when record was saved) is the absolute date that remains, *regardless* of whether the textbase is moved to another machine where the regional settings differ, and regardless of whether the machine the textbase resides on has its regional settings changed. You must re-index the date field to pick up on a different regional settings environment. The consequences of this legacy index behaviour are not at all clear in the Inmagic kbase article 2606 Troubleshooting when the Date format changes after upgrading operating system, nor are the risks of setting date formats in DBTEXT.INI discussed in Inmagic kbase article 2122 Date Formats Supported by DB/TextWorks. In the case of the former, there should be instructions to re-index after the Windows regional settings change, and for the latter, it should be made clear that DBTEXT.INI date format settings do not override Windows regional settings when indexing.

Categories

Let Us Help You!

We're Librarians - We Love to Help People