For landboss we use NHibernate.Search to build and query Lucene indexes made from our NHibernate objects “automagically.” So far it has worked really well, and made full text search stupid-easy. Recently on our live environment we found a disturbing and very confusing issue surrounding it.
The Problem
Landboss is an ASP.NET web application, and we were noticing requests timing out on certain pages. We discovered that any page that did NHibernate.Search work (ie saved an object that was being indexed) the request would timeout. We had already run into threading issues with NHibernate.Search before so we immediately identified it as the culprit. Earlier issues were fixed by updating to more recent builds but we had no such luck this time – only trivial changes had been made since our last used build. Poking around further it looked like the Workspace class in the Nhibernate.Search.Backend namespace was hanging indefinitely when trying to lock a directory provider. Ok, so maybe it’s not releasing the lock in a previous request, right? Well, yes, but not for what you would think.
We spent days under the assumption that either NHibernate.Search wasn’t cleaning itself up correctly, or we weren’t using NHibernate.Search correctly. Would redirecting cause the worker threads to abort? Would not disposing of the IFullTextSession leave locks open? It was quite puzzling. The worst part was we couldn’t reproduce this behavior on any other environment no matter how hard we tried.
The Solution
Turns out it was simply a file access issue. Our IIS settings were such that the first thread that actually launched the app would have full access to all folders (including the index folders), but every request after that would have read-only. Don’t ask me why this was – it just was. Anyway this would cause NHibernate.Search worker threads to fail in a way that their locks were not released. NHibernate.Search catches and cleans up for IOExceptions, but apparently another exception type was being thrown (maybe AccessDeniedException?). Inadvertantly we added write ability to the directory for all threads (trying to fix logging – to help track this “issue” down!) and POOF! it worked perfectly again. How foolish we felt.
But we really shouldn’t have – Since the error only manifested itself in the worker threads, and only on our live environment (where we don’t have nice debuggers and things) that it would have been nearly impossible to make the connection there. Broken Thread Locks = Check NTFS Permissions?
Well, I suppose the lesson to take from this is don’t try to fix anything unless your logging works perfectly. If your log files aren’t complete, how can you expect anything else in your app to work?