I have a bug that shows up every few months on a customer site. It usually happens at 3am and it's not discovered until early the next morning when the customer arrives at their site. And usually when they discover it, they want everything to get working immediately, so our support people generally just reboot the computer. It's been driving me nuts for years. It never happens on my test machine or in the QA lab, only at certain customer sites. Over time, I've
- refactored some of the code that I thought was causing it
- added more debugging printouts around where it appears to be crashing
- redirected stdout so that next time I see it I can "
kill -3" the process
- given support some new tools to dump out the current state of database locks and the like.
- added diagnostics to make it more obvious when it does happen
It hasn't happened in a few months, and I've got my fingers crossed that I might have fixed it this time, but I'm not counting on it.