What is the procedure for debugging a production-only error?

Let me say upfront that I'm so ignorant on this topic that I don't even know whether this question has objective answers or not. If it ends up being "not," I'll delete or vote to close the post.

Here's the scenario: I just wrote a little web service. It works on my machine. It works on my team lead's machine. It works, as far as I can tell, on every machine except for the production server. The exception that the production server spits out upon failure originates from a third-party JAR file, and is skimpy on information. I search the web for hours, but don't come up with anything useful.

So what's the procedure for tracking down an issue that occurs only on production machines? Is there a standard methodology, or perhaps a category/family of tools, for this?

The error that inspired this question has already been fixed, but that was due more to good fortune than a solid approach to debugging. I'm asking this question for future reference.

EDIT:
The answer to this so far seems to be summed up by one word: logging. The one issue with logging is that it requires forethought. What if a situation comes up in an existing system with poor logging, or the client is worried about sensitive data and does not want extensive logging systems in the system in the first place?

Some related questions:
Test accounts and products in a production system
Running test on Production Code/Server

In addition to logging, which is invaluable, here are are some other techniques myself and my co-workers have used over the years... going back to 16-bit windows on client machines we had no access to. (Did I date myself?) Granted, not everything can/will work.

Analyze any and all behavior you see.
Reproduce, if at all possible, reproduce it.
Desk check, walk through code you suspect.
Rubber duck it with team members AND people who have little or no familiarity with the code. The more you have to explain something to someone, the better chance you have of uncovering something.
Don't get frustrated. Take a 5-10 minute break. Take a quick walk across the building/street/whatever. Don't think about the problem for that time.
Listen to your instincts.

This is one of the most difficult debugging scenarios. The answer will depend on the details of the production system. Is it a system you have full control over it? Or is it installed in a client's machine and you need to get through numerous phone calls just to get access to log file or modify a configuration parameter?

I believe that the most people will agree that the most effective way of debugging this is to use logging. You need to act proactively and add as much logging information as possible. However you must be able to enable and disable logging on demand. Extensive debug logs in a production system could kill performance. For the same reason you need to be able to enable only specific parts of the logging. Create logical groups of logging print outs and enable only the one you think it will give you the most relevant information.

I would start with the small, easy to check differences between production and test. Eliminate obvious stuff like permissions, firewalls, different versions, etc through actual testing. The one time I cut corners and say oh, that can't be it, it is.

Then I prioritize more expensive tests by likelihood and cost. Be creative. Think of really weird things that might cause the behaviour you see.

Typically speaking, "debugging" [ie attaching to a process and inspecting execution] is not viable - for many reasons not the least of which is data sensitivity [eg developers are rarely qualified\cleared to inspect the data we manipulate]

So this usually comes down to inferring execution from secondary sources and artifacts. This then boils down to ...

Logging,
Logging,
Logging,

A large majority of software written these days falls into either of Java or .Net camps, so leverage log4j and log4net respectively.

Also having a buller-proof Ops-centric configuration guide and validation process helps. Remember the people responsible for the hardware and environment rarely understand the configuration requirements of the applications they are hosting.

I've used a configurable logging system such as Log4J to see what's happenning at the production runs, this assumes that developers have put useful debugging information in the logs.

But beware that logging might expose some sensible private data, which should be encoded and/or skipped when possible.

Along with logging, other techniques include saving request data that you can then feed in to your own, "identical" system later. This could be as simple as saving every HTTP request you receive to a file for later analysis. Right now you are likely logging much of this information (notably URL for GETs), you just need to add headers and request bodies to the mix as well.

Adding more detail to error messages is handy also. For example, when you get an exception from a routine, you can add the parameters that were used in that call to the Exception error. Or, at least, global state information (who was logged in, what high level module they were in, what high level function they were calling, etc.).

Some advices:

Be prepared that bug could be caused by multiple causes, so that try to not narrow your mind to searching for just one cause.
Use unhandled error handler, which will keep track of errors and aggregate similar defects (greylog, ELMAH).
Consider post-mortem debugging with mini-dump files.
Have fixed time frame for quick and dirty approach, then go with systematic approach.
Try code review defected module with one of your colleagues. Fresh view could be helpful.
Divide and conquer using your version control system (GIT, SVN).
Be careful about fixes, because around 4% of all fixes end up in introduction new bugs.
Don't let pressure for quick fixing bug in production to make you omit your standard quality control procedures (eg. code reviews).
After fixing make sure that you have written automated tests in case when bug would come back some time later.

来源：https://stackoverflow.com/questions/3015429/what-is-the-procedure-for-debugging-a-production-only-error

标签

debugging

production-environment