What is the procedure for debugging a production-only error?

依然范特西╮ 提交于 2019-12-03 10:44:39

In addition to logging, which is invaluable, here are are some other techniques myself and my co-workers have used over the years... going back to 16-bit windows on client machines we had no access to. (Did I date myself?) Granted, not everything can/will work.

  • Analyze any and all behavior you see.
  • Reproduce, if at all possible, reproduce it.
  • Desk check, walk through code you suspect.
  • Rubber duck it with team members AND people who have little or no familiarity with the code. The more you have to explain something to someone, the better chance you have of uncovering something.
  • Don't get frustrated. Take a 5-10 minute break. Take a quick walk across the building/street/whatever. Don't think about the problem for that time.
  • Listen to your instincts.

This is one of the most difficult debugging scenarios. The answer will depend on the details of the production system. Is it a system you have full control over it? Or is it installed in a client's machine and you need to get through numerous phone calls just to get access to log file or modify a configuration parameter?

I believe that the most people will agree that the most effective way of debugging this is to use logging. You need to act proactively and add as much logging information as possible. However you must be able to enable and disable logging on demand. Extensive debug logs in a production system could kill performance. For the same reason you need to be able to enable only specific parts of the logging. Create logical groups of logging print outs and enable only the one you think it will give you the most relevant information.

I would start with the small, easy to check differences between production and test. Eliminate obvious stuff like permissions, firewalls, different versions, etc through actual testing. The one time I cut corners and say oh, that can't be it, it is.

Then I prioritize more expensive tests by likelihood and cost. Be creative. Think of really weird things that might cause the behaviour you see.

Typically speaking, "debugging" [ie attaching to a process and inspecting execution] is not viable - for many reasons not the least of which is data sensitivity [eg developers are rarely qualified\cleared to inspect the data we manipulate]

So this usually comes down to inferring execution from secondary sources and artifacts. This then boils down to ...

  • Logging,
  • Logging,
  • Logging,

A large majority of software written these days falls into either of Java or .Net camps, so leverage log4j and log4net respectively.

Also having a buller-proof Ops-centric configuration guide and validation process helps. Remember the people responsible for the hardware and environment rarely understand the configuration requirements of the applications they are hosting.

I've used a configurable logging system such as Log4J to see what's happenning at the production runs, this assumes that developers have put useful debugging information in the logs.

But beware that logging might expose some sensible private data, which should be encoded and/or skipped when possible.

Along with logging, other techniques include saving request data that you can then feed in to your own, "identical" system later. This could be as simple as saving every HTTP request you receive to a file for later analysis. Right now you are likely logging much of this information (notably URL for GETs), you just need to add headers and request bodies to the mix as well.

Adding more detail to error messages is handy also. For example, when you get an exception from a routine, you can add the parameters that were used in that call to the Exception error. Or, at least, global state information (who was logged in, what high level module they were in, what high level function they were calling, etc.).

Some advices:

  • Be prepared that bug could be caused by multiple causes, so that try to not narrow your mind to searching for just one cause.
  • Use unhandled error handler, which will keep track of errors and aggregate similar defects (greylog, ELMAH).
  • Consider post-mortem debugging with mini-dump files.
  • Have fixed time frame for quick and dirty approach, then go with systematic approach.
  • Try code review defected module with one of your colleagues. Fresh view could be helpful.
  • Divide and conquer using your version control system (GIT, SVN).
  • Be careful about fixes, because around 4% of all fixes end up in introduction new bugs.
  • Don't let pressure for quick fixing bug in production to make you omit your standard quality control procedures (eg. code reviews).
  • After fixing make sure that you have written automated tests in case when bug would come back some time later.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!