Theory on error handling?

后端 未结 14 852
情歌与酒
情歌与酒 2021-01-29 18:28

Most advice concerning error handling boils down to a handful of tips and tricks (see this post for example). These hints are helpful but I think they don\'t answer all question

14条回答
  •  囚心锁ツ
    2021-01-29 18:59

    Introduction

    To understand what needs to be done for error handling, I think one needs clearly to understand the types of errors one encounters, and the contexts in which one encounters them.

    To me, it has been extremely useful to consider the two major types of errors as:

    1. Errors that should never happen, and are typically due to a bug in the code.

    2. Errors which are expected and cannot be prevented in normal operation, such as a database connection going down because of a database issue over which the application has no control.

    The way an error should be handled depends heavily on which type of error it is.

    The differing contexts which affect how errors should be handled are:

    • Application code

    • Library code

    The handling of errors in library code differs somewhat from the handling in application code.

    A philosophy for handling of the two major types of errors is discussed below. The special considerations for library code are also addressed. Finally, the specific practical questions in the original post are addressed in the context of the philosophy presented.

    Types of errors

    Programming errors - bugs - and other errors that should never happen

    Many errors are the result of programming mistakes. These errors typically cannot be corrected, since the specific programming mistake cannot be anticipated. That means we can't know in advance what condition the mistake leaves the application in, so we can't recover from that condition and shouldn't try.

    Ultimately, the fix to this kind of error is to fix the programming mistake. To facilitate that, the error should be surfaced as quickly as possible. Ideally, the program should exit immediately after identifying such an error and providing the relevant information. A quick and obvious exit reduces the time required to complete the debug and retest cycle, permitting more bugs to be fixed in the same amount of testing time; that in turn results in having a more robust application with fewer bugs when it comes time to deploy.

    The other major objective in handling this type of error should be to provide sufficient information to make it easy to identify the bug. In Java, for example, throwing a RuntimeException often provides sufficient information in the stack trace to identify the bug immediately; in clean code, immediate fixes can often be identified just from examining the stack trace. In other languages, one might log the call stack or otherwise preserve the necessary information. It is critical not to suppress information in the interests of brevity; don't worry about how much log space you are taking up when this type of error occurs. The more information that is provided, the quicker the bugs can be fixed, and the fewer bugs will remain to pollute the logs when the application makes it to production.

    Server applications

    Now, in some server applications, it's important that the server be sufficiently fault tolerant to continue operation even in the face of occasional programming errors. In this case, the best approach is to have a very clear separation between the server code that must continue operation and the task processing code that can be allowed to fail. For example, tasks can be relegated to threads or subprocesses, as is done in many web servers.

    In such a server architecture, the thread or subprocess handling the task can then be treated like an application which can fail. All the considerations above apply to such a task: the error should be surfaced as quickly as possible by a clean exit from the task, and sufficient information should be logged to permit the bug to be easily found and fixed. When such a task exits in Java, for example, the entire stack trace of any RuntimeException causing the exit should normally be logged.

    As much of the code as possible should be executed within the threads or processes handling the task, rather than in the main server thread or process. This is because any bug in the main server thread or process will still cause the entire server to go down. It's better to push the code - with the bugs it contains - into the task handling code where it won't cause a server crash when the bug manifests itself.

    Errors that are expected and cannot be prevented in normal operation

    Errors that are expected and cannot be prevented in normal operation, such as an exception from a database or other service separate from the application, require very different treatment. In these cases, the objective is not to fix the code, but rather to have the code handle the error when that makes sense, and inform users or operators who can fix the problem otherwise.

    In these cases, for example, the application may wish to throw away any results that have accumulated thus far, and retry the operation. In database access, use of transactions can help ensure that accumulated data is discarded. In other cases, it can be useful to write one's code with such retries in mind. The concept of idempotency can also be useful here.

    When automated retries won't sufficiently solve the problem, human beings should be informed. The user should be informed that the operation failed; often the user can be given the option of retrying. The user can then judge whether a retry is desirable, and can also make alterations in input that might help things go better on a retry.

    For this type of error, logging and perhaps email notices can be used to inform system operators. Unlike logging of programming errors, logging of errors that are expected in normal operation should be more succinct, since the error may happen many times and appear many times in the logs; operators will often be analyzing the pattern of many errors, rather than focusing on one individual error.

    Libraries and applications

    The above discussion of types of errors is directly applicable to application code. The other major context for error handling is library code. Library code still has the same two basic types of errors, but it typically cannot or should not communicate directly with the user, and and it has less knowledge about the application context, including whether an immediate exit is acceptable, than does the application code.

    As a result, there are differences in how libraries should handle logging, how they should handle errors that may be expected in normal operation, and how they should handle programming errors and other errors that should never happen.

    With respect to logging, the library should if possible support logging in the format desired by the client application code. One valid approach is to do no logging at all, and allow the application code to do all logging based on error information provided to the application code by the library. Another approach is to use a configurable logging interface, allowing the client application to provide the implementation for the logging, for example when the library is first loaded. In Java, for example, the library might use the logback logging interface, and allow the application to worry about what logging implementation to configure for logback to use.

    For bugs and other errors that should never happen, libraries still cannot simply exit the application, since that may not be acceptable to the application. Rather, libraries should exit the library call, providing the caller with sufficient information to help diagnose the problem. The information may be provided in the form of an exception with a stack trace, or the library may log the information if the configurable logging approach is being used. The application can then treat this as it would any other error of this type, typically by exiting, or in a server, by allowing the task process or thread to exit, with the same logging or error reporting that would be done for programming errors in the application code.

    Errors that are expected in normal operation should be also be reported to the client code. In this case, as with this type of error when encountered in the client code, the information associated with the error can be more succinct. Typically libraries should do less local handling of this type of error, relying more on the client code to decide things like whether to retry and for how many times. The client code can then pass along the retry decision to the user if desired.

    Practical questions

    Now that we have the philosophy, let's apply it to the practical questions you mention.

    • How to decide if an error should be handled locally or propagated to higher level code?

    If it is an error that is expected in normal operation, retry or possibly consult the user locally. Otherwise, propagate it to higher level code.

    • How to decide between logging an error, or showing it as an error message to the user?

    If it is an error that is expected in normal operation, and user input would be useful to determine what action to take, get user input and log a succinct message; if it seems to be a programming error, provide the user with a brief notification and log more extensive information.

    • Is logging something that should only be done in application code? Or is it ok to do some logging from library code.

    Logging from the library code should be under the control of the client code. At most, the library should log to an interface for which the client provides the implementation.

    • In case of exceptions, where should you generally catch them? In low-level or higher level code?

    Exceptions that are expected in normal operation can be caught locally and the operation retried or otherwise handled. In all other cases, exceptions should be allowed to propagate.

    • Should you strive for a unified error handling strategy through all layers of code, or try to develop a system that can adapt itself to a variety of error handling strategies (in order to be able to deal with errors from 3rd party libraries).

    The types of errors in third party libraries are the same types of errors that occur in application code. Errors should be handled primarily according to which error type they represent, with relevant adjustments for library code.

    • Does it make sense to create a list of error codes? Or is that old fashioned these days?

    Application code should provide a complete description of the error in the case of programming errors, and a succinct description in the case of errors that can occur in normal operation; in either case, a description is normally more appropriate than an error code. Libraries may provide an error code as a way of describing whether an error is a programming or other internal error, or whether the error is one which can occur in normal operation, with the latter type perhaps subdivided more finely; however, an exception hierarchy can be more useful than an error code in languages where such is possible. Note that applications run from the command line may act as libraries for shell scripts, however.

提交回复
热议问题