Erlang's let-it-crash philosophy - applicable elsewhere?

北战南征 提交于 2019-11-29 19:46:31

It's applicable everywhere. Whether or not you write your software in a "let it crash" pattern, it will crash anyway, e.g., when hardware fails. "Let it crash" applies anywhere where you need to withstand reality. Quoth James Hamilton:

If a hardware failure requires any immediate administrative action, the service simply won’t scale cost-effectively and reliably. The entire service must be capable of surviving failure without human administrative interaction. Failure recovery must be a very simple path and that path must be tested frequently. Armando Fox of Stanford has argued that the best way to test the failure path is never to shut the service down normally. Just hard-fail it. This sounds counter-intuitive, but if the failure paths aren’t frequently used, they won’t work when needed.

This doesn't precisely mean "never use guards," though. But don't be afraid to crash!

rvirding

Yes, it is applicable everywhere, but it is important to note in which context it is meant to be used. It does not mean that the application as a whole crashes which, as @PeterM pointed out, can be catastrophic in many cases. The goal is to build a system which as a whole never crashes but can handle errors internally. In our case it was telecomms systems which are expected to have downtimes in the order of minutes per year.

The basic design is to layer the system and isolate central parts of the system to monitor and control the other parts which do the work. In OTP terminology we have supervisor and worker processes. Supervisors have the job of monitoring the workers, and other supervisors, with the goal of restarting them in the correct way when they crash while the workers do all the actual work. Structuring the system properly in layers using this principle of strictly separating the functionality allows you to isolate most of the error handling out of the workers into the supervisors. You try to end up with a small fail-safe error kernel, which if correct can handle errors anywhere in the rest of the system. It is in this context where the "let-it-crash" philosophy is meant to be used.

You get the paradox of where you are thinking about errors and failures everywhere with the goal of actually handling them in as few places as possible.

The best approach to handle an error depends of course on the error and the system. Sometimes it is best to try and catch errors locally within a process and trying to handle them there, with the option of failing again if that doesn't work. If you have a number of worker processes cooperating then it is often best to crash them all and restart them again. It is a supervisor which does this.

You do need a language which generates errors/exceptions when something goes wrong so you can trap them or have them crash the process. Just ignoring error return values is not the same thing.

I write programs that rely on data from real world situations and if they crash they can cause big $$ in physical damage (not to mention big $$ in lost revenue). I would be out of a job in a flash if I did not program defensively.

With that said I think that Erlang must be a special case that not only can you restart things instantly, that a restarted program can pop up, look around and say "ahhh .. that was what I was doing!"

It is called fail-fast. It's a good paradigm provided you have a team of people who can respond to the failure (and do so quickly).

In the NAVY all pipes and electrical is mounted on the exterior of a wall (preferably on the more public side of a wall). That way, if there is a leak or issue, it is more likely to be detected quickly. In the NAVY, people are punished for not responding to a failure, so it works very well: failures are detected quickly and acted upon quickly.

In a scenario where someone cannot act on a failure quickly, it becomes a matter of opinion whether it is more beneficial to allow the failure to stop the system or to swallow the failure and attempt to continue onward.

My colleagues and myself thought about the topic not especially technology wise but more from a domain perspective and with a safety focus.

The question is "Is it safe to let it crash?" or better "Is it even possible to apply a robustness paradigm like Erlang’s “let it crash” to safety-related software projects?".

In order to find an answer we did a small research project using a close-to-reality scenario with industrial and especially medical background. Take a look here (http://bit.ly/Z-Blog_let-it-crash). There is even a paper for download. Tell me what you think!

Personally I think it is applicable in many cases and even desirable, especially when there is a lot of error handling to do (safety-related systems). You cannot always use Erlang (missing real time features, no real embedded support, costumer whishes ...), but I'm pretty sure you can implement it otherwise (e.g. using threads, exceptions, message passing). I haven't tried it yet though, but I'd like to.

IMHO Some developers handle/wrap checked exceptions with code which add little value. It is often simpler to allow a method to throw the original exception unless you are going to handle it and add some value.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!