Prevent DynamicSupervisor from shutdown if child reaches max_restarts

Svilen Source

I have a DynamicSupervisor that starts children with restart: :transient. By default, if a child exits abnormally, it will be restarted by the supervisor.

However, by design, if the child fails after 3 restarts, the supervisor itself will exit. From the docs:

Notice that supervisor that reached maximum restart intensity will exit 
with :shutdown reason. In this case the supervisor will only be restarted
if its child specification was defined with the :restart option set to :permanent
(the default).

Since killing the supervisor will also kill other children (background jobs that are in progress) I would like to avoid this scenario.

The question is: after reaching max_restarts, how can I kill the failing child process, preserving the supervisor and its other children?

Using Elixir 1.6 / OTP 20.

Update: I found this answer on StackOverflow that essentially suggests that the top-level DynamicSupervisor launches a DynamicSupervisor for each child; the top-level will start the child supervisors with restart: :permanent or :temporary. That's a good workaround, but I'd be interested if there is another solution.



answered 8 months ago Alexei Sholik #1

DynamicSupervisor adheres to the same restart policy as the regular Supervisor and it works the way it does for a good reason. Instead of trying to work around this behaviour we need to understand why it is the way it is.

Understanding supervisor’s purpose

A supervisor monitors its children and in case an unexpected failure brings any of them down, it will restart it with a known initial state. The key to understanding the rationale behind restart limits lies in the definition of unexpected failures.

Unexpected here does not mean something you hadn’t thought about before pushing untested code to production. It’s something that only happens in rare circumstances which are difficult to simulate during normal testing, something that’s difficult to reproduce and that does not happen very often.

Catching such failures is difficult even with the default limit of 3 restarts within 5 seconds. In fact, this limit is way too conservative for live systems. I think it’s mostly useful for catching bugs early in development. When a bug is causing a process to shut down immediately or soon after being started, it won’t take long before it reaches 3 restarts and causes its supervisor to die. At that point you should look for the bug and fix it.

A different way to fail

Assuming you do test your code and are still observing processes die regularly, you’re probably experiencing a different kind of failure – an expected one. I highly suggest reading Fred Hebert's article It's About the Guarantees which covers in great detail the way supervisors should be used and the guarantees they’re supposed to provide. A very brief and abridged version of it:

Supervised processes provide guarantees in their initialization phase, not a best effort. This means that when you're writing a client for a database or service, you shouldn't need a connection to be established as part of the initialization phase unless you're ready to say it will always be available no matter what happens.

If you do require a connection to the database to be established in a process’s init() callback, failing to connect then really does mean the process cannot function and should die. When its restarted by the supervisor yet it keeps failing, that does indeed mean the whole supervision tree cannot function correctly and should die. This continues recursively until the root supervisor is reached and the whole system goes down.

Now, Elixir provides a lot of solutions to various problems like this out of the box. In a way this is really nice, but it also often makes those problems invisible, leaving newcomers unaware of their existence. For example, Ecto depends on db_connection under the hood to provide a default exponential backoff when a connection to the database cannot be established. This behaviour is described in db_connection’s docs.

So what should you do?

Going back to your problem, at this point it should be clear that another approach has to be employed for a process which can fail often and it’s not a bug that’s causing it. You need to acknowledge that its failure is expected and handle it explicitly in your code.

Perhaps, your process depends on an external service that may occasionally be unavailable. In that case, you need to use a circuit breaker. There’s one written in Erlang called fuse which is nicely described by its author in this comment on Hacker News.

Netflix has a blog post showcasing the use of circuit breakers in their API which receives a pounding of billions of requests on a daily basis. That’s a mind-boggling scale and it’s even bigger now since that post is from 2011!

If that’s still not the kind of failure you’re experiencing, then, perhaps, you run untrusted code that cannot be relied on? Wrap it in a try-rescue block and return errors as values instead of relying on the supervisor to magically handle them for you.

I hope this helps.

comments powered by Disqus