When web requests fail in Elixir and Phoenix

I love Elixir language and Phoenix framework. It is so good, that we switched our consultancy entirely from writing Ruby on Rails to Elixir (and Phoenix), and we recommend it to our customers in majority of the cases for both: traditional applications and signle-page app back-ends (we love GraphQL too!).

Things, however, on occassion fail. This is okay, and - in fact - you might have heard about Erlang’s “let it fail” philosophy of not writing a lot of defensive code and allowing things to crash.

While tempting to go this route with Elixir/Phoenix web application, and just crash the web workers badly on each error they experience, this is not always acceptable from business perspective.

There is an array of cases where you want to, or you are legally required to keep a close eye on what is happening in your system. For most of the production systems you are likely in need to display friendly error pages to the users, with an option to go back to the working part of the page, maybe a retry button or information on how to report an issue. For systems processing sensitive personal information, you may need to record such processing along with metadata on who, when and which subset of data accessed. While it’s fairly straightforward thing to do these two things having in mind happy path (no errors) and a bit of defensive programming (but knowing which errors to expect), things get significantly harder to handle if - for example - you need to track all requestes no matter what type of crash happened along the way.

In this article we will have a look on how Phoenix, Plug and Cowboy serve your HTTP requests. We will also investigate what are the built-in defensive mechanisms in both Plug and Phoenix, and how we can utilize them if we need to. Finally, we will see how we can increase the robustness of our solution by handling things we never expected might happen in first place.

How Phoenix serves HTTP requests

Your Phoenix application, just like any Cowboy & Plug based web application, starts number of listeners and acceptors (managed by library called Ranch), which are processes responsible for handling underlying TCP connections. By default, a Phoenix application would start 100 acceptors, but in our example we reduced it to 4 for better visibility, see Figure 1.

When web browser makes a request, Cowboy spawns another process - handler, in context of which all the Plugs are executed (including Phoenix endpoint, router, controller’s, views and templates code) - highlighted on Figure 2.

Since it’s a single process which executes a lot of framework-provided, and all of the user-provided HTTP request handling code, uncatched exceptions will bring it down. This results in Phoenix rendering descriptive error pages by default in dev environment, but these details are being hidden from end user for prod. Programmer can also customize how to respond to such errors by providing custom error pages. In order to facilitate that behaviour, Phoenix installs own try-catch-and-reraise section around controller’s code - and in each pipeline. All captured exceptions are being re-raised in wrapped form - as Plug.Conn.WrapperError.

If you check the relevant code in Phoenix.Controller.Pipeline and follow it a bit, you will clearly see how Phoenix is wrapping the raised exceptions. The questions is: why does it do it?

Well, there are a few different reasons, but the major one is that re-raised, wrapped exceptions, carry on the saved Plug.Conn with them

something that ordinary Elixir exceptions would not do.

If we go back to pondering about happy path, there are also certain things Phoenix & Plug do after user’s code is executed. There are a few ways to do it, but a common solution is to use Plug.Conn.register_before_send/2) to provide callback function which will be executed after your controller and all the Plug’s code executed. This is precisely how Plug’s Logger module handles logging response status and time it took to execute action to the console/logs.

The important thing to learn about Phoenix/Plug web request handlers is that code of Plug, Phoenix and developer’s own - all execute in single Elixir process. It is directly started and internally supervised by Cowboy (via Ranch library). It’s not a process started and managed by Plug nor Phoenix itself, and it might be the source of a few gotchas we are going to discuss below.

Standard ways of handling errors

Having a misbehaving action such as:

plug :set_current_user

...

def index(conn, _params) do
  raise "oopsie"

  render conn, "index.html"
end

defp set_current_user(conn, _) do
  conn |> assign(:current_user, "Hubert")
end

...

You can react to such problem by displaying a custom error page pretty easily.

What if, however, you need to do more than that?

While there are a few global ways of capturing exceptions and crashes in Erlang (and Elixir), and I have a strong feeling that upcoming Erlang 21 :logger integration will be my favourite, sometimes you want to capture more information along with errors themselves. Those, generally, when handling requests would be found in Plug.Conn struct.

Let’s say you want to report the errors as they happen to some central exception collecting utility, or maybe send an error report to user using a Phoenix channel, and you want to know to which user this error happened exactly.

We seem to be lucky, as we are setting the conn.assigns.current_user already so we know who the current user is. Let’s use Plug.ErrorHandler to log it to the console for now.

First, you need to add it to your router (or Endpoint to collect even more exceptions thrown from Plugs mounted in pipelines and router or Endpoint itself):

defmodule AppWeb.Router do
  use AppWeb, :router
  use Plug.ErrorHandler

  def handle_errors(conn, %{kind: kind, reason: reason, stack: stack} = err) do
    IO.inspect "Something happened when handling web request for the user:"
    IO.inspect conn.assigns.current_user
    IO.inspect "The error was:"
    IO.inspect err

    conn
  end

  ...

As you can see we will attempt to catch the raised exception, log it to the console, and return unchanged Plug.Conn. We could have modified the response status code or body here, but it could interfere with our error pages we set up earlier.

"Something happened when handling web request for the user:"
"Hubert"
"The error was:"
%{
kind: error,
      reason: %RuntimeError{message: "oopsie"},
      stack: 

...

Phoenix.ErrorHandler is also re-raising the exception after it’s being caught and processed in your handle_errors/2 function, ensuring the behaviour of the application hasn’t changed if you used it.

This is only true, however, if the behaviour of handle_errors/2 can be clearly predicted. You should not do too much work in this function, to avoid crashing your handlers with different exceptions. It gets a bit hairy to debug if you do, and you most likely want to employ technique similar to AppSignal’s Plug uses internally - this involves starting a dedicated error reporting process (or even OS-level process as they do it).

If you want to ensure your error reporting does not alter the behaviour of the web request handlers in any way, you may want to start separate process, to send error reports, use GenServer(s) that you send async error messages with cast for further reporting. AppSignal even uses dedicated OS level processes to collect and report the exceptions in bulk to their back-ends.

Keeping an eye on contents of Plug.Conn

In the example above, we were reporting the conn.assigns.current_user to the console when error happened. We were lucky, because we were setting the variable in separate plug function from controller’s actions. If we slightly changed our code, however, we would end up with some trouble:

def index(conn, _params) do
  conn = conn |> assign(:current_user, "Hubert")

  raise "oopsie"

  render conn, "index.html"
end

This code crashes badly, and our exception tracking is now broken:

[info] GET /
[debug] Processing with AppWeb.PageController.index/2
  Parameters: %{}
  Pipelines: [:browser]
"Something happened when handling web request for the user:"
[error] #PID<0.393.0> running AppWeb.Endpoint (cowboy_protocol) terminated
Server: localhost:4000 (http)
Request: GET /
** (exit) an exception was raised:
    ** (KeyError) key :current_user not found in: %{}

Moreover, while the user did see our custom error page, we would not see the corresponding Logger message in our application logs. The missing key error caused protocol handler process to exit, aborting execution of any callbacks registered in Plug.Conn.register_before_send/2). This is precisely why you want to avoid doing too much work in this error handling function in the same process, and delegate to something else to do the risky work for you.

Let’s get back to the question on hand, however: Why is conn.assigns.current_user missing now in first place?

The answer is: immutability. Elixir’s data structures are immutable by design, and setting current_user does not alter existing conn struct but creates a brand new one.

Remember when we said that Phoenix solves this by wrapping installing try/rescue clause and wrapping the exceptions in Plug.Conn.WrapperError?

This happens just before our controller’s actions is executed, but also when each of defined pipelines is being entered, and on these two occassions only.

This means it’s good practice to extract figuring out some assigns to separate plugs, but - in case you want to provide more check points for Plug.Conn to be saved - you need to create more pipelines too. Instead of one large :browser pipeline mounted in your router, you may want to split to some framework-related code, authentication pipeline, then a pipeline which finds resources - so the Plug.Conn is being “saved” to be rescued without losing much data at each step.

If you really really want to provide your own check-point on a controller-level (or deeper in your code), you can do it, however, by forcing a conn savepoint yourself:

def index(conn, _params) do
  conn = conn |> assign(:current_user, "Hubertos")

  try do
    raise "oopsie"
  catch
    :error, reason ->
      Phoenix.Controller.Pipeline.__catch__(
      conn, reason, __MODULE__,
      conn.private.phoenix_action, System.stacktrace()
      )
  end

  render conn, "index.html"
end

You probably don’t want to do that, and if you do you probably want to wrap it up in a nice helper function. Please do keep in mind you are being naughty by doing it and you use undocumented framework’s internals in your code. You have been warned.

Understanding how Phoenix sets up custom try/catch clauses may also come in handy when you are concerned with other fields of Plug.Conn, not just assigns as things like response body or status may get lost by default when handling exceptions as well.

Beware of concurrency & links

Exceptions in Elixir (either raised or thrown values) that can be caught, are a likely, but not the only reason for a process being terminated. Another reason for your handler being shut down prematurely is a crash of one of it’s linked processes.

If you are thinking about starting linked processes from your controller’s actions, I would suggest you think again if this is a good solution for the problem. It likely is not, and you may be better off with a background job processing library such as exq, dedicated pool of GenServer workers that you schedule work for asynchronously with cast or a dedicated pool of tasks supervised with Task.Supervisor.

For certain types of use cases, however, you may consider and go with a design where you do start linked processes from your web request handler.

This can be tempting, for example, if you want to fetch a file from S3, or make a HTTP call to some other external service and then return maybe processed value to the user.

You have to be really careful with those cases because crashing linked processes results in your handlers, Phoenix and Plug pipeline misbehaving badly.

To simulate the issue, let’s amend our action:

def index(conn, _params) do
  spawn_link fn -> raise "oops" end

  render conn, "index.html"
end

What happens when user visits such page?

User sees blank page with “This page isn’t working” message in the browser.

If we have a look at the console, we will see even weirder stuff:

[info] GET /
[debug] Processing with AppWeb.PageController.index/2
  Parameters: %{}
  Pipelines: [:browser]
[info] Sent 200 in 260µs
[error] Process #PID<0.436.0> raised an exception
** (RuntimeError) oops
    (app) lib/app_web/controllers/page_controller.ex:7: anonymous fn/0 in AppWeb.PageController.index/2
[error] Ranch protocol #PID<0.434.0> of listener AppWeb.Endpoint.HTTP (cowboy_protocol) terminated
** (exit) an exception was raised:
    ** (RuntimeError) oops
        (app) lib/app_web/controllers/page_controller.ex:7: anonymous fn/0 in AppWeb.PageController.index/2

Oh, so it crashed with error and returned 200 OK to the user? Nope.

Our simulation code is sort of broken, but this case can actually happen in real life if we are not careful. The spawned link brings down the handler process, but after the remaining of our controller code is being executed, when the response is being sent back to the user by Cowboy. This is not good, but more realistically we are going to wait in the controller for spawned processes to shut down. Let’s simulate that better:

def index(conn, _params) do
  spawn_link fn -> raise "oops" end
  :timer.sleep(:infinity)
  render conn, "index.html"
end

What happens now?

User sees the same blank page.

But in our console we are not having a log message reporting on the request handling failure at all:

[info] GET /
[debug] Processing with AppWeb.PageController.index/2
  Parameters: %{}
  Pipelines: [:browser]
[error] Process #PID<0.455.0> raised an exception
** (RuntimeError) oops
    (app) lib/app_web/controllers/page_controller.ex:7: anonymous fn/0 in AppWeb.PageController.index/2
[error] Ranch protocol #PID<0.453.0> of listener AppWeb.Endpoint.HTTP (cowboy_protocol) terminated
** (exit) an exception was raised:
    ** (RuntimeError) oops
        (app) lib/app_web/controllers/page_controller.ex:7: anonymous fn/0 in AppWeb.PageController.index/2

Which sort of makes sense as you can see the RuntimeError. Again, it is another reason why you should not trust the Plug.Logger to output useful information on each failure. If you are only monitoring those - you are doing things wrong.

Our error_handler/2 function in router that we installed earlier hasn’t been called at all, not to mention that no custom error page was rendered. It’s just like everything was stopped abruptly - and it’s precisely what happened.

When a linked process crashes, parent process receives exit signal. If it’s not handled and such process is not trapping exits, the parent process is being brought down as well. There is no exception, no error handling involved at all.

Standard way of dealing with such issues in Erlang/OTP world is to rely on, possibly custom, supervisors to detect and react so such system failures.

As mentioned in the first part of this article, however, the handlers are started directly by Cowboy (or rather Ranch), and Plug nor Phoenix have no power over them.

How do we react so such events then? There surely must be a way.

Linking processes itself provides one of possible answers. Since linking is bi-directional, we can start a dedicated process early in the Plug pipeline, whose responsibility is precisely to listen to and react to handler’s exit signals. We can fight fire with fire, can’t we?

Conceptually, our supervision tree will end up having one more dedicated process per each request, started by and linked to a web request handler:

Let’s implement it, first let’s modify our controller:

defmodule AppWeb.PageController do
  use AppWeb, :controller

  plug :set_current_user
  plug :monitor_crashes

  def index(conn, _params) do
    spawn_link fn -> raise "oops" end
    :timer.sleep(:infinity)
    render conn, "index.html"
  end

  defp monitor_crashes(conn, _) do
    {:ok, _pid} = CrashesMonitor.start_link(conn)

    conn
  end

  def set_current_user(conn, _) do
    conn |> assign(:current_user, "Hubert")
  end
end

And we will base our CrashesMonitor on GenServer. We’ll pass it Plug.Conn on initialize to make sure it’s fairly up to date when crash happens. This can be extended further by adding explicit update functionality in the future if we need to provide later savepoints for better tracking what’s going on.

defmodule CrashesMonitor do
  use GenServer

  def start_link(conn) do
    GenServer.start_link(__MODULE__, conn)
  end

  @impl true
  def init(conn) do
    Process.flag(:trap_exit, true)
    {:ok, %{conn: conn}}
  end

  @impl true
  def terminate(:normal, _), do: :ok
  def terminate(reason, state) do
    IO.inspect "Web request handler crashed! Here's the conn I remember:"
    IO.inspect(state.conn)

    IO.inspect "And the reason for exit:"
    IO.inspect(reason)
  end
end

When a user visits a page now, they can still - unfortunately - see a blank page. After inspecting the server logs, you will find out, however, that we were able to catch the exception that brought down the handler and also retained state of Plug.Conn:

"Web request handler crashed! Here's the conn I remember:"
%Plug.Conn{
  adapter: {Plug.Adapters.Cowboy.Conn, :...},
  ...
"And the reason for exit:"
{%RuntimeError{message: "oops"},
   {AppWeb.PageController, :"-index/2-fun-0-", 0,
    [file: 'lib/app_web/controllers/page_controller.ex', line: 33]}
]}

All being saved including state of Plug.Conn just before the controller’s action kicks in, stacktrace and exception message! In alternative implementation (possibly better and more robust), you could have a single GenServer being a registry of currently running web workers, which listens to :DOWN signals from them to detect failures. This is precisely hos AppSignal implemented it in their monitoring software.

Outro

That’s all for now, I hope it’s been useful!

Post by Hubert Łępicki

Hubert is partner at AmberBit. Rails, Elixir and functional programming are his areas of expertise.