I love Elixir language and Phoenix framework. It is so good, that we switched our consultancy entirely from writing Ruby on Rails to Elixir (and Phoenix), and we recommend it to our customers in majority of the cases for both: traditional applications and signle-page app back-ends (we love GraphQL too!).
Things, however, on occassion fail. This is okay, and - in fact - you might have heard about Erlang’s “let it fail” philosophy of not writing a lot of defensive code and allowing things to crash.
While tempting to go this route with Elixir/Phoenix web application, and just crash the web workers badly on each error they experience, this is not always acceptable from business perspective.
There is an array of cases where you want to, or you are legally required to keep a close eye on what is happening in your system. For most of the production systems you are likely in need to display friendly error pages to the users, with an option to go back to the working part of the page, maybe a retry button or information on how to report an issue. For systems processing sensitive personal information, you may need to record such processing along with metadata on who, when and which subset of data accessed. While it’s fairly straightforward thing to do these two things having in mind happy path (no errors) and a bit of defensive programming (but knowing which errors to expect), things get significantly harder to handle if - for example - you need to track all requestes no matter what type of crash happened along the way.
In this article we will have a look on how Phoenix, Plug and Cowboy serve your HTTP requests. We will also investigate what are the built-in defensive mechanisms in both Plug and Phoenix, and how we can utilize them if we need to. Finally, we will see how we can increase the robustness of our solution by handling things we never expected might happen in first place.
How Phoenix serves HTTP requests
Your Phoenix application, just like any Cowboy & Plug based web application, starts number of listeners and acceptors (managed by library called Ranch), which are processes responsible for handling underlying TCP connections. By default, a Phoenix application would start 100 acceptors, but in our example we reduced it to 4 for better visibility, see Figure 1.
When web browser makes a request, Cowboy spawns another process - handler, in context of which all the Plugs are executed (including Phoenix endpoint, router, controller’s, views and templates code) - highlighted on Figure 2.
Since it’s a single process which executes a lot of framework-provided, and all of the user-provided HTTP request handling code, uncatched exceptions will bring it down. This results in Phoenix rendering descriptive error pages by default in dev
environment, but these details are being hidden from end user for prod
. Programmer can also customize how to respond to such errors by providing custom error pages. In order to facilitate that behaviour, Phoenix installs own try-catch-and-reraise
section around controller’s code - and in each pipeline. All captured exceptions are being re-raised in wrapped form - as Plug.Conn.WrapperError
.
If you check the relevant
code
in Phoenix.Controller.Pipeline
and follow it a
bit,
you will clearly see how Phoenix is wrapping the raised exceptions. The
questions is: why does it do it?
Well, there are a few different reasons, but the major one is that
re-raised, wrapped exceptions, carry on the saved Plug.Conn
with them
- something that ordinary Elixir exceptions would not do.
If we go back to pondering about happy path, there are also certain things Phoenix & Plug do after user’s code is executed. There are a few ways to do it, but a common solution is to use Plug.Conn.register_before_send/2) to provide callback function which will be executed after your controller and all the Plug’s code executed. This is precisely how Plug’s Logger module handles logging response status and time it took to execute action to the console/logs.
The important thing to learn about Phoenix/Plug web request handlers is that code of Plug, Phoenix and developer’s own - all execute in single Elixir process. It is directly started and internally supervised by Cowboy (via Ranch library). It’s not a process started and managed by Plug nor Phoenix itself, and it might be the source of a few gotchas we are going to discuss below.
Standard ways of handling errors
Having a misbehaving action such as:
plug :set_current_user
...
def index(conn, _params) do
raise "oopsie"
render conn, "index.html"
end
defp set_current_user(conn, _) do
conn |> assign(:current_user, "Hubert")
end
...
You can react to such problem by displaying a custom error page pretty easily.
What if, however, you need to do more than that?
While there are a few global ways of capturing exceptions and crashes in
Erlang (and Elixir), and I have a strong feeling that upcoming Erlang 21
:logger
integration will be my
favourite,
sometimes you want to capture more information along with errors
themselves. Those, generally, when handling requests would be found in
Plug.Conn
struct.
Let’s say you want to report the errors as they happen to some central exception collecting utility, or maybe send an error report to user using a Phoenix channel, and you want to know to which user this error happened exactly.
We seem to be lucky, as we are setting the conn.assigns.current_user
already so we know who the current user is. Let’s use
Plug.ErrorHandler
to log it to the console for now.
First, you need to add it to your router (or Endpoint to collect even more exceptions thrown from Plugs mounted in pipelines and router or Endpoint itself):
defmodule AppWeb.Router do
use AppWeb, :router
use Plug.ErrorHandler
def handle_errors(conn, %{kind: kind, reason: reason, stack: stack} = err) do
IO.inspect "Something happened when handling web request for the user:"
IO.inspect conn.assigns.current_user
IO.inspect "The error was:"
IO.inspect err
conn
end
...
As you can see we will attempt to catch the raised exception, log it to
the console, and return unchanged Plug.Conn
. We could have modified
the response status code or body here, but it could interfere with our
error pages we set up earlier.
"Something happened when handling web request for the user:"
"Hubert"
"The error was:"
%{
kind: error,
reason: %RuntimeError{message: "oopsie"},
stack:
...
Phoenix.ErrorHandler
is also re-raising the exception after it’s being
caught and processed in your handle_errors/2
function, ensuring the
behaviour of the application hasn’t changed if you used it.
This is only true, however, if the behaviour of handle_errors/2
can be
clearly predicted. You should not do too much work in this function, to
avoid crashing your handlers with different exceptions. It gets a bit
hairy to debug if you do, and you most likely want to employ technique
similar to AppSignal’s Plug uses
internally -
this involves starting a dedicated error reporting process (or even
OS-level process as they do it).
If you want to ensure your error reporting does not alter the behaviour
of the web request handlers in any way, you may want to start separate
process, to send error reports, use GenServer(s) that you send async error
messages with cast
for further reporting. AppSignal even uses
dedicated OS level processes to collect and report the exceptions in
bulk to their back-ends.
Keeping an eye on contents of Plug.Conn
In the example above, we were reporting the conn.assigns.current_user
to the console when error happened. We were lucky, because we were
setting the variable in separate plug function from controller’s
actions. If we slightly changed our code, however, we would end up with
some trouble:
def index(conn, _params) do
conn = conn |> assign(:current_user, "Hubert")
raise "oopsie"
render conn, "index.html"
end
This code crashes badly, and our exception tracking is now broken:
[info] GET /
[debug] Processing with AppWeb.PageController.index/2
Parameters: %{}
Pipelines: [:browser]
"Something happened when handling web request for the user:"
[error] #PID<0.393.0> running AppWeb.Endpoint (cowboy_protocol) terminated
Server: localhost:4000 (http)
Request: GET /
** (exit) an exception was raised:
** (KeyError) key :current_user not found in: %{}
Moreover, while the user did see our custom error page, we would not see the corresponding Logger message in our application logs. The missing key error caused protocol handler process to exit, aborting execution of any callbacks registered in Plug.Conn.register_before_send/2). This is precisely why you want to avoid doing too much work in this error handling function in the same process, and delegate to something else to do the risky work for you.
Let’s get back to the question on hand, however: Why is
conn.assigns.current_user
missing now in first place?
The answer is: immutability. Elixir’s data structures are immutable by
design, and setting current_user
does not alter existing conn
struct
but creates a brand new one.
Remember when we said that Phoenix solves this by wrapping installing
try/rescue clause and wrapping the exceptions in
Plug.Conn.WrapperError
?
This happens just before our controller’s actions is executed, but also when each of defined pipelines is being entered, and on these two occassions only.
This means it’s good practice to extract figuring out some assigns to
separate plugs, but - in case you want to provide more check points for
Plug.Conn
to be saved - you need to create more pipelines too. Instead
of one large :browser
pipeline mounted in your router, you may want to
split to some framework-related code, authentication pipeline, then a
pipeline which finds resources - so the Plug.Conn
is being “saved” to
be rescued without losing much data at each step.
If you really really want to provide your own check-point on a controller-level (or deeper in your code), you can do it, however, by forcing a conn savepoint yourself:
def index(conn, _params) do
conn = conn |> assign(:current_user, "Hubertos")
try do
raise "oopsie"
catch
:error, reason ->
Phoenix.Controller.Pipeline.__catch__(
conn, reason, __MODULE__,
conn.private.phoenix_action, System.stacktrace()
)
end
render conn, "index.html"
end
You probably don’t want to do that, and if you do you probably want to wrap it up in a nice helper function. Please do keep in mind you are being naughty by doing it and you use undocumented framework’s internals in your code. You have been warned.
Understanding how Phoenix sets up custom try/catch clauses may also come
in handy when you are concerned with other fields of Plug.Conn
, not
just assigns as things like response body or status may get lost by
default when handling exceptions as well.
Beware of concurrency & links
Exceptions in Elixir (either raised or thrown values) that can be caught, are a likely, but not the only reason for a process being terminated. Another reason for your handler being shut down prematurely is a crash of one of it’s linked processes.
If you are thinking about starting linked processes from your
controller’s actions, I would suggest you think again if this is a good
solution for the problem. It likely is not, and you may be better off
with a background job processing library such as
exq, dedicated pool of GenServer workers
that you schedule work for asynchronously with cast
or a dedicated
pool of tasks supervised with
Task.Supervisor.
For certain types of use cases, however, you may consider and go with a design where you do start linked processes from your web request handler.
This can be tempting, for example, if you want to fetch a file from S3, or make a HTTP call to some other external service and then return maybe processed value to the user.
You have to be really careful with those cases because crashing linked processes results in your handlers, Phoenix and Plug pipeline misbehaving badly.
To simulate the issue, let’s amend our action:
def index(conn, _params) do
spawn_link fn -> raise "oops" end
render conn, "index.html"
end
What happens when user visits such page?
<img src=”fig4.png” title=”Figure 4” alt=”User sees blank page with ‘This page isn’t working’ message in browser”/>
User sees blank page with “This page isn’t working” message in the browser.
If we have a look at the console, we will see even weirder stuff:
[info] GET /
[debug] Processing with AppWeb.PageController.index/2
Parameters: %{}
Pipelines: [:browser]
[info] Sent 200 in 260µs
[error] Process #PID<0.436.0> raised an exception
** (RuntimeError) oops
(app) lib/app_web/controllers/page_controller.ex:7: anonymous fn/0 in AppWeb.PageController.index/2
[error] Ranch protocol #PID<0.434.0> of listener AppWeb.Endpoint.HTTP (cowboy_protocol) terminated
** (exit) an exception was raised:
** (RuntimeError) oops
(app) lib/app_web/controllers/page_controller.ex:7: anonymous fn/0 in AppWeb.PageController.index/2
Oh, so it crashed with error and returned 200 OK to the user? Nope.
Our simulation code is sort of broken, but this case can actually happen in real life if we are not careful. The spawned link brings down the handler process, but after the remaining of our controller code is being executed, when the response is being sent back to the user by Cowboy. This is not good, but more realistically we are going to wait in the controller for spawned processes to shut down. Let’s simulate that better:
def index(conn, _params) do
spawn_link fn -> raise "oops" end
:timer.sleep(:infinity)
render conn, "index.html"
end
What happens now?
User sees the same blank page.
<img src=”fig4.png” title=”Figure 4” alt=”User sees blank page with ‘This page isn’t working’ message in browser”/>
But in our console we are not having a log message reporting on the request handling failure at all:
[info] GET /
[debug] Processing with AppWeb.PageController.index/2
Parameters: %{}
Pipelines: [:browser]
[error] Process #PID<0.455.0> raised an exception
** (RuntimeError) oops
(app) lib/app_web/controllers/page_controller.ex:7: anonymous fn/0 in AppWeb.PageController.index/2
[error] Ranch protocol #PID<0.453.0> of listener AppWeb.Endpoint.HTTP (cowboy_protocol) terminated
** (exit) an exception was raised:
** (RuntimeError) oops
(app) lib/app_web/controllers/page_controller.ex:7: anonymous fn/0 in AppWeb.PageController.index/2
Which sort of makes sense as you can see the RuntimeError. Again, it is
another reason why you should not trust the Plug.Logger
to output
useful information on each failure. If you are only monitoring those -
you are doing things wrong.
Our error_handler/2
function in router that we installed earlier
hasn’t been called at all, not to mention that no custom error page was
rendered. It’s just like everything was stopped abruptly - and it’s
precisely what happened.
When a linked process crashes, parent process receives exit signal. If it’s not handled and such process is not trapping exits, the parent process is being brought down as well. There is no exception, no error handling involved at all.
Standard way of dealing with such issues in Erlang/OTP world is to rely on, possibly custom, supervisors to detect and react so such system failures.
As mentioned in the first part of this article, however, the handlers are started directly by Cowboy (or rather Ranch), and Plug nor Phoenix have no power over them.
How do we react so such events then? There surely must be a way.
Linking processes itself provides one of possible answers. Since linking is bi-directional, we can start a dedicated process early in the Plug pipeline, whose responsibility is precisely to listen to and react to handler’s exit signals. We can fight fire with fire, can’t we?
Conceptually, our supervision tree will end up having one more dedicated process per each request, started by and linked to a web request handler:
<img src=”fig3.png” title=”Figure 3” alt=”Dedicated linked process, trapping exits” />
Let’s implement it, first let’s modify our controller:
defmodule AppWeb.PageController do
use AppWeb, :controller
plug :set_current_user
plug :monitor_crashes
def index(conn, _params) do
spawn_link fn -> raise "oops" end
:timer.sleep(:infinity)
render conn, "index.html"
end
defp monitor_crashes(conn, _) do
{:ok, _pid} = CrashesMonitor.start_link(conn)
conn
end
def set_current_user(conn, _) do
conn |> assign(:current_user, "Hubert")
end
end
And we will base our CrashesMonitor
on GenServer
. We’ll pass it
Plug.Conn
on initialize to make sure it’s fairly up to date when crash
happens. This can be extended further by adding explicit update
functionality in the future if we need to provide later savepoints for
better tracking what’s going on.
defmodule CrashesMonitor do
use GenServer
def start_link(conn) do
GenServer.start_link(__MODULE__, conn)
end
@impl true
def init(conn) do
Process.flag(:trap_exit, true)
{:ok, %{conn: conn}}
end
@impl true
def terminate(:normal, _), do: :ok
def terminate(reason, state) do
IO.inspect "Web request handler crashed! Here's the conn I remember:"
IO.inspect(state.conn)
IO.inspect "And the reason for exit:"
IO.inspect(reason)
end
end
When a user visits a page now, they can still - unfortunately - see a
blank page. After inspecting the server logs, you will find out,
however, that we were able to catch the exception that brought down the
handler and also retained state of Plug.Conn
:
"Web request handler crashed! Here's the conn I remember:"
%Plug.Conn{
adapter: {Plug.Adapters.Cowboy.Conn, :...},
...
"And the reason for exit:"
{%RuntimeError{message: "oops"},
{AppWeb.PageController, :"-index/2-fun-0-", 0,
[file: 'lib/app_web/controllers/page_controller.ex', line: 33]}
]}
All being saved including state of Plug.Conn
just before the
controller’s action kicks in, stacktrace and exception message! In
alternative implementation (possibly better and more robust), you could
have a single GenServer
being a registry of currently running web
workers, which listens to :DOWN
signals from them to detect failures.
This is precisely hos
AppSignal implemented it in their monitoring software.
Outro
That’s all for now, I hope it’s been useful!
Post by Hubert Łępicki
Hubert is partner at AmberBit. Rails, Elixir and functional programming are his areas of expertise.