OpenTracing Elixir? Yes Please!

alt tracing

Do you fancy watching your application logs live in production to detect anomalies?

If so, I am sorry I can’t help you!

A Country for Old Man!

I’ll admit it, in my younger days I would make good use of my cat like reflexes to detect application anomalies while watching my production logs scroll by. Like in the movie, watching my screen, I’d gotten pretty good about finding anomalies in the Matrix. Now, my dead cat like reflexes coupled with the microservice explosion in my deployment stacks render this once acquired skill obsolete.

Last year, I’ve became aware of the OpenTracing and thought to myself: “This is a really cool piece of tech!”. In my mind this compliments nicely your monitoring solution and provides a nice affordance to help you pick out root cause when your metrics are out of wack or need to further tune your applications. One killer story here is the ability to cross process boundaries and yet give you a continuous view of your application performance across different services and languages.

Noace!!

What Erlang has to do with it?

OpenTracing specifies a set of standard APIs and conventions to enable tracing within your application. Various languages implement the standard. In Erland, there is a project called Otter from the fine folks at Bluehouse Technology that implements the basic tenants set out by OpenTracing tho not fully compliant. So OpenTracing is the specification, Otter an Erlang implementation, but then one needs an actual tracing collector. Luckily Otter speaks Zipkin which opens support for Jaeger from the self-driven Uber folks. This is a big deal as Jaeger is now part of the CNCF foundation!

That’s great!! Let’s go!

NOTE: Links to the actual code samples are at the end!

Let’s take a trip into instrumenting a plain Phoenix application and take OpenTracing for a spin to see how this help us in building introspective applications.

Our sample application produces a collection of G.O.T castles and associated avatars. Our first crack at this, may result in the following Phoenix controller code.

  # castle_controller.ex

  def index(conn, _params) do
    Traceme.Castle.all
    |> Enum.map(fn(c) ->
      %{c | avatar: Avatar.get_icon(c)}
    end)

    render conn, "index.json", castles: castles
  end

This retrieves our Castles from our database using Ecto by calling Traceme.Castle.all function. It then retrieves the associated Castle avatar by calling another function that simulates generating the Castle’s icons and returns the file name for display purpose. Nothing special here.

Let’s instrument it!

In OpenTracing parlance, we need to define spans that mimic our call stack ie spans are created on a function invocation and then closed upon exit, be it a normal or an error. At the base, a span tracks call timing, tags and logs. Naive implementation may dictate to inline the necessary OpenTracing span calls within our controller actions and supporting functions…

Following that logic, one can quickly realize this is going to completely trash our code base, as tracing calls will litter our entire code base and hence offset readability, maintenance, etc… Upon further thoughts, AOP (Aspect Oriented Programming) seems like the perfect fit for decorating our calls with tracing concerns. For this purpose I humbly offer this little Elixir contraption ExRay that allows one to define pre/post span hooks and decorate the functions of interest by merely annotating the calls with @trace.

Let’s take a look at our new ExRay powered controller…

  # castle_controller.ex

  ...

  use ExRay, pre: :start_span, post: :end_span

  @trace kind: :action
  def index(conn, _params) do
    req_id = conn   |> ExRay.Phx.request_id
    p_span = req_id |> ExRay.Store.current

    castles = req_id
    |> Traceme.Castle.all
    |> Enum.map(fn(c) ->
      %{c | avatar: req_id |> Avatar.get_icon(p_span, c)}
    end)

    render conn, "index.json", castles: castles
  end

We introduce a macro that defines pre/post span hook functions to open/close spans upon invocation of functions annotated with @trace. Spans will get created and follow the call stack and get closed once the call stack unwinds. A thing to note here, is we need a unique ID to identify an execution path per request. We leverage Phoenix’s request_id that’s injected in our connection for every new incoming requests. This gives us the ability to track the span tree as we nest deeper into our call. ExRay leverages ETS tables to store span stack information and pop out spans as we unwind.

So in the case above, we retrieve the request_id from the connection and get a parent span whose children will be populated as we retrieve our Castles from the database and generate the associated avatars.

Let’s take a peek at what the pre/post span hooks might look like for our controller actions.

  # castle_contoller.ex

  ...

  def start_span(ctx) do
    conn = ctx.args |> List.first

    conn
    |> ExRay.Phx.span_name
    |> ExRay.Span.open(conn |> Phx.request_id)
    |> :otter.tag(:kind      , ctx.meta[:kind])
    |> :otter.tag(:controller, conn |> Phx.controller_name)
    |> :otter.tag(:action    , conn |> Phx.action_name)
    |> :otter.log(">>> Starting action `#{conn |> Phx.action_name} at #{conn.request_path}")
  end

  def end_span(ctx, span, _rendered) do
    conn = ctx.args |> List.first

    span
    |> :otter.log("<<< Ending action `#{conn |> Phx.action_name}")
    |> ExRay.Span.close(conn |> Phx.request_id)
  end

In the pre hook, we setup our root span named after the annotated action, dress it up with a few tags for later filtering and introspection, lastly add some logging information. Likewise, in the post hook we add logging info and close the span. That’s it!

If this seems a bit hard to grasp at first, freight not! Let’s take a look at the resulting spans in the Jaeger UI.

# Hit our service URL to generate traces
curl -XGET http://localhost:4001/api/castles
# Open Jaeger UI
open http://localhost:16686

— alt span basic

Here we see our index action took about 2s to run. Damn! You can see ExRay in action, as it decomposed our call stack into action -> ecto -> avatars and introduces spans with detailed tags and logging info at each stage using our @trace annotation technic. Pretty cool!! but then again I might be unanimous on that??

So upon further inspection you may notice the infamous span stair case effect going on while fetching our Castles avatar. That’s our sign! Surely we could do better?

This to me is a great way to visualize, dig in and get intimate with what your application is doing live. Be it in dev or production.

IMHO Beats the heck out of plowing thru gigs of logs! I hope you concur…

At this point we have two main options: 1) Make the avatar gen faster or 2) introduce parallelism? Let’s go with the second and see how this might help?

  # castle_controller.ex

  ...
  @trace kind: :action
  def index(conn, _params) do
    req_id  = conn   |> Phx.request_id
    p_span  = req_id |> ExRay.Store.current

    castles = req_id
    |> Traceme.Castle.all
    |> Flow.from_enumerable
    |> Flow.partition
    |> Flow.reduce(fn -> [] end, fn(c, acc) ->
      acc ++ [%{c | avatar: req_id |> Avatar.get_icon(p_span, c)}]
    end)
    |> Enum.to_list()

    render conn, "index.json", castles: castles
  end

So here, we use the same avatar function and tracing as we did before, but introduce GenFlow into the mix to fetch the Castle’s avatar. GenFlow is another awesome Elixir addition. Thanks Jose!! Resulting on what we see, we can further tune partitioning to afford greater gains based on your use cases.

Let’s take a look at our new traces in the UI…

# Hit our service URL to generate traces
curl -XGET http://localhost:4001/api/castles_flow
# Open Jaeger UI
open http://localhost:16686

alt span flow

Noace! We now see a clear improvement from 2sec down to 0.5sec. The spans now show that we’ve eliminated the stair case and gone for the superior elevator scheme. Bitchin!!

That is all!

There is much to look at and discuss here, but hopefully I’ve whetted your appetites to look at incorporating OpenTracing with your own Elixir stacks and see what it offers. My feel is OpenTracing will become the defacto standard for well written libraries and frameworks. In this new world of microservice, multiple language stacks and exchange protocols, we have introduced a management problem. Shifting out of the monolith, introduces many moving parts. Monitoring solutions, while giving access to tons of metrics, don’t help you much understanding what’s actually going on with your applications right now and why the metrics are off the chart? OpenTracing gives you nice constructs to instrument your applications, across process boundaries, languages and across services while giving you a continuous view of the call stack live when you need it most.

What’s in your servlet?

Links

ExRay
ExRay Phoenix Tracers