Using the async-hooks API to integrate Envoy with our RPC library — Part II

Published in

Tubi Engineering

6 min readDec 10, 2018

In our previous post, we discussed how we pass context across the entire application using the async_hook. We demonstrated how to simplify deeply nested code without requiring drastic changes using monkey patching. For better or worse, monkey patching is not always a viable (or sane) option. For example, we were not successful in doing so for node-grpc. Today, we examine a different way to use async_hooks without monkey patching.

Using some tricks from the call stack

Let’s take a look at this call stack structure:

Nodes (no pun intended) in the tree represent function calls. One will trigger another, from top to bottom, left to right (in-order). There are two separate calls in the picture (painted as two blue blocks). Middleware is being executed before the main logic code and usually to be found in the bottom left.

If we implement a context middleware and put it as the first one in the HTTP server, it’s usually placed like above: the execution is placed in the bottom left and will be executed earlier than the right hand side. So, we adopt the following strategy for managing the context:

Build the tree

First, add an async hook, which constructs the trigger tree of the application: one asyncId will trigger another. (see the picture above)

Set context

Setup a new middleware and make sure it gets to run first.

Setup context for the middleware asyncId.
For all ancestors of the middleware’s asyncId, if it hasn’t setup any context, copy the context from middleware’s asyncId.

Let’s see this process in the picture below:

The yellow part is where the context is being set up. As you can see, in the first (left) request, it will mark executions out of its own territory. This is OK, as once one request marks those executions as territory, the other will respect and know the border of their own. We can prove that only the first one will do this.

Query context

Let currentId = asyncId
Find if the parent of currentId has context:
Yes: copy the context to currentId and return the context.
No: let currentId be the parent’s and go to step 2; when returned, set the context to the returned value.

The picture below shows how the query works (the green part):

The query happens from the execution itself to its parent, and then its parent’s parent… and once queried is done, the context is setup. And the later queries will have O(1) complexity on average.

Pros

No need to worry about how to monkey patch for different libraries.
The same computational complexity as the monkey patch method.

Cons

More complex implementation than monkey patching.
It doesn’t work when a library is using the same async source for different requests. But from what we’ve observed, libraries will usually start a new async resources for new requests.
Need to carefully manage the context store to avoid memory leak (see below).
To avoid context corruption, it’s strictly required that:
a) Context must be set before query or it will get the wrong context from the other request.
b) Context can only be set once for a request (or, it will split into two contexts for the same request).

Life is never easy

We did mention that this is a more complex implementation, didn’t we? Well, even though we had a working implementation, we found more problems along the way we needed to tackle.

Problem #1: Async hook life cycle

When you first skim through the async_hooks API, you’d probably make some assumptions:

Specifically that init / before / after / destroy are all guaranteed to happen exactly once.
The ancestor’s after / destroy should happen after the successor’s one.

However, these assumptions are wrong. Counter-intuitively, there can be no before / after and it’s not guaranteed that ancestor’s after / destroy will happen after the successor’s one.

To prove this, I’m providing an elaborate test code example below along with its output. Check out the following test code:

The example shows:

There can be no before / after fired (see 5 and 6)
Ancestor’s destroy may happen earlier than the successor’s before. (see 7 vs 51, where destroy(32) is earlier than destroy(166))

So, if we init the asyncId dependencies information and destroy it when async resource destroys, the upper stack may not find enough information about its ancestor. (Upper stack: current executing code, in the bottom of the trees you see above). But if we don’t destroy the resources, it will lead to a memory leak.

But there are a two things we can guarantee:

For the same stack, the init / before / execEnd / after / destroy lifecycle calls are happening in strict order.
The ancestor’s init / before / execStart are guaranteed to run earlier than the successor’s ones.

So here we introduce reference counting for the dependencies and context info:

For init:

The current context is referenced by the current execution and adds a reference to its parent context.

For destroying execution:

Remove the reference to its own context.
If the context doesn’t have another children’s reference, delete the context and delete the reference to its parent.
See if the parent needs the same treatment.

Problem #2: Context Corruption

As mentioned above there will be context corruption if:

The query happens before the context is set.
There are multiple sets happening in a single request.

To make matters worse, this corruption cannot be detected. But if the first request satisfies the above constraints and we mark those nodes as the first context, then for the following requests:

Query action: knows the context is not set if it ends up reaching the first context.
Set action: knows it’s not doing the set action again, if it ends up reaching the first context.

We can only set the context once per request, at the bottom left of the tree (before any query). That is OK for our use case. In other words, when this method is correctly used, we don’t have the corruption problem.

Problem #3: Destroy is not called

As for now, in December 2018, this issue of this problem might have been fixed in latest Node.js release. See Github/node#19859.

When we put this implementation into production, we find it’s facing memory leak issue:

As comparing, this is what the process memory looks like a week before:

For debugging, we created a counter variable:

When init hook is called, increase by 1.
When destroy hook is called, decrease by 1.

And we got the following chart for 1.5 days:

The destroy is not called as expected!! As we dig into details, it’s because of our client is using a keep-alive HTTP agent, hence the destroy() of that execution never called. It might be destroyed when the connection is closed, but requests are active, the keep-alive connection will rarely close. Which results in the context not being deleted.

To solve this problem, we introduce the generational garbage collection:

Two Maps for context storage: old vs new.
For set, always set into the new Map.
For get, find from new Map first, if not found, try the old one and copy to the new one if we found it.
In regular intervals, 5 minutes by default, delete the old Map, make new as the old Map, and create an empty Map as the new Map.

As our requests never last for more than a few seconds, it’s safe to delete the old data which never gets accessed for the past 5 minutes. And this manual GC method does not result in any noticeable performance degradation.

Conclusion

By using async_hooks API, we make our tracing work easier than before. You can also checkout our open source envoy-node library for the details. Your feedback is also warmly welcomed!