Captain Not-Too-Obvious ™ Explains How to Improve Node.js Service CPU Performance

This is the second in a series of posts about load testing services built on Node.js

When the platform team at WB Games in San Francisco began the endeavor of testing and optimizing our services code, we weren’t expecting it to be so. hard! Code we thought was totally innocuous snuck up and bit us, and code that we were certain would come back to haunt us … well, it haunted us, too. Through use of a variety of tools and techniques that will be described over a series of posts, we’ll explain what helped us achieve optimal performance from Node.js services under heavy concurrent traffic.

Such Concurrency. Much Request. Wow

I’ve found it challenging to find information about what people are trying to test when they talk about Node.js load testing. That being the case, I’m going to jump in head first and be as transparent as I can. First, understand is that the goal of WB Games Platform is to handle requests on behalf of millions of users per day, without eating into the bottom line of our game studios. That is to say, we’re not interested in just throwing money at horizontal scaling and calling it a day. Second, I’d worked with services on stacks in the past that where inefficiency was king and service nodes were only capable of handling low hundreds of CCU, when I’d come into building this platform of the mindset that our service nodes should be able to handle high thousands of CCU.

Again, for the sake of transparency, I’ll qualify our idea of what “heavy current traffic” is. My original, mythical target was that each production service node, doing a moderate amount of CPU bound work, could handle 10,000 concurrent requests per second. What’s the machine, you ask? Good question. Depending on the to-be-specified-shortly job description of the node, it would be either a 4-Core VM or an 8-Core VM on Microsoft Azure.

So! If an 8-Core VM can do 10k CCU, then the crappiest, 1-Core VM — an “A1” with 1.75gb RAM and ~2.0GHz CPI — can do 1250 CCU, right? RiGhT?!

Target Performance Metrics

Before this bit drifts away into the ether, here the metrics we measure our services’ performance against, and the targets we wanted to hit, in priority order.

  • CPU Usage <= 80%
  • Response Times <= 100ms
  • Errors = 0

The Job Description

As promised, here is the CPU performance best case scenario:

  • Existing user with cached (in Redis) “profile” data comes to a service.
  • Authentication token is successfully validated against cached data (i.e call Redis via an already connected persistent connection).
  • User “profile” data is retrieved from cache and stored in a reference object.
  • Logic operates on and manipulates reference object.
  • User “profile” data is stored in cache, client receives successful response.
  • Secondary system is notified (via Redis List/Queue) to persist data to MongoDB.

Here is the one worst case scenario:

  • Existing user’s “profile” data has expired from the cache
  • Read from MongoDB
  • Repopulate cache
  • Store data in a reference object
  • Logic operates on and manipulates reference object, a lot ...
  • User “profile” data is stored in cache, client receives successful response.
  • Secondary system is notified (via Redis List/Queue) to persist data to MongoDB

OK then, 10k CCU, here we come!

Having followed the tips outlined in the first part of this series, we fired up JMeter and were off to the races. As the ramp-up progresses, everything looks fine, but as we near our peak of 1250 CCU, the CPU usage spikes to 100%! Also, as we see the load average of the VM head towards 1.0, we see the response times increase from ~300ms to as much as ~3000ms! In a word … yuck!

Things looked pretty bleak at that point given our target of 80% CPU and 100ms response times, but thankfully there are plenty of things we were able to do to improve the situation. After having gone through this exercise, we now ask ourselves the following questions when attempting performance optimizations.

Are you logging to disk?

When I started this performance optimization process, I never fully considered the CPU impact of logging to disk. If you have a service receiving tens or even low hundreds of requests per second, you can console.log occasionally and not really feel it. However, when you’re attempting 1k+ requests per second, you’re going to want to curb your logging. In our case, we’re using the winston logging library, and choose to override JavaScript’s console object to internally use winston so that we can run our tests with a logging level that sends everything but console.error calls to /dev/null (more on this in a separate post).

Lesson Learned: Keep console from writing to disk in performance sensitive code

Are you reading from or writing to disk at runtime?

After nipping the logging issue, we found other code reading files from disk at runtime. Ack! Since we’re building services that are long running processes, we optimized away that problem by reading each file once at service boot time, then parsing and caching the results in the Node processes memory.

Lesson Learned: Pre-parse and pre-cache any files that are needed at runtime at service boot time

Are you excessively using JSON.parse or JSON.stringify?

Calls to JSON.parse and JSON.stringify are exorbitantly CPU intensive. Just to clarify, it’s not slow; it’s not inefficient; it just happens to be a blocking operation that you don’t want to ask a CPU to perform 1000 times a second. Just like with disk i/o, if you’re on a system taking tens or hundreds of requests per second … no big whoop. If you’re trying to push 1k+ on crappy “hardware", you’re going to have to pull out as many of the stops that you can find.

In our case, we ended up optimizing our best case scenario so that there would be two JSON.parse calls and two JSON.stringify calls per request. Two, you ask? Well, since we’re using HTTP, and chose explicitly to allow our game clients to use the application/json content-type, that’s what we need to support. That means we’ll be parsing an HTTP request, and JSON data coming from Redis. On the way back to the client, we’ll be stringifying the data to go back to Redis, and the HTTP response to the client.

In the worst case scenario, we’d be paying the additional CPU cost for using the mongodb library and its internal parsing calls when translating bson data into JavaScript objects. Additionally, in the occasional unfortunate situation where we found ourselves pulling multiple keys from the (Redis) cache -- which were nested objects and therefore stored in JSON format -- we’d aggregate each JSON string into an array manually so we only need to call JSON.parse once. Here’s an example

/**
 * Aggregate multiple JSON strings into one to save on CPU.
 * 
 * @param {Array} jsonStringsArray An array of valid JSON strings to be parsed together.
 * @returns {Array} An array of JavaScript Objects, or an empty array if any members of the input array are invalid JSON.
 */
function aggregateParseJSON(jsonStringsArray)
{
    var tmpArray = '[' + jsonStringsArray.join(',') + ']',
        parsed = [];
        
    try {
        parsed = JSON.parse(tmpArray);
    } catch (e) {}
    return parsed;
}

After we completed our JSON.parse and JSON.stringify related optimizations, we found that at 1k+ requests per second, each JSON.parse or JSON.stringify call that we could remove would net 10% less CPU usage. This, in conjunction with our disk i/o related findings, is all we need to pay attention to most of the time to identify and fix a performance bottleneck.

Lesson Learned: Keep calls to JSON.parse and JSON.stringify to a minimum in performance sensitive code

Have you used process.nextTick or setImmediate where possible?

If you’re still with me, you may recall something about using ømq to send data to a secondary system. In our case, we found that we can hit our 80% CPU usage / 100ms response time target by offloading the responsibility of persisting data to disk (via the mongodb Node library) to a secondary VM, if and only if we wrapped our calls to the ømq library with process.nextTick. For most test runs, this dropped our response times from ~100ms to ~10ms.

Lesson Learned: Use process.nextTick or setImmediate, whichever is more appropriate in your case, for “fire and forget” calls

Summary

In conclusion, while any one of the aforementioned tips is very common sensible, the number of moving parts in a typical production architecture makes it hard to remember the basics. If you make adjustments to your services based on each of these suggestions, and retest each independently, I'm confident that you'll get even more performance out of your apps.