Automating chaos experiments in production (2019)

Dwight heats up door handle using torch
It’s f*cking lit

If you use a microservice architecture, you want to make sure that all those services play well with each other. You’ll also want to test what happens when a service suddenly becomes slow or fails completely. Should you create a production-like test environment for such tests? No, of course not – just test on production!

Why it matters

The Netflix streaming service is implemented using a microservice architecture. Each action you perform on the platform is handled by numerous microservices that communicate with each other using remote procedure calls (RPC).

These calls can fail, but that’s fine! All RPCs are configured with a timeout: If a call times out due to a temporary overload or networking issues, a retry on a different server will typically solve the problem. And if a service returns errors for all requests, a sensible default response can be used as fallback.

This makes it possible to handle failures of single (or several) services gracefully, so that the user never notices anything – in theory at least. In practice, these failures don’t happen very often and consequently Netflix’s engineers aren’t always confident that they work as expected.

How the study was conducted

To solve this problem, the authors developed the Chaos Automation Platform (ChAP): an orchestration system that makes it possible to run chaos engineering experiments within Netflix’s microservice architecture.

Methodology

The experiments are made possible by the fact that Netflix’s microservices typically use a common set of Java libraries, like RPC clients, Hystrix, and Cassandra database clients. These libraries provide hooks that make it possible to inject faults at runtime.

Fault injection works as follows:

  1. Incoming requests are annotated with metadata that indicates that a particular call should fail. Requests are handled normally, until they reach the system under test.
  2. Then, one of two types of faults may occur. The library either adds latency before making its call to another service or immediately throws an exception instead of executing the call.

This isn’t as insane as it sounds, because the experiments will only affect a small part of the userbase:

  1. A Netflix engineer creates an experiment for a particular service using ChAP’s UI. The experiment will typically impact only a small, randomly selected portion of users (say, 1%). These will be assigned to a treatment group. An equal number of randomly selected users are assigned to a control group.
  2. Netflix’s continuous delivery system then provisions two smaller copies of the API cluster: a baseline that handles traffic for the control group, and a canary that handles traffic for the treatment group. Neither receives any traffic until the experiment starts.
  3. Telemetry data from microservices within Netflix is typically about five minutes old, but this is not acceptable for the experiments – if things really go south, the experiment must be aborted immediately. ChAP therefore sets up low latency monitoring for the baseline and canary clusters.
  4. Then, Netflix’s reverse-proxy starts assigning users to the control and treatment groups. All requests for those users will then be sent to the baseline and canary clusters.
  5. The engineer follows the results of the experiment via a dashboard.
  6. When the experiment is finished, the reverse-proxy routes all traffic through the “real” cluster again and tears down the baseline and canary clusters.

Safety first

The system under test will usually handle faults elegantlyIts developers should’ve written tests for it!, but sometimes bad things do happen. ChAP therefore includes a number of safety mechanisms that limit the blast radius of experiments:

Automate all the things!

Engineers can use ChAP to manually define and run experiments on their own services, but ChAPIt’s actually a different system called Monocle. Netflix seems to follow Amazon’s naming conventions, because none of the names make any sense to outsiders. can also generate and run experiments automatically.

These experiments can make calls fail in one of three ways:

  1. Immediate failure;
  2. Latency just below the configured timeout;
  3. Latency above the configured timeout, that will lead to failure.

Heuristics are used to identify experimental setups with a high likelihood of finding a vulnerability. These heuristics include (but are not limited to):

What discoveries were made

The authors claim that the experiments have revealed several cases where services did not handle timeouts as gracefully as they should have. There aren’t any real numbers to back this claim up, but since they went through all this effort, we can probably make an educated guess.

Finally, the authors list some challenges and insights that should prove useful for anyone who wants to implement chaos engineering in their own organisation:

The important bits

  1. Many types of inter-service communication faults can be modelled using a service that slows down or returns errors
  2. Resilience of remote procedure calls can be tested by having the library that makes the call simulate a slow or error response
  3. Running experiments on a small sample of your users allows you to test your configuration with a limited “blast radius”