Automating chaos experiments in production (2019)
If you use a microservice architecture, you want to make sure that all those services play well with each other. You’ll also want to test what happens when a service suddenly becomes slow or fails completely. Should you create a production-like test environment for such tests? No, of course not – just test on production!
Why it matters
The Netflix streaming service is implemented using a microservice architecture. Each action you perform on the platform is handled by numerous microservices that communicate with each other using remote procedure calls (RPC).
These calls can fail, but that’s fine! All RPCs are configured with a timeout: If a call times out due to a temporary overload or networking issues, a retry on a different server will typically solve the problem. And if a service returns errors for all requests, a sensible default response can be used as fallback.
This makes it possible to handle failures of single (or several) services gracefully, so that the user never notices anything – in theory at least. In practice, these failures don’t happen very often and consequently Netflix’s engineers aren’t always confident that they work as expected.
How the study was conducted
To solve this problem, the authors developed the Chaos Automation Platform (ChAP): an orchestration system that makes it possible to run chaos engineering experiments within Netflix’s microservice architecture.
The experiments are made possible by the fact that Netflix’s microservices typically use a common set of Java libraries, like RPC clients, Hystrix, and Cassandra database clients. These libraries provide hooks that make it possible to inject faults at runtime.
Fault injection works as follows:
- Incoming requests are annotated with metadata that indicates that a particular call should fail. Requests are handled normally, until they reach the system under test.
- Then, one of two types of faults may occur. The library either adds latency before making its call to another service or immediately throws an exception instead of executing the call.
This isn’t as insane as it sounds, because the experiments will only affect a small part of the userbase:
- A Netflix engineer creates an experiment for a particular service using ChAP’s UI. The experiment will typically impact only a small, randomly selected portion of users (say, 1%). These will be assigned to a treatment group. An equal number of randomly selected users are assigned to a control group.
- Netflix’s continuous delivery system then provisions two smaller copies of the API cluster: a baseline that handles traffic for the control group, and a canary that handles traffic for the treatment group. Neither receives any traffic until the experiment starts.
- Telemetry data from microservices within Netflix is typically about five minutes old, but this is not acceptable for the experiments – if things really go south, the experiment must be aborted immediately. ChAP therefore sets up low latency monitoring for the baseline and canary clusters.
- Then, Netflix’s reverse-proxy starts assigning users to the control and treatment groups. All requests for those users will then be sent to the baseline and canary clusters.
- The engineer follows the results of the experiment via a dashboard.
- When the experiment is finished, the reverse-proxy routes all traffic through the “real” cluster again and tears down the baseline and canary clusters.
The system under test will usually handle faults elegantlyIts developers should’ve written tests for it!, but sometimes bad things do happen. ChAP therefore includes a number of safety mechanisms that limit the blast radius of experiments:
- Experiments only run during business hours, so that engineers can respond quickly if anything goes wrong;
- Experiments that cause excessive impact for customers are aborted;
- All experiments can’t ever affect more than 5% of total traffic;
- Netflix’s control plane is deployed in three geographical regions. In case of an issue, traffic from a troubled region can be redirected to the other two regions. Experiments are not permitted during such failovers.
Automate all the things!
Engineers can use ChAP to manually define and run experiments on their own services, but ChAPIt’s actually a different system called Monocle. Netflix seems to follow Amazon’s naming conventions, because none of the names make any sense to outsiders. can also generate and run experiments automatically.
These experiments can make calls fail in one of three ways:
- Immediate failure;
- Latency just below the configured timeout;
- Latency above the configured timeout, that will lead to failure.
Heuristics are used to identify experimental setups with a high likelihood of finding a vulnerability. These heuristics include (but are not limited to):
- How often a service is triggered compared to other services;
- The number of configured retries;
- The number of interactions with a service;
- Whether the experiment can be run safely, e.g. some services are explicitly blacklisted, while others might lack fallbacks.
What discoveries were made
The authors claim that the experiments have revealed several cases where services did not handle timeouts as gracefully as they should have. There aren’t any real numbers to back this claim up, but since they went through all this effort, we can probably make an educated guess.
Finally, the authors list some challenges and insights that should prove useful for anyone who wants to implement chaos engineering in their own organisation:
- Right now only one type of fault can be injected, but in reality services can fail in multiple ways.
- The systems under test are owned by different teams. Deploying new versions of chaos engineering features can therefore take many months, as you need to wait until all services have picked them up.
Even if you build it, they might not come: ChAP was made available to internal users, but only few teams were willing to actively use the service. This is why experiments are now often designed and executed centrally.
- Automated experiments must be created without domain knowledge about individual services and must have a low false positive rate, because otherwise one risks losing the confidence of developers. This greatly limits what can be done.
- Netflix software runs on many different device types. If an experiment only affects users of one particular device typeOr device brand!, it’s not likely to be detected. You can attempt to solve this by oversampling rare device types, but that would have the unwanted side-effect of increasing the impact for users with that device type.
- The outcome of an experiment can be deduced from the error count. This is not always reliable however, as some devices may generate errors over and over again, which leads to high error counts even though the actual impact is limited.
- Visualisation of existing data can sometimes reveal issues without the need to run ChAP experiments.