
Netflix Chaos Monkey: Why Netflix Breaks Its Own Systems on Purpose
Most engineering teams spend their energy trying to prevent failures. Netflix decided to cause them on purpose.
That's the core idea behind Chaos Monkey — a tool that randomly terminates production instances during business hours, forces engineers to confront failure head-on, and in doing so, builds one of the most resilient streaming platforms on the planet.
If you're running distributed systems, managing microservices, or responsible for uptime at any meaningful scale, this is one of the most important engineering philosophies you need to understand.
Why Netflix built Chaos Monkey
In 2008, Netflix was still shipping DVDs. Then they made a bet on streaming — and with that came a radical shift in infrastructure. They moved everything to AWS and built a sprawling microservices architecture that, at peak, handles hundreds of millions of requests per day.
The problem with distributed systems at that scale is simple: things will fail. Not might fail — will fail. Servers crash, network partitions happen, dependencies time out, and cloud providers have outages. The question isn't whether your system will encounter failure, it's whether your system can survive it.
Netflix's engineering team realized something uncomfortable: the only way to truly know if your system handles failure gracefully is to actually introduce failure. Running drills in staging environments doesn't cut it — staging never perfectly mirrors production. The only honest test is production itself.
So in 2011, they released Chaos Monkey into the wild. The name is intentional — it's a monkey loose in a data center, randomly smashing things. The philosophy behind it became known as Chaos Engineering.
"The Netflix culture around Chaos Engineering was built on a simple premise: if we don't test failure, we're just hoping everything works. Hope is not a strategy."
What Chaos Monkey actually does
At its core, Chaos Monkey is straightforward. It runs as a service inside your infrastructure and does one thing: it randomly selects and terminates virtual machine instances or containers in your production environment.
Not in staging. Not on a Friday at midnight when no one is watching. During business hours, when engineers are at their desks, traffic is flowing, and any failure will be immediately visible.
Here's what the basic workflow looks like:
- Chaos Monkey queries your instance groups (Auto Scaling Groups on AWS, for example)
- It randomly selects an instance from an enabled group
- It terminates that instance
- Your system either handles it gracefully — traffic reroutes, another instance picks up the load, users notice nothing — or it doesn't, and your on-call engineer gets paged
The termination is the easy part. The real value is in what happens next: does your system self-heal? Do your runbooks hold up? Do your alerts fire correctly? Do dependent services degrade gracefully or cascade into a full outage?
The Simian Army — Chaos Monkey's bigger family
Chaos Monkey didn't stay alone for long. Netflix expanded the concept into what they called the Simian Army — a collection of tools that each test a different failure mode:
Chaos Gorilla takes it up a notch from Chaos Monkey — instead of killing individual instances, it simulates the failure of an entire AWS Availability Zone. This tests whether your architecture is truly multi-AZ resilient or just claims to be.
Chaos Kong goes even further, simulating the failure of an entire AWS region. This is about as brutal as it gets — can your platform survive losing an entire geography?
Latency Monkey introduces artificial delays into REST client-server communication. This is particularly nasty because latency failures are harder to detect than hard crashes — services stay up but respond slowly, which can cause cascading timeouts and thread pool exhaustion upstream.
Conformity Monkey checks instances against a set of best practice rules and shuts down instances that don't comply — wrong instance type, missing health checks, improper tagging, and so on.
Security Monkey scans for security misconfigurations and policy violations — open security groups, missing SSL, incorrect IAM policies.
Doctor Monkey runs health checks on instances and triggers removal of unhealthy ones from service.
Together, the Simian Army covers the full spectrum of failure scenarios that a production system is likely to encounter.
The principles behind Chaos Engineering
Netflix's approach eventually got formalized into a set of principles at principlesofchaos.org, which the broader industry now treats as the definitive guide. The key ideas are:
Build a hypothesis around steady state behavior. Before you break anything, define what "normal" looks like. Is it requests per second? Error rate? P99 latency? You need a baseline to measure against.
Vary real-world events. Don't just kill instances — simulate realistic failure modes. Server crashes, network packet loss, disk I/O saturation, dependency failures, bad deploys. The failures you introduce should reflect the failures you'd actually encounter.
Run experiments in production. Staging is a lie. The only environment that tells you the truth about your system's resilience is production.
Automate experiments to run continuously. One-off chaos tests are better than nothing, but the real value comes from running them constantly. Systems change — a deploy last Thursday might have introduced a regression in your failure handling.
Minimize blast radius. Start small. Chaos Engineering isn't about causing outages — it's about discovering weaknesses before an uncontrolled failure does. Begin with a small percentage of traffic or a single non-critical service group.
How to implement Chaos Engineering on your own infrastructure
You don't need to be Netflix to adopt these practices. The open-source version of Chaos Monkey — now part of the Spinnaker project — is available for anyone to run.
Step 1: Set up Chaos Monkey with Spinnaker
Chaos Monkey is tightly integrated with Spinnaker, Netflix's open-source continuous delivery platform. If you're already using Spinnaker, enabling Chaos Monkey is straightforward.
# Clone the Chaos Monkey repo
git clone https://github.com/netflix/chaosmonkey
cd chaosmonkey
# Build the binary
go build ./...
You'll need a MySQL or CockroachDB instance for Chaos Monkey to track termination schedules, and it needs access to your Spinnaker API.
# chaos-monkey config (chaosmonkey.toml)
[chaosmonkey]
enabled = true
schedule_enabled = true
leashed = false # set true to do dry runs first
accounts = ["my-aws-account"]
[database]
host = "localhost"
port = 3306
name = "chaosmonkey"
user = "chaosmonkey"
encrypted_password = "..."
Set leashed = true when you first start — this runs Chaos Monkey in dry-run mode, logging what it would have terminated without actually doing it. Use this to build confidence before you go live.
Step 2: Define your steady state
Before you flip the switch, instrument your system. You need to know what normal looks like so you can tell when chaos is causing real harm vs. when your resilience mechanisms are working correctly.
At minimum, define:
- Success rate — what percentage of requests succeed under normal load?
- Latency P99 — what does your 99th percentile response time look like?
- Error budget — how much degradation is acceptable before you call it an incident?
Tools like Prometheus, Grafana, Datadog, or New Relic work well here. The key is having dashboards running before you start introducing failures, not scrambling to set them up after something goes wrong.
Step 3: Start small with a canary group
Don't enable Chaos Monkey across your entire infrastructure on day one. Pick one non-critical service — something with good health checks, proper auto-scaling, and ideally a stateless architecture — and enable chaos terminations only for that group.
In AWS terms, you'd tag specific Auto Scaling Groups to opt in:
{
"chaosmonkey": {
"enabled": true,
"meanTimeBetweenKillsInWorkDays": 5,
"minTimeBetweenKillsInWorkDays": 1,
"grouping": "app"
}
}
meanTimeBetweenKillsInWorkDays: 5 means on average, one instance in this group gets terminated roughly once a week. Conservative enough to build confidence, frequent enough to actually test things.
Step 4: Use Chaos Toolkit for more advanced experiments
If you're not on Spinnaker, or you want more control over experiment design, Chaos Toolkit is a vendor-neutral alternative that works with AWS, GCP, Azure, Kubernetes, and more.
pip install chaostoolkit
pip install chaostoolkit-aws
A simple experiment definition looks like this:
{
"title": "Can our API survive an EC2 instance termination?",
"description": "Terminate one instance in the API ASG and verify error rate stays below 1%",
"steady-state-hypothesis": {
"title": "API is healthy",
"probes": [
{
"type": "probe",
"name": "api-error-rate-under-threshold",
"provider": {
"type": "http",
"url": "https://your-metrics-endpoint/api-error-rate",
"timeout": 5
},
"tolerance": 0.01
}
]
},
"method": [
{
"type": "action",
"name": "terminate-api-instance",
"provider": {
"type": "python",
"module": "chaosaws.ec2.actions",
"func": "terminate_instances",
"arguments": {
"filters": [
{"Name": "tag:aws:autoscaling:groupName", "Values": ["api-asg-prod"]}
],
"az": "us-east-1a"
}
}
}
]
}
Run it with:
chaos run experiment.json
Chaos Toolkit will verify your steady state before the experiment, execute the termination, then verify steady