Break Things Deliberately Before Users Do

A structured 5-day chaos engineering engagement that injects real failures into your production-like environment, measures actual recovery behaviour, and fixes the gaps your runbooks assumed away.

Duration: 5 days Team: 1 Senior Chaos Engineer

The Challenge

You might be experiencing...

Your MTTR figures are theoretical — you have never measured how long recovery actually takes

Runbooks say 'restart the service' but nobody has tested that under partial network failure

Cascading failures have brought down unrelated services and you don't know why

You want to start chaos engineering but don't know how to run experiments safely

Chaos engineering is the practice of deliberately injecting failures into systems to surface weaknesses before they become incidents. A structured sprint compresses months of accidental discovery into five focused days of scientific experimentation. Every experiment follows a hypothesis-driven approach: we define the expected system behaviour, inject the failure, measure what actually happens, and document the gap.

The most common finding from a chaos engineering sprint is not a missing redundancy — it is a recovery procedure that works in theory but fails in practice. Runbooks assume clean failure modes; real incidents combine network latency with a degraded dependency and a misconfigured circuit breaker. Controlled experiments expose those combinations safely.

We use LitmusChaos and Chaos Mesh for Kubernetes workloads, Toxiproxy for network fault injection, and stress-ng for resource exhaustion. Each tool is chosen for its ability to inject precisely scoped failures with clean rollback. By Day 5, your team has measured MTTR figures, a prioritised remediation backlog, and a chaos runbook they can execute independently going forward.

Our Approach

Engagement Phases

Day 1

Experiment Design

We review your resilience assessment findings (or conduct a rapid architecture review) and design a 15-experiment chaos backlog covering CPU/memory stress, network partition, latency injection, dependency failure, and node termination scenarios.

Days 2–4

Controlled Experiment Execution

We run experiments in your staging or canary environment with a clear abort condition defined before each test. Every experiment follows the scientific method: hypothesis, blast radius scoping, measurement, observation, rollback.

Day 5

Analysis & Remediation Handoff

We triage findings by severity, document root causes, and produce a remediation playbook. We pair with your engineering team to implement the highest-priority fixes and verify them with a follow-up experiment.

What You Get

Deliverables

15-experiment chaos backlog with hypotheses and expected outcomes

Experiment execution report with measured MTTR per failure mode

Cascading failure map showing propagation paths discovered

Remediation playbook with prioritised fixes and owners

Chaos runbook template your team can use independently

Expected Outcomes

Before & After

Metric	Before	After
MTTR	Unknown (untested)	12 min measured
Failure modes tested	0	15 validated
Cascading failures	Unknown	4 identified and fixed

Technology

Tools We Use

LitmusChaos Chaos Mesh tc / iptables / Toxiproxy stress-ng

Common Questions

Frequently Asked Questions

Do you run chaos experiments in production?

We strongly prefer production-like staging environments for initial sprints. If you want production chaos, we design experiments with minimal blast radius, clear abort conditions, and progressive rollout — starting with 1% of traffic or a single availability zone. We never run production experiments without explicit sign-off from your engineering leadership.

What if an experiment breaks something we can't recover?

Every experiment begins with a defined steady state, an abort condition, and a verified rollback procedure. We scope blast radius before each test and have never caused an unrecoverable outage. In five years of running chaos sprints, the most common outcome of a 'bad' experiment is discovering a recovery procedure that didn't work — which is exactly the point.

Do we need LitmusChaos or Chaos Mesh already installed?

No. We install and configure the chaos tooling as part of the sprint setup on Day 1. At the end of the engagement you own the tooling, the experiment library, and a runbook for running future experiments independently.

Know Your Blast Radius

Book a free 30-minute resilience scope call with our chaos engineers. We review your architecture, identify your highest-risk failure modes, and recommend the experiments that will give you the most signal.

Talk to an Expert