What Happens When Stripe Goes Down at Checkout?

External dependencies fail in ways you cannot control. We simulate payment gateway outages, third-party API failures, and internal service degradation to validate that your system degrades gracefully instead of failing catastrophically.

Duration: 3–5 days Team: 1 Senior Chaos Engineer

The Challenge

You might be experiencing...

When a third-party API is slow, your whole checkout hangs because there's no timeout

Circuit breakers are configured but nobody has tested whether they actually open under load

A retry storm from one service has brought down adjacent services that were previously healthy

You have 12 external dependencies and no documented degraded-mode behaviour for any of them

Dependency failure testing validates the most common source of cascading outages in distributed systems: a single dependency that degrades and takes everything downstream with it. External payment gateways, authentication services, notification APIs, and internal microservices all fail in ways that are unpredictable in timing but very predictable in pattern — and those patterns can be tested.

The most dangerous failure mode is not a hard crash but a slow dependency: a payment API that takes 30 seconds to respond instead of 300 milliseconds. Without timeouts, your thread pool fills with waiting requests, your response queue backs up, and a dependency that handles 2% of your traffic can take down 100% of your capacity. Toxiproxy’s latency injection surfaces these exact failure modes in a controlled environment.

Circuit breaker validation is a core deliverable: we verify that circuit breakers open under the conditions they’re configured for, that they half-open correctly after the recovery period, and that they don’t create their own failure modes through misconfigured thresholds. We also test retry logic for amplification effects — a correctly implemented retry with exponential backoff and jitter looks very different from a retry loop that creates a storm.

Our Approach

Engagement Phases

Day 1

Dependency Mapping & Risk Scoring

We map your complete dependency graph — internal services, external APIs, databases, message queues, caches, CDNs, and payment gateways. We score each dependency by blast radius and failure probability to prioritise the experiment backlog.

Days 2–4

Failure Injection

We use Toxiproxy to inject latency, packet loss, and connection resets for each dependency in isolation and in combination. We test timeout behaviour, circuit breaker thresholds, retry configuration, and fallback paths. We use WireMock to simulate degraded API responses (slow, malformed, rate-limited).

Day 5

Tracing Analysis & Remediation

We use Jaeger or Zipkin distributed tracing to analyse failure propagation across the service graph. We identify retry amplification patterns, missing circuit breakers, and absent fallback paths. We deliver a remediation playbook with configuration changes and code patterns.

What You Get

Deliverables

Dependency graph with failure risk scoring

Circuit breaker validation report (configured vs actually working)

Retry storm analysis with amplification factor per service pair

Graceful degradation inventory (what degrades gracefully vs fails hard)

Remediation playbook with timeout, circuit breaker, and fallback configurations

Expected Outcomes

Before & After

Metric	Before	After
Circuit breakers validated	0 tested	12 validated
Retry storms identified	0 known	3 found and fixed
Graceful degradation paths	2 documented	8 implemented

Technology

Tools We Use

Toxiproxy Envoy fault injection WireMock / MockServer Jaeger / Zipkin

Common Questions

Frequently Asked Questions

Can you test against real third-party APIs like Stripe or Twilio?

We use API mocks (WireMock or MockServer) to simulate third-party behaviour without impacting real accounts or incurring costs. The mocks are configurable to simulate the exact failure modes we want to test: timeout, 500 errors, rate limiting, malformed responses. For internal dependencies, we can test against real staging instances.

What is a retry storm and how common are they?

A retry storm occurs when multiple services simultaneously retry failed requests to a dependency, amplifying load on the recovering service and preventing it from recovering. They are extremely common — we find them in roughly 70% of microservice architectures. The fix is typically exponential backoff with jitter plus a circuit breaker, both of which require testing to validate they work correctly.

We use Istio — does that change the approach?

Yes. Istio's VirtualService fault injection allows us to inject failures at the mesh level without touching application code or installing additional tooling. We configure retry policies, timeouts, and circuit breakers in Istio and validate them with targeted load. If you use Istio, the engagement is typically faster because the injection tooling is already in place.

Know Your Blast Radius

Book a free 30-minute resilience scope call with our chaos engineers. We review your architecture, identify your highest-risk failure modes, and recommend the experiments that will give you the most signal.

Talk to an Expert