What Happens When Stripe Goes Down at Checkout?
External dependencies fail in ways you cannot control. We simulate payment gateway outages, third-party API failures, and internal service degradation to validate that your system degrades gracefully instead of failing catastrophically.
You might be experiencing...
Dependency failure testing validates the most common source of cascading outages in distributed systems: a single dependency that degrades and takes everything downstream with it. External payment gateways, authentication services, notification APIs, and internal microservices all fail in ways that are unpredictable in timing but very predictable in pattern — and those patterns can be tested.
The most dangerous failure mode is not a hard crash but a slow dependency: a payment API that takes 30 seconds to respond instead of 300 milliseconds. Without timeouts, your thread pool fills with waiting requests, your response queue backs up, and a dependency that handles 2% of your traffic can take down 100% of your capacity. Toxiproxy’s latency injection surfaces these exact failure modes in a controlled environment.
Circuit breaker validation is a core deliverable: we verify that circuit breakers open under the conditions they’re configured for, that they half-open correctly after the recovery period, and that they don’t create their own failure modes through misconfigured thresholds. We also test retry logic for amplification effects — a correctly implemented retry with exponential backoff and jitter looks very different from a retry loop that creates a storm.
Engagement Phases
Dependency Mapping & Risk Scoring
We map your complete dependency graph — internal services, external APIs, databases, message queues, caches, CDNs, and payment gateways. We score each dependency by blast radius and failure probability to prioritise the experiment backlog.
Failure Injection
We use Toxiproxy to inject latency, packet loss, and connection resets for each dependency in isolation and in combination. We test timeout behaviour, circuit breaker thresholds, retry configuration, and fallback paths. We use WireMock to simulate degraded API responses (slow, malformed, rate-limited).
Tracing Analysis & Remediation
We use Jaeger or Zipkin distributed tracing to analyse failure propagation across the service graph. We identify retry amplification patterns, missing circuit breakers, and absent fallback paths. We deliver a remediation playbook with configuration changes and code patterns.
Deliverables
Before & After
| Metric | Before | After |
|---|---|---|
| Circuit breakers validated | 0 tested | 12 validated |
| Retry storms identified | 0 known | 3 found and fixed |
| Graceful degradation paths | 2 documented | 8 implemented |
Tools We Use
Frequently Asked Questions
Can you test against real third-party APIs like Stripe or Twilio?
We use API mocks (WireMock or MockServer) to simulate third-party behaviour without impacting real accounts or incurring costs. The mocks are configurable to simulate the exact failure modes we want to test: timeout, 500 errors, rate limiting, malformed responses. For internal dependencies, we can test against real staging instances.
What is a retry storm and how common are they?
A retry storm occurs when multiple services simultaneously retry failed requests to a dependency, amplifying load on the recovering service and preventing it from recovering. They are extremely common — we find them in roughly 70% of microservice architectures. The fix is typically exponential backoff with jitter plus a circuit breaker, both of which require testing to validate they work correctly.
We use Istio — does that change the approach?
Yes. Istio's VirtualService fault injection allows us to inject failures at the mesh level without touching application code or installing additional tooling. We configure retry policies, timeouts, and circuit breakers in Istio and validate them with targeted load. If you use Istio, the engagement is typically faster because the injection tooling is already in place.
Know Your Blast Radius
Book a free 30-minute resilience scope call with our chaos engineers. We review your architecture, identify your highest-risk failure modes, and recommend the experiments that will give you the most signal.
Talk to an Expert