Polly retries making outages worse: stop retry storms with backoff and jitter

Jan 29, 202614 min read

Share|

Category:.NET

Polly retries making outages worse: stop retry storms with backoff and jitter

When retries amplify failures instead of fixing them: how retry storms happen in .NET, how to prove it, and the four components that stop cascading failures.

Free download: Polly Retry Policies package. Jump to the download section.

Paid pack available. Jump to the Axiom pack.

The Real Cost: A 45-Minute Outage from One Missing Parameter

The incident: A team running a critical service added Polly retries to handle transient network blips. They used the simplest form: Retry(3). It looked safe. But one Tuesday night, an external API they depend on went down for 2 minutes. Their service should have degraded gracefully. Instead, it cascaded into a full outage lasting 45 minutes.

Here's why: With no backoff policy, every failed request retried immediately. Three times. All in microseconds. With 500 concurrent users hitting the service across 10 instances, they generated 15,000 failed calls to an API that was already struggling. When the API recovered 2 minutes later, it got hammered by the retry backlog and fell over again. The service that was supposed to be resilient became the problem amplifier.

The cost: 45 minutes of downtime (full outage for external customers), 8 engineers firefighting, post-mortems, customer trust erosion. All because of one line of code that looked reasonable.

This is a retry storm, and it's one of the most dangerous resilience mistakes in .NET. What makes it dangerous is that it feels safe. "Retries are resilience, right?" Wrong: retries without boundaries are chaos amplification.

Rescuing an .NET service in production? Start at the .NET Production Rescue hub and the .NET category.


If you only do three things
  • Classify what you retry. Retry only transient failures (timeouts, 429, selected 5xx, connection resets). Never retry bad requests or auth failures.
  • Add boundaries. Backoff + jitter + caps + stop rules, plus a total time budget so one call cannot burn a whole worker/thread.
  • Make retries observable. Log every retry attempt (attempt number, wait, reason, endpoint, correlation id) and alert on retry rate.

Fast triage table (what to check first)

SymptomLikely causeConfirm fastFirst safe move
Downstream recovers but your service stays slow/unhealthyRetry backlog + synchronized retries (thundering herd)Retry rate spikes; many attempts clustered by endpoint in secondsAdd jittered backoff + cap + stop rules; tighten circuit breaker
429 storms after deployNo rate-limit respect + too much concurrency429 count spikes; no Retry-After handling; concurrency highCap concurrency; honor Retry-After; add retry budget
p95 climbs and CPU is calm during retriesTimeouts missing or too large; retries pin threadsTimeouts rare but in-flight climbs; requests “hang”Implement timeouts-first and total budget; stop retrying timeouts
One endpoint dominates failuresPermanent failure being retriedSame 4xx/401/403 repeats; no success even after retriesFail fast on permanent errors; fix caller/request
Errors spike when dependency degradesRetry policy is too permissive (retries everything)High retry volume across many status codesClassify retryable errors; reduce retry count

Why retries amplify load during outages instead of recovering

Before implementing any retry policy, you need to understand what happens when you retry without care.

The mental model most teams have: "If a request fails, ask again. Simple."

What actually happens: You amplify load on a system that's already struggling. Here's the cascade:

  1. A downstream dependency slows down or fails (database, API, cache, whatever).
  2. Your service gets a failure. Retry immediately.
  3. All requests retry at once (synchronized failure).
  4. The downstream system gets 2x, 3x, 10x the load it was already choking on.
  5. Downstream goes slower or dies completely.
  6. More retries.
  7. System spirals into unrecoverable state.

The worst part: Your observability sees "lots of retries happening" but misses that retries are causing the cascade, not recovering from it.

Why this happens: Most teams think retries are free. They are not. Each retry consumes concurrency. Each attempt holds resources (threads, sockets, connection pools). When you multiply load during a downstream failure, you can exhaust your own service first.

Why teams choose naive retries: They've heard "retries are good," they want resilience, and the simplest Polly code is easy to write. There's no alarm bell telling you that Retry(3) is dangerous until you're in the middle of an outage. This pattern appears frequently in background jobs and bots where "just retry" feels safe but causes silent cascades.


When retries make failures worse: transient vs permanent errors

Here's what nobody tells you about retries: They're only safe if you know what you're retrying and why.

Transient failures (network hiccup, temporary unavailability, 503 Service Unavailable) benefit from retries. Waiting a moment and trying again often works.

Permanent failures (404 Not Found, 401 Unauthorized, malformed request) should never be retried. The same failure will happen every time. Retrying does not help. It just wastes time and resources.

Cascading failures (downstream service is down) should never be retried with aggressive policies. If Database is gone, retrying your query 3 times instantly doesn't bring it back. It just kills your own thread pool faster.

The decision of when to retry is business logic. Getting it wrong costs you. Teams often don't distinguish between these cases, which is why they end up with policies that hurt instead of help.

Another tradeoff: Retries delay failure. If you are going to fail anyway (say, a vendor API is down for 10 minutes), retrying for 5 minutes means your customer waits 5+ minutes for a failure message instead of getting it immediately. Some paths should fail fast, not retry hopefully.


What to retry and what to fail fast: decision checklist

Not all retries are created equal. Here's how to think about what to retry:

Retry if:

  • The error is transient (networking blip, temporary slowness, 503 Service Unavailable, connection timeout).
  • The error is not your fault (you didn't send a bad request; the system is just busy).
  • There's a reasonable chance waiting a bit will help (yes for network, maybe for 429, no for 404).

Don't retry if:

  • The error is permanent (404, 401, 400, validation failure).
  • The error indicates the downstream system is down and won't recover quickly.
  • Retrying will consume resources faster than failure would (e.g., database is fully locked; retrying queries makes it worse).

This distinction is critical. A single retry policy for "all errors" is how you end up in the 45-minute outage scenario.


Four components that stop retry storms: backoff, jitter, caps, stop rules

There are four components to a safe retry policy. Miss any one, and you are back in trouble.

Component 1: Exponential backoff: Wait longer with each retry, doubling each time. First retry waits 1 second, second waits 2 seconds, third waits 4 seconds. Why? Because if the downstream system is struggling, hammering it immediately does not help. Waiting gives it time to recover. The first retry is fast (good for transient blips), later retries are slower (good for real outages where recovery takes time).

Component 2: Jitter: Add randomness to the wait time. Instead of "wait 1 second", wait "1 second + random 0 to 1 second". Why? If 100 requests all fail at the exact same moment, you do not want all 100 retrying at the exact same moment 1 second later. That is a thundering herd, and it kills recovery. Jitter spreads retries across a time range, so the downstream system gets requests more steadily instead of in synchronized waves.

Component 3: Caps: Do not let exponential backoff grow forever. Cap the max wait at something reasonable (30 to 60 seconds). After you have waited that long for a single request, you are not recovering. You are just tying up resources.

Component 4: Stop rules: After N retries (usually 3 to 5), give up. If it has not recovered by then, it probably will not for this request. Fail fast and let upstream callers handle it (with their circuit breaker, fallback, etc.).

All four matter. Any one missing and you're vulnerable.


Shipped asset: production-ready Polly package

Download
Free

Resilient HTTP Client (C#): production-ready retries

Free download: a production-grade HTTP retry setup for .NET, built for real-world API failures.

When to use this (fit check)
  • You call external HTTP dependencies where timeouts, 429s, and 5xx spikes happen in production.
  • You want safe defaults (backoff + jitter + caps + stop rules) that don’t amplify outages.
  • You need retry attempts to be observable (structured events + retry-rate alerts).
When NOT to use this (yet)
  • You are retrying writes with side effects and you don’t have idempotency (add idempotency first).
  • You don’t have timeouts and a total budget (start with timeouts-first, then add retries).
  • You can’t classify errors (you’ll end up retrying permanent failures and wasting concurrency).

What you get (4 files):

  • ResilientHttpClient.cs: a copy-paste ready wrapper that implements backoff + jitter + caps + stop rules and exposes one place to enforce policy defaults
  • retry-policy-checklist.md: a decision checklist for what to retry, what to stop, and where to fail fast
  • retry-event-logging-schema.json: a structured event shape for retry attempts so you can measure retry rate and spot amplification quickly
  • README.md: integration notes (timeouts first, policy wiring, and operational boundaries)
Axiom Pack
$49

Retry Policy Kit: Battle-Tested Resilience for Production

Managing retries across multiple services? Get pre-configured Polly policies with monitoring integration, circuit breaker patterns, and incident runbooks. Stop debugging retry storms in production.

  • 10+ production-grade Polly policies for HTTP, gRPC, and database calls
  • Circuit breaker + retry coordination patterns
  • Monitoring integration (Prometheus, OpenTelemetry, Application Insights)
  • Incident runbooks for retry storm diagnosis and mitigation
Get Retry Policy Kit →

A policy shape that works in production

The exact code varies by client and SLA. The shape does not.

  1. Timeout budget first. One call has a maximum budget including retries, or you convert dependency failure into your own timeout cascade.
  2. Retry only what is plausibly transient. 429, 5xx, timeouts, connection resets.
  3. Backoff + jitter + cap + stop. Spread retries and stop after a small number.
  4. Circuit breaker for cascades. If failures cluster, stop hammering and fail fast.
  5. Structured retry logs. Retrying without a queryable event is how storms hide.

This is the smallest excerpt worth copying into a post (the full version is in the shipped package):

csharp
// Exponential backoff + jitter (capped)
var retry = Policy<HttpResponseMessage>
  .Handle<HttpRequestException>()
  .OrResult(r => (int)r.StatusCode >= 500 || (int)r.StatusCode == 429)
  .WaitAndRetryAsync(
    retryCount: 3,
    sleepDurationProvider: attempt =>
    {
      var exponential = Math.Pow(2, attempt - 1); // 1, 2, 4
      var jitterSeconds = Random.Shared.NextDouble();
      var seconds = Math.Min(exponential + jitterSeconds, 30);
      return TimeSpan.FromSeconds(seconds);
    },
    onRetry: (outcome, delay, attempt, _) =>
    {
      // log attempt, delay, endpoint, reason, correlation id
    });

What to log: make retries observable

Every retry should log context. You can't fix what you can't see. Log these fields every time:

json
{
  "timestamp": "2026-01-27T14:30:20.123Z",
  "event": "api_call_retry",
  "url": "https://api.example.com/orders",
  "failure_reason": "timeout / 503 / 429",
  "attempt_number": 2,
  "max_attempts": 3,
  "wait_seconds_before_retry": 2.456,
  "total_elapsed_seconds_so_far": 4.200,
  "elapsed_since_original_request_seconds": 4.200,
  "correlation_id": "req-abc123"
}

With these fields, you can:

  • See which endpoints are flaky (high retry rates).
  • Spot retry storms (many retries for same endpoint, all failing).
  • Correlate retries with downstream outages.
  • Measure recovery time ("how long did retries take before success?").

One additional metric matters in incident response: retry rate per instance. If it jumps, you are in the amplification phase.

Use correlation IDs to trace retries across services so you can see which original request spawned which retry attempts and how retries cascade through your dependency chain.


How to stop a retry storm in production right now

This is a production rescue lane, not a best-practices checklist. When you are paging, you need actions that reduce load now.

  1. Confirm it is retries. Look for a jump in retry logs, especially clustered by one dependency and one endpoint.
  2. Reduce concurrency first. Cap outbound concurrency per dependency. Fewer in-flight requests means fewer retries.
  3. Tighten stop rules. Lower retry count temporarily. Lower cap. Shorten circuit breaker sampling windows.
  4. Enforce a total budget. If the downstream is down, do not spend minutes retrying per request.
  5. Degrade intentionally. Return partial data, cached data, or an explicit 503 quickly. Do not hold threads.
  6. After stabilization: classify which statuses/exceptions are retried, and add tests so a future change cannot reintroduce Retry(3).

If you are also seeing requests timing out with normal CPU, treat that as thread pool starvation and fix time budgets and cancellation across the call path first.


Tradeoffs

  • Retries can hide deterministic failures. Poison inputs, auth failures, and invalid requests must stop and surface immediately.
  • Retries add latency. That latency must fit inside a request budget, or you convert a dependency outage into your own outage.
  • Circuit breakers change failure modes. That is the point. Make sure the caller path is designed to handle fast failures.

Retries without backoff amplify load on a struggling system. If you retry immediately, you send 2x-10x traffic to something already failing. Add exponential backoff (1s, 2s, 4s) and jitter to spread retry attempts over time instead of hammering the dependency in synchronized waves.

Check logs for clustered retry attempts to one endpoint within seconds. Look for retry rate spike + endpoint failure rate spike happening together. If downstream recovered but your service didn't, retries caused the cascade. Query for: high retry attempt count, same endpoint, short time window.

Yes. Circuit breaker opens after N consecutive failures and stops sending requests entirely for a cooldown period. This prevents retry amplification during sustained outages. Use circuit breaker + retry together: retry handles transient blips (1-2 failures), circuit breaker handles sustained outages (5+ failures).

No, not by itself. Retry(3) with no backoff retries instantly three times. This amplifies load 4x (1 original + 3 retries) in microseconds. Under load with many concurrent requests, this creates a thundering herd. Always add WaitAndRetryAsync with exponential backoff + jitter to spread attempts over time.

No. 404 (Not Found) and 401 (Unauthorized) are permanent failures. Retrying won't fix them and wastes resources. Only retry transient errors: timeouts, 429 (rate limit), 503 (temporarily unavailable), connection resets, some 5xx errors. Classify errors in your policy: retry transient, fail fast on permanent.

No, retries make thread starvation worse. Each retry holds a thread waiting. If the pool is exhausted, adding more retries pins more threads. Fix thread starvation first: add timeouts, use async/await properly, avoid sync-over-async. Then add bounded retries with backoff so waiting threads are released faster.

Feature flag your retry policy settings or use configuration-driven policies. During incident: lower max retries from 3 to 1, reduce backoff cap from 30s to 10s, tighten circuit breaker thresholds (open after 3 failures instead of 5). After recovery, restore normal policy. This gives you a production escape hatch.


Additional Questions

No. Use retries for external dependencies where transient failure is normal and you can tolerate a small extra latency budget.

If a call must be correct and fast (auth, writes with side effects, internal database calls in a hot path), failing fast is often safer than retrying.

Common safe candidates: timeouts, connection resets, 429 (rate limiting), and some 5xx responses.

Common unsafe candidates: 400 series from your own request mistakes (400, 401, 403, 404), validation failures, and business rule failures.

Usually 2 to 4. Past that you are often just delaying failure and consuming concurrency.

The correct number is whatever fits inside your total budget while still giving the dependency a chance to recover.

No. The principles are the same across versions: classify, backoff, jitter, cap, stop, and log.

Use whatever Polly version your stack supports, then lock in the behavior with tests so future refactors do not regress to naive Retry(3).

Treat it like load shedding: reduce concurrency, reduce retry count, shorten budgets, and open the circuit earlier.

You can always loosen the policy after the dependency recovers. You cannot recover a thread pool that is pinned by endless waiting.


Resources


Checklist (copy/paste)

  • Total time budget is defined per dependency call path (not just “retry 3 times”).
  • Per-attempt timeouts exist and are smaller than the total budget.
  • Retryable errors are explicitly classified (timeouts/429/selected 5xx); permanent failures fail fast.
  • Backoff is exponential and capped (no unbounded waits).
  • Jitter is applied (no synchronized retry waves).
  • Max attempts is small and enforced (start 2–4).
  • Circuit breaker opens on sustained failure clusters (fast failure during outages).
  • Concurrency is capped around flaky dependencies (bulkhead) to prevent amplification.
  • Retry logs include: endpoint/dependency, attempt, delay, reason, total elapsed, correlation id.
  • Alert exists for retry rate spikes per dependency/endpoint.

Coming soon

If you want more runbooks like this (plus logging schemas and decision trees), that is what Axiom is becoming.

Join to get notified as we ship new operational assets for reliability.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.


Key takeaways

  • Naive retries cause retry storms. Retry(3) without backoff amplifies load and cascades failures.
  • Classify what you retry. Transient errors benefit from retries; permanent errors do not.
  • Backoff + jitter + caps + stop rules are not optional.
  • Circuit breakers prevent cascades by forcing fast failure.
  • Observable retries are debuggable retries. Log every attempt and alert on retry rate.

If you want a fast diagnosis of a retry storm, see .NET Production Rescue or contact me and include one example retry log event plus the endpoint and status distribution.

Recommended resources

Download the shipped checklist/templates for this post.

A small shipped kit for safe Polly retries: C# client wrapper, retry checklist, retry logging schema, and setup notes.

resource

Related posts