Polly retry policies done right: backoff + jitter + caps + stop rules

Jan 29, 202610 min read

Category:.NET

Polly retry policies done right: backoff + jitter + caps + stop rules

Build retry policies that stop retry storms, thundering herds, and cascading failures in .NET.

The Real Cost: A 45-Minute Outage from One Missing Parameter

The incident: A team running a critical service added Polly retries to handle transient network blips. They used the simplest form: Retry(3). It looked safe. But one Tuesday night, an external API they depend on went down for 2 minutes. Their service should have degraded gracefully. Instead, it cascaded into a full outage lasting 45 minutes.

Here's why: With no backoff policy, every failed request retried immediately. Three times. All in microseconds. With 500 concurrent users hitting the service across 10 instances, they generated 15,000 failed calls to an API that was already struggling. When the API recovered 2 minutes later, it got hammered by the retry backlog and fell over again. The service that was supposed to be resilient became the problem amplifier.

The cost: 45 minutes of downtime (full outage for external customers), 8 engineers firefighting, post-mortems, customer trust erosion. All because of one line of code that looked reasonable.

This is a retry storm, and it's one of the most dangerous resilience mistakes in .NET. What makes it dangerous is that it feels safe. "Retries are resilience, right?" Wrong: retries without boundaries are chaos amplification.


If you only do three things

  1. Classify what you retry. Retry only transient failures (timeouts, 429, 5xx, connection resets). Never retry bad requests or auth failures.
  2. Add boundaries. Backoff + jitter + caps + stop rules, and a total time budget so one call cannot burn a whole thread.
  3. Make retries observable. Log every retry attempt (attempt number, wait, reason, endpoint, correlation id) and alert on retry rate.

Why naive retries backfire: the mechanism

Before implementing any retry policy, you need to understand what happens when you retry without care.

The mental model most teams have: "If a request fails, ask again. Simple."

What actually happens: You amplify load on a system that's already struggling. Here's the cascade:

  1. A downstream dependency slows down or fails (database, API, cache, whatever).
  2. Your service gets a failure. Retry immediately.
  3. All requests retry at once (synchronized failure).
  4. The downstream system gets 2x, 3x, 10x the load it was already choking on.
  5. Downstream goes slower or dies completely.
  6. More retries.
  7. System spirals into unrecoverable state.

The worst part: Your observability sees "lots of retries happening" but misses that retries are causing the cascade, not recovering from it.

Why this happens: Most teams think retries are free. They are not. Each retry consumes concurrency. Each attempt holds resources (threads, sockets, connection pools). When you multiply load during a downstream failure, you can exhaust your own service first.

Why teams choose naive retries: They've heard "retries are good," they want resilience, and the simplest Polly code is easy to write. There's no alarm bell telling you that Retry(3) is dangerous until you're in the middle of an outage.


The tradeoff most teams miss

Here's what nobody tells you about retries: They're only safe if you know what you're retrying and why.

Transient failures (network hiccup, temporary unavailability, 503 Service Unavailable) benefit from retries. Waiting a moment and trying again often works.

Permanent failures (404 Not Found, 401 Unauthorized, malformed request) should never be retried. The same failure will happen every time. Retrying does not help. It just wastes time and resources.

Cascading failures (downstream service is down) should never be retried with aggressive policies. If Database is gone, retrying your query 3 times instantly doesn't bring it back. It just kills your own thread pool faster.

The decision of when to retry is business logic. Getting it wrong costs you. Teams often don't distinguish between these cases, which is why they end up with policies that hurt instead of help.

Another tradeoff: Retries delay failure. If you are going to fail anyway (say, a vendor API is down for 10 minutes), retrying for 5 minutes means your customer waits 5+ minutes for a failure message instead of getting it immediately. Some paths should fail fast, not retry hopefully.


Building safer policies: a decision framework

Not all retries are created equal. Here's how to think about what to retry:

Retry if:

  • The error is transient (networking blip, temporary slowness, 503 Service Unavailable, connection timeout).
  • The error is not your fault (you didn't send a bad request; the system is just busy).
  • There's a reasonable chance waiting a bit will help (yes for network, maybe for 429, no for 404).

Don't retry if:

  • The error is permanent (404, 401, 400, validation failure).
  • The error indicates the downstream system is down and won't recover quickly.
  • Retrying will consume resources faster than failure would (e.g., database is fully locked; retrying queries makes it worse).

This distinction is critical. A single retry policy for "all errors" is how you end up in the 45-minute outage scenario.


The mechanics: what makes Polly policies actually resilient

There are four components to a safe retry policy. Miss any one, and you are back in trouble.

Component 1: Exponential backoff: Wait longer with each retry, doubling each time. First retry waits 1 second, second waits 2 seconds, third waits 4 seconds. Why? Because if the downstream system is struggling, hammering it immediately does not help. Waiting gives it time to recover. The first retry is fast (good for transient blips), later retries are slower (good for real outages where recovery takes time).

Component 2: Jitter: Add randomness to the wait time. Instead of "wait 1 second", wait "1 second + random 0 to 1 second". Why? If 100 requests all fail at the exact same moment, you do not want all 100 retrying at the exact same moment 1 second later. That is a thundering herd, and it kills recovery. Jitter spreads retries across a time range, so the downstream system gets requests more steadily instead of in synchronized waves.

Component 3: Caps: Do not let exponential backoff grow forever. Cap the max wait at something reasonable (30 to 60 seconds). After you have waited that long for a single request, you are not recovering. You are just tying up resources.

Component 4: Stop rules: After N retries (usually 3 to 5), give up. If it has not recovered by then, it probably will not for this request. Fail fast and let upstream callers handle it (with their circuit breaker, fallback, etc.).

All four matter. Any one missing and you're vulnerable.


Shipped asset: production-ready Polly package

Download
Free

Resilient HTTP Client (C#): production-ready retries

Free download: a production-grade HTTP retry setup for .NET, built for real-world API failures.

What you get (4 files):

  • ResilientHttpClient.cs: a copy-paste ready wrapper that implements backoff + jitter + caps + stop rules and exposes one place to enforce policy defaults
  • retry-policy-checklist.md: a decision checklist for what to retry, what to stop, and where to fail fast
  • retry-event-logging-schema.json: a structured event shape for retry attempts so you can measure retry rate and spot amplification quickly
  • README.md: integration notes (timeouts first, policy wiring, and operational boundaries)

A policy shape that works in production

The exact code varies by client and SLA. The shape does not.

  1. Timeout budget first. One call has a maximum budget, including retries.
  2. Retry only what is plausibly transient. 429, 5xx, timeouts, connection resets.
  3. Backoff + jitter + cap + stop. Spread retries and stop after a small number.
  4. Circuit breaker for cascades. If failures cluster, stop hammering and fail fast.
  5. Structured retry logs. Retrying without a queryable event is how storms hide.

This is the smallest excerpt worth copying into a post (the full version is in the shipped package):

csharp
// Exponential backoff + jitter (capped)
var retry = Policy<HttpResponseMessage>
  .Handle<HttpRequestException>()
  .OrResult(r => (int)r.StatusCode >= 500 || (int)r.StatusCode == 429)
  .WaitAndRetryAsync(
    retryCount: 3,
    sleepDurationProvider: attempt =>
    {
      var exponential = Math.Pow(2, attempt - 1); // 1, 2, 4
      var jitterSeconds = Random.Shared.NextDouble();
      var seconds = Math.Min(exponential + jitterSeconds, 30);
      return TimeSpan.FromSeconds(seconds);
    },
    onRetry: (outcome, delay, attempt, _) =>
    {
      // log attempt, delay, endpoint, reason, correlation id
    });

What to log: make retries observable

Every retry should log context. You can't fix what you can't see. Log these fields every time:

json
{
  "timestamp": "2026-01-27T14:30:20.123Z",
  "event": "api_call_retry",
  "url": "https://api.example.com/orders",
  "failure_reason": "timeout / 503 / 429",
  "attempt_number": 2,
  "max_attempts": 3,
  "wait_seconds_before_retry": 2.456,
  "total_elapsed_seconds_so_far": 4.200,
  "elapsed_since_original_request_seconds": 4.200,
  "correlation_id": "req-abc123"
}

With these fields, you can:

  • See which endpoints are flaky (high retry rates).
  • Spot retry storms (many retries for same endpoint, all failing).
  • Correlate retries with downstream outages.
  • Measure recovery time ("how long did retries take before success?").

One additional metric matters in incident response: retry rate per instance. If it jumps, you are in the amplification phase.


Operator playbook: stopping a retry storm

This is a production rescue lane, not a best-practices checklist. When you are paging, you need actions that reduce load now.

  1. Confirm it is retries. Look for a jump in retry logs, especially clustered by one dependency and one endpoint.
  2. Reduce concurrency first. Cap outbound concurrency per dependency. Fewer in-flight requests means fewer retries.
  3. Tighten stop rules. Lower retry count temporarily. Lower cap. Shorten circuit breaker sampling windows.
  4. Enforce a total budget. If the downstream is down, do not spend minutes retrying per request.
  5. Degrade intentionally. Return partial data, cached data, or an explicit 503 quickly. Do not hold threads.
  6. After stabilization: classify which statuses/exceptions are retried, and add tests so a future change cannot reintroduce Retry(3).

If you are also seeing thread pool starvation symptoms, treat that as a secondary incident and fix time budgets and cancellation across the call path.


Tradeoffs

  • Retries can hide deterministic failures. Poison inputs, auth failures, and invalid requests must stop and surface immediately.
  • Retries add latency. That latency must fit inside a request budget, or you convert a dependency outage into your own outage.
  • Circuit breakers change failure modes. That is the point. Make sure the caller path is designed to handle fast failures.

FAQ

No. Use retries for external dependencies where transient failure is normal and you can tolerate a small extra latency budget.

If a call must be correct and fast (auth, writes with side effects, internal database calls in a hot path), failing fast is often safer than retrying.

Common safe candidates: timeouts, connection resets, 429 (rate limiting), and some 5xx responses.

Common unsafe candidates: 400 series from your own request mistakes (400, 401, 403, 404), validation failures, and business rule failures.

Usually 2 to 4. Past that you are often just delaying failure and consuming concurrency.

The correct number is whatever fits inside your total budget while still giving the dependency a chance to recover.

No. The principles are the same across versions: classify, backoff, jitter, cap, stop, and log.

Use whatever Polly version your stack supports, then lock in the behavior with tests so future refactors do not regress to naive Retry(3).

Treat it like load shedding: reduce concurrency, reduce retry count, shorten budgets, and open the circuit earlier.

You can always loosen the policy after the dependency recovers. You cannot recover a thread pool that is pinned by endless waiting.


Resources


Coming soon

If you want more runbooks like this (plus logging schemas and decision trees), that is what Axiom is becoming.

Join to get notified as we ship new operational assets for reliability.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.


Key takeaways

  • Naive retries cause retry storms. Retry(3) without backoff amplifies load and cascades failures.
  • Classify what you retry. Transient errors benefit from retries; permanent errors do not.
  • Backoff + jitter + caps + stop rules are not optional.
  • Circuit breakers prevent cascades by forcing fast failure.
  • Observable retries are debuggable retries. Log every attempt and alert on retry rate.

If you want a fast diagnosis of a retry storm, see .NET Production Rescue or contact me and include one example retry log event plus the endpoint and status distribution.

Related posts