The real cost of retry logic: when “resilience” makes outages worse

Jan 26, 202610 min read

Category:.NET

The real cost of retry logic: when “resilience” makes outages worse

Retry storms don’t look like a bug — they look like good engineering until production melts. Here’s how to bound retries with stop rules and proof.

Download available. Jump to the shipped asset.

Retry storms are what happens when a dependency is slow and your "resilience" multiplies the load. You get timeouts, queue pileups, and paging, even though nothing is technically down.

The deliverable is a production playbook: a stop / retry / escalate decision framework, safe defaults for budgets, and the exact log fields that turn retries into something you can measure and control.

This failure mode costs you twice. First you lose availability. Then you lose time: engineers arguing in the incident channel about whether retries helped, while the system keeps generating more work.

Mini incident pattern:

  • 14:03 a vendor API slows down (still returning 200s, just late)
  • 14:06 latency climbs, retry policies trigger, in-flight requests rise
  • 14:10 success rate drops, queue depth grows, thread pool and connection pools saturate
  • 14:14 scale out increases pressure on the same dependency, everything gets worse
  • 14:22 retries are disabled for the hot path, the system stabilizes

Rescuing an .NET service in production? Start at the .NET Production Rescue hub and the .NET category.


What is actually happening (the mechanism)

Retries do not add reliability by default. They add work.

In a healthy system that extra work is invisible. In a degraded system it is fuel. Each retry increases in-flight concurrency, and in-flight concurrency is what kills you: thread pool pressure, socket pressure, queue pressure, and a feedback loop that makes the dependency even slower.

The multiplier usually looks like this:

  • attempts per call (Polly)
  • fan-out per request (one API call triggers N downstream calls)
  • redelivery per message (queue retries)
  • concurrent callers (all instances and all users)

When the dependency is slow, each layer piles on. You do not get one slow call. You get many slow calls that overlap.

What it looks like on dashboards:

  • request rate rises while success rate falls
  • p95 and p99 latency explode
  • timeouts rise even if the dependency still returns 200 sometimes
  • queue depth climbs, then redelivery multiplies attempts

If scaling out did not help (or made it worse), treat retry amplification as a primary suspect.


How teams make it worse without realizing

Most retry storms are created by reasonable people making reasonable local decisions. The mistake is optimizing for a single call outcome instead of system behavior under backpressure.

The common pattern: the first incident triggers "add retries". The next incident triggers "add more retries". By the third incident, retries are everywhere and nobody can answer how many attempts a single user action generated.

Things that quietly turn retries into an outage multiplier:

  • retries without per-attempt timeouts (stacked waits)
  • no total time budget (unbounded work)
  • retrying non-transient failures (client errors, auth errors, conflicts)
  • ignoring throttling semantics (429 without honoring Retry-After)
  • layered retries (HTTP policy plus queue redelivery plus user refresh)

None of these look dramatic in a code review. All of them are dramatic at 2am.


Diagnosis ladder (fast checks first)

Retries are only useful if you can prove their behavior. Start by answering one question: how many attempts did we generate per user request or per message during the incident window.

This ladder is designed to be fast, safe, and non-controversial.

  • Graph attempts per request (or infer it from logs) during the incident window
  • Check queue redelivery counts and retry delays (broker and consumer)
  • Identify layered retries (Polly, SDK retries, gateway retries, queue redelivery)
  • Check whether per-attempt timeouts exist (retries without timeouts are stacked waits)
  • Validate error classification (are 400/401/403/409 being retried?)
  • Confirm retry behavior honors throttling (429 and Retry-After)

If you cannot answer "how many attempts did we do per user request", you are not ready to tune policies. You are ready to add observability.


Containment during an incident

During the incident window, containment wins. Do the smallest change that reduces pressure on the dependency, then verify it on dashboards.

The safest containment changes reduce concurrency and remove retry multipliers. They are reversible and do not require a system redesign.

  • Reduce concurrency toward the dependency (bulkhead, semaphore, queue consumer limit)
  • Prefer fail fast over "try harder" when the dependency is clearly unhealthy
  • Respect server backpressure (429 + Retry-After)
  • Remove duplicate retries in the hottest path if you can do it safely
  • Add a short-lived circuit breaker or stop rule to prevent endless pressure

If you need a blunt instrument, turn off in-process retries for the dependency in the hottest path and rely on one slower mechanism (often controlled queue redelivery) until the dependency recovers.

Verification signals that you picked the right lever:

  • in-flight requests decrease
  • queue depth stops growing
  • dependency latency stabilizes
  • success rate improves without increasing request volume

Decision framework: stop, retry, escalate

Bounded retries are not just a delay loop. They are a policy that makes a decision, writes evidence, and triggers an operator action when the dependency is sick.

In practice you need three outcomes:

  • stop: do not retry, return a useful error
  • retry: bounded attempts and bounded time, with jitter and backpressure
  • escalate: stop retrying and emit an operator payload a human can act on

Stop (no retry)

Stop immediately when the failure is not transient or a duplicate attempt can cause harm.

Common stop signals:

  • 400 validation
  • 401 authentication
  • 403 permission
  • 409 and 412 conflicts
  • many 404s when they mean "does not exist"

Retry (bounded)

Retry only when waiting has a reasonable chance of helping.

Common retry signals:

  • 429 with Retry-After (required)
  • some 5xx
  • transient network failures
  • timeouts with a sane per-attempt timeout and a sane total budget

Escalate (operator payload)

Escalate when you have enough evidence to act and continuing to retry adds pressure.

Common escalation triggers:

  • max attempts reached
  • total time budget exhausted
  • sustained 429 throttling
  • repeating failures across many calls

Bounded retries in .NET (where policies belong)

In .NET services, retry behavior usually ends up split across layers: an SDK might retry, a Polly policy might retry, a queue might redeliver, and the caller might refresh.

Pick one place to own retries for a dependency and make it the source of truth. Everything else should be zero retry or minimal safety retry.

Use budgets, not vibes:

  • per-attempt timeout: caps one call
  • total time budget: caps the entire retry sequence
  • max attempts: caps how many tries

That policy must also be observable. If you cannot see attempts, delays, and stop decisions in logs, you will argue during the next incident.

Polly is a common place to centralize this in legacy .NET because it is additive. You can wrap one client, roll it out, and roll it back.


A bounded Polly policy (example)

This is a baseline that enforces the constraints above. It is intentionally boring.

Start by writing the policy in words. Then implement it. The policy should say:

  • which failures are retryable
  • max attempts
  • per-attempt timeout
  • total time budget
  • what gets logged on each attempt
csharp
// Pseudocode baseline.
// Keep the code simple and make the policy observable.
// Wire this via IHttpClientFactory or your client wrapper.
 
var policy = Policy
  .Handle<HttpRequestException>()
  .OrResult<HttpResponseMessage>(r => (int)r.StatusCode >= 500)
  .WaitAndRetryAsync(
    retryCount: 2,
    sleepDurationProvider: attempt => TimeSpan.FromMilliseconds(200 * attempt) + TimeSpan.FromMilliseconds(Random.Shared.Next(0, 150)),
    onRetry: (outcome, delay, attempt, ctx) => {
      // log: correlationId, dependency, attempt, delay, timeoutMs, outcome
    }
  );

The missing piece is always the same: prove it in logs.

Here's a minimal per-call log shape that makes retry storms visible:

json
{
  "ts": "2026-01-21T14:06:11.902Z",
  "level": "warning",
  "correlationId": "c-1f2aa91d3a9c4a0b",
  "dependency": "http:VendorApi",
  "attempt": 2,
  "delayMs": 350,
  "timeoutMs": 1500,
  "status": 503,
  "decision": "retry",
  "totalBudgetMs": 4000
}

What to log (so retries stop being a debate)

When retries go wrong, teams argue because nobody has evidence. The way out is logs and metrics that turn retry behavior into something you can measure during the incident window.

You need to answer two questions quickly:

  1. How many attempts did we generate per user request/message?
  2. Did retries respect a total time budget, or did we keep trying indefinitely?

Minimum fields per dependency attempt:

  • correlationId
  • dependency
  • attempt
  • delayMs
  • timeoutMs
  • totalBudgetMs
  • decision (stop | retry | escalate)
  • reason (e.g. 429-retry-after, 5xx, timeout, non-transient-4xx)

Then add one aggregate metric that makes storms obvious: attempts per successful request (or attempts per message). If that number spikes while success rate drops, you're amplifying.

Retries can smooth tiny transient blips. Under sustained backpressure they increase latency and load. Optimize for predictable degradation, not "try harder until it melts".

If you can't aggregate these logs during an incident, you'll always argue about retries instead of fixing them.


Fix and prevention plan (concrete steps)

After containment, fix the root cause in a bounded way. The goal is not "add more retries". The goal is "make failure predictable".

Do this in order:

  • Map every retry layer (SDK, Polly, queue, caller)
  • Delete redundant layers (keep one place where retries are decided)
  • Add per-attempt timeouts and a total time budget
  • Implement stop rules (non-transient errors do not retry)
  • Instrument and prove it (attempts per request, budget usage, escalation rate)

Rollout safety:

  • apply the policy to one dependency first
  • canary one service instance or one percentage of traffic
  • keep a fast rollback switch
  • verify: in-flight drops, queue stabilizes, success rate improves

Shipped asset

Download
Free

Stop / retry / escalate decision framework

Production template for defining bounded retries with clear stop rules. Prevents retry storms by classifying errors up front.

What you get (3 files):

  • stop-retry-escalate-decision-tree.md - Decision framework and operator actions
  • retry-budget-template.md - Per-attempt timeout, total budget, and caps
  • layered-retry-audit-checklist.md - Find and remove duplicate retry layers

Quick reference (decision logic):

code
If error is 400/401/403/409 then STOP (no retry)
If error is 429 with Retry-After then RETRY (bounded) and respect backpressure
If error is 5xx or timeout then RETRY (bounded), then ESCALATE with an operator payload
If total budget is exceeded then STOP and ESCALATE
If the same error repeats 3+ times then STOP and investigate the root cause

Resources

Internal:

External:


FAQ

Enough to ride out small transient failures, not enough to hide outages. In many production systems, 2-3 total attempts with exponential backoff + jitter is a safe baseline.

If you need more than that often, treat it as a signal. Timeouts are wrong, the dependency is unhealthy, or you are missing caching and concurrency limits.

You can retry any method if the operation is idempotent. The real question is whether a duplicate attempt causes harm. If yes, add an idempotency key and make the server treat repeats as safe no-ops.

If you cannot guarantee idempotency, be conservative. Stop early, or gate side effects behind a workflow that can tolerate duplicates.

That is where retries do the most damage. A slow dependency creates long in-flight requests; retries create more in-flight requests.

The fix is budgets and bulkheads. Time out the slow call within a known budget, cap concurrency, and only retry when it is actually likely to improve.

Retries handle transient blips. Circuit breakers handle sustained failure by stopping traffic and allowing recovery.

If a dependency is consistently failing, retrying just adds load. A breaker (plus fallback or degrade behavior) prevents your app from becoming a denial-of-service client.

They multiply. If an HTTP call retries 3 times per attempt and the queue redelivers 5 times, you can accidentally create 15 attempts per message.

Make one layer the source of truth. Often that means: queue redelivery is limited and slow, and the in-process HTTP policy is very conservative.

Test against controlled failure modes: inject 429s with Retry-After, inject timeouts, and inject sustained 503s. Then validate invariants: attempt caps, total budget cap, and logged decisions.

You're done when the system degrades predictably instead of "trying harder" until it melts.


Coming soon

If you want more of these operator-grade templates (decision trees, log schemas, runbooks), that is what Axiom is becoming.

Join to get notified as we ship assets you can drop into incident response, with real files and real defaults, not generic advice.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.


Key takeaways

  • Retries are a multiplier. Under backpressure they can create the outage.
  • The safe default is bounded attempts + bounded time + classified failures + proof in logs.
  • Layered retries are the most common hidden amplifier. Map layers and delete redundancy.

If you want a fast diagnosis of your current retry behavior, see .NET Production Rescue or contact me and include one incident log with correlation IDs.

Related posts