Jan 26, 202615 min read

Share |

Category:.NET

Retries making outages worse: when resilience policies multiply failures in .NET

Retry storms don't look like a bug—they look like good engineering until retries amplify failures and multiply in-flight requests during backpressure.

Free download: Stop / Retry / Escalate decision tree (template). Jump to the download section.

Paid pack available. Jump to the Axiom pack.

Retry storms are what happens when a dependency is slow and your "resilience" multiplies the load. You get timeouts, queue pileups, and paging, even though nothing is technically down.

The deliverable is a production playbook: a stop / retry / escalate decision framework, safe defaults for budgets, and the exact log fields that turn retries into something you can measure and control.

This failure mode costs you twice. First you lose availability. Then you lose time: engineers arguing in the incident channel about whether retries helped, while the system keeps generating more work.

Mini incident pattern:

14:03 a vendor API slows down (still returning 200s, just late)
14:06 latency climbs, retry policies trigger, in-flight requests rise
14:10 success rate drops, queue depth grows, thread pool and connection pools saturate
14:14 scale out increases pressure on the same dependency, everything gets worse
14:22 retries are disabled for the hot path, the system stabilizes

Rescuing an .NET service in production? Start at the .NET Production Rescue hub and the .NET category.

If you only do three things

Enforce boundaries: per-attempt timeouts plus a total time budget for the whole retry sequence.
Audit layered retries (SDK, Polly, queue redelivery, gateway) and delete duplicates in the hot path.
Log retry decisions (attempt, delay, reason, budget, dependency) so you can prove whether retries helped or harmed.

Fast triage table (what to check first)

Symptom	Likely cause	Confirm fast	First safe move
Scaling out made failures worse	Retry amplification (more callers × more attempts)	Instance count up and in-flight to dependency up; success rate down	Cap concurrency to the dependency (bulkhead/semaphore); fail fast on hottest path
p95/p99 explode while CPU is normal	Retries without timeouts (stacked waits) or timeouts too large	Lots of long-running requests; thread pool/connection pool pressure	Add per-attempt timeouts; enforce a total time budget
429s keep climbing even “with retries”	Backpressure ignored (`Retry-After` not honored)	429s cluster; delays don’t match `Retry-After`	Honor `Retry-After`; cap delay to budget; add jitter
Same 4xx repeats many times	Non-transient failures being retried	400/401/403/409 show attempts > 1	Stop retrying non-transient 4xx; surface actionable error
Queue depth grows and redelivery spikes	Layered retries (HTTP retry + queue redelivery)	Redelivery count rises with HTTP retry attempts	Make one layer source-of-truth; slow bounded redelivery; conservative in-process retry

Why retries multiply failures: the amplification mechanism under load

Retries do not add reliability by default. They add work.

In a healthy system that extra work is invisible. In a degraded system it is fuel. Each retry increases in-flight concurrency, and in-flight concurrency is what kills you: thread pool pressure, socket pressure, queue pressure, and a feedback loop that makes the dependency even slower.

The multiplier usually looks like this:

attempts per call (Polly)
fan-out per request (one API call triggers N downstream calls)
redelivery per message (queue retries)
concurrent callers (all instances and all users)

When the dependency is slow, each layer piles on. You do not get one slow call. You get many slow calls that overlap.

What it looks like on dashboards:

request rate rises while success rate falls
p95 and p99 latency explode
timeouts rise even if the dependency still returns 200 sometimes
queue depth climbs, then redelivery multiplies attempts

If scaling out did not help (or made it worse), treat retry amplification as a primary suspect.

How adding more retries makes incidents worse (the hidden multipliers)

Most retry storms are created by reasonable people making reasonable local decisions. The mistake is optimizing for a single call outcome instead of system behavior under backpressure.

The common pattern: the first incident triggers "add retries". The next incident triggers "add more retries". By the third incident, retries are everywhere and nobody can answer how many attempts a single user action generated.

Things that quietly turn retries into an outage multiplier:

retries without per-attempt timeouts (stacked waits)
no total time budget (unbounded work)
retrying non-transient failures (client errors, auth errors, conflicts)
ignoring throttling semantics (429 without honoring Retry-After)
layered retries (HTTP policy plus queue redelivery plus user refresh)

None of these look dramatic in a code review. All of them are dramatic at 2am.

How to diagnose retry storms: can you count attempts per request?

Retries are only useful if you can prove their behavior. Start by answering one question: how many attempts did we generate per user request or per message during the incident window.

This ladder is designed to be fast, safe, and non-controversial.

Graph attempts per request (or infer it from logs) during the incident window
Check queue redelivery counts and retry delays (broker and consumer)
Identify layered retries (Polly, SDK retries, gateway retries, queue redelivery)
Check whether per-attempt timeouts exist (retries without timeouts are stacked waits)
Validate error classification (are 400/401/403/409 being retried?)
Confirm retry behavior honors throttling (429 and Retry-After)

If you cannot answer "how many attempts did we do per user request", you are not ready to tune policies. You are ready to add observability.

Containment during an incident

During the incident window, containment wins. Do the smallest change that reduces pressure on the dependency, then verify it on dashboards.

The safest containment changes reduce concurrency and remove retry multipliers. They are reversible and do not require a system redesign.

Reduce concurrency toward the dependency (bulkhead, semaphore, queue consumer limit)
Prefer fail fast over "try harder" when the dependency is clearly unhealthy
Respect server backpressure (429 + Retry-After)
Remove duplicate retries in the hottest path if you can do it safely
Add a short-lived circuit breaker or stop rule to prevent endless pressure

If you need a blunt instrument, turn off in-process retries for the dependency in the hottest path and rely on one slower mechanism (often controlled queue redelivery) until the dependency recovers.

Verification signals that you picked the right lever:

in-flight requests decrease
queue depth stops growing
dependency latency stabilizes
success rate improves without increasing request volume

Stop, retry, escalate: the decision framework that prevents amplification

Bounded retries are not just a delay loop. They are a policy that makes a decision, writes evidence, and triggers an operator action when the dependency is sick.

In practice you need three outcomes:

stop: do not retry, return a useful error
retry: bounded attempts and bounded time, with jitter and backpressure
escalate: stop retrying and emit an operator payload a human can act on

Stop (no retry)

Stop immediately when the failure is not transient or a duplicate attempt can cause harm.

Common stop signals:

400 validation
401 authentication
403 permission
409 and 412 conflicts
many 404s when they mean "does not exist"

Retry (bounded)

Retry only when waiting has a reasonable chance of helping.

Common retry signals:

429 with Retry-After (required)
some 5xx
transient network failures
timeouts with a sane per-attempt timeout and a sane total budget

Escalate (operator payload)

Escalate when you have enough evidence to act and continuing to retry adds pressure.

Common escalation triggers:

max attempts reached
total time budget exhausted
sustained 429 throttling
repeating failures across many calls

Bounded retries in .NET (where policies belong)

In .NET services, retry behavior usually ends up split across layers: an SDK might retry, a Polly policy might retry, a queue might redeliver, and the caller might refresh.

Pick one place to own retries for a dependency and make it the source of truth. Everything else should be zero retry or minimal safety retry.

Use budgets, not vibes:

per-attempt timeout: caps one call
total time budget: caps the entire retry sequence
max attempts: caps how many tries

That policy must also be observable. If you cannot see attempts, delays, and stop decisions in logs, you will argue during the next incident.

Polly is a common place to centralize this in legacy .NET because it is additive. You can wrap one client, roll it out, and roll it back.

A bounded Polly policy (example)

This is a baseline that enforces the constraints above. It is intentionally boring.

Start by writing the policy in words. Then implement it. The policy should say:

which failures are retryable
max attempts
per-attempt timeout
total time budget
what gets logged on each attempt

csharp

// Pseudocode baseline.
// Keep the code simple and make the policy observable.
// Wire this via IHttpClientFactory or your client wrapper.
 
var policy = Policy
  .Handle<HttpRequestException>()
  .OrResult<HttpResponseMessage>(r => (int)r.StatusCode >= 500)
  .WaitAndRetryAsync(
    retryCount: 2,
    sleepDurationProvider: attempt => TimeSpan.FromMilliseconds(200 * attempt) + TimeSpan.FromMilliseconds(Random.Shared.Next(0, 150)),
    onRetry: (outcome, delay, attempt, ctx) => {
      // log: correlationId, dependency, attempt, delay, timeoutMs, outcome
    }
  );

The missing piece is always the same: prove it in logs.

Here's a minimal per-call log shape that makes retry storms visible:

json

{
  "ts": "2026-01-21T14:06:11.902Z",
  "level": "warning",
  "correlationId": "c-1f2aa91d3a9c4a0b",
  "dependency": "http:VendorApi",
  "attempt": 2,
  "delayMs": 350,
  "timeoutMs": 1500,
  "status": 503,
  "decision": "retry",
  "totalBudgetMs": 4000
}

What to log (so retries stop being a debate)

When retries go wrong, teams argue because nobody has evidence. The way out is logs and metrics that turn retry behavior into something you can measure during the incident window.

You need to answer two questions quickly:

How many attempts did we generate per user request/message?
Did retries respect a total time budget, or did we keep trying indefinitely?

Minimum fields per dependency attempt:

correlationId
dependency
attempt
delayMs
timeoutMs
totalBudgetMs
decision (stop | retry | escalate)
reason (e.g. 429-retry-after, 5xx, timeout, non-transient-4xx)

Then add one aggregate metric that makes storms obvious: attempts per successful request (or attempts per message). If that number spikes while success rate drops, you're amplifying.

Retries can smooth tiny transient blips. Under sustained backpressure they increase latency and load. Optimize for predictable degradation, not "try harder until it melts".

If you can't aggregate these logs during an incident, you'll always argue about retries instead of fixing them.

Fix and prevention plan (concrete steps)

After containment, fix the root cause in a bounded way. The goal is not "add more retries". The goal is "make failure predictable".

Do this in order:

Map every retry layer (SDK, Polly, queue, caller)
Delete redundant layers (keep one place where retries are decided)
Add per-attempt timeouts and a total time budget
Implement stop rules (non-transient errors do not retry)
Instrument and prove it (attempts per request, budget usage, escalation rate)

Rollout safety:

apply the policy to one dependency first
canary one service instance or one percentage of traffic
keep a fast rollback switch
verify: in-flight drops, queue stabilizes, success rate improves

Shipped asset

Download

Free

Stop / Retry / Escalate Decision Framework

Production template for defining bounded retries with clear stop rules. Prevents retry storms by classifying errors up front.

Get the template

When to use this (fit check)

You suspect retry storms or layered retries are amplifying an incident.
You need a single place to define STOP vs RETRY vs ESCALATE with budgets and operator actions.
You want to contain pressure fast (reduce concurrency, enforce budgets) without a rewrite.

When NOT to use this (yet)

You are retrying writes with side effects and you don’t have idempotency (add idempotency first).
You can’t observe attempts per request/message (add the log fields first, then tune).
You don’t control any retry layer for the dependency (start by finding where retries really happen).

What you get (3 files):

stop-retry-escalate-decision-tree.md - Decision framework and operator actions
retry-budget-template.md - Per-attempt timeout, total budget, and caps
layered-retry-audit-checklist.md - Find and remove duplicate retry layers

Quick reference (decision logic):

code

If error is 400/401/403/409 then STOP (no retry)
If error is 429 with Retry-After then RETRY (bounded) and respect backpressure
If error is 5xx or timeout then RETRY (bounded), then ESCALATE with an operator payload
If total budget is exceeded then STOP and ESCALATE
If the same error repeats 3+ times then STOP and investigate the root cause

Axiom Pack

$49

Retry Policy Kit: Battle-Tested Resilience for Production

Managing retries across multiple services? Get pre-configured Polly policies with monitoring integration, circuit breaker patterns, and incident runbooks. Stop debugging retry storms in production.

✓10+ production-grade Polly policies for HTTP, gRPC, and database calls
✓Circuit breaker + retry coordination patterns
✓Monitoring integration (Prometheus, OpenTelemetry, Application Insights)
✓Incident runbooks for retry storm diagnosis and mitigation

Get Retry Policy Kit →

Checklist (copy/paste)

We can count attempts per request/message during an incident window.
We mapped every retry layer (SDK, Polly, gateway, queue redelivery, caller refresh).
One layer owns retries for each dependency; duplicate layers are removed or set to zero/minimal retry.
Per-attempt timeouts exist (no stacked waits).
A total time budget is enforced (attempts + waits cannot burn a worker indefinitely).
Retryable errors are classified; non-transient 4xx (400/401/403/409/412) stop immediately.
429 is treated as backpressure; Retry-After is honored (and capped to budget) with jitter.
Concurrency to the dependency is capped (bulkhead/semaphore/consumer limit).
Escalation emits an operator payload (dependency, reason, attempts, budget used, next action).

Resources

Internal:

.NET Production Rescue
Timeouts first: why "infinite waits" create recurring outages in .NET
Thread pool starvation: the silent killer of ASP.NET performance
Polly retry policies done right (backoff, jitter, caps, stop rules)
HttpClient keeps getting 429s: why retries amplify rate limiting in .NET - when retries multiply rate limits
Cannot trace requests across services: why correlation IDs die at boundaries in .NET - trace retry attempts across boundaries
Why your background jobs hang forever (and no one notices) - queue retry amplification

External:

Troubleshooting Questions Engineers Search

Because retries add work. When a dependency is slow or failing, each retry increases in-flight concurrency. More in-flight requests create more thread pool pressure, socket pressure, and queue pressure. The dependency gets slower, triggering more retries, creating a feedback loop. The fix is bounded retries with per-attempt timeouts and a total time budget.

Graph attempts per request during the incident window. If attempts spike while success rate drops, retries are amplifying. Also check: in-flight requests rising, queue depth growing, scaling out made it worse, and p99 latency exploding. If you cannot count attempts per request from logs, add instrumentation before tuning policies.

Because scaling out multiplies the retry pressure on the same sick dependency. If each instance retries 3 times and you scale from 10 to 30 instances, you just tripled the request load on a dependency that is already struggling. Contain concurrency with bulkheads and circuit breakers before scaling out.

Depends on your SLA and user tolerance, but a common baseline is 3-5 seconds total including all attempts. This prevents one slow call from blocking a thread for 30+ seconds. Use per-attempt timeouts (e.g., 1.5s) and cap total attempts (e.g., 2-3 retries). If you need more time, the dependency is unhealthy and retrying just delays the inevitable failure.

Because dev has low concurrency and fast dependencies. In production, retries multiply across: concurrent users, fan-out per request, queue redelivery, and SDK/gateway retries. A retry policy that looks safe for one call can generate 15+ attempts per user action when layered. Audit all retry layers and delete redundancy.

Most teams don't know until an incident forces the audit. Common layers: Polly policies, HttpClient handler retries, SDK built-in retries, API gateway retries, queue redelivery (SQS/RabbitMQ), and user refresh. Each layer multiplies attempts. Map every layer, delete redundancy, and make one layer the source of truth for each dependency.

Not by default. You need to parse the Retry-After header (can be seconds or HTTP date) and use WaitAndRetryAsync with a custom sleep duration provider. If you ignore Retry-After, you will retry too soon and amplify rate limiting. Respect server backpressure or risk getting banned.

FAQ

Enough to ride out small transient failures, not enough to hide outages. In many production systems, 2-3 total attempts with exponential backoff + jitter is a safe baseline.

If you need more than that often, treat it as a signal. Timeouts are wrong, the dependency is unhealthy, or you are missing caching and concurrency limits.

You can retry any method if the operation is idempotent. The real question is whether a duplicate attempt causes harm. If yes, add an idempotency key and make the server treat repeats as safe no-ops.

If you cannot guarantee idempotency, be conservative. Stop early, or gate side effects behind a workflow that can tolerate duplicates.

That is where retries do the most damage. A slow dependency creates long in-flight requests; retries create more in-flight requests.

The fix is budgets and bulkheads. Time out the slow call within a known budget, cap concurrency, and only retry when it is actually likely to improve.

Retries handle transient blips. Circuit breakers handle sustained failure by stopping traffic and allowing recovery.

If a dependency is consistently failing, retrying just adds load. A breaker (plus fallback or degrade behavior) prevents your app from becoming a denial-of-service client.

They multiply. If an HTTP call retries 3 times per attempt and the queue redelivers 5 times, you can accidentally create 15 attempts per message.

Make one layer the source of truth. Often that means: queue redelivery is limited and slow, and the in-process HTTP policy is very conservative.

Test against controlled failure modes: inject 429s with Retry-After, inject timeouts, and inject sustained 503s. Then validate invariants: attempt caps, total budget cap, and logged decisions.

You're done when the system degrades predictably instead of "trying harder" until it melts.

Coming soon

If you want more of these operator-grade templates (decision trees, log schemas, runbooks), that is what Axiom is becoming.

Join to get notified as we ship assets you can drop into incident response, with real files and real defaults, not generic advice.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.

Join waitlist

Key takeaways

Retries are a multiplier. Under backpressure they can create the outage.
The safe default is bounded attempts + bounded time + classified failures + proof in logs.
Layered retries are the most common hidden amplifier. Map layers and delete redundancy.

If you want a fast diagnosis of your current retry behavior, see .NET Production Rescue or contact me and include one incident log with correlation IDs.

Recommended resources

Download the shipped checklist/templates for this post.

Stop / Retry / Escalate decision tree (template)Free

A decision tree for bounded retries: classify failures, enforce time budgets, and escalate with an operator payload.

resource

.NETJan 30, 2026

HttpClient keeps getting 429s: why retries amplify rate limiting in .NET

When retries multiply 429 errors instead of fixing them: how retry amplification happens, how to prove it, and how to honor Retry-After with budgets.

.NETJan 21, 2026

Requests hang forever: why missing timeouts cause recurring outages in .NET

When requests hang forever and recycling releases stuck work: why missing timeouts create backlog, how to add budgets safely, and the rollout plan that prevents new incidents.

.NETFeb 04, 2026

Idempotency keys for APIs: stop duplicate orders, emails, and writes

When retries create duplicate side effects, idempotency keys are the only safe fix. This playbook shows how to design keys, store results, and prove duplicates cannot recur.

Fast triage table (what to check first)

Why retries multiply failures: the amplification mechanism under load

How adding more retries makes incidents worse (the hidden multipliers)

How to diagnose retry storms: can you count attempts per request?

Containment during an incident

Stop, retry, escalate: the decision framework that prevents amplification

Stop (no retry)

Retry (bounded)

Escalate (operator payload)

Bounded retries in .NET (where policies belong)

A bounded Polly policy (example)

What to log (so retries stop being a debate)

Fix and prevention plan (concrete steps)

Shipped asset

Stop / Retry / Escalate Decision Framework

Retry Policy Kit: Battle-Tested Resilience for Production

Checklist (copy/paste)

Resources

Troubleshooting Questions Engineers Search

FAQ

Coming soon

Axiom (Coming Soon)

Key takeaways

Recommended resources

Related posts

HttpClient keeps getting 429s: why retries amplify rate limiting in .NET

Requests hang forever: why missing timeouts cause recurring outages in .NET

Idempotency keys for APIs: stop duplicate orders, emails, and writes