Resources/Axiom Module - Retry Policy Kit

Axiom Module - Retry Policy Kit

A paid retry policy kit with stop/retry/escalate rules, backoff + jitter defaults, retry budgets/caps, and 429 Retry-After handling templates (Polly + HttpClient) to prevent retry storms.

$49Feb 01, 2026

Buy

From this article

Browse all

Retries making outages worse: when resilience policies multiply failures in .NET

Retry storms don't look like a bug—they look like good engineering until retries amplify failures and multiply in-flight requests during backpressure.

Also referenced in

Retries amplify failures: why exponential backoff without jitter creates storms

Polly retries making outages worse: stop retry storms with backoff and jitter

HttpClient keeps getting 429s: why retries amplify rate limiting in .NET

Overview

This kit gives you one production-ready retry policy system you can standardize across services.

It covers the full loop end-to-end:

Decision rules: stop vs retry vs escalate
Timing rules: backoff + jitter + Retry-After
Damage limits: attempt caps + time budgets
Guardrails: timeouts-first, circuit breakers, and bulkheads
Implementation: copy-ready .NET templates (Polly + HttpClient)
Proof and visibility: test checklist + structured retry telemetry + queries

The goal is simple: when something breaks (especially during an incident), you don’t want to assemble guidance from 10 different tabs. You want a single, consistent policy you can trust.

If you’ve ever seen a dependency start failing and then watched your “resilience” layer turn into a self-inflicted outage, this is the pack.

Who it’s for

Engineers running production APIs and integrations
Teams dealing with rate-limited vendor APIs
Anyone who has seen “resilience” turn into a retry storm

When to use it

You own multiple outbound dependencies (payments, identity, market data, shipping, email, vendor REST APIs)
Different services currently “pick their own” retry behavior
You need bounded retries with a measurable blast radius
You want incident operators to have a decision framework, not folklore

What it prevents

Retry storms / thundering herd behavior
Vendor bans caused by aggressive retry loops
“Infinite retry” background jobs
Invisible failure amplification (high retry rate with no diagnosis)

Failure patterns covered (real incidents)

429 throttling turns into a loop (clients re-try in sync, queueing explodes)
5xx clusters during vendor deploys (retries stack at multiple layers)
Timeouts mask “slow death” outages (calls never complete, thread pool pressure rises)
Connection resets / transient network faults (safe to retry, unsafe to spam)
Partial outage amplification (one dependency is degraded; retries turn it into a full outage)

What you’ll be able to do after using this

Define a single set of retry defaults your team can reuse across integrations
Stop retry loops before they turn into storms
Respect vendor contracts (Retry-After) instead of brute-forcing
Instrument retries so you can measure retry rate and detect storms early

Adoption path (90 minutes, no new architecture)

Pick the “retry owner” layer (HttpClient/Polly or job runner) using the layered audit.
Adopt the decision tables so “what we retry” is consistent.
Standardize budgets/caps (attempt cap + total time cap) so retries are bounded.
Add bulkheads/breakers so degraded deps fail fast instead of saturating your pool.
Emit the retry event schema and add two or three queries/dashboards.

What you get

Editable source docs (Markdown) you can adapt for your org
A one-page cheat sheet for “what do we do right now?”
Implementation templates (.NET) that embody the rules
Operator artifacts: a policy registry template, a retry event schema, and a query pack
PDF versions of key docs for printing/sharing

PDFs are in the pdf/ folder and match the same filenames as the .md docs.

What’s inside

The pack ships as 13 core files, plus README.md, pack-manifest.json, and a pdf/ folder (8 PDFs). Here’s what each file is for.

`00-START-HERE.md`

The fastest adoption path.

Includes:

the recommended sequence to roll this out
a rollout plan that avoids shipping retry behavior blind

`01-stop-retry-escalate-decision-tables.md`

The core decision system.

Use it to answer, consistently:

What do we retry? (timeouts, transient 5xx, connection resets)
What do we stop immediately? (4xx validation/auth errors, bad requests)
What do we escalate? (vendor outage signals, repeated 429s, repeated timeouts)

Output:

A policy you can apply across services so incidents don’t depend on individual judgment.

`02-backoff-and-jitter-defaults.md`

Your “safe defaults” sheet.

Includes:

A recommended backoff shape (with jitter)
Practical cap guidance so retries don’t run forever
Notes on when to switch from retries to degradation

Output:

A set of defaults you can standardize across all outbound calls.

`03-retry-budgets-and-caps.md`

The damage-control layer.

Defines:

Attempt caps (max tries)
Time caps (max total retry time)
Concurrency guidance (how retries multiply load)

Output:

“Bounded retries” that don’t turn a partial outage into a full outage.

`04-polly-templates.cs`

Copy-ready C# templates for implementing the decisions above using Polly.

Includes patterns for:

retry with backoff + jitter
stop rules and exception/result filtering
timeouts, bulkheads, and circuit breaker guardrails
logging hooks (so you can measure retry rate)

Output:

A consistent Polly policy setup you can reuse across services.

`05-httpclient-429-retry-after-template.cs`

Copy-ready C# template focused on vendor rate limits.

Includes patterns for:

respecting Retry-After correctly
fallback behavior when Retry-After is missing or invalid
guardrails to prevent 429 loops

Output:

A rate-limit-aware client behavior that prevents bans and runaway retries.

`06-test-harness-checklist.md`

The “prove it works” checklist.

Use it to verify:

retries stop when they should
caps/budgets are enforced
429 behavior respects Retry-After
retry events are observable in logs

Output:

Confidence that the policy behaves under real failure modes (not just happy paths).

`07-retry-policy-registry-template.yaml`

A config template to standardize retry rules per dependency.

Use it to:

keep policy changes auditable (versioned)
avoid teams reinventing different retry behavior per service
encode budgets and stop/escalation rules in one place

`08-retry-event-logging-schema.json`

A JSON schema for structured retry events.

Use it to:

make retry amplification measurable
correlate retry behavior with incidents
power dashboards and alerts

`09-query-pack.md`

A query pack that shows what to graph/alert to detect retry storms.

Use it to:

identify top dependencies by retry rate
detect sustained 429/5xx patterns
validate budgets are respected

`10-layered-retry-audit.md`

A worksheet to find and remove stacked retries (HttpClient + Polly + queue redelivery).

Use it to:

compute worst-case amplification
choose a single “retry owner” layer
prevent accidental retry multiplication

`11-cheat-sheet-one-page.md`

Print-friendly one-page summary.

Use it to:

standardize defaults quickly
make stop/retry/escalate decisions under pressure
keep 429/Retry-After rules and budgets visible during incidents

`12-circuit-breakers-and-bulkheads.md`

The “stop making it worse” guardrails.

Use it to:

cap concurrency per dependency (bulkhead)
fail fast when failures cluster (circuit breaker)
align breakers with retry budgets so you don’t oscillate

Output:

A sane protection layer that prevents retries from saturating your process.

`13-implementation-notes-registry-and-telemetry.md`

Implementation bridge: how to wire the registry and emit telemetry.

Use it to:

map YAML registry entries → runtime policy selection
standardize retry event emission from the templates
keep policy changes auditable (versioned) while keeping code stable

During an incident (5-minute workflow)

Confirm if you’re in a retry storm (retry rate rising + attempts collapsing into short windows)
Identify the top dependency by retry volume and by user impact
Check whether Retry-After is being honored (and whether clients are looping)
Verify budgets/caps are holding (no infinite retries, no runaway time)
Decide the action: stop, degrade, or escalate (based on the decision tables)

FAQ

Is this a library?

No. It’s intentionally not a black box. You get a decision framework + defaults + templates + operator artifacts you can tailor and standardize across your org.