Axiom Module - Retry Policy Kit
A paid retry policy kit with stop/retry/escalate rules, backoff + jitter defaults, retry budgets/caps, and 429 Retry-After handling templates (Polly + HttpClient) to prevent retry storms.
From this article
Browse allOverview
This kit gives you one production-ready retry policy system you can standardize across services.
It covers the full loop end-to-end:
- Decision rules: stop vs retry vs escalate
- Timing rules: backoff + jitter +
Retry-After - Damage limits: attempt caps + time budgets
- Guardrails: timeouts-first, circuit breakers, and bulkheads
- Implementation: copy-ready .NET templates (Polly + HttpClient)
- Proof and visibility: test checklist + structured retry telemetry + queries
The goal is simple: when something breaks (especially during an incident), you don’t want to assemble guidance from 10 different tabs. You want a single, consistent policy you can trust.
If you’ve ever seen a dependency start failing and then watched your “resilience” layer turn into a self-inflicted outage, this is the pack.
Who it’s for
- Engineers running production APIs and integrations
- Teams dealing with rate-limited vendor APIs
- Anyone who has seen “resilience” turn into a retry storm
When to use it
- You own multiple outbound dependencies (payments, identity, market data, shipping, email, vendor REST APIs)
- Different services currently “pick their own” retry behavior
- You need bounded retries with a measurable blast radius
- You want incident operators to have a decision framework, not folklore
What it prevents
- Retry storms / thundering herd behavior
- Vendor bans caused by aggressive retry loops
- “Infinite retry” background jobs
- Invisible failure amplification (high retry rate with no diagnosis)
Failure patterns covered (real incidents)
- 429 throttling turns into a loop (clients re-try in sync, queueing explodes)
- 5xx clusters during vendor deploys (retries stack at multiple layers)
- Timeouts mask “slow death” outages (calls never complete, thread pool pressure rises)
- Connection resets / transient network faults (safe to retry, unsafe to spam)
- Partial outage amplification (one dependency is degraded; retries turn it into a full outage)
What you’ll be able to do after using this
- Define a single set of retry defaults your team can reuse across integrations
- Stop retry loops before they turn into storms
- Respect vendor contracts (
Retry-After) instead of brute-forcing - Instrument retries so you can measure retry rate and detect storms early
Adoption path (90 minutes, no new architecture)
- Pick the “retry owner” layer (HttpClient/Polly or job runner) using the layered audit.
- Adopt the decision tables so “what we retry” is consistent.
- Standardize budgets/caps (attempt cap + total time cap) so retries are bounded.
- Add bulkheads/breakers so degraded deps fail fast instead of saturating your pool.
- Emit the retry event schema and add two or three queries/dashboards.
What you get
- Editable source docs (Markdown) you can adapt for your org
- A one-page cheat sheet for “what do we do right now?”
- Implementation templates (.NET) that embody the rules
- Operator artifacts: a policy registry template, a retry event schema, and a query pack
- PDF versions of key docs for printing/sharing
PDFs are in the pdf/ folder and match the same filenames as the .md docs.
What’s inside
The pack ships as 13 core files, plus README.md, pack-manifest.json, and a pdf/ folder (8 PDFs). Here’s what each file is for.
00-START-HERE.md
The fastest adoption path.
Includes:
- the recommended sequence to roll this out
- a rollout plan that avoids shipping retry behavior blind
01-stop-retry-escalate-decision-tables.md
The core decision system.
Use it to answer, consistently:
- What do we retry? (timeouts, transient 5xx, connection resets)
- What do we stop immediately? (4xx validation/auth errors, bad requests)
- What do we escalate? (vendor outage signals, repeated 429s, repeated timeouts)
Output:
- A policy you can apply across services so incidents don’t depend on individual judgment.
02-backoff-and-jitter-defaults.md
Your “safe defaults” sheet.
Includes:
- A recommended backoff shape (with jitter)
- Practical cap guidance so retries don’t run forever
- Notes on when to switch from retries to degradation
Output:
- A set of defaults you can standardize across all outbound calls.
03-retry-budgets-and-caps.md
The damage-control layer.
Defines:
- Attempt caps (max tries)
- Time caps (max total retry time)
- Concurrency guidance (how retries multiply load)
Output:
- “Bounded retries” that don’t turn a partial outage into a full outage.
04-polly-templates.cs
Copy-ready C# templates for implementing the decisions above using Polly.
Includes patterns for:
- retry with backoff + jitter
- stop rules and exception/result filtering
- timeouts, bulkheads, and circuit breaker guardrails
- logging hooks (so you can measure retry rate)
Output:
- A consistent Polly policy setup you can reuse across services.
05-httpclient-429-retry-after-template.cs
Copy-ready C# template focused on vendor rate limits.
Includes patterns for:
- respecting
Retry-Aftercorrectly - fallback behavior when
Retry-Afteris missing or invalid - guardrails to prevent 429 loops
Output:
- A rate-limit-aware client behavior that prevents bans and runaway retries.
06-test-harness-checklist.md
The “prove it works” checklist.
Use it to verify:
- retries stop when they should
- caps/budgets are enforced
- 429 behavior respects
Retry-After - retry events are observable in logs
Output:
- Confidence that the policy behaves under real failure modes (not just happy paths).
07-retry-policy-registry-template.yaml
A config template to standardize retry rules per dependency.
Use it to:
- keep policy changes auditable (versioned)
- avoid teams reinventing different retry behavior per service
- encode budgets and stop/escalation rules in one place
08-retry-event-logging-schema.json
A JSON schema for structured retry events.
Use it to:
- make retry amplification measurable
- correlate retry behavior with incidents
- power dashboards and alerts
09-query-pack.md
A query pack that shows what to graph/alert to detect retry storms.
Use it to:
- identify top dependencies by retry rate
- detect sustained 429/5xx patterns
- validate budgets are respected
10-layered-retry-audit.md
A worksheet to find and remove stacked retries (HttpClient + Polly + queue redelivery).
Use it to:
- compute worst-case amplification
- choose a single “retry owner” layer
- prevent accidental retry multiplication
11-cheat-sheet-one-page.md
Print-friendly one-page summary.
Use it to:
- standardize defaults quickly
- make stop/retry/escalate decisions under pressure
- keep 429/
Retry-Afterrules and budgets visible during incidents
12-circuit-breakers-and-bulkheads.md
The “stop making it worse” guardrails.
Use it to:
- cap concurrency per dependency (bulkhead)
- fail fast when failures cluster (circuit breaker)
- align breakers with retry budgets so you don’t oscillate
Output:
- A sane protection layer that prevents retries from saturating your process.
13-implementation-notes-registry-and-telemetry.md
Implementation bridge: how to wire the registry and emit telemetry.
Use it to:
- map YAML registry entries → runtime policy selection
- standardize retry event emission from the templates
- keep policy changes auditable (versioned) while keeping code stable
During an incident (5-minute workflow)
- Confirm if you’re in a retry storm (retry rate rising + attempts collapsing into short windows)
- Identify the top dependency by retry volume and by user impact
- Check whether
Retry-Afteris being honored (and whether clients are looping) - Verify budgets/caps are holding (no infinite retries, no runaway time)
- Decide the action: stop, degrade, or escalate (based on the decision tables)
Related articles
- The real cost of retry logic: when “resilience” makes outages worse
- Backoff + jitter: safe retries that don’t synchronize a thundering herd
- Polly retry policies done right (backoff, jitter, caps, stop rules)
- HttpClient keeps getting 429s: why retries amplify rate limiting in .NET
FAQ
Is this a library?
No. It’s intentionally not a black box. You get a decision framework + defaults + templates + operator artifacts you can tailor and standardize across your org.
Will this work in legacy .NET?
Yes. The core ideas are framework-agnostic. The templates are written to be copy/paste-friendly and adaptable.
What do I actually implement first?
Start with 10-layered-retry-audit.md, then enforce budgets/caps (03-*), then adopt 429 handling (05-*) for any rate-limited vendors.
Purchase
- This resource is sold via MatrixTrak Shop (
shop.matrixtrak.com).
Newsletter
Get the .NET production reliability newsletter
Weekly runbooks, failure patterns, and practical fixes.
No spam. Unsubscribe anytime.
Need help implementing this?
I can help you apply this to your systems without the drama.
Work with meSimilar resources
More resources to help you succeed