Retry backoff + jitter checklist (production defaults)
Download includes 2 copy/paste files: retry-backoff-jitter-recipes.md (safe defaults, stop rules, 429 handling, caps/budgets) and retry-telemetry-fields.md (log/metric fields + example line for incident debugging).
FreeJan 16, 2026
DownloadFrom this article
Browse allThis download is meant for on-call operators and owners of production automation (bots, workers, agents).
What’s inside the download
You get two copy/paste Markdown files:
retry-backoff-jitter-recipes.md: safe defaults (backoff + full jitter), stop rules, and 429/Retry-Afterhandling you can standardize across services.retry-telemetry-fields.md: the exact log/metric fields to emit per attempt, plus an example line you can grep during an incident.
Use it as a lightweight pre-flight before you ship (or change) retries.
Preview (collapsed)
Open the checklist preview
1) Retry classification
- Retry only transient failures (timeouts, connection resets, 502/503).
- Do not retry deterministic failures (400/401/403/422) unless you explicitly re-auth / refresh.
- Treat
429as “slow down” (not “try harder”).
2) Backoff + jitter
- Exponential backoff: delay grows each attempt.
- Add jitter so clients do not synchronize.
- Cap max delay (e.g. 10-30s) to keep UX predictable.
Suggested defaults (tune per system):
- Base delay: 200-500ms
- Multiplier: 2x
- Max delay: 15s
- Attempts: 5 (or time-budgeted)
3) Budgets and caps
- Max attempts per request.
- Max total retry time budget (e.g. 30s).
- Max concurrent retries per dependency (bulkhead).
- Circuit breaker trips when the dependency is unhealthy.
4) Timeouts
- Every attempt has a timeout.
- Timeout is smaller than your overall budget.
5) Telemetry (minimum viable)
- Log: dependency name, error class, status code, attempt number, delay.
- Emit metrics: retries count, success-after-retry, give-up count.
- Trace or correlation id so incidents are debuggable.
6) Rate limits (429)
- Respect
Retry-Afterwhen provided. - Add jitter around server-provided delay.
- Reduce concurrency when 429 spikes (token bucket / limiter).
Copy/paste: retry decision table
code
Timeout / reset / 502 / 503 -> retry with backoff+jitter
429 -> retry with Retry-After (or backoff+jitter) + reduce concurrency
401 / 403 -> stop, refresh credentials or alert (no blind retries)
400 / 422 -> stop (bug / invalid request)
Newsletter
Get the automation reliability newsletter
Weekly runbooks, failure patterns, and practical fixes.
No spam. Unsubscribe anytime.
Need help implementing this?
I can help you apply this to your systems without the drama.
Work with meCanonical: https://matrixtrak.com/resources/retry-backoff-jitter-checklist