Resources/Retry backoff + jitter checklist (production defaults)

Retry backoff + jitter checklist (production defaults)

Download includes 2 copy/paste files: retry-backoff-jitter-recipes.md (safe defaults, stop rules, 429 handling, caps/budgets) and retry-telemetry-fields.md (log/metric fields + example line for incident debugging).

FreeJan 16, 2026
Download

This download is meant for on-call operators and owners of production automation (bots, workers, agents).

What’s inside the download

You get two copy/paste Markdown files:

  • retry-backoff-jitter-recipes.md: safe defaults (backoff + full jitter), stop rules, and 429/Retry-After handling you can standardize across services.
  • retry-telemetry-fields.md: the exact log/metric fields to emit per attempt, plus an example line you can grep during an incident.

Use it as a lightweight pre-flight before you ship (or change) retries.

Preview (collapsed)

Open the checklist preview

1) Retry classification

  • Retry only transient failures (timeouts, connection resets, 502/503).
  • Do not retry deterministic failures (400/401/403/422) unless you explicitly re-auth / refresh.
  • Treat 429 as “slow down” (not “try harder”).

2) Backoff + jitter

  • Exponential backoff: delay grows each attempt.
  • Add jitter so clients do not synchronize.
  • Cap max delay (e.g. 10-30s) to keep UX predictable.

Suggested defaults (tune per system):

  • Base delay: 200-500ms
  • Multiplier: 2x
  • Max delay: 15s
  • Attempts: 5 (or time-budgeted)

3) Budgets and caps

  • Max attempts per request.
  • Max total retry time budget (e.g. 30s).
  • Max concurrent retries per dependency (bulkhead).
  • Circuit breaker trips when the dependency is unhealthy.

4) Timeouts

  • Every attempt has a timeout.
  • Timeout is smaller than your overall budget.

5) Telemetry (minimum viable)

  • Log: dependency name, error class, status code, attempt number, delay.
  • Emit metrics: retries count, success-after-retry, give-up count.
  • Trace or correlation id so incidents are debuggable.

6) Rate limits (429)

  • Respect Retry-After when provided.
  • Add jitter around server-provided delay.
  • Reduce concurrency when 429 spikes (token bucket / limiter).

Copy/paste: retry decision table

code
Timeout / reset / 502 / 503 -> retry with backoff+jitter
429 -> retry with Retry-After (or backoff+jitter) + reduce concurrency
401 / 403 -> stop, refresh credentials or alert (no blind retries)
400 / 422 -> stop (bug / invalid request)

Newsletter

Get the automation reliability newsletter

Weekly runbooks, failure patterns, and practical fixes.

No spam. Practical updates only.

We respect your inbox. Unsubscribe anytime.

No spam. Unsubscribe anytime.

Need help implementing this?

I can help you apply this to your systems without the drama.

Work with me
Canonical: https://matrixtrak.com/resources/retry-backoff-jitter-checklist