# Layered retry audit checklist

Use this to find retry multipliers. The failure mode is not a single retry. It is stacked retries across layers.

## Step 1) Map every retry layer

For each layer, write down:

- Where the retry lives
- Max attempts
- Delay strategy
- Per-attempt timeout
- Total time budget
- What gets logged

Common layers:

- Browser or client SDK
- API gateway or reverse proxy
- Service-to-service HttpClient handler
- Polly policy
- Queue redelivery
- Scheduler retry (cron, orchestrator)
- User refresh behavior

## Step 2) Calculate the worst case multiplier

Worst case attempts per user action is often a product of layers.

Example:

- HttpClient retries: 3 attempts
- Polly retries: 3 attempts
- Queue redelivery: 5 deliveries

Worst case attempts per message: 3 * 3 * 5 = 45

If you cannot do this math quickly, you are flying blind.

## Step 3) Pick one retry authority

Choose one place where retries are decided.

Rules:

- One owner sets max attempts and total budget.
- Other layers either do zero retries or only do minimal safety retries.
- Total time budget must be enforced once.

## Step 4) Separate retry from escalation

When retries stop, escalation must be actionable.

Define:

- Who gets paged
- What payload is included
- What the first operator action is

## Step 5) Verify backpressure behavior

- 429 is treated as retryable only when bounded
- Retry-After is respected
- Concurrency toward the dependency is capped

## Step 6) Prove it with telemetry

Add these signals:

- attempts per request or attempts per message
- total elapsed time vs total budget
- percentage of calls that hit max attempts
- percentage of calls that escalated

During an incident window, you should be able to answer:

- How many attempts did we generate per user action
- Did we respect total budgets
- Where did the retries come from
