# Retry policy checklist (Polly)

Use this before shipping retries to production.

## 1) What are you retrying?
- Is the dependency idempotent for the operation?
- Are you retrying reads vs writes?
- Do you have a per-endpoint risk rating?

## 2) Allowed retries (transient)
Retry (usually) when you see:
- `429 Too Many Requests` (with backoff)
- `503 Service Unavailable`
- `408 Request Timeout`
- network exceptions / connection resets
- timeouts (if you still have budget)

## 3) Never retry (permanent or unsafe)
Do not retry when you see:
- `400 Bad Request` (your bug)
- `401/403` (auth/config)
- `404` (wrong route)
- validation errors
- request bodies that are not safe to replay

## 4) Backoff + jitter
- Use exponential backoff and a hard cap
- Add jitter to prevent thundering herd
- Example: 1s, 2s, 4s (+ jitter), cap at 30s

## 5) Stop rules
- Max retries (`MaxRetryAttempts`)
- Total timeout budget (per request)
- Circuit breaker to stop hammering dead systems

## 6) Concurrency limits
- Set max outgoing concurrency per dependency
- Use bulkheads/queues if needed

## 7) Logging & metrics
Emit one event per retry attempt:
- endpoint, status/exception, attempt, delay, correlation id
- include dependency name + environment

## 8) Testing
- Unit test: retry happens for 429/503
- Unit test: no retry for 401/400
- Chaos test: dependency down -> circuit opens
