# Stop / Retry / Escalate decision tree

Use this to make retry behavior predictable during incidents.

The goal is not maximum success at any cost. The goal is bounded work with clear operator actions.

## Inputs (collect first)

- Dependency name
- Call path (web request, job step, message handler)
- Total time budget for the call path (seconds)
- Per-attempt timeout (milliseconds)
- Max attempts (integer)
- Whether the operation is idempotent (or has an idempotency key)

## Decision table

### STOP (no retry)

Stop immediately and return a clear error when the failure is not transient or a duplicate attempt can cause harm.

Common STOP signals:

- 400 validation
- 401 authentication
- 403 permission
- 404 when it means "does not exist" (not a transient lookup race)
- 409 / 412 concurrency conflicts (ETag, optimistic concurrency)
- 422 semantic validation (domain rules)
- Any response that proves the request is invalid

Operator action:

- Fix input, auth, or business logic
- Do not add more retries

### RETRY (bounded, with backpressure)

Retry only when there is a reasonable chance that waiting helps and the work is bounded.

Common RETRY signals:

- 429 with Retry-After (required)
- 503 / 502 / 504 from an upstream
- Transient network failures (connection reset, DNS timeout, socket timeout)
- Timeouts where the dependency is known to recover quickly

Rules:

- Always enforce a per-attempt timeout
- Always enforce a total time budget
- Always include jitter
- Respect Retry-After when present

### ESCALATE (operator payload)

Escalate when you stop and a human can take an action.

Escalate triggers:

- Max attempts reached
- Total time budget exhausted
- Same failure repeats across many requests/messages
- 429 without Retry-After during sustained throttling
- A dependency is slow but not failing and in-flight work is rising

Operator action:

- Reduce concurrency toward the dependency
- Disable or degrade the feature that calls the dependency
- Switch to cached or stale data if possible
- Pause queue consumers or slow redelivery
- Open an incident with evidence

## Operator payload (minimum fields)

When you ESCALATE, include this payload in logs and alerts:

- correlationId
- dependency
- operation
- attempt
- maxAttempts
- delayMs
- timeoutMs
- elapsedMs
- totalBudgetMs
- statusCode (if any)
- exceptionType (if any)
- decision (stop, retry, escalate)
- reason
- retryAfterSeconds (if any)
- nextAction (disable feature, reduce concurrency, check vendor status)

## Layered retries audit (one place owns retries)

Retry storms usually come from stacked retries:

- HttpClient handler retries
- Polly retries
- Queue redelivery
- Scheduled job retries
- User refresh and upstream retries

Pick one place where retries are decided and remove the rest.

## Notes for non-idempotent operations

If a duplicate attempt can create double writes, double charges, or double emails:

- STOP by default
- Add an idempotency key and make the server treat repeats as safe
- Only then consider bounded retries
