# Timeouts rollout plan (observe -> warn -> enforce)

Goal: add time budgets without creating a new outage.

This assumes a live system where rewrites are not an option.

## Phase 0: Inventory (1-2 hours)

- Identify the top 3 user flows or job types that page you
- List the dependencies involved (HTTP/SQL/cache/queue/file share)
- Pick one dependency call that is currently causing pain

## Phase 1: Observe (ship this first)

- Log per dependency call:
  - dependency name
  - elapsedMs
  - attempt
  - outcome
  - correlationId/runId
- Add histograms (p50/p95/p99) where you can
- Do NOT enforce timeouts yet unless a call can literally hang forever

Exit criteria:
- You can answer: "what is slow" with one log query

## Phase 2: Warn (budget violations are visible)

- Decide a conservative attempt timeout
- Add a warning log when elapsedMs exceeds the budget
- Add an alert only for repeated violations or rising backlog (avoid paging on noise)

Exit criteria:
- You can see which callers and endpoints will be impacted

## Phase 3: Enforce (bounded behavior)

- Enforce the attempt timeout
- Cap retries and add a total budget
- Ensure cancellation is wired through so work stops, not just waiting

Exit criteria:
- Timeouts produce an operator decision (stop/retry/quarantine), not a mystery

## Safety checks

- Verify cancellation is honored (no background work continuing after timeout)
- Verify connection pools and worker concurrency recover after a timeout burst
- Verify your fallback path (cached data, queue, partial response) actually works

## Common failure modes

- Only adding a timeout (stop waiting) but not stopping work (zombie work)
- Timeouts + retries stacking into a longer total wait
- Enforcing everywhere at once and declaring "timeouts broke prod"
- Missing poison path for deterministic failures
