# Timeout matrix template (HTTP / SQL / jobs)

Purpose: turn "waiting" into bounded, observable behavior.

This template is intentionally boring. It exists so on-call can answer:
- What is the total budget?
- What are the per-dependency attempt budgets?
- How many attempts are allowed?
- When do we stop and escalate?

## 1) Context

Service: __________________________

Job/request type: __________________

Owner/on-call rotation: ____________

## 2) Total budget

Pick the maximum time you can afford before the system must choose a different path.

Total budget (end-to-end): ________ ms

Fallback behavior when budget is exceeded:
- [ ] return error
- [ ] return cached/stale data
- [ ] enqueue for async processing
- [ ] partial response
- [ ] manual intervention

## 3) Per-dependency budgets (attempt level)

Rules:
- Attempt timeout <= total budget
- Total budget must include retries + backoff + queueing
- Do not retry non-transient failures

| Dependency | Operation | Attempt timeout | Max attempts | Total time budget | Notes |
|---|---|---:|---:|---:|---|
| http:VendorApi | GET /foo | ____ ms | __ | ____ ms | Honor Retry-After on 429 |
| sql:OrdersDb | SELECT ... | ____ ms | __ | ____ ms | Include lock wait |
| cache:Redis | GET key | ____ ms | __ | ____ ms | Prefer fail-open where safe |
| queue:Payments | Process message | ____ | __ | ____ | Max runtime + poison path |

## 4) Stop rules (operator-grade)

Stop rules are the point. They prevent zombie work.

- If total budget exceeded: STOP and emit an incident payload
- If max runtime exceeded (jobs): CANCEL, verify cancellation honored, then STOP
- If non-retryable error (400/401/403/409, validation errors): STOP and quarantine (poison path)
- If heartbeat missing: STOP and page (stuck work)

## 5) Incident payload (what to include)

When you stop, ship enough context that the next responder can act.

- request/job type
- runId / correlationId
- dependency name
- attempt count
- elapsedMs + timeoutMs
- outcome (timeout/5xx/429/lock/etc)
- decision (stop/retry/quarantine/escalate)
- next action (pause job type, reduce concurrency, contact vendor, etc)
