# Stop / Retry / Escalate Decision Tree (AI Agents)

Goal: make agent behavior predictable under failure.

Use this decision tree any time a tool call fails, or when the agent is repeating itself.

## Step 1 — Classify the failure

Pick the best match:

- **Input/validation** (bad params, schema validation, 400)
- **Auth/permission** (401/403, missing scope)
- **Rate limit** (429)
- **Transient service** (timeouts, 5xx, DNS)
- **Safety/policy** (blocked action)
- **Ambiguous** (unknown error)

## Step 2 — Apply the action

### A) Input/validation → STOP

- Stop immediately.
- Explain what input is invalid.
- Provide a corrected example.

### B) Auth/permission → ESCALATE

- Do not retry.
- Escalate with:
  - tool name
  - missing permission/scope
  - remediation steps

### C) Rate limit (429) → RETRY (bounded)

- Respect `Retry-After` if present.
- Apply exponential backoff + jitter.
- Reduce concurrency.
- Cap retries (e.g., 3).
- If still failing: ESCALATE (rate limit sustained).

### D) Transient service (timeouts/5xx) → RETRY (bounded)

- Retry with backoff + jitter.
- Cap retries (e.g., 2–3).
- If still failing: ESCALATE (service degraded).

### E) Safety/policy → STOP or ESCALATE

- If the action is disallowed: STOP.
- If a human can approve a safe alternative: ESCALATE.

### F) Ambiguous → RETRY once, then ESCALATE

- Retry once (in case of fluke).
- If it repeats: escalate with full logs.

## Loop-specific override

If any of these are true, STOP or ESCALATE immediately:

- Same tool + same error repeats N times
- Same output repeats N times
- No plan changes after N iterations

Suggested defaults:

- N=2 for auth/safety
- N=3 for 429/timeouts

## Escalation payload (minimum)

Include:

- run id
- goal
- last 5 actions (tool name + params summary)
- last 5 results (status + error)
- loop counters (iteration, repeats)
- recommended next action
