Tool · Observability & Incident Response

Incident Runbook Builder

Build structured incident runbooks with detection criteria, decision trees, escalation paths, and post-incident checklists. Covers the 6 most common trading bot incidents.

6incident types
4severity levels
Markdownoutput

Runbooks aren't documentation — they're decisions under pressure. When your trading bot fails at 3 AM, you don't have time to think. You need a pre-built decision tree that tells you: what to check, what action to take, when to escalate. This builder generates production-ready runbooks you can paste into Slack, Notion, or your incident management tool.

Builder

Your bot is receiving error codes from the exchange API (4xx, 5xx, or specific error codes).

Degraded service. Trading continues but with reduced reliability. Respond within 30 min.

P2 — HighExchange API Error incident runbook

Detection Criteria

  • Error rate exceeds 1% of total requests in 5-minute window
  • Specific error codes: -1021 (timestamp), -2015 (auth), 10006 (rate limit), EAPI:Rate limit exceeded
  • Order failure rate spikes above baseline
  • Alert threshold: > 5 consecutive errors on same endpoint

Response Steps

  1. Categorize the error code immediately — is it auth (401/403), rate limit (429), validation (400), or server (5xx)?
  2. Auth errors (401/403): check API key permissions in exchange dashboard. Verify IP whitelist. Regenerate key if compromised.
  3. Rate limit errors (429): reduce request rate to 50% immediately. Check X-RateLimit-Remaining headers. Use Rate Limit Headroom Calculator.
  4. Validation errors (400): check the failing request payload against exchange info. Verify LOT_SIZE, PRICE_FILTER, MIN_NOTIONAL.
  5. Server errors (5xx): wait 30s, retry once. If persists, check exchange status page. Do NOT retry aggressively — this is their problem.
  6. If error persists > 5 min: escalate to on-call with error code, endpoint, and request count.

Decision Tree

Is the error a 4xx (client error)?

✅ Yes: Fix the request — don't retry blindly. Go to step 2-4.
❌ No: Go to next question.

Is the error a 429 (rate limit)?

✅ Yes: Immediate rate reduction. Pause all non-critical requests for one window cycle.
❌ No: Go to next question.

Is the error a 5xx (server error)?

✅ Yes: Wait and retry once with backoff. Do not escalate unless > 5 min.
❌ No: Unknown error — log full response and escalate.

Escalation Path

When:Error rate > 5% for 5+ minutes, or any single auth/403 error (possible key compromise)
Who:On-call engineer → Lead developer (if auth-related)
How:Slack #incidents channel with: error code, endpoint, rate, and sample request ID

Post-Incident Checklist

Related tools

Frequently asked questions

Frequently asked questions

How is this different from a documentation runbook?
Documentation runbooks are written once and go stale. This builder generates runbooks with decision trees — if/then logic that guides the responder step by step. Each runbook maps to a specific incident type with concrete detection criteria, not vague 'check the logs' instructions. The output is copy-paste ready for Slack, PagerDuty, or Notion.
What severity should I assign to each incident type?
P1 (Critical): revenue-impacting — order failures, reconciliation drift. P2 (High): degraded service — WebSocket drops, rate limits. P3 (Medium): minor impact — elevated latency, intermittent errors. P4 (Low): cosmetic. The builder sets sensible defaults but you should tune them based on your bot's PnL sensitivity and trading frequency.
How do I keep runbooks updated?
Re-run the builder after every production incident to incorporate lessons learned. The post-incident checklist items should feed back into the runbook. Store runbooks in version control (git) alongside your bot code. Review and update quarterly even if no incidents occurred.
Can I customize the runbook for my specific bot?
Yes — copy the Markdown output and edit it. Add your specific alert thresholds, Slack channel names, on-call rotation details, and any bot-specific checks. The builder gives you the structure and decision logic; you customize the specifics.
What's the most important part of an incident runbook?
The decision tree. When an incident hits, the responder is stressed and sleep-deprived. A clear if/then tree prevents panic-driven decisions (like restarting the bot without checking state, or retrying orders without checking idempotency). The first decision should always be: 'Stop the bleeding, then investigate.'

What engineers say

What engineers say

The retry policy generator alone saved us from a production incident. We had exponential backoff configured wrong for months — the timeline visualization made it obvious instantly.
A

Alex R.

Senior Backend Engineer, Fintech Startup

1 / 16

Newsletter

Weekly engineering insights

Get practical tips on AI, .NET, trading bot reliability, and building scalable systems. No spam, unsubscribe anytime.