Incident Runbook Builder
Build structured incident runbooks with detection criteria, decision trees, escalation paths, and post-incident checklists. Covers the 6 most common trading bot incidents.
Runbooks aren't documentation — they're decisions under pressure. When your trading bot fails at 3 AM, you don't have time to think. You need a pre-built decision tree that tells you: what to check, what action to take, when to escalate. This builder generates production-ready runbooks you can paste into Slack, Notion, or your incident management tool.
Builder
Your bot is receiving error codes from the exchange API (4xx, 5xx, or specific error codes).
Degraded service. Trading continues but with reduced reliability. Respond within 30 min.
Detection Criteria
- •Error rate exceeds 1% of total requests in 5-minute window
- •Specific error codes: -1021 (timestamp), -2015 (auth), 10006 (rate limit), EAPI:Rate limit exceeded
- •Order failure rate spikes above baseline
- •Alert threshold: > 5 consecutive errors on same endpoint
Response Steps
- Categorize the error code immediately — is it auth (401/403), rate limit (429), validation (400), or server (5xx)?
- Auth errors (401/403): check API key permissions in exchange dashboard. Verify IP whitelist. Regenerate key if compromised.
- Rate limit errors (429): reduce request rate to 50% immediately. Check X-RateLimit-Remaining headers. Use Rate Limit Headroom Calculator.
- Validation errors (400): check the failing request payload against exchange info. Verify LOT_SIZE, PRICE_FILTER, MIN_NOTIONAL.
- Server errors (5xx): wait 30s, retry once. If persists, check exchange status page. Do NOT retry aggressively — this is their problem.
- If error persists > 5 min: escalate to on-call with error code, endpoint, and request count.
Decision Tree
Is the error a 4xx (client error)?
Is the error a 429 (rate limit)?
Is the error a 5xx (server error)?
Escalation Path
Post-Incident Checklist
Related tools
Frequently asked questions
Frequently asked questions
How is this different from a documentation runbook?
What severity should I assign to each incident type?
How do I keep runbooks updated?
Can I customize the runbook for my specific bot?
What's the most important part of an incident runbook?
What engineers say
What engineers say
“The retry policy generator alone saved us from a production incident. We had exponential backoff configured wrong for months — the timeline visualization made it obvious instantly.”
Alex R.
Senior Backend Engineer, Fintech Startup
Newsletter
Weekly engineering insights
Get practical tips on AI, .NET, trading bot reliability, and building scalable systems. No spam, unsubscribe anytime.