Bot reliability checklist: 20-point pre-flight for trading bots
Production-readiness checklist for crypto trading bots: rate limits, reconnects, idempotency, crash recovery, clock sync, and incident response.
FreeJun 09, 2026
Source code
This resource is backed by a public GitHub repository with source code, templates, and documentation you can fork, review, and integrate.
View on GitHubFrom this article
Browse allHow I Built a Real-Time Crypto Trading Bot in Python
Learn how I built a real-time crypto trading bot in Python from scratch—complete with Binance API integration, modular strategies, CLI control, paper trading, logging,…
Also referenced in
Why Most Crypto Trading Bots Fail (And How to Build One That Actually Works)
API key suddenly forbidden: why exchange APIs ban trading bots without warning
WebSocket Disconnects in Trading Bots: Reconnection That Actually Works
Crash Recovery: Reconciliation Loops That Prevent Double Orders
Trading bot keeps getting 429s after deploy: stop rate limit storms
This checklist is the minimum bar for a trading bot that interacts with exchange APIs. If you cannot answer "yes" to all items, you have a gap that will cause an incident.
A) Auth and connectivity
- API keys have minimum required permissions (read-only where possible).
- API keys are scoped to specific IPs or have IP whitelisting enabled.
- Key rotation process exists and is documented.
- Every signed request uses a fresh timestamp (
Date.now()), not a cached or reused value. - Clock is synced via NTP every 5-15 minutes (systemd-timesyncd or chronyd).
- Clock drift is logged per signed request (local time vs exchange server time offset).
B) Rate limiting and backpressure
- Per-endpoint concurrency caps are enforced (private: 1-2, public: 2-4).
- 429 responses trigger backoff + jitter, not immediate retry.
- Retry budgets are bounded (max 2-3 attempts per request).
- Retries use exponential backoff with jitter (±500ms range).
- Reconnect attempts are singleflight (one at a time) with jittered backoff.
C) Error handling
- Auth failures (401/403, signature/timestamp errors) are STOP rules — 0 retries, escalate to operator.
- 429 errors are treated as backpressure (reduce concurrency, backoff).
- 5xx/timeouts are retried with bounded budget, then escalate.
- Validation errors (4xx, schema errors) are STOP rules — 0 retries.
- Circuit breakers exist by failure class (auth, rate-limit, platform).
D) Crash recovery and state
- Crash recovery can reconcile state on restart without double orders.
- Idempotency keys are used for order placement and cancellation.
- Message sequence numbers are tracked for WebSocket gap detection.
- Resync after reconnect is bounded (deltas only, not full state).
- Kill switch exists to stop trading without redeploy.
E) Observability
- Every API request logs: endpoint, status, error_code, attempt, latency_ms, concurrency_inflight.
- Every disconnect event logs: close_code, last_message_ago_ms, uptime_seconds.
- Clock offset is logged per signed request with alert threshold at 50% of recvWindow.
- Bot health is monitored (process alive, websocket connected, orders flowing).
Related
Newsletter
Get the automation reliability newsletter
Weekly runbooks, failure patterns, and practical fixes.
No spam. Unsubscribe anytime.
Need help implementing this?
I can help you apply this to your systems without the drama.
Work with meSimilar resources
More resources to help you succeed
Canonical: https://matrixtrak.com/resources/bot-reliability-checklist