Resources/Bot reliability checklist: 20-point pre-flight for trading bots

Bot reliability checklist: 20-point pre-flight for trading bots

Production-readiness checklist for crypto trading bots: rate limits, reconnects, idempotency, crash recovery, clock sync, and incident response.

FreeJun 09, 2026

Source code

This resource is backed by a public GitHub repository with source code, templates, and documentation you can fork, review, and integrate.

View on GitHub

This checklist is the minimum bar for a trading bot that interacts with exchange APIs. If you cannot answer "yes" to all items, you have a gap that will cause an incident.

A) Auth and connectivity

  • API keys have minimum required permissions (read-only where possible).
  • API keys are scoped to specific IPs or have IP whitelisting enabled.
  • Key rotation process exists and is documented.
  • Every signed request uses a fresh timestamp (Date.now()), not a cached or reused value.
  • Clock is synced via NTP every 5-15 minutes (systemd-timesyncd or chronyd).
  • Clock drift is logged per signed request (local time vs exchange server time offset).

B) Rate limiting and backpressure

  • Per-endpoint concurrency caps are enforced (private: 1-2, public: 2-4).
  • 429 responses trigger backoff + jitter, not immediate retry.
  • Retry budgets are bounded (max 2-3 attempts per request).
  • Retries use exponential backoff with jitter (±500ms range).
  • Reconnect attempts are singleflight (one at a time) with jittered backoff.

C) Error handling

  • Auth failures (401/403, signature/timestamp errors) are STOP rules — 0 retries, escalate to operator.
  • 429 errors are treated as backpressure (reduce concurrency, backoff).
  • 5xx/timeouts are retried with bounded budget, then escalate.
  • Validation errors (4xx, schema errors) are STOP rules — 0 retries.
  • Circuit breakers exist by failure class (auth, rate-limit, platform).

D) Crash recovery and state

  • Crash recovery can reconcile state on restart without double orders.
  • Idempotency keys are used for order placement and cancellation.
  • Message sequence numbers are tracked for WebSocket gap detection.
  • Resync after reconnect is bounded (deltas only, not full state).
  • Kill switch exists to stop trading without redeploy.

E) Observability

  • Every API request logs: endpoint, status, error_code, attempt, latency_ms, concurrency_inflight.
  • Every disconnect event logs: close_code, last_message_ago_ms, uptime_seconds.
  • Clock offset is logged per signed request with alert threshold at 50% of recvWindow.
  • Bot health is monitored (process alive, websocket connected, orders flowing).

Newsletter

Get the automation reliability newsletter

Weekly runbooks, failure patterns, and practical fixes.

No spam. Practical updates only.

We respect your inbox. Unsubscribe anytime.

No spam. Unsubscribe anytime.

Need help implementing this?

I can help you apply this to your systems without the drama.

Work with me

Similar resources

More resources to help you succeed

View all
Canonical: https://matrixtrak.com/resources/bot-reliability-checklist