
Category:AutomationCrypto
Signature invalid but bot was working: why clock drift breaks auth suddenly
When bot gets signature invalid or 401 after working fine for hours: why clock drift breaks exchange auth suddenly, and the time calibration that prevents it.
Free download: Timestamp drift runbook + logging schema. Jump to the download section.
If your bot suddenly starts throwing signature invalid or timestamp out of range, it’s tempting to go hunting for a bug in your signing code.
Sometimes that’s correct.
But in production, a surprisingly large share of “signature bugs” are not code bugs at all. They’re time bugs.
Your bot’s clock drifts, the exchange enforces a strict timestamp window, and everything looks fine… until it doesn’t.
- it ran for hours
- you redeployed “the same code”
- and now private endpoints are failing with 401/403
This is not a tutorial. It is an incident playbook for operators running crypto automation in production.
This post gives you:
- diagnose timestamp drift quickly (without guessing)
- stop retry patterns that escalate blocks
- implement the boring fixes that keep bots alive
This post is in the Crypto Automation hub and the Crypto Automation category.
- Stop retries on auth/timestamp failures (401/403/signature/recvWindow). Open an auth breaker and halt trading.
- Measure drift (offset_ms) against exchange time and log it with RTT and a deploy marker (signing_version).
- Gate startup: don’t enable private endpoints until time sync is healthy and offset is calibrated.
Fast triage table (what to check first)
| Symptom | Likely cause | Confirm fast | First safe move |
|---|---|---|---|
| Public endpoints fine; private signed endpoints spike 401/403 | Timestamp/signature path failing (not general connectivity) | Status codes split by endpoint class | Stop retries; open auth breaker; halt trading until classified |
| Errors start right after deploy/scale-out | Startup signing before time sync/offset is stable | Deploy marker timestamp aligns with first failures | Add startup gate; calibrate before enabling private endpoints |
| “Timestamp out of range” / recvWindow messages | Clock drift or time jump (pause/resume) | Call serverTime; compute offset_ms | Calibrate offset; fix host time sync; resume gradually |
| “Invalid signature” but offset is small/stable | Signing bug or key/secret mismatch | Drift is within window; failures persist | Stop trading; compare signing_version; verify key/secret and canonicalization |
| Multiple instances disagree (some succeed, some fail) | Fleet skew (different offsets per host) | Offset_ms differs by instance | Alert on offset; enforce per-host time sync + gating |
Why signature errors happen suddenly: clock drift after working for hours
An on call engineer gets paged for 401 and 403 spikes on private endpoints. Public market data is fine.
The team rotates keys. Nothing changes. They redeploy a hotfix that touches nothing in signing. Still failing.
The root cause is boring: the instance clock is off by seconds after a pause/resume event, and every signed request is now outside the exchange tolerance window.
The failure mode (what’s actually happening)
Most exchange auth flows include a timestamp (or nonce) in the signed payload.
The exchange uses that timestamp for two things:
- Replay protection: it prevents someone from capturing a valid request and replaying it later.
- Abuse control: it prevents clients from sending stale traffic that looks like automated scanning.
That means the exchange must decide whether your timestamp is “close enough” to its own time.
If your local clock is ahead/behind by a few seconds, you’ll get errors like:
timestamp_out_of_rangeTimestamp for this request is outside of the recvWindowinvalid signature(because the exchange refuses to evaluate it)
The important part: the bot can be perfectly healthy in every other way. This is why it feels random.
What causes random signature failures: VM pause, restart, and time sync lag
Timestamp drift typically shows up after a real-world event, not after a code change:
- a VM host pauses/resumes
- an instance boots and time sync isn’t ready yet
- a container restarts on a host with drift
- your fleet scales out and new instances come online with uneven time
- NTP/Chrony loses upstream connectivity and slowly drifts
Engineers often miss it because:
- local development machines keep time reasonably well
- the signing code “looks deterministic”
- the error message points at the signature, not the clock
How to diagnose timestamp drift: offset measurement and deploy correlation
This ladder is designed to prevent the biggest operational mistake: treating auth failures like transient errors and retrying them.
1) Classify the error (don’t guess)
Put the incident into one bucket:
- timestamp: recvWindow / timestamp out of range / nonce too old
- signature: invalid signature with stable time
- permission: key scope / 403 forbidden
- platform: 5xx or gateway issues
If your logs don’t contain enough detail to classify it, your first fix is observability (see the shipped logging fields).
2) Check “did we deploy or scale?”
Time incidents correlate with deploy/scale events because:
- new instances start signing requests immediately
- time sync may not be stable yet
- concurrency increases, which amplifies the blast radius
If errors started within minutes of a deploy, assume time is a candidate.
3) Check whether failures are endpoint-scoped
A useful signal:
- only private signed endpoints fail
- public market data still works
That points strongly to a signing/timestamp issue, not general connectivity.
4) Measure drift (don’t debate it)
If the exchange provides a serverTime endpoint, use it.
Compute:
offset_ms = server_time_ms - local_time_ms
If the offset magnitude is larger than the exchange’s window, the incident is explained.
5) Decide the safe behavior
If it’s auth/timestamp:
- stop retrying
- open an auth breaker
- halt trading
If it’s a 5xx/platform outage:
- backoff + jitter
- reduce concurrency
These are different problems and require different handling.
Stop signature auth failures: time calibration, startup gates, and fail-fast rules
1) Never retry auth failures
This one rule prevents a lot of “we got blocked” escalations.
When the bot is sending invalid signed requests repeatedly, the exchange sees a pattern that looks like an attacker.
Policy:
- 401/403, signature invalid, timestamp errors: fail fast
- open breaker, alert, require investigation
2) Ensure host time sync is real (not assumed)
Many teams think they have time sync “because the OS does it”. In practice, fleets drift when:
- upstream NTP is flaky
- instances are snapshotted/restored
- containers are scheduled on unhealthy hosts
The fix is operational:
- verify time sync service health
- verify the last sync is recent
- verify the offset is stable
If you can’t measure it, you can’t trust it.
3) Add exchange-time calibration (recommended)
If the exchange provides serverTime, you can harden your bot by calibrating:
- call
serverTime - compute
offset_ms - apply offset to all signed request timestamps
This turns “clock drift” into “offset tracking”, which is much easier to control.
Operational rules:
- recalibrate every 5 to 15 minutes
- recalibrate on restarts
- recalibrate after sleep/resume events
4) Delay startup until time is sane
Many bots fail right after deploy because they start signing requests immediately.
A safer startup sequence:
- verify host time sync is running
- calibrate exchange offset
- only then enable private endpoints / trading
This single change removes a lot of post-deploy incidents.
5) Make drift visible (so it stops being “random”)
Add one dashboard chart:
applied_offset_msover time
When that line spikes or oscillates, you’ve found the problem before the exchange blocks you.
recvWindow: what it is (and what it isn’t)
Many exchanges expose a parameter called recvWindow (or similar). It’s easy to misunderstand.
Think of recvWindow as a tolerance window that says: “accept my request if my timestamp is within this many milliseconds of your server time.”
It helps with minor jitter.
It does not fix:
- a host clock that’s drifting steadily
- a fleet where some instances are +4s and others are -3s
- instances that jump time after pause/resume
Practical guidance:
- Keep
recvWindowsmall and sensible (exchange-specific). - Treat increases as a temporary mitigation, not a root-cause fix.
- If you need a huge window, you are not solving time. You are hiding it.
Implementing exchange-time offset safely
If the exchange provides serverTime, calibrating an offset is the single highest leverage fix you can ship.
The naïve approach is:
- call
serverTime - subtract your local time
But that ignores network latency. A better approach is to measure the request round-trip time (RTT) and assume the server timestamp corresponds roughly to the midpoint.
RTT-corrected offset
Let:
t0= local time before requestts= server time from responset1= local time after response
Approximate the local time at server timestamp as t0 + (t1 - t0)/2.
Then:
offset_ms = ts - (t0 + (t1 - t0)/2)
Here’s a TypeScript sketch:
type ServerTimeResponse = { serverTime: number };
export async function calibrateOffsetMs(fetchServerTime: () => Promise<ServerTimeResponse>) {
const t0 = Date.now();
const { serverTime } = await fetchServerTime();
const t1 = Date.now();
const rtt = t1 - t0;
const midpointLocal = t0 + Math.floor(rtt / 2);
const offsetMs = serverTime - midpointLocal;
return { offsetMs, rtt };
}
export function signedTimestampMs(offsetMs: number) {
return Date.now() + offsetMs;
}Operational notes:
- If RTT is huge or unstable, offset will be noisy. Log
rttand treat high RTT as an infrastructure symptom. - Do not recalibrate every request. Calibrate periodically (5 to 15 minutes) and on restart.
- Store the offset in memory (or a small shared cache if you coordinate across instances).
Startup gating (the simplest way to stop post-deploy incidents)
Most “it broke right after deploy” stories are caused by a bad startup sequence.
If your bot enables trading as soon as the process starts, it’s signing requests during the noisiest time:
- time sync might not be stable
- offset not calibrated yet
- caches cold, so you’re about to burst several endpoints
A safer startup gate:
- Verify time sync service health (host-level)
- Calibrate exchange offset (application-level)
- Warm caches (exchange info, symbols, permissions)
- Enable private endpoints
- Enable trading
If step (1) or (2) fails, fail closed. Don’t “try your luck” with live keys.
Incident playbook (10 minutes)
When the alert fires:
- Stop retries on auth/timestamp immediately
- Open auth breaker and halt trading
- Measure offset with
serverTime(logoffset_ms+rtt_ms) - Check deploy marker (
signing_version) to rule out signing regression - Confirm host sync is enabled and stable
If you can’t do step (3), that’s a gap worth fixing first.
What to log (minimum viable)
Signature incidents are painful when you only log “401”.
Log enough to answer two questions immediately:
- Is the timestamp wrong?
- Did this start after a deploy/config change?
Minimum fields:
request_idendpointstatuserror_code/messagelocal_ts_msserver_ts_ms(if available)applied_offset_msrecv_window_mssigning_version(deploy marker)
Example incident log shape:
{
"ts": "2026-01-15T09:41:12.345Z",
"bot_instance_id": "prod-1",
"exchange": "example",
"endpoint": "private/order/create",
"status": 401,
"error_code": "timestamp_out_of_range",
"error_message": "outside recvWindow",
"local_ts_ms": 1768489272345,
"server_ts_ms": 1768489266122,
"applied_offset_ms": -6223,
"recv_window_ms": 5000,
"signing_version": "2026-01-15.2",
"request_id": "req-abc"
}This is enough to answer the only question that matters in the first minute:
Is it time drift, or did we ship a signing change?
Shipped asset
Time sync runbook + logging fields
Operational runbook you can hand to on-call engineers. Logging field list makes timestamp incidents diagnosable in minutes, not hours.
- You see intermittent signature/timestamp failures that correlate with deploys, restarts, or VM pause/resume.
- You need on-call to classify “time drift vs signing regression” within minutes.
- You want a repeatable startup gate and a minimal log schema for signature incidents.
- You can’t measure server time/offset (add a serverTime probe or infrastructure time checks first).
- You treat auth failures like transient errors (stop retries + add breaker behavior first).
- You don’t have a deploy marker in logs (add
signing_version/deploy_idfirst).
This is intentionally compact here. Full package details are on the resource page.
What you get (2 files):
time-sync-runbook.md - Incident response procedure
- Detection: what symptoms indicate time drift (signature failures on valid requests)
- Diagnosis: which systems to check first (NTP, clock sources, OS drift)
- Verification: how to confirm drift before mitigation (query exchange timestamp)
- Mitigation: immediate actions (sync clocks, restart services, escalate)
- Prevention: post-incident checks (monitoring, alerting, constraints)
logging-fields-for-signature-incidents.md - Required log fields
- Timestamp fields: request_time, server_time, local_time, drift_ms
- Context: exchange, endpoint, key_id, request_signature
- Diagnosis: ntp_offset_ms, system_clock_source, drift_direction
- Resolution: action_taken, time_corrected, incident_duration_ms
Quick reference (what to log):
{
"timestamp": "2026-01-27T14:30:20.123Z",
"event_type": "signature_verification_failed",
"exchange": "binance",
"local_time_ms": 1706347820000,
"server_time_ms": 1706347775000,
"drift_ms": -45,
"ntp_status": "unsynchronized",
"action": "time_sync_triggered"
}What this solves:
- Gives on call a repeatable procedure during signature incidents
- Makes drift diagnosable with queryable logs, not guesswork
- Helps you stop harmful retries before they escalate blocks
Trading Bot Hardening Suite: Production-Ready Crypto Infrastructure
Running production trading bots? Get exchange-specific rate limiters, signature validation, and incident recovery playbooks. Stop losing money to preventable API failures.
- ✓Exchange-specific rate limiting (Binance, Coinbase, Kraken, Bybit)
- ✓Signature validation & timestamp drift detection
- ✓API ban prevention patterns & key rotation strategies
- ✓Incident runbooks for 429s, signature errors, and reconnection storms
Common mistakes (that keep repeating)
These mistakes are common because time is “invisible” until it breaks. The goal is to make time visible and treat it like any other dependency.
- Using
recvWindowas a band-aid
A larger recvWindow can reduce false failures, but it doesn’t fix unstable time. If your offset jumps around, you still fail.
- Retrying auth failures
Auth failures are the wrong class of error to retry. They’re a “stop and investigate” signal.
- No deploy marker in logs
Without signing_version/deploy_id, you can’t quickly separate “time drift” from “signing code changed”.
- No offset metric
If you aren’t graphing offset, drift will always look random.
In practice, teams that fix this permanently do two things: they calibrate offset against exchange time and they alert on offset spikes.
Troubleshooting Questions Engineers Search
Because your server clock drifted away from exchange time while the bot was running. Common triggers: VM pause/resume, container migration, NTP upstream loss, or gradual drift on hosts under load. The signing code didn't change—the timestamp you're sending is now outside the exchange's tolerance window (often 5-30 seconds). The exchange rejects it as "signature invalid" or "timestamp out of range."
Measure the offset. If the exchange provides a serverTime endpoint, call it and compute: offset_ms = server_time - local_time. If the offset magnitude exceeds the exchange window (e.g., 5+ seconds), it's drift. If offset is small but you still get 401, investigate signing logic. Log: local timestamp, server timestamp, offset, and deploy version so you can separate time issues from code regressions.
Because time sync may not be stable yet when the bot starts signing requests. NTP/Chrony takes a few seconds to sync after boot. If your bot enables trading immediately, it's sending signed requests with potentially wrong timestamps. The fix: gate startup—verify time sync is healthy and calibrate exchange offset before enabling private endpoints.
Only partially and temporarily. recvWindow is a tolerance buffer—it helps with minor jitter, but doesn't fix unstable clocks or hosts that drift steadily. If your offset oscillates or keeps growing, a larger window just masks the problem. The durable fix: ensure host time sync is reliable (NTP/Chrony) and/or calibrate against exchange serverTime so you're tracking offset instead of guessing.
Because production hosts drift in ways dev machines don't. VM pause/resume, noisy neighbors, bad host clocks, missing NTP upstream, cold starts, and container reschedules are common in production but rare locally. Your laptop usually has stable time sync and a single process. Production fleets experience time jumps that dev can't reproduce.
No. Treat signature and timestamp errors as stop rules, not transient failures. If you keep retrying invalid auth requests, the exchange sees a pattern that looks like an attacker or broken client. This can escalate to blocks. The safe action: open an auth circuit breaker, halt trading, log offset measurements, and require investigation before resuming.
Make offset visible. Track applied_offset_ms (or NTP offset) as a metric and alert on spikes. After the fix, signature failures should drop to near-zero. In logs, you should see: offset measurements stable, no timestamp errors, and any remaining auth failures correlated with specific drift events (not random). Graph offset over time—a stable line means the problem is solved.
FAQ
This is where teams usually get stuck when they try to “just fix the timestamp”.
Sometimes it reduces noise, but it’s not a real fix. recvWindow is a tolerance buffer; it doesn’t correct unstable clocks, jumpy VMs, or hosts that drift under load.
If you’re consistently outside the window, the only durable fixes are: make host time sync reliable and/or calibrate against exchange serverTime and log your offset.
Start with every 5 to 15 minutes and on every process restart. Also recalibrate after events that cause time jumps (sleep/resume, VM migration, long GC pauses, container reschedules).
If you see offset oscillation, log RTT and investigate network instability; noisy calibration can be a symptom of infrastructure issues.
Then your only reliable option is to treat time sync as infrastructure. Ensure NTP/Chrony is configured correctly, alert on offset if you can, and gate startup until time is sane.
In that world, logging becomes even more important: record local timestamps and deploy markers so you can separate time drift from signing regressions.
Production hosts drift in ways your laptop rarely does.
VM pause/resume, noisy neighbors, bad host clocks, missing NTP upstream connectivity, and cold start behavior are common triggers. Local dev tends to have stable time sync and a single process.
Fail closed.
Treat signature and timestamp errors as stop rules. Open an auth breaker, halt trading, and require investigation. If you keep sending invalid signed requests you can escalate to blocks.
Make time visible.
Track applied offset (or NTP offset) as a metric, and alert on spikes. In logs, you should see signature failures drop to near zero, and you should be able to correlate any remaining errors with concrete drift measurements.
Checklist (copy/paste)
- Auth/timestamp failures are STOP rules (0 retries) and open an auth circuit breaker.
- We can compute and log
offset_ms(exchange server time vs local time) plusrtt_ms. - Logs include a deploy marker (
signing_version/deploy_id) to separate drift from code regression. - Startup is gated: time sync health + offset calibration before enabling private endpoints/trading.
- A metric exists for
applied_offset_msand alerts trigger on spikes (e.g., > 5s) or NTP unsynchronized. - Fleet skew is detectable (offset differs by instance); we can identify the bad host quickly.
- recvWindow is treated as a small tolerance, not a root-cause fix.
- Post-incident, we verify time sync configuration and test by intentionally skewing a non-prod host.
Resources
This is intentionally compact. Full package details are on the resource page.
- Timestamp drift runbook + logging schema
- Crypto Automation hub
- Axiom (Coming Soon)
- Exchange API bans: how to prevent them
- Trading bot keeps getting 429s after deploy: stop rate limit storms - deploy-incident correlation
- The real cost of retry logic: when resilience makes outages worse - never retry auth failures
External references:
Coming soon
If you want more production-ready runbooks and templates like this, the Axiom waitlist is the right place.
Axiom (Coming Soon)
Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.
Key takeaways
Timestamp drift is a boring problem with expensive consequences.
Fix it once by adopting three behaviors:
- treat auth failures as fail-fast (no retries)
- measure and log offset so the incident is obvious
- calibrate against exchange time so drift stops being a production roulette wheel
For more production bot operations work, see the Crypto Automation category.
Recommended resources
Download the shipped checklist/templates for this post.
On-call runbook for signature error incidents. Logging schema makes clock drift diagnostics fast. Know what to check in minutes, not hours.
resource
Related posts

API key suddenly forbidden: why exchange APIs ban trading bots without warning
When API key flips from working to 403 forbidden after bot runs for hours: why exchange APIs ban trading bots for traffic bursts, retry storms, and auth failures, and the client behavior that prevents it.

Trading bot keeps getting 429s after deploy: stop rate limit storms
When deploys trigger 429 storms: why synchronized restarts amplify rate limits, how to diagnose fixed window vs leaky bucket, and guardrails that stop repeat incidents.

Agent keeps calling same tool: why autonomous agents loop forever in production
When agent loops burn tokens calling same tool repeatedly and cost spikes: why autonomous agents loop without stop rules, and the guardrails that prevent repeat execution and duplicate side effects.