Signature invalid but bot was working: why clock drift breaks auth suddenly

Jan 09, 202616 min read

Share|

Category:AutomationCrypto

Signature invalid but bot was working: why clock drift breaks auth suddenly

When bot gets signature invalid or 401 after working fine for hours: why clock drift breaks exchange auth suddenly, and the time calibration that prevents it.

Free download: Timestamp drift runbook + logging schema. Jump to the download section.

If your bot suddenly starts throwing signature invalid or timestamp out of range, it’s tempting to go hunting for a bug in your signing code.

Sometimes that’s correct.

But in production, a surprisingly large share of “signature bugs” are not code bugs at all. They’re time bugs.

Your bot’s clock drifts, the exchange enforces a strict timestamp window, and everything looks fine… until it doesn’t.

  • it ran for hours
  • you redeployed “the same code”
  • and now private endpoints are failing with 401/403

This is not a tutorial. It is an incident playbook for operators running crypto automation in production.

This post gives you:

  • diagnose timestamp drift quickly (without guessing)
  • stop retry patterns that escalate blocks
  • implement the boring fixes that keep bots alive
If you only do three things
  • Stop retries on auth/timestamp failures (401/403/signature/recvWindow). Open an auth breaker and halt trading.
  • Measure drift (offset_ms) against exchange time and log it with RTT and a deploy marker (signing_version).
  • Gate startup: don’t enable private endpoints until time sync is healthy and offset is calibrated.

Fast triage table (what to check first)

SymptomLikely causeConfirm fastFirst safe move
Public endpoints fine; private signed endpoints spike 401/403Timestamp/signature path failing (not general connectivity)Status codes split by endpoint classStop retries; open auth breaker; halt trading until classified
Errors start right after deploy/scale-outStartup signing before time sync/offset is stableDeploy marker timestamp aligns with first failuresAdd startup gate; calibrate before enabling private endpoints
“Timestamp out of range” / recvWindow messagesClock drift or time jump (pause/resume)Call serverTime; compute offset_msCalibrate offset; fix host time sync; resume gradually
“Invalid signature” but offset is small/stableSigning bug or key/secret mismatchDrift is within window; failures persistStop trading; compare signing_version; verify key/secret and canonicalization
Multiple instances disagree (some succeed, some fail)Fleet skew (different offsets per host)Offset_ms differs by instanceAlert on offset; enforce per-host time sync + gating

Why signature errors happen suddenly: clock drift after working for hours

An on call engineer gets paged for 401 and 403 spikes on private endpoints. Public market data is fine.

The team rotates keys. Nothing changes. They redeploy a hotfix that touches nothing in signing. Still failing.

The root cause is boring: the instance clock is off by seconds after a pause/resume event, and every signed request is now outside the exchange tolerance window.


The failure mode (what’s actually happening)

Most exchange auth flows include a timestamp (or nonce) in the signed payload.

The exchange uses that timestamp for two things:

  1. Replay protection: it prevents someone from capturing a valid request and replaying it later.
  2. Abuse control: it prevents clients from sending stale traffic that looks like automated scanning.

That means the exchange must decide whether your timestamp is “close enough” to its own time.

If your local clock is ahead/behind by a few seconds, you’ll get errors like:

  • timestamp_out_of_range
  • Timestamp for this request is outside of the recvWindow
  • invalid signature (because the exchange refuses to evaluate it)

The important part: the bot can be perfectly healthy in every other way. This is why it feels random.


What causes random signature failures: VM pause, restart, and time sync lag

Timestamp drift typically shows up after a real-world event, not after a code change:

  • a VM host pauses/resumes
  • an instance boots and time sync isn’t ready yet
  • a container restarts on a host with drift
  • your fleet scales out and new instances come online with uneven time
  • NTP/Chrony loses upstream connectivity and slowly drifts

Engineers often miss it because:

  • local development machines keep time reasonably well
  • the signing code “looks deterministic”
  • the error message points at the signature, not the clock

How to diagnose timestamp drift: offset measurement and deploy correlation

This ladder is designed to prevent the biggest operational mistake: treating auth failures like transient errors and retrying them.

1) Classify the error (don’t guess)

Put the incident into one bucket:

  • timestamp: recvWindow / timestamp out of range / nonce too old
  • signature: invalid signature with stable time
  • permission: key scope / 403 forbidden
  • platform: 5xx or gateway issues

If your logs don’t contain enough detail to classify it, your first fix is observability (see the shipped logging fields).

2) Check “did we deploy or scale?”

Time incidents correlate with deploy/scale events because:

  • new instances start signing requests immediately
  • time sync may not be stable yet
  • concurrency increases, which amplifies the blast radius

If errors started within minutes of a deploy, assume time is a candidate.

3) Check whether failures are endpoint-scoped

A useful signal:

  • only private signed endpoints fail
  • public market data still works

That points strongly to a signing/timestamp issue, not general connectivity.

4) Measure drift (don’t debate it)

If the exchange provides a serverTime endpoint, use it.

Compute:

  • offset_ms = server_time_ms - local_time_ms

If the offset magnitude is larger than the exchange’s window, the incident is explained.

5) Decide the safe behavior

If it’s auth/timestamp:

  • stop retrying
  • open an auth breaker
  • halt trading

If it’s a 5xx/platform outage:

  • backoff + jitter
  • reduce concurrency

These are different problems and require different handling.


Stop signature auth failures: time calibration, startup gates, and fail-fast rules

1) Never retry auth failures

This one rule prevents a lot of “we got blocked” escalations.

When the bot is sending invalid signed requests repeatedly, the exchange sees a pattern that looks like an attacker.

Policy:

  • 401/403, signature invalid, timestamp errors: fail fast
  • open breaker, alert, require investigation

2) Ensure host time sync is real (not assumed)

Many teams think they have time sync “because the OS does it”. In practice, fleets drift when:

  • upstream NTP is flaky
  • instances are snapshotted/restored
  • containers are scheduled on unhealthy hosts

The fix is operational:

  • verify time sync service health
  • verify the last sync is recent
  • verify the offset is stable

If you can’t measure it, you can’t trust it.

If the exchange provides serverTime, you can harden your bot by calibrating:

  • call serverTime
  • compute offset_ms
  • apply offset to all signed request timestamps

This turns “clock drift” into “offset tracking”, which is much easier to control.

Operational rules:

  • recalibrate every 5 to 15 minutes
  • recalibrate on restarts
  • recalibrate after sleep/resume events

4) Delay startup until time is sane

Many bots fail right after deploy because they start signing requests immediately.

A safer startup sequence:

  1. verify host time sync is running
  2. calibrate exchange offset
  3. only then enable private endpoints / trading

This single change removes a lot of post-deploy incidents.

5) Make drift visible (so it stops being “random”)

Add one dashboard chart:

  • applied_offset_ms over time

When that line spikes or oscillates, you’ve found the problem before the exchange blocks you.


recvWindow: what it is (and what it isn’t)

Many exchanges expose a parameter called recvWindow (or similar). It’s easy to misunderstand.

Think of recvWindow as a tolerance window that says: “accept my request if my timestamp is within this many milliseconds of your server time.”

It helps with minor jitter.

It does not fix:

  • a host clock that’s drifting steadily
  • a fleet where some instances are +4s and others are -3s
  • instances that jump time after pause/resume

Practical guidance:

  • Keep recvWindow small and sensible (exchange-specific).
  • Treat increases as a temporary mitigation, not a root-cause fix.
  • If you need a huge window, you are not solving time. You are hiding it.

Implementing exchange-time offset safely

If the exchange provides serverTime, calibrating an offset is the single highest leverage fix you can ship.

The naïve approach is:

  • call serverTime
  • subtract your local time

But that ignores network latency. A better approach is to measure the request round-trip time (RTT) and assume the server timestamp corresponds roughly to the midpoint.

RTT-corrected offset

Let:

  • t0 = local time before request
  • ts = server time from response
  • t1 = local time after response

Approximate the local time at server timestamp as t0 + (t1 - t0)/2.

Then:

  • offset_ms = ts - (t0 + (t1 - t0)/2)

Here’s a TypeScript sketch:

ts
type ServerTimeResponse = { serverTime: number };
 
export async function calibrateOffsetMs(fetchServerTime: () => Promise<ServerTimeResponse>) {
  const t0 = Date.now();
  const { serverTime } = await fetchServerTime();
  const t1 = Date.now();
 
  const rtt = t1 - t0;
  const midpointLocal = t0 + Math.floor(rtt / 2);
  const offsetMs = serverTime - midpointLocal;
 
  return { offsetMs, rtt };
}
 
export function signedTimestampMs(offsetMs: number) {
  return Date.now() + offsetMs;
}

Operational notes:

  • If RTT is huge or unstable, offset will be noisy. Log rtt and treat high RTT as an infrastructure symptom.
  • Do not recalibrate every request. Calibrate periodically (5 to 15 minutes) and on restart.
  • Store the offset in memory (or a small shared cache if you coordinate across instances).

Startup gating (the simplest way to stop post-deploy incidents)

Most “it broke right after deploy” stories are caused by a bad startup sequence.

If your bot enables trading as soon as the process starts, it’s signing requests during the noisiest time:

  • time sync might not be stable
  • offset not calibrated yet
  • caches cold, so you’re about to burst several endpoints

A safer startup gate:

  1. Verify time sync service health (host-level)
  2. Calibrate exchange offset (application-level)
  3. Warm caches (exchange info, symbols, permissions)
  4. Enable private endpoints
  5. Enable trading

If step (1) or (2) fails, fail closed. Don’t “try your luck” with live keys.


Incident playbook (10 minutes)

When the alert fires:

  1. Stop retries on auth/timestamp immediately
  2. Open auth breaker and halt trading
  3. Measure offset with serverTime (log offset_ms + rtt_ms)
  4. Check deploy marker (signing_version) to rule out signing regression
  5. Confirm host sync is enabled and stable

If you can’t do step (3), that’s a gap worth fixing first.


What to log (minimum viable)

Signature incidents are painful when you only log “401”.

Log enough to answer two questions immediately:

  1. Is the timestamp wrong?
  2. Did this start after a deploy/config change?

Minimum fields:

  • request_id
  • endpoint
  • status
  • error_code/message
  • local_ts_ms
  • server_ts_ms (if available)
  • applied_offset_ms
  • recv_window_ms
  • signing_version (deploy marker)

Example incident log shape:

json
{
  "ts": "2026-01-15T09:41:12.345Z",
  "bot_instance_id": "prod-1",
  "exchange": "example",
  "endpoint": "private/order/create",
  "status": 401,
  "error_code": "timestamp_out_of_range",
  "error_message": "outside recvWindow",
  "local_ts_ms": 1768489272345,
  "server_ts_ms": 1768489266122,
  "applied_offset_ms": -6223,
  "recv_window_ms": 5000,
  "signing_version": "2026-01-15.2",
  "request_id": "req-abc"
}

This is enough to answer the only question that matters in the first minute:

Is it time drift, or did we ship a signing change?


Shipped asset

Download
Free

Time sync runbook + logging fields

Operational runbook you can hand to on-call engineers. Logging field list makes timestamp incidents diagnosable in minutes, not hours.

When to use this (fit check)
  • You see intermittent signature/timestamp failures that correlate with deploys, restarts, or VM pause/resume.
  • You need on-call to classify “time drift vs signing regression” within minutes.
  • You want a repeatable startup gate and a minimal log schema for signature incidents.
When NOT to use this (yet)
  • You can’t measure server time/offset (add a serverTime probe or infrastructure time checks first).
  • You treat auth failures like transient errors (stop retries + add breaker behavior first).
  • You don’t have a deploy marker in logs (add signing_version/deploy_id first).

This is intentionally compact here. Full package details are on the resource page.

What you get (2 files):

time-sync-runbook.md - Incident response procedure

  • Detection: what symptoms indicate time drift (signature failures on valid requests)
  • Diagnosis: which systems to check first (NTP, clock sources, OS drift)
  • Verification: how to confirm drift before mitigation (query exchange timestamp)
  • Mitigation: immediate actions (sync clocks, restart services, escalate)
  • Prevention: post-incident checks (monitoring, alerting, constraints)

logging-fields-for-signature-incidents.md - Required log fields

  • Timestamp fields: request_time, server_time, local_time, drift_ms
  • Context: exchange, endpoint, key_id, request_signature
  • Diagnosis: ntp_offset_ms, system_clock_source, drift_direction
  • Resolution: action_taken, time_corrected, incident_duration_ms

Quick reference (what to log):

json
{
  "timestamp": "2026-01-27T14:30:20.123Z",
  "event_type": "signature_verification_failed",
  "exchange": "binance",
  "local_time_ms": 1706347820000,
  "server_time_ms": 1706347775000,
  "drift_ms": -45,
  "ntp_status": "unsynchronized",
  "action": "time_sync_triggered"
}

What this solves:

  • Gives on call a repeatable procedure during signature incidents
  • Makes drift diagnosable with queryable logs, not guesswork
  • Helps you stop harmful retries before they escalate blocks
Axiom Pack
$99

Trading Bot Hardening Suite: Production-Ready Crypto Infrastructure

Running production trading bots? Get exchange-specific rate limiters, signature validation, and incident recovery playbooks. Stop losing money to preventable API failures.

  • Exchange-specific rate limiting (Binance, Coinbase, Kraken, Bybit)
  • Signature validation & timestamp drift detection
  • API ban prevention patterns & key rotation strategies
  • Incident runbooks for 429s, signature errors, and reconnection storms
Coming soon

Common mistakes (that keep repeating)

These mistakes are common because time is “invisible” until it breaks. The goal is to make time visible and treat it like any other dependency.

  1. Using recvWindow as a band-aid

A larger recvWindow can reduce false failures, but it doesn’t fix unstable time. If your offset jumps around, you still fail.

  1. Retrying auth failures

Auth failures are the wrong class of error to retry. They’re a “stop and investigate” signal.

  1. No deploy marker in logs

Without signing_version/deploy_id, you can’t quickly separate “time drift” from “signing code changed”.

  1. No offset metric

If you aren’t graphing offset, drift will always look random.

In practice, teams that fix this permanently do two things: they calibrate offset against exchange time and they alert on offset spikes.


Because your server clock drifted away from exchange time while the bot was running. Common triggers: VM pause/resume, container migration, NTP upstream loss, or gradual drift on hosts under load. The signing code didn't change—the timestamp you're sending is now outside the exchange's tolerance window (often 5-30 seconds). The exchange rejects it as "signature invalid" or "timestamp out of range."

Measure the offset. If the exchange provides a serverTime endpoint, call it and compute: offset_ms = server_time - local_time. If the offset magnitude exceeds the exchange window (e.g., 5+ seconds), it's drift. If offset is small but you still get 401, investigate signing logic. Log: local timestamp, server timestamp, offset, and deploy version so you can separate time issues from code regressions.

Because time sync may not be stable yet when the bot starts signing requests. NTP/Chrony takes a few seconds to sync after boot. If your bot enables trading immediately, it's sending signed requests with potentially wrong timestamps. The fix: gate startup—verify time sync is healthy and calibrate exchange offset before enabling private endpoints.

Only partially and temporarily. recvWindow is a tolerance buffer—it helps with minor jitter, but doesn't fix unstable clocks or hosts that drift steadily. If your offset oscillates or keeps growing, a larger window just masks the problem. The durable fix: ensure host time sync is reliable (NTP/Chrony) and/or calibrate against exchange serverTime so you're tracking offset instead of guessing.

Because production hosts drift in ways dev machines don't. VM pause/resume, noisy neighbors, bad host clocks, missing NTP upstream, cold starts, and container reschedules are common in production but rare locally. Your laptop usually has stable time sync and a single process. Production fleets experience time jumps that dev can't reproduce.

No. Treat signature and timestamp errors as stop rules, not transient failures. If you keep retrying invalid auth requests, the exchange sees a pattern that looks like an attacker or broken client. This can escalate to blocks. The safe action: open an auth circuit breaker, halt trading, log offset measurements, and require investigation before resuming.

Make offset visible. Track applied_offset_ms (or NTP offset) as a metric and alert on spikes. After the fix, signature failures should drop to near-zero. In logs, you should see: offset measurements stable, no timestamp errors, and any remaining auth failures correlated with specific drift events (not random). Graph offset over time—a stable line means the problem is solved.


FAQ

This is where teams usually get stuck when they try to “just fix the timestamp”.

Sometimes it reduces noise, but it’s not a real fix. recvWindow is a tolerance buffer; it doesn’t correct unstable clocks, jumpy VMs, or hosts that drift under load.

If you’re consistently outside the window, the only durable fixes are: make host time sync reliable and/or calibrate against exchange serverTime and log your offset.

Start with every 5 to 15 minutes and on every process restart. Also recalibrate after events that cause time jumps (sleep/resume, VM migration, long GC pauses, container reschedules).

If you see offset oscillation, log RTT and investigate network instability; noisy calibration can be a symptom of infrastructure issues.

Then your only reliable option is to treat time sync as infrastructure. Ensure NTP/Chrony is configured correctly, alert on offset if you can, and gate startup until time is sane.

In that world, logging becomes even more important: record local timestamps and deploy markers so you can separate time drift from signing regressions.

Production hosts drift in ways your laptop rarely does.

VM pause/resume, noisy neighbors, bad host clocks, missing NTP upstream connectivity, and cold start behavior are common triggers. Local dev tends to have stable time sync and a single process.

Fail closed.

Treat signature and timestamp errors as stop rules. Open an auth breaker, halt trading, and require investigation. If you keep sending invalid signed requests you can escalate to blocks.

Make time visible.

Track applied offset (or NTP offset) as a metric, and alert on spikes. In logs, you should see signature failures drop to near zero, and you should be able to correlate any remaining errors with concrete drift measurements.


Checklist (copy/paste)

  • Auth/timestamp failures are STOP rules (0 retries) and open an auth circuit breaker.
  • We can compute and log offset_ms (exchange server time vs local time) plus rtt_ms.
  • Logs include a deploy marker (signing_version / deploy_id) to separate drift from code regression.
  • Startup is gated: time sync health + offset calibration before enabling private endpoints/trading.
  • A metric exists for applied_offset_ms and alerts trigger on spikes (e.g., > 5s) or NTP unsynchronized.
  • Fleet skew is detectable (offset differs by instance); we can identify the bad host quickly.
  • recvWindow is treated as a small tolerance, not a root-cause fix.
  • Post-incident, we verify time sync configuration and test by intentionally skewing a non-prod host.

Resources

This is intentionally compact. Full package details are on the resource page.

External references:


Coming soon

If you want more production-ready runbooks and templates like this, the Axiom waitlist is the right place.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.


Key takeaways

Timestamp drift is a boring problem with expensive consequences.

Fix it once by adopting three behaviors:

  • treat auth failures as fail-fast (no retries)
  • measure and log offset so the incident is obvious
  • calibrate against exchange time so drift stops being a production roulette wheel

For more production bot operations work, see the Crypto Automation category.

Recommended resources

Download the shipped checklist/templates for this post.

On-call runbook for signature error incidents. Logging schema makes clock drift diagnostics fast. Know what to check in minutes, not hours.

resource

Related posts