# Job heartbeat logging schema

Goal: make "it is running" meaningless and "it is making progress" measurable.

## What to log per run

Minimum fields:

- job (job type name)
- runId (stable id for the run)
- attempt (integer)
- step (coarse stage name)
- startedAt
- elapsedMs
- heartbeat (true)
- processed (optional)
- total (optional)

Recommended fields:

- instance (worker name)
- tenantId (if multi-tenant)
- dependency (if calling out)
- dependencyElapsedMs
- timeoutMs
- outcome (success, fail, timeout)
- decision (continue, stop, retry, escalate)
- reason

## Heartbeat event example

Emit this periodically while work is making progress.

```json
{
  "ts": "2026-01-29T02:22:10.110Z",
  "level": "info",
  "job": "NightlyExport",
  "runId": "run-20260129-0210",
  "attempt": 1,
  "step": "ExportOrders",
  "processed": 12000,
  "total": 50000,
  "elapsedMs": 720000,
  "heartbeat": true
}
```

## Dependency call event example

Emit this around external calls.

```json
{
  "ts": "2026-01-29T02:24:42.902Z",
  "level": "warning",
  "job": "NightlyExport",
  "runId": "run-20260129-0210",
  "attempt": 1,
  "step": "CallVendorApi",
  "dependency": "http:VendorApi",
  "dependencyElapsedMs": 31000,
  "timeoutMs": 30000,
  "outcome": "timeout",
  "decision": "stop",
  "reason": "budget-exhausted"
}
```

## Alerts that actually work

These alerts are simple and effective:

- time since last success over threshold (per job type)
- oldest message age over threshold
- max runtime exceeded (runId is still in-flight beyond budget)
- repeated failure for the same record id (poison detection)

## Operator payload for escalation

When you escalate, include:

- job, runId, attempt
- current step
- elapsedMs and maxRuntimeMs
- dependency (if any)
- queue depth and oldest age
- last error or status
- next action (pause job, reduce concurrency, check dependency)
