# Stuck job runbook

Use this when a queue piles up or a scheduled job appears to run forever.

Goal: contain blast radius, capture evidence, recover safely, and prevent repeats.

## 0) Collect the minimum context (2 minutes)

- Job type or name
- Environment and instance (service name, worker name)
- Last successful run time
- First observed failure time
- Queue depth and oldest message age
- In-flight job count and completion rate
- One example run id (or message id)

If you cannot answer "when was the last success" you are already flying blind.

## 1) Classify the failure mode

You are usually in one of these buckets:

- Dependency slowdown: jobs are running, but blocked on HTTP, SQL, file IO, or a vendor SDK
- Poison input: the same record fails repeatedly
- Wedge: the worker is alive but not making progress (deadlock, lock contention, infinite wait)

Pick the bucket before you change code.

## 2) Find what the job is waiting on

Look for the last progress log for the run id:

- step name (coarse stage)
- dependency name (if calling out)
- elapsed time

If you do not have progress logs, start with a dump of in-flight job metadata from the scheduler/queue and pick one job to inspect.

## 3) Contain (make the system stop generating pressure)

Prefer containment actions that are reversible:

- Pause this job type (feature flag, disable schedule)
- Reduce concurrency (reduce workers for this job type)
- Cap retries and total time budgets
- Add a short stop rule for known non-transient failures

If the dependency is unhealthy, fail fast beats retrying harder.

## 4) Recover safely

Recovery only works if duplicates are safe.

- Confirm idempotency or dedupe rules for the job
- If duplicates are not safe, stop and escalate with an operator payload

Safe recovery patterns:

- Dead-letter poison records with a human review path
- Reprocess with controlled concurrency
- Re-run a bounded window of work (not the entire backlog)

## 5) Evidence to capture during the incident

Capture one artifact that makes the mechanism obvious:

- A single run id with progress logs showing the last step
- Dependency call logs showing elapsed and timeouts
- Queue metrics: depth, oldest age, completion rate
- Worker saturation: in-flight count and max concurrency

This evidence is what prevents tomorrow's repeat incident.

## 6) Prevent repeats (the minimum guardrails)

These three guardrails change everything:

- Heartbeat or progress logs with a max runtime budget
- Per-attempt timeouts for dependencies
- A poison path (dead-letter or quarantine) with operator payload

Also track these per job type:

- time since last success
- success rate (1h, 24h)
- runtime p50, p95
- in-flight count
- queue depth and oldest age
