Why your background jobs hang forever (and no one notices)

Jan 19, 202610 min read

Category:.NET

Why your background jobs hang forever (and no one notices)

Queues and scheduled jobs fail quietly: missing timeouts, missing heartbeats, and retries that hide failure. A practical runbook-style playbook for .NET systems.

Download available. Jump to the shipped asset.

Background jobs are where teams learn the difference between "it is running" and "it is working".

In request and response systems, failure is loud. A user gets an error, someone refreshes, and traces usually have a correlation id. Job systems do not get that feedback loop for free. A worker can be stuck for hours and the only symptom is a growing queue or missing reports.

Mini incident pattern:

At 02:10 the nightly export starts. The scheduler UI says the worker is running. At 06:30 customers notice missing reports. Nothing crashed. The queue is growing and oldest message age is climbing. Someone restarts the Windows service and the backlog drains until tomorrow.

The cost is never just the missed report. It is the second-order mess: reruns you cannot trust, duplicates you have to unwind, and on-call time burned staring at a process that looks healthy.

Here is the production playbook for .NET job systems: add boundaries, add a heartbeat, and make stuck work provable.

Rescuing an .NET service in production? Start at the .NET Production Rescue hub and the .NET category.


The mechanism: quiet failure is missing boundaries

Background jobs feel simple, so teams treat them like backend UI: if it did not crash, it must be fine.

But jobs do not have the safety rails that exist in request and response systems. User requests often have upstream timeouts, cancellation, and the fact that someone complains. Jobs often have none of that. A slow dependency becomes an infinite wait, and an infinite wait becomes a worker slot that never returns.

That is the mechanism: missing boundaries create zombie work. Zombie work captures threads, connections, locks, and worker concurrency, and the backlog grows quietly until customers notice.

The mindset shift: jobs are not backend UI. They are production automation. They need explicit budgets, stop rules, and operator-grade signals.

Once you build those boundaries, "hangs forever" turns from a mysterious outage into a diagnosable condition with a clear response.


Diagnosis ladder (fast checks first)

When someone says "jobs are stuck", the worst move is to jump straight into code. Start with checks that classify the failure.

Do these in order:

  • Time since last success for the job type (this catches silent failure fast)
  • Queue depth + oldest message age (backlog vs spike)
  • In-flight jobs vs completions (is work finishing at all?)
  • One example run: last heartbeat/progress log + where it stopped
  • Dependency health: any vendor/DB/file share incident overlapping the window

If oldest message age is rising and completions are flat, you are not slow. You are accumulating work.


Why jobs hang forever

Most job failures are not dramatic. They are quiet.

The failure mode is usually a missing boundary: no timeout, no cancellation, no heartbeat, no stop rules. Without boundaries, jobs become zombie work that holds resources and delays discovery.

These are the usual causes.

1) No timeouts (infinite waits)

Jobs often call external systems:

  • vendors
  • file shares
  • databases
  • APIs

If a call can hang forever, your job can hang forever.

This is why "the worker is running" is not a health check. A worker can be fully alive while a single call is waiting on a socket, a file share, or a lock that never resolves.

2) No cancellation

A timeout should stop work.

If your job ignores cancellation and keeps processing after the operator gave up, your system accumulates "zombie work" that never completes.

Operators hate this because it creates the illusion of control: you can press "stop," but the system keeps doing work in the dark.

3) Poison messages / bad inputs

Some jobs fail the same record repeatedly:

  • malformed data
  • unexpected state
  • missing permissions

If you treat these as retryable, you create endless reprocessing and hide the real issue.

4) Lock contention and accidental serialization

Many job systems accidentally serialize work:

  • global locks
  • single shared resource
  • "only one job at a time" assumptions

Under load, it looks like "it is running", but nothing completes.

5) Retries that hide failure

Retries are useful, until they become denial.

If a job retries forever, you don't have reliability. You have delayed discovery.


Decision framework: stop, retry, quarantine, or escalate

During an incident, the fastest path to stability is not "make it succeed." It is "make it bounded." You want clear conditions that tell an operator what to do next without guessing.

Use this framework per job type and per failure:

  • Retry (bounded) when the failure is transient and safe: network blips, short vendor instability, temporary DB throttling. Cap attempts, add jitter, and fail with a clear outcome when the budget is spent.
  • Stop + alert when the job is consuming capacity without making progress: no heartbeat, max runtime exceeded, or oldest message age rising while completions are flat.
  • Quarantine (poison path) when the same input fails deterministically: malformed data, auth/permission failures, unexpected state, invariant violations. Move it aside with an operator payload instead of retrying forever.
  • Escalate when recovery is unsafe or ambiguous: non-idempotent side effects, unknown progress, or cancellation is not honored. Collect one example run's logs + history and stop the blast radius first.

The rule of thumb: prioritize preventing hidden work over chasing success. You can always re-run later, but you cannot get back lost capacity.


What to log (so stuck becomes provable)

You don't need a perfect tracing setup to operate jobs well. You need a few structured fields that make one question answerable during an incident:

What is the job waiting on right now, and how long has it been waiting.

At minimum, log per job run:

  • job (job type name)
  • runId (stable ID for that run)
  • attempt (if retries exist)
  • step (coarse stage name)
  • elapsedMs
  • heartbeat=true periodically while making progress
  • dependency + elapsedMs + timeoutMs for external calls

Example progress + dependency logs:

json
{
  "ts": "2026-01-21T02:22:10.110Z",
  "level": "info",
  "job": "NightlyExport",
  "runId": "run-20260121-0210",
  "step": "ExportOrders",
  "processed": 12000,
  "total": 50000,
  "elapsedMs": 720000,
  "heartbeat": true
}
json
{
  "ts": "2026-01-21T02:24:42.902Z",
  "level": "warning",
  "job": "NightlyExport",
  "runId": "run-20260121-0210",
  "step": "CallVendorApi",
  "dependency": "http:VendorApi",
  "elapsedMs": 31000,
  "timeoutMs": 30000,
  "outcome": "timeout",
  "action": "stop"
}

Those fields are boring on purpose. They make it possible to answer "is it making progress?" and "what is it waiting on?" without a war room.

If you want to stop arguing in incidents, add one more field: a max runtime budget. That turns "running" into "violating a boundary".


The minimum health signals every job system needs

The goal of job observability is not pretty dashboards. The goal is fast answers:

  • Is the job system healthy?
  • If not, when did it start failing?
  • Are we building backlog right now?

You don't need a massive observability program. You need a few obvious dials.

Track these (per job type):

  • success rate (last 1h / 24h)
  • time since last success
  • runtime duration (p50/p95)
  • in-flight count
  • queue depth (and oldest message age)

If you can add one more signal, make it this: a heartbeat/progress log that includes a max-runtime budget. It turns "it's running" into "it's making progress" and makes alerts defensible.

A very simple alert that prevents embarrassment:

  • "time since last success > threshold"

That catches silent failures faster than most dashboards.


A stuck-job runbook (copy/paste)

When a job hangs, operators need decisions, not panic.

The point of a runbook is to stop hero debugging. It should tell someone what to check first, what to collect, and when to stop digging and escalate.

Scope

  • Which job type is failing/hanging?
  • Is it one tenant/customer, or everyone?
  • When did the last success happen?

Blocker

  • Is the job waiting on HTTP, SQL, file IO, or a lock?
  • Do you see retry amplification?
  • Is there a dependency incident (vendor outage, database slowdown)?

Contain

  • pause the job type (feature flag / disable schedule)
  • cap concurrency (reduce workers)
  • apply timeouts / stop rules

Recover safely

  • reprocess with idempotency guarantees
  • dead-letter poison records with a human review path
  • document "stop/retry/escalate" conditions

Patterns that prevent repeat incidents

Here is the uncomfortable discussion point: most job systems fail because teams treat jobs like backend UI. They assume that if it did not crash, it worked.

Reliable job systems behave more like production automation: bounded work, clear stop rules, and explicit operator signals.

1) Heartbeats

A job should regularly emit "I'm alive and still making progress".

If you can't implement a full heartbeat, at least emit progress logs every N records/items.

json
{
  "ts": "2026-01-21T02:22:10.110Z",
  "level": "info",
  "job": "NightlyExport",
  "runId": "run-20260121-0210",
  "processed": 12000,
  "total": 50000,
  "elapsedSec": 720,
  "heartbeat": true
}

2) Max runtime guardrails

A job should have a max runtime. Otherwise "it's slow" becomes "it never ends".

  • if it exceeds the limit, fail with a clear reason
  • operators get an alert

This prevents zombie work.

3) Cancellation + timeouts wired through

If a job times out but keeps doing work, you've created invisible load.

Here's a minimal pattern (pseudocode style):

csharp
public async Task RunAsync(CancellationToken ct)
{
  using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
  timeoutCts.CancelAfter(TimeSpan.FromMinutes(5));
 
  // Make sure downstream calls honor timeoutCts.Token
  await DoWorkAsync(timeoutCts.Token);
}

4) Idempotency / dedupe

Jobs are often retried or rerun. If a rerun can duplicate side effects (emails, invoices, writes), you'll hesitate to recover.

Idempotency makes recovery safe.

5) Poison handling

If the same record fails repeatedly, treat it as:

  • stop and send to a dead-letter path
  • include an operator payload (what failed, why, what record, what to do)

That prevents infinite retry until someone notices.


Yes. The failure modes are the same:

  • infinite waits
  • missing cancellation
  • missing health signals
  • retries that hide failure

The tooling differs, but the reliability properties are identical.


Shipped asset

Download
Free

Stuck job runbook (template)

Free download. Runbook and heartbeat schema for stuck jobs and queue pileups.

What you get (2 files):

  • stuck-job-runbook.md
  • job-heartbeat-logging-schema.md

Preview:

text
Scope: which job type, last success time, queue depth, oldest age
Blocker: waiting on HTTP/SQL/file IO/lock? retries amplifying?
Contain: pause job type, cap concurrency, add timeouts/stop rules
Recover: reprocess safely (idempotency), dead-letter poison records
Prevent: heartbeat + max runtime + operator alerts

Resources

Internal:

External (authoritative docs):


FAQ

You need a progress signal, not just a "started at" timestamp. A slow job still changes state: counters increase, steps transition, and heartbeats keep arriving. A stuck job stops moving.

The simplest boundary is a max-runtime budget + heartbeat. Once you have both, "stuck" becomes an alert condition instead of a feeling.

Because retries can hide deterministic failures by turning them into "eventually succeeds". In the meantime, you burn worker capacity and grow the queue.

Retries only help when they're bounded and classified. Poison inputs and auth/permission failures should stop and quarantine, not retry forever.

Sometimes, but only when you can recover safely. The safe pattern is: hit max runtime -> cancel -> verify cancellation was honored -> reschedule with a dedupe key.

If you can't guarantee idempotency, killing jobs can create duplicates. In that case, stop the blast radius first and escalate with a useful incident payload.

Three things: a max runtime guardrail, a heartbeat/progress log, and a poison/quarantine path. Those boundaries turn "silent failure" into "diagnosable failure".

Once those exist, you can tighten timeouts and retry policies without guessing.

Usually no. Most "jobs hang forever" incidents are boundaries + observability problems you can fix in the current system.

If you migrate platforms without adding boundaries, you'll recreate the same silent failures, just with different dashboards.

Make the job idempotent: use stable run ids, write dedupe records, and design external calls so repeats become safe no-ops.

If a side effect cannot be made idempotent, treat it as a gated step: require human confirmation or stop early when uncertainty is high.


Coming soon

If you want more runbooks like this (plus logging schemas and decision trees), that is what Axiom is becoming.

Join to get notified as we ship new operational assets for reliability.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.


Key takeaways

  • Background jobs fail quietly unless you build explicit health signals.
  • The fastest wins are timeouts, cancellation, heartbeats, and max runtime guardrails.
  • A runbook + poison handling prevents tomorrow's repeat incident.

If you want a fast diagnosis of a queue pileup or stuck jobs, see .NET Production Rescue or contact me and include one example run's logs + job history.

Related posts