Jan 19, 202613 min read

Share |

Category:.NET

Background jobs stuck but look healthy: why workers hang forever with no alerts in .NET

When background jobs hang but workers look healthy and queue pileup grows: why jobs fail silently without timeouts or heartbeats, and the runbook that stops repeat incidents.

Free download: Stuck job runbook (template). Jump to the download section.

Background jobs are where teams learn the difference between "it is running" and "it is working".

In request and response systems, failure is loud. A user gets an error, someone refreshes, and traces usually have a correlation id. Job systems do not get that feedback loop for free. A worker can be stuck for hours and the only symptom is a growing queue or missing reports.

Mini incident pattern:

At 02:10 the nightly export starts. The scheduler UI says the worker is running. At 06:30 customers notice missing reports. Nothing crashed. The queue is growing and oldest message age is climbing. Someone restarts the Windows service and the backlog drains until tomorrow.

The cost is never just the missed report. It is the second-order mess: reruns you cannot trust, duplicates you have to unwind, and on-call time burned staring at a process that looks healthy.

Here is the production playbook for .NET job systems: add boundaries, add a heartbeat, and make stuck work provable.

Rescuing an .NET service in production? Start at the .NET Production Rescue hub and the .NET category.

Why jobs hang silently: missing boundaries create zombie work

Background jobs feel simple, so teams treat them like backend UI: if it did not crash, it must be fine.

But jobs do not have the safety rails that exist in request and response systems. User requests often have upstream timeouts, cancellation, and the fact that someone complains. Jobs often have none of that. A slow dependency becomes an infinite wait, and an infinite wait becomes a worker slot that never returns.

That is the mechanism: missing boundaries create zombie work. Zombie work captures threads, connections, locks, and worker concurrency, and the backlog grows quietly until customers notice.

The mindset shift: jobs are not backend UI. They are production automation. They need explicit budgets, stop rules, and operator-grade signals.

Once you build those boundaries, "hangs forever" turns from a mysterious outage into a diagnosable condition with a clear response.

How to diagnose stuck jobs: time since last success vs queue depth

When someone says "jobs are stuck", the worst move is to jump straight into code. Start with checks that classify the failure.

Do these in order:

Time since last success for the job type (this catches silent failure fast)
Queue depth + oldest message age (backlog vs spike)
In-flight jobs vs completions (is work finishing at all?)
One example run: last heartbeat/progress log + where it stopped
Dependency health: any vendor/DB/file share incident overlapping the window

If oldest message age is rising and completions are flat, you are not slow. You are accumulating work.

Why background jobs keep hanging: no timeouts, no heartbeats, no alerts

Most job failures are not dramatic. They are quiet.

The failure mode is usually a missing boundary: no timeout, no cancellation, no heartbeat, no stop rules. Without boundaries, jobs become zombie work that holds resources and delays discovery.

These are the usual causes.

1) No timeouts (infinite waits)

Jobs often call external systems:

vendors
file shares
databases
APIs

If a call can hang forever, your job can hang forever.

This is why "the worker is running" is not a health check. A worker can be fully alive while a single call is waiting on a socket, a file share, or a lock that never resolves.

2) No cancellation

A timeout should stop work.

If your job ignores cancellation and keeps processing after the operator gave up, your system accumulates "zombie work" that never completes.

Operators hate this because it creates the illusion of control: you can press "stop," but the system keeps doing work in the dark.

3) Poison messages / bad inputs

Some jobs fail the same record repeatedly:

malformed data
unexpected state
missing permissions

If you treat these as retryable, you create endless reprocessing and hide the real issue.

4) Lock contention and accidental serialization

Many job systems accidentally serialize work:

global locks
single shared resource
"only one job at a time" assumptions

Under load, it looks like "it is running", but nothing completes.

5) Retries that hide failure

Retries are useful, until they become denial.

If a job retries forever, you don't have reliability. You have delayed discovery.

Stop, retry, quarantine, escalate: the framework for stuck jobs

During an incident, the fastest path to stability is not "make it succeed." It is "make it bounded." You want clear conditions that tell an operator what to do next without guessing.

Use this framework per job type and per failure:

Retry (bounded) when the failure is transient and safe: network blips, short vendor instability, temporary DB throttling. Cap attempts, add jitter, and fail with a clear outcome when the budget is spent.
Stop + alert when the job is consuming capacity without making progress: no heartbeat, max runtime exceeded, or oldest message age rising while completions are flat.
Quarantine (poison path) when the same input fails deterministically: malformed data, auth/permission failures, unexpected state, invariant violations. Move it aside with an operator payload instead of retrying forever.
Escalate when recovery is unsafe or ambiguous: non-idempotent side effects, unknown progress, or cancellation is not honored. Collect one example run's logs + history and stop the blast radius first.

The rule of thumb: prioritize preventing hidden work over chasing success. You can always re-run later, but you cannot get back lost capacity.

What to log (so stuck becomes provable)

You don't need a perfect tracing setup to operate jobs well. You need a few structured fields that make one question answerable during an incident:

What is the job waiting on right now, and how long has it been waiting.

At minimum, log per job run:

job (job type name)
runId (stable ID for that run)
attempt (if retries exist)
step (coarse stage name)
elapsedMs
heartbeat=true periodically while making progress
dependency + elapsedMs + timeoutMs for external calls

Example progress + dependency logs:

json

{
  "ts": "2026-01-21T02:22:10.110Z",
  "level": "info",
  "job": "NightlyExport",
  "runId": "run-20260121-0210",
  "step": "ExportOrders",
  "processed": 12000,
  "total": 50000,
  "elapsedMs": 720000,
  "heartbeat": true
}

json

{
  "ts": "2026-01-21T02:24:42.902Z",
  "level": "warning",
  "job": "NightlyExport",
  "runId": "run-20260121-0210",
  "step": "CallVendorApi",
  "dependency": "http:VendorApi",
  "elapsedMs": 31000,
  "timeoutMs": 30000,
  "outcome": "timeout",
  "action": "stop"
}

Those fields are boring on purpose. They make it possible to answer "is it making progress?" and "what is it waiting on?" without a war room.

If you want to stop arguing in incidents, add one more field: a max runtime budget. That turns "running" into "violating a boundary".

The minimum health signals every job system needs

The goal of job observability is not pretty dashboards. The goal is fast answers:

Is the job system healthy?
If not, when did it start failing?
Are we building backlog right now?

You don't need a massive observability program. You need a few obvious dials.

Track these (per job type):

success rate (last 1h / 24h)
time since last success
runtime duration (p50/p95)
in-flight count
queue depth (and oldest message age)

If you can add one more signal, make it this: a heartbeat/progress log that includes a max-runtime budget. It turns "it's running" into "it's making progress" and makes alerts defensible.

A very simple alert that prevents embarrassment:

"time since last success > threshold"

That catches silent failures faster than most dashboards.

A stuck-job runbook (copy/paste)

When a job hangs, operators need decisions, not panic.

The point of a runbook is to stop hero debugging. It should tell someone what to check first, what to collect, and when to stop digging and escalate.

Scope

Which job type is failing/hanging?
Is it one tenant/customer, or everyone?
When did the last success happen?

Blocker

Is the job waiting on HTTP, SQL, file IO, or a lock?
Do you see retry amplification?
Is there a dependency incident (vendor outage, database slowdown)?

Contain

pause the job type (feature flag / disable schedule)
cap concurrency (reduce workers)
apply timeouts / stop rules

Recover safely

reprocess with idempotency guarantees
dead-letter poison records with a human review path
document "stop/retry/escalate" conditions

Patterns that prevent repeat incidents

Here is the uncomfortable discussion point: most job systems fail because teams treat jobs like backend UI. They assume that if it did not crash, it worked.

Reliable job systems behave more like production automation: bounded work, clear stop rules, and explicit operator signals.

1) Heartbeats

A job should regularly emit "I'm alive and still making progress".

If you can't implement a full heartbeat, at least emit progress logs every N records/items.

json

{
  "ts": "2026-01-21T02:22:10.110Z",
  "level": "info",
  "job": "NightlyExport",
  "runId": "run-20260121-0210",
  "processed": 12000,
  "total": 50000,
  "elapsedSec": 720,
  "heartbeat": true
}

2) Max runtime guardrails

A job should have a max runtime. Otherwise "it's slow" becomes "it never ends".

if it exceeds the limit, fail with a clear reason
operators get an alert

This prevents zombie work.

3) Cancellation + timeouts wired through

If a job times out but keeps doing work, you've created invisible load.

Here's a minimal pattern (pseudocode style):

csharp

public async Task RunAsync(CancellationToken ct)
{
  using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
  timeoutCts.CancelAfter(TimeSpan.FromMinutes(5));
 
  // Make sure downstream calls honor timeoutCts.Token
  await DoWorkAsync(timeoutCts.Token);
}

4) Idempotency / dedupe

Jobs are often retried or rerun. If a rerun can duplicate side effects (emails, invoices, writes), you'll hesitate to recover.

Idempotency makes recovery safe.

5) Poison handling

If the same record fails repeatedly, treat it as:

stop and send to a dead-letter path
include an operator payload (what failed, why, what record, what to do)

That prevents infinite retry until someone notices.

Yes. The failure modes are the same:

infinite waits
missing cancellation
missing health signals
retries that hide failure

The tooling differs, but the reliability properties are identical.

Shipped asset

Download

Free

Stuck job runbook (template)

Free download. Runbook and heartbeat schema for stuck jobs and queue pileups.

Get the runbook

What you get (2 files):

stuck-job-runbook.md
job-heartbeat-logging-schema.md

Preview:

text

Scope: which job type, last success time, queue depth, oldest age
Blocker: waiting on HTTP/SQL/file IO/lock? retries amplifying?
Contain: pause job type, cap concurrency, add timeouts/stop rules
Recover: reprocess safely (idempotency), dead-letter poison records
Prevent: heartbeat + max runtime + operator alerts

Resources

Internal:

.NET Production Rescue
Timeouts first: why "infinite waits" create recurring outages in .NET
The real cost of retry logic: when resilience makes outages worse
Contact
Requests timing out but CPU normal: thread pool starvation in ASP.NET - job waits capture threads
Cannot trace requests across services: why correlation IDs die at boundaries in .NET - trace job chains across boundaries

External (authoritative docs):

Troubleshooting Questions Engineers Search

Because "running" only means the worker process is alive, not that it's making progress. The job could be waiting on a slow dependency call, stuck on a lock, or blocked on an infinite wait. Without heartbeat logs or progress signals, the system looks healthy while work accumulates. Add a heartbeat and max runtime budget to make "stuck" provable.

A slow job still changes state: counters increase, steps transition, and heartbeat logs keep arriving. A stuck job stops moving. The simplest boundary is a max-runtime budget plus heartbeat. Once you have both, "stuck" becomes an alert condition (no heartbeat in N minutes, or runtime exceeded) instead of a feeling.

Because workers are alive but not completing work. Common causes: jobs are blocked on infinite waits (HTTP calls, SQL locks, file shares), missing timeouts, or retrying poison messages forever. Check: oldest message age rising + completions flat = accumulating work. The workers aren't down—they're captured by zombie work.

Because restarting discards in-flight work and releases captured resources (threads, connections, locks). It doesn't fix the code or the missing boundaries. If the same traffic pattern returns and the same infinite waits still exist, jobs will hang again and the queue will pile up. A restart is evidence of missing timeouts, not a fix.

Yes. Hangfire doesn't enforce timeouts by default on job execution. If your job calls a slow dependency without a timeout, or gets stuck on a lock, it will wait indefinitely. Add explicit timeouts to external calls, pass CancellationToken through your code, and set a max execution time. Otherwise "BackgroundJob.Enqueue" becomes "wait forever."

Make the job idempotent: use stable run IDs, write dedupe records before side effects, and design external calls so repeats become safe no-ops. If a side effect cannot be made idempotent (emails, invoices, payments), treat it as a gated step: require human confirmation or stop early when uncertainty is high. Idempotency makes recovery safe.

Oldest message age is how long the oldest message in the queue has been waiting. If it keeps growing while workers look healthy, you're not processing faster than messages are arriving—you're accumulating backlog. Common causes: workers stuck on infinite waits, poison messages retrying forever, or lock contention serializing work. Growing age = capacity problem, not traffic spike.

FAQ

You need a progress signal, not just a "started at" timestamp. A slow job still changes state: counters increase, steps transition, and heartbeats keep arriving. A stuck job stops moving.

The simplest boundary is a max-runtime budget + heartbeat. Once you have both, "stuck" becomes an alert condition instead of a feeling.

Because retries can hide deterministic failures by turning them into "eventually succeeds". In the meantime, you burn worker capacity and grow the queue.

Retries only help when they're bounded and classified. Poison inputs and auth/permission failures should stop and quarantine, not retry forever.

Sometimes, but only when you can recover safely. The safe pattern is: hit max runtime -> cancel -> verify cancellation was honored -> reschedule with a dedupe key.

If you can't guarantee idempotency, killing jobs can create duplicates. In that case, stop the blast radius first and escalate with a useful incident payload.

Three things: a max runtime guardrail, a heartbeat/progress log, and a poison/quarantine path. Those boundaries turn "silent failure" into "diagnosable failure".

Once those exist, you can tighten timeouts and retry policies without guessing.

Usually no. Most "jobs hang forever" incidents are boundaries + observability problems you can fix in the current system.

If you migrate platforms without adding boundaries, you'll recreate the same silent failures, just with different dashboards.

Make the job idempotent: use stable run ids, write dedupe records, and design external calls so repeats become safe no-ops.

If a side effect cannot be made idempotent, treat it as a gated step: require human confirmation or stop early when uncertainty is high.

Coming soon

If you want more runbooks like this (plus logging schemas and decision trees), that is what Axiom is becoming.

Join to get notified as we ship new operational assets for reliability.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.

Join waitlist

Key takeaways

Background jobs fail quietly unless you build explicit health signals.
The fastest wins are timeouts, cancellation, heartbeats, and max runtime guardrails.
A runbook + poison handling prevents tomorrow's repeat incident.

If you want a fast diagnosis of a queue pileup or stuck jobs, see .NET Production Rescue or contact me and include one example run's logs + job history.

Recommended resources

Download the shipped checklist/templates for this post.

Stuck job runbook (template)Free

A copy/paste runbook for queue pileups and jobs that run forever: scope, blocker, containment, safe recovery, and prevention guardrails.

resource

.NETFeb 04, 2026

Structured logging that actually helps: Serilog fields that matter in .NET incidents

When logs are noisy but useless: why incidents stay unsolved, which fields actually explain failures, and the minimal schema that makes .NET outages diagnosable.

.NETFeb 04, 2026

OpenTelemetry for .NET: minimum viable tracing for production debugging

When incidents span multiple services and logs cannot explain latency: the smallest OpenTelemetry setup that makes production debugging possible without a full rewrite.

.NETJan 28, 2026

Cannot trace requests across services: why correlation IDs die at boundaries in .NET

A production playbook for when logs exist but cannot be joined—correlation IDs die at HttpClient boundaries, jobs, and queues, making incidents unreproducible.

Why jobs hang silently: missing boundaries create zombie work

How to diagnose stuck jobs: time since last success vs queue depth

Why background jobs keep hanging: no timeouts, no heartbeats, no alerts

1) No timeouts (infinite waits)

2) No cancellation

3) Poison messages / bad inputs

4) Lock contention and accidental serialization

5) Retries that hide failure

Stop, retry, quarantine, escalate: the framework for stuck jobs

What to log (so stuck becomes provable)

The minimum health signals every job system needs

A stuck-job runbook (copy/paste)

Scope

Blocker

Contain

Recover safely

Patterns that prevent repeat incidents

1) Heartbeats

2) Max runtime guardrails

3) Cancellation + timeouts wired through

4) Idempotency / dedupe

5) Poison handling

Shipped asset

Stuck job runbook (template)

Resources

Troubleshooting Questions Engineers Search

FAQ

Coming soon

Axiom (Coming Soon)

Key takeaways

Recommended resources

Related posts

Structured logging that actually helps: Serilog fields that matter in .NET incidents

OpenTelemetry for .NET: minimum viable tracing for production debugging

Cannot trace requests across services: why correlation IDs die at boundaries in .NET