Requests hang forever: why missing timeouts cause recurring outages in .NET

Jan 21, 202612 min read

Share|

Category:.NET

Requests hang forever: why missing timeouts cause recurring outages in .NET

When requests hang forever and recycling releases stuck work: why missing timeouts create backlog, how to add budgets safely, and the rollout plan that prevents new incidents.

Free download: Timeout matrix template (HTTP / SQL / jobs). Jump to the download section.

Most production incidents do not start as "down." They start as waiting.

At 09:12 a dependency slows down. Your ASP.NET instances look healthy. CPU is fine. Memory is fine. But requests stop finishing. In-flight count climbs. Connection pools stop turning over. You scale out and it does not help because the new instances just join the waiting.

The cost is not subtle. Backlog grows, SLAs fail, and on-call starts recycling processes because it is the only thing that releases the stuck work. Then the incident repeats next week because nothing changed about the waiting.

This post gives you a production playbook for .NET: how to set time budgets, wire cancellation, and roll it out without triggering a new outage.

Rescuing an ASP.NET service in production? Start at the .NET Production Rescue hub and the .NET category.

If you only do three things
  • Write down a total budget per request/job (then enforce it).
  • Set per-attempt timeouts for each dependency and log elapsedMs, timeoutMs, and the decision (retry/stop/fallback).
  • Propagate cancellation end-to-end so work stops (no zombie work after timeouts).

Why requests hang forever: infinite waits capture capacity

Missing timeouts are not a performance problem. They are a capacity problem.

When a call can wait forever, it will eventually wait longer than your system can afford. While it waits, it holds something your service needs to operate: a worker slot, a thread, a connection, a lock, or a request budget.

Once enough requests or jobs are holding those resources, the system stops behaving like a service and starts behaving like a queue you did not design. From the outside it looks like "everything is slow." Underneath, you are accumulating work you cannot complete.

Timeouts are not a tuning knob. They are a product decision about how long you are willing to wait before you choose a different path.


How to diagnose: find what's waiting forever without budgets

Before you change numbers, classify the waiting. This prevents the classic failure: enforcing a strict timeout everywhere and then declaring "timeouts broke production."

Do these checks in order:

  • Identify the top 3 dependencies on the slow path (HTTP/SQL/vendor SDK)
  • Find the longest observed durations (not averages)
  • Confirm whether work stops after a timeout decision (cancellation honored) or keeps running (zombie work)
  • Map retries and total time budget (timeouts + retries are one policy)
  • Look for backlog signals: queue depth, oldest age, in-flight rising while completions flatten

Fast triage table (what to check first)

SymptomLikely causeConfirm fastFirst safe move
In-flight climbs, completions flatten, CPU looks fineInfinite waits capturing capacityBacklog signals (oldest request age, queue depth) rise during the slowdownAdd explicit per-attempt timeouts + a total budget; wire cancellation end-to-end
Requests “time out” but downstream keeps workingZombie work (timeouts without cancellation propagation)Work continues after client has given up; late side effects appearPass CancellationToken into every async call; use linked CTS with CancelAfter
Retry policies make the incident worseTimeouts + retries not treated as one budgetAttempt count rises while latency rises; total time grows unboundedCap attempts + add total budget; stop retrying timeouts into a slow dependency
One dependency dominates the slow windowBudget violated by one downstream (SQL/vendor)Dependency logs show one name repeatedly exceeding timeoutMsAdd bulkhead/caps + conservative timeout; add fallback/queue path

If you cannot answer "what is our total time budget per request or job run," that is the first gap to close.


Why infinite waits keep happening: common patterns in .NET

Infinite waits are rarely one bug. They are an emergent property of a system that is allowed to wait without limits.

These are the patterns that create repeat incidents:

  • HTTP calls without an attempt budget or without cancellation propagated into the call
  • SQL commands waiting behind locks, or long queries with no operator boundary
  • background work with no max runtime and no heartbeat
  • integrations that block on a vendor outage while you keep retrying

This is why process recycling "works." It does not fix the dependency. It discards the waiting work and frees resources temporarily.


How to fix: add budgets, fallbacks, and stop rules

Teams avoid adding timeouts because they have seen the short-term effect: more errors.

That story is usually true, and it is still the wrong conclusion.

Timeouts do not create fragility. They reveal fragility that already exists: cancellation not wired through, retries stacking waits, and no fallback path when the budget is spent.

Decide three things per request or job type:

  • Total budget: how long the system can afford before it must choose a different path
  • Attempt budgets: how long any single dependency call is allowed to hold resources
  • Stop rules: when you stop, quarantine, or escalate instead of retrying optimistically

Fallback choices that work in the real world:

  • return a clear error and stop doing work
  • serve cached or stale data
  • enqueue work and respond immediately
  • partial response (where safe)

If your only behavior is "hang forever," you have guaranteed that dependency slowdowns will become outages.


A timeout matrix you can actually operate

You do not need perfect numbers. You need consistent budgets and an operator story.

Start with a total budget, then allocate smaller budgets inside it. Keep it boring.

Practical starting points:

User-facing web requests

  • Total budget: 3-10 seconds depending on the page and fallback
  • Attempt budgets: usually 1-3 seconds per dependency call

Internal service-to-service calls

  • Total budget: 2-10 seconds depending on criticality
  • If slow: prefer predictable failure + queue/fallback over waiting

Background jobs

  • Budget must be explicit (seconds/minutes)
  • Always include a max runtime guardrail and a poison path

Database calls

  • Budget depends on the query type and lock profile
  • "It is slow" must become "it exceeded budget X" (observable + bounded)

Implementation patterns (bounded and boring)

The goal is not clever code. The goal is to make waiting impossible without you choosing it.

Two rules prevent most repeat incidents:

  • Always have a total budget (end-to-end)
  • Always propagate cancellation so work stops, not just waiting

Example: enforce a budget and pass the token through the call.

csharp
public async Task<VendorResult> CallVendorAsync(HttpClient httpClient, CancellationToken ct)
{
  using var budgetCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
  budgetCts.CancelAfter(TimeSpan.FromSeconds(2));
 
  using var req = new HttpRequestMessage(HttpMethod.Get, "/foo");
  using var res = await httpClient.SendAsync(req, budgetCts.Token);
  res.EnsureSuccessStatusCode();
 
  return await res.Content.ReadFromJsonAsync<VendorResult>(cancellationToken: budgetCts.Token);
}

SQL is similar: command timeout is different from connection timeout, and you still need an operator budget.

csharp
using var cmd = new SqlCommand(sql, connection);
cmd.CommandTimeout = 5;
 
// Also pass a CancellationToken to the execution method where available.
// await cmd.ExecuteNonQueryAsync(ct);

Rollout without a new incident

If you have lived without budgets for years, enforcing everywhere at once creates the wrong narrative:

"Timeouts broke production."

Timeouts did not break production. They revealed where production was already fragile.

Safe rollout in a live estate:

  • Observe: instrument durations and identify the long tail
  • Warn: log budget violations with enough context to act
  • Enforce: turn budgets on gradually and verify cancellation is honored

This is how you turn "it is hanging" into a diagnosable statement.


What to log so timeouts become diagnosable

Per dependency call, log:

  • correlationId or runId
  • dependency name
  • attempt
  • timeoutMs (attempt budget)
  • totalBudgetMs
  • elapsedMs
  • outcome (success/timeout/429/5xx/lock)
  • decision (stop/retry/quarantine)

Example:

json
{
  "ts": "2026-01-21T11:46:18.003Z",
  "level": "warning",
  "correlationId": "c-77d3f2f7b4fa4f35",
  "dependency": "http:VendorApi",
  "attempt": 1,
  "elapsedMs": 2105,
  "timeoutMs": 2000,
  "totalBudgetMs": 5000,
  "outcome": "timeout",
  "decision": "stop",
  "next": "fallback"
}

This turns "timeouts happened" into "vendor X exceeded budget Y for endpoint Z." That is a fixable statement.


Shipped asset

Download
Free

Timeout matrix template (HTTP / SQL / jobs)

Free download. Timeout matrix worksheet and a safe rollout plan for live .NET systems.

When to use this (fit check)
  • You have recurring “hangs” where recycling releases stuck work.
  • You can’t explain a request/job’s total time budget (timeouts + retries + queue time).
  • You need a rollout sequence that prevents the “timeouts broke production” narrative.
When NOT to use this (yet)
  • Your dominant issue is compute saturation (CPU pegged) and waits are not the primary symptom.
  • You’re looking for one global timeout value (this template is about per-dependency budgets).
  • You cannot instrument durations/log budget violations (start there first).

What you get (2 files):

  • timeout-matrix-template.md
  • timeouts-rollout-plan.md

Preview:

text
Request budget: ____ms
 
Dependency budgets:
- http:VendorApi   timeout=____ms  retries=__  totalBudget=____ms
- sql:OrdersDb     timeout=____ms  retries=__  totalBudget=____ms
- queue:Payments   maxRuntime=____  maxAttempts=__
 
Stop rules:
- if totalBudget exceeded -> stop + escalate payload

Resources

Internal:

External:


Because without explicit timeouts, a request can wait indefinitely. While it waits, it holds a worker thread, a connection, or other resources your service needs. When enough requests are waiting, the system runs out of capacity and new work queues up. The dependency slowdown becomes a platform incident because nothing forces a decision to stop waiting.

Because recycling discards all in-flight work and frees captured resources. It doesn't fix the code or the slow dependency—it just resets state temporarily. If the same traffic pattern returns and the same infinite waits still exist, requests will hang again and the incident repeats. A recycle is evidence of missing budgets, not a fix.

Roll out safely: observe first (measure durations, identify long tail), warn second (log budget violations without enforcing), then enforce gradually (one dependency at a time). Start with the dependency that's already causing pain. Verify cancellation is honored so work stops, not just waiting. This prevents "timeouts broke production" narratives.

You stop waiting but the work continues in the background (zombie work). The timeout fires, your code moves on, but the HTTP call or SQL query keeps running. This wastes resources and can create confusing logs. Always propagate the CancellationToken through the call so cancellation is honored end-to-end.

Because timeouts reveal fragility that already existed. Before timeouts, slow calls looked like "hanging" or "eventual success." After timeouts, they become visible failures. This is a normal rollout phase. The key is safe rollout (observe → warn → enforce) and ensuring fallback paths exist. If timeouts create a new incident, it usually means cancellation wasn't honored or retries are stacking waits.

Depends on the dependency and your fallback behavior, but a common baseline is 1-3 seconds per attempt with a total budget of 3-10 seconds including retries. Start conservative, measure actual durations, and adjust based on p95/p99. The timeout should be long enough for the dependency to respond under normal load, but short enough that waiting doesn't capture resources indefinitely.

Yes. Without an explicit CommandTimeout, SQL queries can wait indefinitely (often behind locks or long table scans). Set CommandTimeout on the SqlCommand, and also pass a CancellationToken if the API supports it. Remember: connection timeout (how long to wait for a connection) is different from command timeout (how long to wait for query results). Both need budgets.


FAQ

Pick one dependency call that is already causing pain (often: a vendor HTTP call or a lock-heavy SQL path).

Start by measuring durations and logging budget violations. Then enforce a conservative attempt timeout and verify cancellation is honored.

Avoid enforcing budgets everywhere at once. That is how you trigger surprise cascades and blame the timeout instead of the waiting.

A long timeout is better than an infinite timeout because it still forces a decision eventually.

But if it is longer than your system can afford, it just delays discovery and captures resources while you build backlog.

The safer approach is budgets: set a total request/job budget and assign smaller per-dependency timeouts inside it, with fallbacks and stop rules.

HttpClient.Timeout enforces an upper bound at the client level. A CancellationToken is how you propagate a timeout/cancel decision through your code and into downstream calls that accept it.

In production you typically want a total budget that cancels work end-to-end, plus per-attempt timeouts so a single call cannot stall a worker indefinitely.

If the token is not passed into the call, you may stop waiting while the work continues in the background.

No. Timeouts are tied to user expectations and fallback behavior.

Your payment provider call and your cache call should not share the same budget.

If you standardize anything, standardize the method: total budget first, then per-dependency attempt budgets, then logs that prove when budgets were exceeded.

That is a normal phase of rollout. You are turning hidden slow failures into visible failures.

The key is to roll out safely (observe, warn, enforce) and make sure your fallback paths and stop rules are correct.

If timeouts create a new incident, it usually means cancellation was not honored, retries are stacking waits, or the fallback path is missing.

Treat them as one policy. A retry policy without timeouts stacks waits.

Set a per-attempt timeout and a total budget, then cap attempts. Log both so you can prove the policy behavior under stress.

If your retry policy has no total budget, you do not have a policy. You have optimism.


Coming soon

If you want more assets like the timeout matrix (plus runbooks and logging schemas), that is what Axiom is becoming.

Join to get notified as we ship practical operational assets you can use during incidents and during rollout work, not generic tutorials.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.

Checklist (copy/paste)

  • Each request/job has a documented total budget.
  • Each dependency call has an explicit attempt timeout (HTTP/SQL/vendor SDK).
  • Cancellation is propagated end-to-end (no zombie work after timeout).
  • Retries + timeouts are treated as one policy (total budget + capped attempts).
  • Dependency logs include: correlation/run id, dependency, attempt, elapsedMs, timeoutMs, totalBudget, outcome, decision.
  • Backlog signals exist (in-flight, queue depth, oldest age) and are monitored.
  • Rollout follows observe → warn → enforce and is done one dependency at a time.
  • A fallback exists when the budget is spent (stop, degrade, cache, queue).

Key takeaways

  • Infinite waits capture capacity and create backlog.
  • Timeouts are budgets + fallbacks, not a magic performance knob.
  • Roll out safely: observe -> warn -> enforce, and wire cancellation end-to-end.

Recommended resources

Download the shipped checklist/templates for this post.

A practical worksheet to set request/job budgets, allocate per-dependency timeouts, and define stop rules and logging fields.

resource

Related posts