Timeouts first: why infinite waits create recurring outages in .NET

Jan 21, 20268 min read

Category:.NET

Timeouts first: why infinite waits create recurring outages in .NET

Infinite waits do not look like crashes. They look like calm dashboards and growing backlog. This is the production playbook for adding time budgets safely in .NET.

Download available. Jump to the shipped asset.

Most production incidents do not start as "down." They start as waiting.

At 09:12 a dependency slows down. Your ASP.NET instances look healthy. CPU is fine. Memory is fine. But requests stop finishing. In-flight count climbs. Connection pools stop turning over. You scale out and it does not help because the new instances just join the waiting.

The cost is not subtle. Backlog grows, SLAs fail, and on-call starts recycling processes because it is the only thing that releases the stuck work. Then the incident repeats next week because nothing changed about the waiting.

This post gives you a production playbook for .NET: how to set time budgets, wire cancellation, and roll it out without triggering a new outage.

Rescuing an ASP.NET service in production? Start at the .NET Production Rescue hub and the .NET category.


The mechanism: infinite waits capture capacity

Missing timeouts are not a performance problem. They are a capacity problem.

When a call can wait forever, it will eventually wait longer than your system can afford. While it waits, it holds something your service needs to operate: a worker slot, a thread, a connection, a lock, or a request budget.

Once enough requests or jobs are holding those resources, the system stops behaving like a service and starts behaving like a queue you did not design. From the outside it looks like "everything is slow." Underneath, you are accumulating work you cannot complete.

Timeouts are not a tuning knob. They are a product decision about how long you are willing to wait before you choose a different path.


Diagnosis ladder (fast checks first)

Before you change numbers, classify the waiting. This prevents the classic failure: enforcing a strict timeout everywhere and then declaring "timeouts broke production."

Do these checks in order:

  • Identify the top 3 dependencies on the slow path (HTTP/SQL/vendor SDK)
  • Find the longest observed durations (not averages)
  • Confirm whether work stops after a timeout decision (cancellation honored) or keeps running (zombie work)
  • Map retries and total time budget (timeouts + retries are one policy)
  • Look for backlog signals: queue depth, oldest age, in-flight rising while completions flatten

If you cannot answer "what is our total time budget per request or job run," that is the first gap to close.


Where infinite waits come from in .NET estates

Infinite waits are rarely one bug. They are an emergent property of a system that is allowed to wait without limits.

These are the patterns that create repeat incidents:

  • HTTP calls without an attempt budget or without cancellation propagated into the call
  • SQL commands waiting behind locks, or long queries with no operator boundary
  • background work with no max runtime and no heartbeat
  • integrations that block on a vendor outage while you keep retrying

This is why process recycling "works." It does not fix the dependency. It discards the waiting work and frees resources temporarily.


Decision framework: budget, fallback, and stop rules

Teams avoid adding timeouts because they have seen the short-term effect: more errors.

That story is usually true, and it is still the wrong conclusion.

Timeouts do not create fragility. They reveal fragility that already exists: cancellation not wired through, retries stacking waits, and no fallback path when the budget is spent.

Decide three things per request or job type:

  • Total budget: how long the system can afford before it must choose a different path
  • Attempt budgets: how long any single dependency call is allowed to hold resources
  • Stop rules: when you stop, quarantine, or escalate instead of retrying optimistically

Fallback choices that work in the real world:

  • return a clear error and stop doing work
  • serve cached or stale data
  • enqueue work and respond immediately
  • partial response (where safe)

If your only behavior is "hang forever," you have guaranteed that dependency slowdowns will become outages.


A timeout matrix you can actually operate

You do not need perfect numbers. You need consistent budgets and an operator story.

Start with a total budget, then allocate smaller budgets inside it. Keep it boring.

Practical starting points:

User-facing web requests

  • Total budget: 3-10 seconds depending on the page and fallback
  • Attempt budgets: usually 1-3 seconds per dependency call

Internal service-to-service calls

  • Total budget: 2-10 seconds depending on criticality
  • If slow: prefer predictable failure + queue/fallback over waiting

Background jobs

  • Budget must be explicit (seconds/minutes)
  • Always include a max runtime guardrail and a poison path

Database calls

  • Budget depends on the query type and lock profile
  • "It is slow" must become "it exceeded budget X" (observable + bounded)

Implementation patterns (bounded and boring)

The goal is not clever code. The goal is to make waiting impossible without you choosing it.

Two rules prevent most repeat incidents:

  • Always have a total budget (end-to-end)
  • Always propagate cancellation so work stops, not just waiting

Example: enforce a budget and pass the token through the call.

csharp
public async Task<VendorResult> CallVendorAsync(HttpClient httpClient, CancellationToken ct)
{
  using var budgetCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
  budgetCts.CancelAfter(TimeSpan.FromSeconds(2));
 
  using var req = new HttpRequestMessage(HttpMethod.Get, "/foo");
  using var res = await httpClient.SendAsync(req, budgetCts.Token);
  res.EnsureSuccessStatusCode();
 
  return await res.Content.ReadFromJsonAsync<VendorResult>(cancellationToken: budgetCts.Token);
}

SQL is similar: command timeout is different from connection timeout, and you still need an operator budget.

csharp
using var cmd = new SqlCommand(sql, connection);
cmd.CommandTimeout = 5;
 
// Also pass a CancellationToken to the execution method where available.
// await cmd.ExecuteNonQueryAsync(ct);

Rollout without a new incident

If you have lived without budgets for years, enforcing everywhere at once creates the wrong narrative:

"Timeouts broke production."

Timeouts did not break production. They revealed where production was already fragile.

Safe rollout in a live estate:

  • Observe: instrument durations and identify the long tail
  • Warn: log budget violations with enough context to act
  • Enforce: turn budgets on gradually and verify cancellation is honored

This is how you turn "it is hanging" into a diagnosable statement.


What to log so timeouts become diagnosable

Per dependency call, log:

  • correlationId or runId
  • dependency name
  • attempt
  • timeoutMs (attempt budget)
  • totalBudgetMs
  • elapsedMs
  • outcome (success/timeout/429/5xx/lock)
  • decision (stop/retry/quarantine)

Example:

json
{
  "ts": "2026-01-21T11:46:18.003Z",
  "level": "warning",
  "correlationId": "c-77d3f2f7b4fa4f35",
  "dependency": "http:VendorApi",
  "attempt": 1,
  "elapsedMs": 2105,
  "timeoutMs": 2000,
  "totalBudgetMs": 5000,
  "outcome": "timeout",
  "decision": "stop",
  "next": "fallback"
}

This turns "timeouts happened" into "vendor X exceeded budget Y for endpoint Z." That is a fixable statement.


Shipped asset

Download
Free

Timeout matrix template (HTTP / SQL / jobs)

Free download. Timeout matrix worksheet and a safe rollout plan for live .NET systems.

What you get (2 files):

  • timeout-matrix-template.md
  • timeouts-rollout-plan.md

Preview:

text
Request budget: ____ms
 
Dependency budgets:
- http:VendorApi   timeout=____ms  retries=__  totalBudget=____ms
- sql:OrdersDb     timeout=____ms  retries=__  totalBudget=____ms
- queue:Payments   maxRuntime=____  maxAttempts=__
 
Stop rules:
- if totalBudget exceeded -> stop + escalate payload

Resources

Internal:

External:


FAQ

Pick one dependency call that is already causing pain (often: a vendor HTTP call or a lock-heavy SQL path).

Start by measuring durations and logging budget violations. Then enforce a conservative attempt timeout and verify cancellation is honored.

Avoid enforcing budgets everywhere at once. That is how you trigger surprise cascades and blame the timeout instead of the waiting.

A long timeout is better than an infinite timeout because it still forces a decision eventually.

But if it is longer than your system can afford, it just delays discovery and captures resources while you build backlog.

The safer approach is budgets: set a total request/job budget and assign smaller per-dependency timeouts inside it, with fallbacks and stop rules.

HttpClient.Timeout enforces an upper bound at the client level. A CancellationToken is how you propagate a timeout/cancel decision through your code and into downstream calls that accept it.

In production you typically want a total budget that cancels work end-to-end, plus per-attempt timeouts so a single call cannot stall a worker indefinitely.

If the token is not passed into the call, you may stop waiting while the work continues in the background.

No. Timeouts are tied to user expectations and fallback behavior.

Your payment provider call and your cache call should not share the same budget.

If you standardize anything, standardize the method: total budget first, then per-dependency attempt budgets, then logs that prove when budgets were exceeded.

That is a normal phase of rollout. You are turning hidden slow failures into visible failures.

The key is to roll out safely (observe, warn, enforce) and make sure your fallback paths and stop rules are correct.

If timeouts create a new incident, it usually means cancellation was not honored, retries are stacking waits, or the fallback path is missing.

Treat them as one policy. A retry policy without timeouts stacks waits.

Set a per-attempt timeout and a total budget, then cap attempts. Log both so you can prove the policy behavior under stress.

If your retry policy has no total budget, you do not have a policy. You have optimism.


Coming soon

If you want more assets like the timeout matrix (plus runbooks and logging schemas), that is what Axiom is becoming.

Join to get notified as we ship practical operational assets you can use during incidents and during rollout work, not generic tutorials.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.


Key takeaways

  • Infinite waits capture capacity and create backlog.
  • Timeouts are budgets + fallbacks, not a magic performance knob.
  • Roll out safely: observe -> warn -> enforce, and wire cancellation end-to-end.

Related posts