Jan 24, 202617 min read

Share |

Category:.NET

Requests timing out but CPU normal: thread pool starvation in ASP.NET

When requests time out but CPU is low and restarting fixes it temporarily: how thread pool starvation happens, how to prove queueing, and the smallest fixes that stop repeat incidents.

Free download: Thread pool starvation triage checklist. Jump to the download section.

At 09:12, p95 doubles across the app. Error rate creeps up. CPU is calm. Memory is flat. The database looks "fine".

By 09:31 someone recycles IIS (or restarts the service) and latency snaps back to normal. By 10:20 the slowdown returns.

That failure mode is thread pool starvation. Too many worker threads get captured waiting, so new request work queues up and everything times out.

On call, two questions decide what to do next:

What evidence proves this is queueing behind blocked work?
What is the smallest change that stops the repeats without a risky rewrite?

If you are the tech lead, the goal is a bounded stabilization plan you can ship this sprint. If you are the CTO, the goal is measurable risk reduction: fewer timeouts, less paging, and fewer "restart fixed it" incidents.

Rescuing an .NET service in production? Start at the .NET Production Rescue hub and the .NET category.

If you only do three things

Prove queueing: chart p95/p99 + throughput together (not CPU alone).
Add dependency timing logs with a timeout budget field (elapsedMs + timeoutMs + outcome).
Remove sync waits in hot paths and cap concurrency around the slow dependency (bulkhead).

Why timeouts happen when CPU looks fine: the starvation mechanism

Thread pool starvation is not "the server is overloaded". It is "the server is waiting".

In ASP.NET, each request needs a worker thread to run its work. If enough worker threads get stuck waiting (sync-over-async waits, lock contention, slow dependency calls with no timeout budget), the process still has CPU available but it cannot schedule new request work quickly. Requests queue. Latency rises across endpoints. Timeouts appear in places that look unrelated.

Operationally, you have accidentally turned your service into a queue you did not design, size, or instrument.

Why CPU is low but everything times out: waiting vs computing

CPU measures compute. Starvation is usually waiting.

A thread that is blocked on a lock or blocked on synchronous I/O is not burning CPU, but it is still consuming your most limited resource during peak traffic: runnable request capacity. When enough threads are captured, new work queues. Queueing is what drives p95 and p99 through the roof.

That is also why "restart fixed it" is a recurring clue. A restart does not make the code faster. It discards the backlog of blocked work and resets state. If the mechanism that captures threads is still present, the backlog rebuilds under the same traffic shape.

How to recognize starvation: everything slows down together

Starvation has a specific smell: everything gets slower together.

It is not one slow endpoint. It is the whole process losing the ability to schedule work. That is why symptoms spread across unrelated routes and unrelated downstreams.

Look for a cluster like this:

latency rises across many endpoints at the same time
throughput drops (req/sec decreases)
CPU is not pegged (often 20 to 60 percent)
downstream timeouts show up in bursts (HTTP, SQL, cache)
recycling or restarting makes it look "fixed" for a while

For background workers, the shape is similar:

queue depth climbs
oldest message age climbs
workers are "running" but completions slow down

Fast triage table (what to check first)

Symptom	Likely cause	Confirm fast	First safe move
p95 rises across many endpoints while throughput drops and CPU is moderate	Queueing behind blocked work (thread capture)	Active requests / oldest request age climbs; traces/dumps show threads waiting	Lower/standardize timeouts, cap concurrency (bulkhead), remove sync waits in hot paths
CPU is high and stays high	Compute saturation	CPU profile shows hotspots; throughput drops because cores are pegged	Reduce work per request, cache, scale out (after confirming dependency isn’t the bottleneck)
GC time spikes; allocations spike	Memory pressure / GC pauses	GC metrics show high pause %, Gen2/LOH churn	Reduce allocations, fix high-churn paths, validate with load
One endpoint (or one job type) is slow; others are fine	Localized hot path or single dependency	Per-route latency and logs isolate the path	Fix that path first; add explicit timeouts and bounded retries
Scaling out makes it worse	Shared downstream is the bottleneck (SQL/vendor API)	Downstream latency rises with instance count	Stop amplification: bulkheads + stop rules; reduce retries into timeouts

Why it happens (common causes)

Most teams do not create starvation intentionally. They create it one reasonable decision at a time.

The job here is not to collect symptoms. The job is to find the mechanism that captures threads.

Sync-over-async (most common)

Somewhere, async work is forced into a blocking wait:

csharp

// Common offenders in request paths
var result = SomeAsyncCall().Result;
SomeAsyncCall().Wait();
SomeAsyncCall().GetAwaiter().GetResult();

In classic ASP.NET this can deadlock. In ASP.NET Core it often appears to work until load increases. Either way, it blocks a thread waiting for I/O. Enough blocked threads and your service becomes a queue.

The durable fix is not "add more async". The durable fix is making the hot request path async end-to-end and removing synchronous waits.

Long I/O with no timeouts

If a call can wait forever, eventually it will.

Typical culprits:

HTTP calls without a real timeout budget
SQL calls stuck behind locks
vendor SDK calls that block internally

The consequence is not just one slow call. It is thread capture. That is how a dependency glitch becomes a platform incident.

Lock contention and critical sections

Starvation does not require I/O.

If many threads are blocked on locks (or waiting on a single shared resource), the effect is similar: a backlog forms, but CPU looks "fine".

Retry storms amplify the backlog

Retries are concurrency multipliers.

If a dependency slows down and you retry aggressively, you create more in-flight work. That creates more blocked threads. That increases queueing. That increases timeouts.

This is how a small blip becomes a full incident.

Decision framework: confirm starvation, then find the captor

Starvation is a scheduling failure. The question is what is capturing threads.

Use this quick framing to avoid the two most common mistakes: blaming the wrong dependency and "fixing" symptoms by adding capacity.

If CPU is high and stays high, you have compute saturation. Starvation may exist too, but compute is the first-order problem.
If GC is dominating and allocations spike, you have memory pressure. Starvation symptoms can be downstream of GC pauses.
If CPU is moderate and p95 rises everywhere while throughput drops, assume queueing. Then prove what is holding threads.

Focus on the third case below.

How to diagnose: prove queueing, then find what's capturing threads

During an incident, starvation becomes a debate. One person blames SQL. Another blames networking. Another wants to scale out.

Skip the debate. Collect evidence that explains the mechanism: what is queued, what is waiting, and what is holding threads.

1) Prove queueing, not compute

You want a picture that says: latency is rising, throughput is falling, CPU is not pegged.

Signals that count:

p95 and p99 rise across many endpoints at the same time
req/sec drops during the slowdown window
active requests climb and stay elevated

Interpretation: when latency rises and throughput drops while CPU stays moderate, you are looking at queueing behind a shared bottleneck.

2) Identify the dependency budget that is being violated

Queueing alone does not tell you where to fix. Add structured dependency call logs and watch for one pattern: a dependency consistently exceeds its timeout budget during the slowdown window.

This log shape is enough to be actionable:

json

{
  "ts": "2026-01-21T09:18:34.120Z",
  "level": "warning",
  "correlationId": "c-8f5e9b8f5fdd4e29",
  "endpoint": "GET /orders/{id}",
  "dependency": "sql:OrdersDb",
  "elapsedMs": 4120,
  "timeoutMs": 3000,
  "attempt": 1,
  "outcome": "timeout",
  "note": "suspect queueing / blocked threads"
}

Interpretation: if one dependency starts exceeding its budget, that is your likely captor. The fix is not more threads. The fix is timeouts, bounded concurrency, and stop rules.

3) Capture a short artifact while it is slow

One short artifact taken during the slowdown prevents days of guesswork. Aim for a 20 to 60 second capture.

Good artifacts:

PerfView trace
dotnet-trace
a process dump (dotnet-dump) for later analysis

The artifact should answer one question: what are thread pool threads waiting on?

4) Find the captor: sync waits, locks, or unbounded fan-out

Now look for the smoking gun mechanism:

sync waits on HTTP calls
sync waits on database calls
contention on one shared lock
unbounded parallel fan-out around a slow dependency

At this point you should be able to name the captor in one sentence.

Containment moves (during the fire)

Containment is not a perfect fix. Containment is stopping the system from getting worse.

The common chain looks like this: a dependency slows down, threads get captured, queues grow, retries amplify, then everything collapses. Containment breaks the chain.

Moves that usually buy you time:

reduce concurrency around the hot dependency (bulkhead)
lower timeouts so blocked work is released
disable expensive or non-essential features behind a flag
cap retries hard, or temporarily disable retries for the failing dependency

Scaling out can help if the service is truly capacity bound. It can also make the incident worse by increasing pressure on the same slow dependency. Treat scale-out as a hypothesis, not a reflex.

Fixes that stop repeats (smallest first)

Once you have seen starvation once, the goal is to remove thread capture, not to tune around it.

Thread pool tweaks and more servers can hide symptoms. They do not change the fact that a captured thread is still a captured thread. The same mechanism will return under load.

1) Remove sync waits in request paths

Find offenders like .Result, .Wait(), and .GetAwaiter().GetResult() in hot paths and remove them.

This is usually the highest ROI stabilization work because it reduces thread capture immediately without changing business behavior.

2) Make timeouts explicit, consistent, and logged

Timeouts are how you prevent infinite waits from capturing threads. A timeout is also a decision point: fail fast, degrade, or stop.

Prefer a consistent budget per dependency and log when the budget is exceeded. If a dependency can stall your process indefinitely, it will.

3) Cap concurrency at the call site (bulkheads)

If your SQL server can safely handle 50 heavy queries, do not allow 500. Put the limit where it matters: the place that fans out requests.

One safe pattern is a dependency bulkhead:

csharp

private static readonly SemaphoreSlim OrdersDbBulkhead = new(50);
 
public async Task<Order> GetOrderAsync(int id, CancellationToken ct)
{
  await OrdersDbBulkhead.WaitAsync(ct);
  try
  {
    return await _ordersRepository.GetAsync(id, ct);
  }
  finally
  {
    OrdersDbBulkhead.Release();
  }
}

This does not make the dependency faster. It prevents one dependency slowdown from consuming the entire process.

4) Make retries bounded and conditional

Retries must have a budget and a stop rule. Retrying timeouts into a slow dependency multiplies concurrency and increases starvation pressure.

Bounded retries are a reliability feature. Unbounded retries are an outage amplifier.

What to log (so the next incident is obvious)

The goal is not "more logs". The goal is one answer during an incident: are we slow because we are busy, or slow because we are queued behind blocked work?

Minimum fields that make this diagnosable (per request and per dependency call):

correlationId
endpoint (or job type)
dependency (e.g. sql:OrdersDb, http:VendorApi)
elapsedMs
timeoutMs (if enforced)
attempt (if retries exist)
outcome (ok | timeout | exception)

Pair those logs with two charts:

latency percentiles (p95 and p99)
throughput (req/sec)

If you can add one additional signal, add a backlog signal (active requests, request queue length, oldest request age). That turns "feels slow" into a measurable queue.

When latency rises and throughput drops while CPU stays moderate, you are looking at queueing. The dependency logs tell you what is capturing threads.

A safe rollout plan (reduce risk while you fix it)

Stabilization changes are high leverage. They are also easy to ship in a way that creates a different incident (new timeout paths, new error handling bugs, unexpected fan-out).

Roll out in a bounded way, in this order:

instrument first (dependency logs: duration, timeout budget, attempt number)
pick one hot path (the route or job that dominates the slow window)
remove one captor mechanism (sync wait, lock contention, or unbounded fan-out)
add explicit timeouts with conservative budgets and log every timeout
cap concurrency at the call site and watch backlog signals
validate under load: p95 stabilizes, throughput recovers, no recycle required

If you can only ship two changes this week, remove sync waits in the hot path and add explicit timeouts. That combination prevents a large share of "CPU is fine but everything times out" incidents.

Shipped asset

Use this when p95 is climbing, CPU is calm, and you need a disciplined way to prove queueing and find the captor.

Download

Free

Thread pool starvation triage package

A checklist and a logging schema you can apply to legacy services without risky rewrites. The download is a real local zip.

Get the package

When to use this (fit check)

p95/p99 spikes across many endpoints while CPU is calm (queueing).
“Restart fixed it” is a recurring incident move.
You need one artifact + one log shape that makes the captor obvious.

When NOT to use this (yet)

You have a clear compute bottleneck (CPU pegged) and the slowdown is isolated to one hotspot.
You cannot ship any instrumentation right now (start by logging dependency duration + timeout budget first).
You’re looking for a “tuning” fix (min threads / scale-out) instead of removing thread capture.

Download, skim, then use it as your incident checklist.

Included files:

thread-pool-starvation-triage-checklist.md
dependency-call-logging-schema.md

Preview (log shape excerpt):

json

{
  "correlationId": "c-8f5e9b8f5fdd4e29",
  "endpoint": "GET /orders/{id}",
  "dependency": "sql:OrdersDb",
  "elapsedMs": 4120,
  "timeoutMs": 3000,
  "attempt": 1,
  "outcome": "timeout"
}

Resources

Package details and download live on the resource page. The external links are the official tooling references.

Thread pool starvation triage package
.NET Production Rescue hub
Axiom (Coming Soon)
Retries making outages worse: when resilience policies multiply failures in .NET - retry storms amplify starvation
Timeouts first: why infinite waits create recurring outages in .NET - infinite waits capture threads
Cannot trace requests across services: why correlation IDs die at boundaries in .NET - trace blocked request chains
Polly retries making outages worse: how retry storms multiply failures in .NET - bounded retries prevent capture

External references:

Troubleshooting Questions Engineers Search

Because CPU measures compute, not waiting. Thread pool starvation is usually thread capture: threads are blocked on I/O (HTTP calls, SQL queries), locks, or synchronous waits. The server has CPU available but cannot schedule new request work because worker threads are stuck waiting. That creates queueing, which drives p95 and p99 through the roof.

Because restarting discards the backlog of blocked work and resets thread state. It doesn't change the code path that captures threads. If the same traffic pattern returns and the same sync waits or slow dependencies still exist, the backlog rebuilds and timeouts return. A restart is evidence of queueing, not a fix.

Both can look similar, but starvation has a specific pattern: latency rises across many endpoints at the same time, not just one. If CPU is moderate and p95 spikes everywhere while throughput drops, you're looking at queueing. The slow dependency is often the captor (what's holding threads), but the incident is starvation (the system can't schedule new work).

Yes. Task.Result, .Wait(), and .GetAwaiter().GetResult() are synchronous waits on async work. In ASP.NET Core they don't deadlock like in classic ASP.NET, but they still block a thread. Under load, enough blocked threads create queueing and starvation. The fix is making the request path async end-to-end and removing synchronous waits.

Because scaling out multiplies pressure on the same slow dependency. If each instance has threads blocked on a slow SQL call, adding more instances creates more concurrent SQL calls. The database gets slower, threads stay blocked longer, and starvation gets worse. Fix the thread capture mechanism before scaling out.

A deadlock is a hard logical wait cycle where progress stops completely. Thread pool starvation is a capacity problem: enough threads are blocked (usually on I/O or locks) that new work queues up. Starvation often looks like "it sometimes recovers" and "restarting helps temporarily" because the code can still complete—it's just queueing behind blocked work.

Raising ThreadPool minimums can reduce short spikes, but it can also hide the root problem and increase pressure on slow dependencies. If threads are blocked on a slow SQL call, adding more threads often creates more concurrent SQL calls and makes the incident worse. Treat it as a short-term containment move, not a fix. The durable fix is removing thread capture: async end-to-end, explicit timeouts, and bounded concurrency.

FAQ

Edge cases and misconceptions that show up in real incidents.

No. A deadlock is usually a hard logical wait cycle where progress can't happen at all. Thread pool starvation is typically a capacity problem: enough threads are blocked (often on I/O or locks) that the remaining runnable work can't get scheduled quickly.

The key operational difference: starvation often looks like "it sometimes recovers" and "recycling helps for a while", because the underlying code path is still capable of completing. It's just queueing behind blocked work.

Sometimes raising ThreadPool minimums can reduce short spikes, but it can also hide the root problem and increase pressure on a slow dependency.

Treat it as a short-term containment move, not a fix. If threads are blocked on a slow SQL call, adding more threads often creates more concurrent SQL calls and makes the incident worse.

Because you're throwing away the backlog and resetting thread state. That can temporarily restore latency even though the code path is still capturing threads.

If the same traffic pattern returns and the same dependency is still slow (or the same sync waits still exist), the backlog rebuilds and the incident repeats.

Capture one short artifact while the system is slow (a dotnet-trace trace, a dump, or a PerfView trace) and combine it with two charts: p95 latency + throughput.

If latency rises across endpoints while throughput drops and CPU is moderate, you're looking at queueing. The trace/dump then tells you what the threads are blocked on.

Async helps when it's end-to-end and you aren't blocking on the result. If you keep sync waits (.Result, .Wait()) in hot paths, you still capture threads.

Also, async doesn't solve lock contention or unbounded concurrency. You still need timeouts, bulkheads, and sane retry policies.

Monitor latency percentiles and throughput together, and add structured logs for dependency calls (duration, timeout, attempt, outcome). Those three things let you detect queueing and identify which dependency is capturing threads.

If you can add one new signal: track "active requests" and "oldest request age" for the app. It turns "feels slow" into a measurable backlog.

Coming soon

Axiom is where we ship operator-grade assets for production teams: runbooks, templates, and decision trees that are designed for legacy constraints.

Join to get notified when new incident packages ship.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.

Join waitlist

Checklist (copy/paste)

Latency + throughput charted together (p95/p99 and req/sec).
Backlog signal exists (active requests, oldest request age, queue depth).
Dependency logs include: correlationId, endpoint, dependency, elapsedMs, timeoutMs, attempt, outcome.
At least one hot path is async end-to-end (no .Result, .Wait(), .GetAwaiter().GetResult()).
Explicit timeouts exist for HTTP, SQL, and jobs (no infinite waits).
Concurrency is capped around the slow dependency (bulkhead), not via global locks.
Retries are bounded and conditional (stop retrying timeouts into a slow dependency).
One artifact can be captured during a slowdown (PerfView / dotnet-trace / dump) and attached to the incident ticket.
After the fix, a load test proves: throughput recovers and p95 stays stable without recycling.

Key takeaways

If you only change a few things, change these.

Thread pool starvation is usually thread capture (sync waits, long I/O, locks), not high CPU.
A restart is evidence, not a fix. It discards blocked work.
Durable fixes remove thread capture: async end-to-end, explicit timeouts, bounded retries, and capped concurrency.

If you are dealing with this now, start at the .NET Production Rescue hub or contact me with one slow-request log (include a correlation ID and dependency timing).

Recommended resources

Download the shipped checklist/templates for this post.

Thread pool starvation triage checklistFree

A small incident package: a triage checklist plus a dependency-call logging schema for proving queueing and stopping repeat timeouts in legacy ASP.NET services.

resource

.NETJan 21, 2026

Requests hang forever: why missing timeouts cause recurring outages in .NET

When requests hang forever and recycling releases stuck work: why missing timeouts create backlog, how to add budgets safely, and the rollout plan that prevents new incidents.

.NETFeb 26, 2026

Performance triage in legacy .NET: find the top 3 bottlenecks fast

When the legacy system is slow and no one knows where to start, a structured triage finds the real bottlenecks in hours, not weeks. This playbook gives you a repeatable method to identify, rank, and fix the top 3 performance killers.

.NETJan 26, 2026

Retries making outages worse: when resilience policies multiply failures in .NET

Retry storms don't look like a bug—they look like good engineering until retries amplify failures and multiply in-flight requests during backpressure.

Why timeouts happen when CPU looks fine: the starvation mechanism

Why CPU is low but everything times out: waiting vs computing

How to recognize starvation: everything slows down together

Fast triage table (what to check first)

Why it happens (common causes)

Sync-over-async (most common)

Long I/O with no timeouts

Lock contention and critical sections

Retry storms amplify the backlog

Decision framework: confirm starvation, then find the captor

How to diagnose: prove queueing, then find what's capturing threads

1) Prove queueing, not compute

2) Identify the dependency budget that is being violated

3) Capture a short artifact while it is slow

4) Find the captor: sync waits, locks, or unbounded fan-out

Containment moves (during the fire)

Fixes that stop repeats (smallest first)

1) Remove sync waits in request paths

2) Make timeouts explicit, consistent, and logged

3) Cap concurrency at the call site (bulkheads)

4) Make retries bounded and conditional

What to log (so the next incident is obvious)

A safe rollout plan (reduce risk while you fix it)

Shipped asset

Thread pool starvation triage package

Resources

Troubleshooting Questions Engineers Search

FAQ

Coming soon

Axiom (Coming Soon)

Checklist (copy/paste)

Key takeaways

Recommended resources

Related posts

Requests hang forever: why missing timeouts cause recurring outages in .NET

Performance triage in legacy .NET: find the top 3 bottlenecks fast

Retries making outages worse: when resilience policies multiply failures in .NET