Thread pool starvation: the silent killer of ASP.NET performance

Jan 24, 202612 min read

Category:.NET

Thread pool starvation: the silent killer of ASP.NET performance

When CPU looks fine but everything times out: how thread pool starvation happens, how to prove it with real signals, and the smallest fixes that stop repeat incidents.

Download available. Jump to the shipped asset.

At 09:12, p95 doubles across the app. Error rate creeps up. CPU is calm. Memory is flat. The database looks "fine".

By 09:31 someone recycles IIS (or restarts the service) and latency snaps back to normal. By 10:20 the slowdown returns.

That failure mode is thread pool starvation. Too many worker threads get captured waiting, so new request work queues up and everything times out.

On call, two questions decide what to do next:

  • What evidence proves this is queueing behind blocked work?
  • What is the smallest change that stops the repeats without a risky rewrite?

If you are the tech lead, the goal is a bounded stabilization plan you can ship this sprint. If you are the CTO, the goal is measurable risk reduction: fewer timeouts, less paging, and fewer "restart fixed it" incidents.

Rescuing an .NET service in production? Start at the .NET Production Rescue hub and the .NET category.


What thread pool starvation is

Thread pool starvation is not "the server is overloaded". It is "the server is waiting".

In ASP.NET, each request needs a worker thread to run its work. If enough worker threads get stuck waiting (sync-over-async waits, lock contention, slow dependency calls with no timeout budget), the process still has CPU available but it cannot schedule new request work quickly. Requests queue. Latency rises across endpoints. Timeouts appear in places that look unrelated.

Operationally, you have accidentally turned your service into a queue you did not design, size, or instrument.


Why CPU looks fine while latency explodes

CPU measures compute. Starvation is usually waiting.

A thread that is blocked on a lock or blocked on synchronous I/O is not burning CPU, but it is still consuming your most limited resource during peak traffic: runnable request capacity. When enough threads are captured, new work queues. Queueing is what drives p95 and p99 through the roof.

That is also why "restart fixed it" is a recurring clue. A restart does not make the code faster. It discards the backlog of blocked work and resets state. If the mechanism that captures threads is still present, the backlog rebuilds under the same traffic shape.


What it looks like in production

Starvation has a specific smell: everything gets slower together.

It is not one slow endpoint. It is the whole process losing the ability to schedule work. That is why symptoms spread across unrelated routes and unrelated downstreams.

Look for a cluster like this:

  • latency rises across many endpoints at the same time
  • throughput drops (req/sec decreases)
  • CPU is not pegged (often 20 to 60 percent)
  • downstream timeouts show up in bursts (HTTP, SQL, cache)
  • recycling or restarting makes it look "fixed" for a while

For background workers, the shape is similar:

  • queue depth climbs
  • oldest message age climbs
  • workers are "running" but completions slow down

Why it happens (common causes)

Most teams do not create starvation intentionally. They create it one reasonable decision at a time.

The job here is not to collect symptoms. The job is to find the mechanism that captures threads.

Sync-over-async (most common)

Somewhere, async work is forced into a blocking wait:

csharp
// Common offenders in request paths
var result = SomeAsyncCall().Result;
SomeAsyncCall().Wait();
SomeAsyncCall().GetAwaiter().GetResult();

In classic ASP.NET this can deadlock. In ASP.NET Core it often appears to work until load increases. Either way, it blocks a thread waiting for I/O. Enough blocked threads and your service becomes a queue.

The durable fix is not "add more async". The durable fix is making the hot request path async end-to-end and removing synchronous waits.

Long I/O with no timeouts

If a call can wait forever, eventually it will.

Typical culprits:

  • HTTP calls without a real timeout budget
  • SQL calls stuck behind locks
  • vendor SDK calls that block internally

The consequence is not just one slow call. It is thread capture. That is how a dependency glitch becomes a platform incident.

Lock contention and critical sections

Starvation does not require I/O.

If many threads are blocked on locks (or waiting on a single shared resource), the effect is similar: a backlog forms, but CPU looks "fine".

Retry storms amplify the backlog

Retries are concurrency multipliers.

If a dependency slows down and you retry aggressively, you create more in-flight work. That creates more blocked threads. That increases queueing. That increases timeouts.

This is how a small blip becomes a full incident.


Decision framework: confirm starvation, then find the captor

Starvation is a scheduling failure. The question is what is capturing threads.

Use this quick framing to avoid the two most common mistakes: blaming the wrong dependency and "fixing" symptoms by adding capacity.

  • If CPU is high and stays high, you have compute saturation. Starvation may exist too, but compute is the first-order problem.
  • If GC is dominating and allocations spike, you have memory pressure. Starvation symptoms can be downstream of GC pauses.
  • If CPU is moderate and p95 rises everywhere while throughput drops, assume queueing. Then prove what is holding threads.

Focus on the third case below.


Diagnosis ladder (fast checks first)

During an incident, starvation becomes a debate. One person blames SQL. Another blames networking. Another wants to scale out.

Skip the debate. Collect evidence that explains the mechanism: what is queued, what is waiting, and what is holding threads.

1) Prove queueing, not compute

You want a picture that says: latency is rising, throughput is falling, CPU is not pegged.

Signals that count:

  • p95 and p99 rise across many endpoints at the same time
  • req/sec drops during the slowdown window
  • active requests climb and stay elevated

Interpretation: when latency rises and throughput drops while CPU stays moderate, you are looking at queueing behind a shared bottleneck.

2) Identify the dependency budget that is being violated

Queueing alone does not tell you where to fix. Add structured dependency call logs and watch for one pattern: a dependency consistently exceeds its timeout budget during the slowdown window.

This log shape is enough to be actionable:

json
{
  "ts": "2026-01-21T09:18:34.120Z",
  "level": "warning",
  "correlationId": "c-8f5e9b8f5fdd4e29",
  "endpoint": "GET /orders/{id}",
  "dependency": "sql:OrdersDb",
  "elapsedMs": 4120,
  "timeoutMs": 3000,
  "attempt": 1,
  "outcome": "timeout",
  "note": "suspect queueing / blocked threads"
}

Interpretation: if one dependency starts exceeding its budget, that is your likely captor. The fix is not more threads. The fix is timeouts, bounded concurrency, and stop rules.

3) Capture a short artifact while it is slow

One short artifact taken during the slowdown prevents days of guesswork. Aim for a 20 to 60 second capture.

Good artifacts:

  • PerfView trace
  • dotnet-trace
  • a process dump (dotnet-dump) for later analysis

The artifact should answer one question: what are thread pool threads waiting on?

4) Find the captor: sync waits, locks, or unbounded fan-out

Now look for the smoking gun mechanism:

  • sync waits on HTTP calls
  • sync waits on database calls
  • contention on one shared lock
  • unbounded parallel fan-out around a slow dependency

At this point you should be able to name the captor in one sentence.


Containment moves (during the fire)

Containment is not a perfect fix. Containment is stopping the system from getting worse.

The common chain looks like this: a dependency slows down, threads get captured, queues grow, retries amplify, then everything collapses. Containment breaks the chain.

Moves that usually buy you time:

  • reduce concurrency around the hot dependency (bulkhead)
  • lower timeouts so blocked work is released
  • disable expensive or non-essential features behind a flag
  • cap retries hard, or temporarily disable retries for the failing dependency

Scaling out can help if the service is truly capacity bound. It can also make the incident worse by increasing pressure on the same slow dependency. Treat scale-out as a hypothesis, not a reflex.


Fixes that stop repeats (smallest first)

Once you have seen starvation once, the goal is to remove thread capture, not to tune around it.

Thread pool tweaks and more servers can hide symptoms. They do not change the fact that a captured thread is still a captured thread. The same mechanism will return under load.

1) Remove sync waits in request paths

Find offenders like .Result, .Wait(), and .GetAwaiter().GetResult() in hot paths and remove them.

This is usually the highest ROI stabilization work because it reduces thread capture immediately without changing business behavior.

2) Make timeouts explicit, consistent, and logged

Timeouts are how you prevent infinite waits from capturing threads. A timeout is also a decision point: fail fast, degrade, or stop.

Prefer a consistent budget per dependency and log when the budget is exceeded. If a dependency can stall your process indefinitely, it will.

3) Cap concurrency at the call site (bulkheads)

If your SQL server can safely handle 50 heavy queries, do not allow 500. Put the limit where it matters: the place that fans out requests.

One safe pattern is a dependency bulkhead:

csharp
private static readonly SemaphoreSlim OrdersDbBulkhead = new(50);
 
public async Task<Order> GetOrderAsync(int id, CancellationToken ct)
{
  await OrdersDbBulkhead.WaitAsync(ct);
  try
  {
    return await _ordersRepository.GetAsync(id, ct);
  }
  finally
  {
    OrdersDbBulkhead.Release();
  }
}

This does not make the dependency faster. It prevents one dependency slowdown from consuming the entire process.

4) Make retries bounded and conditional

Retries must have a budget and a stop rule. Retrying timeouts into a slow dependency multiplies concurrency and increases starvation pressure.

Bounded retries are a reliability feature. Unbounded retries are an outage amplifier.


What to log (so the next incident is obvious)

The goal is not "more logs". The goal is one answer during an incident: are we slow because we are busy, or slow because we are queued behind blocked work?

Minimum fields that make this diagnosable (per request and per dependency call):

  • correlationId
  • endpoint (or job type)
  • dependency (e.g. sql:OrdersDb, http:VendorApi)
  • elapsedMs
  • timeoutMs (if enforced)
  • attempt (if retries exist)
  • outcome (ok | timeout | exception)

Pair those logs with two charts:

  • latency percentiles (p95 and p99)
  • throughput (req/sec)

If you can add one additional signal, add a backlog signal (active requests, request queue length, oldest request age). That turns "feels slow" into a measurable queue.

When latency rises and throughput drops while CPU stays moderate, you are looking at queueing. The dependency logs tell you what is capturing threads.


A safe rollout plan (reduce risk while you fix it)

Stabilization changes are high leverage. They are also easy to ship in a way that creates a different incident (new timeout paths, new error handling bugs, unexpected fan-out).

Roll out in a bounded way, in this order:

  • instrument first (dependency logs: duration, timeout budget, attempt number)
  • pick one hot path (the route or job that dominates the slow window)
  • remove one captor mechanism (sync wait, lock contention, or unbounded fan-out)
  • add explicit timeouts with conservative budgets and log every timeout
  • cap concurrency at the call site and watch backlog signals
  • validate under load: p95 stabilizes, throughput recovers, no recycle required

If you can only ship two changes this week, remove sync waits in the hot path and add explicit timeouts. That combination prevents a large share of "CPU is fine but everything times out" incidents.


Shipped asset

Use this when p95 is climbing, CPU is calm, and you need a disciplined way to prove queueing and find the captor.

Download
Free

Thread pool starvation triage package

A checklist and a logging schema you can apply to legacy services without risky rewrites. The download is a real local zip.

Download, skim, then use it as your incident checklist.

Included files:

  • thread-pool-starvation-triage-checklist.md
  • dependency-call-logging-schema.md

Preview (log shape excerpt):

json
{
  "correlationId": "c-8f5e9b8f5fdd4e29",
  "endpoint": "GET /orders/{id}",
  "dependency": "sql:OrdersDb",
  "elapsedMs": 4120,
  "timeoutMs": 3000,
  "attempt": 1,
  "outcome": "timeout"
}

Resources

Package details and download live on the resource page. The external links are the official tooling references.

External references:


FAQ

Edge cases and misconceptions that show up in real incidents.

No. A deadlock is usually a hard logical wait cycle where progress can't happen at all. Thread pool starvation is typically a capacity problem: enough threads are blocked (often on I/O or locks) that the remaining runnable work can't get scheduled quickly.

The key operational difference: starvation often looks like "it sometimes recovers" and "recycling helps for a while", because the underlying code path is still capable of completing. It's just queueing behind blocked work.

Sometimes raising ThreadPool minimums can reduce short spikes, but it can also hide the root problem and increase pressure on a slow dependency.

Treat it as a short-term containment move, not a fix. If threads are blocked on a slow SQL call, adding more threads often creates more concurrent SQL calls and makes the incident worse.

Because you're throwing away the backlog and resetting thread state. That can temporarily restore latency even though the code path is still capturing threads.

If the same traffic pattern returns and the same dependency is still slow (or the same sync waits still exist), the backlog rebuilds and the incident repeats.

Capture one short artifact while the system is slow (a dotnet-trace trace, a dump, or a PerfView trace) and combine it with two charts: p95 latency + throughput.

If latency rises across endpoints while throughput drops and CPU is moderate, you're looking at queueing. The trace/dump then tells you what the threads are blocked on.

Async helps when it's end-to-end and you aren't blocking on the result. If you keep sync waits (.Result, .Wait()) in hot paths, you still capture threads.

Also, async doesn't solve lock contention or unbounded concurrency. You still need timeouts, bulkheads, and sane retry policies.

Monitor latency percentiles and throughput together, and add structured logs for dependency calls (duration, timeout, attempt, outcome). Those three things let you detect queueing and identify which dependency is capturing threads.

If you can add one new signal: track "active requests" and "oldest request age" for the app. It turns "feels slow" into a measurable backlog.


Coming soon

Axiom is where we ship operator-grade assets for production teams: runbooks, templates, and decision trees that are designed for legacy constraints.

Join to get notified when new incident packages ship.

Coming soon

Axiom (Coming Soon)

Get notified when we ship real operational assets (runbooks, templates, benchmarks), not generic tutorials.


Key takeaways

If you only change a few things, change these.

  • Thread pool starvation is usually thread capture (sync waits, long I/O, locks), not high CPU.
  • A restart is evidence, not a fix. It discards blocked work.
  • Durable fixes remove thread capture: async end-to-end, explicit timeouts, bounded retries, and capped concurrency.

If you are dealing with this now, start at the .NET Production Rescue hub or contact me with one slow-request log (include a correlation ID and dependency timing).

Related posts