# Thread pool starvation triage checklist

Use this during an incident to confirm the starvation pattern, stop amplification, and collect one artifact that makes the root cause obvious.

This checklist is written for legacy ASP.NET and ASP.NET Core services. It assumes the system is live and the goal is a safe, bounded stabilization plan.

## Quick decision

If p95 rises across many endpoints while throughput drops and CPU stays moderate, assume queueing. Then prove what is capturing threads.

## What not to do first

- Do not start by raising thread pool minimums.
- Do not start by scaling out unless you have evidence the service is compute bound.
- Do not add retries to "make it more reliable" during the slowdown window.

Those moves often hide the symptom while increasing pressure on the real failing dependency.

## 1) Confirm the symptom cluster

- Latency rises across many endpoints at once
- Throughput drops (requests per second down)
- CPU is not pegged
- Downstream timeouts appear (HTTP, SQL)
- Restart or recycle appears to fix it temporarily

Interpretation:

- This is usually queueing behind blocked work, not a single slow endpoint.
- Recycling does not fix the mechanism. It discards the backlog.

## 2) Find sync over async offenders

Search the hottest paths for:

- `.Result`
- `.Wait()`
- `.GetAwaiter().GetResult()`

If you find these in request handling or high throughput background work, treat them as primary suspects.

Targets to search:

- Controller actions and middleware
- HTTP client wrappers
- Database repositories
- Background job handlers
- Startup and singleton initializers that are hit per request

Fix direction:

- Remove sync waits and make the hot path async end-to-end.
- Do not mix async with blocking waits in the same call chain.

## 3) Verify explicit timeouts exist

- HTTP calls: explicit per-attempt timeout budget
- SQL calls: command timeout (not just connection timeout)
- Jobs: max runtime guardrail

What to check:

- HTTP: set an explicit per-attempt timeout and log it.
- SQL: set `CommandTimeout` and log it.
- Queues: enforce a max runtime and stop rule for poison work.

If a call can wait forever, eventually it will.

## 4) Check for lock contention hotspots

- Global locks
- Singletons with shared mutable state
- Single threaded critical sections on hot paths

Fast smell tests:

- A lock that protects "everything" (global cache, token refresh, config reload)
- A singleton with shared mutable state
- A critical section that does I/O

Fix direction:

- Reduce lock scope.
- Move I/O out of locks.
- Add concurrency caps around dependencies instead of using locks as a throttle.

## 5) Stop amplification

- Cap concurrency around the slow dependency
- Cap retries (attempts and total time budget)
- Temporarily disable non essential features

Containment goal:

- Free threads by reducing in-flight work.
- Stop retries from multiplying the backlog.
- Stop the slow dependency from consuming the whole process.

If you can only do one containment move, cap concurrency around the slow dependency and lower timeouts to a real budget.

## 6) Capture one artifact while slow

Capture one short artifact while the system is still slow:

- PerfView trace
- `dotnet-trace`
- `dotnet-dump`

Goal: identify what threads are blocked on.

Practical target:

- Capture for 20 to 60 seconds during the slowdown window.
- Capture from one instance that is clearly affected.

### Option A: EventPipe trace (dotnet-trace)

Commands (run on the host with the right permissions):

1) Find the PID:

	`dotnet-trace ps`

2) Collect a short trace:

	`dotnet-trace collect -p <pid> --duration 00:00:30 --providers Microsoft-DotNETCore-SampleProfiler,Microsoft-Windows-DotNETRuntime:0x1c14fccbd:5`

Keep the trace small and time-boxed.

### Option B: Dump (dotnet-dump)

Use a dump when the process looks hung or when you need stacks for blocked work.

`dotnet-dump collect -p <pid>`

### Option C: PerfView

PerfView is often the fastest path for teams already used to it. Time-box the capture and keep it attached to the incident ticket.

## 7) What to send for a fast diagnosis

- A short incident timeline
- 3 to 10 representative slow request logs with correlation IDs
- One trace or dump captured during the slowdown

If you have structured dependency logs, include:

- the slowest 3 dependency calls for the window (elapsedMs, timeoutMs, outcome)
- retry attempt counts
- one correlationId that shows request and dependency calls together

## After the incident (bounded fixes)

Pick the smallest fix that removes thread capture:

- remove sync waits in the hottest path
- add explicit timeouts and log budgets
- cap concurrency around slow dependencies (bulkheads)
- bound retries with stop rules

Validation:

- p95 and p99 stabilize
- req/sec recovers
- no recycle required to recover
