# Dependency call logging schema (thread pool starvation)

The goal is to make one incident question answerable fast:

Are we slow because we are busy, or slow because we are queued behind blocked work?

This schema is designed for legacy ASP.NET and ASP.NET Core services.

If you already log dependency calls but do not log budgets (timeoutMs) and attempts, you will still be stuck debating root cause during incidents.

## Principles

- Log one event per dependency call (SQL, HTTP, cache)
- Include the timeout budget and attempt number
- Always include a correlation ID that connects request to dependency calls
- Keep the fields stable so you can query them during incidents

Practical rule:

- If it is not queryable in 60 seconds during an incident, it is not a useful field.

## Required fields

These fields are the minimum that make starvation diagnosable.

### Request context

- `timestamp`
- `correlationId`
- `service`
- `environment`
- `endpoint` (route template if possible)
- `httpMethod`

Notes:

- Use a route template for `endpoint` when possible (`/orders/{id}`), not the raw path.
- Keep `correlationId` consistent across request logs and dependency logs.

### Dependency call

- `dependency` (example: `sql:OrdersDb`, `http:VendorApi`, `cache:Redis`)
- `dependencyOperation` (example: `SELECT Orders`, `POST /v1/payments`)
- `elapsedMs`
- `timeoutMs`
- `attempt`
- `outcome` (one of `ok`, `timeout`, `exception`, `canceled`)

Notes:

- `timeoutMs` should be the enforced budget for that attempt.
- `elapsedMs` should be end-to-end time for the dependency call.
- Always log `attempt`, even if most calls are attempt 1.

### Optional but high value

- `statusCode` (for HTTP dependencies)
- `exceptionType`
- `exceptionMessage` (truncate)
- `threadPoolAvailableWorkers` (sampled)
- `threadPoolAvailableIo` (sampled)
- `activeRequests` (if you can expose it)
- `queueLength` (if you have a request queue)

Keep these optional fields sampled. Do not compute expensive values per call.

If you can only add one optional field, add something that helps prove backlog (activeRequests, queueLength, oldestRequestAgeMs).

## Example event

```json
{
  "timestamp": "2026-01-21T09:18:34.120Z",
  "level": "warning",
  "correlationId": "c-8f5e9b8f5fdd4e29",
  "service": "orders-api",
  "environment": "prod",
  "endpoint": "GET /orders/{id}",
  "httpMethod": "GET",
  "dependency": "sql:OrdersDb",
  "dependencyOperation": "SELECT order by id",
  "elapsedMs": 4120,
  "timeoutMs": 3000,
  "attempt": 1,
  "outcome": "timeout"
}
```

If you log at info level by default, consider raising level to warning when `elapsedMs` is close to or above `timeoutMs`.

## Queries to keep ready

- Top dependencies by p95 `elapsedMs`
- Timeouts by `dependency` and `endpoint`
- Retry amplification: count of dependency calls grouped by `correlationId`
- Incident proof: p95 latency up while throughput down, plus increased dependency timeouts

### Kusto-style examples (Azure Monitor, Application Insights)

These are intentionally generic. Adapt field names to your schema.

- Timeouts by dependency and endpoint:

  `traces | where outcome == "timeout" | summarize count() by dependency, endpoint | order by count_ desc`

- p95 latency by dependency:

  `traces | summarize p95 = percentile(elapsedMs, 95) by dependency | order by p95 desc`

- Retry amplification (calls per correlationId):

  `traces | summarize calls = count() by correlationId | order by calls desc`

### Splunk-style examples

- Timeouts by dependency:

  `index=prod outcome=timeout | stats count by dependency | sort -count`

- Retry amplification:

  `index=prod | stats count as calls by correlationId | sort -calls | head 50`

## Operator guidance

- If `elapsedMs` exceeds `timeoutMs` frequently, you are capturing threads. Lower timeouts, cap concurrency, and stop retries.
- If retries exist, cap attempts and enforce a total time budget. Unbounded retries create queueing.
- If you cannot add metrics, structured dependency logs are the next best tool. They make the incident explain itself.

What good looks like:

- You can name the captor in one minute ("sql:OrdersDb exceeded a 3s budget and retries multiplied the backlog").
- You can prove queueing (p95 up, throughput down, CPU moderate) and show the exact dependency budget violations.
- After the fix, timeouts by dependency drop and the backlog signal stops climbing.
