
Feb 04, 202610 min read
Category:.NET
Structured logging that actually helps: Serilog fields that matter in .NET incidents
When logs are noisy but useless: why incidents stay unsolved, which fields actually explain failures, and the minimal schema that makes .NET outages diagnosable.
Download available. Jump to the shipped asset.
PagerDuty fires: 5xx_rate{service=billing-api} > 5%. p95 is climbing, CPU is boring, and the log stream is a flood of exception text with no consistent join key. You cannot answer the on-call questions that matter: which dependency is slow, whether retries are stacking, and which route is burning the budget. Someone suggests a recycle because it is the only lever that predictably changes the graph.
The outage is expensive. The repeat outage is worse. When logs cannot explain cause and sequence, every fix is a guess and every postmortem ends with “add more logging.” This playbook is the opposite: the smallest structured field set that makes incidents diagnosable in .NET without turning your log bill into a second incident.
- Log a correlation id, dependency name, and total duration for every request.
- Record retry attempts and timeouts as structured fields, not plain text.
- Standardize field names and types so queries work across services.
Why logs fail during incidents even when there are lots of them
Logs fail when they do not map to decisions. The on call needs to answer three questions in minutes: what is failing, why it is failing, and whether the failure is spreading. If the logs do not include dependency names, duration, or retry behavior, they cannot answer those questions.
Most .NET systems log messages but not structure. They rely on human scanning and string matching. That works in development. It fails in production because the failure mode is multi step. A request fails because a dependency slowed down, which caused retries, which exhausted a pool, which triggered timeouts. Without structured fields, the log stream is a wall of noise that hides the causal chain.
A second failure is inconsistent field names across services. One service writes CorrelationId, another writes correlation_id, and the third only writes it inside the message. During an incident, the queries become unreliable. This leads to false conclusions. A team sees a partial slice of data and assumes it is the whole system.
A third failure is logging the wrong unit. Teams log the response time of the handler but omit total request time, or log dependency time without the retries. That makes slow requests look fast and masks the actual failure budget.
The incident pattern this playbook is built for (500s, timeouts, and no answer in logs)
A stable system starts producing noisy errors. You see 500s, timeouts, and “dependency failed” warnings, but you cannot answer where time was spent or which downstream caused the cascade. CPU and memory look fine. The logs exist, but they do not let you slice by dependency, retries, timeout budget, or request path.
The underlying issue is rarely “we need more logs.” It is “we cannot reconstruct one request path with evidence.” That is a schema problem: inconsistent field names, missing durations, missing dependency identity, and retry behavior trapped in unqueryable text.
Mini incident timeline
- Alert: 5xx crosses 5% and p95 crosses 2s.
- Logs: you can see exceptions, but you cannot group by
route,dependency_name, or even a sharedcorrelation_id. - Response: a “quick” redeploy adds one-off log lines to one service, but the next slow dependency hop shifts the failure signature and the new lines don’t help.
- Containment: a recycle / restart drops the in-flight pileup and buys time, but nothing was proven.
The failure is not “no logs.” It is “no narrative you can query.”
Diagnosis ladder: make logs incident-ready before you add volume
Start with the fastest checks. If any of these fails, fix it before you expand logging volume.
- Can you trace a single request across services. If you cannot follow a request from entry to dependency, your logs are not incident ready.
- Can you filter by dependency. If you cannot filter to all calls to a single downstream system, you will not find the bottleneck during a slowdown.
- Can you measure retries and timeouts. If you do not log retry attempt count and timeout outcome, you will misdiagnose retry storms as dependency flakiness.
- Can you measure total request time. If you only log inner durations, you will not see time spent waiting.
- Can you query by result class. If you cannot separate success from transient failure from permanent failure, your alerting is blind.
Why teams get structured logging wrong
Common traps:
- “Add more logs.” More lines without structure increase noise and cost. The gap is missing fields, not missing volume.
- “We need a platform first.” You don’t. Serilog can emit JSON today. The hard part is agreeing on names/types and sticking to them.
- “Correlation IDs are enough.” They only let you stitch. You still need
dependency_name, durations, retry/timeout outcomes, and a result class. - “Verbose by default is safer.” In production, cardinality is a budget. Keep the schema small and stable; put high-detail payloads behind sampling or a debug toggle.
Decision framework: what to log and what to ignore
Your schema should answer these questions for every request and dependency call:
- Which request failed.
- Which dependency caused the wait or error.
- How long the request spent waiting overall.
- Whether retries or timeouts occurred.
- Which error class occurred and whether it is transient.
Everything else is optional. If a field does not help answer those five questions, it should not be a standard field. This keeps queries fast and dashboards stable.
The tradeoff is that you lose some ad hoc detail. That is acceptable. During an incident, clarity matters more than fullness. You can always add temporary debug logging behind a flag when you need it. Standard logging should remain small and consistent.
A minimal, incident ready logging schema for .NET
The schema below is the minimum that makes incidents diagnosable without exploding cost. It is not a full observability stack. It is the field set that lets an on call reconstruct the incident in minutes.
Request fields
correlation_idrequest_idroutemethodstatus_codeduration_msresult_class(success, transient_failure, permanent_failure)
Dependency fields
dependency_namedependency_type(http, sql, cache, queue)dependency_duration_msdependency_statustimeout_ms
Retry and timeout fields
retry_attemptretry_maxretry_wait_mstimeout_result(none, canceled, timed_out)
Context fields
serviceinstanceenvironmentversion
These fields are small, stable, and queryable. They allow cross service comparisons without guessing.
Example: a structured log line that answers the incident questions
{
"timestamp": "2026-02-04T02:17:41.522Z",
"level": "Error",
"service": "billing-api",
"instance": "api-3",
"environment": "prod",
"version": "2026.02.04.1",
"correlation_id": "c7f2d8f1e9a044ef",
"request_id": "r-9d1f8",
"route": "POST /v1/charge",
"method": "POST",
"status_code": 500,
"duration_ms": 4821,
"dependency_name": "payments-db",
"dependency_type": "sql",
"dependency_duration_ms": 4700,
"dependency_status": "timeout",
"retry_attempt": 2,
"retry_max": 3,
"retry_wait_ms": 400,
"timeout_ms": 5000,
"timeout_result": "timed_out",
"result_class": "transient_failure"
}This single event explains the incident: a SQL dependency timed out, retries were attempted, and the total request time was almost the timeout budget. You can search by dependency and by retry attempt without guessing.
Fix plan: how to introduce the schema without causing a new incident
Start with one service that is already painful. Add the schema there and validate that the fields are consistent and queryable. Do not roll out to every service at once. This is operational work. Treat it like a migration.
Phase 1: Read only instrumentation
- Add the schema fields but keep logging volume the same.
- Verify that field names are consistent across environments.
- Create two queries that will be used during incidents: by dependency and by correlation id.
Phase 2: Tighten and normalize
- Standardize field types. Strings for ids, integers for durations.
- Remove high cardinality fields that break queries.
- Add
result_classanddependency_typeif they are missing.
Phase 3: Roll out across services
- Ship a shared logging helper in a small library.
- Enforce the schema in code review for any new endpoints.
- Build a runbook that tells on call which fields to look at first.
What to log during retries and timeouts
Retries and timeouts are where most incidents hide. If you do not log them as fields, you will misclassify the failure. Use explicit fields and keep them consistent across services.
retry_attemptas an integer starting at 1retry_maxas the configured capretry_wait_msfor the delay before the attempttimeout_msas the configured budgettimeout_resultastimed_out,canceled, ornone
This makes retry storms visible. It also gives you a quick way to see whether timeouts are configured at all.
Tradeoffs and limits
Structured logging is not a substitute for tracing. It is a substitute for chaos. It gives you predictable fields that can answer the most important questions quickly. That is its value.
The cost is schema discipline. Teams have to agree on field names and types. That is not glamorous work. It does not feel like progress until the next incident happens. Then it is the difference between a 20 minute fix and a 2 hour firefight.
Do not try to solve everything with logging. Use logging to tell the incident story. Use tracing for deeper cross service latency analysis. The two work together when the fields align.
Shipped asset
Structured logging field checklist for .NET
Minimal schema and Serilog starter config for incident ready logs (free, email delivery)
What you get (4 files):
structured-logging-fields-checklist.mdserilog-json-starter.jsonexample-error-log.jsonREADME.md
Production Observability Templates
Standardizing fields across multiple .NET services? Get a complete field taxonomy, query packs, and a rollout playbook so incidents stop turning into string-search archaeology.
- ✓Field taxonomy for request, dependency, retries, and time budgets
- ✓Ready-to-run queries and dashboards (so on-call can answer “what waited” fast)
- ✓Rollout guidance for legacy systems (avoid breaking production while you standardize)
Resources
Internal:
- The .NET Production Rescue hub
- The .NET category
- Requests hang forever: missing timeouts in .NET
- Correlation IDs in .NET
External:
Troubleshooting Questions Engineers Search
Most logs are message only and lack structure. When you need to filter by dependency, duration, or retry attempt, you cannot. The fix is not more lines. It is a small set of consistent fields that show what happened and where time was spent.
No. Tracing is useful but it is not the first fix. A minimal structured logging schema lets you diagnose failures immediately. Once logs are consistent, tracing becomes far easier to roll out because the same fields map to spans.
Correlation id, dependency name, total duration, status code, and retry or timeout fields. If you add only those, you will already answer most incident questions. Add context fields like service and version so you can tie failures to deployments.
Do not log raw payloads or user identifiers as top level fields. Keep ids as opaque strings and avoid unbounded values like full URLs. If you must log more detail, include it in a message field behind sampling or a debug toggle.
No. Only request and dependency logs need the full schema. Background tasks and health checks can use a smaller subset. The key is that the incident critical paths use consistent fields.
During an incident, you will query across services. If each service uses different field names, your queries break and you miss the failure path. A shared schema is the difference between one query and ten guesses.
Coming soon
If your .NET services are noisy but still impossible to diagnose, the next step is a production ready observability pack. It standardizes logging fields, correlation, and query packs so your next incident is explainable in minutes.
Axiom .NET Rescue (Coming Soon)
Get notified when we ship production grade observability templates, logging schemas, and incident runbooks for .NET systems.
Key takeaways
- Logs fail when they do not map to incident decisions.
- A small, consistent schema is more valuable than high volume logging.
- Retry and timeout fields must be structured to diagnose cascades.
- Roll out logging in phases to avoid breaking production or blowing costs.
- Use logging to tell the incident story, then add tracing for deeper analysis.
Recommended resources
Download the shipped checklist/templates for this post.
A minimal schema and Serilog starter config that makes production incidents diagnosable in .NET services.
resource
Related posts

Cannot trace requests across services: why correlation IDs die at boundaries in .NET
A production playbook for when logs exist but cannot be joined—correlation IDs die at HttpClient boundaries, jobs, and queues, making incidents unreproducible.

OpenTelemetry for .NET: minimum viable tracing for production debugging
When incidents span multiple services and logs cannot explain latency: the smallest OpenTelemetry setup that makes production debugging possible without a full rewrite.

Background jobs stuck but look healthy: why workers hang forever with no alerts in .NET
When background jobs hang but workers look healthy and queue pileup grows: why jobs fail silently without timeouts or heartbeats, and the runbook that stops repeat incidents.