OpenTelemetry for .NET: minimum viable tracing for production debugging

Feb 04, 20268 min read

Category:.NET

OpenTelemetry for .NET: minimum viable tracing for production debugging

When incidents span multiple services and logs cannot explain latency: the smallest OpenTelemetry setup that makes production debugging possible without a full rewrite.

Download available. Jump to the shipped asset.

Latency alarms fire, but every service looks “fine” in isolation. One API shows 200s with slow durations, a downstream service shows normal CPU, and a queue consumer shows no obvious errors. The logs don’t join into a single path, and the team wastes the first hour doing timestamp archaeology: “is this the same request or a different one?”

This is the moment logs stop being enough. You need traces that show one request end-to-end across services, dependencies, and background hops. The goal here is minimum viable OpenTelemetry for .NET: a small tracing setup that makes the next incident a timeline you can prove.

If you only do three things
  • Propagate trace id and span id across every inbound request.
  • Instrument only your critical entry points and top dependencies first.
  • Log trace id with every error so logs and traces join quickly.

Why traces are required when incidents cross services

Logs show events. Traces show relationships. When latency or errors occur across services, you need to see the timeline of a request. That is impossible with logs alone unless every service logs perfectly and you have the correlation id in every log line. Most systems do not.

OpenTelemetry gives you a standard trace context and a way to export spans. The goal is not to trace everything. The goal is to trace the path that explains incidents. A minimum viable tracing setup focuses on entry points, critical dependencies, and error paths.

Tracing also helps with negative evidence. When you see that a request spent 4 seconds in a downstream dependency, you stop chasing the wrong service. This saves time and prevents unnecessary rollbacks.

The incident pattern this playbook targets (latency spikes with no single culprit)

  • Latency spikes with no obvious local cause.
  • Requests hop across multiple services and queues.
  • Logs exist but do not show the path or timing.
  • The team spends hours comparing timestamps and still cannot explain the delay.

If this pattern sounds familiar, you need traces. Not a full platform. A small, consistent setup that gives you a single timeline.

Mini incident timeline

  • Symptom: p95 jumps from ~200ms to >1.5s while error rate stays low.
  • The API reports “slow request” but can’t say whether it waited on SQL, cache, or an internal HTTP hop.
  • A rollback happens because it’s the only reversible action; it changes nothing.
  • The real issue is downstream: a queue consumer got scaled down and the backlog grew. The API is waiting on work completion but can’t prove it.

Tracing would have shown the critical fact immediately: most of the request time was spent inside one downstream hop (and exactly which hop).

Diagnosis ladder for missing traces

  1. Do you have a trace id on every request. If not, add trace propagation before adding more spans.
  2. Do your logs include trace id. If not, you will not be able to jump from logs to traces during incidents.
  3. Do you instrument at least the entry points. If only internal spans exist, you still cannot see the full request path.
  4. Do you see dependency spans. If you do not, the slow hop remains hidden.
  5. Is sampling configured. Without sampling, you might flood the system and lose data quality.

Common misconceptions that slow teams down

What slows tracing rollouts in legacy .NET systems isn’t tooling. It’s scope mistakes.

  • You don’t need a “platform” first. Start with OpenTelemetry + a single exporter and focus on a usable timeline.
  • Instrumenting everything first backfires. It increases noise and overhead before you trust the basics. Instrument the incident-critical path first.
  • Traces don’t replace logs. Logs explain what happened at a point; traces explain how the request got there. Join them with trace_id.
  • It’s not only for microservices. Queues and background jobs create “invisible hops” even inside a single product. Traces are how you make those hops visible.

Minimum viable tracing for .NET in production

This is the smallest setup that makes incidents diagnosable without creating an observability project.

Instrument these first

  • All inbound HTTP endpoints
  • One to two critical dependencies per service, such as SQL or an external API
  • Background job entry points if they affect user requests

Add these fields to every span

  • service.name
  • deployment.environment
  • http.method
  • http.route
  • http.status_code
  • net.peer.name or db.system
  • error and exception.type when relevant

Sampling rule

  • Start with 10 percent sampling on success
  • Sample 100 percent of error responses
  • Increase only after you have query confidence

A practical trace and log bridge

Tracing is only useful if you can jump from an error log to a trace in seconds. Do this by logging trace id and span id on any error and slow request.

json
{
  "timestamp": "2026-02-04T14:03:21.188Z",
  "level": "Error",
  "service": "orders-api",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "route": "POST /v1/orders",
  "status_code": 504,
  "duration_ms": 5200,
  "dependency_name": "inventory-service",
  "dependency_duration_ms": 4900,
  "result_class": "transient_failure"
}

This line tells you exactly where to look. You can pull the full trace for the slow dependency and stop guessing.

Fix plan: how to roll out tracing safely

Phase 1: Trace propagation and logging bridge

  • Add trace id and span id to structured logs.
  • Verify that trace context is propagated across HTTP calls.
  • Enable spans only for inbound requests.

Phase 2: Add critical dependency spans

  • Add spans for SQL and external HTTP calls.
  • Confirm that dependency spans include duration and status.
  • Verify sampling does not drop error traces.

Phase 3: Expand cautiously

  • Add spans for background job entry points.
  • Add one domain specific span where incidents are frequent.
  • Monitor overhead and adjust sampling.

What to log so traces remain useful

  • trace_id and span_id on every error and slow request
  • result_class for success, transient failure, permanent failure
  • dependency_name and dependency_duration_ms for slow dependencies

Without these fields, traces become isolated artifacts that cannot be joined to operational data.

Tradeoffs and limits

Tracing adds overhead. The goal is to control it with sampling and scope. If you instrument too much, you will burn budgets and lose trust. If you instrument too little, you will still be blind. Minimum viable tracing is the middle path: enough to explain incidents, not enough to drown you in data.

Also remember that traces reveal latency but not necessarily correctness. A fast failure can still be wrong. Traces are for timing and causality. Logs and domain checks are for correctness.

Shipped asset

Download
Free

OpenTelemetry tracing starter kit for .NET

Trace propagation checklist and starter config for production debugging (free, email delivery)

What you get (4 files):

  • otel-trace-starter-checklist.md
  • otel-attribute-map.md
  • sampling-defaults.md
  • README.md
Axiom Pack
$69

Production Observability Templates

Need consistent trace + log joining across multiple services? Get field taxonomy, trace/log bridge defaults, query packs, and a rollout playbook designed for incident response.

  • Trace/log bridge defaults (trace_id + incident fields) that actually join
  • Attribute taxonomy + naming guidance to keep traces queryable
  • Rollout plan and verification checks to avoid “we added tracing and it got slower”
Get Observability Templates →

Resources

Internal: status="comingSoon" ctaLabel="Coming soon"

External:

No. OpenTelemetry is a standard and can export to many backends. You can start with a basic exporter and still get value. The key is consistent trace context and a small number of spans that explain incidents.

Start with one span for every inbound request and one span for each critical dependency. That is often enough to see the slow hop. Add more only after the first set is reliable.

Use sampling. Sample all errors and only a fraction of successful requests. Avoid adding high cardinality attributes and do not trace every internal function until you have a reason.

Not every log line. Only errors and slow requests require it. The purpose is to join logs and traces during incidents. That keeps volume down while still enabling fast diagnosis.

Yes. Treat the job entry point like an inbound request and create a new trace. Propagate the trace id into any downstream calls so you can see the timeline across jobs and services.

Traces show timing and dependency hops. They do not explain correctness or business rules. Combine traces with structured logs that capture the result class and domain context.

Coming soon

If your incidents span multiple services and logs still do not explain the failure, the observability pack provides production ready trace and logging templates that make failures diagnosable fast.

Coming soon

Axiom .NET Rescue (Coming Soon)

Get notified when we ship observability templates, trace defaults, and incident runbooks for .NET systems.

Key takeaways

  • Traces show relationships. Logs show events. You need both during incidents.
  • Minimum viable tracing focuses on entry points and critical dependencies.
  • Log trace id and span id on errors so you can jump from logs to traces.
  • Sampling controls cost and keeps data usable.
  • Expand instrumentation only after the core path is stable.

Recommended resources

Download the shipped checklist/templates for this post.

A minimal tracing checklist and attribute map for debugging production incidents in .NET services.

resource

Related posts