Feb 04, 202610 min read

Share |

Category:.NET

OpenTelemetry for .NET: minimum viable tracing for production debugging

When incidents span multiple services and logs cannot explain latency: the smallest OpenTelemetry setup that makes production debugging possible without a full rewrite.

Free download: OpenTelemetry tracing starter kit (.NET). Jump to the download section.

Paid pack available. Jump to the Axiom pack.

Latency alarms fire, but every service looks “fine” in isolation. One API shows 200s with slow durations, a downstream service shows normal CPU, and a queue consumer shows no obvious errors. The logs don’t join into a single path, and the team wastes the first hour doing timestamp archaeology: “is this the same request or a different one?”

This is the moment logs stop being enough. You need traces that show one request end-to-end across services, dependencies, and background hops. The goal here is minimum viable OpenTelemetry for .NET: a small tracing setup that makes the next incident a timeline you can prove.

If you are stabilizing a legacy .NET system, start at the .NET Production Rescue hub.

If you only do three things

Propagate trace id and span id across every inbound request.
Instrument only your critical entry points and top dependencies first.
Log trace id with every error so logs and traces join quickly.

Why traces are required when incidents cross services

Logs show events. Traces show relationships. When latency or errors occur across services, you need to see the timeline of a request. That is impossible with logs alone unless every service logs perfectly and you have the correlation id in every log line. Most systems do not.

OpenTelemetry gives you a standard trace context and a way to export spans. The goal is not to trace everything. The goal is to trace the path that explains incidents. A minimum viable tracing setup focuses on entry points, critical dependencies, and error paths.

Tracing also helps with negative evidence. When you see that a request spent 4 seconds in a downstream dependency, you stop chasing the wrong service. This saves time and prevents unnecessary rollbacks.

The incident pattern this playbook targets (latency spikes with no single culprit)

Latency spikes with no obvious local cause.
Requests hop across multiple services and queues.
Logs exist but do not show the path or timing.
The team spends hours comparing timestamps and still cannot explain the delay.

If this pattern sounds familiar, you need traces. Not a full platform. A small, consistent setup that gives you a single timeline.

Mini incident timeline

Symptom: p95 jumps from ~200ms to >1.5s while error rate stays low.
The API reports “slow request” but can’t say whether it waited on SQL, cache, or an internal HTTP hop.
A rollback happens because it’s the only reversible action; it changes nothing.
The real issue is downstream: a queue consumer got scaled down and the backlog grew. The API is waiting on work completion but can’t prove it.

Tracing would have shown the critical fact immediately: most of the request time was spent inside one downstream hop (and exactly which hop).

Diagnosis ladder for missing traces

Do you have a trace id on every request. If not, add trace propagation before adding more spans.
Do your logs include trace id. If not, you will not be able to jump from logs to traces during incidents.
Do you instrument at least the entry points. If only internal spans exist, you still cannot see the full request path.
Do you see dependency spans. If you do not, the slow hop remains hidden.
Is sampling configured. Without sampling, you might flood the system and lose data quality.

Fast triage table: symptom → likely cause → confirm → fix

Symptom	Likely cause	Confirm	Fix (minimum)
You see no traces anywhere	Exporter not configured or process can’t reach backend	No spans exported; exporter errors in logs	Add one exporter and verify connectivity; start with one service
Traces exist but requests don’t join across services	Trace context not propagated	Different `trace_id` values per hop	Ensure W3C trace context is propagated on HTTP + messaging
Dependency latency is still a mystery	Dependency instrumentation missing	Only one span per request	Add SQL/HTTP client instrumentation for top dependencies
You can’t jump from error logs to a trace	`trace_id` not logged on errors	Error logs don’t contain `trace_id`	Log `trace_id` and `span_id` on errors and slow requests
Costs spike and data becomes noisy	Sampling missing or too high	Too many success traces stored	Sample success (e.g., 10%), sample 100% of errors
Traces are hard to query	Attribute naming drift / missing route fields	Queries require per-service custom fields	Standardize key attributes (`service.name`, `http.route`, env)

Common misconceptions that slow teams down

What slows tracing rollouts in legacy .NET systems isn’t tooling. It’s scope mistakes.

You don’t need a “platform” first. Start with OpenTelemetry + a single exporter and focus on a usable timeline.
Instrumenting everything first backfires. It increases noise and overhead before you trust the basics. Instrument the incident-critical path first.
Traces don’t replace logs. Logs explain what happened at a point; traces explain how the request got there. Join them with trace_id.
It’s not only for microservices. Queues and background jobs create “invisible hops” even inside a single product. Traces are how you make those hops visible.

Minimum viable tracing for .NET in production

This is the smallest setup that makes incidents diagnosable without creating an observability project.

Definition (for clarity): a trace is the end-to-end timeline of one request; a span is one timed operation inside that trace.

Instrument these first

All inbound HTTP endpoints
One to two critical dependencies per service, such as SQL or an external API
Background job entry points if they affect user requests

Add these fields to every span

service.name
deployment.environment
http.method
http.route
http.status_code
net.peer.name or db.system
error and exception.type when relevant

Sampling rule

Start with 10 percent sampling on success
Sample 100 percent of error responses
Increase only after you have query confidence

Minimal .NET setup (packages + Program.cs)

Keep this small. The goal is: inbound request traces + critical dependency spans + one exporter.

NuGet packages (starter list)

OpenTelemetry
OpenTelemetry.Extensions.Hosting
OpenTelemetry.Instrumentation.AspNetCore
OpenTelemetry.Instrumentation.Http
OpenTelemetry.Instrumentation.SqlClient
One exporter (pick one): OpenTelemetry.Exporter.OpenTelemetryProtocol (OTLP) or OpenTelemetry.Exporter.Console

Program.cs (minimum viable)

csharp

using OpenTelemetry;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
 
var builder = WebApplication.CreateBuilder(args);
 
builder.Services.AddOpenTelemetry()
  .WithTracing(tracing =>
  {
    tracing
      .SetResourceBuilder(
        ResourceBuilder.CreateDefault()
          .AddService(serviceName: "orders-api")
          .AddAttributes(new[]
          {
            new KeyValuePair<string, object>("deployment.environment", builder.Environment.EnvironmentName)
          }))
      .AddAspNetCoreInstrumentation()
      .AddHttpClientInstrumentation()
      .AddSqlClientInstrumentation();
 
    // Exporter (choose one)
    // tracing.AddOtlpExporter();
    // tracing.AddConsoleExporter();
 
    // Sampling: start small on success; keep errors.
    tracing.SetSampler(new ParentBasedSampler(new TraceIdRatioBasedSampler(0.10)));
  });
 
var app = builder.Build();
app.MapGet("/health", () => Results.Ok(new { ok = true }));
app.Run();

A practical trace and log bridge

Tracing is only useful if you can jump from an error log to a trace in seconds. Do this by logging trace id and span id on any error and slow request.

json

{
  "timestamp": "2026-02-04T14:03:21.188Z",
  "level": "Error",
  "service": "orders-api",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "route": "POST /v1/orders",
  "status_code": 504,
  "duration_ms": 5200,
  "dependency_name": "inventory-service",
  "dependency_duration_ms": 4900,
  "result_class": "transient_failure"
}

This line tells you exactly where to look. You can pull the full trace for the slow dependency and stop guessing.

Fix plan: how to roll out tracing safely

Phase 1: Trace propagation and logging bridge

Add trace id and span id to structured logs.
Verify that trace context is propagated across HTTP calls.
Enable spans only for inbound requests.

Phase 2: Add critical dependency spans

Add spans for SQL and external HTTP calls.
Confirm that dependency spans include duration and status.
Verify sampling does not drop error traces.

Phase 3: Expand cautiously

Add spans for background job entry points.
Add one domain specific span where incidents are frequent.
Monitor overhead and adjust sampling.

What to log so traces remain useful

trace_id and span_id on every error and slow request
result_class for success, transient failure, permanent failure
dependency_name and dependency_duration_ms for slow dependencies

Without these fields, traces become isolated artifacts that cannot be joined to operational data.

Tradeoffs and limits

Tracing adds overhead. The goal is to control it with sampling and scope. If you instrument too much, you will burn budgets and lose trust. If you instrument too little, you will still be blind. Minimum viable tracing is the middle path: enough to explain incidents, not enough to drown you in data.

Also remember that traces reveal latency but not necessarily correctness. A fast failure can still be wrong. Traces are for timing and causality. Logs and domain checks are for correctness.

Shipped asset

Download

Free

OpenTelemetry tracing starter kit for .NET

Trace propagation checklist and starter config for production debugging (free, email delivery)

Get the free starter kit

When to use this (fit check)

Your incidents span multiple services/queues and logs can’t prove where time went.
You need one end-to-end request timeline (trace) with the slow hop called out (span).
You’re ready to start small: entry points + top dependencies + sampling.

When NOT to use this (yet)

You can’t propagate a correlation/trace context across hops (fix propagation first).
You plan to “instrument everything” on day one (start with the incident-critical path).
You can’t join logs to traces (add trace_id to errors/slow logs before expanding).

What you get (4 files):

otel-trace-starter-checklist.md
otel-attribute-map.md
sampling-defaults.md
README.md

Axiom Pack

$69

Production Observability Templates

Need consistent trace + log joining across multiple services? Get field taxonomy, trace/log bridge defaults, query packs, and a rollout playbook designed for incident response.

✓Trace/log bridge defaults (trace_id + incident fields) that actually join
✓Attribute taxonomy + naming guidance to keep traces queryable
✓Rollout plan and verification checks to avoid “we added tracing and it got slower”

Get Observability Templates →

Resources

Internal:

External:

Troubleshooting Questions Engineers Search

No. OpenTelemetry is a standard and can export to many backends. You can start with a basic exporter and still get value. The key is consistent trace context and a small number of spans that explain incidents.

Start with one span for every inbound request and one span for each critical dependency. That is often enough to see the slow hop. Add more only after the first set is reliable.

Use sampling. Sample all errors and only a fraction of successful requests. Avoid adding high cardinality attributes and do not trace every internal function until you have a reason.

Not every log line. Only errors and slow requests require it. The purpose is to join logs and traces during incidents. That keeps volume down while still enabling fast diagnosis.

Yes. Treat the job entry point like an inbound request and create a new trace. Propagate the trace id into any downstream calls so you can see the timeline across jobs and services.

Traces show timing and dependency hops. They do not explain correctness or business rules. Combine traces with structured logs that capture the result class and domain context.

Coming soon

If your incidents span multiple services and logs still do not explain the failure, the observability pack provides production ready trace and logging templates that make failures diagnosable fast.

Coming soon

Axiom .NET Rescue (Coming Soon)

Get notified when we ship observability templates, trace defaults, and incident runbooks for .NET systems.

Join waitlist

Checklist (copy/paste)

Key takeaways

Traces show relationships. Logs show events. You need both during incidents.
Minimum viable tracing focuses on entry points and critical dependencies.
Log trace id and span id on errors so you can jump from logs to traces.
Sampling controls cost and keeps data usable.
Expand instrumentation only after the core path is stable.

Recommended resources

Download the shipped checklist/templates for this post.

OpenTelemetry tracing starter kit (.NET)Free

A minimal tracing checklist and attribute map for debugging production incidents in .NET services.

resource

.NETJan 28, 2026

Cannot trace requests across services: why correlation IDs die at boundaries in .NET

A production playbook for when logs exist but cannot be joined—correlation IDs die at HttpClient boundaries, jobs, and queues, making incidents unreproducible.

.NETFeb 04, 2026

Structured logging that actually helps: Serilog fields that matter in .NET incidents

When logs are noisy but useless: why incidents stay unsolved, which fields actually explain failures, and the minimal schema that makes .NET outages diagnosable.

.NETJan 19, 2026

Background jobs stuck but look healthy: why workers hang forever with no alerts in .NET

When background jobs hang but workers look healthy and queue pileup grows: why jobs fail silently without timeouts or heartbeats, and the runbook that stops repeat incidents.

Why traces are required when incidents cross services

The incident pattern this playbook targets (latency spikes with no single culprit)

Mini incident timeline

Diagnosis ladder for missing traces

Fast triage table: symptom → likely cause → confirm → fix

Common misconceptions that slow teams down

Minimum viable tracing for .NET in production

Minimal .NET setup (packages + Program.cs)

A practical trace and log bridge

Fix plan: how to roll out tracing safely

What to log so traces remain useful

Tradeoffs and limits

Shipped asset

OpenTelemetry tracing starter kit for .NET

Production Observability Templates

Resources

Troubleshooting Questions Engineers Search

Coming soon

Axiom .NET Rescue (Coming Soon)

Checklist (copy/paste)

Key takeaways

Recommended resources

Related posts

Cannot trace requests across services: why correlation IDs die at boundaries in .NET

Structured logging that actually helps: Serilog fields that matter in .NET incidents

Background jobs stuck but look healthy: why workers hang forever with no alerts in .NET