Legacy .NET production rescue (stabilize first, modernize safely)

If you have an older .NET system that must stay running — Web Forms, WCF, .NET Framework, or “it’s complicated” — I help you stop repeats in production (timeouts, stuck jobs, slow pages, flaky integrations) and then modernize with low-risk cutovers.

The goal is boring reliability: clear stop/retry rules, usable observability, and a runbook your team can operate.

Stabilize-first
Stop repeats before starting a rewrite.
Low-risk cutovers
Strangler patterns, parallel runs, safe rollouts.
Operate calmly
Observability + runbooks, not hero debugging.

Quick questionnaire → clear next step

Send answers to these and I’ll tell you the smallest set of changes that will stop the repeats.

  • What breaks today (timeouts, slowdowns, deadlocks, stuck jobs, thread pool starvation, random 500s)?
  • What stack is it (Web Forms/WCF/MVC, .NET Framework version, IIS/Azure, Windows services, scheduled jobs)?
  • What are the “money risks” (downtime cost, missed orders, SLA, compliance)?
  • Any artifacts you can share (logs, error screenshots, environment details, repo access)?
  • Do you need modernization soon (security, .NET upgrade deadline, cloud move), or just stability first?
  • Urgency and any budget constraints we should respect?

What usually goes wrong

These are common failure patterns in long-lived .NET systems.

  • Slow pages / timeouts under load (and no clear reason)
  • Thread pool starvation (sync-over-async, blocked threads, long IO)
  • Stuck background jobs / Windows services that “hang”
  • Deadlocks + lock contention (requests queue, CPU looks “fine”)
  • Retries amplify incidents (retry storms, queue pileups)
  • Socket exhaustion / connection pool exhaustion (HTTP/SQL) under bursts
  • Random 500s with logs that don’t explain the root cause
  • GC pressure / memory leaks (LOH growth, pauses, random slowdowns)
  • Deployments feel risky because behavior is unpredictable
  • Cancellation/timeouts not wired through (work continues after the user gave up)
  • “Modernization” becomes a rewrite because there’s no safe cutover plan

If you suspect thread pool starvation

This is one of the most common “everything is slow, CPU is fine” production problems in .NET. It often comes from sync-over-async, blocking calls, long IO, or lock contention that ties up request threads.

What it looks like
  • Requests queue up; latency climbs across the board
  • CPU isn’t pegged, but throughput collapses
  • “Random” timeouts appear in HTTP/SQL calls under bursts
  • Recycling the app “fixes it” temporarily
What to send (fastest diagnosis)
  • A short incident timeline (when it starts, how it ends, how often)
  • Metrics: request rate/latency, error rate, queue length (if you have it)
  • A thread dump / trace around the slowdown (PerfView, dotnet-trace, dotnet-dump)
  • A few representative logs with correlation IDs for slow requests
Typical fixes: remove blocking calls, make async truly async, propagate cancellation/timeouts, and cap concurrency where upstream systems can’t keep up.

.NET Production Rescue Audit

Choose this if: you need a clear diagnosis and a short, prioritized fix list.

What you walk away with
  • Top failure modes + “stop/retry/escalate” rules
  • Performance bottleneck shortlist (what to fix first)
  • Observability checklist + runbook outline
How we do it (technical)
  • Map incident patterns (timeouts, hangs, retries, slow SQL)
  • Check for thread pool starvation + lock contention patterns
  • Define safe timeouts + retry policy boundaries
  • Specify logging fields + correlation IDs + runbook steps

Stabilization Sprint

Choose this if: you want the top fixes shipped into production fast.

What changes immediately
  • Timeouts and hangs get bounded (caps + stop rules)
  • Errors become diagnosable (structured logs + traces)
  • Incidents become repeatable (runbook + decision paths)
How we do it (technical)
  • Fix “retry storms” (Polly/backoff+jitter/Retry-After)
  • Instrument the hotspots (OpenTelemetry / structured logs)
  • Ship safe rollouts (feature flags, parallel runs)

Modernization Plan (low-risk)

Choose this if: you need to get off .NET Framework / WCF / legacy hosting without a big-bang rewrite.

What you get
  • Strangler plan (what to peel off first)
  • Parallel run + cutover checklist (risk controls)
  • Target architecture + “operate it” runbooks
Typical tools
  • YARP, System.Web Adapters, Upgrade Assistant
  • Containerization (where it reduces risk)
  • Incremental refactors + safe rollout strategy

Typical pricing (USD)

Ranges below are to set expectations. Final scope depends on access (logs, repro, deploy pipeline), urgency, and whether you want hands-on shipping or a fix plan your team implements.

Rapid Production Triage (48–72 hours)

Choose this if: you need answers fast before committing to a larger engagement.

$750–$1,500
What you get
  • Incident hypothesis (what’s most likely happening)
  • Immediate stop/containment recommendations
  • Clear “do this first” checklist
Typical timing
1–2 days, no long-term commitment. Often rolls directly into an Audit or Sprint.
Rescue Audit
Usually $3.5k–$7.5k
1–2 weeks. Prioritized fix list, stop/retry rules, and a runbook outline.
Stabilization Sprint
Usually $7.5k–$15k
1–3 weeks. Ship the top fixes into production with safe rollouts.
Modernization Plan
Usually $5k–$10k
1–3 weeks. Strangler plan + parallel run + cutover checklist.
Hourly advisory
$175–$250/hr (typical). Emergency / same-day: $250–$325/hr.
Best for: architecture review, incident triage, or a short diagnosis block. (Hourly typically starts after an initial diagnostic block. Usually with a minimum block.)
Weekly engagement
$4k–$8k/week (limited slots)
Best for: ongoing stabilization, shipping fixes, and calm operations while your team keeps building.

Want the fastest answer?

Send the questionnaire answers and a couple of example logs. I’ll reply with a recommended lane (Audit vs Sprint) and why.

Reliability
Bounded timeouts, safe retries, and calmer incident response.
Maintainability
Make the system debuggable and operable for the next engineer.
Safe modernization
Strangler cutovers and parallel runs instead of a rewrite gamble.