Legacy .NET production rescue (stabilize first, modernize safely)
If you have an older .NET system that must stay running — Web Forms, WCF, .NET Framework, or “it’s complicated” — I help you stop repeats in production (timeouts, stuck jobs, slow pages, flaky integrations) and then modernize with low-risk cutovers.
The goal is boring reliability: clear stop/retry rules, usable observability, and a runbook your team can operate.
Quick questionnaire → clear next step
Send answers to these and I’ll tell you the smallest set of changes that will stop the repeats.
- What breaks today (timeouts, slowdowns, deadlocks, stuck jobs, thread pool starvation, random 500s)?
- What stack is it (Web Forms/WCF/MVC, .NET Framework version, IIS/Azure, Windows services, scheduled jobs)?
- What are the “money risks” (downtime cost, missed orders, SLA, compliance)?
- Any artifacts you can share (logs, error screenshots, environment details, repo access)?
- Do you need modernization soon (security, .NET upgrade deadline, cloud move), or just stability first?
- Urgency and any budget constraints we should respect?
What usually goes wrong
These are common failure patterns in long-lived .NET systems.
- Slow pages / timeouts under load (and no clear reason)
- Thread pool starvation (sync-over-async, blocked threads, long IO)
- Stuck background jobs / Windows services that “hang”
- Deadlocks + lock contention (requests queue, CPU looks “fine”)
- Retries amplify incidents (retry storms, queue pileups)
- Socket exhaustion / connection pool exhaustion (HTTP/SQL) under bursts
- Random 500s with logs that don’t explain the root cause
- GC pressure / memory leaks (LOH growth, pauses, random slowdowns)
- Deployments feel risky because behavior is unpredictable
- Cancellation/timeouts not wired through (work continues after the user gave up)
- “Modernization” becomes a rewrite because there’s no safe cutover plan
If you suspect thread pool starvation
This is one of the most common “everything is slow, CPU is fine” production problems in .NET. It often comes from sync-over-async, blocking calls, long IO, or lock contention that ties up request threads.
- Requests queue up; latency climbs across the board
- CPU isn’t pegged, but throughput collapses
- “Random” timeouts appear in HTTP/SQL calls under bursts
- Recycling the app “fixes it” temporarily
- A short incident timeline (when it starts, how it ends, how often)
- Metrics: request rate/latency, error rate, queue length (if you have it)
- A thread dump / trace around the slowdown (PerfView, dotnet-trace, dotnet-dump)
- A few representative logs with correlation IDs for slow requests
.NET Production Rescue Audit
Choose this if: you need a clear diagnosis and a short, prioritized fix list.
- Top failure modes + “stop/retry/escalate” rules
- Performance bottleneck shortlist (what to fix first)
- Observability checklist + runbook outline
- Map incident patterns (timeouts, hangs, retries, slow SQL)
- Check for thread pool starvation + lock contention patterns
- Define safe timeouts + retry policy boundaries
- Specify logging fields + correlation IDs + runbook steps
Stabilization Sprint
Choose this if: you want the top fixes shipped into production fast.
- Timeouts and hangs get bounded (caps + stop rules)
- Errors become diagnosable (structured logs + traces)
- Incidents become repeatable (runbook + decision paths)
- Fix “retry storms” (Polly/backoff+jitter/Retry-After)
- Instrument the hotspots (OpenTelemetry / structured logs)
- Ship safe rollouts (feature flags, parallel runs)
Modernization Plan (low-risk)
Choose this if: you need to get off .NET Framework / WCF / legacy hosting without a big-bang rewrite.
- Strangler plan (what to peel off first)
- Parallel run + cutover checklist (risk controls)
- Target architecture + “operate it” runbooks
- YARP, System.Web Adapters, Upgrade Assistant
- Containerization (where it reduces risk)
- Incremental refactors + safe rollout strategy
Typical pricing (USD)
Ranges below are to set expectations. Final scope depends on access (logs, repro, deploy pipeline), urgency, and whether you want hands-on shipping or a fix plan your team implements.
Choose this if: you need answers fast before committing to a larger engagement.
- Incident hypothesis (what’s most likely happening)
- Immediate stop/containment recommendations
- Clear “do this first” checklist
Want the fastest answer?
Send the questionnaire answers and a couple of example logs. I’ll reply with a recommended lane (Audit vs Sprint) and why.