Monday, April 28, 2008 10:38 PM
ahopper
Instrumentation, Your App, and You - Part I: Tracing
It's been a while since I've blogged about anything technical, and since I'm here in my new home at Wintellect I decided that I need to begin anew. So, without further ado...
I wish I could find a way to make instrumentation sexy; every time I try to explain to someone about why they should invest in instrumentation, I see them get the “deer in the headlights” look and try to find a way to get away from this crazy bald man. It’s just so BORING… and invaluable.
Basically, my philosophy is that attaching a debugger is well and good, but there are only a few scenarios where that’s going to be possible or acceptable. With Enterprise-facing Web or LOB applications, you have a lot of leeway – they can afford to have you (or one of their technical resources) go to a machine and attach WinDBG or CDB – but I hold Internet-facing sites and client-side code to a much higher bar. With a live Internet-facing site, you can’t just hit a breakpoint and dig around as it brings the app to a screeching halt. Even scripting a breakpoint that snaps a minidump and resumes is no good as it still blocks the app for an indeterminate (although admittedly pretty short) amount of time. With a client-side application, you simply cannot ask a customer to install a debugger and get that minidump to you.
So, how can you instrument your application? Well, there are three primary ways and one interesting way. You can use trace logging (which will be the core focus of this post), event logging, performance counters, and finally, WMI.
There are plenty of times where that hard-won minidump simply isn’t enough; after all, not only is it rocket science to most developers, but all you’ll see stack traces and current values for members – not HOW you got to this point. A lot of the art debugging is intuiting what path the app took to get to this crashing stack, and a good trace log can let the app do a lot of that detective work for you. Even minimal stack tracing (entry/exit) gives you a great deal of behavioral information, and this can be invaluable when you’re trying to figure out WHY you’re waiting on that Mutex again.
Tracing undeniably has a non-zero overhead associated with it, and this overhead takes two forms. First, there’s the behavioral/code overhead that goes with getting people in the mindset of leaving post-it notes in their code for the unfortunate soul who has to troubleshoot their misbehaving app. Second, there’s the performance overhead that goes along with adding more calls to the application. The first is manageable, and the second is fixable.
On the behavioral front, it’s very much an evangelism issue and it’s kinda like selling insurance for your career. Once you have an app out in the field with zero instrumentation and some high-profile customer is screaming for your head because you STILL don’t know why your app is hitting this exception, you gain a deep appreciation for how much you would have loved to know what the response from that SOAP call was. As for the amount of work it requires from the developer, it all depends upon the class of application; there’s a dial you’ll want to turn for the amount of tracing code you write. When I’m wearing my hard hat I want tracing at every decision block, every exception handler to dump the exception and logging the parameters to methods at “verbose” levels. When I’m wearing my “warm fuzzy” hat for simple LOB applications, I want tracing before and after major events in the code like network/IO calls. But in practice, it’s going to be up to the dev to understand how their code might misbehave and to leave a trail of tracing breadcrumbs to help some poor soul noodle out why some button didn’t get enabled.
Now on to the performance overhead. Most developers don’t stop to think about the cost associated with constructing the call stack for a trace call, and when they see that their application’s runtime performance suffers because they have a bazillion trace calls that call ToString or box up a bunch of integers, they’re quick to blame tracing in general and yank it altogether (seen it happen first-hand, that). There’s no reason it has to be this expensive, though. If tracing isn’t enabled, it should just get out of the way – it’s only when we’re debugging that we want that overhead, and then we won’t care as much if we take a hit. It’s an unfortunate reality that all of the wonderful work the driver guys did on TraceWPP is lost in the managed world; there’s no (publicly available, that is) tool for preprocessing managed code and wrapping trace calls with runtime-evaluated conditionals to shortcut the construction of the call stack if tracing is not enabled.
So, if we can't use TraceWPP, what can we do, then? Well, for the time being (hint, hint), we're stuck with hand-generating conditional blocks to detect whether tracing is enabled for a particular trace level. However, I hope to remedy that situation here in this blog: we'll walk through the implementation of a preprocessor in the style of TraceWPP and use it for our managed apps.
Well, this post has become long enough for now. We'll take a look at the other ways your application can help you debug itself later.