This is a continuation of my inaugural blog series on instrumentation (see part one here).

Eventing is one of those features in the .NET Framework that is incredibly useful, especially if you are an Enterprise or Internet-facing application developer. Proper eventing instrumentation can make the difference between an operable application and something your IT guys would never allow in the datacenter. Unfortunately, it is also really easy to abuse.

OK, this is where I don my Pedantic Instrumentation Guy hat. Yes, I'm going to be an annoying PIG that asks you to do a bit of work in order to have good instrumentation so you don't have to pay some ex-green-beret a lot of money to wade through mini-dumps. (Not that we mind the income, mind you!)

All too often, developers will use the event log as a poor man's trace log - they'll dump informational messages into the Application log or, worse yet, perform full stack dumps into the log when an exception is caught. In addition, it's just too easy to configure programmatically, which can set you up for embarrasing crashes.

You see, the event log was not intended to act as a log file; instead, it is intended to be a place where you can go to see a comprehensive view of important things happening on your system. There are two magic words in there I want to emphasize: comprehensive and important. You must understand that you are not the only application writing to the event log. In fact, there's a phenomenon known as an "event storm" where a poorly written application can flood a computer's event log with so much useless or redundant data that other things that are really useful to see like warnings from SQL or disk I/O errors get missed. This logically brings us to event importance. You really should only be logging significant events. Things like startup, shutdown, catastrophic failures but most importantly, recoverable errors.

Ahhh, recoverable errors. PIG hat still on? Ah, good. This topic could (and probably will) merit a whole series of posts dedicated to knowing/learning how your applications may fail. Here's a Golden Rule of eventing: Do not log an error if all your poor sysadmin can do is see that the application has logged an error. Trust me; he's not going to appreciate the event message for its innate beauty. If there is no possible recovery action documented for a particular error event ID (or even better, included in the event's message), then all your app is doing is indicating to the sysadmin that he should seriously consider uninstalling your application. Consider this: you see the following error event:

Log Name:      Application
Source:        MyApplication
Date:          5/8/2008 7:00:00 AM
Event ID:      0
Task Category: (0)
Level:         Error
Keywords:     
User:          N/A
Computer:      datacenter1
Description:
Object not set to an instance of an object.

Did this tell you anything about how you can prevent this error in the future? If not, why is it in the event log? If it doesn't provide me with a way to help my system be happy again, it's just noise. The best I can do is to attach a debugger and snap a minidump when I hit a NullReferenceException, and this is not something to consider lightly when you've got an Internet-facing application (Or tight purse-strings. Do you know how much it costs to hire ex-green-berets these days?).

No, a much better error event would be:

Log Name:      Application
Source:        MyApplication
Date:          5/8/2008 7:00:00 AM
Event ID:      1033
Task Category: (3)
Level:         Error
Keywords:     
User:          N/A
Computer:      datacenter1
Description:
Unable to retrieve customer information from the database. Please verify that the 'customer' database at server 'database2' is accessible.

So, what's the difference here? Well, there's the obvious difference and the hidden one. The obvious difference is that the event gives you a pretty good indication of what's going on. In addition, it gave you some things to go check to help your app get back on its feet: is the database server up and running? If so, is the database available? The hidden difference is that the developer anticipated that he may not get a valid object from the call to retrieve an object from the server and wrote error-handling code to capture the information needed to troubleshoot the problem and to write the above entry. I'm not going to lie and say that it's easy to write code that anticipates failure. It's actually pretty hard to know all of the ways in which some component can fail, especially if you didn't write it yourself. Some of this will come from experience and testing, and there are also static analysis tools that can analyze your code's call stacks for exceptions that aren't handled but if you're really lucky, the developers who wrote the component will have included things like <exception> information in their XML documentation. With all of the above and some grunt-work on the developers' part, you should be able to tell your sysadmin what is wrong with your app. So what if you encounter a truly exceptional exception? Well, that falls under the category of catastrophic failure. For obvious reasons, it's your responsibility to try to minimize the scenarios where one of these can occur, but if it were to happen this is where all of the helpful advice I provided in the tracing article (you DID read that, right?) comes in.

Oh, one other thing you may notice in my "new and improved" log message is that there was an event ID. This is also a pretty important concept: once you've spent the time to analyze the manner in which your application can fail, try to assign event IDs to the cause of failures you are logging. For example: an inability to connect to the database server is 1033, an access denied to a configuration file may be 2021, and so on. If you do your homework here you can not only make it easier for a human being to find and correct the issue preventing your application from functioning but you can facilitate the development of auto-recovery actions in monitoring packages like System Center Operations Manager. Spending the time to itemize your event messages also allows you to leverage the string formatting and localization capabilities of the Message Compiler (mc.exe) utility. I'll dedicate a separate post to how one can use MC.exe to generate message files and how you can register and use event sources that use message files.

Finally, let's talk about event source registration. Again, this is one of those areas where things were made a wee bit too easy. A pattern you see too often in managed code is:

if (!EventLog.SourceExists("Foo"))
{
    EventLog.CreateEventSource("Foo", "Application");
}
EventLog.WriteEntry("Foo", "Some event", EventLogEntryType.Information);

In fact, I was apalled to discover this code in a Microsoft certification exam! You should never ship code like the above for a couple of reasons, but the most obvious one is that it requires elevated permissions to create an event source. If an unprivileged user account attempts to run the above you will get a SecurityException. Instead, have your installation package leverage Windows Installer's support for registering event message files and use the WriteEvent overload instead - that way, you can create an EventInstance object that contains the event message ID and supply format strings that get plugged into the appropriate place in the message.

Well, there you have it. A whirlwind tour of the philosophy behind proper* event logging in your application.

Whew. Now I can take this stupid thing off.

* "Proper" is in the eye of the beholder. Feel free to ignore any and all of my advice. Tax, tag and title not included in the advertised sale price.