Following my post about the bus factor, I thought it’d be a good idea to focus on a topic that hasn’t been shared much with my coworkers but is a great tool to have in your toolbelt: how to triage production issues as they happen, when the stakes are high and you need to get things working quickly and figure out what went wrong as fast as possible. This series will primarily cover gathering and analyzing memory dumps with WinDBG, but before I dive into details about memory dumps, I want to cover something that’s very important for any application that’s been deployed to production.
Logging is your first line of defense when things start going wrong. With effective logging, you can avoid digging through a memory dump in all but the worst cases. Why is that important? Because debugging through a memory dump is time consuming, very time consuming, and all of that time could be better spent on other things. It also means you can fix production issues faster, minimizing the system’s downtime. In some cases, from what you see happening in the log you can proactively take steps before an issue becomes serious enough to take the system down.
Make logging a first class citizen in your software architecture, not an afterthought. There are many mature logging frameworks to choose from, so you just need to pick one that covers your needs. Keep in mind what the things that might need to be logged in the system. Do different components need to log separately? Does the logging need to be asynchronous for performance? Do different logs need to be created for each execution? For example, if you’re creating an application that imports files, does it make sense to log each file import separately or is it fine to combine everything into one file? One of the most important things to keep in mind, though, is don’t log so much that you have to dig through a pile of junk to find the useful parts. It takes some time and trial and error to get it right, but it really makes a difference.
Logs are useless if no one looks at them
In general it’s good to review logs periodically to make sure there’s nothing out of the ordinary going on. On the other hand, errors that occur shouldn’t be buried in a log file waiting to be found days after they’ve occurred. While it’s a good idea to still log them with everything else, they also need to be more in your face. I’ve mostly done this with email alerts, but there are other ways to create alerts too, such as IM, text message, or third party monitoring software. You have to be careful to limit how many alerts get created or you start to ignore all of the error emails that come in. Keep that in mind as you’re writing error handling code to make sure you don’t “cry wolf”. Reserve sending emails for the exceptions that are critical. Context is also essential when emailing an error. The more details included the greater the chance that you won’t even need to open the full log. For example, if you catch a SqlException include the parameters used in the SQL statement. Maybe you’ll immediately spot the edge case you hadn’t tested for.
Determining a good error handling strategy has to be done on a case by case basis, and will vary from application to application. There are a lot of factors to consider. Is this a client application or a backend mission critical system? Are the integration points going to be prone to errors? Is it ok for the process to crash or does that need to be avoided as much as possible? The frequency of errors and how they get handled will affect how they get logged. In some cases emailing each error works great, while in others it will make more sense to only send an alert if errors occur repeatedly. It might take a few iterations to get things right, to find the places that need more detail logged, and decide what works best for the application.
In my next post, I’ll cover the different ways of creating memory dumps and opening them in WinDbg. After that I’ll be posting about using WinDbg to dig through the mountains of data within the dump.