Developer’s tool belt: debugging production issues–part 1

Following my post about the bus factor, I thought it’d be a good idea to focus on a topic that hasn’t been shared much with my coworkers but is a great tool to have in your toolbelt: how to triage production issues as they happen, when the stakes are high and you need to get things working quickly and figure out what went wrong as fast as possible.  This series will primarily cover gathering and analyzing memory dumps with WinDBG, but before I dive into details about memory dumps, I want to cover something that’s very important for any application that’s been deployed to production.

Logging

Logging is your first line of defense when things start going wrong.  With effective logging, you can avoid digging through a memory dump in all but the worst cases.  Why is that important?  Because debugging through a memory dump is time consuming, very time consuming, and all of that time could be better spent on other things.  It also means you can fix production issues faster, minimizing the system’s downtime.  In some cases, from what you see happening in the log you can proactively take steps before an issue becomes serious enough to take the system down.

Make logging a first class citizen in your software architecture, not an afterthought.  There are many mature logging frameworks to choose from, so you just need to pick one that covers your needs.  Keep in mind what the things that might need to be logged in the system.  Do different components need to log separately? Does the logging need to be asynchronous for performance?  Do different logs need to be created for each execution?  For example, if you’re creating an application that imports files, does it make sense to log each file import separately or is it fine to combine everything into one file?  One of the most important things to keep in mind, though, is don’t log so much that you have to dig through a pile of junk to find the useful parts.  It takes some time and trial and error to get it right, but it really makes a difference.

Logs are useless if no one looks at them

In general it’s good to review logs periodically to make sure there’s nothing out of the ordinary going on.  On the other hand, errors that occur shouldn’t be buried in a log file waiting to be found days after they’ve occurred.  While it’s a good idea to still log them with everything else, they also need to be more in your face.  I’ve mostly done this with email alerts, but there are other ways to create alerts too, such as IM, text message, or third party monitoring software.  You have to be careful to limit how many alerts get created or you start to ignore all of the error emails that come in.  Keep that in mind as you’re writing error handling code to make sure you don’t “cry wolf”.  Reserve sending emails for the exceptions that are critical.  Context is also essential when emailing an error.  The more details included the greater the chance that you won’t even need to open the full log.  For example, if you catch a SqlException include the parameters used in the SQL statement.  Maybe you’ll immediately spot the edge case you hadn’t tested for.

Determining a good error handling strategy has to be done on a case by case basis, and will vary from application to application.  There are a lot of factors to consider.  Is this a client application or a backend mission critical system?  Are the integration points going to be prone to errors?  Is it ok for the process to crash or does that need to be avoided as much as possible?  The frequency of errors and how they get handled will affect how they get logged.  In some cases emailing each error works great, while in others it will make more sense to only send an alert if errors occur repeatedly.  It might take a few iterations to get things right, to find the places that need more detail logged, and decide what works best for the application.

In my next post, I’ll cover the different ways of creating memory dumps and opening them in WinDbg.  After that I’ll be posting about using WinDbg to dig through the mountains of data within the dump.

Advertisements

The bus factor

In this case, “bus” is used literally.  The bus factor measures risk to a project based on how many project members could be “hit by a bus” before the project fails. 

In software development, this typically refers to the expertise of each developer in a team or organization, and ensuring the knowledge of the systems built and maintained for the organization are spread amongst the entire development team.  Just like adding redundancy to servers helps ensure High Availability, it is important to have a level of redundancy amongst the people developing the systems that power the business’s operations, so the business isn’t left scrambling if that key person leaves.

This is something my colleagues and I are always aware of, and are continuously working to improve.  At a team level, we try to ensure a majority of the team members are familiar with all aspects of the projects we work on.  We may have our individual areas of expertise, but we also make sure we aren’t the lone knowledgeable person.  We pair program, do code reviews, and collaborate on the design of the solution.  We bounce ideas off each other and share insights.  Across teams we rotate among projects and take time on knowledge transfer of a new system to the next team to work in it.  There may be an initial barrier of entry that slows down the progress of the team, but we recognize the long term benefits.

“But don’t I lose job security?”

Maybe.  It depends on whether you’re working in the right environment.  Does your employer recognize the importance of technology to their business?  Do they recognize the risk of cutting corners in their IT department?  If the company would replace you with a junior developer as soon as you shared the knowledge of the systems you’ve built, do you really want to work there in the first place? 

If you’re working in the right place, your job security comes from your ability to solve the tough challenges presented to you, from your expertise in the tools and frameworks used, and from you’re ability to use those skills collaboratively with others.  In fact, it’s to your employer’s benefit that you distribute as much knowledge as possible so you can continue to provide the business the most value you have to offer, by helping the business continue to grow by developing new solutions, rather than being stuck as the only one able to maintain the systems already developed.

The most effective way to become indispensable is by being good at your job

Take the time to learn new technologies, read blogs, and stay up to speed on the tools you use.  Read The Clean Coder.  Become a professional software craftsman, rather than a code monkey.  These things will do far more for your job security and usefulness than being the person who will leave a gaping hole in the knowledge pool if “hit by a bus”.

Rhino.ServiceBus saga persistence with RavenDB

I want to document the RavenDB saga persister that I added to github about a year ago.  Thanks to Corey Kaylor for his help with writing the initial code.

RavenDB is a natural fit for saga state storage and the client’s support for dynamic objects makes it very easy to do.  Setting up your project to use this persister requires a couple of registrations add to the service bus bootstrapper.  The following shows an example using WIndsor:

The RavenStoreProviderMessageModule lazily creates IDocumentSessions scoped to the processing of a message when the saga persister is invoked.  It does this using the IDocumentStore injected into it, so you can configure which document store you want saga state persisted however you choose.

The RavenSagaPersister handles loading and saving saga state using an IDocumentSession.  It sets the document ID using the convention ‘{saga type}/{saga ID}’.  For example: MySagaType/25C209BD-F655-4526-92E0-6A2CB59AEAE2.

You can view the full RavenSagaPersister on github here.

Please note: The project’s dependencies are currently out of date, so it’ll need to be recompiled with the version of RavenDB and Rhino ServiceBus you’re using or just add an assembly redirect to your project’s config.

Can’t open a solution? Check your line endings

I’ve run into this issue occasionally over the past few months, but just wrote it off as a minor annoyance.  There are some Visual Studio solutions that wouldn’t open when I’d double-click on them.  Nothing would happen at all.  I’d have to open Visual Studio first, then open the solution from there.  I also noticed that the icon of the file didn’t have the little “10” in the corner:

Bad solution

The issue cropped up again this week, and also happened to a coworker, so I decided to dig into the solution file itself to find out what was wrong.  That’s when I noticed the file had Unix line endings (\n) instead of Windows line endings (\r\n).  Correcting the line endings fixed the file and now the icon looked right:

Good solution

The strange part was other guys in the office had no trouble opening the solution, so why didn’t it work for me?  I figured the “10” in the icon must have something to do with it, and noticed I still had Visual Studio 2008 installed.  I’m guessing this issue happens if you have two versions installed and the file gets opened through the Visual Studio Version Selector.  That can’t seem to handle files with incorrect line endings.

OpenWithVersionSelector

At least this reminded me to finally uninstall Visual Studio 2008.  It makes me a little sad.  That one performed so much better than 2010.