Developer’s tool belt: debugging production issues–part 3

Before I dive into the next post in this series, I want to recommend reading Debugging Microsoft .NET 2.0 Applications.  I rely on this book whenever I can’t remember how to do something in WinDBG, and what I’ll be blogging about investigating memory issues or deadlock I learned from this book.  In addition to its great detail about using WinDBG, it has a lot of other great tips for debugging with Visual Studio.

There are a few commands I run on a dump when I first open it in WinDBG to get a quick look at the state of the process.  The first command must always run in order to analyze .NET dumps:

.loadby sos clr

This loads SOS.dll into WinDBG and enables all of the .NET related commands you’ll need to use.  If the dump was created due to a crash, something like the following might be displayed when it’s first loaded:

This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(80c.3fc): CLR exception - code e0434352 (first/second chance not available)
ntdll!ZwWaitForMultipleObjects+0xa:
00000000`771718ca c3              ret

You can execute .excr but it isn’t aware of .NET, so it will just display the registers and unmanaged stack trace at the time of the exception.  Instead, !analyze –v will display the Windows Error Reporting information related to the exception, including the stack trace of the thread the exception occurred on.  Typically, I want to see what all of the threads looked like when the dump was made.  First, I use !threads:

!Threads

Oh look, there’s the OutOfMemoryException that crashed the process.  Threads with XXXX as the thread number in the leftmost column are dead threads that haven’t been cleaned up yet.  You won’t be able to switch to those threads or view their stack, but if a lot them are listed that could be an indication of the issue.  If a particular thread doesn’t catch my eye, I’ll usually use !EEStack next.  It works the same as !DumpStack, but gives the full stack trace (managed and unmanaged) for all threads, which helps you get a good idea of what was going on at the time of the dump.  !ClrStack is similar but only displays the managed portion of the stack, but it’s useful when you want to see the parameters passed to methods and the local variable when you get deeper into the dump.  In the case of !EEStack, seeing the unmanaged stack is useful because it makes it easy to see what threads are currently sleeping, whether because they have nothing to do or they’re waiting for a lock or some other action is causing them to block.  This is usually indicated by a WaitFor… at the top of the stack trace:

ntdll!ZwWaitForMultipleObjects

For example, here’s the top of a stack trace of a thread waiting for a SQL query to complete:

BlockedThread

You can see the difference between how managed and native methods are displayed, and at the top there’s a WaitForSingleObject in this case.

If my intent was to try to track down a memory issue, my next step would be to look at the heap with !DumpHeap –stat.  I’ll go into the details of debugging memory issues in my next post.

Developer’s tool belt: debugging production issues–part 2

So your shiny new application is ready to deploy. You’ve written unit tests and have great code coverage.  What could go wrong?  Then after running for some time in production an issue crops up.  Maybe it freezes and you suspect there might be a deadlock, or it crashes without any error logged, or there’s a memory leak you can’t track down.  If the issue can’t be reproduced and the logs don’t show the issue, then digging into a memory dump might be the last resort.

This post will cover setting up WinDBG and taking memory dumps.  I’ll dive into details about analyzing memory dumps in my next post.  The first step is to install WinDBG.  This used to be a separate install but I guess someone thought that was too easy, so now it’s bundled into the Windows SDK.  Download the web installer and choose Debugging Tools for Windows from the installation options:

WinDBGsetup

Once WinDBG is installed, an environment variable named “_NT_SYMBOL_PATH” needs to be created to specify where WinDBG can find debugging symbols for modules.  Set this variable to “SRV*C:\Symbols*http://referencesource.microsoft.com/symbols;SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols”.  This instructs WinDBG to download symbols from https://referencesource.microsoft.com/symbols and http://msdl.microsoft.com/download/symbols and to save them to C:\Symbols, from which they’ll be loaded by WinDBG when needed.

Creating Dumps

There are a few ways to create memory dumps, which basically boils down to user preference.  The easiest way to create a dump is by using Task Manager, which has had the ability since Windows Vista.  Right click on the process you want to create the dump of and select Create Dump File:

CreateDumpFromTaskManager

This will take a minute or two and display the location of the dump:

CreateDumpFromTaskManagerResult

One thing to keep in mind, which I just learned when creating the dump of Visual Studio for the screenshot above, is that Task Manager can’t create dumps of 32-bit processes by default.  When trying to open the resulting dump file it errors with the message “Debugging a 64-bit dump of a 32-bit process is not supported, please collect a 32-bit dump of a 32-bit process.”  A 32-bit dump can be created by running the 32-bit version of Task Manager from C:\Windows\SysWOW64.  This blog post has more information.  Since I’ve always been creating dumps from 64-bit processes, I haven’t been affected by this before, thanks to .NET targeting Any CPU when compiling by default.

The traditional method for creating dumps is by using ADPlus, which will attach CDB (Console Debugger that’s installed with WinDBG) and create a dump file.  The great thing about ADPlus is that you can use it to create dumps immediately when the process has hung (known as “hang” dumps) or have it wait until the process crashes and create a crash dump.  To use ADPlus, it’s easiest to add “C:\Program Files\Debugging Tools for Windows (x64)” to the Path environment variable and open an administrator command prompt.  Just type “adplus” to view a summary of the options available.  Details on the various options can be found here

It might look complicated but I usually only need to use three options.  First, specify –hang or –crash to indicate whether to create a dump immediately or to just attach CDB now and wait for a crash.  Second, specify where to save the dump with –o.  And third, specify what process to attach to, which is most easily done by using –p and entering the process ID (you can get this from Task Manager using View > Select Columns… > PID) or by using –pn and entering the process file name if there’s only one running.  So, if I wanted to create a hang dump for process 7900, I’d type “adplus –hang –o C:\Dumps –p 7900”.  This opens a separate console window while the dump is being created that displays output from CDB, which will close when it’s finished.

The last method of obtaining a dump is by using the dump created by Windows Error Reporting when the process crashes, if you were lucky enough that one was created.  Look through the Application event log for an Information event from Windows Error Reporting logged just after the Error logged for the crash.  This log entry will have details about the crash, like the following example:

Fault bucket , type 0
Event Name: CLR20r3
Response: Not available
Cab Id: 0

Problem signature:
P1: 3KQHBFJJVVOHCIEKMSLMA1UKFTHQY1VE
P2: 2.1.0.63
P3: 507c3954
P4: mscorlib
P5: 4.0.0.0
P6: 4f1967ce
P7: 1f45
P8: 10
P9: System.OutOfMemoryException
P10: 

Attached files:

These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_3KQHBFJJVVOHCIEK_aabae3283b257b1c1dde673025c79c8073f44c84_23ee8796

Analysis symbol: 
Rechecking for solution: 0
Report Id: e566d53e-24ab-11e2-bc24-005056990056
Report Status: 4

The AppCrash folder may have an .hdmp file in it, which can be copied to another folder and opened with WinDBG.  You’ll want to copy it so you can use it after the report queue is cleared.

In my next post, I’ll detail the techniques I use to triage a memory dump taken from a crash or a hang. Coming up after that, I’ll cover how to search for memory leaks in a memory dump.

Developer’s tool belt: debugging production issues–part 1

Following my post about the bus factor, I thought it’d be a good idea to focus on a topic that hasn’t been shared much with my coworkers but is a great tool to have in your toolbelt: how to triage production issues as they happen, when the stakes are high and you need to get things working quickly and figure out what went wrong as fast as possible.  This series will primarily cover gathering and analyzing memory dumps with WinDBG, but before I dive into details about memory dumps, I want to cover something that’s very important for any application that’s been deployed to production.

Logging

Logging is your first line of defense when things start going wrong.  With effective logging, you can avoid digging through a memory dump in all but the worst cases.  Why is that important?  Because debugging through a memory dump is time consuming, very time consuming, and all of that time could be better spent on other things.  It also means you can fix production issues faster, minimizing the system’s downtime.  In some cases, from what you see happening in the log you can proactively take steps before an issue becomes serious enough to take the system down.

Make logging a first class citizen in your software architecture, not an afterthought.  There are many mature logging frameworks to choose from, so you just need to pick one that covers your needs.  Keep in mind what the things that might need to be logged in the system.  Do different components need to log separately? Does the logging need to be asynchronous for performance?  Do different logs need to be created for each execution?  For example, if you’re creating an application that imports files, does it make sense to log each file import separately or is it fine to combine everything into one file?  One of the most important things to keep in mind, though, is don’t log so much that you have to dig through a pile of junk to find the useful parts.  It takes some time and trial and error to get it right, but it really makes a difference.

Logs are useless if no one looks at them

In general it’s good to review logs periodically to make sure there’s nothing out of the ordinary going on.  On the other hand, errors that occur shouldn’t be buried in a log file waiting to be found days after they’ve occurred.  While it’s a good idea to still log them with everything else, they also need to be more in your face.  I’ve mostly done this with email alerts, but there are other ways to create alerts too, such as IM, text message, or third party monitoring software.  You have to be careful to limit how many alerts get created or you start to ignore all of the error emails that come in.  Keep that in mind as you’re writing error handling code to make sure you don’t “cry wolf”.  Reserve sending emails for the exceptions that are critical.  Context is also essential when emailing an error.  The more details included the greater the chance that you won’t even need to open the full log.  For example, if you catch a SqlException include the parameters used in the SQL statement.  Maybe you’ll immediately spot the edge case you hadn’t tested for.

Determining a good error handling strategy has to be done on a case by case basis, and will vary from application to application.  There are a lot of factors to consider.  Is this a client application or a backend mission critical system?  Are the integration points going to be prone to errors?  Is it ok for the process to crash or does that need to be avoided as much as possible?  The frequency of errors and how they get handled will affect how they get logged.  In some cases emailing each error works great, while in others it will make more sense to only send an alert if errors occur repeatedly.  It might take a few iterations to get things right, to find the places that need more detail logged, and decide what works best for the application.

In my next post, I’ll cover the different ways of creating memory dumps and opening them in WinDbg.  After that I’ll be posting about using WinDbg to dig through the mountains of data within the dump.

Rhino.ServiceBus saga persistence with RavenDB

I want to document the RavenDB saga persister that I added to github about a year ago.  Thanks to Corey Kaylor for his help with writing the initial code.

RavenDB is a natural fit for saga state storage and the client’s support for dynamic objects makes it very easy to do.  Setting up your project to use this persister requires a couple of registrations add to the service bus bootstrapper.  The following shows an example using WIndsor:

The RavenStoreProviderMessageModule lazily creates IDocumentSessions scoped to the processing of a message when the saga persister is invoked.  It does this using the IDocumentStore injected into it, so you can configure which document store you want saga state persisted however you choose.

The RavenSagaPersister handles loading and saving saga state using an IDocumentSession.  It sets the document ID using the convention ‘{saga type}/{saga ID}’.  For example: MySagaType/25C209BD-F655-4526-92E0-6A2CB59AEAE2.

You can view the full RavenSagaPersister on github here.

Please note: The project’s dependencies are currently out of date, so it’ll need to be recompiled with the version of RavenDB and Rhino ServiceBus you’re using or just add an assembly redirect to your project’s config.

Can’t open a solution? Check your line endings

I’ve run into this issue occasionally over the past few months, but just wrote it off as a minor annoyance.  There are some Visual Studio solutions that wouldn’t open when I’d double-click on them.  Nothing would happen at all.  I’d have to open Visual Studio first, then open the solution from there.  I also noticed that the icon of the file didn’t have the little “10” in the corner:

Bad solution

The issue cropped up again this week, and also happened to a coworker, so I decided to dig into the solution file itself to find out what was wrong.  That’s when I noticed the file had Unix line endings (\n) instead of Windows line endings (\r\n).  Correcting the line endings fixed the file and now the icon looked right:

Good solution

The strange part was other guys in the office had no trouble opening the solution, so why didn’t it work for me?  I figured the “10” in the icon must have something to do with it, and noticed I still had Visual Studio 2008 installed.  I’m guessing this issue happens if you have two versions installed and the file gets opened through the Visual Studio Version Selector.  That can’t seem to handle files with incorrect line endings.

OpenWithVersionSelector

At least this reminded me to finally uninstall Visual Studio 2008.  It makes me a little sad.  That one performed so much better than 2010.

Rhino.ServiceBus 3.0 Delivery Options

This post delves into the details of a feature I contributed to Rhino.ServiceBus, available in the recently released version 3.0.  Previously, you were able to add custom headers to outgoing messages by registering an implementation of ICustomizeMessageHeaders with the container.  That interface has been renamed to ICustomizeOutgoingMessages, and now allows you to also set a DeliverBy and/or MaxAttempts.  You can register any number of implementations with the container, and the bus will invoke each one as it builds outgoing messages.

Rhino.Queues has full support for DeliverBy and MaxAttempts.  If either of these two options is set, the message will be moved to the OutgoingHistory and marked as failed transmission if the message couldn’t be delivered in time.

MSMQ supports DeliverBy by setting the TimeToReachQueue property on the MSMQ message.  MaxAttempts is only supported when set to 1, which sets TimeToReachQueue to TimeSpan.Zero, meaning MSMQ only makes one attempt to deliver the message.  If MaxAttempts is set to a value other than 1, an InvalidUsageException is thrown.

Because it’s up to the user to create an implementation of ICustomizeOutgoingMessages, the user can set delivery options in a way that works best for them, whether it’s by setting options for specific message types, specific destinations, or reflecting over messages to look for specific attributes or interfaces.

Examples:

If you want to limit how many times the bus tries delivering messages to endpoints that aren’t always online, you could add a query parameter to the endpoint and the customize using the Destination on the OutgoingMessageInformation.

 

Maybe you have a message that’s only useful for a short period of time.

 

Using the OutgoingMessageInformation instance, you can inspect the Messages being sent, the Destination endpoint, and the Source endpoint to make any customizations you need.  You can modify the Headers collection on that instance, set MaxAttempts, or set DeliverBy.

Four years later, a retrospective

Welcome back!  I’ve joined a few of my coworkers in a weekly blog challenge, to write a new post each week.  To start, I figured I should finally write the post I meant to write four years ago, about a small piece of code that saved our rear ends from a huge issue plaguing our production environment. 

Right about this time four years ago, our busy season was just beginning, and we had recently deployed our custom built phone call distributer, Complemax.  We started seeing an issue crop up that was so severe, Complemax effectively stopped routing calls in our call center.  Complemax works by building a grid of phone calls and agents, and scoring each match.  Then the calls are routed to the best matching agents.  After digging through logs, we found that Complemax was processing messages fine, calls were being added and removed as they went on hold or hung up, and employees were updating as they came online or changed their status, but the grid wasn’t being scored! 

So now we had found the root of the problem, but what was the cause?  After analyzing memory dumps, digging through the code, and learning more about low level debugging than I ever had, we finally found the issue was caused by our threading.  The grid makes use of a ReaderWriterLockSlim as its state is updated.  Every five seconds, a read lock would be taken so the current matches could be scored, during which pending updates would wait a short time.  Unfortunately, that lock prioritizes writes over reads, causing our all-important grid calculation thread to be starved by the large amount of updates that were coming in as we received more and more phone calls.

How would we prioritize the read lock, to freeze the grid while we looped through it?  The solution needed to be simple, so we could fix the issue in production as fast as possible.  That’s when I came up with the idea of a “dual” reader-writer lock.  Use two locks on top of each other to give the less common read operation higher priority.  When an update is needed, take a read lock on the first lock and then a write lock on the second lock.  When the grid needs to be frozen, take a write lock on the first lock and pending “reads” on that lock will wait, preventing more updates to the grid.

I’m sure if we had caught the issue before it reached production, we would’ve ended up using something other than a ReaderWriterLockSlim in the first place, but surprisingly in the four years since we added the “dual” lock, we haven’t changed it.  It’s simple, it works, and it handles the load thrown at it.  It was certainly a good learning experience, one I’m glad I had once, but not one I’d ask for.