Four years later, a retrospective

Welcome back!  I’ve joined a few of my coworkers in a weekly blog challenge, to write a new post each week.  To start, I figured I should finally write the post I meant to write four years ago, about a small piece of code that saved our rear ends from a huge issue plaguing our production environment. 

Right about this time four years ago, our busy season was just beginning, and we had recently deployed our custom built phone call distributer, Complemax.  We started seeing an issue crop up that was so severe, Complemax effectively stopped routing calls in our call center.  Complemax works by building a grid of phone calls and agents, and scoring each match.  Then the calls are routed to the best matching agents.  After digging through logs, we found that Complemax was processing messages fine, calls were being added and removed as they went on hold or hung up, and employees were updating as they came online or changed their status, but the grid wasn’t being scored! 

So now we had found the root of the problem, but what was the cause?  After analyzing memory dumps, digging through the code, and learning more about low level debugging than I ever had, we finally found the issue was caused by our threading.  The grid makes use of a ReaderWriterLockSlim as its state is updated.  Every five seconds, a read lock would be taken so the current matches could be scored, during which pending updates would wait a short time.  Unfortunately, that lock prioritizes writes over reads, causing our all-important grid calculation thread to be starved by the large amount of updates that were coming in as we received more and more phone calls.

How would we prioritize the read lock, to freeze the grid while we looped through it?  The solution needed to be simple, so we could fix the issue in production as fast as possible.  That’s when I came up with the idea of a “dual” reader-writer lock.  Use two locks on top of each other to give the less common read operation higher priority.  When an update is needed, take a read lock on the first lock and then a write lock on the second lock.  When the grid needs to be frozen, take a write lock on the first lock and pending “reads” on that lock will wait, preventing more updates to the grid.

I’m sure if we had caught the issue before it reached production, we would’ve ended up using something other than a ReaderWriterLockSlim in the first place, but surprisingly in the four years since we added the “dual” lock, we haven’t changed it.  It’s simple, it works, and it handles the load thrown at it.  It was certainly a good learning experience, one I’m glad I had once, but not one I’d ask for.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s