Learning about error handling the hard way!

Let's be honest, error handling is not fun. We like to spend time on the functionality right? The cool stuff that people see or use and then remark how good your software is! Error handling is like sweeping the roads or emptying the bins - necessary but very dull.

So two things have happened recently that has taught me the hard way about trying to see potential errors and handling them properly. One of them more serious than the other. You should know that I do have error handling. For instance, all of my database functions will catch exceptions, log them and then return some kind of error code but this is not enough!

The big, bad experience I had was with a worker task that is designed to delete unused images from our cloud storage. It works pretty simply, every 4 hours it queries the database for images that are referenced and then queries the storage for images that are stored and then any on disk that are not in the database are deleted. Easy right? How many errors could you possibly see and what effect could they have? How can I improve it? Well, there were no sanity checks anywhere so when anything failed, it just threw an exception, pretty much ignored it and then waited another 4 hours. However, one weekend, the database connection failed temporarily and so the query returned 0 entries. The code then decided that 'if not images are referenced in the database, I need to delete everything from the images storage', which it did! Every user's uploaded profile image was deleted - just like that.

It wasn't the end of the world, I had to edit the database and set everyone's profile picture to a placeholder than included text describing what had happened and how they needed to re-upload an image.

The second issue was a raft of errors that were logged to my error logs (and emailed to me!) and which meant that the database connections were being forcibly closed, which SQL Server is known to do when it feels like it. I don't know how this works on SQL Azure which is essentially VM style databases on a shared host but what I do know is that I don't want people to see generic error messages which mean nothing and which make the problem my fault in their eyes.

I realised I needed more. I firstly needed some more checking for conditions external to my system: did the query suceed? How can I tell if it succeeded? How many rows were returned because I would expect a few thousand - certainly not zero. What should I do if this happened? Log the error? Disable the worker? Instead of deleting the images, perhaps I could move them to another folder so they could be restored if required (for perhaps a month?).

The second thing I needed was for the users to know that when a database connection fails, it is not my fault and they can try again soon. I can also automatically retry calls to the database if they fail for a reasonable amount of time until the user would get annoyed and then either display a "sorry, it's not my fault" page or even just a "please wait, the system is attempting to re-connect to the database".

In conclusion, I get annoyed as a user when something on a site crashes - even when it shows you a nice message to say it has gone wrong. I therefore need to realise that my users will also get annoyed if I don't correct prevent and/or recover from errors in the system - whether code bugs or network issues. Logging is also essential so that these errors are quickly tracked down and resolved.