The premise that systems should ‘fail fast’ is pretty well established - the idea has it’s own wikipedia page, and any number of books talk about it as a fundamental premise. For example, Release It! from the Pragmatic Bookshelf makes several references to Fail Fast in Chapter 5: Stability patterns.

I first encountered the idea of Fail Fast (though it wasn’t called that then) way back in my university days while completing my Computer Science degree.

It is frustrating, therefore, to come across software that seems to completely ignore the premise.

At the moment, I’m spending way too much time working with product X.

Product X is a tool that is supposed to help with a certain data transportation issue, connecting two existing systems together so that the value of each is increased.

Unfortunately, Product X doesn’t ‘Fail Fast’ when something goes wrong. Rather, it seems to follow a ‘dogged determination’ philosophy - whenever something goes wrong, do your best to keep running regardless of the consequences.

For example, today I found that one of the database views lacked the correct permissions, and no data was being transferred. No error had been logged by the system - I noticed the problem by chance. Worse, when I brought up the configuration tool to check details, the system made no comment at all about the missing database view. Worst of all, the system appeared to have lost all of the mapping configuration.

Fortunately, once visibility of the database view was restored, I was able recover the mapping details without working from scratch.

Based on this experience, I’d like to suggest that systems should not just Fail Fast, they should Fail Loudly.

A critical error in a production system should never be kept quiet, waiting for chance discovery.

There must be a thousand ways to get attention, from the subtle to the ridiculous. Use them. Don’t wait for someone to wander by with their mind and focus on another task - create a log file, write details into the event log, broadcast an event, bring up a stomping great error dialog, post a blog entry, start a siren, set off the sprinklers, do anything that’s necessary to get some attention and get the problem solved.

Comments

blog comments powered by Disqus
Next Post
Creative Discipline  25 Sep 2008
Prior Post
New Zealand TechEd 2008  13 Sep 2008
Related Posts
Using Constructors  27 Feb 2023
An Inconvenient API  18 Feb 2023
Method Archetypes  11 Sep 2022
A bash puzzle, solved  02 Jul 2022
A bash puzzle  25 Jun 2022
Improve your troubleshooting by aggregating errors  11 Jun 2022
Improve your troubleshooting by wrapping errors  28 May 2022
Keep your promises  14 May 2022
When are you done?  18 Apr 2022
Fixing GitHub Authentication  28 Nov 2021
Archives
September 2008
2008