Here’s a puzzle for you all …

Assume you’re working with a complex system that’s highly important for the business, one that is mission critical for some of the staff, where they are dependent of the system so they can do their jobs every day.

Further, lets assume the system has just been restarted after a major meltdown and you’re investigating the cause of that failure.

When you think you’ve identified the sequence of actions that caused the system to go down, when you have a plausible theory of the crash that matches known information, what do you do next?

  1. Document your theory and actively solicit feedback from other knowledgeable parties?

  2. Run a controlled test of your theory in a non-critical (non-production!) environment to see what happens?

  3. Try it out on the production system to see it the meltdown recurs?

If you happen to think that option #3 is acceptable, think about what happens if your theory is correct and the production system goes down again. (Bonus points if you can guess which of these was inflicted on critical system at work today.)

Having a production system go down for reasons outside of your control can be both catastrophic and expensive, with costs in time, direct expenses and lost revenue.

If that down time occurs for reasons under your control - if you in fact actively caused that failure, then the consequences should also be “career limiting.”

For most systems, test environments are set up and maintained for exactly this reason - to provide a safe place where failure is less significant, less costly, less important.

No matter how good you think you are - in fact, regardless of how good you actually are - playing fast and loose with a production system is not ok.

Comments

blog comments powered by Disqus
Next Post
Caliburn.Micro and Ninject  31 Jan 2012
Prior Post
It's a (PowerShell) Trap  06 Jan 2012
Related Posts
Using Constructors  27 Feb 2023
An Inconvenient API  18 Feb 2023
Method Archetypes  11 Sep 2022
A bash puzzle, solved  02 Jul 2022
A bash puzzle  25 Jun 2022
Improve your troubleshooting by aggregating errors  11 Jun 2022
Improve your troubleshooting by wrapping errors  28 May 2022
Keep your promises  14 May 2022
When are you done?  18 Apr 2022
Fixing GitHub Authentication  28 Nov 2021
Archives
January 2012
2012