DevOps - P2 - Attention to detail case study
Ho ho ho! I have a fire side story for you today. Grab your eggnog if you must, Scotch if you can, and let me tell you a story when Christmas almost failed.
In Part 2 of our DevOps adventure, I stressed that the non-production environments must be as similar to production as possible, in everyway. Well, let me tell you a story, where a seemingly innocuous?difference, had the power to blow up production!
Our story is almost a holiday tale, as it took place when happy carols fill the airwaves, malls are full of pissed-off people socking up the cheer, and coffee-driven developers are eager to squeeze in changes before the end of the year.
Some technical mumbo-jumbo to set the stage
Story has to do with SQL Server. SQL Server, for the unacquainted, has both Data and Log files. When files grow over the max size of the drives they are on, all hell breaks loose. Log files are particularly sinister as they may temporarily swell to many times their regular size, just to shrink again. When the drives are unable to accommodate the swelling, the system simply grinds down to a halt. So when you do some heavy operations, which can cause the Log files to grow, you typically keep a close attention to the the size of those Log files.
Now to Santa's workshop.
It so happen to be, that one of Santa's most eager helpers, was doing exactly that - a critical upgrade on Santa's sleigh - Santa is exclusively on Microsoft tech. Anyway, all of north's poles procedures were followed diligently, all proper testing and monitoring and planning and drinking and risking and caroling was done.
There was just one little problem.... on little difference between the practice and production sleigh
You see, Santa's testing sleighs, had the Data and Log files on the same drive. Production, (unknowingly to the poor overzealous elf, who was never allowed to touch let alone sit in the production sleigh) had them separated on two drives - one drive for Data files, another for Logs
Oh but our little elf was so careful in upgrading the production sleighs. Monitored that one drive so well! All was going so well... and if it was not for that one pesky little difference between the production and non-production sleigh, all would have finished well, as well. Too any wells... bad omen.
领英推荐
Alas! The Data files, on the drive were doing just fine but the Log, well, it ... it swelled! And burst open as one would expect. That one little difference caused a whole lot of crap! (1)
(Serious) moral of the story
When I stress that non-prod environments, especially Staging, have to be as close to production as possible, I am not just trying to be difficult or over protective. Such stories are plentiful - everyone has a few of their own. Even little, seemingly benign difference, can cause problems.
If you want reliable DevOps, do not let unnecessary differences creep into your environment setup. Think twice if you introduce a difference to save cost. Try to fix difference later on, that you are forced to live with now. Be aware of all the differences you have, make sure they are well socialized.
If most of one's work is done in non-production environments, and it should be the case, the setup, the structure of that environment burns into your memory. You move around almost instinctively. Forcing engineers to do a context switch when traveling between prod and pre-prod environments is a recipe for disaster.
Worse, there may be some actions/scripts, that will have to written differently for the different environments. This means that any test performed on staging, is not sufficient for production. Technically and practically, you will be running and testing your script for the first time on live production.
Take the time to have all your environments configured as similarly as possible.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(1) OK ok, for dramatic effect, I embellished the story a bit. Nothing really bad happened at the end. The elves fixed the problem before it was too late, but it was so close!