What Really Happens on a Release Weekend
The final walk through was complete. The master spreadsheet ('the spreadie') showed 136 rows. I got the sense that everyone was pumped for this release weekend. It's not often that we have releases with so many applications in. This was definitely a biggy.
No one would say as such, but I felt it too. This is what we lived for. This was our NASA launch. The one time where we have some real pressure on us to deliver. From the outside it looked like just another IT release that happen all the time in financial organisations of our size. But we knew. This was a test. A test of endurance. A test of execution. A test of focus. And above everything else, a test of staying calm under pressure. What we didn't know in that walkthrough, one day before the release kicked off, was that we would find ourselves in a real dog fight.
The release got under way as planned at 8pm on the Friday. My role was 'The Commander', head of command and control. My job was to coordinate all the 'pilots', making sure they got in the air at the right time, steer them through their mission and then land them safely. We had a few rookie pilots for this mission so I had to be on my game. This first night was all about endurance. The spreadie had the finish time at 4am, but we all knew the spreadie could be blown out of the air at any point.
By about midnight I was starting to get nervous. Nothing had happened. That made me nervous. So far the spreadie was predicting everything within +/- 15 minutes. It was all a little too perfect.
Pilot 3, you are cleared for landing. Nice job.
I landed all the pilots safely but one. A rookie pilot. No need to panic, the spreadie had predicted that she would be last to land. She was way out there in India. All I had was online chat contact with her. Was she getting tired? I was. The couch was starting to call me. Stay focused. Just land this last plane and file the report. She landed safely just after 4am. One task remained before a well earned rest. Just needed to send an email out to everyone letting them know that we were all clear for Saturday activities. All systems were go.
Someone was chatting me. Why is pilot 3 chatting me at 4:15am? I thought he was well tucked up by now? I'm just about to press send on the 'all good' email.
'We've got a problem!'
At first I thought I was hallucinating. We can't have a problem. How does he know we have a problem? What is he doing at 4am?
Turns out he had gone 'off spreadie'. He too thought everything was a little too perfect, so he decided to run some further checks after he landed. What he found surprised him. He didn't know what was happening, but he knew enough to know that something wasn't right.
Ok, don't panic. Barely able to keep my eyes open I had to make a call. Something inside told me that tomorrow (which was already today) was going to need everything we had. Best let everyone rest and tackle this in a few hours.
When I woke up it was a different world. The business had already been at it for an hour and they weren't liking what they saw. 'Big boy' (application 1) was behaving erratically. Just as pilot 3 had reported only hours earlier, something wasn't right.
Big Boy had been around for years and he wasn't going anywhere soon. Some say he'd taken on a life of his own in the last few years. There were even rumours that mysterious lines of code appeared as if written by Big Boy himself.
No one knew the entirety of Big Boy, he was too big. Well, there was 'The Oracle'. What the Oracle didn't know about Big Boy probably wasn't worth knowing. At times it was hard to tell them apart. Man and machine as one.
A problem with Big Boy was not to be laughed at. What starts out as a seemingly small issue can spread like forest fire on a windy day. Before you know it, you have entered chaos, where cause and effect are dancing in the shadows.
It was time to call 'The General' and 'King Julian'. They would know what to do.
The thing about The General and King Julian was they had been here so many times before. If you walked past them in the office you would think nothing of them. Just a couple of mid level managers going about their business in an extremely large organisation.
But they were old school. They didn't believe the hype about the 'new world order'. They were grounded in reality. Their motto was to always respect 'the machines' but never give them more credit than they deserve. You arm yourselves with people capable of fighting them off when they started to misbehave. And that's what they did. They knew that this business was about people. Good people. Capable people. People that wouldn't panic. They knew that if you didn't have pilots that could fly through a hurricane then you basically didn't have pilots at all. So they went about their business, slowly and methodically. Finding pilots. Weeding out those that panicked in a mere cross wind, until they were left with those you could trust.
That was their simple advice. 'Send the pilots up again'. This time 'send them in pairs'. Genius!
So that's what we did. We stood the business down and we sent two of our elite back into Big Boy.
What they found was not good. The worst possible case. One definite problem with a known fix and one other problem that needs more time and may or may not be related to problem 1.
It had started. The ambiguity was happening. Slowly but surely we were in a spin. The clock was ticking. Saturday afternoon and we're now in a dog fight with Big Boy. The wind was picking up.
It was time to call the internal Feds!
Don't get me wrong, no one liked calling the Feds. They were outsiders. Their job was simple. To tell you to stop when they thought you couldn't complete the mission. They weren't interested in pilots or heroics. All they cared about was stability.
My heart rate was up as I called them. Experience had taught me to play it as cool as possible. Just a routine issue with plenty of time to fix it. No need to panic.
The problem is they must have picked something up in my voice. They were trained to pick up the slightest wiff of panic.
Their decision: They wanted a conference call immediately with the General, the pilots and King Julian. Disaster! Golden rule. Never put the pilots in the same room as the internal Feds. That's a recipe for 'too much information'.
The call was a complete disaster. They reeled one of the pilots into their trap and before we could say shut the f*ck up, they were blabbering about how difficult a back out would be at this point in time. If there's one impression you don't want to give to the internal Feds, it's that a back out would be very complicated.
As the person who brought the Feds and Pilots together, I was not looking good at this point.
The Feds had laid down the law.
'You have until the end of the night to fix this or else we're shutting you down'.
Then the General uttered the words:
'Call the Oracle!'
Did he just say what I think he said? Hasn't the Oracle been unwell recently? Shouldn't he be resting?
When you're in a dog fight and the clock is ticking..............
Apparently King Julian was already in the process of driving to the Oracle's house. The thing with the Oracle is he's only just moved to mobile phones and I think he still keeps his in the fridge over weekends. No one has had the heart to tell him they don't need cooling.
Within a matter of minutes the Oracle was suited up and cleared for take-off. King Julian lived for these moments. I often thought this was the only reason he still did this job.
They were up there for what felt like days. At times they were completely off grid. The winds were too great. I couldn't get through to anyone.
When they came out of the clouds, they'd picked up a hunch. That's all it was, but a hunch from the Oracle was worth 1000 hours of detailed analysis by a mere mortal. He had seen this behaviour before. A deployment issue that disguised itself as something more serious. Left long enough and it starts to look like multiple areas of Big Boy are failing. But the root cause is the same.
We had 2 hours to get this hooked up to a test environment, prove it, bring the business back in to verify it and then drop it into Big Boy tonight.
We'd need 'the engineer' and the 'database guru'. Both of them were on board. No questions asked. Both of them had been working with Big Boy for years.
Six highly trained veterans were now in a dog fight with Big Boy and we were starting to get the upper hand. By 6pm we had recreated Big Boy 2. A perfect replica. We deployed the hunch, tested it and verified it with the business.
Now for approval from the Feds to deploy the hunch to the real Big Boy. They consented.
At 8pm we flew the hunch right into the heart of Big Boy. He fought it, but he knew he was defeated.
He settled down over night.
By morning he was behaving as normal. The business gave their seal of approval.
At 12pm Sunday, just as the spreadie stated, I sent the 'all finished' email out to anyone who cared to read it.
Those that were there knew. Their time would come, but for the moment this is a world ruled by humans.