The Simian Army, Principles of Chaos Engineering & building resilient construction projects
Deepak Mistry.
Risk Director at HKA | Infrastructure & Capital Projects Advisory | International
Chaos Engineering Part 3
This is Part 3 of a series of articles where I continue to explore what the discipline of risk management can learn from other industries to help us better manage risk, deal with blind spots and build resilience on construction projects.
The term Chaos Engineering may conjure up a sense of randomness and disorder but the discipline is far from this. Through the process of planned and controlled experimentation we can observe and learn about the behaviour of a system (insert “project”) in order to improve performance and risk mitigation efforts but also help design our project plans with resilience in mind.
As a recap, in Part 1 I introduced Chaos Engineering which is “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions”. I described some similarities to risk management.
In Part 2 I described the nature of “distributed systems” and related this to construction projects, listed several benefits of Chaos Engineering and explored the idea of “injecting failure” into a system by way of experiments to observe the impacts. My aim was to establish some common ground between both disciplines.
In this more in depth article I attempt to delve deeper into the advanced principles of Chaos Engineering and have focused on what IBM considers to be best practice but before I do this I’d like to pay homage to Netflix which has a long history with Chaos Engineering.
The Simian Army
My entire article series was inspired by Netflix which is one of the pioneers of Chaos Engineering. The company began to use Chaos Engineering in 2008 when it introduced a tool called “Chaos Monkey” that randomly disabled the live production environment in a controlled manner to help identify weak points to fix and make the system more resilient. Inspired by this they went on to develop a new "Simian Army" (with novel names!) that induced various kinds of failures, or detect abnormal conditions and to test their ability to survive them.
“A virtual Simian Army to keep our cloud safe, secure, and highly available” Netflix TechBlog
Here’s a selection of some of the tools deployed:
Chaos Monkey – a tool which terminated virtual machine instances running in production (helped Netflix identify and fix issues with its auto-scaling, redundancy, and monitoring systems)
Latency Monkey – a tool which simulated network latency in order to test the resilience of the system to slow network connections (helped Netflix identify and fix issues with its timeouts and retry logic)
Chaos Kong - tested the resilience of Netflix’s data storage system by randomly killing entire data center regions (helped Netflix identify and fix issues with its data replication and recovery processes)
This caught my imagination as I could see similarities with some of the principles of risk management and it got me thinking what if anything we could learn from Chaos Engineering and apply it to improving how we manage risk on construction projects to help build more resilience, especially for those risks which remain hidden creating blind spots. I liked the concept of experimentation, injecting failure and some of the random aspects to this and wondered if this could help.
It also got me thinking of an idea about the equivalent of a Symian Army but deployed in the risk management space. Interestingly, AI assistants have hit our working worlds so there's huge potential.
Advanced Principles of Chaos Engineering
The following list represents the advanced principles of Chaos Engineering and best practice approach by IBM. I intend to dive into each one in turn and apply this to risk management on construction projects:
Feel free to swap out the term “injecting failure” to “injecting risk” and consider this as running “risk experiments” instead of “chaos experiments”. I’ll attempt to relate each heading to risk management and explore concepts, leave questions for you to ponder over (I don’t have all the answers!) and see if it offers any inspiration for improvement or innovation.
Warning, it’s longer than previous articles so you might want to grab a brew first, skip to parts which interest you or exit now if you feel so inclined (no offence taken-ish!). Personally, it’s been a really useful exercise for me as it’s got out of my head the things I wanted to explore and has left me with several ideas to experiment with and explore.
Understand the System and establish the steady-state behaviour
“Defining a steady state hypothesis is a crucial step in the chaos engineering process, as it sets the foundation for all subsequent experiments.” How Netflix embraced Chaos. As distributed systems have become more… | by Haasita Pinnepu | Medium
This should describe what the system should be doing under “normal conditions” and for construction projects a proxy for this could be what has been planned or is expected. The construction schedule is one representation of this as it provides a project plan for all of the key activities and tasks needed to be completed to deliver projects outputs and outcomes. The plan represents a “deterministic” virtual model of how the project is expected to be delivered and how the system should behave, all other things being constant and assumptions made.
Understanding the system also means having an appreciation of the entire project not just one element of it. The project plan helps here because of how it’s organised (Work Breakdown Structure), it’s characteristics (estimated activity durations/resources), it’s properties (constraints/dependencies) and logical rules to follow. This project plan enables us to see the bigger picture, more granular detail as well as the impact of any changes to the rest of the project. Plans also offer up quantitative metrics which can be observed when change occurs.
I think the project plan seems to fit well with providing a comparable “system” example for the application of Chaos Engineering principles in a practical way. It can provide a measurable baseline against which to observe changes in behaviour during chaos or risk experiments. The good news it that we’re already adopting this approach on projects through Project Controls reporting and the Planning teams capturing and tracking progress of activities who understand the finer details. However, it often seems retrospective and reactive.
The risk function tackles the forward looking aspect by injecting uncertainty and risk against the deterministic plan but the question I have is whether we’ve really challenged the status quo and whether we could squeeze more insight and value out from what we’re measuring and observing when we perform risk experiments in the form of risk analysis and scenarios.
The following additional questions are floating about in my head:
It’s important to understand the nature of the project system but also deciding what we want to measure when observing change and performing risk analysis. Establishing this upfront before projects begin is vital because they guide and inform decision making from day one.
Embrace failure
“Disruptions will always occur in IT services and it's better to experience them in a controlled environment to identify the solution pre-emptively”. What is Chaos Engineering? | IBM
As the saying goes “change is the one constant in life” and life on construction projects is a brilliant example of this. Whilst projects are the vehicle of change in our society, people working on them experience it too and plenty of it! Some of this is planned and controlled but much is driven by uncertainty and risk leading to poor performance and being caught on the back foot.
What’s not brilliant is the well publicised fact that the construction industry is rubbish at learning lessons from the past. It’s disappointing to continually observe that even with the wealth of information and experience we have at our disposal we’re failing to address known repeat offending risks or causes of poor project performance. We’re not embracing failure, we’re continuing to invite it in! The situation is compounded further because we’re also not great at imagining risks we’ve not previously encountered before either.
“We’re not embracing failure, we’re continuing to invite it in...we’re also not great at imagining risks we’ve not previously encountered before”
What I’m hopeful of is that Chaos Engineering will help inspire us to think of new ways to address some of these blinds spots or at least minimise the impacts when they cause us pain. I’m not calling for a revolutionary new approach to risk management (there’s plenty of people doing that already!) but more of an exploration of incremental improvements or opportunities to innovate building on what we currently know or have at our disposal.
On a positive note we’re actually embracing the principle of failure already when we run quantitative schedule risk analysis (QSRA) which is a monte carlo simulation modelling technique. This is where we take the construction schedule and apply uncertainty ranges to activity durations and discrete risks pinned to activities to observe how they impact performance and the delivery of key milestones and the project overall.
The effectiveness of QSRA has been well debated over the years and it’s common knowledge that it can suffer from human biases and a whole host of other factors but I think it nevertheless does offer value by going through the process itself. It’s a good learning opportunity as you engage subject matter experts to test the credibility of identified risks and socialise appreciation of the potential impacts.
“At times using QSRA and embracing this type of failure feels a bit predictable, linear and prescriptive”
This may sound strange given that the methodology is underpinned by probability theory and it’s all to do with uncertainty but it at times using QSRA and embracing this type of failure feels a bit predictable, linear and prescriptive. Seasoned risk professionals can almost anticipate what the outcome of a QSRA will be before they’ve even run the model. They’re not like Neo from the Matrix but most of it is common sense. Add an identified risk or scenario to the model and it will probably extend activities to the right of the schedule potentially leading to delay. What about those risks or scenarios which remain hidden or have not been imagined?
There is certainly scope to improve how we could perform risk analysis more effectively than how we’re doing it today and it’s not all about technology. The main point is that it should afford the project sufficient time to make key decisions and take action to mitigate risk and build resilience to deal with events we may not have considered.
A few questions to consider:
Identify real world incidents
“Chaos engineering experiments should hew as closely as possible to what might happen on a normal day instead of creating unlikely situations.” What is Chaos Engineering? | IBM
This is all about developing hypotheses about potential deviations from the “steady state” mentioned above and introducing realistic failure into the system, however, I’m in two minds about this statement.
Chaos Engineering focuses on exploring events on a live system such as network and infrastructure failures, bad code, power issues and traffic overload. Identifying and simulating these “real world” incidents enables you to test the system’s resilience and identify potential technical weaknesses.
The issue I have is with two parts of the above statement, the first being “normal day”. The focus of risk management is anything but focusing on a normal day. In fact, to my mind, normal day is akin to what the baseline construction represents – the plan and expected performance. Some assumptions about risk and uncertainty may be build in to what we consider to be expected and even costed into the budget (known knowns or known unknowns).
领英推荐
As mentioned above, risk quantification is also prone to and influenced by a factors such as biases, politics, different agendas and more which may dictate what ends up in a risk register. However perhaps there’s also an issue with the way in which we interpret and apply the definition of what a credible risk or scenario is.
The second part of the statement which is quite interesting to me is “instead of creating unlikely situations”. The start of the process on construction projects is the development of a risk register capturing all of the credible risks and scenarios we may foresee occurring on a project. Sure, some of these may be unlikely with a very low probability but we also have a tendency either intentionally or not to shy away from scenarios which might be considered unrealistic or perhaps unpalatable. There may also be genuine blind spots.
Projects also operate in an environment where information can be imperfect especially at the start yet many push forward regardless given the urgency of delivering the outcomes desired. In order to proceed assumptions are made but these assumptions can represent a significant source of uncertainty/risk yet don’t appear in the risk register. I would argue we’re also not great at monitoring these assumptions during the life of a project or recognising their potential impacts if they don't hold true.
To put it another way, just because we haven’t identified any credible risks or scenarios there’s no guarantee that certain activities or scopes of work on our project won’t ever be impacted by some kind of event or assumptions not holding true. Maybe we if we explore and adopt the random element of Chaos Engineering we might overcome some of this? (i.e. the application of injecting failure is partly randomised which removes an element of selective application).
Questions that come to mind are:
Create a game day
"Expect to be surprised, and not just the first time….Systems change over time, and chaos [engineering] game days keep your knowledge of your systems fresh." TechTarget
In the software industry a “game day” is one where a series of failure experiments are pre-planned and applied to the live operational system. A team of people who would normally operate the system are assembled and this could range from operational staff through to business or client facing roles.
They develop the failure scenarios and then “inject the failure” into the system, observe and react to remedy any issues. The game day aims to test systems, processes and responses. The team should perform their roles as if the unexpected event occurred for real. After the event a review is undertaken, lessons learnt and recommendations made. Interventions are applied to the live system to avoid a “real life” repeat.
We clearly don’t do this on live construction projects. Beyond the safety considerations mentioned previously risk professionals and project teams simply don’t have the bandwidth to entertain this. Perhaps before a project kicks off there’s some kind of scenario analysis and stress testing undertaken of forecasts but this is limited and basic. Then during the life of a project, if you’re lucky, a QSRA might be run at periodic intervals or even adhoc in approach but given these are few and far between they lose value to make the timely decision making impact they need to.
It feels, to me at least, like we’re missing an opportunity to regularly test our project system’s resilience in a more meaningful and useful way. Methods and approaches already exist which we could easily adopt and adapt. Some of these can actually leverage expertise to offer both an inside and outside view to offer friendly or constructive challenge and review.
Questions I have are:
Use automation
“Organisations of all sizes can use chaos engineering by automating experiments, which would be too labour intensive if companies manually conducted them…Experiment design, failure injection and infrastructure provisioning are all aspects of experimentation that organisations can automate.” What is Chaos Engineering? | IBM
In the software industry chaos experiments are actually undertaken in a live operational environment which demands the use of automation as impactful timescales are miniscule but the concept can be applied to the relevant granularity of timescales on projects. What I’m proposing is that we leverage existing risk modelling software and/or newer AI technology to run these risk experiments automatically and virtually ahead of time but also during the life of a project to gain useful insights. Currently it’s a very manual process.
The idea here is to use automation to better facilitate running risk experiments continuously as opposed to running them at the start of a project, across lengthy intervals or in an adhoc manner. This breaks the burden of the energy intensive project controls reporting cycle. This has always felt retrospective in nature and transactional. I’ve also never quite understood why the monthly reporting cycle exists! Perhaps it more to do with the time it takes to pull information together than anything more meaningful?
Automation can help here because it could lead to more timely insights to support decision making early enough (proactively) to make an impact. For clarification, I believe human intervention is still required to make sense of the insights and help translate this for stakeholders but the labour intensive activities can be minimised to free up more time to do the value adding work.
This continuous approach also enables you to capture data and feedback on the project system’s behaviour over time which can be used to refine and improve those risk experiments.
With recent developments in AI risk software and one software vendor in particular even having access to hundreds of thousands of past construction schedule data (nPlan) this may help offer insights from the experiences of other projects that we may not have personally observed before (in our working lives). The ChatGPT element to this also offers access to quicker insights or recommendations which of course need to be scrutinised carefully but give us a head start.
However, tools like this are primarily driven by the data they’re trained on so there’s potentially more blind spots yet to uncover and I’m uncertain whether they push the extremes out beyond this. Also, not everyone is fortunate to have access to these tools so in the meantime could we do more with what we already have?
Questions which come to mind are:
Risk Army
In more recent times we've seen the emergence AI Assistants and their potential future use cases. What if we could develop a "Risk Army" equivalent to the Simian Army but to help manage risk and build more resilient projects? I envisage this as a combination of automation but overseen by humans being freed up to offer more value adding contributions.
I think it's only a matter of time because it's no longer just traditional software developers having a "strangle hold" on developing tools now but it's happening on mass by not only industry professionals but the public around the world. I feel the risk software landscape is being disrupted so existing vendors will need to up their game if they're going to offer value.
I also think use cases should focus less on prediction and more on producing insights to help build more resilient projects and facilitate better use of scenarios and stress testing. We can't identify all risks but we can think more deeply about understanding the nature of systemic risks, their characteristics and apply these principles across a project to test resilience early and continue to iterate and feedback.
AI offers an opportunity to automate much and getting the Risk Army to continually do this and test potential blind spots feels like a worthwhile endeavour.
Be mindful of the blast radius
“This principle emphasizes the importance of minimizing the impact of chaos experiments on the production environment and end-users. In other words, you should ensure that the experiments are isolated and do not impact any critical systems or services.” How Netflix embraced Chaos. As distributed systems have become more… | by Haasita Pinnepu | Medium
Okay, this quote again demonstrates running chaos experiments in a live operational environment which I’m not advocating. The “blast radius” refers to the consequential impacts of the failure being injected into the system (i.e. what goes wrong, how does it manifest, what other activities does it impact?).
Chaos Engineering seeks to minimise the “blast radius” by:
The one thing in risk experiments we don’t want to do is to limit the blast radius as that defeats the object of what we’re trying to understand, assess and measure. There is a point after our risk experiments have been executed where we do in fact want to limit the blast radius but that’s when we’ve revealed the potential weaknesses in the project we’ll seek to eliminate or strengthen.
It’s not that we’re trying to maximise the blast radius but when we inject failure or risk into the project system we want to understand all of the possible consequential impacts. On that basis I’ll try to flip the meaning about a bit to generate some relevant application to construction projects. I’ll cover off each of the above points in turn.
On construction projects we could consider targeting a subset of activities or tasks to test by applying risks to them. We already do this in QSRAs but we could go further such as simulating the failure of controls or parts of the plan which are based on a set of assumptions holding true. Often these are excluded from risk analyses but their impacts could be significant. Some of these are difficult or uncomfortable for project teams to think about so we need to overcome this. Perhaps illustrating the potential impacts might be sufficient to convince others to take action to protect against adverse impacts.
It makes sense that specific experiments are time or event bounded. What I mean by this is that if the risk experiment fails to induce any kind of impact it make senses to terminate it and move on to the next one. However, one thing to note is that just because the experiment failed to produce an impact on this occasion doesn’t mean it won’t the next time the project schedule has been updated! Worth keeping in mind.
Sometimes risks are like buses. You don’t see any for ages but then they all arrive together at the same time. So, injecting multiple risks at “peak traffic” project times might be worth considering too. How you define peak traffic on a project is up for grabs but there should be some rationale to it. Perhaps picking points when many activities are predecessors and converge? Or perhaps even shifting this around the schedule to see how sensitive the project is to them?
On software projects, the argument against running Chaos experiments in development environment (like we do when we perform monte carlo simulation) is that the conditions will differ to the live operational environment potentially leading to a false picture of what might happen in reality. On construction projects we are reliant on using models of reality (plans/schedules) so all we can do is ensure that the inputs and logic are as accurate or realistic as possible.
Given resource constraints, it's not feasible to experiment on every component of the project system so this comes back to my earlier point about defining and agreeing upfront what is important to measure when we observe changes caused by our risk experiments (i.e. whatever metrics are appropriate).
Closing Remarks
If you’ve made it this far and followed this article series hopefully you’ll have seen there are many similarities or parallels between Chaos Engineering and Risk Management. There are opportunities to learn and apply some of the key principles which to a certain degree are universal.
Embracing failure in this way has benefits which include being able to identify hidden dependencies, developing more of a scenario based approach to testing for resilience, helping projects to prepare better to deal with unplanned change and challenges and leveraging automation to free up more of our time so we can perform more value adding work.
I’ve personally been inspired to adopt more of a continuous and experimental approach to risk management and the writing process has helped me to capture more questions which triggered off many innovative and creative ideas for me to explore too which I’m really excited about!
There's plenty of research out there on how we might overcome some of our blind spots but I think there's some potential with the random injection of failure approach Chaos Engineering adopts. This is something I'm keen to explore further too. Automation via the "Risk Army" of AI Assistants is something that really excites me too.
I was initially keen to learn more technical skills such as Python as felt I was behind the tech curve but I've come to realise it's actually more about our ability to develop and interact with these AI Assistants which will be key so I'm doubling down on playing with tools like ChatGPT and Copilot.
Hope you found it interesting, inspiring or useful in some way. As always, I welcome and invite any comments. I've already got a more "entertaining" idea for my next article so watch this space...
Software Engineer @ Rightcharge | Python, JavaScript, Node.js, React.js, AWS | Innovating EV Charging?
9 个月Deepak, your post got me thinking about how chaos engineering could apply beyond just tech. Imagine applying these principles to other areas like business processes or even personal development strategies! It's all about embracing uncertainty and building resilience. Thanks for sparking these thoughts!
Risk Director at HKA | Infrastructure & Capital Projects Advisory | International
9 个月As a follow up, how resilient are we today and might be tomorrow? With increasing reliance on AI/cloud based technology I wonder if we've really given this careful consideration in the construction industry. Outages appear to be the norm. What if we experience 1 hour, a few hours, a day, a few days? Credible? Realistic? Plan B? Food for thought, check out some headlines below: ?? Microsoft Copilot fixed worldwide after 24 hour outage ?? Microsoft Teams hit by second outage in three days ?? AT&T network outage draws government discussions ?? Barclays bank payments restored after app went down in outage ?? Sainsbury’s and Tesco resolve technical issues that disrupted deliveries ?? McDonald's blames global outage on third party ?? Facebook outage: what went wrong and why did it take so long to fix after social platform went down? ?? NatWest banking app is back online following a three-hour outage that left thousands of frustrated customers across the UK without access to funds ?? Cloud providers suffered nearly 500 critical outages in 2022 ??? TSB banking app suffered major outage as customers couldn't log into accounts ?? Three apologises after network outages affect 10,000 customers across UK
Head of Commercial at Nodes & Links | Driving revenue growth
9 个月Deepak Mistry. This series has been insightful. Thank you for sharing