Chaos Engineering on Guidewire Cloud: An approach for P&C carriers to consider

Chaos Engineering on Guidewire Cloud: An approach for P&C carriers to consider

Introduction

P&C Insurance is going through a tectonic shift. Cloud is radically changing everything about the way applications need to be managed and products need to be rolled out in the dynamic world of P&C Insurance. With Guidewire at the heart of digital transformations in P&C Insurance, Guidewire’s pivot to cloud also mandates carriers to proactively put guardrails around performance, resiliency, and security. Carriers need to adopt a futuristic outlook keeping in mind the threats to cloud stability because of how geo-political events as well as climate hazards can disrupt the way availability zones work, or how over-zealous hackers are consistently working at ways to intrude into IT systems. Using tools like Gatling for performance testing on Guidewire platforms or setting up security test cases as part of the Continuous Integration/Continuous Development processes are a great first step at instilling the importance of performance and security. However, carriers need to think further and smarter when it comes to ensuring resiliency and security, thereby building reliable business operations. Chaos Engineering is a great approach to consider for fortifying the P&C business operating on Guidewire Cloud. This article presents a perspective on how Chaos Engineering can be introduced in a thoughtful way to enable the P&C carriers’ strategic vision to grow business on Guidewire Cloud.


Chaos Engineering- from Netflix to the world of Insurance


When engineers at Netflix came up with the concept of Chaos Engineering, it was as much radical as it was needed, considering the nature of the business in question. Financial services including Insurance cater to a different market and operate in ways very different to Netflix. However, with P&C Insurance moving progressively to public cloud, it becomes critical to put up guardrails against outages and ensure security of protected data on cloud. So, Chaos Engineering can become an important tool to address these factors and ensure compliance with regulations relevant to Insurance industry. There are subtle differences in the approach though in terms of how Chaos Engineering can be interpreted in the context of P&C Insurance as opposed to Netflix. Chaos Engineering was initially propagated as a disruptive measure to check if breaking down one part of a process broke the entire process down, and thereby build more fault-resisting measures intrinsically. With P&C Insurance industry moving significantly to cloud, thereby embracing microservices and increasingly distributed architecture, the possibility of failures is even more real and yet, harder to predict. Outages are costly. Some data worth looking at are: Per Gartner, IT downtime can cost up to $540000/Hr; larger businesses can suffer losses up to $9000/minute and P&C Insurance as a subset of Finance industry, qualifies to be a high-risk industry liable to suffer losses up to that magnitude. Traditional industries are catching up with the value of Chaos Engineering e.g. National Australia Bank deployed Chaos Monkey tools to strengthen IT system resiliency as well as staff resiliency. P&C carriers on Guidewire Cloud can therefore deploy Chaos Engineering to address multiple mandates of Cloud migration:

·????? Resiliency: Build resiliency into the applications and mitigate system entropy.

·????? Data Security: Continuous monitoring across attack areas to ensure financial data on cloud is secure.

·????? Cost efficiency: Effective design of policies to mitigate the cost of cloud sprawl.

·????? Conformance: With Guidewire Cloud enabling rollout of P&C products more rapidly in an Agile/DevOps model, there is a greater need of multiple teams conforming to consistent operating principles on cloud platforms.

·????? Observability: Ability to meaningfully leverage insights built from behavior intrinsic to the platform.


There needs to be a thoughtful approach towards deploying Chaos Engineering so as to attend to these 5 key areas. In the words of Nora Jones, one of the harbingers of Chaos Engineering at Netflix- “Chaos Engineering is about building a culture of resilience in the presence of unexpected system outcomes. If tools can help meet this goal, then that is excellent, but they are only one possible means to an end.”


Chaos Engineering for safeguarding P&C Insurance business on Guidewire Cloud- An approach to consider


The implementation approach/framework that can be considered, consists of:


A. Chaos Engineering strategy

B. Compatibility with Guidewire Cloud

C. Steps to implement

D. Measure Outcomes and Refine the Chaos Engineering framework


Let us dive into each of these.


A. Chaos Engineering strategy:

Chaos Engineering started with the introduction of Chaos Monkeys. Put simply, these were scripts purposely deployed to shut server instances down in a random manner, and thereby introducing chaos into the system. Objective was to push developers to guard against uncertainty by building applications which were more fault-tolerant. Subsequently, Simian Army was introduced as a suite of failure-inducing capabilities beyond what was the ambit of Chaos Monkeys. Primarily, the idea behind Simian Army was to expand to other types of failure points than focusing on random instances alone e.g. traffic volume blocking, EBS volume detachment, CPU and IO Burning (heavy utilization), Fill Disk, Kill Processes etc. While the original Simian Army tools are now mostly retired (e.g. Chaos Gorilla, Chaos Kong, Latency Monkey etc.), some still are active in a repurposed form- Janitor Monkey (now Swabbie) used to identify and dispose of unused resources within Cloud, Conformity Monkey (now part of Spinnaker) used to identify instances not conforming to pre-determined rules and shutting those down, and Security Monkey (now spun off into an open-source project) used to specifically locate security vulnerabilities. The key objective of a sound Chaos Engineering strategy to ensure the stability of P&C Insurance business on Guidewire Cloud needs to be to think less about the tools and more about the need to build a proactive resilience and security culture to guard against unknowns.


B. Compatibility with Guidewire Cloud

Guidewire Cloud is powered by AWS. The strategic partnership is aimed at providing P&C carriers with the ability to launch products faster and serve customers uniquely, while facilitating easier customizations and configurations. The model provides isolated and single-tenant core systems surrounded by multi-tenant cloud-native services, thus enabling carriers to launch cloud-native products traversing the breadth and depth of P&C business. Guidewire Cloud Platform (GWCP) was designed to keep Guidewire’s? InsuranceSuite applications at the core and build containerized workloads around it, while driving unification of services and capabilities through cloud-native libraries. Architecturally, GWCP is built as a PaaS layer (called ATMOS) on top of AWS infrastructure, while the SaaS product build-out sits on top of the PaaS layer. So, while this PaaS within SaaS model enables standardization, orchestration and flexibility with product configuration, the Chaos Engineering strategy needs to align very closely with this model to be successful.


C. Steps to implement

With the shared responsibility model in action, AWS ensures resiliency of the cloud infrastructure including the hardware as well as availability zones, edge locations etc. The P&C carriers on Guidewire Cloud will have to ensure resiliency in the cloud which will cover application resiliency, operational policies for cost monitoring and optimization, observability to ensure failure management, conformance with continuous testing of infrastructure as a broader theme of continuous integration and continuous deployment, and data security. Chaos Engineering provides a structured approach to address these, by going above and beyond the traditional Disaster Recovery principles. Here is a suggested set of steps to consider for setting up Chaos Engineering on Guidewire Cloud platform:


1. Design Hypotheses:

Focus on disruptive scenarios that could impact critical business processes and customer experience. Consider conducting Chaos experiments on Guidewire Cloud to

·????? Measure the impact on claims processing time by simulating ClaimCenter system failures or performance degradation use cases.

·????? Measure the impact on policy creation/change/renewal by simulating unplanned PolicyCenter downtime.

·????? Measure the impact on quote/bind/rating processes by impacting the provisioning of data for underwriters.


2. Technical approach:

Focus on establishing a framework using the best of Guidewire and open-source. Focus on the objective and look to set up the most optimal architecture to meet the objective.


·????? Outline the key criteria to evaluate the technical RoI against:

o?? Resiliency: Pay attention to the PaaS within SaaS model. Focus on individual microservices working behind the critical transactions. Also, simulate critical failure as well as steep spikes in traffic e.g. higher-than-normal demands on the PolicyCenter and ClaimCenter screens caused by seasonal hazards.

o?? Data security: Design the experiments for specific threats such as DDoS.

o?? Observability: GWCP enables all 3 key facets of observability- Log collection, metrics collection and analysis, and trace enablement. Furthermore, GWCP also enables persona-based logging as well as log security. Also, Garmisch onwards, Guidewire has also significantly enabled self-service when it comes to customizing observability and monitoring for GWCP applications. The P&C carriers on Guidewire Cloud need to consider these features provided by GWCP, as a key component of the design.

·????? Establish the tools framework:

o?? With Guidewire Cloud being powered by AWS, the framework needs to be set up on AWS.

§? AWS FIS (Fault Engine Simulator) has been set up by AWS as a fully-managed service that can help simulate real-world infrastructure faults.

§? FIS is multi-dimensional in its ability to induce faults of varying types:

·????? Forcing undersized EC2 instances into failover

·????? Node failures

·????? Hyper latency

·????? Over-stressing of CPU/memory

·????? RDS database instance failure

§? FIS is also integrated with AWS CloudWatch alarms, this synergy can also be used to set up guardrails to enables cost monitoring.

§? FIS is adding newer scenarios like ‘AZ Availability: Power Interruption’ and ‘Cross-Region: Connectivity’ to diversify fault situations.

§? AWS Lambda can be used to initiate FIS services.

o?? Solutions/tools compatible with AWS can also be introduced into the eco-system to make it more resilient.

§? Chaos Toolkit, an open-source solution, can help customizations of the Chaos Engineering framework on AWS, through the eco-system of extensions it offers.

§? Chaos Mesh, another open-source solution, can help orchestrate many fault scenarios at the Kubernetes level, which will further fortify the PaaS layer of GWCP.

§? Gremlin offers a Reliability Management Platform that can help identify and fix reliability risks at scale on AWS. Gremlin also offers Failure Flags that can run reliability tests on AWS Lambda.

§? Litmus/Harness offer libraries of faults that can be considered for strengthening the Chaos Engineering strategy for AWS, and in extension, GWCP.

o?? Chaos Engineering architects with the P&C carriers need to lay out a robust architecture that addresses the AWS FIS offering and complementary open-source solutions/tools, in order to address the PaaS-within-SaaS architecture of GWCP.


3.? Identify Metrics Monitoring framework

Approach the objective at 2 layers- P&C Business Metrics, and Guidewire Technical stack.

·????? Sample business Metrics to monitor post introduction of faults, can be orchestrated with dashboard features offered by Guidewire Cloud releases starting with Garmisch.

o?? Claims processing times

o?? Policy creation/update/renewal times

o?? Underwriters’ throughput

·????? Sample technical metrics aligning with GWCP stack.

o?? Guidewire’s Response Time Analysis Toolkit includes

§? Response Time Monitor that can be used to determine fault-inducing parameters.

§? Swing UI Profiler enables dissection of overall response time to components such as Rules Engine, Web UI Render, database etc.

§? jHiccup tool can enable dissection at JVM level.

o?? Other standard Chaos Engineering metrics offered by AWS as well as open-source solutions such as Gremlin can also be considered to supplement the GWCP-specific metrics.


4. Measure Outcomes and Refine the Chaos Engineering framework

·????? Identify the blast radius and continue to refine as scale/complexity increases.

·????? Start with simpler experiments such as shutting down non-critical services during off-peak hours.

·????? Increase complexity in a gradual manner, both in terms of breadth of layers impacted and the intensity of fault induced.

·????? Monitor and Analyze Metrics.

·????? Iterate and leverage the insights to fortify soft spots.

·????? Collaborate across stakeholder teams and establish governance.


Conclusion

In conclusion, P&C Insurance is at an inflection point and needs to fortify the existing business operations even as the industry looks to grow aggressively. Resiliency has to be a mandatory aspect of secure P&C business on Guidewire Cloud. Chaos Engineering is a principle that needs to be embraced by P&C carriers as they move to Guidewire Cloud, so that they can grow their business using the power and flexibility of Guidewire Cloud, while making sure business is secure from unforeseen hazards and ever-increasing security risks.

A robust and thoughtful Chaos Engineering framework for Guidewire Cloud needs to leverage the best of GWCP and AWS, while considering complementary open-source options. The framework needs to be well-planned and base itself on a continuous feedback loop, and the ultimate objective should be to ensure healthy P&C Business metrics.








Articles/posts referred to:

·????? https://www.geektime.com/how-much-does-it-downtime-cost/

·????? https://www.pingdom.com/outages/average-cost-of-downtime-per-industry/

·????? https://www.itnews.com.au/news/nab-deploys-chaos-monkey-to-kill-servers-24-7-382285

·????? https://www.guidewire.com/sites/default/files/media/pdfs/Guidewire_Cloud_data_sheet_en.pdf

·????? https://medium.com/guidewire-engineering-blog/guidewire-cloud-why-hybrid-tenancy-is-the-right-choice-56a0ff176032

·????? https://www.techtarget.com/searchcloudcomputing/definition/cloud-sprawl#:~:text=Cloud%20sprawl%20is%20the%20uncontrolled,over%20its%20cloud%20computing%20resources

·????? https://medium.com/guidewire-engineering-blog/log-management-and-guidewire-cloud-platform-observability-73a033a34e9a

·????? https://medium.com/guidewire-engineering-blog/guidewire-cloud-why-hybrid-tenancy-is-the-right-choice-56a0ff176032

·????? https://medium.com/guidewire-engineering-blog/guidewire-cloud-why-hybrid-tenancy-is-the-right-choice-part-2-of-2-ba22c9888bb8

·????? https://marketplace.guidewire.com/s/product/response-time-analysis-tool-for-insurancesuite-100x/01t3n00000GfL6AAAV?language=en_US

·????? https://www.guidewire.com/blog/technology/expanding-your-companys-cloud-capabilities-with-garmisch/

·????? https://documentation.solarwinds.com/en/success_center/observability/content/configure/services/java/guidewire-support.htm

·????? https://aws.amazon.com/blogs/architecture/chaos-engineering-in-the-cloud/

·????? https://aws.amazon.com/blogs/mt/chaos-engineering-leveraging-aws-fault-injection-simulator-in-a-multi-account-aws-environment/

·????? https://aws.amazon.com/blogs/apn/improving-system-resilience-and-observability-chaos-engineering-with-aws-fis-and-aws-dlt/#:~:text=You%20can%20monitor%20performance%20metrics,be%20affected%20during%20chaos%20testing

·????? https://aws.amazon.com/blogs/architecture/chaos-engineering-in-the-cloud/#:~:text=As%20Chaos%20Engineering%20should%20provide,be%20injected%20to%20your%20workload

·????? https://aws.amazon.com/about-aws/whats-new/2023/11/aws-fault-injection-service-two-requested-scenarios/

·????? https://docs.aws.amazon.com/fis/latest/userguide/what-is.html

·????? https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/resiliency-and-the-components-of-reliability.html

·????? https://aws.amazon.com/blogs/devops/chaos-experiments-on-amazon-rds-using-aws-fault-injection-simulator/

·????? https://www.gremlin.com/community/tutorials/chaos-engineering-tools-comparison/

·????? https://www.gremlin.com/aws/

·????? https://chaos-mesh.org/

·????? https://chaostoolkit.org/






?






要查看或添加评论,请登录

Moulinath Chakrabarty的更多文章

社区洞察

其他会员也浏览了