Game Day Strategy implementation (By AWS )
Sherwin Arjnan
Leadership | Strategy | Business Development| Digital Marketing | HR | Believer | Songwriter
A?game day?simulates a failure or event to test systems, processes, and team responses. The purpose is to perform the actions the team would perform as if an exceptional event happened.
Covers the areas of:
·??????? Operations,
·??????? Security,
·??????? Reliability,
·??????? Performance, and
·??????? Cost.
Operations
Ability to run and monitor systems to deliver business value and continually improve supporting processes and procedures
Key as aspects
Organization – the organizations structure and priorities.
Preparedness – Current design of systems, people and processes needed and available to perform important functions
Operability – How would workloads be managed? What is our workload health? How prepared are we to respond to events?
Evolve - What have we learned and how will we improve?
Power failures – backup energy and systems e.g cloud
Security
Protections of information, systems, and assets. Our current Risk Assessments and Mitigation Strategies.
-??????? Access management – who has access and who will be give access and at what level. The Principal of “Least Access”
-??????? Access Management policy – (Multi Factor Authentication (MFA), sign Mechanisms e.g. Commonly used passwords)
What is types security required and what is in place?
-??????? What are our layers of security?
-??????? Detection – Security events - what was detected, what needs to detect and how will we do it.The use of automated alert systems.
-??????? Infrastructure protection - current system protection from breach e.g. cyber-attack. Using appropriate Content Delivery Network, Network Firewalls. Distribution of layers into subnet (smaller networks)
Establish, Trust boundaries, System security configuration, Operation system ,Policy Enforcement.
-??????? Data protection - data breach, data backup, Data Classification (level of security required)
领英推荐
Creating Dashboards to view data instead access to main database. Data Encryption ( in rest and in transit)???
-??????? Incident response – who, what and timeframe and access requirements
-??????? Information storage and backup for recovery. e.g. on site, Cloud storage.
Reliability
Current state on how we recover from failure and how would we measure it. Systems/organization health.
Information storage and backup for recovery
Recover from Infrastructure or service disruptions- implementation of Service Orientated Architecture and microservices.
Network bandwidth – Current usage of network bandwidth – one or multiple ISP’s. Reliability of ISP
Dynamically acquire resources on demand
Scaling operations to match demand. People processes to be streamlined and system required to be automatic or as close as possible.
Mitigate disruptions
Change management – Monitoring metrics of People (Employee satisfaction survey, 360’s , and IT systems via monitoring dashboards. How well do people respond to agile movements .People resources required and duration (external )
Failure Management – conduct simulations (tests)in current environment (Failure drills) with possible live environments (cloud based or on other sites) .Distribution of workloads to other geographic zones. Back up and Disaster recovery (DR Strategy )
Common KPI
-Recovery Time Objective
-Recovery Point Objective
Making responses/actions Idempotent.
Design a playbook
Use Chaos Engineering (Create failures)
Performance
Are our people system (IT) and processes utilized efficiently?
Cost Optimization
Who monitoring our costs?” It can be the Finance Department”. They cannot advise “how” to reduce costs.
Develop a Cost Optimization plan.
Source : Amazon Web Services