Game Day Strategy implementation (By AWS )

Game Day Strategy implementation (By AWS )


A?game day?simulates a failure or event to test systems, processes, and team responses. The purpose is to perform the actions the team would perform as if an exceptional event happened.

Covers the areas of:

·??????? Operations,

·??????? Security,

·??????? Reliability,

·??????? Performance, and

·??????? Cost.

Operations

Ability to run and monitor systems to deliver business value and continually improve supporting processes and procedures

Key as aspects

Organization – the organizations structure and priorities.

Preparedness – Current design of systems, people and processes needed and available to perform important functions

Operability – How would workloads be managed? What is our workload health? How prepared are we to respond to events?

Evolve - What have we learned and how will we improve?

Power failures – backup energy and systems e.g cloud

Security

Protections of information, systems, and assets. Our current Risk Assessments and Mitigation Strategies.

-??????? Access management – who has access and who will be give access and at what level. The Principal of “Least Access”

-??????? Access Management policy – (Multi Factor Authentication (MFA), sign Mechanisms e.g. Commonly used passwords)

What is types security required and what is in place?

-??????? What are our layers of security?

-??????? Detection – Security events - what was detected, what needs to detect and how will we do it.The use of automated alert systems.

-??????? Infrastructure protection - current system protection from breach e.g. cyber-attack. Using appropriate Content Delivery Network, Network Firewalls. Distribution of layers into subnet (smaller networks)

Establish, Trust boundaries, System security configuration, Operation system ,Policy Enforcement.

-??????? Data protection - data breach, data backup, Data Classification (level of security required)

Creating Dashboards to view data instead access to main database. Data Encryption ( in rest and in transit)???

-??????? Incident response – who, what and timeframe and access requirements

-??????? Information storage and backup for recovery. e.g. on site, Cloud storage.

Reliability

Current state on how we recover from failure and how would we measure it. Systems/organization health.

Information storage and backup for recovery

Recover from Infrastructure or service disruptions- implementation of Service Orientated Architecture and microservices.

Network bandwidth – Current usage of network bandwidth – one or multiple ISP’s. Reliability of ISP

Dynamically acquire resources on demand

Scaling operations to match demand. People processes to be streamlined and system required to be automatic or as close as possible.

Mitigate disruptions

Change management – Monitoring metrics of People (Employee satisfaction survey, 360’s , and IT systems via monitoring dashboards. How well do people respond to agile movements .People resources required and duration (external )

Failure Management – conduct simulations (tests)in current environment (Failure drills) with possible live environments (cloud based or on other sites) .Distribution of workloads to other geographic zones. Back up and Disaster recovery (DR Strategy )

Common KPI

-Recovery Time Objective

-Recovery Point Objective

Making responses/actions Idempotent.

Design a playbook

Use Chaos Engineering (Create failures)

Performance

Are our people system (IT) and processes utilized efficiently?

Cost Optimization

Who monitoring our costs?” It can be the Finance Department”. They cannot advise “how” to reduce costs.

Develop a Cost Optimization plan.


Source : Amazon Web Services

要查看或添加评论,请登录

社区洞察

其他会员也浏览了