登录查看更多内容

PILLARS of AWS Well Architectured Framework

Ashutosh Upadhyay

Solutions Architect @ AWS, Data Analytics, Cloud Security, GitOps, DevOps, Machine Learning, Generative AI, Solutions Architecture, Platform Architecture

发布日期: 2021年2月10日

+ 关注

SECURITY

Protecting data at rest:

Enforce encryption at rest: Enforce your defined encryption requirements based on the latest standards and best practices to help protect your data at rest.

Protecting data in transit:

Define data protection in transit requirements, such as encryption standards, based on data classification to meet your organizational, legal, and compliance requirements.
Best practices are to encrypt and authenticate all traffic, and to enforce the latest standards and ciphers.
Implement secure key and certificate management: Store encryption keys and certificates securely and rotate them with strict access control; for example, by using a certificate management service such as AWS Certificate Manager or Sectigo.
Enforce encryption in transit: Enforce your defined encryption requirements based on the latest standards and best practices to help you meet your organizational, legal, and compliance requirements.
Automate detection of data leak: Use a tool or detection mechanism to automatically detect attempts to move data outside of defined boundaries; for example, to detect a database system that is copying data to an unknown host.
Authenticate network communications: Verify the identity of communications by using protocols, such as Transport Layer Security (TLS) or IPsec, to reduce the risk of data tampering or loss.

Responding to an incident:

Pre-provision access: Ensure that security personnel has the correct access pre-provisioned into AWS so that an appropriate response can be made to an incident.
Pre-deploy tools: Ensure that security personnel has the right tools pre-deployed into AWS so that an appropriate response can be made to an incident.
Run game days: Practice incident response game days (simulations) regularly, incorporate lessons learned into plans, and continuously improve responses and plans.

RELIABILITY

Manage service limits:

Monitor and manage limits.
Accommodate fixed service limits through architecture.
Ensure a sufficient gap between the current service limit and the maximum usage to accommodate failover.
Manage service limits across all relevant accounts and regions

Manage your Network Topology:

Use highly available connectivity between private addresses in public clouds and on-premises environments.
Enforce non-overlapping private IP address ranges in multiple private address spaces where they are connected.
Ensure IP subnet allocation accounts for expansion and availability.
Use highly available network connectivity for the users of the workload.

The system adapts to changes in demand:

Procure resources upon detection of lack of service within a workload
Procure resources manually upon detection that more resources may be needed soon for a workload.
Load test the workload.
Procure resources automatically when scaling a workload up or down.

Monitor your resources:

Monitor the workload in all tiers.
Send notifications based on the monitoring.
Perform automated responses on events.
Conduct reviews regularly

Implement change:

Deploy changes with automation

Back-up data

Perform periodic recovery of the data to verify backup integrity and processes: Validate that your backup process implementation meets Recovery Time Objective and Recovery Point Objective through a recovery test.
Secure and encrypt backups or ensure the data is available from a secure source for reproduction.
Detect access via authentication and authorization like AWS IAM, and detect data integrity compromise by using encryption.

System withstand component failures

Implement graceful degradation to transform applicable hard dependencies into soft dependencies:
When a component's dependencies are unhealthy, the component itself does not report as unhealthy. It can continue to serve requests in a degrading manner.
Automating complete recovery because technology constraints exist in parts or all of the workload requiring a single location: Elements of the workload can only run in one Availability Zone or one data center, requiring you to implement a complete rebuild of the workload with defined recovery objectives.
Deploy the workload to multiple locations: Distribute workload load across multiple Availability Zones and AWS Regions (for example, DNS, ELB, Application Load Balancer, and API Gateway). These locations can be as diverse as needed.
Automate healing on all layers: Use automated capabilities upon detection of failure to perform an action to remediate.

Test Resilience

Use playbooks for unanticipated failures: You have playbooks for failure scenarios that have not been anticipated to identify root causes and assist in strategies for prevention or mitigation.
Inject failures to test resiliency: Test failures regularly, ensuring coverage of failure pathways.
Conduct game days regularly: Use game days to regularly exercise your failure procedures with the people who will be involved in actual failure scenarios.

Plan for Disaster Recovery

Define recovery objectives for downtime and data loss: The workload has a recovery time objective (RTO) and recovery point objective (RPO).
Use defined recovery strategies to meet the recovery objectives: A disaster recovery (DR) strategy has been defined to meet objectives.
Test disaster recovery implementation to validate the implementation: Regularly test failover to DR to ensure that RTO and RPO are met.
Manage configuration drift on all changes: Ensure that AMIs and the system configuration state are up-to-date at the DR site or region, as well as the limits on AWS services.
Automate recovery: Use AWS or third-party tools to automate system recovery.

Performance

Select the best performing architecture:

Understand the available services and resources.
Define a process for architectural choices.
Factor cost or budget into decisions.
Use policies or reference architectures.
Use the guidance from AWS or an APN Partner.
Benchmark existing workloads.
Load test your workload.

Select your compute solution:

Evaluate the available compute options.
Understand the available compute configuration options.
Collect computer-related metrics.
Determine the required configuration by right-sizing.
Re-evaluate compute needs based on metrics.
Use the available elasticity of resources.

Select your storage solution:

Understand storage characteristics and requirements.
Evaluate available configuration options.
Make decisions based on access patterns and metrics.

Select your database solution:

Understand data characteristics.
Evaluate the available options.
Collect and record database performance metrics.
Choose data storage based on the access.
Optimize data storage based on access patterns and metrics.

Configure your networking solution:

Understand how networking impacts performance.
Understand available product options.
Evaluate available networking features.
Choose a location based on network requirements.
Optimize network configuration based on metrics.
Use minimal network ACLs.
Leverage encryption offloading and load-balancing.
Choose network protocols to improve performance.

Evolve your workload to take advantage of new releases:

Keep up-to-date on new resources and services.
Evolve workload performance over time.
Define a process to improve workload performance.

Monitor your resources to ensure they are performing as expected:

Record performance-related metrics.
Analyze metrics when events or incidents occur.
Establish KPIs to measure workload performance.
Use monitoring to generate alarm-based notifications.
Review metrics at regular intervals.
Monitor and alarm proactively: Use KPIs, combined with monitoring and alerting systems, to proactively address performance-related issues. Use alarms to trigger automated actions to remediate issues where possible; escalate the alarm to those able to respond if the automated response is not possible. For example, a system that can predict expected KPI values and alarm when they breach certain thresholds or a tool that can automatically halt or roll back deployments if KPIs are outside of expected values.

Use tradeoffs to improve performance:

Understand the areas where performance is most critical.
Learn about design patterns and services.
Identify how tradeoffs impact customers and efficiency.
Measure the impact of performance improvements.
Use various performance-related strategies.

Cost Optimization

Govern usage:

Implement an account structure.
Implement groups and roles.
Implement cost controls.
Track project lifecycle.
Develop policies based on your organization's requirements. Develop policies that define how resources are managed by your organization. Policies should cover cost aspects of resources and workloads, including creation, modification, and decommission over the resource lifetime. Also, develop cost targets and goals for workloads.

Monitor usage and cost:

Configure AWS Cost and Usage Report.
Define and implement tagging.
Configure billing and cost management tools.
Identify cost attribution categories: Identify organization categories that could be used to allocate cost within your organization.
Establish organization metrics: Establish the organization metrics that are required for this workload. Example metrics of a workload are customer reports produced or web pages served to customers.
Report and notify on cost optimization: Configure AWS Budgets to provide notifications on cost and usage against targets. Have regular meetings to analyze this workload's cost efficiency and to promote cost-aware culture.
Monitor cost proactively: Implement tooling and dashboards to monitor cost proactively for this workload; do not just look at costs and categories when you receive notifications. This helps to identify positive trends and promote them throughout your organization.
Allocate costs based on workload metrics: Allocate this workload's costs by metrics or business outcomes to measure workload cost efficiency. Implement a process to analyze the AWS Cost and Usage Report with Amazon Athena, which can provide insight and chargeback capability.

Decommission resources:

Track resources over their lifetime.
Implement a decommissioning process.
Decommission resources in an unplanned manner.
Decommission resources automatically.
Decommission resources in an unplanned manner: Decommission resources on an unplanned basis. This is typically triggered by events such as periodic audits and is usually performed manually.
Decommission resources automatically: Design your workload to gracefully handle resource termination as you identify and decommission non-critical resources, resources that are not required, or resources with low utilization.

Meet cost targets when you select resource type and size:

Select resource type and size based on estimates.
Select resource type and size based on metrics.
Perform cost modeling: Identify organization requirements and perform cost modeling of the workload and each of its components. Perform benchmark activities for the workload under different predicted loads and compare the costs. The modeling effort should reflect potential benefits, for example, time spent is proportional to component cost.

Use pricing models to reduce cost:

Perform pricing model analysis: Perform an analysis on the workload using the Reserved Instance or Savings Plans and recommendations feature in AWS Cost Explorer.
Implement different pricing models, with low coverage: Implement reserved capacity, Spot Instances, Spot Blocks or Spot Fleet, in the workload but with low coverage, at less than 80 percent of overall recommendations.
Implement pricing models for all components of this workload: Permanently running resources have high coverage with reserved capacity, with at least 80 percent of recommendations implemented. Short term capacity is configured to use Spot Instances, Spot Blocks or Spot Fleet.
On-demand is only used for short-term workloads that cannot be interrupted, and do not run long enough for reserved capacity: typically 25 to 75 percent of the year, depending on the resource type.
Implement regions based on cost: Resource pricing can be different in each region. Factoring in region cost ensures you pay the lowest overall price for this workload.

Plan for data transfer charges:

Perform data transfer modeling.
Select components to optimize data transfer cost.
Implement services to reduce data transfer costs.

Match supply of resources with demand:

Perform an analysis on the workload demand.
Provision resources reactively or unplanned.
Provision resources dynamically: Resources are provisioned in a planned manner. This can be demand-based, such as through automatic scaling; buffer-based, where demand is spread over time with lower overall resourcing used; or time-based, where demand is predictable and resources are provided based on time. These methods result in the least amount of over or under-provisioning.

Evaluate new services:

Review and implement services in an unplanned way.
Keep up to date with new service releases.
Establish a cost optimization function. Create a team that regularly reviews cost and usage across the organization.
Develop a workload review process: Develop a process that defines the criteria and process for workload review. The review effort should reflect potential benefits, for example, core workloads or workloads with a value of over 10% of the bill are reviewed quarterly, while workloads below 10% are reviewed annually.
Review and analyze this workload regularly: Existing workloads are regularly reviewed as per defined processes.

Operations

Determine what your priorities are:

Evaluate external customer needs.
Evaluate internal customer needs.
Evaluate compliance requirements.
Evaluate threat landscape.
Evaluate the impact of trade-offs between competing interests, to help make informed decisions when determining where to focus operations efforts. For example, accelerating speed to market for new features may be emphasized over cost optimization.
Manage benefits and risks to make informed decisions when determining where to focus operations efforts. For example, it may be beneficial to deploy a system with unresolved issues so that significant new features can be made available to customers.

Design your workload so that you can understand its state:

Implement application telemetry.
Implement and configure workload telemetry.
Implement dependency telemetry.
Implement user activity telemetry: Instrument your application code to emit information about user activity. For example, clickstreams, or started, abandoned and completed transactions. Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required.
Implement transaction traceability: Implement your application code and configure your workload components to emit information about the flow of transactions across the workload. Use this information to determine when a response is required and to assist in identifying the root cause of issues.

Reduce defects, ease remediation, and improve flow into the production:

Use version control.
Test and validate changes.
Perform patch management.
Share design standards.
Implement practices to improve code quality.
Use multiple environments.
Make frequent, small, reversible changes.
Use configuration management systems.
Use build and deployment management systems.

Mitigate deployment risks:

Plan for unsuccessful changes.
Test and validate changes.
Test using limited deployments.
Deploy frequent, small, reversible changes.
Use deployment management systems.
Deploy using parallel environments.
Fully automated integration and deployment: Automate build, deployment, and testing of the workload. This reduces errors caused by manual processes and reduces the effort to deploy changes.
Automate testing and rollback: Automate testing of deployed environments to confirm desired outcomes. Automate rollback to previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes.

Know that you are ready to support a workload:

Ensure personnel capability.
Use runbooks to perform procedures.
Make informed decisions to deploy systems and changes.
Ensure consistent review of operational readiness: Ensure you have a consistent review of your readiness to operate a workload. The review must include at a minimum the operational readiness of the teams and the workload, and security considerations. Implement review activities in code and trigger automated review in response to events where appropriate, to ensure consistency, speed of execution, and reduce errors caused by manual processes.
Use playbooks to identify issues: Playbooks are documented processes to investigate issues. Enable consistent and prompt responses to failure scenarios by documenting investigation processes in playbooks. Implement playbooks as code and trigger playbook execution in response to events where appropriate, to ensure consistency, speed responses, and reduce errors caused by manual processes.

Understand the health of your workload:

Identify key performance indicators.
Define workload metrics.
Collect and analyze workload metrics.
Establish workload metrics baselines.
Learn expected patterns of activity for the workload.
Alert when workload outcomes are at risk.
Alert when workload anomalies are detected.
Validate the achievement of outcomes and the effectiveness of KPIs and metrics.

Manage workload and operations events:

Use processes for the event, incident, and problem management.
Use a process for root cause analysis - Have a process per alert.
Prioritize operational events based on business impact.
Define escalation paths.
Enable push notifications.
Communicate status through dashboards: Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest.
Automate responses to events: Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses.

Evolve operations:

Have a process for continuous improvement.
Define drivers for improvement.
Document and share lessons learned.
Implement feedback loops: Include feedback loops in your procedures and workloads to help you identify issues and areas that need improvement.
Validate insights: Review your analysis results and responses with cross-functional teams and business owners. Use these reviews to establish a common understanding, identify additional impacts, and determine courses of action. Adjust responses as appropriate.
Perform operations metrics reviews: Regularly perform a retrospective analysis of operations metrics with cross-team participants from different areas of the business. Use these reviews to identify opportunities for improvement, potential courses of action, and to share lessons learned.
Allocate time to make improvements: Dedicate time and resources within your processes to make continuous incremental improvements possible.

Above mentioned points are recommended and ideal way to architect the applications and workloads in AWS. Please feel free to comment and suggest more good practice which you have implemented in your organization.

Thank you and Best Regards,