Edition 5c: AWS Well-Architected Framework - Reliability Pillar
Sitaram Choudary Yarlagadda
Data Technology Architect and Engineer Capable of utilizing the People, Process, and Technology framework as well as the DAMA-DMBOK concepts to effectively create and manage mission-critical enterprise data platforms.
Reliability Pillar
??????????? The Reliability pillar pertains to the capacity of a job to consistently and accurately carry out its intended function within the predicted timeframe. This encompasses the capacity to manage and evaluate the task during its entire lifespan.
·?????? Automatically recover from failure: By closely monitoring a workload for certain key performance indicators (KPIs), you may initiate automated processes as soon as a predetermined threshold is exceeded. The Key Performance Indicators (KPIs) should assess the commercial value rather than focusing on the technical elements of the service's functioning. This feature enables the automatic detection and monitoring of failures, as well as the implementation of automated recovery procedures that either bypass or fix the fault. By using advanced automation, it becomes feasible to predict and resolve issues proactively, before to their occurrence.
·?????? Test recovery procedures: Testing is often performed in an on-premises setting to demonstrate the functionality of a workload in a specific circumstance. Testing is often not used for the purpose of validating rehabilitation procedures. Within the cloud environment, you have the ability to assess the failure of your workload and verify the effectiveness of your recovery protocols. Automation may be used to replicate various errors or reproduce situations that have resulted in failures. This methodology reveals potential failure routes that may be examined and rectified prior to an actual failure event, hence reducing the likelihood of risk.
·?????? Scale horizontally to increase aggregate workload availability: Implementing a strategy of substituting a single substantial resource with many smaller resources may effectively mitigate the consequences of a single failure on the total burden. Disperse requests across several, diminutive resources to ensure that they do not possess a shared vulnerability.
·?????? Stop guessing capacity: Resource saturation is a frequent reason for failure in on-premises workloads. It occurs when the demands put on a workload surpass its capacity, frequently as a result of denial-of-service assaults. Within a cloud environment, it is possible to see and track the level of demand and utilization of workloads. Additionally, resources may be automatically added or removed in order to maintain an optimal level of efficiency that meets the demand without excessive or insufficient provisioning. While there are still some limitations, it is possible to exercise control over certain quotas and regulate others. For more information, refer to the documentation on managing service quotas and constraints.
·?????? Manage change through automation: Automation should be used to implement changes to your infrastructure. The changes that need to be handled include modifications to the automation, which may thereafter be monitored and evaluated.
Best Practices
??????????? The Reliability pillar encompasses four domains in which we must establish and identify best practices.
·?????? Foundations
·?????? Workload architecture
·?????? Change management
·?????? Failure management
Foundations
??????????? Foundational needs are those that have a broader reach than a specific job or project. Prior to designing any system, it is essential to establish core criteria that have an impact on its dependability.
Service Quota and Constraints Management
??????????? Cloud-based workload designs include Service Quotas, commonly known as service restrictions. The purpose of these limits is to avoid unintentionally allocating excessive resources and to restrict the pace at which API actions are requested in order to safeguard services from misuse. Additionally, there are limitations on resources, such as the maximum rate at which data can be sent across a fiber-optic connection or the storage capacity of a physical disk.
Network Topology Plan
??????????? Workloads often occur in various situations. These include many cloud environments, including both publicly accessible and private ones, as well as perhaps your current data center architecture. Plans should include network issues such as intra- and inter-system connection, administration of public IP addresses, management of private IP addresses, and resolution of domain names.
Workload Architecture
??????????? An effective workload begins with strategic design choices for both software and infrastructure. The architectural decisions you make will have a significant influence on how your workload behaves in relation to all of the Well-Architected pillars.
Workload Service Architecture
??????????? Create robust and scalable workloads by using either a service-oriented architecture (SOA) or a microservices architecture. Service-oriented architecture (SOA) refers to the technique of designing software components in a way that allows them to be reused via service interfaces. The microservices design aims to significantly reduce the size and complexity of components.
Design Interactions in a Distributed System to Prevent Failures
??????????? Distributed systems depend on communication networks to link various components, such as servers or services. Your workload must function consistently and dependably even in the presence of data loss or delays in these networks. The components of the distributed system must function in a manner that does not have a detrimental effect on other components or the workload. Implementing these optimal methods mitigates failures and improves the mean time between failures (MTBF).
Design Interactions in a Distributed System to Mitigate Failures
??????????? Distributed systems depend on communication networks to link components, such as servers or services. Your task must function dependably even in the presence of data loss or delays in transmission across these networks. The components of the distributed system must function in a manner that does not have a detrimental effect on other components or the workload. These optimal methods enable workloads to endure stressors or failures, recover from them more rapidly, and reduce the effect of such impairments. The outcome is an enhanced mean time to recovery (MTTR).
Change Management
??????????? In order to ensure consistent functioning of your workload, it is necessary to anticipate and adapt to any changes that may occur in its workload or surroundings. Changes include both external factors that affect your workload, such as sudden increases in demand, as well as internal factors, such as the implementation of new features and security updates.
Monitor Workload Resources
??????????? Logs and metrics are potent instruments for acquiring profound understanding of the well-being of your task. It is possible to customize your workload settings in order to track and analyze logs and data. This will allow you to get alerts if certain thresholds are exceeded or important events take place. Monitoring enables your workload to detect and respond automatically when low-performance criteria is exceeded, or problems occur.
Adapt To Changes In Demand
??????????? A scalable workload allows for automated addition or removal of resources in order to precisely align with the existing demand at any given moment.
Change Implementation
??????????? Systematic modifications are required to implement new features and ensure that the tasks and the operational conditions are using recognized software and may be fixed or substituted in a predictable way. Unregulated developments provide challenges in accurately forecasting their impact and effectively resolving resulting concerns.
Failure Management
??????????? Failures are anticipated in any system of considerable complexity. In order to ensure reliability, it is necessary for your workload to be able to detect faults as they happen and take appropriate measures to prevent any negative effect on availability. Workloads must possess the capability to endure failures and autonomously rectify problems.
领英推荐
Backing Up Data
??????????? Ensure the preservation of data, applications, and configuration in accordance with your desired recovery time objectives (RTO) and recovery point objectives (RPO).
Fault Isolation
??????????? Fault isolation boundaries restrict the impact of a failure within a task to a specific and restricted set of components. Components located beyond the border are unaffected by the failure. By implementing several fault isolated borders, you may effectively minimize the impact on your workload.
Withstand Component Failure
??????????? Workloads that need to be consistently available and have a short mean time to recovery (MTTR) must be designed to be resilient.
?
?
?
?
?
?
Bibliography
Bibliography
Acceldata. (2022, September 7). How to Architect a Data Platform. Retrieved from acceldata.io: https://www.acceldata.io/article/what-is-a-data-platform-architecture
Amazon Web Services. (n.d.). AWS Well Architected Framework. Retrieved from aws.amazon.com: https://aws.amazon.com/architecture/well-architected/?wa-lens-whitepapers.sort-by=item.additionalFields.sortDate&wa-lens-whitepapers.sort-order=desc&wa-guidance-whitepapers.sort-by=item.additionalFields.sortDate&wa-guidance-whitepapers.sort-order=desc
Amazon Web Services. (n.d.). What is AWS? Retrieved from aws.amazon.com: https://aws.amazon.com/what-is-aws/?nc1=f_cc
DAMA International. (2024). DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition, Revised. Los Angles: Technics Publications.
en.wikipedia.org. (n.d.). Data Management Association. Retrieved from en.wikipedia.org: https://en.wikipedia.org/wiki/Data_Management_Association
Groover, M. (2021). Speed of Advance. Lion Crest Publications.
Hiltbrand, T. (2024, May 9). From Data-Driven to Data-Centric: The Next Evolution in Business Strategy. Retrieved from tdwi.org: https://tdwi.org/Articles/2024/05/09/PPM-ALL-From-Data-Driven-to-Data-Centric-Next-Evolution-in-Business-Strategy.aspx
Intrepid Tech Ventures. (n.d.). Understand your data asset. Retrieved from theintrepidventures.com: https://theintrepidventures.com/value-proposition/understand-your-data-asset/
Khan, S. M. (2024, May 9). The data product lifecycle: Getting the most out of your data investments. Retrieved from starburst.io: https://www.starburst.io/blog/data-product-lifecycle/
Roberts, S. (2023, April 18). Understand the four Vs of Big Data. Retrieved from theknowledgeacademy.com: https://www.theknowledgeacademy.com/blog/4-vs-of-big-data/
Rowshankish, R. L. (2023, July 31). The evolution of the data-driven enterprise. Retrieved from mckinsey.com: https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/tech-forward/the-evolution-of-the-data-driven-enterprise
Simon, B. (2021, July 21). Complete Guide to PPT Framework | Smartsheet. Retrieved from smartsheet.com: https://www.smartsheet.com/content/people-process-technology#:~:text=for%20IT%20%26%20Ops-,What%20Is%20the%20People%2C%20Process%2C%20Technology%20Framework%3F,maintain%20good%20relationships%20among%20them.
Tharran, A. S. (2023, October 22). The Evolution of Data Science: Past, Present, and Future. Retrieved from linkedin.com: https://www.dhirubhai.net/pulse/evolution-data-science-past-present-future-aditya-singh-tharran-bmmre/
?#AWS #WellArchitectedFramework #SecurityPillar #AWS #OperationalExcellence #WellArchitectedFramework #AWS #DataDrivenCompany #TechnologyPlatform
#DataManagement #DataStrategy #DataLifecycle #DAMA-DMBOK
#DataManagement #DAMA #DMBOK #DataDrivenCompany #DataDriven #BusinessStrategy #PPT #People #Process #Technology #Organization #Data #DataLake #DataWarehouse #Databases #OLTP #OLAP #BigData #Hadoop #AWS #WellArchitectedFramework #DataManagement #DMBOK #DataGovernance #DataIngestion #DataVisualization #DataProcessing #ETL #ELT #MasterData #Metadata #DataSecurity #Security #OperationalExcellence #Relaibility #Sustainability #CostOptimization #PerformanceEfficiency #Kenesis #DynamoDB #Redshift #RedshiftSpectrum #QuickSight?#Trino #Iceberg #Parquet #S3 #Lambda #EC2 #ECS #EKS #VPC #SecurityGroups #Python #PySpark #Spark #SparkSQL #SparkStreaming #DataFrames #RDDs #CoudFormation #AWSConfig #MachineLearning #AI #AI/ML #DataEngineer #MLEngineer #LLMs #DataManagement #DAMA #Newsletter #KnowledgeSharing
#AWS #DataDrivenCompany #TechnologyPlatform #DataManagement #DataStrategy #DAMA-DMBOK #WellArchitectedFramework #DataGovernance #DataIngestion #DataVisualization #DataProcessing #ETL #ELT #DataSecurity #Security #OperationalExcellence #Reliability #Sustainability #CostOptimization #PerformanceEfficiency #MachineLearning #AI #DataEngineer #MLEngineer #KnowledgeSharing #AWS #CloudComputing #WellArchitectedFramework
?
?
?
?
?
?