Huawei Cloud CCE Agile Best Practices — Finance HA Solution

Huawei Cloud CCE Agile Best Practices — Finance HA Solution

1 Background

1.1 Introduction to CCE Agile

Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructures, and declarative APIs exemplify this approach. These techniques make loosely coupled systems resilient, manageable, and observable. Combined with robust automation, cloud native technologies allow engineers to make high-impact changes frequently and predictably with minimal toil.

Cloud Container Engine Agile (CCE Agile) is a next-generation cloud native management platform driven by a large number of high-reliable cloud services and high-performance financial applications. It provides new infrastructures for enterprises to develop, integrate, manage, maintain, and operate software. CCE Agile extends Huawei's hybrid cloud container solution to enterprises' premises and is well compatible with Kubernetes and Docker container ecosystems. CCE Agile provides high-performance, scalable container services that allow you to create highly-reliable container clusters so you can create and manage diverse workloads easily. CCE Agile also provides efficient O&M capabilities, such as self-healing, log collection, and auto scaling.

1.1 Why Is Geo-Redundancy Required?

To keep up with different technological advancements, enterprises need cloud platforms to enhance their IT capabilities. To ensure the high availability (HA) and reliability of a cloud platform, multi-layer redundancy is crucial. For example, at the application layer, multiple copies of your data need to be stored. For region-level faults, geo-redundancy is a common practice.

Geo-redundancy refers to the practice of deploying a system in three different data centers distributed across two geographical locations to ensure HA and fault tolerance. It is typically used in industries such as finance, telecoms, and healthcare to eliminate single points of failure (SPOFs) and to ensure that a standby site can always take over if the primary site goes offline.

1.1 Application Scenarios

Huawei Cloud CCE Agile can be used to design an HA solution for microservice applications.

2 HA Solution

2.1 Overview

The following shows the architecture of an HA solution.

Description:

·? In this solution, multiple clusters need to be managed. In this example, Region 1 is used as the primary site to provide intra-city active-active services, and Region 2 is used as the standby site. There is a cluster in each data center. For mission-critical applications, two clusters can be deployed in a data center for higher HA.

·? A large cluster is split into multiple smaller clusters to reduce the blast radius of any potential faults. For a single cluster, up to 200 nodes are recommended. No workloads run at the standby site. To reduce costs, the number of nodes in a cluster can be reduced. However, data checks and node scale-out must be complete before traffic switchover. If a scale-out is not feasible, you are advised to prepare the environment at the standby site based on the peak traffic requirements of the production environment at the primary site.

·?? The traffic of each data center is transmitted to Elastic Load Balance (ELB) on the cloud platform, where cross-cloud load balancing is performed. The traffic is then forwarded to a service gateway in each cluster. The service gateway routes the traffic to backend applications based on the application distribution and registration details. Domain Name Service (DNS) generally directs all traffic to the primary site but switches the traffic to the standby site during DR.

·?? The gateway routing can be implemented within and across clusters by phase, which requires microservice framework reconstruction. Health checks and traffic monitoring at the service layer allows for global management and distribution of application routes, so there are no major issues if a single cluster, AZ, or data center crashes.

·?? Persistent application data is stored in a database, and applications are stateless, which makes for easier scaling and migration during DR. Unilateral read/write can be used for the databases that are used by applications in both data centers at the primary site to ensure data consistency. At the same time, data is synchronized between the two databases at the primary site in real time. In case of a disaster, the standby database takes over and becomes the primary database. The data of mission-critical applications can be synchronized across regions, which make remote disaster recovery possible when the primary site is faulty.

·?? Applications can be reconstructed on an application-by-application basis to minimize the risk of faults in a single application. Individual applications can use a multi-active architecture, such as dual clusters, for DR. In addition, data synchronization and disaster recovery can be implemented at the application level.

2.1 Key Points

You can design the HA architecture at the access layer, application layer, data layer, and infrastructure layer.

2.2.1 Access Layer HA

Key points: DNS and load balancing for traffic switchover upon regional and single-DC faults

Key technology 1: Route management using DNS

·?? Weighted routing through A record sets of DNS

·?? Health checks and automatic failover. If an IP address in a region is faulty, services are automatically switched to another available IP address.

Key technology 2: Single-region traffic weight management through load balancing

·?? Intra-city cross-DC traffic scheduling

·?? Health checks and dynamical adjustment of cluster traffic weights based on the backend application instance status


As shown in the following figure, when a fault is detected, public network traffic is switched to the standby site.

2.2.2 Application Layer HA

Key points: Multi-cluster and multi-copy deployment and gateway routing (mainly at the primary site)

Key technology 1: Multi-cluster and multi-copy deployment

Multiple clusters are deployed to prevent the impact of a single cluster fault.

Key technology 2: Application route management

·?? In the image below, a gateway configured for each cluster routes traffic based on a microservice framework.

·?? Ingress and Service of the NodePort type routes traffic based on the Kubernetes service discovery.

·?? To reduce cross-cluster dependency, services are only accessed within a single cluster.

Key technology 3: Cross-cluster routing (gradual evolution)

In the image below, the registration center's dual-write mechanism ensures that application routes are available in both clusters. This allows traffic to be routed to the other cluster if all applications in one cluster are faulty. Routes are read before applications can access each other, and inter-service API invoking and traffic forwarding are performed based on policies such as nearest access.

·?? During route registration in the registration center, short paths in a cluster and long paths across clusters are automatically registered and distinguished.

·?? Intra- and inter-cluster API invoking and traffic forwarding

????? Intra-cluster: lowest latency

????? Inter-cluster:

???????? The gateway in each cluster provides services for external systems. Cross-cluster traffic needs to traverse the service gateways, which increases the latency.

???????? ELB needs to provide a health check for Kubernetes ingress and a Service of the NodePort type that are exposed to an external IP address, so that traffic can be diverted to a healthy cluster if necessary.

2.2.3 Database HA

Key points: Multi-DC database deployment and cross-DC data synchronization

Key technology 1: Cross-AZ deployment for failover within seconds

Cross-AZ deployment is supported. If a primary database deployed in a production environment becomes faulty, a standby database in a standby environment can take over within seconds.

Key technology 2: Data synchronization

Both semi-synchronous and asynchronous models are supported. To ensure data security, semi-synchronous is recommended. To improve performance, asynchronous model is recommended.

Key technology 3: Read/write isolation and multiple read replicas

Multiple read replicas can be used to share read traffic and improve the service query throughput.

Key technology 4: Primary/standby switchover transparent to applications

A pair of primary/standby databases, deployed across AZs, use the same IP address for external access. That way, if the primary database becomes faulty, the IP address remains unchanged. Even if there is a natural disaster in the region hosting the primary database and the primary and standby databases cannot be connected to applications, a standby database in another region can be promoted to primary and take over. Then, you can connect your applications to the new primary database for service recovery.

2.2.4 Infrastructure HA

Infrastructure mainly includes basic resources such as IaaS, containers, storage, and networks.

A container platform can be deployed in each data center at both sites to ensure that the container platforms can still be managed even if a private line linking data centers becomes faulty.

For other resources such as IaaS and storage, especially those for storing persistent data, DR across equipment rooms and data centers must be provided.

Key technology 1: Active-active data reliability

In the example shown here, persistent data for applications running in equipment room A is immediately written to the NAS storage in equipment room B. If the storage devices in equipment room A crash, the latest data is still readily available in equipment room B. This accelerates service takeover and performance restoration.

Key technology 2: Cross-AZ HA of IaaS

Multiple AZs are created in a data center for VM and network HA. If AZs need to be created across data centers at the same site, corresponding technical requirements need to be provided.

2.2.5 Application Unit Reconstruction

Key points: Application reconstruction on a user-by-user basis ensures that each application unit can handle all required services. Each unit includes both applications and data.

·? Advantages: Faults in individual units are isolated from other units, so the blast radius is small. Superb performance and linear expansion are both supported. More units can be added, as needed, for horizontal expansion.

·?? Dependencies: There must be a service model. Tightly-coupled applications and databases require service slices to provide global routes. APIs that allow cross-unit access require application data integration. High reconstruction costs require coordinated planning and design for both applications and data.

2.2.6 Oamp;M

Key points: Layered O&M provides unified O&M management for applications.

Key technology 1: Unified login SSO

Multiple platforms, such as IaaS, container, storage, and database platforms, need to be connected to the internal single sign-on (SSO) system for unified login and management. The same credentials can be used to log in to platforms in different regions.

Key technology 2: Unified log monitoring and management

The O&M platform integrates application logging, monitoring, and tracing for multiple clusters to provide capabilities such as a monitoring dashboard, full-trace analysis, and operation reports.

3 HA Planning

Once infrastructure resources are ready, you can plan CCE Agile in each data center based on the HA solution.

3.1 Cluster HA Planning

There is a management cluster and more than one workload cluster in each data center, and each data center has an independent control plane to avoid dependencies between data centers.

Table 3-1 Management cluster

Table 3-2 Workload cluster

3.2 HA Evolution Strategy

There are two phases for developing the HA infrastructure for the middle-end: intra-city active-active architecture and geo-redundancy architecture.

An intra-city active-active architecture is built from the bottom up. You can start with the infrastructure, the equipment room and the IaaS for the second data center. Then, complete the dual-DC HA evolution of databases and middleware, build a new cluster, and release services to the new cluster. Once applications are released to the gateway layer and verified, the dual-DC active-active architecture is complete. The process of building the remote DR site for geo-redundancy is similar.


要查看或添加评论,请登录

Huawei Cloud的更多文章

社区洞察

其他会员也浏览了