登录查看更多内容

Understand and Build a High Availability AWS CloudHSM

Shaoyi Li

Lead Cloud Engineer | AWS Hero | AWS re:Invent and Summit Speaker | 82x AWS Badges | TikTok-CN(100K) | MSE-DS@Upenn

发布日期: 2022年6月17日

Hi everyone, I recently designed a High Availability and Disaster Recovery architecture for our company's FIPS 140-2 Level 3 hardware security module (HSM) in AWS. I am delighted to post the design here and discuss it with you guys to see if I can improve any part of my design.

Keywords:?High Availability,?Hardware security module, System monitoring, Data encryption at rest, Robustness, PCI-DSS, RPO and Data retention.

AWS uses a container named cluster to manage the HSM instances. All the data (key, policy files, user identity policy) in the HSM instances inside the cluster keep in sync. AWS provides two ways to sync the data: server-side and client-side key synchronization. Server-side sync is through AWS underlying system and the CloudHSM client manages client-side sync, the client generates keys with the same token in several HSMs simultaneously on behalf you. AWS fully manages those processes, and the key KVC will be validated for integrity check. This design gives us the a method to eliminate the tradeoff between data redundancy and the performance.

1) Architecture of High Availability

To add data redundancy to my CloudHSM, I place the HSMs among?two availability zone?inside one region. It brings us about 99.99% availability per AWS customer SLA. AWS?doesn't recommend?us to use a single HSM because there will be a data loss if the single HSM tampers (single point of failure). From the link: Best practices for AWS CloudHSM , "?For production clusters, you should have at least two HSM instances spread across two availability zones in a region." This increases the availability when incidents happen in HSM. It's a great choice to place one single HSM in the Staging/User Testing environment to reduce the cost because the cloud HSM is costly, at $1000 per month. The best practice AWS recommends to provision at least 3 HSMs among two availability zones.

2) Use the load balancing to improve the high availability

领英推荐

Eks Auto Mode: Simplifying Kubernetes Management

CloudifyOps 2 个月前

5 Pillars of AWS Well-Architected Framework

Eleke Great 1 年前

Simplifying Kubernetes Cluster Autoscaling with…

Christopher Adamson 1 年前

AWS uses a client software in backend EC2 to connect the CloudHSM (client-server pattern) and deliver users' decryption request. Although I configure the client to communicate with the HSM in the corresponding availability zone (Zone A client <-> Zone A HSM), AWS still provide?round-robin load balancing?for users request to HSM instances automatically. If the client loses the connection to its mapping HSM, the CloudHSM client SDK will automatically handle any HSM failures and load balance across the HSMs. I set an event alarm for the?HsmUnhealthy?metric in the?AWS CloudWatch?monitoring center to detect the HSM tamper and administrate our HSM better . It is the responsibility of AWS to remediate the failed HSM per?AWS Shared Responsibility Model, so we don't need to worry about it. The request flow is: Users request->Backend EC2 server ( Where the HSM Client is Installed)-> Load Balancer -> HSM instances.

3) ClouHSM backup with encryption at rest

To enhance the data resilience of our HSM, I also set a?periodical backup?(1 day RPO) with a?retention period?of 90 days for our CloudHSM cluster. The snapshots are stored in the AWS S3 with a 99.999999999% durability level. HSM encrypts the snapshots through envelope encryption with?the AES-256 algorithm?first. Next, AWS S3 add one extra security layer by encrypting the snapshots using?AWS KMS?to improve security. If we encounter any data loss or failed CloudHSM auto-recovery by AWS, we can use the backup to restore our data.

4) Firewall and Event Monitoring

We use security group as the network firewall to protect the CloudHSM. The principle to set the inbound rules is only to allow the ports TCP 2223-2225 from the backend EC2 as the source for the dedicated and necessary connection between EC2 and CloudHSM. We can monitor and troubleshoot the CloudHSM issues with the?CloudHSM audit log?in the CloudWatch log group and the?client SDK logs. I use AWS Lambda to sync the CloudHSM audit log and use the CloudWatch agent to sync the client SDK logs from the backend EC2 server to the S3 buckets in an independent centralization account for?SIEM per PCI 10.5.3.

In the next article, I will introduce how to build an automation workflow and Business Continuity/Disaster Recovery plan for AWS CloudHSM using CloudHSM cross-region replication, AWS Health and AWS lambda. Please feel free to leave your questions or suggestions for my HA CloudHSM design.

Ketankumar Prajapati

Principal Software Engineer @ Mastercard

1 年

I also worked with with most of the similar design requirements with AWS CloudHSM and came across below questions/challenges which didn't had an easy solution or workaround exists few months ago. Pasting those points here if you faced any of those and have any thoughts to share. 1. How to achieve client-side load balancing and uninterrupted HSM connection when HSM_IP/ENI_IP attached HSM instance goes down and new client is not able to connect to the HSM cluster? 2. CloudHSM cluster initialization process at https://docs.aws.amazon.com/cloudhsm/latest/userguide/initialize-cluster.html: 2.a: AWS mandated that the certificate which is being used to sign the cluster CSR has to be a long running certificate (10 years). This is because of the fact that if the certificate is expired or a private key gets lost, we will loose the cluster entirely as it will be non-accessible. 2.b: AWS cloud HSMs have a technical limitation were the Cluster signing certificate can only signed using a self-signed certificate versus the certificates which can have an intermediate CA. Interested to know whether you have created a separate CA for this process? Let me know your thoughts! Happy to chat.

1 次回应

Ketankumar Prajapati

Principal Software Engineer @ Mastercard

1 年

Good article! Just checking if you created another article to outline an automation on cross region replication with AWS CloudHSM as you mentioned in this article. Let me know if you did. Will be happy to read that.

查看更多评论

要查看或添加评论，请登录

Shaoyi Li的更多文章

File Integrity Monitoring with OSSEC in AWS EC2

2023年1月24日

File Integrity Monitoring with OSSEC in AWS EC2

PCI-DSS is a set of security standards for businesses to protect cardholder data in card data storing, processes and…

9 条评论
Replicating data from RDS SQL Server to Redshift using AWS DMS in compliance with PCI DSS

2022年7月10日

Replicating data from RDS SQL Server to Redshift using AWS DMS in compliance with PCI DSS

Hi everyone. I recently designed and built an offline data analytics cloud architecture for my company on AWS.

Understand and Build a High Availability AWS CloudHSM

Shaoyi Li

Lead Cloud Engineer | AWS Hero | AWS re:Invent and Summit Speaker | 82x AWS Badges | TikTok-CN(100K) | MSE-DS@Upenn

领英推荐

Shaoyi Li的更多文章

社区洞察

其他会员也浏览了

Codeless Architecture and the 5 Stages of Grief

Azure Availability Zones: Ensuring High Availability and Fault Tolerance

Storage made simple for all

Beyond the Basics: A Deep Dive into a Resilient AWS Database Architecture

AWS Migration Services: Use Cases

AWS Grafana

Horizontal vs. Vertical Scaling: A Deep Dive (???????)

System Design - Horizontal Scaling v/s Vertical Scaling

Edition 5c: AWS Well-Architected Framework - Reliability Pillar

领英推荐

Shaoyi Li的更多文章

File Integrity Monitoring with OSSEC in AWS EC2

Replicating data from RDS SQL Server to Redshift using AWS DMS in compliance with PCI DSS

社区洞察

其他会员也浏览了

Codeless Architecture and the 5 Stages of Grief

Azure Availability Zones: Ensuring High Availability and Fault Tolerance

Storage made simple for all

Beyond the Basics: A Deep Dive into a Resilient AWS Database Architecture

AWS Migration Services: Use Cases

AWS Grafana

Horizontal vs. Vertical Scaling: A Deep Dive (???????)

System Design - Horizontal Scaling v/s Vertical Scaling

Edition 5c: AWS Well-Architected Framework - Reliability Pillar