Understand and Build a High Availability AWS CloudHSM
Hi everyone, I recently designed a High Availability and Disaster Recovery architecture for our company's FIPS 140-2 Level 3 hardware security module (HSM) in AWS. I am delighted to post the design here and discuss it with you guys to see if I can improve any part of my design.
Keywords:?High Availability,?Hardware security module, System monitoring, Data encryption at rest, Robustness, PCI-DSS, RPO and Data retention.
AWS uses a container named cluster to manage the HSM instances. All the data (key, policy files, user identity policy) in the HSM instances inside the cluster keep in sync. AWS provides two ways to sync the data: server-side and client-side key synchronization. Server-side sync is through AWS underlying system and the CloudHSM client manages client-side sync, the client generates keys with the same token in several HSMs simultaneously on behalf you. AWS fully manages those processes, and the key KVC will be validated for integrity check. This design gives us the a method to eliminate the tradeoff between data redundancy and the performance.
1) Architecture of High Availability
To add data redundancy to my CloudHSM, I place the HSMs among?two availability zone?inside one region. It brings us about 99.99% availability per AWS customer SLA. AWS?doesn't recommend?us to use a single HSM because there will be a data loss if the single HSM tampers (single point of failure). From the link: Best practices for AWS CloudHSM , "?For production clusters, you should have at least two HSM instances spread across two availability zones in a region." This increases the availability when incidents happen in HSM. It's a great choice to place one single HSM in the Staging/User Testing environment to reduce the cost because the cloud HSM is costly, at $1000 per month. The best practice AWS recommends to provision at least 3 HSMs among two availability zones.
2) Use the load balancing to improve the high availability
领英推荐
AWS uses a client software in backend EC2 to connect the CloudHSM (client-server pattern) and deliver users' decryption request. Although I configure the client to communicate with the HSM in the corresponding availability zone (Zone A client <-> Zone A HSM), AWS still provide?round-robin load balancing?for users request to HSM instances automatically. If the client loses the connection to its mapping HSM, the CloudHSM client SDK will automatically handle any HSM failures and load balance across the HSMs. I set an event alarm for the?HsmUnhealthy?metric in the?AWS CloudWatch?monitoring center to detect the HSM tamper and administrate our HSM better . It is the responsibility of AWS to remediate the failed HSM per?AWS Shared Responsibility Model, so we don't need to worry about it. The request flow is: Users request->Backend EC2 server ( Where the HSM Client is Installed)-> Load Balancer -> HSM instances.
3) ClouHSM backup with encryption at rest
To enhance the data resilience of our HSM, I also set a?periodical backup?(1 day RPO) with a?retention period?of 90 days for our CloudHSM cluster. The snapshots are stored in the AWS S3 with a 99.999999999% durability level. HSM encrypts the snapshots through envelope encryption with?the AES-256 algorithm?first. Next, AWS S3 add one extra security layer by encrypting the snapshots using?AWS KMS?to improve security. If we encounter any data loss or failed CloudHSM auto-recovery by AWS, we can use the backup to restore our data.
4) Firewall and Event Monitoring
We use security group as the network firewall to protect the CloudHSM. The principle to set the inbound rules is only to allow the ports TCP 2223-2225 from the backend EC2 as the source for the dedicated and necessary connection between EC2 and CloudHSM. We can monitor and troubleshoot the CloudHSM issues with the?CloudHSM audit log?in the CloudWatch log group and the?client SDK logs. I use AWS Lambda to sync the CloudHSM audit log and use the CloudWatch agent to sync the client SDK logs from the backend EC2 server to the S3 buckets in an independent centralization account for?SIEM per PCI 10.5.3.
In the next article, I will introduce how to build an automation workflow and Business Continuity/Disaster Recovery plan for AWS CloudHSM using CloudHSM cross-region replication, AWS Health and AWS lambda. Please feel free to leave your questions or suggestions for my HA CloudHSM design.
Principal Software Engineer @ Mastercard
1 年I also worked with with most of the similar design requirements with AWS CloudHSM and came across below questions/challenges which didn't had an easy solution or workaround exists few months ago. Pasting those points here if you faced any of those and have any thoughts to share. 1. How to achieve client-side load balancing and uninterrupted HSM connection when HSM_IP/ENI_IP attached HSM instance goes down and new client is not able to connect to the HSM cluster? 2. CloudHSM cluster initialization process at https://docs.aws.amazon.com/cloudhsm/latest/userguide/initialize-cluster.html: 2.a: AWS mandated that the certificate which is being used to sign the cluster CSR has to be a long running certificate (10 years). This is because of the fact that if the certificate is expired or a private key gets lost, we will loose the cluster entirely as it will be non-accessible. 2.b: AWS cloud HSMs have a technical limitation were the Cluster signing certificate can only signed using a self-signed certificate versus the certificates which can have an intermediate CA. Interested to know whether you have created a separate CA for this process? Let me know your thoughts! Happy to chat.
Principal Software Engineer @ Mastercard
1 年Good article! Just checking if you created another article to outline an automation on cross region replication with AWS CloudHSM as you mentioned in this article. Let me know if you did. Will be happy to read that.