Cloudera Big Data Platform Advantages on IBM Power Systems and ESS
Last year, I created and published blueprint for running Cloudera Data Platform on IBM Power and ESS Cloudera Data Platform (CDP) Private Cloud Base on IBM Power and IBM Elastic Storage System (ESS).
Figure 1: Cloudera CPD DC 7.1.7sp1 running on IBM POWER and ESS
Cloudera have also published a comprehensive Architectural Blueprint for running CDP:
Cloudera Data Platform - Data Center (CDP-DC) Reference Architecture
What supporting statements are essential when building a Power and ESS based solution?
At initial glance you might believe there is “missing memory” or “not enough compute” in an IBM proposal. This is because we are designing our proposal around the premise of the Strategic Cloudera trajectory of segregating Compute and Storage requirements, allowing us to scale independently and right size the setup to your particular need.
Figure 2: Cloudera Modernization Strategy, Segregated Compute and Storage
Compute
This is what Cloudera has to say about the compute requirements:
This means for typical Spark jobs the 8-issue POWER9 processor is ideally crafted for Big Data workloads.
Evidence: Cloudera treats 10 POWER cores as 80 logical cores/threads.
Figure 3: 10 Core POWER virtual machines having 80 logical processors.
In a typical distributed x86 environment a large number of data nodes are required to hold 3 copies of data. 3 replicas of the data will restrict only these 3 nodes to access that data. This means for job scheduling purposes 3x 20 core x86 nodes is similar to a single a 15 core POWER9 partition which can run as many simultaneous jobs.
But wait, it gets better, as we use a shared everything architecture, this artificial performance limitation on x86 architectures caused by data locality which limits performance dramatically does not apply In contrast on POWER and ESS solution the full compute capacity of the cluster can access the same byte of data. This prevents data getting imbalanced and the need to shut down the cluster for periodic rebalancing as well as providing better scalability.
Memory
Desegregating compute and storage as per Cloudera strategy allows each IBM POWER node to access all the data. With 3x86 data nodes, each holding 384 GB of memory, a compute block of 40TB of data addresses 768 GB RAM. With our desegregated solution, all the data node’s memory is aggregated across the whole Spark cluster. This provides, for typical cluster sizes, 3x more available memory for the IBM solution and reduces the segregated investment on memory and storage.
Availability:
The reference architecture documents also provide support to the thesis that an OS/Log data size of 500GB+ should be sufficient. There is no practical value of reserving large areas for OS and kernel dumps on data nodes as they are expected to fail without affecting the overall solution. Since HW and OS failures will not impact the solution architecture, only application errors would be causing problems that need more investigation. These are use space application errors and would not be captured in a system dump.
What more, the POWER and ESS Solution can be extended on the storage side to provide data replication (synchronous or asynchronous) depending on or distance, for both active/Active or Production/DR Scenarios. In Such cases care must be taken to replicate the HDFS (Hive) metadata along with the solution and have a consistent name space in order to avoid issues when failing over.
Figure 4: Potential Multi Site CDP design with POWER and ESS.
Hence the IBM Power and ESS solution is ideally crafted for an efficient Clodera CDP DC workload implementation on a single site or across multiple locations.
Chief Architect - Focusing on assisting enterprise customers in their Hybrid MultiCloud, Data& AI solution needs.
2 年Mr. Best Practice!!! Amazing view!
Presales Manager at QuanTech SAL
2 年Impressive reference material ... ??
IBM Champion 2025 | Enterprise IT Architect | AI & Data Science Enthusiast
2 年Awesome Fredrik Lundholm , this message has to reach out to every Big Data customer and please if you could also build a case study to integrate with existing non-IBM CDP environments that would be great.
Storage Technical Sales Leader - MEA
2 年Great Fredrik Lundholm as always