Compute.AI's Vertically Integrated Platform vs Distributed Shared Memory Spark

Compute.AI's Vertically Integrated Platform vs Distributed Shared Memory Spark

Introduction, Nomenclature & Jargon

Elastic Clusters using Distributed Shared Memory (DSM) have for the most part become the de facto standard in industry for databases and warehouses. With in-memory processing coming in vogue in the last decade and a half it was possible to rapidly build data processing engines that could provide high performance (processing data in memory is faster than having to fetch it from disk/SSD), scalability, and also protection from partial failure (a node failure did not nuke the job).

Note: Everything done at Compute.AI is on commodity infra on the major clouds. There is no specialized hardware support (CXL Memory, RDMA, etc.).

Now let's coin the term Vertically Integrated Cluster (VIC), as none appears for what we are about to describe. Vertical integration refers to using the memory hierarchy within a node (typically DRAM and SSD but in some cases it can also include SCM ). A VIC cluster made up of any number of nodes where a SQL statement (that we will call a job going forward) runs on a single node only. A VIC cluster can run thousands of jobs scheduled across the nodes of a cluster but with a single job never spanning more than a single node. Vertical integration aims to provide the oomph of a cluster within a single node by beefing up a second tier of memory (for our purposes here, NVMe based local SSD) to provide a different price/performance matrix for certain workloads/use cases. The ComputeAI Performance section below has some numbers.

For technical reasons we will disambiguate a DSM cluster from an elastic cluster, as in this article they mean slightly different things. DSM allows a single job to run on the whole cluster by virtue of the fact that the cluster presents a unified memory model for the job to run. Elasticity refers to the ability to increase/decrease the number of nodes based upon the load.

With cloud infra now coming of age we can assume the presence of building blocks critical for designing today's solutions as being generally available on major clouds. Using AWS infra let's build our system:

  • EC2 as compute tier
  • S3 an inexpensive storage tier
  • EBS, a fast NVMe based local SSD on the EC2 compute instances, and also with configurable performance/SLAs, and
  • EFS, a more expensive but higher performance file system layer with configurable SLAs

We will use the above AWS infra components as a template for our discussion.

Some DSM Related Challenges for Relational Compute

While there are undeniable benefits (mainly performance, scalability, and protection against partial failure) when running DSM systems, we see some practical challenges when doing relational/database compute:

  • Low CPU utilization in DSM clusters; CPU is barely at 30% in 5-8 node clusters and often goes lower as you increase the nodes
  • Even though most cloud service providers have been able to amortize the cost of memory regardless of horizontal or vertical scaling, the preferred model is to scale in-memory compute systems horizontally by adding more nodes (even though CPU utilization suffers)
  • There is vertical scaling also available (canonically called demand paging with the modern term being spill-to-disk); however, this is inefficient and results in large disk I/O Waits
  • Starting a node often takes many minutes (the elasticity is less than near real-time)

Low CPU utilization with increasing nodes can be primarily attributed to shuffling data for certain relational operations like JOIN and GROUP BY. Think of the data movement causing network I/O Waits much like spill-to-disk causes disk I/O Waits. It is the combination of network and disk I/O that results in low CPU utilization (unless other workloads run in parallel to mask the I/O waits). Data skew often exacerbates shuffling and thus CPU utilization.

Given that most cloud compute is billed by time, low utilization, say 30%, is like leaving 70 cents per dollar on the table. Hence, maximizing CPU utilization is key to lowering the cost of compute. We have been calling this—making compute abundant.

To further add to the cost we should look at the cost of memory that is the dominant cost of compute infra. Scaling out is done because in-memory compute requires most of the working set to fit into memory for the best performance. If an adequate amount of memory is not present, the job runs out of memory and is then terminated. This type of failure is called an out-of-memory failure (or colloquially, OOM-kill).

In general, we are buying more compute nodes for memory but are not able to utilize the CPU efficiently. This is the problem that we have addressed at Compute.AI , technically and architecturally using VIC. From a business perspective it is possible that many use cases/workloads (speculatively, over 80%) may be suitable for VIC.

An OOM-kill hurts reliability/SLAs for production workloads (especially when real-world data causes working set explosions). It also affects developer productivity because an engineer has to empirically deduce the cluster size for a given job, and then rerun. However, OOM-kills are algorithmic failures and not bugs. The job must run with an adequate amount of memory but needs to fail if the memory provided is insufficient. Aside from OOM-kills, today's relational compute platforms are generally robust for enterprise workloads. We may have gotten used to the idea of jobs getting OOM-killed, and rightly so as there exists no trivial solution to address the issue.

The Compute.AI Solution

In a line, ComputeAI is another version of Spark for SQL workloads that implements a tight vertical memory hierarchy using our VIC architecture for a dramatic reduction in memory (and thus infra) while not sacrificing performance. ComputeAI runs SQL directly on Parquet/Iceberg files. The product runs on AWS, Azure, GCP, and hybrid. It runs on laptops and even on a RaspberryPi as an embedded system. However, the focus in this article is on server side performance.

We have implemented vertical memory management with a memory hierarchy. Our spill-to-disk is highly optimized using our AI based fine-grained paging of data from memory to NVMe based SSDs (AWS EBS), and read/write to shared tables that reside on a high-perf distributed file system (AWS EFS or S3).

ComputeAI's AWS Architecture for a Vertically Integrated Elastic Cluster

In the above figure we see that memory is paged to EBS that is local to the node. Shared tables can be written back to S3 or EFS . Shared tables can concurrently be accessed by adding EC2 instances/nodes. This allows 1000s of jobs to concurrently run on shared tables. These are primarily read-only workloads. We handle DDL & DML using Iceberg (refer to this article for details ).

Here's a way to conceptually view this architecture...

When ComputeAI sees an EC2 cluster it sees CPU/cores, as compared to a typical in-memory engine that sees it primarily as a source of memory. So even though a single job only runs on a single node, one can have unlimited concurrency that is only limited by S3/EBS/EFS latency & bandwidth.

We will do a deep dive into how we designed and implemented this system in a separate article. In short, what you get is extremely high CPU utilization, very low system time (that is, most compute, aka User CPU in Linux, is given to the SQL job), and the ability to run complex workloads without having to provision memory for the worst case to prevent an OOM-kill.

To get into massive memory overcommitment in some more detail we can share that achieving ~300x memory overcommit is very realistic with ComputeAI. At 10x memory overcommit we barely see loss in performance for most complex compute. In other words, with an increasing number of JOIN and GROUP BY operations, coupled with large data/working set, and finally some skew, our system starts to shine. Why?

Conceptually, we are able to better utilize vertical memory hierarchy than many DSM systems using horizontal memory for scaling. Without getting into the specifics of 10gb, 40gb or higher speed networks, we are able to better utilize the system with a small amount of memory, many cores, and a large amount of NVMe based SSDs.

While it is a no-brainer to see how single beefy nodes can outperform a cluster for the aforementioned reasons, the differentiation here is stock cloud infra being used for massive memory overcommitment with no appreciable loss in performance. That is, doing the work with 20% of the memory and running as fast and without OOM-kills for relational compute. The SSDs used to virtualize memory are 3 orders of magnitude slower than memory. Hence, this becomes technically challenging (and explains why most systems today are in-memory systems). Instead of using 20% memory we can run with 1% or even 0.1% of the memory that another database system requires, and the jobs will still run. There is no theoretical limit to memory overcommitment as long as the virtualized does not result in thrashing (an application and working set dependent behavior where the data in memory is dependent on data on disk and we end up in an endless paging cycle). Detecting cycles is a roadmap item for us.

Compute.AI 's spill-to-disk uses fine-grained demand paging and is driven by AI algorithms. We characterize the machine as the system comes up within a millisecond; this is our supervised AI model that knows about all the infra: memory, cores, latency/bandwidth to different endpoints, etc. Later we use the power of a declarative language—SQL—and read the blueprint of the plan along with its metadata (all the way from Parquet/Iceberg to indexing/metadata changes during plan execution) to train the model during its unsupervised execution. That along with a powerful microkernel that uses variable page sizes (as a result of compression/un-compression) is incredibly efficient in managing data within the memory hierarchy for relational operators to work on. Paging of data to local SSDs (that serve as a cache or swap) is done using rows or columns (mixed) depending on the plan. There is much here to call for a separate discussion :). For now, we just focus on the high level results.

Benefits of ComputeAI's VIC Implementation

At a quick glance here is what we get:

  • Smaller memory footprint
  • Highly optimal CPU usage
  • 10x memory overcommit without an appreciable loss of performance
  • No OOM-kills
  • A no-touch system that elastically autoscales just like a cloud data warehouse

ComputeAI Performance

We are often able to reduce memory by ~5x while providing a 10x improvement in performance for complex SQL (a 50x increase that seems hard to believe). However, our focus has not been on performance. For us it is about reliability, ease of use, and making optimal use of infra. Products like Trino are much faster than ComputeAI when given an adequate amount of memory. What we bring to the table is the protection against OOM-kills regardless of how much memory is on the system. Of course, with more memory we run faster.

Note: Better CPU utilization is not necessarily a good thing :). If there is work that needs to be done and the system cannot utilize the idle CPU then there is an issue. But if work can be done with low CPU utilization then that is efficient. In the ComputeAI's throughput based computing model, the best measure of whether the system is better is throughput. TPC-H and TPC-DS benchmarks are one such measure of throughput.

Here are some numbers to give a sense for where this product currently stands:

TPC-H

  • ~2x faster than some* Spark implementations for TPC-H (8.5x faster on 5 most complex queries, and 3.5x on 10 most complex queries)
  • Queries 3 & 10 run into OOM-kills and require memory over-provisioning to run

TPC-DS

As rated by query complexity we have the following results:?

  • 8% of the queries are ~6x faster on ComputeAI
  • 15% of the queries are >4x faster on ComputeAI
  • 40% of the queries are >2x faster on ComputeAI
  • Overall: 75% of ComputeAI queries are faster than AWS EMR Spark
  • 25% of the queries run slower on ComputeAI than AWS Spark (we are working on the necessary CBO/optimizations to get into the >2x speedup that we minimally achieve elsewhere)?

[*We benchmarked AWS EMR Spark vs ComputeAI for the above numbers for SF1 to SF1000. Apache Spark does worse than AWS EMR Spark.]

Real-World

  • ~10x faster when running real-world workloads in memory constrained environments
  • We are able to handle >300x memory overcommitment with no appreciable performance drop at 10x overcommitment

Feel feel to try our free GitHub download . We can assist you on our discourse if you need help with our product. The exact machine configs, etc., that were used for the benchmarks are well documented and can be shared if needed. We are happy to change this article in the event of any error. In fact, we do not believe that benchmark numbers are representative of real-world workloads and dissuade folks from attributing much importance in them. We believe these numbers were benchmarked using comparable cloud infra, though it should be noted that increasing EBS bandwidth and IOPS severely impacted EMR Spark's performance. The comparison to EMR Spark has nothing to do with that product; it was readily available for us to use with a few clicks and it offers a perspective.

What's Novel

Making spill-to-disk highly efficient with such massive levels of overcommit requires new algorithms that are highly specialized to run an "ultra clean" system as we describe below. Fetching a page from SSD to memory a millisecond sooner will result in thrashing given how contentious memory overcommitment can become on resources. Standard prefetch algorithms do not work. Memory is just a buffer cache of table data and metadata; the AI driven paging algorithms push the spacial and temporal limits of caching to the extreme. The fact that SQL is a declarative language is leveraged by our design. A SQL plan is a DAG with the data layout etched on it. Our model is not suitable for imperative languages (e.g., C, C++, Java, Python, etc.) that have loops and conditionals and hence do not compile into a DAG.

A new two-level threading runtime does ultra fast context switches using Thread Local Storage for managing thread stacks. A huge cost of a context switch is actual the context of the thread stack and not the pointer switch between threads of execution. Here, we have paid careful attention to job scheduling and dependency management to provide data affinity to processor cores.

Since the system is designed for analytics workloads, a throughput based computing style of of work is suitable. In addition, we avoid/minimize the use pthread_mutex() style primitives to prevent sleeps in the kernel or processor pipeline stalls due to synchronization across a large number of cursors. Logical clocks and ordering based algorithms are used instead. If we must sleep we do so in userland.

Even though the workload is primarily read-only from a SQL perspective, there are many synchronization points for the threads as a result of data partitioning, fragmentation, scatter-gather, and management of hash and tree structures by multiple cursors (that are threads of execution picking up a part of the SQL plan). Multiple SQL plans (jobs) can run concurrently and hence memory management and scheduling decisions are made on plan operator boundaries. AI models make some policy decisions that are used for paging data in/out as well as eviction of clean pages.

We have incorporated some software-only emulation of an MMU inbuilt for managing paging data of arbitrary sizes to their respective locations on block storage (say EBS). Other than managing page mapping, it includes emulating the MMU Reference & Modified bits to give the AI based paging additional hints. All the paging needs to be Just-In-Time and this is one reason traditional paging algorithms that we grew up with don't quite perform :).

A system like ComputeAI is not a fair weather system. It was designed to run in inclement weather while not sacrificing performance even for ~10x memory overcommitment. For whatever the level of memory overcommitment, we seek to provide pretty good performance. Also, we protect the "slow path" of the code from running into the out-of-memory issues (still a best effort). The ability to rollback the work done by any SQL plan operator (many plans can execute concurrently) provides greater insurance over managing memory. Better memory management results in better processor utilization (this article discusses how memory and CPU are two sides of the same coin).

There is no use of STL lib either as the ComputeAI microkernel controls all the resources very tightly (STL data structures and locking can present an overhead when working at a native level). The code is written in C++ but the choice is primarily driven by the need for module isolation and strong interfaces (not objects, inheritance, generics, STL, etc.).

The key is to move the data between the memory tiers while masking the I/O latencies with computation. Having the utmost control over the microkernel of the relational compute platform results in a super clean system where Linux System Time usage is very low (1-2%), there are almost no I/O waits, and the User CPU usage is maxed out when there is work that needs to be done.

With core to memory ratios of processors worsening, and normal design patterns (posix thread, mutex, condvar, STL, etc.) the likelihood of memory stalls is higher. It is for these reasons that we avoid those design patterns. Memory stalls look like busy CPU. So even the <30% CPU that we see in elastic clusters potentially has memory stalls included. We no not have a measure of how much we might have helped on that front but all indications are that we might have made a dent :)

The operators and the CBO code is written in a single threaded manner and without any locks. It is naturally parallelized by the underlying microkernel that is solely responsible for all the tradeoffs between consistency and concurrency. Building a system using the above paradigm allowed our team to rapidly implement all the relational operators and ~600 builtin functions. The system was robust even as it came out of the oven, and with over 100K lines of ChatGPT generated SQL for QA we hope to have hardened it enough to put an "enterprise grade" stamp on it.

Between 4 engineers and a part-time contractor for DevOps and CloudOps, in ~2 1/2 years we built the relational compute platform, made it work with Spark/Presto/Trino SQL, provided support for most toolchains that come with the open source ecosystems, ported it on all 3 major clouds and hybrid, did many integrations including MinIO, and created an elastic cluster product that can be deployed using Kubernetes, AWS EC2 using Cloud Formation, and AWS EMR orchestration. The time included doing some PoCs, getting a Wall St. bank into pre-production, and unfortunately ~5 months of wasted work for another bank (a fully managed service product built to spec) that did not jump in due to management changes.

Conclusion

Having built DSM systems and been in love with them forever (Solaris at Sun Microsystems, Spring Microkernel at SunLabs, MediaBase Video Server at SGI, Exadata at Oracle, and Xcalar at my previous startup), I have learnt a thing or two about them :). In sharing the above VIC architecture we built at Compute.AI , I hope to represent our small but talented team of 4 engineers including me) and share our thoughts. We are exploring use cases and see some good opportunities ahead of us.

Given enough time, our preference would be to do both horizontal (DSM) and vertical (VIC) integration to allow the best of both worlds. It is just a time/money equation that every startup wanting to innovate struggles through :). So think of VIC as Phase 1. DSM will be Phase 2 (I hope to post a future article on how we plan to tackle that using plan federation).

Another area where we make a dent is trading DRAM for Flash/SSD while holding up good performance. This means less watts dissipated (DRAMs burn much hotter than SSDs). This is in line with Compute.AI's vision of reducing the carbon footprint from a compute standpoint.

Introducing new technology is always challenging for an early stage startup and hence we are working with partners where we can provide value to their platforms. In the meantime, we hope these ideas offer some freshness as we reimagine distributed systems. There is a beautiful world of compute out there with many great products to choose from. We hope to have created a small but significant value for that world.

Related Articles

Why Was Compute.AI Founded

Harnessing the Power of Iceberg

Do We Need Another Version of Spark

An Approach to Database Fine-Grained Access Controls

Velimir Radanovic

Architect, Development Manager, Product Manager, Developer

9 个月

It's about time 4 experienced folks like yourself to strike back! On that note: horizontally scaled systems do not necessarily waste CPU cycles (unlike Spark)

Billy Newport

Building DataSurface, the future of enterprise Data platforms

9 个月

Yup, the fastest distributed system is a single server with enough CPU/memory and disk to do the job :)

要查看或添加评论,请登录

社区洞察

其他会员也浏览了