登录查看更多内容

Deep Dive: Caching in the New Planning Analytics Engine

Ryan C.

Sr. Data Engineering Manager | Empowering data driven organizations | Mentoring the next generation of engineers

发布日期: 2024年10月12日

Planning Analytics Engine (PAE) is the next-generation IBM Planning Analytics (PA) database (aka TM1 v12). PAE is available on the IBM Cloud Pak for Data platform and IBM Planning Analytics as a Service platforms. PAE is a step-change for the Planning Analytics product, introducing many new features, one of which is horizontal scalability and a new memory allocator (more on this later). Horizontal scalability is achieved via a leader-follower paradigm. Horizontal scaling has the potential to greatly improve PAE's performance under load. As you might expect this new architecture has an impact on how caching works.

Caching in Planning Analytics

Caching in Planning Analytics helps improve performance by persisting frequently accessed data (including aggregations). The primary caching construct in PAE is known as a Stargate view. A Stargate view is a calculated subsection of a cube. The Stargate view only contains the data for a section of the larger cube, and does not contain information like the formatting or user selections.

PAE creates AND stores a Stargate view when a query takes longer to retrieve than the time threshold defined by the View Maximum Time (VMT) property. A Stargate view persists only as long as the cache remains valid. Data or hierarchy changes to dependent objects invalidates the cache and removes the Stargate from storage. The amount of memory allocated to store Stargate views is controlled by the View Memory Maximum (VMM) property.

This Wasn't What I Thought I'd Write About

I originally planned this blog to be about how read replicas and the new memory allocator could be used to speed up query responses and control RAM consumption in PAE. However, the testing lead me on an interesting journey, leading to an official IBM tech-note, and a much better understanding of why caching continues to be such an important part of PA model construction, even with the introduction of horizontal scaleability.

This entire adventure started with help from Dat Nguyen and Morgan Ma when we we set up a new CP4D instance (not for the faint of heart). With PAE up and running I really wanted to run some performance tests to see how it might compare to some of the prior versions that lacked horizontal scaling.

To start testing I needed a large but generic cube. Remembering a blog I read recently, I decided to base this test off of the cube described in Yuri Kudryavcev 's blog "Processing a billion+ cells in Planning Analytics". Next, using TM1py, I put together a script to run a query in parallel batches to see what impact adding replicas, aka followers, would have. Starting the script my parallel query count climbed to 40.....and the service crashed....over and over.

Containers Out of Memory and a Tech Note

While the tests ran I monitored the progress from the PA Admin Console. As the query load started to rise, so did the RAM consumption. This was generally expected, but the new memory allocator was supposed to release unused memory back to the host. However, the memory consumption continued to grow, and grow, until there was no more available. There didn't seem to be anything preventing PAE from consuming all the RAM available to the container.

Number indicates the number parallel queries

This left me a bit confused, what about the new allocator, what was actually happening here? Enter the experts at IBM support, and after a few working sessions we figured it out. RAM consumption, and the subsequent release back to the host is actually user/session dependent. The tests were using the same user account to run the tests and monitor the progress. Therefore by watching the tests run, it was preventing PAE from returning the memory to the host. We determined that:

All cell-sets are kept in memory unless they are explicitly deleted or until all sessions of the same named user are closed (logged out).
All private and session subsets, and other session based objects, are kept in memory until all sessions of the same named users are closed.

https://www.ibm.com/support/pages/node/7130044

After moving the tests to run on a unique user, the new memory allocator was working as I expected.

领英推荐

Caching Strategies in Distributed Systems

David Shergilashvili 2 个月前

Caching is not always in-memory

Shrey Batra 1 年前

Caching - Evolving your Architecture

Saurav Prateek 3 年前

Note the decrease in memory after the session ended.

Replica Performance and the Connection to Caching

Now that the tests were actually working I wanted to know, do more replicas mean better query performance?

Yes, in most of my tests more replicas resulted in better aggregate performance. In the graph below you can see that as the number parallel queries increase, total response times improve with more replicas.

If we look at this data slightly differently we see something interesting. aggregate performance is better but that's not the whole story. Zooming in on how responses change over time tells a different story. In the case where a query is "small" and the total time to complete is always less than VMT, replicas increase performance considerably. The graph below shows PAE sends queries to the replicas gradually as traffic on the leader increases; and more read replicas = faster queries for uncached workloads.

When the size and complexity of the query increases there is a different behavior. In this test the query always takes longer than 5 seconds to execute (>VMT). This is when things get interesting. Once the query load started to rise above what the leader could handle, queries were sent to the read replicas as expected. However, like querying the leader for the first time, the first query sent to a replica needs to establish the local Stargate cache.

The chart below shows that as the number of parallel queries increases beyond the leader's capacity, the load is shared with a replica. The first queries to arrive at a replica need to establish the local Stargate cache. More interestingly though, once a query result can be retrieved from cache, the number of replicas impact on performance is negligible.

Each node, leader and replicas, manage their own local version of the cache.

In a similar test, replicas were added after cache had been established on the leader node. In this case assigning queries to the replicas actually degraded performance as the replicas did not have a valid cache. The chart below shows that once a cache has been established on a replica, query performance returns to baseline.

It is important to note that you can't control the cache on replicas. Cache events, including ViewConstruct, are not shared to other nodes.

Setting Expectations

Like most tests, the situations above are not very representative of real system usage patterns, but they help explain how a system works. PA environments tend to be very dynamic, with constant data changes and a high rate if cache misses. Therefore, the performance benefit provided by replicas might be more in real-life use. I recommend reviewing cache settings, and tuning them the same way you would in the legacy versions of PA. It is important to note that simply adding replicas won't make all queries faster or improve all user performance, in some cases it might not help at all.

Scott W.

Cubewise, CTO at Apliqo AG, TM1 expert

5 个月

This is very interesting stuff. I have the strong feeling that just like rules, feeders, calculation-cache and view-cache are the poorly understood dark arts of TM1, that DB replicas and performance optimization will be the same for PAE.

2 次回应

Ann-Grete Tan

Speaker/Architect/Developer: Bridging the gap between Sustainability reporting, Finance, and Technology

5 个月

Good piece Ryan! Thanks for doing the work and sharing the results!

1 次回应

Vitalij Rusakovskij ?? ??

TM1 Academy | Business Solutions based on IBM Planning Analytics | Support in TM1-Projects

5 个月

Thanks!

Joseph Pusztai

5 个月

Fascinating stuff, Ryan C. - thanks for sharing your research with the TM1 community. And, it looks like Schrodinger was right - observing the experiment does alter the experiment ??

1 次回应

Christoph Hein

Managing Director at Intito | Expert in Finance & Sustainability Planning | TM1 Fanboy | Advocate for a Sustainable Future

5 个月

Interesting read. Thanks for sharing Ryan C..

查看更多评论

要查看或添加评论，请登录

Ryan C.的更多文章

Cognos Analytics and Planning Analytics: why is this still so hard?

2019年6月28日

Cognos Analytics and Planning Analytics: why is this still so hard?

I’ll just come out and say it, the integration between these two IBM products is still not great. Cognos Analytics…

14 条评论
The Analytics Critic: Coming Soon

2019年5月31日

The Analytics Critic: Coming Soon

A new set of publications that provide a fresh take on the world of analytics software. No sales pitches here, just my…

4 条评论

Deep Dive: Caching in the New Planning Analytics Engine

Ryan C.

Sr. Data Engineering Manager | Empowering data driven organizations | Mentoring the next generation of engineers

Caching in Planning Analytics

This Wasn't What I Thought I'd Write About

Containers Out of Memory and a Tech Note

领英推荐

Replica Performance and the Connection to Caching

Setting Expectations

Ryan C.的更多文章

社区洞察

其他会员也浏览了

Performance vs. Scalability in System Design

Deep Dive into Caching in System Design part 14

Navigating the Scalability Maze: Ensuring Robust Performance Under Growing User Loads

Ensuring Data Reliability in Apache Kafka

Monitoring and managing Kafka: a deep dive for architects

Scale-up vs Scale-out

Observability Challenges in Kafka Multi-Tenant Architectures

Mastering Distributed Cache: A Blueprint for Scalability, Performance, and Availability

Demystifying Latency: a critical aspect of Data-Intensive Scalable Architectures

Introduction to Observability in Kafka Multi-Tenant Architectures

Caching in Planning Analytics

This Wasn't What I Thought I'd Write About

Containers Out of Memory and a Tech Note

领英推荐

Replica Performance and the Connection to Caching

Setting Expectations

Ryan C.的更多文章

Cognos Analytics and Planning Analytics: why is this still so hard?

The Analytics Critic: Coming Soon

社区洞察

其他会员也浏览了

Performance vs. Scalability in System Design

Deep Dive into Caching in System Design part 14

Navigating the Scalability Maze: Ensuring Robust Performance Under Growing User Loads

Ensuring Data Reliability in Apache Kafka

Monitoring and managing Kafka: a deep dive for architects

Scale-up vs Scale-out

Observability Challenges in Kafka Multi-Tenant Architectures

Mastering Distributed Cache: A Blueprint for Scalability, Performance, and Availability

Demystifying Latency: a critical aspect of Data-Intensive Scalable Architectures

Introduction to Observability in Kafka Multi-Tenant Architectures