Deep Dive: Caching in the New Planning Analytics Engine
Planning Analytics Engine (PAE) is the next-generation IBM Planning Analytics (PA) database (aka TM1 v12). PAE is available on the IBM Cloud Pak for Data platform and IBM Planning Analytics as a Service platforms. PAE is a step-change for the Planning Analytics product, introducing many new features, one of which is horizontal scalability and a new memory allocator (more on this later). Horizontal scalability is achieved via a leader-follower paradigm. Horizontal scaling has the potential to greatly improve PAE's performance under load. As you might expect this new architecture has an impact on how caching works.
Caching in Planning Analytics
Caching in Planning Analytics helps improve performance by persisting frequently accessed data (including aggregations). The primary caching construct in PAE is known as a Stargate view. A Stargate view is a calculated subsection of a cube. The Stargate view only contains the data for a section of the larger cube, and does not contain information like the formatting or user selections.
PAE creates AND stores a Stargate view when a query takes longer to retrieve than the time threshold defined by the View Maximum Time (VMT) property. A Stargate view persists only as long as the cache remains valid. Data or hierarchy changes to dependent objects invalidates the cache and removes the Stargate from storage. The amount of memory allocated to store Stargate views is controlled by the View Memory Maximum (VMM) property.
This Wasn't What I Thought I'd Write About
I originally planned this blog to be about how read replicas and the new memory allocator could be used to speed up query responses and control RAM consumption in PAE. However, the testing lead me on an interesting journey, leading to an official IBM tech-note, and a much better understanding of why caching continues to be such an important part of PA model construction, even with the introduction of horizontal scaleability.
This entire adventure started with help from Dat Nguyen and Morgan Ma when we we set up a new CP4D instance (not for the faint of heart). With PAE up and running I really wanted to run some performance tests to see how it might compare to some of the prior versions that lacked horizontal scaling.
To start testing I needed a large but generic cube. Remembering a blog I read recently, I decided to base this test off of the cube described in Yuri Kudryavcev 's blog "Processing a billion+ cells in Planning Analytics". Next, using TM1py, I put together a script to run a query in parallel batches to see what impact adding replicas, aka followers, would have. Starting the script my parallel query count climbed to 40.....and the service crashed....over and over.
Containers Out of Memory and a Tech Note
While the tests ran I monitored the progress from the PA Admin Console. As the query load started to rise, so did the RAM consumption. This was generally expected, but the new memory allocator was supposed to release unused memory back to the host. However, the memory consumption continued to grow, and grow, until there was no more available. There didn't seem to be anything preventing PAE from consuming all the RAM available to the container.
This left me a bit confused, what about the new allocator, what was actually happening here? Enter the experts at IBM support, and after a few working sessions we figured it out. RAM consumption, and the subsequent release back to the host is actually user/session dependent. The tests were using the same user account to run the tests and monitor the progress. Therefore by watching the tests run, it was preventing PAE from returning the memory to the host. We determined that:
After moving the tests to run on a unique user, the new memory allocator was working as I expected.
领英推荐
Replica Performance and the Connection to Caching
Now that the tests were actually working I wanted to know, do more replicas mean better query performance?
Yes, in most of my tests more replicas resulted in better aggregate performance. In the graph below you can see that as the number parallel queries increase, total response times improve with more replicas.
If we look at this data slightly differently we see something interesting. aggregate performance is better but that's not the whole story. Zooming in on how responses change over time tells a different story. In the case where a query is "small" and the total time to complete is always less than VMT, replicas increase performance considerably. The graph below shows PAE sends queries to the replicas gradually as traffic on the leader increases; and more read replicas = faster queries for uncached workloads.
When the size and complexity of the query increases there is a different behavior. In this test the query always takes longer than 5 seconds to execute (>VMT). This is when things get interesting. Once the query load started to rise above what the leader could handle, queries were sent to the read replicas as expected. However, like querying the leader for the first time, the first query sent to a replica needs to establish the local Stargate cache.
The chart below shows that as the number of parallel queries increases beyond the leader's capacity, the load is shared with a replica. The first queries to arrive at a replica need to establish the local Stargate cache. More interestingly though, once a query result can be retrieved from cache, the number of replicas impact on performance is negligible.
Each node, leader and replicas, manage their own local version of the cache.
In a similar test, replicas were added after cache had been established on the leader node. In this case assigning queries to the replicas actually degraded performance as the replicas did not have a valid cache. The chart below shows that once a cache has been established on a replica, query performance returns to baseline.
It is important to note that you can't control the cache on replicas. Cache events, including ViewConstruct, are not shared to other nodes.
Setting Expectations
Like most tests, the situations above are not very representative of real system usage patterns, but they help explain how a system works. PA environments tend to be very dynamic, with constant data changes and a high rate if cache misses. Therefore, the performance benefit provided by replicas might be more in real-life use. I recommend reviewing cache settings, and tuning them the same way you would in the legacy versions of PA. It is important to note that simply adding replicas won't make all queries faster or improve all user performance, in some cases it might not help at all.
Cubewise, CTO at Apliqo AG, TM1 expert
5 个月This is very interesting stuff. I have the strong feeling that just like rules, feeders, calculation-cache and view-cache are the poorly understood dark arts of TM1, that DB replicas and performance optimization will be the same for PAE.
Speaker/Architect/Developer: Bridging the gap between Sustainability reporting, Finance, and Technology
5 个月Good piece Ryan! Thanks for doing the work and sharing the results!
TM1 Academy | Business Solutions based on IBM Planning Analytics | Support in TM1-Projects
5 个月Thanks!
Fascinating stuff, Ryan C. - thanks for sharing your research with the TM1 community. And, it looks like Schrodinger was right - observing the experiment does alter the experiment ??
Managing Director at Intito | Expert in Finance & Sustainability Planning | TM1 Fanboy | Advocate for a Sustainable Future
5 个月Interesting read. Thanks for sharing Ryan C..