Migrating event sourced Akka application from CockroachDB to AlloyDB
Introduction
One of our customers requested our assistance in further scaling their large, industrial, event-sourced project, which heavily leverages Akka and its powerful tools in that domain. The core issue was that a previous design decision (by yours truly) made in the greenfield phase of the project came back with a vengeance. The project was initially set up with CockroachDB from Cockroach Labs , a NewSQL, fully-ACID-compliant database that promises easy horizontal scaling. This choice was made with future growth in mind, but in the early rush to set up the walking skeleton and build a proof of concept, some shortcuts have been taken. One of them was the choice of a classic Akka Persistence journal and snapshot plugin - akka-persistence-jdbc. While this approach would be fine with any traditional RDBMS like Postgres or Oracle, a particular property of the journal's table schema has proven itself to be quite limiting when used with a distributed database like Cockroach. The property in question is the ordering column, usually implemented using a database-backed sequence. While such a feature usually comes with negligible overhead, in a distributed database, it necessitates some negotiation between the nodes with all the networking traffic required to achieve it. This, in turn, makes it quite slow and is quite explicitly mentioned as a big no-no in Cockroach's documentation. The customer requested that we look at possible alternatives - YugabyteDB , GCP's Cloud SQL for Postgres and the newly globally available Google Cloud product - AlloyDB . We were also asked to take the new Akka-persistence-r2dbc plugin that arrived with Akka 2.8 for a spin and provide feedback on production readiness, performance and possible benefits that could be obtained by migrating to it. We've quickly decided that Yugabyte is going to face the same issues as Cockroach without a plugin migration (it's natively supported by akka-persistence-r2dbc) and also has been caught performing troubling practices in reliable industry press again and again . Not wanting to introduce more risks, we've focused on Google's managed database products - CloudSQL for Postgres and AlloyDB. A quick skim through both technical docs and marketing docs of Alloy made the choice quite simple - from a scalability perspective, Alloy looked like a vastly better choice with higher reported performance, automatic failover, automatic cross-zone replication of data and availability of cross-region replication. It looked very promising if one squints hard enough to overlook the problem of having a single point of failure. We've discussed these trade-offs with our customer and the decision that we should take a closer look into Google’s Alloy was made.
What questions did we need answered?
Our customer's primary concern was verifying whether AlloyDB's maximum write throughput would pose a problem in the future. This was, in fact, the original concern that drove the decision away from a self-hosted Postgres or CloudSQL for Postgres - that even with really massive instance sizes, with a single leader capable of performing writes, the throughput ceiling would still be a looming threat. Google promised that AlloyDB scales on the right side to the number of writes per second that felt very safe for our use case, and so the main objective became to verify whether these numbers would check out with the database schemas of Akka Persistence backends.
The secondary objective was to seize the opportunity and verify how different akka-persistence plugins would fare in our circumstances so that we would have insight into whether we can migrate towards something better later should a need arise. To that end, we've decided to compare schemas of three plugins: the classic akka-persistence-jdbc , Swissborg's akka-persistence-postgres and the newest on the block, Akka (formerly Lightbend) 's akka-persistence-r2dbc . The scope of this research did not include the performance of actual plugins, just the impact of their database schema on maximum write throughput. An exciting contender here is the akka-persistence-r2dbc, designed with another NewSQL database in mind - YugaByte. The schema of akka-persistence-r2dbc, therefore, avoids the problematic global sequence column and should be much more friendly towards CockroachDB, too. It's also worth noting that akka-persistence-postgres has three different schemas available tuned for different kinds of event-sourced workloads:
Out of these 3 variants, nested was known from the start to be a wrong fit for our customer's use case as its target is a situation where there are a relatively small amount of event-sourced entities that have huge journals each (hence separate partitions per persistence id partitioned by sequence!). It was, therefore, expected from the start that this variant would strongly underperform. This brought the total number of schema variants to five.
The final objective, a kind of sanity check, was to verify that AlloyDB actually yields better performance than the currently misused CockroachDB. Should the numbers come out relatively comparable for CockroachDB, it would be better to stick with it and migrate to akka-persistence-r2dbc once its support for CockroachDB is ready.
Methodology
Load testing using a custom-built tool was selected as the best way to answer questions. Given that the scope was limited to schemas of akka-persistence plugins, the dynamic generation of data for load tests was relatively simple because journal tables are essentially rows of metadata with a single blob column containing a serialized event. To find the maximum write throughput threshold in a realistic scenario, the current event blob size distribution was taken from a copy of the customer's production database using an SQL query that returned the following table:
size | occurrences | smallest | largest | average
----------------+-------------+----------+---------+----------
> 100000 bytes | 25304 | 103796 | 110262 | 107340
> 90000 bytes | 4297 | 96588 | 98604 | 97873
> 50000 bytes | 5295 | 50606 | 57030 | 53678
> 40000 bytes | 25284 | 40884 | 49776 | 48023
> 30000 bytes | 85988 | 30117 | 39636 | 32381
> 20000 bytes | 377780 | 20055 | 29964 | 22091
> 10000 bytes | 201750 | 10005 | 19981 | 13795
< 10000 bytes | 151129 | 211 | 9923 | 3038
This distribution was then fed to a purpose-written code that generates realistic-looking rows of journal data to be inserted into a given table. Different rows are generated for different schemas, but the event's binary blob size always follows the production distribution (this is tested with p = .99 given a large enough sample).
Load testing of databases, both distributed and not, has several constraints that have to be taken into account:
In the case of single-writer databases like Postgres, the total number of connections has to be taken into consideration (for PostgreSQL, that's just 100 connections by default!). This is, however, not a problem for AlloyDB, which defaults to 1000 connections and can be tuned up to 240 000 (sic!).
A load testing application was built using the ZIO library for ease of parallelisation and granular control in concurrent use cases. Database integration was performed using the ZIO-Quill library, which has a straightforward integration with ZIO and generates queries in compile time to allow easy use of prepared statements. To allow easy reruns of the load tests in the Google Cloud environment, Pulumi was used to automate the creation of both managed databases and Kubernetes clusters to deploy load-testing application instances.
Load testing itself follows an incremental connection count scenario in which database schema is migrated first, then a load test for that schema commences starting with S (step) connections initially and then increasing the connection count after 10 seconds by another S (step) until M (maximum connections) value is met, after which test is complete, results are serialised to JSON, uploaded to a GCS bucket and then migration for another schema is executed. The load tester is built to run in parallel on the GKE cluster. Synchronisation between parallel processes is provided by a simple timing mechanism conjoined with looped table existence checks used for locking where the leader process waits for other processes (pods) to start and to finish their load testing before applying migrations and follower-workers wait for tables to be dropped and then migrated again before beginning the next phase. This guarantees that 10 second phases of the same load intensity overlap between parallel processes with minor sub-second discrepancies between processes.
The minimum and maximum load (amount of parallel looped insert queries) in the initial and final phase, respectively can be, therefore calculated using the following formula:
n - number of pods (processes)
S - step size
M - maximum connections per process
p = phase index (starting with 1)
L = load
Lmin = n * S
Lmax = n * M
L(p) = n p S
The actual number of queries executed (throughput, in other words) is dependent on myriad variables and is what we are after in this research.
AlloyDB can be set up in different variants configured by the primary node's desired amount of CPUs. Available variants are:
CockroachDB was set up using CockroachDB Dedicated offering which offers five variants with vCPUs being the main differentiator:
Due to the rather exploratory nature of this research, not all variants were tested, and the configuration of load generators was directed more by an approximation based on previous runs than any analytical consideration. Having said that, load was increased significantly on each CPU bump in line with initial findings about latency increases found on a local PostgreSQL instance used to test the load generator itself.?
Additional consideration was that tests of schemas belonging to akka-persistence-postgres were dropped from CockroachDB runs because they all require a sequence-based column but akka-persistence-jdbc schema was retained for baseline comparison. The real interest point in CockroachDB points was the performance of the akka-persistence-r2dbc schema that had to be tweaked slightly to translate hashing for the primary key. The original schema built for YugabyteDB defined the primary key as:
PRIMARY KEY(persistence_id HASH, seq_nr ASC)
but CockroachDB does not support this syntax. After some experimentation, two variants were found:
Tests were executed against both of these variants as it wasn't clear which would have the same semantics as YugabyteDB. Only later, it was confirmed by a CockroachDB representative (thank you Tofara!) that the first one was the correct one.
Load tests for AlloyDB were performed in the following configurations:
For CockroachDB the same pattern of incremental load tests was executed with the vCPU variants equal to CPU variants of AlloyDB with the exception of the largest run (so 8 for 8, 16 for 16 but 32 for 64 because that's the largest instance size available for CockroachDB Dedicated). It's important to note that CockroachDB is a distributed database but wasn't used as such - the cluster we used for testing had a single instance to make this comparison more fair.
Results and findings
Results consist of throughput and latency data for all phases per process (a Kubernetes pod running the load testing program on a separate virtual machine). Additionally, database metrics were captured from the Google Cloud dashboard and Datadog integration for CockroachDB to validate what can be seen on the database client level.?
The charts representing these results are included at the end of this post, and the summary below will be presented only for brevity. All charts representing the whole dataset obtained from this research are available in Google Drive .
Data charts available in the repo:
Notice
Cut-offs on the charts present for the nested variant of akka-persistence-postgres are caused by queries crossing the 500ms average latency threshold, which stops the test. For most scenarios, nested only goes up to the first step. This is completely understandable given that each event generated has a unique, random id and the base idea of this schema is that each event sourcing aggregate has its own partition. This means that for each insert to the nested schema, two DDL statements were executed, first to create a partition in the journal table and then to create the table for said partition.
领英推荐
Initial runs against 8 CPUs for both AlloyDB and CockroachDB with a single instance of load testing program have shown the baseline performance for both systems along with some interesting, both expected and unexpected, tendencies. For Alloy, the wrong fit of the nested schema of akka-persistence-postgres became obvious immediately, with tests stopping at 30 concurrent connections with abysmal throughput (around 100 TPS at 10 connections) at all times. The performance penalty of the partitioned variant of the same plugin has also made itself known by results lower by between 600 to 1500 TPS from the leader. The other schemas performed similarly, with the plain variant of akka-persistence-postgres slightly leading the pack (fig.1). Interestingly, the plain variant was also differentiating itself on the latencies front in a negative way with the highest p99 beside the partitioned variant (which had to have high latencies for some of the queries due to periodic execution of DDL to create a new partition table)(fig. 2). This trend of the akka-persistence-postgres plain became a constant through all the runs without a clear reason for it - one could assume that it's due to the sequence-based column used for the global ordering of the events, but the same is also true for basic akka-persistence-jdbc schema. The peak performance registered was 4500 TPS at 90 concurrent connections (fig.1). On CockroachDB's side of things, one base assumption was corroborated immediately, too - the akka-persistence-jdbc variant using a sequence-based column was not performing well and kept between 500 and 600 TPS no matter how much traffic went its way. On the other hand, the two variants of the akka-persistence-r2dbc schema performed better, with the first variant (the correct one) reaching 1800 TPS at 20 connections (fig.3). In this run, the second variant has surprisingly won at higher connection counts (70 and 80), where it became faster than the first.?
The next run against the 8 CPU databases was only meant to search for the limits of this setup with higher connection count with a single process. Maximum connections were bumped to 200 and the results obtained were nearly identical with the previous run. For Alloy akka-persistence-postgres plain still led the pack and clocked 5100 TPS at 180 concurrent connections (but with p99 latencies crossing 100ms threshold, the third best result was 4800 TPS at 150 connections with p99 latencies around 60ms for the same schema) (figs 5 and 6). For Cockroach, variant 1 of r2dbc schema performed in a very stable fashion around 1500 TPS between 20 and 130 connections with a spike up to 1800 TPS at 180 connections and a sharp dip above this threshold (fig 7).
A subsequent run was also executed against the 8 CPU databases (exploration!) but with two parallel load tester instances with maximum connections threshold again set to 200. On Alloy's side, results were again not surprising. With two processes, the 5000 TPS threshold was met and briefly crossed by all variants besides nested and partitioned akka-persistence-postgres options. This has also conveyed clear information that a region of around 5000 transactions per second was the hard limit of the database instance at its current size (fig.9). Runs against Cockroach haven't surprised either - the first variant of r2dbc schema has held the throughput relatively steadily around 1800 TPS and then dipped on higher connection counts (this is mirrored in p99 latencies chart where it went through the roof at 2x180 connections) where the invalid, second variant won again. Akka-persistence-jdbc schema has also crossed the 500ms average latency threshold on 120 connections, and its test was stopped (figs. 11 and 12).
The fourth test run was the first, where the database instance size was bumped to 16 CPUs for both databases. There were still two pods with load-testing apps, and the connection limit was bumped to 300, leading to 600 concurrent connections in the peak. On Alloy's side, a clear tendency has shown up with akka-postgres-persistence plain leading and clocking 8650 TPS at 150 connections and akka-postgres-r2dbc schema playing catch up and tying up for first place in some phases (fig.13). A similar increase in throughput could not be seen on Cockroach's side, however, with the first variant of the r2dbc schema still needs to break through the 2000 TPS threshold (fig.15). Moreover, additional issues have shown up on the way. In this configuration, some test runs against Cockroach started failing to write events to the database with ABORT_REASON_NEW_LEASE_PREVENTS_TXN error code. Documentation explained that this error is triggered when the leaseholder is shuffled from underneath a pending transaction, and the transaction cannot continue. After a small change in code that introduced retries for this particular error, test runs were able to complete successfully. However, this problem occurred often enough to raise average latencies and, therefore, cross the 500ms threshold, after which tests are stopped (fig.15).
In the next run, the same hardware configuration was retained, but the pod count was bumped to 3 while the connection limit stayed at 300, meaning 900 concurrent connections in the peak. There weren't any surprises again for Alloy, where akka-postgres-persistence plain led akka-postgres-r2dbc by a thin margin, often tying for victory. The highest throughput observed was recorded for the plain variant, around 9800 TPS for 225 concurrent connections (so 675 in total!) (fig.17). Latencies suffered heavily, however, in phases with more than 100 connections (fig.18). On Cockroach's side, things stayed the same; the first variant of akka-persistence-r2dbc still performed best but still didn't break the 2000 TPS barrier with latencies steadily growing in the general direction of the moon (figs.19 and 20).
Final run against the biggest database instances, 64 CPU with 512 GB of RAM for AlloyDB and 32 vCPUs with 128 GB of RAM for CockroachDB respectively, has proven that Alloy is capable of scaling writes to a single table further with the high watermark being 16300 inserts per second with 8 pods pumping writes through 75 connections each (600 concurrent connections!) with p99 latencies kept under 100ms (figs. 21 and 22). Interestingly, the best-performing schema variant for this test run finally became the akka-persistence-r2dbc. Writes to CockroachDB did not scale with the vertical scaling of the hardware and again never went over 2000 TPS ceiling (fig.23). One interesting observation is that two other schemas - the invalid r2dbc translation and the classic akka-persistence-jdbc - crossed the 500ms average latency threshold immediately with 8 processes with 25 connections each. The correct r2dbc translation (first variant) was able to complete the whole run but p99 latencies became untenable with more than 25 connections (fig.24). In general, CockroachDB can be overwhelmed with large amounts of open connections and in each test the best result was recorded in the first or second phase when connection counts per load tester app were the lowest.
Conclusions
The results we gathered allowed us to recommend the AlloyDB to the customer who ordered this consultation with a high degree of confidence. Our only reservation after the test runs concluded regarded the overall architecture of the solution - AlloyDB is still a leader/replicas setup, and even with the built-in hot standby ready to take over at all times, the single point of failure is there. On the other hand, the nice scalability capabilities (support for thousands of open connections!) allowing it to easily reach 10000 inserts per second, with each insert containing a blob of 23.4 kilobytes on average (with some reaching 100kB), meant that it was a safe bet for the foreseeable future. The best results for Alloy per hardware variant recorded with p99 latencies under 100ms (a reasonable value for responsiveness under heavy load) were:?
The late win of the r2dbc-based schema is an interesting data point that can possibly be explained by the (small) overhead of the sequence-based ordering column that is not present in this variant.?
CockroachDB's results, on the other hand, are surprising and a bit worrying. The fact that no amount of vertical scaling of the instances allowed the load test to cross the 2000 transactions per second threshold suggests either a configuration problem or something more essential. No configuration tweaks were applied (with the exception of max connections bumped over 1000 for the final run against AlloyDB) to maintain fairness, so this behavior is something that happens with out-of-the-box CockroachDB instances. To verify it was not a networking issue the same test was executed locally on a M1 Macbook Pro against Cockroach using docker image cockroachdb/cockroach:v23.1.13 and the problem persisted - the highest observed writes per second number was 2130 with 50 concurrent connections for the first variant of r2dbc schema. This is unexpected because this schema variant doesn't even declare a sequence-based column and performs about 1.8x better than the akka-persistence-jdbc variant, which does.
In terms of comparison of impact on throughput of journal table schemas of different akka-persistence plugins when using a leader/replicas type of database given additional safety guarantees received by the explicit ordering of events stored, the clear winner is SwissBorg 's akka-persistence-postgres with the plain variant. The only time when akka-persistence-r2dbc was able to provide higher throughput was with extreme amounts of concurrent writes. This, in turn, is bought by a higher application complexity related to how Akka's Persistence Query and Projections modules operate. When using a distributed, NewSQL type of database like Yugabyte or Cockroach, the r2dbc schema is definitely the choice to make as it fully leverages the internal model of such databases and allows the use of their horizontal scaling capabilities.
Our findings allowed the interested party to make an informed decision and migrate from CockroachDB to AlloyDB with positive results in terms of lower observed latencies (a sequence column is cheaper in a leader/replicas architecture) and without any issues related to the reliability of the persistence stack. Verified support for very high active connection counts is very promising for Akka-based event-sourcing applications as it means that it's relatively safe to scale up the cluster without the need to apply any changes to the database's configuration.?
Addendum I - 25.01.2024:
Denis Magda of YugabyteDB reached out to clear the situation around research published at aphyr.com :
Yugabyte sponsored the initial Jepsen testing because it needed Kyle (the person who knows the Jepsen framework best) to create a proper test suite for YugabyteDB's distributed architecture. Since then, the Yugabyte engineering team has addressed all the issues and runs the Jepsen tests daily. The suite continues to be instrumental in discovering regressions, which are immediately fixed before the next database version gets released: https://github.com/yugabyte/yugabyte-db/issues/10052
There's a possibility that research will be expanded to include YugabyteDB, another NewSQL offering, that akka-postgres-r2dbc supports out of the box.
Head of Product Marketing at Volt Active Data, Inc.
9 个月This one is a bit of a headscratcher from my viewpoint. I can't help wondering if somerthing went badly wrong somewhere... *all* of these numbers are really, really low... ??
Sr. Director of Engineering at Cockroach Labs
10 个月This benchmark appears to use a monotonic integer in its PRIMARY KEY, which is pathologically bad for CockroachDB as it creates a hot spot. If someone is interested in rerunning this with a more even data distribution, please reach out (`seanc - at - cockroachlabs dot com`), but at first pass, this test and conclusions are not particularly fair or representative of CockroachDB's ability to scale (I've personally pushed CockroachDB to millions of TPS in single production and test clusters and have questions, and am hoping to find someone to work with to reconcile these results). Cheers.
Developer Enablement and Relations, MLP Launchpad
10 个月Hey, thanks for sharing your story. I'm glad everything worked out for the customer during the transition. >> Yugabyte is going to face the same issues as Cockroach without a plugin migration (it's natively supported by akka-persistence-r2dbc) Do you have a specific list of issues? If you came across anything during testing, I would appreciate it if you could report your findings on GitHub. Btw, sequences scale easily with YugabyteDB. The primary copy of a sequence number is stored on a single node, but the number of writes to that node is dramatically reduced with the 'CACHE' parameter of the 'CREATE SEQUENCE' command. The other nodes cache a range of IDs on their end and go to the primary node only occasionally to lock in the next range. >> and also has been caught performing troubling practices in reliable industry press again and again.? With all due respect, those articles are outdated (from 2019). It's been over 5 years since Kyle tested the earlier versions of YugabyteDB. The team collaborated with him to address those findings. Anyway, it's 2024 today, and YugabyteDB is already recognized as one of the top 20 cloud databases (Gartner's Magic Quadrant). This recognition was made possible thanks to numerous customers.