System modernization. An approach: make it geo-distributed, optimize, leverage AWS managed services.

System modernization. An approach: make it geo-distributed, optimize, leverage AWS managed services.

This story is about how I managed to achieve 2.5-3x application performance gain from making it a globally distributed one without changing any single bit of the app itself. The approach is generic enough and is worth considering in real-life business applications. This demo deals with Odoo’s CRM application.

Odoo app suite is absolutely awesome, though it demonstrates all challenges you will typically run into dealing with elasticity and global accessibility of a legacy application. Despite the fact the solution does not run in production, the approach it demonstrates is quite practical. Scroll right to the very last part of the article to find the link to the source code.

Important! For sake of simplicity the solution lacks many of the security features, and is not ready for running in production without tuning.

I appreciate your sharing any thoughts and ideas, and contributing to the solution.

Imagine, our enterprise has a setup which has its central server in Europe. Let’s say, in the us-central-1 (Frankfurt) AWS region. We are tasked to let users in Japan access our CRM system. Here the problem arises. Every of the API calls that the application performs experiences 300 millisecond delay on average.

A geo-distributed application. The branch region reaches the main one.

(Map by Free Vector Maps.)

Can we make it better? Yes, we can. What we can achieve is those data points of the green-colored graph on the above screenshot. You see, the median response time ranges from 90 to 150 milliseconds, which is 2.5-3 times better.

A geo-distributed application. The main database replicates the data to the branch region.

But it came at a cost of some additional complexity. Thanks to AWS, the complexity is affordable since it’s mostly managed by them. Let’s dive into the details.

Solid base

What the solution has as its foundation is a pretty straightforward master-slave geo-distributed setup. In my experiments I was running the “master” node of it in the eu-central-1 (Frankfurt) region. Odoo requires a PostgreSQL database to have at its disposal. So, the solution makes use of Amazon RDS for PostgreSQL and leverages its Global Database feature to establish the read-only “slave” part of it in ap-northeast (Tokyo) region.

This is a top-level overview of the infrastructure.

No alt text provided for this image

As you have noticed, it runs Odoo backend as Docker containers configured for horizontal scalability with help of Amazon ECS, AWS Fargate and ECS’s Service auto scaling. The templates you find at the link down below set it up in this way for sake of simplicity. Otherwise, you may choose any of the approaches to running containers you know and love.

AWS Application Load Balancer is the load balancer, and it is the endpoint Odoo’s client web application accesses. So, we have two endpoint addresses in two locations on Earth, which would be separated by 9000+ kilometers, if we had an absolutely straight fiber-optic channel between them, which of course is not the case.

The bad news is that the setup is not going to work yet.

First, Odoo requires a shared file system.

Make a globally-distributed file system

What Odoo stores on the file system are the application's static assets, such as JavaScrip and CSS files, and session context files.

Here I’ll show you how the solution makes those static files available in the “slave” region. It’s the easier one since the files require copying only once. (To be absolutely correct, we need to copy the assets once after we have installed or updated an Odoo App.)

Replicating session files which pop up once a user gets signed in is trickier and requires achieving sub-second latency. I’ll talk about cracking this nut later on.

So, for the static files we create an Amazon Elastic File System file system in each of the regions and leverage AWS DataSync, a managed data transfer service which integrates with EFS out of the box.

No alt text provided for this image

What we already have by this moment promises to work pretty smoothly if we make it to send all requests that write to the database to the “master” region. If not, the operation will try to run INSERT, or UPDATE, or DELETE SQL query against that “slave” database instance which is in read-only mode and will reject the query.

Add CDN and let it orchestrate cross-region traffic

No alt text provided for this image

The diagram gives you the idea that we discriminate read-write operations from the read-only once and route them to those two load balancers.

To implement this approach we configure the CloudFront distribution with two origins and multiple behaviors. Read this FAQ which briefly describes how it works and provides some pointers.

The diagram denotes two classes of traffic by mentioning GET vs. POST. It is a simplification of how the actual configuration looks and works. You’ll see it from the templates provided. Odoo uses JSON-RPC, a protocol on top of HTTP. JSON-RPC is a generic protocol; it sends everything as POST requests with payload including method name. So, actually both read and write operations are POST requests. The setup we are talking about distinguishes reads from writes by the path in the request’s URI. These segregation rules are strictly application-dependent and not necessarily obvious. Be careful planning on traffic routing rules in case of your application. Generally speaking, you are on the safe side if you configure it to route requests to the read-only cluster only if you are 100% confident that all requests that match the rule only read data under every possible circumstances. Make the “master” read-write cluster be the default destination for all requests that either a) definitely write data, or b) write if specific conditions occur, or c) in doubt, or d) you cannot express the discrimination with the means CloudFront gives you. With that rule of thumb, you’ll sacrifice some performance but won’t spoil the whole idea.

Make use of Lambda@Edge to replicate session files

We have done a lot so far. Though, it still does not allow the users to sign in. The problem is that during the login workflow the backend creates a session context file. All subsequent requests must have session id included, that’s what Odoo client web application does, and the backend matches the session id it knows from the request to what it reads from … Ouch! If a request comes to the read-only backend, it does not have that session file on EFS in the read-only region. This is because we have the CloudFront distribution configured to forward /login requests to the “master” back-end since the operation updates some records in the database. The back-end creates those session files in the “master” region, of course. Unfortunately, it is practically impossible to leverage DataSync for replicating the session files. DataSync’s task execution is a heavy and slow thing to spin up. DataSync’s task has zero knowledge about which of the files are new ones and need to be copied. So, DataSync’s task execution performs a painstaking job comparing the destination and the source. This tool is not the best match for cases when we require it to operate on single-file level and strive for milliseconds latency.

So, the last component we add to it is a Lambda@Edge function. (Don’t confuse it with CloudFront Functions that emerged recently.) The function is wired to the origin-response event of the /login path of the distribution.

No alt text provided for this image

CloudFront triggers this function when the user sign-in workflow has completed and a session file has been created on the EFS file system in the “master” region.

Lambda@Edge functions have no direct access to EFS. This is because Lambda@Edge functions run actually in the region the CloudFront’s point-of-presence server belongs to, not in the “master” region. So, what OnUserSignIn does is it first invokes the ReadSessionFile Lambda function (a “normal” Lambda function) in the “master” region. As a result it receives the session file’s content which is several dozen bytes long. Then, it passes the content to the WriteSessionFile Lambda function in the “slave” region. That’s it. After those things are done, OnUserSignIn @Edge returns, releasing CloudFront to return the response to the browser with some delay. We assume that users sign in only once and the whole sign in workflow is quite lengthy; so, adding some delay to it is still bearable for our precious users.

How I tested performance of the solution

I used the Locust load testing tool for running the tests.

I performed two sets of tests.

During the first one, I was running one Locust container in Europe and another in Japan, both simulating 450 users, and both hitting the single central application cluster in Europe.

No alt text provided for this image

Then I load-tested the globally distributed setup by pointing the Locust running in Japan to query the “slave” Odoo cluster in Japan.

No alt text provided for this image

Here I copy the diagram presenting what Locust has measured. Both the green and the yellow lines correspond to median response times shown by the Locust in Osaka, while the similar requests coming from Paris to Frankfurt obviously did not suffer any latency, and demonstrated the best possible response time.

No alt text provided for this image

You will find the Locust scripts in Python which I used to load test it along with the source code at the link down below.

Cost implications

Let’s consider this geo-distributed setup from the point of view of additional costs implied by adding those additional moving parts we just talked about.

First, leveraging Amazon Aurora Global Database shrinks the choice of the database instance types and sizes. The cheapest ones supporting Global Database are db.r5.large or db.r6g.large at $0,26-0,35/hr depending on the region and type.

The good news is the global database showed nearly linear horizontal scalability. The “centralized” configuration was load tested with a single db.r5.xlarge instance bearing 900 users, while geo-distributed setup consisted of two db.r5.large instances with 450 users on each. Overall cost of the transaction stayed the same. (I measured only “read” operations. “Write” operations were tested to prove that the geo-distributed configuration works all right, but, for sake of simplicity, I do not present what I measured when I load tested mixture of read and write operations. Long story short, write operations do not affect hugely read operations in any of the regions. That perfectly meets the goal of this design.)

Though, Amazon Aurora Global Database charges additionally the replicated write I/O operations it performs.

Odoo itself showed, unsurprisingly, good horizontal scalability. Centralized configuration stabilized at 11 ECS Fargate container instances (ECS Tasks) of 1 vCPU x 2GB RAM size. Geo-distributed setup required 5 plus 6 containers.

Cost of running those two EFS file systems is negligible.

Cost of the two CloudFront distributions is a small fraction of the overall cost. Though, because total request volume stays the same if we compare the two configurations, and because we expect much less cross-region requests in case of geo-distributed configurations, we expect here to save some costs, actually.

What about running two Application Load Balancers instead of a single central one? ALB cost is the sum of fixed hourly rate and cost of LCUs, Load Balancer Capacity Units, which is proportional to its throughput. In both configurations roughly the same amount of requests and data comes through the ALB(s). So, the variable LCU part of the equation stays the same. The only additional cost is ~$20/month (before taxes) for the additional “regional” ALB “instance”. Do you agree with me that this extra is not a barrier at all when you are going to go really global?

Some clouds in (what was supposed to be) the blue skies

The not-that-good news is that so far we were considering a quite optimistic scenario. In a realistic one, the gain you achieve might not be worth complicating things with such a geo-distributed setup.

No alt text provided for this image

What our load test does is executing four HTTP requests against the Odoo API. What is presented on the graph is the median response time of the blend. (I'm repeating the same graph again.)

In the blend, one of the requests stands out significantly because of a quite heavy SQL query it performs. If we consider the 95th percentile of the metric, we see that the gain is not that obvious.

No alt text provided for this image

I’m going to give you one more example of a case when inter-region latency is a less important issue. It turned out that Odoo CRM does not scale well in respect to the size of the data it stores in the database. The below was captured when the crm_leads table had 10k records. Given the size of database instances I was able to measure only 100+100 virtual users. That load already sent the instances into their CPU capacity limit. Neither 95th percentile nor median response time was impressive and gain expected from the geo-distributed configuration was spoiled by poor performance of that heavy request.

No alt text provided for this image
No alt text provided for this image

As this journey is rather about investigating approaches to modernizing applications in general, I’m planning on trying to crack data scalability problems as well. I’m going to demonstrate how the modern state of cloud technologies helps us scale virtually any application, even those that were not designed with elasticity/scalability in mind. The journey continues. If you feel thrilled and want to jump in to help, please …?

Source code

Find all the templates and load test scripts in this repo on GitHub.

Bonus stuff: Why database proxies I played with didn’t work out

As a part of this journey, I also tried to leverage database proxy for discriminating database update SQL requests and route them to the master database. I tried Pgpool-II and Heimdall Proxy. The both are absolutely amazing products allowing some really advanced scenarios. Although in the case of Odoo neither worked out well because of how Odoo’s ORM composes transactions to achieve and guarantee ACID which is a natural spoiler of good performance in distributed world. This story is worth a dedicated article. So, stay tuned.

Mahbubur Rahman

Software Engineer at ruhr.agency (ruhrdot GmbH)

1 年

Thanks a lot to Maksim Aniskov for such an elaborative and resourceful article. I have one only question though, - Can you tell us what problem you actually faced and observed with Heimdall Proxy last time?

回复

要查看或添加评论,请登录

Maksim Aniskov的更多文章

社区洞察