AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules
Ivan Vokhmin
Lead Engineer Frontend @ moebel.de Einrichten & Wohnen GmbH | AWS, Team Leadership, Software Architecture, AI
In moebel.de we use AWS lambda for many projects. From event processing to page rendering, lambda leverages a lot of shifting loads. However, here are some cases where triggering lambda blindly with high concurrency will cause congestion issues and high costs, and a solution I was able to architect to mitigate these issues.
Cases
We encountered 2 different cases when too many concurrent invocation caused issues.
Case 1: CDN purging - connecting to rate-limited APIs
Our page heavily depends on CloudFlare to cache many pre-generated pages and resources. Cached time may be long for optimization purposes, so we need to actively tell CloudFlare to refresh cache for specific URLs. For this, we use CloudFlare API.
Our usage of CloudFlare purge APIs implies that we sent many purge request for different URLs from many event producers (trough SQS queue or direct lambda invocation). When too many purge events were generated, we could have hit request per minute limit of CloudFlare API. Because of this, some purge requests failed and congestion was created (hundreds/thousands of requests that failed after some time because of processing timeout). Worst of all, many of those requests were redundant (same URL purged multiple times).
Case 2: Data transformation - excessive (avoidable) invocation costs
Updating data model of our portal requires constant synchronization of slow backend database and a fast cache layer via data transformer service (lambda). While backend team that operates the database can do hundreds or thousands of changes, every change create an SQS queue event. Even with some event bundling, we still have a lot of redundant events (like dozens of props of same category changed => transformer must update cache layer multiple times, in parallel). Every time a SQS event arrives, lambda must read remote database, do a lot of computational work on object id from event, and store processed data in fast cache layer. This has lead to significant cost increase with changes becoming more and more frequent. Another variable here is that the computation load and memory requirements of transforming lambda are pretty big (and therefore every invocation is expensive).
What do this cases have in common?
In both of those situations, there are some common traits that warrant a template solution:
Template solution
Single lambda logic was split into 2 lambdas. One of them (very small one) accepts all possible events and invocations and writes them in DynamoDB in a de-duplicated way (consumer lambda). Another one (worker lambda), more heavy on logic, memory and CPU is executed periodically on schedule by CloudWatch Rule (like every 5 or 30 minutes etc).
To ensure de-duplication, a unique property should become a HASH key (partition key) - so no duplicate events can be inserted for later processing. Here is a CloudFormation example that we use to store unique URLs for later CloudFlare API purging.
DynamoDBForUrls
Type: AWS::DynamoDB::Table
Properties:
AttributeDefinitions:
- AttributeName: url
AttributeType: S
# PAY_PER_REQUEST is recommended for fluctuating workloads
BillingMode: PAY_PER_REQUEST
KeySchema:
# Forces unique urls
- AttributeName: url
KeyType: HASH
TableName: url-storage-table-name
The workflow of worker lambda starts by reading many entries at once from DynamoDB, do work on them (like contact related database for exact category data and transform it, or connect to CloudFlare API to purge batch of URLs), then delete processed entries from DynamoDB. To prevent possible "update while processing", some critical events have modification date appended apart from partition key (DynamoDB is schemaless, so extra fields can be added at will). Deletion only happens if modification date match. If a worker lambda encounters empty database on its run, it exits immediately to keep costs low.
领英推荐
Important note: while consumer lambda can scale and run in concurrent mode, this is fine as it only writes data to DynamoDB and exits instantly. The heavier worker lambda is executed without parallelization (the more delays between invocations => lower the costs because of de-duplication and less dry runs, but also longer overall event processing time)
DynamoDB perfectly matches our use cases, because:
Note: you may consider to reserve DynamoDB capacity by using PROVISIONED billing mode for constant workloads.
Outcomes
After solution was implemented, some time was spent to analyze the outcomes.
Rate-limited API case
Our event processing time increased (sometimes up-to an hour in very high demand case). However, we never run into rate-limit of CloudFlare API or lost an event (URL to purge) again.
Cost decrease for transformer case
After the solution was rolled out (red arrow on next 2 graphs) we could process significantly more SQS events at greatly reduced price, because all events were properly bundled together (and we saved a lot of common execution time). This happened because most of events require a lot of common information from the backend database, and reading it was taking significant time from lambda execution. Now, with "one run for all" events, we spared a lot of avoidable lambda costs.
Conclusion
While lambda is very good at handling spontaneous loads, in some cases it is beneficial to put a "middleman" like DynamoDB to split SQS event consumption away from real processing logic. It can reduce costs or adhere to external rate-limits at the cost of increased event processing time.
?????????? TypeScript & Flutter | Software Developer
1 年????