AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules
Moebel.de de-duplicated event processor by Ivan Vokhmin

AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

In moebel.de we use AWS lambda for many projects. From event processing to page rendering, lambda leverages a lot of shifting loads. However, here are some cases where triggering lambda blindly with high concurrency will cause congestion issues and high costs, and a solution I was able to architect to mitigate these issues.

Cases

We encountered 2 different cases when too many concurrent invocation caused issues.

Case 1: CDN purging - connecting to rate-limited APIs

Our page heavily depends on CloudFlare to cache many pre-generated pages and resources. Cached time may be long for optimization purposes, so we need to actively tell CloudFlare to refresh cache for specific URLs. For this, we use CloudFlare API.

Our usage of CloudFlare purge APIs implies that we sent many purge request for different URLs from many event producers (trough SQS queue or direct lambda invocation). When too many purge events were generated, we could have hit request per minute limit of CloudFlare API. Because of this, some purge requests failed and congestion was created (hundreds/thousands of requests that failed after some time because of processing timeout). Worst of all, many of those requests were redundant (same URL purged multiple times).

Case 2: Data transformation - excessive (avoidable) invocation costs

Updating data model of our portal requires constant synchronization of slow backend database and a fast cache layer via data transformer service (lambda). While backend team that operates the database can do hundreds or thousands of changes, every change create an SQS queue event. Even with some event bundling, we still have a lot of redundant events (like dozens of props of same category changed => transformer must update cache layer multiple times, in parallel). Every time a SQS event arrives, lambda must read remote database, do a lot of computational work on object id from event, and store processed data in fast cache layer. This has lead to significant cost increase with changes becoming more and more frequent. Another variable here is that the computation load and memory requirements of transforming lambda are pretty big (and therefore every invocation is expensive).

What do this cases have in common?

In both of those situations, there are some common traits that warrant a template solution:

  • Redundant invocations (invocations with same event payload) that can be processed in debounced way
  • Not very time-sensitive data (most of stakeholders are prepared to wait for some minutes/hours before the changes are processed)
  • Sporadic event generation - usually events come in big batches, but most of the time lambda is idle
  • We can not afford to miss events. Parts of our website will become obsolete because of this

Template solution

Single lambda logic was split into 2 lambdas. One of them (very small one) accepts all possible events and invocations and writes them in DynamoDB in a de-duplicated way (consumer lambda). Another one (worker lambda), more heavy on logic, memory and CPU is executed periodically on schedule by CloudWatch Rule (like every 5 or 30 minutes etc).


No alt text provided for this image
Solution overview

To ensure de-duplication, a unique property should become a HASH key (partition key) - so no duplicate events can be inserted for later processing. Here is a CloudFormation example that we use to store unique URLs for later CloudFlare API purging.

  DynamoDBForUrls
    Type: AWS::DynamoDB::Table
    Properties:
      AttributeDefinitions:
        - AttributeName: url
          AttributeType: S
      # PAY_PER_REQUEST is recommended for fluctuating workloads
      BillingMode: PAY_PER_REQUEST
      KeySchema:
        # Forces unique urls
        - AttributeName: url
          KeyType: HASH
      TableName: url-storage-table-name        

The workflow of worker lambda starts by reading many entries at once from DynamoDB, do work on them (like contact related database for exact category data and transform it, or connect to CloudFlare API to purge batch of URLs), then delete processed entries from DynamoDB. To prevent possible "update while processing", some critical events have modification date appended apart from partition key (DynamoDB is schemaless, so extra fields can be added at will). Deletion only happens if modification date match. If a worker lambda encounters empty database on its run, it exits immediately to keep costs low.

Important note: while consumer lambda can scale and run in concurrent mode, this is fine as it only writes data to DynamoDB and exits instantly. The heavier worker lambda is executed without parallelization (the more delays between invocations => lower the costs because of de-duplication and less dry runs, but also longer overall event processing time)

DynamoDB perfectly matches our use cases, because:

  • It is very easy to setup and use (+ scale as you go)
  • We don't need to keep data for long time - most of the time it is empty. With PAY_PER_REQUEST model we keep db costs very low (few USD in a month).

Note: you may consider to reserve DynamoDB capacity by using PROVISIONED billing mode for constant workloads.

Outcomes

After solution was implemented, some time was spent to analyze the outcomes.

Rate-limited API case

Our event processing time increased (sometimes up-to an hour in very high demand case). However, we never run into rate-limit of CloudFlare API or lost an event (URL to purge) again.

Cost decrease for transformer case

After the solution was rolled out (red arrow on next 2 graphs) we could process significantly more SQS events at greatly reduced price, because all events were properly bundled together (and we saved a lot of common execution time). This happened because most of events require a lot of common information from the backend database, and reading it was taking significant time from lambda execution. Now, with "one run for all" events, we spared a lot of avoidable lambda costs.


No alt text provided for this image
Processed SQS events before and after solution deployment
No alt text provided for this image
Processing cost before and after solution deployment

Conclusion

While lambda is very good at handling spontaneous loads, in some cases it is beneficial to put a "middleman" like DynamoDB to split SQS event consumption away from real processing logic. It can reduce costs or adhere to external rate-limits at the cost of increased event processing time.


#aws #dynamodb #lambda #sqs #cloudwatch #cloudflare

Tim Endert

?????????? TypeScript & Flutter | Software Developer

1 年

????

回复

要查看或添加评论,请登录

Ivan Vokhmin的更多文章

社区洞察

其他会员也浏览了