Scaling Document Processing with Serverless

Scaling Document Processing with Serverless

Standard Metrics' mission is to accelerate innovation in the private markets. We're building a platform that aims to become a financial "lingua franca" between startups and venture firms.

We help startups share financial metrics with their investors effortlessly. One of our core services is analyzing, validating, and capturing metrics from Financial Documents that startups share with their investors.

Our Data Solutions team is in charge of this process. They offer strong SLAs on the time it takes from when a document is uploaded to our platform until it's analyzed and its metrics are available through our platform. It's part of their quarterly OKRs, so they track it religiously and have goals around it.

The Problem

Excel files are used ubiquitously for financial data. They represent 60% of all the documents on the platform. It's so crucial that we built a specific tool where the Data Solutions team analyzes, validates, and maps which value belongs to which financial metric.

A JSON representation of the Excel file powers the tool's UI. Documents uploaded to our platform are stored and scanned for virus in S3. Later, a Celery task within our Kubernetes cluster transforms the document into its JSON representation, and uploads it back to S3 for the UI to consume.


High-level architecture of Document Processing v1

The original implementation of the Document Processing queue had several issues. Processing failures would get stuck in the queue causing bottlenecks, and a real lack of visibility into what was going wrong forced us to be reactive, putting out fires as they popped up.

Investigating these issues was a time sink for engineering and negatively impacted the Solution team’s SLAs and OKRs.

Step Zero

Understanding the problem with instrumentation

The initial theory was that Calamine, the library we used to read the Excel files, was consuming too much memory for big files or files with images and getting killed by the Kubernetes controller, blocking other tasks in the same Pod when it was rescheduled.

Our highest priority was instrumenting the celery task to see into the failures. You can't fix what you don't understand.

We improved logs and metrics on the task to track "Time in Queue" and the number of "Items in the Queue."

We added two alarms:

  • When the time in queue exceeded 30 minutes.
  • When there were more than 25 files in the queue for more than 15 minutes.

Dashboards and alarms (v1). — Time in queue clearly increases with load, up to a point where the queue halts processing

The turning point

Things improved significantly. Instead of finding issues days later, engineers got alerts within hours. On the flip side, this alarm was triggering as often as three times per week. It was a step on the right direction, but things still needed to be better for the Solution team’s SLAs.

We also noticed a new pattern: During reporting season (when Startups share more data with their investors) time in queue went up, taking up to 20 minutes for individual tasks to run.

With all this new information, and knowing this was happening more often than initially thought, we carved out space in our roadmap to find a solution.


Doc Processing v2

With a better understanding of how things needed to work, the team proposed a new architecture for Doc Processing where queueing and JSON generation would happen outside our core backend using AWS Lambda and SQS instead of running inside our Kubernetes cluster running in celery tasks.

Our goals for the project were:

  • Better Isolation. Processing would not affect any other tasks in the platform. And when a task fails, it is retried automatically. It shouldn’t block other operations.
  • Build with good bones. Good instrumentation makes understanding problems easier. We needed to have better guard rails against failures with strong retry guarantees from the infra and the ability to “replay” failures. When there's any issues we understand why they happen, and we make sure they won't happen again in the future.
  • Dogfooding and integrating with the core service. The plan was to leverage our existing backend codebase as much as possible.


High-level implementation

We made the strategic decision of going serverless, integrating directly with events from S3 via EventBridge.

A Lambda function picks up the documents when they are marked as clean by the antivirus, and queues them for processing. When the documents are processed their JSON representation is uploaded back to S3, and there’s a notification via a webhook to our core backend with the status. Because of the SQS integration, all files are automatically retried three times in case of failures, and they’re sent to a Dead-letter queue (DLQ) when they’re unprocessable.

We limit the volume of requests to the backend with Lambda’s reserved concurrency settings. Any spikes are processed slowly, minimizing contention for resources with other tasks running in the backend.


High-level architectural diagram of the new version of Doc Processing.


Observability and Metrics

Observability is a first-class citizen this time around. We use DataDog to trace the process end to end. Logs from all our lambda functions and the backend are centralized in DataDog, too, making it much simpler to find the root cause of any issue.

We implemented alarms for Processing time, Queue length, and Documents in the DLQ. We can trace by document name, document id, request id or execution id and see the flow from when a file is uploaded, when it’s queued, picked up for processing, all the way to the webhook in the backend.

Processing files is much faster. Our average processing time is now 8 seconds, down from 20 minutes. The maximum time processing went down to 3 minutes, down from 4 days. Because of that, we also alarm much earlier now; we alarm when a document it's been queued for 10 minutes (down from 30 minutes) or when there are files in the DLQ.


Doc Processing v2 — Dashboards and Alarms.

Scaling with spikes

In our previous implementation, heavier traffic was a major pain point. It usually meant the queue slowed to down to a crawl. When tasks spiked 3x to 4x our regular traffic, resource contention started bleeding into other endpoints and celery tasks.

We handle spikier loads gracefully now. Our most recent spike was 35x our average traffic. It was very noticeable on the dashboards compared with our average weekly traffic; however, its impact was negligible.


Traffic spikes — Big spike (blue rectangle) vs normal weekly traffic (green rectangles).


Learning Lessons

This project was a success; we learned a lot while building it. We ran the two versions side to side for a week and identified all failure modes in processing that were hidden before. We migrated all traffic in about three weeks, including a few extra features — no major issues.

Our big wins were:

  • We reduced the time in queue and improved the experience of the solutions team with extra features and a better integration with the antivirus workflow.
  • We improved error handling, processing speed, traffic spikes handling, and instrumentation, improved our time to recovery, and made our runbooks a lot simpler.

Some trade-offs:

  • This project adds extra complexity as we are adding more infrastructure and a new pattern with serverless. All engineers will need time to understand and work with this new stack. Testing locally is also more difficult.
  • We had to choose between TypeScript and Python for Lambda. We went with TypeScript because Lambda makes it much easier to work with Node (both infra and logic use the same codebase).

Overall, we are very happy with this new implementation. It was a win for both DS and Engineering teams not having to think about “the queue” anymore.

For the DS team specifically it was a huge quality of life improvement, they now focus on the part of their day to day that matters (and not how that is implemented). They also had a huge win on their SLAs last quarter!

It’s nice to think the engineering team help them achieve it.



We are hiring!

Standard Metrics is working with the top venture capital firms in the world, building the "lingua franca" of the innovation economy. We are a team of builders with a strong focus on customer impact. We value velocity and momentum, strong engineering practices and learning from issues.?We build with good bones! ;)

If this resonates with you and you want to explore how we build and what we're building, let's talk!

https://standardmetrics.io/careers/


Thanks to Samuel Joli , Viru Kanjilal and Mandar Gaitonde for helping with article.

This article was originally published in my personal blog — Coffee Engineering.

要查看或添加评论,请登录

Eduardo Romero的更多文章

社区洞察

其他会员也浏览了