登录查看更多内容

Scaling Document Processing with Serverless

Eduardo Romero

Software Engineer @ Standard Metrics

发布日期: 2025年1月3日

Standard Metrics' mission is to accelerate innovation in the private markets. We're building a platform that aims to become a financial "lingua franca" between startups and venture firms.

We help startups share financial metrics with their investors effortlessly. One of our core services is analyzing, validating, and capturing metrics from Financial Documents that startups share with their investors.

Our Data Solutions team is in charge of this process. They offer strong SLAs on the time it takes from when a document is uploaded to our platform until it's analyzed and its metrics are available through our platform. It's part of their quarterly OKRs, so they track it religiously and have goals around it.

The Problem

Excel files are used ubiquitously for financial data. They represent 60% of all the documents on the platform. It's so crucial that we built a specific tool where the Data Solutions team analyzes, validates, and maps which value belongs to which financial metric.

A JSON representation of the Excel file powers the tool's UI. Documents uploaded to our platform are stored and scanned for virus in S3. Later, a Celery task within our Kubernetes cluster transforms the document into its JSON representation, and uploads it back to S3 for the UI to consume.

High-level architecture of Document Processing v1

The original implementation of the Document Processing queue had several issues. Processing failures would get stuck in the queue causing bottlenecks, and a real lack of visibility into what was going wrong forced us to be reactive, putting out fires as they popped up.

Investigating these issues was a time sink for engineering and negatively impacted the Solution team’s SLAs and OKRs.

Step Zero

Understanding the problem with instrumentation

The initial theory was that Calamine, the library we used to read the Excel files, was consuming too much memory for big files or files with images and getting killed by the Kubernetes controller, blocking other tasks in the same Pod when it was rescheduled.

Our highest priority was instrumenting the celery task to see into the failures. You can't fix what you don't understand.

We improved logs and metrics on the task to track "Time in Queue" and the number of "Items in the Queue."

We added two alarms:

When the time in queue exceeded 30 minutes.
When there were more than 25 files in the queue for more than 15 minutes.

Dashboards and alarms (v1). — Time in queue clearly increases with load, up to a point where the queue halts processing

The turning point

Things improved significantly. Instead of finding issues days later, engineers got alerts within hours. On the flip side, this alarm was triggering as often as three times per week. It was a step on the right direction, but things still needed to be better for the Solution team’s SLAs.

We also noticed a new pattern: During reporting season (when Startups share more data with their investors) time in queue went up, taking up to 20 minutes for individual tasks to run.

With all this new information, and knowing this was happening more often than initially thought, we carved out space in our roadmap to find a solution.

Doc Processing v2

With a better understanding of how things needed to work, the team proposed a new architecture for Doc Processing where queueing and JSON generation would happen outside our core backend using AWS Lambda and SQS instead of running inside our Kubernetes cluster running in celery tasks.

Our goals for the project were:

Better Isolation. Processing would not affect any other tasks in the platform. And when a task fails, it is retried automatically. It shouldn’t block other operations.
Build with good bones. Good instrumentation makes understanding problems easier. We needed to have better guard rails against failures with strong retry guarantees from the infra and the ability to “replay” failures. When there's any issues we understand why they happen, and we make sure they won't happen again in the future.
Dogfooding and integrating with the core service. The plan was to leverage our existing backend codebase as much as possible.

High-level implementation

We made the strategic decision of going serverless, integrating directly with events from S3 via EventBridge.

A Lambda function picks up the documents when they are marked as clean by the antivirus, and queues them for processing. When the documents are processed their JSON representation is uploaded back to S3, and there’s a notification via a webhook to our core backend with the status. Because of the SQS integration, all files are automatically retried three times in case of failures, and they’re sent to a Dead-letter queue (DLQ) when they’re unprocessable.

We limit the volume of requests to the backend with Lambda’s reserved concurrency settings. Any spikes are processed slowly, minimizing contention for resources with other tasks running in the backend.

领英推荐

A CTO’s guide to Big Data and AI ??

Canonical 3 个月前

Technologists need these essential skills for…

Pluralsight 9 个月前

15+ Next Gen ML Engineering Companies – The 2025…

DataToBiz 2 个月前

High-level architectural diagram of the new version of Doc Processing.

Observability and Metrics

Observability is a first-class citizen this time around. We use DataDog to trace the process end to end. Logs from all our lambda functions and the backend are centralized in DataDog, too, making it much simpler to find the root cause of any issue.

We implemented alarms for Processing time, Queue length, and Documents in the DLQ. We can trace by document name, document id, request id or execution id and see the flow from when a file is uploaded, when it’s queued, picked up for processing, all the way to the webhook in the backend.

Processing files is much faster. Our average processing time is now 8 seconds, down from 20 minutes. The maximum time processing went down to 3 minutes, down from 4 days. Because of that, we also alarm much earlier now; we alarm when a document it's been queued for 10 minutes (down from 30 minutes) or when there are files in the DLQ.

Doc Processing v2 — Dashboards and Alarms.

Scaling with spikes

In our previous implementation, heavier traffic was a major pain point. It usually meant the queue slowed to down to a crawl. When tasks spiked 3x to 4x our regular traffic, resource contention started bleeding into other endpoints and celery tasks.

We handle spikier loads gracefully now. Our most recent spike was 35x our average traffic. It was very noticeable on the dashboards compared with our average weekly traffic; however, its impact was negligible.

Traffic spikes — Big spike (blue rectangle) vs normal weekly traffic (green rectangles).

Learning Lessons

This project was a success; we learned a lot while building it. We ran the two versions side to side for a week and identified all failure modes in processing that were hidden before. We migrated all traffic in about three weeks, including a few extra features — no major issues.

Our big wins were:

We reduced the time in queue and improved the experience of the solutions team with extra features and a better integration with the antivirus workflow.
We improved error handling, processing speed, traffic spikes handling, and instrumentation, improved our time to recovery, and made our runbooks a lot simpler.

Some trade-offs:

This project adds extra complexity as we are adding more infrastructure and a new pattern with serverless. All engineers will need time to understand and work with this new stack. Testing locally is also more difficult.
We had to choose between TypeScript and Python for Lambda. We went with TypeScript because Lambda makes it much easier to work with Node (both infra and logic use the same codebase).

Overall, we are very happy with this new implementation. It was a win for both DS and Engineering teams not having to think about “the queue” anymore.

For the DS team specifically it was a huge quality of life improvement, they now focus on the part of their day to day that matters (and not how that is implemented). They also had a huge win on their SLAs last quarter!

It’s nice to think the engineering team help them achieve it.

We are hiring!

Standard Metrics is working with the top venture capital firms in the world, building the "lingua franca" of the innovation economy. We are a team of builders with a strong focus on customer impact. We value velocity and momentum, strong engineering practices and learning from issues.?We build with good bones! ;)

If this resonates with you and you want to explore how we build and what we're building, let's talk!

https://standardmetrics.io/careers/

Thanks to Samuel Joli , Viru Kanjilal and Mandar Gaitonde for helping with article.

This article was originally published in my personal blog — Coffee Engineering.

要查看或添加评论，请登录

Eduardo Romero的更多文章

Making gamers happier, one nudge at a time

2023年2月22日

Making gamers happier, one nudge at a time

Queueing Theory: A Brief introduction to Nudge This article was original published in my personal blog. I've always…

4 条评论
Serverless Architectural Patterns

2019年3月4日

Serverless Architectural Patterns

We’re building more and more complex platforms, at the same time that we are trying to address ever-changing business…
Proxies, markers, and observability

2019年1月28日

Proxies, markers, and observability

Using Javascript’s ES6 Proxy to ease the pain of instrumenting our AWS Lambda functions for observability in our…
A tale of a dying monolith: The complexity of replacing something simple

2018年9月25日

A tale of a dying monolith: The complexity of replacing something simple

I’ve been working on Big Media for almost two years now. Two of my favorite projects were building publishing platforms.

4 条评论
Fast API Prototyping with Webtask.io and Serverless.

2017年9月28日

Fast API Prototyping with Webtask.io and Serverless.

Function as a Service (FaaS) commonly referred as serverless computing, feels like the answer to every engineer’s…
Continuous Engineering with the Serverless Framework

2017年8月8日

Continuous Engineering with the Serverless Framework

We are working with microservices, and I’m enjoying it. Our implementation is based on AWS Lambda serverless compute.
10ish Factor WordPress for Continuous Engineering

2017年7月20日

10ish Factor WordPress for Continuous Engineering

I am currently working on an interesting project: a publishing platform for one of our biggest clients. As I discussed…
A Microservices-driven Proof of Concept

2017年5月4日

A Microservices-driven Proof of Concept

Over the last two weeks I worked with my team on a Proof of Concept of a headless WordPress implementation for a…
Building a redundant Web with $50

2016年6月6日

Building a redundant Web with $50

Virtualization and commoditization of the Cloud came as a gift to us Software Developers. We can provision servers and…
How we built a location-aware Reminders app in two weeks.

2016年5月17日

How we built a location-aware Reminders app in two weeks.

This is one of the Pet Projects we developed at the office. Every two to three weeks we take on the task of building a…

1 条评论

See all articles

Scaling Document Processing with Serverless

Eduardo Romero

Software Engineer @ Standard Metrics

The Problem

Step Zero

Understanding the problem with instrumentation

The turning point

Doc Processing v2

High-level implementation

领英推荐

Observability and Metrics

Scaling with spikes

Learning Lessons

We are hiring!

Eduardo Romero的更多文章

社区洞察

其他会员也浏览了

Nitor Infotech’s April Tech Bulletin

Observability Redefined: A New Era in Data and DevOps

Forte Spotlight: Tech's Strategic Inflection Point

IBM Big Buy & Evolution of MACH and Jamstack

Interview with MLOps Engineer

Marvelous MLOps issue 17: The perfect MLOps team

Marvelous MLOps #32: MLOps seen by a CIO

Qwak 2.0: The Power of ML with a Fresh New Look

How to Approach Complex ML Deployments

Is Your Supply Chain Hampered by Shiny Object Syndrome?

The Problem

Step Zero

Understanding the problem with instrumentation

The turning point

Doc Processing v2

High-level implementation

领英推荐

Observability and Metrics

Scaling with spikes

Learning Lessons

We are hiring!

Eduardo Romero的更多文章

Making gamers happier, one nudge at a time

Serverless Architectural Patterns

Proxies, markers, and observability

A tale of a dying monolith: The complexity of replacing something simple

Fast API Prototyping with Webtask.io and Serverless.

Continuous Engineering with the Serverless Framework

10ish Factor WordPress for Continuous Engineering

A Microservices-driven Proof of Concept

Building a redundant Web with $50

How we built a location-aware Reminders app in two weeks.

社区洞察

其他会员也浏览了

Nitor Infotech’s April Tech Bulletin

Observability Redefined: A New Era in Data and DevOps

Forte Spotlight: Tech's Strategic Inflection Point

IBM Big Buy & Evolution of MACH and Jamstack

Interview with MLOps Engineer

Marvelous MLOps issue 17: The perfect MLOps team

Marvelous MLOps #32: MLOps seen by a CIO

Qwak 2.0: The Power of ML with a Fresh New Look

How to Approach Complex ML Deployments

Is Your Supply Chain Hampered by Shiny Object Syndrome?