登录查看更多内容

Streaming an avalanche of data with Go’s io package

Jason Lui

Experienced, passionate and meticulous software architect and engineer

发布日期: 2022年5月17日

In Go, using?io?package (io.Pipe(),?io.Reader/WriteCloser) to stream data could avoid reading all data into the memory.

If you prefer reading code rather than my babbling, here’s a?gist with revisions, and the playground link?before?and?after?the refactor. (but the last part?The Final Fine-tuning?is not covered)

One fine day I started working on a task to extract a bunch of data based on user-specified filter, transform that into CSV format and upload it to a network location.

“That’s rather typical,” thought I naively. So here I went:

Nice, modular and testable. (See?Playground link?for a demo) What could go wrong?

Enter load testing…

This was just a few millions of records, and it’s Xendit we’re talking about here, one single request could easily hit that number, this wouldn’t work.

Solutioning

Let’s identify the root cause of this RAM spike.

Apparently, by the interface?getData(context.Context, filter) ([]item, error), I was storing all?items in the RAM. To reduce the RAM usage, I needed to find a way to stream the data from DB directly to the network location.

io Package to the Rescue

Taking a closer look, the upload SDK takes an?io.Reader?as input, like most other libraries in Go involving input/output. Therefore,

should be refactored as

Now I have a Reader, how do I write to it? Go’s?io?package has a function?Pipe()?that returns a connected pair of Reader (PipeReader) and WriteCloser (PipeWriter).

Hence, I should create the pair, pass the reader to the upload interface, and use another goroutine to perform the writing (if I don’t use a goroutine, the reader and writer will block each other).

To perform the writing, I need to poll the database result, and write to it. So the main logic becomes

The datastore interface would return the rows now:

It worked! (See?Playground link?for an updated demo)

Most of the allocations came from the upload/reader part. To put a cherry on top, the time consumption was cut in half because the read and write were performed concurrently, albeit still being O(n).

Package Boundary

That’s not the end of the story! In reality, I separated the DB access in another package, in hope of reusing it, this meant the package would expose?*sql.Rows?(actually I’m using?sqlx, so?*sqlx.Rows), making it tightly coupled to database or even a specific DB library.

To address this, I decided to communicate with a channel of?items instead. So now the DB interface changed from

领英推荐

The most powerful S3 API ever? Introducing the Prompt…

MinIO 3 个月前

Unlocking the Need for Speed: The Secrets Behind…

Navyug Infosolutions Pvt. Ltd. 1 年前

Setting up Claude Filesystem MCP

Rick H. 3 个月前

And the CSV writer part was updated accordingly, see?Playground link?for the updated demo.

Let’s check the performance again!

Memory Consumption Comparison with Refactor

Yup, almost overlapping red and yellow lines indicated that the refactor didn’t burden the memory much. Time shouldn’t be impacted as I didn’t…

Time Consumption Comparison with Refactor

Wait… What? The refactor (in yellow) slowed it down even worse than the original solution. We’ve got a new problem.

The Final Fine-tuning

Why was it slowed down? Time to pull out the big gun — pprof.

I added the following lines to capture the CPU profile:

(See?the gist?for the final code.) Before the refactor (where?*rows?is returned), the visualization of pprof partly looked like this:

After that (where a channel is used), it looked like this:

There’s a significant amount of time spent on waiting, by?this great article,

Each channel internally holds a mutex lock which is used to avoid data races in all kinds of operations.

I thought buffering the channel might help, and it did.

Time Consumption Comparison with Buffered Channel

Buffered channels (green/orange line) consumed less time than the original solution (blue line), but still more than the solution where?*rows?was used (red line), and the size of the buffer hardly mattered beyond a certain threshold, because the upload part was able to drain the items.

The memory consumption shouldn’t be impacted much as I didn’t…

Memory Consumption Comparison with Buffered Channel

No more surprises. All good now.

As a last thought, I think the use of channel is not well justified as the only reason is that I want to separate the DB access part, which, on a second thought, might not be reusable anyway because the filter is quite specific to this use case. Hence, I decided to merge the DB access with the main logic and use the?*rows?directly.

In summary, when a large amount of data is flowing out of your Go system:

use?io.Pipe()?to create a pair of connected?io.Reader?and?io.WriteCloser?;
pass the?io.Reader?to your data output component; (e.g., S3 SDK, Kafka library)
write the data to the?io.WriteCloser?in another goroutine; (don’t forget to?Close?it at the end!)
if necessary and feasible, use buffered channel to pass the data to the writer to facilitate loose coupling.

This way you wouldn’t need to hold all data in RAM.

Thank you for reading and I hope this helps you!

要查看或添加评论，请登录

Jason Lui的更多文章

DB as MQ: Reliable Task Distributions with a POC

2024年5月20日

DB as MQ: Reliable Task Distributions with a POC

If you wish to see my POC, please find it at https://github.com/mrkagelui/alfred, I welcome feedback and issues! Our…

2 条评论
In Go, sometimes nil is not nil!

2022年9月22日

In Go, sometimes nil is not nil!

Type is life in Go, especially in interfaces, hence nil may not be nil at times! When asked “what’s the most notorious…
Use Test Coverage to Defend against Typos

2022年8月4日

Use Test Coverage to Defend against Typos

When you need to type dozens of if statements, do you trust your body to make zero mistakes? If you don’t, perhaps this…
One Way to Store Money in PostgreSQL Database and A Benchmark against Its Alternative

2022年3月1日

One Way to Store Money in PostgreSQL Database and A Benchmark against Its Alternative

Synopsis To store money in PostgreSQL, use integers with magnifier if you know the finest denomination, use numeric if…

2 条评论

Streaming an avalanche of data with Go’s io package

Jason Lui

Experienced, passionate and meticulous software architect and engineer

Solutioning

io Package to the Rescue

Package Boundary

领英推荐

The Final Fine-tuning

Jason Lui的更多文章

社区洞察

其他会员也浏览了

The Case for No CALCULATE

Algorithms and Data Structures are irrelevant

AIM Weekly for 01-July-2024

How to Embed Your Processes into Artemis

JSON Serialization in .NET 9

BitTorrent Internals - Part 2 - Understanding the Torrent File

War Rooms to Workspaces: The Epic Journey of SQL Weaving the Fabric of the Digital Economy

CROPLAND's top picks from the rstudio conf 2022: Data platforms

DATA Pill #054 - 10 best open-source repos, LLM, Flink and Apache Iceberg + Python

Hyper fast Vector Semantic Search for LLMs and General AI

Solutioning

io Package to the Rescue

Package Boundary

领英推荐

The Final Fine-tuning

Jason Lui的更多文章

DB as MQ: Reliable Task Distributions with a POC

In Go, sometimes nil is not nil!

Use Test Coverage to Defend against Typos

One Way to Store Money in PostgreSQL Database and A Benchmark against Its Alternative

社区洞察

其他会员也浏览了

The Case for No CALCULATE

Algorithms and Data Structures are irrelevant

AIM Weekly for 01-July-2024

How to Embed Your Processes into Artemis

JSON Serialization in .NET 9

BitTorrent Internals - Part 2 - Understanding the Torrent File

War Rooms to Workspaces: The Epic Journey of SQL Weaving the Fabric of the Digital Economy

CROPLAND's top picks from the rstudio conf 2022: Data platforms

DATA Pill #054 - 10 best open-source repos, LLM, Flink and Apache Iceberg + Python

Hyper fast Vector Semantic Search for LLMs and General AI