登录查看更多内容

Processing enormous streams of stuff

Rick Banerjee

Senior Principal Member of Technical Staff at Oracle

发布日期: 2024年8月31日

What do you remember from the recently concluded Paris Olympics? Manu Bhaker's medals might come to mind. The Indian Hockey team's triumph. Neeraj Chopra winning a silver. For number nerds, there was the nearly 9bn euro spent to host the Olympics, the sustainable-meat meals served in the Olympic village and tonnes of CO2 saved.

Few will remember the controversy around the Triathlon Swim event. When this event was completed, many of the women and men associated with organizing the Olympics must have breathed a huge sigh of relief. For them this was the successful conclusion of a three-year long project costing close to 90 million euros.

Before the Olympics, deputy Paris mayor Antoine Guillou said -

"I like to say that we're building two cathedrals. There's the one above ground that everyone knows — Notre Dame. And then there's the one underground."

What is he talking about? The Le bassin d'Austerlitz or The Austerlitz basin[2]. An underground reservoir holding 50,000 cubic meters of wastewater from Paris' sewerage systems. This is the reservoir standing between the Paris wastewater and the Seine. On days of heavy rain the untreated wastewater usually made its way into the river, polluting it, rendering it unfit for swimming. This audacious project meant to restore the dream of a Seine in which Parisians could take a dip in.

In computer science terms, the Austerlitz basin is a giant buffer, the equivalent of 20 Olympic-sized swimming pools. It's one part in an enormous wastewater stream processing system. With the advent of big data, we the software folk, build enormous data stream processing systems in the digital world. Sure, there's less E.coli in the data streams than what the wastewater stream processing system is dealing with! No need to get your hands dirty, even if you're a hands-on data engineer!

Enormous data stream processing brings with it engineering challenges just as big as wastewater stream processing for a large city the size of Paris. All the usual rules and tools come to naught when dealing with infinite data streams. Simple operations that help make sense of data become impossible.

That's where Probabilistic Stream Processing Algorithms come to our rescue. When trying to answer basic questions of the data, such as -

How many unique elements are present in the data seen so far?
What is the frequency of elements present in the data seen so far?

regular approaches of maintaining counters and frequency tables break down. There's not enough memory to store this volume of data, neither enough time to aggregate the statistics across the thousands of nodes through which the data might pass.

Enter algorithms such as HyperLogLog (easily the winner of The Coolest Name for an Algorithm), Count-Min Sketch, Leaky Counters and others. Together they restore the tools at the data engineer's disposal and we can once again answer the questions of unique element counts, frequency and others.

Ah, but there's a catch. If it's precision you're after, you're barking up the wrong tree. These algorithms are probabilistic in nature, so the results are approximate, but delivered fast and with a constant-sized memory footprint.

What's the magic behind these seemingly impossible algorithms?

Two key concepts, which Ted Dunning so elegantly explains in his talk[1].

Hashing - We know what this is, a one-way mathematical function yielding a deterministic output each time the same input is passed.

Sketches - Holding an approximation of the data, in a memory-efficient way.

Putting these two ideas together, HyperLogLog helps answer how many unique elements have been seen so far (formally - the cardinality of the data set) and Count-Min Sketch and Leaky Counter helps answer how many times an element has been seen (formally - the frequency of elements). There are other algorithms such as t-digest and streaming k-means in this class of algorithms.

The Paris Olympics ended as a success, and the Seine is cleaner than it has been in years. And you're hopefully better off after reading this article, with new tools in your toolbox when attempting to wrestle with infinite data streams! For a real deep dive into these algorithms start with Ted Dunning's video[1] and then go as far down the rabbit hole as you wish. Don't burrow into the sewage system though.

References:

[1] Some Important Streaming Algorithms You Should Know About - Ted Dunning (MapR)

https://www.youtube.com/watch?v=lzHC8NQD3jQ

[2] The Austerlitz basin, an underground cathedral to clean up the Seine

https://www.paris.fr/pages/assainissement-de-la-seine-les-travaux-du-tunnelier-d-austerlitz-vont-bientot-demarrer-21535

Mehul Khare

Principal Technical Program Manager @Oracle

6 个月

Parallel drawn between data streams with waste water streams is quite thought provoking...

要查看或添加评论，请登录

Rick Banerjee的更多文章

Index me this

2025年2月16日

Index me this

Let's play a game. I'll offer a word and you reply with the first word that comes to your mind when you hear my word.

2 条评论
Information Batteries in the age of abundant renewable energy

2025年1月29日

Information Batteries in the age of abundant renewable energy

Where does the term "battery" come from? Long before there were electrical batteries, there were armies using guns and…

1 条评论
Deception Environments: Beyond Chaos Testing

2024年7月20日

Deception Environments: Beyond Chaos Testing

One of Malcolm Gladwell's Revisitionist History podcast episodes is titled - "Taxonomy of the Modern Mystery Story"[1].…

1 条评论
Uncertainty Everywhere: From electrons to your search results

2024年7月13日

Uncertainty Everywhere: From electrons to your search results

The model of the atom before 1926 was neat and precise. The nucleus at the center, like the Sun in the solar system.
What does the great train robbery of the AI age look like?

2024年6月22日

What does the great train robbery of the AI age look like?

On the night of June 12, 1924, U.S.
Murphy's Law as a shield against Ransomware

2024年6月8日

Murphy's Law as a shield against Ransomware

On Wednesday February 21, 2024, Change Healthcare (a subsidiary of the UnitedHealth Group) faced massive system outages…

1 条评论
Automated Test Case Generation - LLM-Based Software Engineering

2024年6月1日

Automated Test Case Generation - LLM-Based Software Engineering

First there was Dev. Just Dev.

See all articles

Rick Banerjee的更多文章

Index me this

Information Batteries in the age of abundant renewable energy

Deception Environments: Beyond Chaos Testing

Uncertainty Everywhere: From electrons to your search results

What does the great train robbery of the AI age look like?

Murphy's Law as a shield against Ransomware

Automated Test Case Generation - LLM-Based Software Engineering