Processing enormous streams of stuff
(Source: https://www.youtube.com/watch?v=FNfWfBV-3y4 (MAX Frank Group))

Processing enormous streams of stuff

What do you remember from the recently concluded Paris Olympics? Manu Bhaker's medals might come to mind. The Indian Hockey team's triumph. Neeraj Chopra winning a silver. For number nerds, there was the nearly 9bn euro spent to host the Olympics, the sustainable-meat meals served in the Olympic village and tonnes of CO2 saved.

Few will remember the controversy around the Triathlon Swim event. When this event was completed, many of the women and men associated with organizing the Olympics must have breathed a huge sigh of relief. For them this was the successful conclusion of a three-year long project costing close to 90 million euros.

Before the Olympics, deputy Paris mayor Antoine Guillou said -

"I like to say that we're building two cathedrals. There's the one above ground that everyone knows — Notre Dame. And then there's the one underground."

What is he talking about? The Le bassin d'Austerlitz or The Austerlitz basin[2]. An underground reservoir holding 50,000 cubic meters of wastewater from Paris' sewerage systems. This is the reservoir standing between the Paris wastewater and the Seine. On days of heavy rain the untreated wastewater usually made its way into the river, polluting it, rendering it unfit for swimming. This audacious project meant to restore the dream of a Seine in which Parisians could take a dip in.

In computer science terms, the Austerlitz basin is a giant buffer, the equivalent of 20 Olympic-sized swimming pools. It's one part in an enormous wastewater stream processing system. With the advent of big data, we the software folk, build enormous data stream processing systems in the digital world. Sure, there's less E.coli in the data streams than what the wastewater stream processing system is dealing with! No need to get your hands dirty, even if you're a hands-on data engineer!

Enormous data stream processing brings with it engineering challenges just as big as wastewater stream processing for a large city the size of Paris. All the usual rules and tools come to naught when dealing with infinite data streams. Simple operations that help make sense of data become impossible.

That's where Probabilistic Stream Processing Algorithms come to our rescue. When trying to answer basic questions of the data, such as -

  • How many unique elements are present in the data seen so far?
  • What is the frequency of elements present in the data seen so far?

regular approaches of maintaining counters and frequency tables break down. There's not enough memory to store this volume of data, neither enough time to aggregate the statistics across the thousands of nodes through which the data might pass.

Enter algorithms such as HyperLogLog (easily the winner of The Coolest Name for an Algorithm), Count-Min Sketch, Leaky Counters and others. Together they restore the tools at the data engineer's disposal and we can once again answer the questions of unique element counts, frequency and others.

Ah, but there's a catch. If it's precision you're after, you're barking up the wrong tree. These algorithms are probabilistic in nature, so the results are approximate, but delivered fast and with a constant-sized memory footprint.

What's the magic behind these seemingly impossible algorithms?

Two key concepts, which Ted Dunning so elegantly explains in his talk[1].

Hashing - We know what this is, a one-way mathematical function yielding a deterministic output each time the same input is passed.

Sketches - Holding an approximation of the data, in a memory-efficient way.

Putting these two ideas together, HyperLogLog helps answer how many unique elements have been seen so far (formally - the cardinality of the data set) and Count-Min Sketch and Leaky Counter helps answer how many times an element has been seen (formally - the frequency of elements). There are other algorithms such as t-digest and streaming k-means in this class of algorithms.

The Paris Olympics ended as a success, and the Seine is cleaner than it has been in years. And you're hopefully better off after reading this article, with new tools in your toolbox when attempting to wrestle with infinite data streams! For a real deep dive into these algorithms start with Ted Dunning's video[1] and then go as far down the rabbit hole as you wish. Don't burrow into the sewage system though.

References:

[1] Some Important Streaming Algorithms You Should Know About - Ted Dunning (MapR)

https://www.youtube.com/watch?v=lzHC8NQD3jQ

[2] The Austerlitz basin, an underground cathedral to clean up the Seine

https://www.paris.fr/pages/assainissement-de-la-seine-les-travaux-du-tunnelier-d-austerlitz-vont-bientot-demarrer-21535



Mehul Khare

Principal Technical Program Manager @Oracle

6 个月

Parallel drawn between data streams with waste water streams is quite thought provoking...

回复

要查看或添加评论,请登录

Rick Banerjee的更多文章