Applying Some 'Unconventional Wisdom'? to Improve Model Scoring wrt. High-Velocity Streaming Data
Conventional and alternative architectures for high-velocity / high-volume streaming data processing

Applying Some 'Unconventional Wisdom' to Improve Model Scoring wrt. High-Velocity Streaming Data

Some data streams in the telecom industry can exceed the rate of 50,000 messages / second. And at about 300KB per message (6TB / day), the size is not tiny either. How does one handle such high-velocity / high-volume data for real-time scoring of ML models? Well, the conventional (or dominant) architecture for such problems is Kafka in conjunction with Spark Streaming. At the said rate/size, however, the Spark infrastructure requirements become substantial - approaching 100 core / 100 GB of compute. In fact, for a similar problem the cloud infrastructure costs exceeded $50K / month! And this was for batch-mode processing! [Sample code: fwaris/OrleansKafkaSample (github.com)]

The Unconventional Approach

Can one do better - i.e., use a more efficient architecture? To answer this question, I experimented with an alternative approach. Instead of Spark, I used Orleans Streaming -- deployed to a modest on-prem Kubernetes cluster -- and the F# programming language with 'AsyncSeq' -- a framework for asynchronous stream processing. Let's just say this wasn't exactly a walk in the park for me. I am well-versed in F# but Kafka, Kubernetes and Orleans Streaming were all relative unknowns.

screenshot of kubernetes resource consumption
Memory (10 GB) and CPU (11.8 cores) of the Orleans-F# app on Kubernetes (screenshot from 'Octant' tool)

It took me a few weeks to get all the pieces together, but the final result was shockingly good. To process 30,000 messages / second (ea. 300KB in size) the compute needed was only 12 core / 10 GB!

This is an amazing result and if this architecture is deployed widely, it should result in substantial savings for my company when compared to cloud and even on-prem Spark infrastructure costs.

But Why?

The question is why this is so? The reasons are many -- ranging from software (in)efficiency to something rooted in fundamental computer science. While it is hard to prove anything without rigorous scientific experimentation, here are my educated guesses for the reasons why:

  • Spark was developed primarily for batch-mode processing. Streaming was added later and therefore Spark is not most optimal stream processing framework
  • The Orleans/Dotnet Kafka and JSON processing libraries are much more efficient (than what is packaged in Spark). The underlying Kafka library for Dotnet is based on librdkafka - a highly tuned Kafka client written in C++. Similarly, Dotnet System.Text.Json is very efficient for JSON parsing because it relies on the Dotnet Span<'t>/Memory<'t> constructs that reduce unnecessary memory allocation and can efficiently parse data as its being read 'off the wire'.
  • Both Orleans and F# AsyncSeq make heavy use of Dotnet's asynchronous processing features. Asynchronous processing is crucial for extracting the most performance from a constrained compute environment when running an I/O heavy?workload. Each Kubernetes 'pod' is given only 2 virtual cores. If a core is blocked waiting for data (to be read or written) the core is not available for any other useful work - this is disastrous for performance. Fortunately, the AsyncSeq F# library provides the right abstractions for easily expressing asynchronous (non-blocking) stream processing logic. The library leverages F# computation expressions (which are a realization of Monads in computer science).

Conclusion

Unsatisfied with the resource requirements of conventional stream processing architectures, I applied some unconventional wisdom to create an alternative architecture that yielded a phenomenal increase in efficiency.

Frankly, the ability to efficiently process high-velocity / high-volume streaming data provided a huge relief to me. It means that the business case for supporting a particular use?case is now much easier to justify. And as such, we (our group) can support more of the needs of our internal customers.

Can you share the source code of a small example?

回复
Evan Howlett

Sr. Fullstack Software Engineer | SQL, F#, C#

2 年

This is a great write up! I've been wanting to try out the Orleans framework for something but never found a good application for it. was it a headache to deal w/ all the C#-isms? Are your price estimates coming from Azure or another host provider?

回复
Ramón Soto Mathiesen

"Datalogist" (*) @ SPISE MISU ApS

2 年

Coming from F# Online. Maybe I'm misunderstanding something, but reading in the same sentence "Kafka and JSON" sounds a bit odd. I've experience with Kafka and also in combination with `dotnet`. What you would "normally" do is read `raw bytes` and then transform them to either some `specific` or `generic` types: - https://avro.apache.org/docs/1.8.2/api/csharp/html/namespaceAvro_1_1Specific.html - https://avro.apache.org/docs/1.8.2/api/csharp/html/namespaceAvro_1_1Generic.html The important part, is to avoid using "slow" (text) data containers, such as JSON/XML. But again, I might be misunderstanding your setup. Btw, speaking of JSON and F#. I've realized that since LTS 6.0 if you don't tag your (record) types with: ```fsharp open System.Runtime.Serialization [<DataContract>] type t = { [<field: DataMember(Name="foo")>] foo: string [<field: DataMember(Name="bar")>] bar: string } ``` when serializing, you would end up with: ```json { "foo@":"value0", "bar@":"value1", "foo":"value0", "bar":"value1", } ``` which results in sending x2 the size of a given data payload. I wonder who were the "geniuses" behind this decision ??

回复

要查看或添加评论,请登录

Faisal Waris的更多文章

社区洞察