Applying Some 'Unconventional Wisdom' to Improve Model Scoring wrt. High-Velocity Streaming Data
Some data streams in the telecom industry can exceed the rate of 50,000 messages / second. And at about 300KB per message (6TB / day), the size is not tiny either. How does one handle such high-velocity / high-volume data for real-time scoring of ML models? Well, the conventional (or dominant) architecture for such problems is Kafka in conjunction with Spark Streaming. At the said rate/size, however, the Spark infrastructure requirements become substantial - approaching 100 core / 100 GB of compute. In fact, for a similar problem the cloud infrastructure costs exceeded $50K / month! And this was for batch-mode processing! [Sample code: fwaris/OrleansKafkaSample (github.com)]
The Unconventional Approach
Can one do better - i.e., use a more efficient architecture? To answer this question, I experimented with an alternative approach. Instead of Spark, I used Orleans Streaming -- deployed to a modest on-prem Kubernetes cluster -- and the F# programming language with 'AsyncSeq' -- a framework for asynchronous stream processing. Let's just say this wasn't exactly a walk in the park for me. I am well-versed in F# but Kafka, Kubernetes and Orleans Streaming were all relative unknowns.
It took me a few weeks to get all the pieces together, but the final result was shockingly good. To process 30,000 messages / second (ea. 300KB in size) the compute needed was only 12 core / 10 GB!
This is an amazing result and if this architecture is deployed widely, it should result in substantial savings for my company when compared to cloud and even on-prem Spark infrastructure costs.
But Why?
The question is why this is so? The reasons are many -- ranging from software (in)efficiency to something rooted in fundamental computer science. While it is hard to prove anything without rigorous scientific experimentation, here are my educated guesses for the reasons why:
Conclusion
Unsatisfied with the resource requirements of conventional stream processing architectures, I applied some unconventional wisdom to create an alternative architecture that yielded a phenomenal increase in efficiency.
Frankly, the ability to efficiently process high-velocity / high-volume streaming data provided a huge relief to me. It means that the business case for supporting a particular use?case is now much easier to justify. And as such, we (our group) can support more of the needs of our internal customers.
Can you share the source code of a small example?
Sr. Fullstack Software Engineer | SQL, F#, C#
2 年This is a great write up! I've been wanting to try out the Orleans framework for something but never found a good application for it. was it a headache to deal w/ all the C#-isms? Are your price estimates coming from Azure or another host provider?
"Datalogist" (*) @ SPISE MISU ApS
2 年Coming from F# Online. Maybe I'm misunderstanding something, but reading in the same sentence "Kafka and JSON" sounds a bit odd. I've experience with Kafka and also in combination with `dotnet`. What you would "normally" do is read `raw bytes` and then transform them to either some `specific` or `generic` types: - https://avro.apache.org/docs/1.8.2/api/csharp/html/namespaceAvro_1_1Specific.html - https://avro.apache.org/docs/1.8.2/api/csharp/html/namespaceAvro_1_1Generic.html The important part, is to avoid using "slow" (text) data containers, such as JSON/XML. But again, I might be misunderstanding your setup. Btw, speaking of JSON and F#. I've realized that since LTS 6.0 if you don't tag your (record) types with: ```fsharp open System.Runtime.Serialization [<DataContract>] type t = { [<field: DataMember(Name="foo")>] foo: string [<field: DataMember(Name="bar")>] bar: string } ``` when serializing, you would end up with: ```json { "foo@":"value0", "bar@":"value1", "foo":"value0", "bar":"value1", } ``` which results in sending x2 the size of a given data payload. I wonder who were the "geniuses" behind this decision ??