Why Most Companies Are Using Apache Kafka "Incorrectly" For Real-Time Analytics
Source: InfoQ

Why Most Companies Are Using Apache Kafka "Incorrectly" For Real-Time Analytics

It's been a while since my last blog, life has been extremely busy. So, I thought why not relax and write a blog. Also, is it just me or did summer actually happen this year in Canada? It seems we had spring, then skipped summer, now its Fall...ahh time - it relentlessly keeps moving forward.

Any case, I wanted to talk about real-time data and how companies are using Kafka incorrectly. Yes, that is a bold statement but based on what I have been seeing online, and discussing with major organizations around the world, I am convinced that Kafka is being used incorrectly for real-time analytics and companies are wasting money.

Here are the major issues I am seeing companies do incorrectly when it comes to Real-Time Analytics:

  1. Streaming data to Kafka - then, using some other technology, storing the data in a Real-Time Database like Bigquery, Redshift, SQL server, Singlestore, etc. is a waste of money. The issue with this is you are duplicating your data which not only increases storage costs, but also creates security issues because you have increased the attack surface for your data. DATA STREAMING INTO KAFKA DOES NOT NEED TO BE MOVED OUT OF KAFKA.
  2. Real-Time Data processing using SQL queries is a waste of money. Using SQL queries to process real-time data can get very expensive, especially if you are doing millions or billions of queries per day to process real-time data. Data processing with SQL queries also creates overhead on maintaining the code and managing changes to the queries and causes lots of data movement which increases network costs. USING TRANSACTIONAL MACHINE LEARNING (TML) SQL QUERIES ARE ELIMINATED AND ALL PROCESSING OF DATA IS DONE IN-MEMORY WITH JSON PROCESSING AT THE ENTITY LEVEL.
  3. Real-Time machine learning using third party technologies like MLLib, Google ML, etc. is not needed and companies are wasting money. Doing machine learning using real-time data needs to be done using sliding time windows. Because a key characteristic of data streams is "temporal locality" - data moves forward in time and capturing segments of data to analyse requires sliding time windows. USING TRANSACTIONAL MACHINE LEARNING (TML) - MACHINE LEARNING DOES NOT REQUIRE THIRD-PARTY ML LIBRARIES - TML PERFORMS AUTO MACHINE LEARNING IN-MEMORY AT THE ENTITY LEVEL WHICH DRASTICALLY REDUCES COMPUTE, STORAGE AND NETWORK COSTS.
  4. Real-Time visualization using third-party visualization tools is a waste of money. Almost all third-party visualization tools do not have a direct connector to Kafka. This means companies waste money moving data out of Kafka into a databases and then point the visualization tool to these data. USING TRANSACTIONAL MACHINE LEARNING (TML) - YOU CAN DIRECTLY CONNECT TO A KAFKA TOPIC AND STREAM THE DATA OVER WEBSOCKETS TO ANY BROWSER - BYPASSING THE DATABASE AND THIRD-PARTY TOOLS AND LICENSES.
  5. Companies do not understand the advantages of entity level processing and machine learning for real-time data. What is entity level processing? An entity is an individual object that can be an IoT device, human, or something that is individually producing data. For example, if you want to analyse 10M IoT devices, each device is producing data in real-time in its own environment that will influence the data it produces. Companies make the mistake of using ONE (1) machine learning model to analyse all of these devices - but this may not take into account the individual behavior's of each of these devices. TML USES IN-MEMORY ENTITY LEVEL PROCESSING AND AUTO MACHINE LEARNING; IT CAN PROCESS EACH DEVICE WITH ITS OWN DATA AND MACHINE LEARNING ALGORITHM: THIS MEANS IT CAN CREATE 10M MACHINE LEARNING ALGORITHMS FOR 10M DEVICES AND GET A MUCH MORE GRANULAR AND ACCURATE PREDICTION OF EACH DEVICE'S FUTURE BEHAVIOURS. See TML Processing and Machine Learning below.

Source: Author
Source: Author

The above is not a complete list but they do address very important issues with real-time data processing and machine learning. As your data gets faster and bigger, the costs associated with processing and machine learning will dramatically increase and this could inhibit your solution from being used.

Companies need to think very carefully on better and cheaper ways to process real-time data with Kafka that does not line the pockets of the cloud vendors, but creates higher value for their company.

Here is a summary table of the above issues:

Source: Author

Till next time.


要查看或添加评论,请登录

Sebastian Maurice, Ph.D.的更多文章

社区洞察

其他会员也浏览了