登录查看更多内容

Why Most Companies Are Using Apache Kafka "Incorrectly" For Real-Time Analytics

Sebastian Maurice, Ph.D.

Global AI and Machine Learning Leader | Teacher | Inventor | Author | Blogger | Coder

发布日期: 2023年9月10日

It's been a while since my last blog, life has been extremely busy. So, I thought why not relax and write a blog. Also, is it just me or did summer actually happen this year in Canada? It seems we had spring, then skipped summer, now its Fall...ahh time - it relentlessly keeps moving forward.

Any case, I wanted to talk about real-time data and how companies are using Kafka incorrectly. Yes, that is a bold statement but based on what I have been seeing online, and discussing with major organizations around the world, I am convinced that Kafka is being used incorrectly for real-time analytics and companies are wasting money.

Here are the major issues I am seeing companies do incorrectly when it comes to Real-Time Analytics:

Streaming data to Kafka - then, using some other technology, storing the data in a Real-Time Database like Bigquery, Redshift, SQL server, Singlestore, etc. is a waste of money. The issue with this is you are duplicating your data which not only increases storage costs, but also creates security issues because you have increased the attack surface for your data. DATA STREAMING INTO KAFKA DOES NOT NEED TO BE MOVED OUT OF KAFKA.
Real-Time Data processing using SQL queries is a waste of money. Using SQL queries to process real-time data can get very expensive, especially if you are doing millions or billions of queries per day to process real-time data. Data processing with SQL queries also creates overhead on maintaining the code and managing changes to the queries and causes lots of data movement which increases network costs. USING TRANSACTIONAL MACHINE LEARNING (TML) SQL QUERIES ARE ELIMINATED AND ALL PROCESSING OF DATA IS DONE IN-MEMORY WITH JSON PROCESSING AT THE ENTITY LEVEL.
Real-Time machine learning using third party technologies like MLLib, Google ML, etc. is not needed and companies are wasting money. Doing machine learning using real-time data needs to be done using sliding time windows. Because a key characteristic of data streams is "temporal locality" - data moves forward in time and capturing segments of data to analyse requires sliding time windows. USING TRANSACTIONAL MACHINE LEARNING (TML) - MACHINE LEARNING DOES NOT REQUIRE THIRD-PARTY ML LIBRARIES - TML PERFORMS AUTO MACHINE LEARNING IN-MEMORY AT THE ENTITY LEVEL WHICH DRASTICALLY REDUCES COMPUTE, STORAGE AND NETWORK COSTS.
Real-Time visualization using third-party visualization tools is a waste of money. Almost all third-party visualization tools do not have a direct connector to Kafka. This means companies waste money moving data out of Kafka into a databases and then point the visualization tool to these data. USING TRANSACTIONAL MACHINE LEARNING (TML) - YOU CAN DIRECTLY CONNECT TO A KAFKA TOPIC AND STREAM THE DATA OVER WEBSOCKETS TO ANY BROWSER - BYPASSING THE DATABASE AND THIRD-PARTY TOOLS AND LICENSES.
Companies do not understand the advantages of entity level processing and machine learning for real-time data. What is entity level processing? An entity is an individual object that can be an IoT device, human, or something that is individually producing data. For example, if you want to analyse 10M IoT devices, each device is producing data in real-time in its own environment that will influence the data it produces. Companies make the mistake of using ONE (1) machine learning model to analyse all of these devices - but this may not take into account the individual behavior's of each of these devices. TML USES IN-MEMORY ENTITY LEVEL PROCESSING AND AUTO MACHINE LEARNING; IT CAN PROCESS EACH DEVICE WITH ITS OWN DATA AND MACHINE LEARNING ALGORITHM: THIS MEANS IT CAN CREATE 10M MACHINE LEARNING ALGORITHMS FOR 10M DEVICES AND GET A MUCH MORE GRANULAR AND ACCURATE PREDICTION OF EACH DEVICE'S FUTURE BEHAVIOURS. See TML Processing and Machine Learning below.

领英推荐

Top big data tools and technologies in 2024

Net Talent 1 年前

“THE FUNDAMENTALS OF BIG DATA TOOLS: MapReduce, Spark,…

Benuel Omanga 3 个月前

Top 10 big data platforms – Part 1

Jubin P. 2 年前

The above is not a complete list but they do address very important issues with real-time data processing and machine learning. As your data gets faster and bigger, the costs associated with processing and machine learning will dramatically increase and this could inhibit your solution from being used.

Companies need to think very carefully on better and cheaper ways to process real-time data with Kafka that does not line the pockets of the cloud vendors, but creates higher value for their company.

Here is a summary table of the above issues:

Till next time.

要查看或添加评论，请登录

Sebastian Maurice, Ph.D.的更多文章

Automating the Scaling of Real-Time Solutions for the Enterprise with Kubernetes, TML, Kafka, CoreDNS, Docker, PrivateGPT and Qdrant

2024年12月11日

Automating the Scaling of Real-Time Solutions for the Enterprise with Kubernetes, TML, Kafka, CoreDNS, Docker, PrivateGPT and Qdrant

As far as I can remember, very few companies are good at scaling real-time solutions. Not because they can not do it…
Accelerate Real-Time Solution Builds For the Enterprise with Airflow, Kafka, Docker, GitHub, TML, and ReadTheDocs

2024年9月20日

Accelerate Real-Time Solution Builds For the Enterprise with Airflow, Kafka, Docker, GitHub, TML, and ReadTheDocs

I have always been fascinated with things that move fast. Speed is exhilarating and gets us to places faster.
Cybersecurity and Real-Time Data Processing using privateGPT, Kafka, TML, Qdrant VectorDB, Docker

2023年12月29日

Cybersecurity and Real-Time Data Processing using privateGPT, Kafka, TML, Qdrant VectorDB, Docker

In the past few months, I have been discussing with some cybersecurity experts how real-time data is being used to…
Streaming with PrivateGPT: 100% Secure, Local, Private, and Free with Docker

2023年11月20日

Streaming with PrivateGPT: 100% Secure, Local, Private, and Free with Docker

As I am sure many people who write blogs feel, there has to be some inspiration and motivation to write something that…

2 条评论
Real-Time Text Extraction From PDFs, Audio, Video, Images and Processing with TML, Kafka, Blockchain and ChatGPT For Information Management

2023年6月15日

Real-Time Text Extraction From PDFs, Audio, Video, Images and Processing with TML, Kafka, Blockchain and ChatGPT For Information Management

Companies today are faced with a fast growing digital repository of data that is not just numeric, but textual such as…

3 条评论
Containerizing Real-Time IoT Machine Learning Solutions with Docker, TML, Kafka, TMUX, and Python

2023年5月15日

Containerizing Real-Time IoT Machine Learning Solutions with Docker, TML, Kafka, TMUX, and Python

The world of the Internet of Things (IoT) is growing rapidly especially as more devices and objects like bulbs, vacuum…
Real-Time Predictions of Black Swan Events using ChatGPT, Transactional Machine Learning (TML), and Apache Kafka

2023年4月22日

Real-Time Predictions of Black Swan Events using ChatGPT, Transactional Machine Learning (TML), and Apache Kafka

A wet and dark afternoon here in Toronto, making it a perfect time to write a blog. I was recently posed a question by…

2 条评论
Contextualizing ChatGPT with Health Care Data Streams, Kafka and TML: Analyse and Summarize Data Faster For Faster Understanding of Disease Trends

2023年3月24日

Contextualizing ChatGPT with Health Care Data Streams, Kafka and TML: Analyse and Summarize Data Faster For Faster Understanding of Disease Trends

What can I say? In just a few months (weeks?), Generative AI and ChatGPT have literally, finally, changed the AI…

1 条评论
FHIR Data Streams: A Quick Approach For Real-Time Processing and Transactional Machine Learning using Apache Kafka

2023年3月17日

FHIR Data Streams: A Quick Approach For Real-Time Processing and Transactional Machine Learning using Apache Kafka

An area that has interested me for many years is the digital evolution of health care systems around the world with…
Three Reasons Why You Do NOT Need a Real-Time Database For Real-Time (Transactional) Machine Learning Only Apache Kafka

2023年3月4日

Three Reasons Why You Do NOT Need a Real-Time Database For Real-Time (Transactional) Machine Learning Only Apache Kafka

It is a beautiful snowy afternoon here in Toronto, and I was pondering real-time databases (RTDBs). Specifically, as it…

See all articles

Why Most Companies Are Using Apache Kafka "Incorrectly" For Real-Time Analytics

Sebastian Maurice, Ph.D.

Global AI and Machine Learning Leader | Teacher | Inventor | Author | Blogger | Coder

领英推荐

Sebastian Maurice, Ph.D.的更多文章

社区洞察

其他会员也浏览了

Presto(PrestoDB) - What it Offers and Where and How it can be used

Delta Lake Format: Understanding Parquet under the hood.

Apache Iceberg: Transforming Data Lake Management for the AI Era

Day 9: Data Storage and Management

Apache Spark 101: DataFrame Write API Operation

“What are the big Data Tools and Technologies?”

Solving Massive Data Latency with Dynamic Partitioning and Adaptive Query Execution in Apache Spark

Data Glossary: Know the terms. #BigData

Kafka Logstash pipeline or Databricks connector to write data to Elasticsearch, MongoDB, or Neo4?

NoSQL Injection For Beginners | TryHackMe NoSQL Injection

领英推荐

Sebastian Maurice, Ph.D.的更多文章

Automating the Scaling of Real-Time Solutions for the Enterprise with Kubernetes, TML, Kafka, CoreDNS, Docker, PrivateGPT and Qdrant

Accelerate Real-Time Solution Builds For the Enterprise with Airflow, Kafka, Docker, GitHub, TML, and ReadTheDocs

Cybersecurity and Real-Time Data Processing using privateGPT, Kafka, TML, Qdrant VectorDB, Docker

Streaming with PrivateGPT: 100% Secure, Local, Private, and Free with Docker

Real-Time Text Extraction From PDFs, Audio, Video, Images and Processing with TML, Kafka, Blockchain and ChatGPT For Information Management

Containerizing Real-Time IoT Machine Learning Solutions with Docker, TML, Kafka, TMUX, and Python

Real-Time Predictions of Black Swan Events using ChatGPT, Transactional Machine Learning (TML), and Apache Kafka

Contextualizing ChatGPT with Health Care Data Streams, Kafka and TML: Analyse and Summarize Data Faster For Faster Understanding of Disease Trends

FHIR Data Streams: A Quick Approach For Real-Time Processing and Transactional Machine Learning using Apache Kafka

Three Reasons Why You Do NOT Need a Real-Time Database For Real-Time (Transactional) Machine Learning Only Apache Kafka

社区洞察

其他会员也浏览了

Presto(PrestoDB) - What it Offers and Where and How it can be used

Delta Lake Format: Understanding Parquet under the hood.

Apache Iceberg: Transforming Data Lake Management for the AI Era

Day 9: Data Storage and Management

Apache Spark 101: DataFrame Write API Operation

“What are the big Data Tools and Technologies?”

Solving Massive Data Latency with Dynamic Partitioning and Adaptive Query Execution in Apache Spark

Data Glossary: Know the terms. #BigData

Kafka Logstash pipeline or Databricks connector to write data to Elasticsearch, MongoDB, or Neo4?

NoSQL Injection For Beginners | TryHackMe NoSQL Injection