登录查看更多内容

Kafka and Spark: A perfect Data Science "Match"

Harvinder Singh Saluja

Head of Software Engineering | AI | AI Agents & ML Innovator | AWS LLMs /Spring AI Specialist | Cloud Data Lakes, Delta Lakes Leader

发布日期: 2018年1月9日

Kafka is a publish-subscribe messaging system that provides a reliable Spark Streaming

source. The Kafka project introduced a new consumer API between versions 0.8 and 0.10, so there are 2 separate corresponding Spark Streaming packages available. The API provides one-to-one mapping between Kafka's partition and the DStream generated RDDs partition along with access to metadata and offset.

The following diagram shows end-to-end integration with Kafka, consuming messages from it, doing simple to complex windowing ETL, and pushing the desired output to various sinks such as memory, console, file, databases, and back to Kafka itself.

Following set of properties will need to be added to Spark Streaming API to integrate Kafka with Spark as a Source

bootstrap.servers: This describes the host and port of Kafka server(s) separated by a comma.

key.deserializer: This is the name of the class to deserialize the key of the messages from Kafka.

value.deserializer: This refers to the class that deserializes the value of the message.

group.id: This uniquely identifies the group of consumer.

auto.offset.reset: This is used messages are consumed from a topic in Kafka, but does not have initial offset in Kafka or if the current offset does not exist anymore on the server then one of the following options helps.

Rameshwar Balanagu

Growth Focused IT Executive & Digital Transformation Leader | Driving Business Growth through Innovative Tech Strategies | Connecting Vedas 2 AI for a better& brighter civilization | Startup Advisor

7 年

Thank you for sharing the article. Passion for technology is one thing and reality is other . While there are immense benefits of using spark which are beyond doubt. How was a business case made for switching to spark ( cost,etc.)? what was the learning curve . Given your passion for security how did you deal with SMT???. Thx in advance

查看更多评论

要查看或添加评论，请登录

Harvinder Singh Saluja的更多文章

Case Study: Mindtelligent Document Verification Solution for a Leading Home Goods Distributor, using AI Agents

2025年3月1日

Case Study: Mindtelligent Document Verification Solution for a Leading Home Goods Distributor, using AI Agents

Client Overview A leading distributor of home goods, serving vendors, retailers, and wholesalers, faced challenges in…
MLOps Using AWS: Streamlining Machine Learning Operations in the Cloud. CDK for Infrastructure as Code

2025年2月20日

MLOps Using AWS: Streamlining Machine Learning Operations in the Cloud. CDK for Infrastructure as Code

Introduction Machine Learning Operations (MLOps) is an essential discipline that combines machine learning, DevOps, and…
AI Agents Workflow: A Detailed Explanation

2025年2月17日

AI Agents Workflow: A Detailed Explanation

What is an AI Agent Workflow? An AI Agent Workflow refers to an automated sequence of tasks where AI-powered agents…

1 条评论
Scaling Microservices on AWS ECS

2025年2月10日

Scaling Microservices on AWS ECS

Scaling a microservice on AWS ECS involves multiple strategies, including horizontal and vertical scaling. Below are…
Migrating Oracle E-Business Suite (EBS) to AWS – A Detailed Process

2025年2月9日

Migrating Oracle E-Business Suite (EBS) to AWS – A Detailed Process

Migrating Oracle E-Business Suite (EBS) to AWS – A Detailed Process Migrating Oracle E-Business Suite (EBS) to AWS is a…
Future-proofing a system architecture while maintaining stability

2024年10月4日

Future-proofing a system architecture while maintaining stability

Future-proofing a system architecture while maintaining stability involves designing with flexibility, scalability, and…
CIO Strategy for Migrating Oracle (on-premise) to Aurora PostgreSQL-AWS

2024年7月3日

CIO Strategy for Migrating Oracle (on-premise) to Aurora PostgreSQL-AWS

I frequently collaborate with the CIO office to devise strategies and design cloud-based solutions. Here are some…
CIO Strategy for AWS Big Data Implementation

2024年6月25日

CIO Strategy for AWS Big Data Implementation

As a Chief Engineer of MindTelligent Technology Consulting, I frequently work with CIO office of my clients. Here are…

1 条评论
Cloud Service Offerings

2024年3月5日

Cloud Service Offerings

Comparing Azure, Google Cloud Platform (GCP), and Amazon Web Services (AWS) involves looking at various factors such as…
MindTelligent, Inc Unveils Cutting-Edge EDI Framework Powered by Spring Boot on AWS

2024年1月24日

MindTelligent, Inc Unveils Cutting-Edge EDI Framework Powered by Spring Boot on AWS

MindTelligent, Inc, a leading innovator in the technology solutions space, is proud to announce the launch of its…

2 条评论

See all articles

Kafka and Spark: A perfect Data Science "Match"

Harvinder Singh Saluja

Head of Software Engineering | AI | AI Agents & ML Innovator | AWS LLMs /Spring AI Specialist | Cloud Data Lakes, Delta Lakes Leader

Harvinder Singh Saluja的更多文章

社区洞察

其他会员也浏览了

Apache Flink: Real-Time Data Processing at Scale

Spark Dynamic Resource Allocation

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

?? DATA Pill #104 - What can LLMs never do?, Kafka Connect: A Love/Hate Relationship

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

DATA Pill #038 - 2 ways to build a Feature Store, Spark Structured Streaming, and a lot of Apache…

Harnessing Kafka Streams for Real-Time Data Processing: A Case Study

?? DATA Pill #107 - dbt 1.8 is just wow, How Twitter processes 4 billion events in real-time daily

Harvinder Singh Saluja的更多文章

Case Study: Mindtelligent Document Verification Solution for a Leading Home Goods Distributor, using AI Agents

MLOps Using AWS: Streamlining Machine Learning Operations in the Cloud. CDK for Infrastructure as Code

AI Agents Workflow: A Detailed Explanation

Scaling Microservices on AWS ECS

Migrating Oracle E-Business Suite (EBS) to AWS – A Detailed Process

Future-proofing a system architecture while maintaining stability

CIO Strategy for Migrating Oracle (on-premise) to Aurora PostgreSQL-AWS

CIO Strategy for AWS Big Data Implementation

Cloud Service Offerings

MindTelligent, Inc Unveils Cutting-Edge EDI Framework Powered by Spring Boot on AWS

社区洞察

其他会员也浏览了

Apache Flink: Real-Time Data Processing at Scale

Spark Dynamic Resource Allocation

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

?? DATA Pill #104 - What can LLMs never do?, Kafka Connect: A Love/Hate Relationship

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

DATA Pill #038 - 2 ways to build a Feature Store, Spark Structured Streaming, and a lot of Apache…

Harnessing Kafka Streams for Real-Time Data Processing: A Case Study

?? DATA Pill #107 - dbt 1.8 is just wow, How Twitter processes 4 billion events in real-time daily