登录查看更多内容

Real-Time Streaming Data Pipelines With Apache Kafka, Spark Streaming, And Hbase with BP in the Gulf coast OIL Use Case

Steven Murhula

ML Engineer l Data Engineer l Scala l Python l Data Analysis l Big Data Development l SQL I AWS l ETL I GCP I Azure I Microservices l Data Science I Data Engineer I AI Engineer I Architect I Databricks I Java I Sql

发布日期: 2018年3月23日

Introduction :

It happens in our daily professional life in data industry that many of the systems that we whether monitor or extract from or build Intelligence from happens to be from a stream of events. Examples include event data from web or mobile applications, sensors, or medical devices.

most of real life streaming are :

Website monitoring , Network monitoring

Fraud detection

Web clicks

Advertising

Internet of Things: sensors

Business Activities

with batch processing we have the ability to get the insights into activities or events that happened in the past, but it lacks the ability to answer the question of "what is happening right now?”

Event though batch processing can only retrieve data from a later stage in a business real live ti will be come more and more necessary ti to process events and activities that in real time to have business action and reaction in response , but high performance at scale is necessary to do this. That's where Apache Streaming comes in where we integrate Apache , MapR-DB, and MapR Streams for fast, event-driven applications.

BP OIL USE CASE :

The example use case we will look at here is an application that monitors oil wells. Sensors in oil rigs generate streaming data, which is processed by Spark and stored in HBase, for use by various analytical and reporting tools. We want to store every single event in HBase as it streams in. We also want to filter for, and store alarms. Daily Spark processing will store aggregated summary statistics.

What do we need to do? And how do we do this with high performance at scale?

We need to collect the data, process the data, store the data, and finally serve the data for analysis, machine learning, and dashboards.

Streaming Data Ingestion

Spark Streaming supports data sources such as HDFS directories, TCP sockets, Kafka, Flume, Twitter, etc. In our example, we will use MapR Streams, a new distributed messaging system for streaming event data at scale. MapR Streams enables producers and consumers to exchange events in real time via the Apache Kafka 0.9 API. MapR Streams integrates with Spark Streaming via the Kafka direct approach.

MapR Streams (or Kafka) topics are logical collections of messages. Topics organize events into categories. Topics decouple producers, which are the sources of data, from consumers, which are the applications that process, analyze, and share data.

Topics are partitioned for throughput and scalability. Partitions make topics scalable by spreading the load for a topic across multiple servers. Producers are load balanced between partitions and consumers can be grouped to read in parallel from multiple partitions within a topic for faster performance. Partitioned parallel messaging is a key to high performance at scale.

Another key to high performance at scale is minimizing time spent on Disk reads and writes. Compared with older messaging systems, Kafka and MapR Streams eliminated the need to track message acknowledgements on a per-message, per-listener basis. Messages are persisted sequentially as produced, and read sequentially when consumed. These design decisions mean that non sequential reading or writing is rare, and allow messages to be handled at very high speeds. MapR Streams performance scales linearly as servers are added within a cluster, with each server handling more than 1 million messages per second.

Real-time Data Processing Using Spark Streaming

Spark Streaming brings Spark's APIs to stream processing, letting you use the same APIs for streaming and batch processing. Data streams can be processed with Spark’s core APIs, DataFrames, GraphX, or machine learning APIs, and can be persisted to a file system, HDFS, MapR-FS, MapR-DB, HBase, or any data source offering a Hadoop OutputFormat or Spark connector.

Spark Streaming divides the data stream into batches of X seconds called Dstreams, which internally is a sequence of RDDs, one for each batch interval. Each RDD contains the records received during the batch interval.

Resilient distributed datasets, or RDDs, are the primary abstraction in Spark. An RDD is a distributed collection of elements, like a Java Collection, except that it’s spread out across multiple nodes in the cluster. The data contained in RDDs is partitioned and operations are performed in parallel on the data cached in memory. Spark caches RDDs in memory, whereas MapReduce involves more reading and writing from disk. Here again the key to high performance at scale is partitioning and minimizing disk I/O.

There are two types of operations on DStreams: transformations and output operations.

Your Spark application processes the DStream RDDs using Spark transformations like map, reduce, and join, which create new RDDs. Any operation applied on a DStream translates to operations on the underlying RDDs, which in turn, applies the transformation to the elements of the RDD.

Output operations write data to an external system, producing output in batches.Examples of output operations are saveAsHadoopFiles, which saves to a Hadoop-compatible file system, and saveAsHadoopDataset, which saves to any Hadoop-supported storage system.

Storing Streaming Data Using HBase

For storing lots of streaming data, we need a data store that supports fast writes and scales.

With MapR-DB (HBase API), a table is automatically partitioned across a cluster by key range, and each server is the source for a subset of a table. Grouping the data by key range provides for really fast read and writes by row key.

Also with MapR-DB each partitioned subset or region of a table has a write and read cache. Writes are sorted in cache, and appended to a WAL; writes and reads to disk are always sequential; recently read or written data and cached column families are available in memory; all of this provides for really fast read and writes.

With a relational database and a normalized schema, query joins cause bottlenecks with lots of data. MapR-DB and a de-normalized schema scales because data that is read together is stored together.

So how do we collect, process, and store real-time events with high performance at scale? The key is partitioning, caching, and minimizing time spent on Disk reads and writes for :

Messaging with MapR Streams
Processing with Spark Streaming
Storage with MapR-DB

Serving the Data

End applications like dashboards, business intelligence tools, and other applications use the processed event data. The processing output can also be stored back in MapR-DB, in another Column Family or Table, for further processing later.

Example Use Case Code

Now we will step through the code for a MapR Streams producer sending messages, and for Spark Streaming processing the events and storing data in MapR-DB.

MapR Streams Producer Code

The steps for a producer sending messages are:

Set producer properties

The first step is to set the KafkaProducer configuration properties, which will be used later to instantiate a KafkaProducer for publishing messages to topics.

Create a KafkaProducer

You instantiate a KafkaProducer by providing the set of key-value pair configuration properties which you set up in the first step. Note that the KafkaProducer<k,v> is a Java generic class. You need to specify the type parameters as the type of the key-value of the messages that the producer will send.

Build the ProducerRecord message

The ProducerRecord is a key-value pair to be sent to Kafka. It consists of a topic name to which the record is being sent, an optional partition number, and an optional key and a message value. The ProducerRecord is also a Java generic class, whose type parameters should match the serialization properties set before. In this example, we instantiate the ProducerRecord with a topic name and message text as the value, which will create a record with no key.

Send the message

Call the send method on the KafkaProducer passing the ProducerRecord, which will asynchronously send a record to the specified topic. This method returns a Java Future object, which will eventually contain the response information. The asynchronous send() method adds the record to a buffer of pending records to send, and immediately returns. This allows sending records in parallel without waiting for the responses, and allows the records to be batched for efficiency.

Finally, call the close method on the producer to release resources. This method blocks until all requests are complete.

Apply transformations (which create new DStreams)

We parse the message values into Sensor objects, with the map operation on the dStream. The map operation applies the Sensor.parseSensor function on the RDDs in the dStream, resulting in RDDs of Sensor objects.
Any operation applied on a DStream translates to operations on the underlying RDDs. The map operation is applied on each RDD in the dStream to generate the sensorDStream RDDs.

要查看或添加评论，请登录

Steven Murhula的更多文章

Automating AI in the Cloud: MLOps Best Practices for Azure, AWS, and GCP

2025年3月20日

Automating AI in the Cloud: MLOps Best Practices for Azure, AWS, and GCP

Introduction Artificial Intelligence (AI) is transforming industries at an unprecedented scale, but its true power is…
Solving the MLOps Puzzle: How to Optimize Model Deployment in Azure, AWS, and GCP

2025年3月20日

Solving the MLOps Puzzle: How to Optimize Model Deployment in Azure, AWS, and GCP

Introduction The world of AI is racing forward, but without a solid deployment strategy, even the most powerful machine…
Building Resilient Data Pipelines: Stop Firefighting, Start Delivering Value

2025年3月11日

Building Resilient Data Pipelines: Stop Firefighting, Start Delivering Value

"Our pipeline broke again. Dashboards are down.

1 条评论
From Chaos to Clarity: How Data Lakehouses Are Powering Real-Time Analytics

2025年3月6日

From Chaos to Clarity: How Data Lakehouses Are Powering Real-Time Analytics

A Deep Dive Into Kafka, Iceberg, Airflow, and the Future of Streaming Analytics in AWS & GCP ?? Introduction: The Data…
DAGs, Snowflake, and the Future of Cloud Data Engineering

2025年3月4日

DAGs, Snowflake, and the Future of Cloud Data Engineering

Introduction In today’s fast-paced digital world, businesses thrive on data-driven decisions. But how do companies…
Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

2025年2月26日

Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

Introduction Data engineers often face challenges in managing complex data workflows, ensuring environment consistency,…
Beyond Pipelines: Why Most ML Models Fail in Production (And How to Fix It)

2025年2月24日

Beyond Pipelines: Why Most ML Models Fail in Production (And How to Fix It)

?? You built an ML model. It works beautifully in your Jupyter notebook.
Your ML Model is Dying—And You Don’t Even Know It

2025年2月24日

Your ML Model is Dying—And You Don’t Even Know It

The Hidden MLOps Crisis That’s Costing Companies Millions You just built an amazing machine learning model. It crushed…
Why Your Data Models Are Failing: The Hidden Mistakes You’re Overlooking

2025年2月21日

Why Your Data Models Are Failing: The Hidden Mistakes You’re Overlooking

Have you ever spent weeks fine-tuning your data model only to watch it crash and burn in production? You’re not alone…
From Data Chaos to Cloud Automation: How Apache NiFi Powers Scalable Data Pipelines: A Hands-On Guide for Engineers & Architects

2025年2月19日

From Data Chaos to Cloud Automation: How Apache NiFi Powers Scalable Data Pipelines: A Hands-On Guide for Engineers & Architects

Introduction: The Data Movement Challenge in Cloud Environments As organizations increasingly shift to cloud-first…

See all articles

Real-Time Streaming Data Pipelines With Apache Kafka, Spark Streaming, And Hbase with BP in the Gulf coast OIL Use Case

Steven Murhula

ML Engineer l Data Engineer l Scala l Python l Data Analysis l Big Data Development l SQL I AWS l ETL I GCP I Azure I Microservices l Data Science I Data Engineer I AI Engineer I Architect I Databricks I Java I Sql

Introduction :

most of real life streaming are :

Website monitoring , Network monitoring

Fraud detection

Web clicks

Advertising

Internet of Things: sensors

BP OIL USE CASE :

Steven Murhula的更多文章

社区洞察

其他会员也浏览了

Real-Time Features, Real-Time Results: Exploring Streaming Feature Stores with Kafka

Bridging Networks: Exploring the Potential of Apache Kafka and MQTT in Streaming Applications

Rethinking Event Streaming: Kafka and Its Modern-Day Contenders

A Deep Dive into Triggers in Apache Beam

Spark Structured Streaming

Modern Data Integration with Streaming Analytics: Real-Time Ingestion & Processing

How Big Techs Manage High Volume of Real-Time Data Streaming with Node.js

November 11, 2021

The data streaming event of the year - Kafka Summit London - 19 & 20 March

My Favorite Technology of 2017

Introduction :

most of real life streaming are :

Website monitoring , Network monitoring

Fraud detection

Web clicks

Advertising

Internet of Things: sensors

BP OIL USE CASE :

Steven Murhula的更多文章

Automating AI in the Cloud: MLOps Best Practices for Azure, AWS, and GCP

Solving the MLOps Puzzle: How to Optimize Model Deployment in Azure, AWS, and GCP

Building Resilient Data Pipelines: Stop Firefighting, Start Delivering Value

From Chaos to Clarity: How Data Lakehouses Are Powering Real-Time Analytics

DAGs, Snowflake, and the Future of Cloud Data Engineering

Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

Beyond Pipelines: Why Most ML Models Fail in Production (And How to Fix It)

Your ML Model is Dying—And You Don’t Even Know It

Why Your Data Models Are Failing: The Hidden Mistakes You’re Overlooking

From Data Chaos to Cloud Automation: How Apache NiFi Powers Scalable Data Pipelines: A Hands-On Guide for Engineers & Architects

社区洞察

其他会员也浏览了

Real-Time Features, Real-Time Results: Exploring Streaming Feature Stores with Kafka

Bridging Networks: Exploring the Potential of Apache Kafka and MQTT in Streaming Applications

Rethinking Event Streaming: Kafka and Its Modern-Day Contenders

A Deep Dive into Triggers in Apache Beam

Spark Structured Streaming

Modern Data Integration with Streaming Analytics: Real-Time Ingestion & Processing

How Big Techs Manage High Volume of Real-Time Data Streaming with Node.js

November 11, 2021

The data streaming event of the year - Kafka Summit London - 19 & 20 March

My Favorite Technology of 2017