登录查看更多内容

Evolution of data management architecture concept: Is structured streaming the next big thing?

Mohammed Othman

Senior Data Architect

发布日期: 2023年4月26日

#dataarchitecture #lambda #kappa #structuredstreming #delatalake #datalakehouse

The quick growth of data volumes and ingestion velocity has made real-time data processing a necessity, throughout years data management architects introduced varieties of design patters including. Each architecture pattern addressed specific challenges in real-time data processing.?

Lambda Architecture

In his article “How to beat the CAP theorem” Nathan Marz described the principles of the Lambda architecture, it was back in 2011 when Nathan described it as a hybrid architecture that combines batch processing and stream processing. He classified the architecture component into three layers: Batch Layer, Speed Layer, and Serving Layer.

?The (Batch Layer) uses a “cold path” to store and process the entire data set, then precomputes aggregations and summaries from the raw data, and stores them persistently in a batch layer serving database.

?While the (Speed Layer) a provides a “hot path” for processing events and streams of data in real-time, generating (incremental updates) that are used as augmentations for the batch aggregated summary data components.

?Finally, the Serving Layer combines the batch and real-time views to provide a unified view of the data.

No alt text provided for this image — Diagram Source Lambda Architecture | element61 https://www.element61.be/en/resource/lambda-architecture

The benefits of Lambda Architecture is that it provides fault-tolerance, scalability, and consistency throughout the data pipeline . on the other hand the shortfall of this architecture pattern it is a not easy to implement and it requires considerable efforts to develop, test, and maintain. The skillsets required for designating platforms based on the Lambda archipattern is wider in spectrum compared to alternatives.

Lambda architecture It is best suited for scenarios where flexibility and scalability to process high-volume data both offline and real-time processing are required.

?Delta Lake is a technology that provides ACID transactions, schema enforcement, and other reliability guarantees on top of cloud-based data lakes like Apache Hadoop and Apache Spark. Delta Lake provides the reliability and correctness required for the Batch Layer of the Lambda Architecture, which stores and processes the entire data set.

?More : How to beat the CAP?theorem - thoughts from the red planet - thoughts from the red planet (nathanmarz.com)

Kappa Architecture

Three years after the introduction of the Lambda archipattern principles, an alternative approach identified as the Kappa architecture ?was introduced, The first Kappa architecture was introduced by Jay Kreps in 2014 in his post (Questioning the Lambda Architecture), The Kappa Archipattern is a pure stream processing architecture that uses a single processing layer.

No batch layer in the Kappa architecture, it only uses a unified processing layer to handle both batch and real-time processing. It ingests data from the sources of all types, processes it, and stores the results to the sink side of the in real-time processing system. The architecture is fault-tolerant, scalable, and provides low-latency processing capabilities.

领英推荐

The Data Lakehouse: The Benefits, Implementation…

Alex Merced 1 个月前

End-to-End Basic Data Engineering Tutorial (Spark…

Alex Merced 11 个月前

Big Data Architectural patterns - Lambda (λ), Kappa…

Deepanshu Kalra 2 年前

Provided that the sources can provide the data in the form of realtime streams or micro-batches,?Kappa Architecture is simpler to develop and maintain compared the Lambda architecture. It is best suited for scenarios where low-latency processing is required, and the data volume is not significantly high.

Structured Streaming

?Structured Streaming is a stream processing framework. it was introduced in Apache Spark 2.0 in 2016 (Two years after the first demonstration of the Kappa design principles ). Data is perceived and processed as a sequence of structured data streams, where each stream represents an unbounded table. This enables developers to work with data in a structured manner, similar to working with data in a traditional database.?

More: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

In the Structured Streaming paradigm a “continuous processing model” is proposed so that data is aggregated and analyzed continuously over time. This approach enables data queriers to make decisions based on up-to-date information , data is consumed through high-level APIs that allow developers to work with structured data, SQL-like queries, and machine learning algorithms that access these unbound tables as they were traditional tables without worrying about how the underlying data is constructed from the source streams and micro-batches.

Structured Streaming is best suited for scenarios where real-time processing of continuous data streams is required. Some examples of situations where it beneficial to use structured streaming are:

Log analysis: Structured Streaming can be used to process and analyze log data in real-time, enabling developers to monitor system and application logs and quickly identify issues as they arise.
Real-time analytics: Structured Streaming is well-suited for real-time analytics applications that require processing and analyzing data in real-time, such as fraud detection, network monitoring, and financial analytics.
ETL processing: Structured Streaming can be used to replace or enhnce conventional data propagation or ETL (Extract, Transform, Load) processing, enabling developers to process and transform data in real-time as it flows through the data stream from sources to targets.
IoT data processing: With the growth of the Internet of Things (IoT), there is an increasing need for real-time processing of sensor data. Structured Streaming provides a powerful tool for processing and analyzing sensor data in real-time.
Social media analysis: Social media generates vast amounts of data in real-time, making it difficult to process and analyze using traditional batch processing methods. Structured Streaming provides a flexible and powerful tool for processing and analyzing social media data in real-time.

?The rise of delta lakes and data lakehouses

?Among the actors that led to shifting focus towards structured streaming was the introduction of the delta lakes and the data lakehouse , Structured Streaming plays a critical role in enabling the handling real-time data processing to support these concepts in the world of big data processing .

Both Delta Lake and data lakehouse can employ structured streaming to process real-time data streams. But there is a slight difference between the two approaches with respect to their approach to data processing.

Data lakehouse provides a unified platform for managing and analyzing structured and unstructured data using SQL and other traditional data warehouse tools. With structured streaming, data lakehouse can process real-time data streams and write the results to various sinks.

On the other hand, Delta Lake is a “technology” that is built on top of data lakes that provides features like the support of ACID transactions, schema enforcement, and other features of governance and reliability guarantees for big data processing. Delta Lake can also use structured streaming to ingest data from various sources, process it in real-time, and write the results to a sink in real-time.

Another aspect of the Delta Lake is that it provides history enablement functionalities such as versioning, time travel, and data retention that make it easier to manage and maintain data in a data lake environment. With structured streaming, Delta Lake can process large volumes of data in real-time, while ensuring data correctness and reliability.

要查看或添加评论，请登录

Mohammed Othman的更多文章

The Turing Test at Your Service: When Bots and Humans Play Hide and Seek

2025年1月27日

The Turing Test at Your Service: When Bots and Humans Play Hide and Seek

It’s 2025, and AI is no longer science fiction—it’s everywhere, even in your coffee-machine and food ordering apps But…
The Digital East India Company: Lessons from History's Largest Monopoly on the Future of AI

2025年1月22日

The Digital East India Company: Lessons from History's Largest Monopoly on the Future of AI

If you're into the AI world, you're probably keeping tabs on how tech giants are supercharging their platforms with…
Ontology and Knowledge graphs: A post that is not written by a human being

2022年12月2日

Ontology and Knowledge graphs: A post that is not written by a human being

#ontology #graphdatabase #graphdatascience #ai #openai #knowledgegraph Knowledge graphs and ontologies are used to…

1 条评论
Process mining: Business development through beautiful analytics

2022年10月13日

Process mining: Business development through beautiful analytics

#processmining #advancedanalytics #businessanalysis #businessprocess #dataanalytics Most business activities nowadays…

1 条评论
Data gravity in banks and how it affects data management strategies

2022年9月30日

Data gravity in banks and how it affects data management strategies

#dataarchitecture #Gravity #Banks #architecture #Cloud #Strategy #datagravity #datamesh #datahub #datafabric “Financial…
Data Architecture Challenges in Banking and Financial Services Environments

2022年8月1日

Data Architecture Challenges in Banking and Financial Services Environments

Watch out for this; this can go wrong… The more complex a business is, the more sophisticated its data architecture…

See all articles

Evolution of data management architecture concept: Is structured streaming the next big thing?

Mohammed Othman

Senior Data Architect

Lambda Architecture

Kappa Architecture

领英推荐

Structured Streaming

?The rise of delta lakes and data lakehouses

Mohammed Othman的更多文章

社区洞察

其他会员也浏览了

Fundamentals of Data Engineering: Building the Backbone of Modern Data Infrastructure

ELEMENTS OF DATA ARCHITECTURE

Change Data Capture (CDC) Events Ingestion

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

Understanding Lambda and Kappa Architectures: Which One is Right for Your Big Data Strategy?

Why Delta Lake Is The Most Widely Used Lakehouse Format In The World?

Proposal for a Management Architecture for Large Volumes of Data

Big Data Architecture

Building the Future of Data Architecture with Apache Pinot

Lambda Architecture

Kappa Architecture

领英推荐

Structured Streaming

?The rise of delta lakes and data lakehouses

Mohammed Othman的更多文章

The Turing Test at Your Service: When Bots and Humans Play Hide and Seek

The Digital East India Company: Lessons from History's Largest Monopoly on the Future of AI

Ontology and Knowledge graphs: A post that is not written by a human being

Process mining: Business development through beautiful analytics

Data gravity in banks and how it affects data management strategies

Data Architecture Challenges in Banking and Financial Services Environments

社区洞察

其他会员也浏览了

Fundamentals of Data Engineering: Building the Backbone of Modern Data Infrastructure

ELEMENTS OF DATA ARCHITECTURE

Change Data Capture (CDC) Events Ingestion

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

Understanding Lambda and Kappa Architectures: Which One is Right for Your Big Data Strategy?

Why Delta Lake Is The Most Widely Used Lakehouse Format In The World?

Proposal for a Management Architecture for Large Volumes of Data

Big Data Architecture

Building the Future of Data Architecture with Apache Pinot