Evolution of data management architecture concept: Is structured streaming the next big thing?
The quick growth of data volumes and ingestion velocity has made real-time data processing a necessity, throughout years data management architects introduced varieties of design patters including. Each architecture pattern addressed specific challenges in real-time data processing.?
?
Lambda Architecture
In his article “How to beat the CAP theorem” Nathan Marz described the principles of the Lambda architecture, it was back in 2011 when Nathan described it as a hybrid architecture that combines batch processing and stream processing. He classified the architecture component into three layers: Batch Layer, Speed Layer, and Serving Layer.
?The (Batch Layer) uses a “cold path” to store and process the entire data set, then precomputes aggregations and summaries from the raw data, and stores them persistently in a batch layer serving database.
?While the (Speed Layer) a provides a “hot path” for processing events and streams of data in real-time, generating (incremental updates) that are used as augmentations for the batch aggregated summary data components.
?Finally, the Serving Layer combines the batch and real-time views to provide a unified view of the data.
The benefits of Lambda Architecture is that it provides fault-tolerance, scalability, and consistency throughout the data pipeline . on the other hand the shortfall of this architecture pattern it is a not easy to implement and it requires considerable efforts to develop, test, and maintain. The skillsets required for designating platforms based on the Lambda archipattern is wider in spectrum compared to alternatives.
?
Lambda architecture It is best suited for scenarios where flexibility and scalability to process high-volume data both offline and real-time processing are required.
?Delta Lake is a technology that provides ACID transactions, schema enforcement, and other reliability guarantees on top of cloud-based data lakes like Apache Hadoop and Apache Spark. Delta Lake provides the reliability and correctness required for the Batch Layer of the Lambda Architecture, which stores and processes the entire data set.
Kappa Architecture
Three years after the introduction of the Lambda archipattern principles, an alternative approach identified as the Kappa architecture ?was introduced, The first Kappa architecture was introduced by Jay Kreps in 2014 in his post (Questioning the Lambda Architecture), The Kappa Archipattern is a pure stream processing architecture that uses a single processing layer.
?
No batch layer in the Kappa architecture, it only uses a unified processing layer to handle both batch and real-time processing. It ingests data from the sources of all types, processes it, and stores the results to the sink side of the in real-time processing system. The architecture is fault-tolerant, scalable, and provides low-latency processing capabilities.
领英推荐
Provided that the sources can provide the data in the form of realtime streams or micro-batches,?Kappa Architecture is simpler to develop and maintain compared the Lambda architecture. It is best suited for scenarios where low-latency processing is required, and the data volume is not significantly high.
Structured Streaming
?Structured Streaming is a stream processing framework. it was introduced in Apache Spark 2.0 in 2016 (Two years after the first demonstration of the Kappa design principles ). Data is perceived and processed as a sequence of structured data streams, where each stream represents an unbounded table. This enables developers to work with data in a structured manner, similar to working with data in a traditional database.?
In the Structured Streaming paradigm a “continuous processing model” is proposed so that data is aggregated and analyzed continuously over time. This approach enables data queriers to make decisions based on up-to-date information , data is consumed through high-level APIs that allow developers to work with structured data, SQL-like queries, and machine learning algorithms that access these unbound tables as they were traditional tables without worrying about how the underlying data is constructed from the source streams and micro-batches.
Structured Streaming is best suited for scenarios where real-time processing of continuous data streams is required. Some examples of situations where it beneficial to use structured streaming are:
?
?The rise of delta lakes and data lakehouses
?Among the actors that led to shifting focus towards structured streaming was the introduction of the delta lakes and the data lakehouse , Structured Streaming plays a critical role in enabling the handling real-time data processing to support these concepts in the world of big data processing .
?
Both Delta Lake and data lakehouse can employ structured streaming to process real-time data streams. But there is a slight difference between the two approaches with respect to their approach to data processing.
Data lakehouse provides a unified platform for managing and analyzing structured and unstructured data using SQL and other traditional data warehouse tools. With structured streaming, data lakehouse can process real-time data streams and write the results to various sinks.
On the other hand, Delta Lake is a “technology” that is built on top of data lakes that provides features like the support of ACID transactions, schema enforcement, and other features of governance and reliability guarantees for big data processing. Delta Lake can also use structured streaming to ingest data from various sources, process it in real-time, and write the results to a sink in real-time.
Another aspect of the Delta Lake is that it provides history enablement functionalities such as versioning, time travel, and data retention that make it easier to manage and maintain data in a data lake environment. With structured streaming, Delta Lake can process large volumes of data in real-time, while ensuring data correctness and reliability.