Data Technology Trend #8: Data Next
Lakshmi Venkatesh
Head of Enterprise Data, Analytics & Transformation Architecture at GIC | SG Digital Leader 2024 | Accredited Board Director (SID) | MBA, MTech NUS, MCA, MCOM | Blogger & International Keynote Speaker
Trend 8.1 Unified and Enriched Big Data and AI – Streaming & Delta Lake
Delta Lake:
There have been several massive shifts in Data Technologies. We have seen the biggest shifts and especially in recent times, the data world has upended and the importance of data management, governance, strategy has taken a front seat.
Note: Unlike other trends in this series, for Delta Lake and Kafka Streaming, I am taking a bit of a deep dive as I think they are the future for Distributed file processing and data management. this article is "too long, do read!" :) Will try to make your time worth it.
The key shift in Data trends:
1. ETL -> ELT
2. Big Data Map Reduce -> Apache Spark
3. On-Premise -> Modern Cloud Data Platforms
4. Hadoop & Data Warehousing -> Data Lakes
5. Silos Databases -> Converged and unified OLTP, OLAP, and Analytics platforms
6. Data Streaming from alternative -> essential approach and more…
I genuinely think Delta Lake will be adopted by more and more organizations and Delta Lake is the future, which is why I wanted to highlight it as part of the final trend “Data Next”!
What is Delta Lake:
Delta Lake is open-source storage that enables to build "Lakehouse" architecture on top of existing systems such as S3, ADLS, GCS, and HDFS.
We have seen it all from Small Data to Data Warehouses to Data Lakes. For the application servers in general, with various cloud offerings, we see the trend of the computing layer has been separated from the storage layer and its ability to scale independently. In the same space, Database servers which had a tight coupling of storage and computing (processing) have been de-coupled – EMR, Data Lake, Snowflake, etc., are good examples of this. Delta Lake is also file-based (this could be your existing Data Lake, yes)– you have a query layer (or processing with Apache Spark) based on the massive file system with the ability to provide ACID consistency (which means, you can practically insert, update, delete the data – which is not a practical possibility for many of the modern cloud data platforms that are file-based), versioned data and is inherently scalable to petabytes or exabytes – how cool can that be! ACID consistency is ensured by the transaction logs (like the redo logs in Oracle or the WAL of PostgreSQL).
The problem Delta Lake is trying to solve
Apache Spark brought Big Data back into the limelight and solved the problem of building massive and performance-rich data sets for data pipelining and processing.
Data Lake solved the problem of unifying the repository of data that enables to build data platform and by leveraging this platform build data applications such as data warehousing, efficient performant data queries, and or analytics on top of it. The single version of truth shifted from paper to reality.
When there are effective technologies such as Apache Spark and Data Lake already in place and upended the Data world that is doing pretty well, what is the need for the adoption of another new technology? there still exist challenges from Data Reliability, security to Performance in Data lake space – in their own words.
Spark or Data Lake still requires a rich toolset and has substantial building efforts. Delta Lake puts a layer on top of these two amazing technologies and provides the required tools and technologies to work with for the organizations that aim to achieve unified data solutioning. So, Delta Lake is not a replacement for Apache Spark and/or Delta Lake rather it is an additional layer enabling organizations to build performant and unified Data Platforms. Spark does not have ACID consistency on its own, Delta Lake brings ACID consistency on top of Spark similar to transactional databases that handle the massive volumes of data and unifies streaming and batch processing. Delta Lake is built by the creators of Spark.
Data Lake Vs. Delta Lakehouse Vs. Data Warehouse:
Reference: Source
Why Data Warehouses fail:
The main purpose to build Data Warehouses is to be able to generate analytics from a single place for the purpose of business decision-making and improve its performance. However, many Data Warehousing application fails due to 2 key reasons
(1) Upstream to downstream data curation and normalization: Data Warehouses store the curated data from the upstream as Data Warehouses are immutable. Will all the upstream be willing to do this and will they have the bandwidth to do this even if it is a Management mandate? - Not really. There will be data duplications across the organization and process overlaps that are built over several years. To identify the redundant processes and to build a unified system with upstream sending all relevant information is not a cakewalk (who said walking in cake is an easy task.. anyway). Building Data Warehouse has so many pre-cursors that need to be satisfied, which often is overseen or not addressed before the Data Warehouse gets started. So, Data Warehouse comes back to the drawing board very often than not. You would have seen the existence of 150 + Data Warehouses in a firm. If Data Warehouse is the single version of the truth and is “Central” to the firm then why there are so many Central Data Warehouses. To implement a Data Warehousing solution in a firm, for business, there is always a constant trade-off between emergency and importance. Data Warehousing is not a low-hanging fruit and its benefit will be realized over time. Any new functional-rich project will always be given importance by the business over Data Warehousing.
(2) Factor of Time: Building Data Warehousing takes time. While answering the essential questions of Why Data Warehousing? and what approach are we taking to build the Data Warehousing? (Inmon Vs Kimball or combination based on organization needs etc.), a clear message should be delivered that building Data Warehouses takes substantial time and effort and the benefits can be seen only in the long run. Many of the firms, keep this deadline as 1-2 years and marks the Data Warehousing as a failure, and moves on to building a new one within few years. Data Warehousing is a function of resources, time, quality, cost, technology, and most importantly data. Thousands and thousands of data points across several systems that need to be read, understood, curated, and normalized into a single warehouse is not an easy task!
What problems exist in Data Warehousing does not vanish just by introducing Data Lake or Delta Lake. However, the unification of data into a single place without the need to curate 1000’s of data points becomes quicker and easier. Also unlike Data Warehousing, unification of data will see the light of the day sooner as all the data in its original form is in a single place without any fancy modifications (with the ability to process structured and unstructured data and without the need to stick to proprietary file types). Building a Single version of the truth from this massive, unified data set by talking to one business at a time and bringing in only the essential fields becomes easier. Building Data Services for quick business use on top of this massive staging data becomes a lot easier.
Few impressive features of Delta Lake:
The Data and AI Summit of 2021 announced several key features. Have listed few impressive features + the ones introduced recently. Of the other main features such as metadata handling, time travel, support for CRUD, etc. below are the notable and impressive features.
1. The Lakehouse architecture -> Delta.io
A Lakehouse architecture consolidates/integrates the entire data sourcing till consumption. Delta Lake is an open-source project using which you can build Lakehouse Architecture. Delta Lake is a unification of Data Warehousing, Data Lake, and Analytics which can be built using the tools supplied by any of the modern cloud providers such as AWS / Azure or can be built from Data Bricks opting the tools or can be built using other modern technologies.
The Lakehouse architecture as a concept is not new. Ever since we started processing and ingesting the data and processing the same, the ultimate aim of the Datastore is to gain meaningful and actionable insights out of the data and make use of the data for business growth and performance. While it happens in silos across the organization in multi-tier mode, the introduction of Delta Lake, puts a structure around it and makes the data source, store, and share at ease. Based on the paper by Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia, the below diagram depict the evolution of Delta Lake aka Lakehouse Architecture.
Even though many of the organization may be at first a two-tier architecture, the good news is that shifting to Lakehouse architecture and creating a Lakehouse architecture out of it is though complicated due to the massive and disparate data points, is not cumbersome or complex. Lakehouse Architecture is nothing but a complete bundle of Data Management systems that focuses on sourcing till the consumption of data and hence simplifies the complete Data management task. Is Delta Lake, the open-source project first of its kind – not really, however Delta lake is more popular due to its innate benefits such as ACID, Meta Data management, simple store, and share, etc. Refer to this article for Comparison between Hudi, Iceberg, and Delta Lake.
Refer Source: Delta.io
Technology stack for Delta Lake:
We can also directly use DataBrick's interface and rich toolset to build sophisticated platforms. Most of the popular public cloud provides tight integration for DataBricks solutions.
2. ACID Transactions
While the Lakehouse architecture enables you to perform everything from sharing to consuming and most importantly as part of the consumption layer, enables users to perform BI, SQL Analytics, Machine Learning, and analytics. All this is to do with Data Warehousing and Analytics (the OLAP, BI, and the Analytics Layer) – fine. But, what about the transactional databases OLTP? Does Delta Lake provide the power of OLTP + OLAP + BI & Analytics, then only you can call it a Unified Modern Data Platform, isn’t it. True. With Transactional databases OLTP, the challenge is its Mutability – ability to append/modify the data, applying the archive log to the correct change reference number, and providing the ability to perform real-time operations and consistency and yet at the same time addressing the performance and scalability/elasticity and keeping the cost in check with utmost best data quality in place sounds not possible isn’t it.
How about having an ACID Transaction on top of Spark? Spark addresses all the problems of scale, performance, data pipelines, etc., and ACID solves the problem of consistency. That is the promise of Databricks. Along with the ACID consistency, the lakehouse architecture also supports partitioning, indexing, schema validation, and handling large metadata.
Source: Databricks
The single most important transition that has happened with the advent of Data Lake is for the organization is creating a massive staging area and operating out of the data lake. This essential shift has created one more layer on top of it and the Lakehouse architecture that enables an open format, common storage of data with varying quality and process points of data such as Bronze, Silver, and Gold and provides a neat and cleanability to build BI / Streaming Analytics / Data Science or ML solutions out of it. For the next level of processing take the data as it is suitable – be it bronze or silver or gold. The thought process of re-using the streaming pipeline to build applications has shifted to using Bronze, Silver, and Gold Data sources.
Even though this topic is about Lakehouse architecture, we cannot completely ignore the Streaming architecture and talk only about the Lakehouse architecture. For designing Distributed File Architecture, there is more than one way to
Method 1: Use of Streaming / Brokers:
Discussed later in the section.
Method 2: Use of Lakehouse architecture:
Yet another way of efficient pipelining and storage of data is using the Lakehouse architecture. Similar to Apache Kafka streaming, the Lakehouse architecture can also be divided into 3 layers, producer, store, and consumer. The storage is split into 3 Bronze – raw data; Silver – filtered/cleaned/augmented data; Gold – processed data. Based on the business need, data can be retrieved from any of these buckets and shall be used and re-used.
Scalability, Elasticity, and Flexibility get a good score. In terms of future growth and inclusion of new business units, need not completely add new Data Lake solutions as we add a queue, it is just adding a few buckets and provisioning to the business unit in the current data lake space. Art is not to make the data lake unmanageable and make it a data swamp which will be disastrous.
If we pull out one core issue that converts a Data Lake into Data Swamp is its segmentation and the need to provide access rights to different business units based on the underlying data set – Data Governance. If this can be addressed, the usage of Data Lake can be at its optimal with high data quality. Simplify the sharing using the “Unity Catalog” – this is one good concept that can be adopted regardless of whether we use data bricks or not. It is so simple and efficient.
3. Unity Catalog
Databricks Unity Catalog is a brand-new feature introduced in 2021 (this article is written when there is a waitlist to sign-up for the solution). The very reason Data Lakes becomes data swamps is the missing part of governance and the ever-growing data into the data lakes and creating multiple buckets and providing file-based access rights to the end-user. If this problem can be sorted, the data swamps can be avoided and data lakes can be retained taking care of data quality. In order to achieve this, instead of giving permission at the file level, provide permission at the query level! Users can be of 3 types in any organization (1) Simple user (2) Power user and (3) Super user. To any of these users, to access the data if restricted at the query / table level, then the data lake need not be disturbed and altered.
Data Lake starts all neat…
Unity Catalog, as per Data Bricks is the world’s first unified catalog for the lakehouse.
Key features
1. One interface to govern all data assets
2. One security model based on “ANSI SQL”
3. Full integration with existing catalogs
As Data Lake stores everything as files, there are a bunch of users who will need different access rights to different files. In order to separate it, we tend to create multiple buckets and provide permission at the file level. Data Bricks saw this to be the starting problem where Data Lake gets murkier and introduced permission at ANSI SQL level instead of at file level. If a user has permission to run a query on the table, then they will have access to that data set. This is not a new concept, we have been dealing with Fine-Grained access permissions in the databases for several decades. But with this shift in the modern data platform and use of new technology, the same fine-grained access permissions need to be shifted and re-aligned for performance, ease of use and enabling the data lake not to get murkier. It addresses 3 main issues
1. What if users are interested only in few columns in a file/table
2. What if there is a change in the data layout
3. What if Data Lake and the tables go out of sync – which data to use and different governance model for different data technologies
4. What if there is a change in organization governance rules
Solution:
The time-tested solution of providing
- Fine-grained permissions on tables, fields, views, and NOT files
- Industry-standard – ANSI SQL grants
- Unified permission model for all data assets
- Centrally audited
Source: Databricks Data + AI Summit 2021
4. Delta Sharing
As an organization produces more and more data and with the organization’s regional spread and external accessibility of information, data needs to flow beyond borders within the same organization and external to the organization. Each country's regulations and governance framework are varied, which technically does not allow the share of data within the same cloud provider or framework. The data will have to pass the borders, different public clouds, and yet safe and secured. There has to be a robust “sharing” framework in order to support this complexity. To solve this, Data Bricks has introduced the industry-first open protocol for secure data sharing called “Delta Sharing”.
What is Delta Sharing:
Sharing across borders within or outside the organization is increasingly complicated due to multi-country regulations and different governance structures adopted in the cloud. Delta Sharing has the following goals (as per Data Bricks)
1. Share existing, live data in data lakes/lakehouses (no need to copy it out)
2. Support a wide range of clients by using existing, open data formats
3. Strong security, auditing, and governance
4. Efficiently scale to massive datasets
Key features:
- Fully open, without proprietary vendor lock-in. Vendor-neutral OSS governance model.
- Not restricted to SQL, full support for Data Science.
- Easily managed privacy, security, and compliance.
- A vibrant ecosystem that integrates across all clouds.
How does it work:
Delta Sharing is also another efficient and simple design that can be used even if an organization is not using Data Bricks.
Say a user from either internal/external organization is requesting for a data set, the related access permissions will be checked by the data provider and a “temporary short-lived URL” with the actual data which is ringfenced with the required security protocols that have been provisioned for S3.
Source: Databricks Data + AI Summit 2021
What problem it solves:
It does not distinguish region, individual or cross firms, the quantity of data – if the user has permission to access the underlying data, it simply sends a short-lived URL that needs to be actioned upon by the recipient. This helps in multiple ways beyond sharing of data – individual organizations or regions need not build separate Data Warehouses if they are using the data as is – if they are processing the receiving data set and to generate ML / Analytics out of it, it is simply an extension of the Data Lake and they, in turn, can use the similar Delta Sharing principles and server to share the final data to their downstream consumers. Is it good or it is good!
- As a concept can be adopted with or without Data Bricks.
- Provider can share a “single version of the truth” of the underlying table/file/partition
- Live sharing of the table and can have ACID consistency!
- Any client who reads Parquet can support Delta shares
- Better, faster, cheaper, smart, reliable, parallelism possible using any modern cloud file system
We shall discuss how to share the table live in the “Delta Live Table” later in the section.
Delta Sharing Ecosystem:
Source: Databricks Data + AI Summit 2021
5. Delta tables to Delta Live Tables
As we saw earlier, the foundation of Lakehouse architecture is having Bronze – row data; Silver – filtered, cleaned augmented data, and Gold – Business level aggregates. This is the simplest form. But in reality, as the producers increase and consumers increase and if we are not adopting any of the modern features such as Unity Catalog, we may end up having multiple Bronze, Silver, and Gold buckets. This makes it difficult to maintain a reliable version of data and the Data Lake will soon end up being Data Swamp. In order to preserve the single version of the truth and the reliability of data, Databricks announces “Delta Live Table”, a reliable ETL made easy with Delta Lake.
What is Delta Live Table:
Delta Live Table, as the name suggests, shares the live data as and when some changes happen to the underlying data set.
Key Features:
1. Delta live tables understand your data pipeline
2. Live table understands your dependencies
3. It does automatic monitoring and recovery
4. Enables automatic environment independent data management
a. Different copies of data can be isolated and updated using the same code base.
5. Treat your data as code
a. Enables automatic testing
b. Single source of truth more than just transformation logic
6. Provides live updates.
How does it work:
Creates live tables directly on the underlying file/table. As data modifications happen to the underlying data source, the same is reflected. To the live tables. For sharing the Delta Live table can be shared via. Delta Sharing will ensure that live data is shared as the request is made.
What problem does it solve:
From Query to Production, though with Delta Sharing and Unity Catalog is a simple and easy job, sharing data and gathering analytics on terabytes/exabytes of data that should reflect the live updates every time there have been updates to the underlying (something like Material Views) on loads of data is not an easy task – it has so many operational challenges and will lead to performance bottlenecks.
Source: Databricks Data + AI Summit 2021
Key challenges it solves:
1. Enable data teams to innovate rapidly
2. Ensure useful and accurate analytics, BI with top-notch data quality
3. Ensures single version of the truth
4. Adopts to organization growth and the new addition of data.
6. Data Bricks Machine Learning
MLFlow:
DataBricks Machine learning has put together a stack called the “Managed MLFlow” and has introduced a couple of new features “AutoML” and “Feature Score”. MLFlow is an open machine learning platform introduced back in 2018-19.
- AutoML: Automates machine learning by quickly enabling you to deploy the models. Most part of preprocessing, feature engineering, and training have been automated hence saves time and focus on the quality of outcome and give bandwidth for resources to focus on solutions such as Explainable AI. Auto logging, Tracking, integration with PyCaret ML library, and deployment backends (along with Kubernetes, Docker, Spark, python, Redis, etc., have included Ray, Algorithm,a and Pytorch). are new in MLFlow. Data Bricks has a promising roadmap to the managed MLFlow.
- Feature score: Databricks has introduced a Feature online store. As part of the MLFlow, if the model is trained based on the Feature Store, the model itself will lookup for the features for the feature store.
Source: Databricks Data + AI Summit 2021
MLOps:
MLOps using Databrick’s MLFlow becomes simple and efficient.
MLOPS = DataOPS + DevOps + ModelOps (Delta Lake + Git Repos + MLFlow).
7. Databricks integration in public cloud
Databricks on Azure: Accelerate Data-driven innovation with Azure Databricks.
Databricks on AWS: Simple unified platform seamlessly integrated with the AWS Services.
Databricks on GCP: Databrick’s open lakehouse platform is fully integrated with GCP’s data services.
---------------------------------------------------------------------------------------------------------------
Streaming:
Already discussed in Accelerated Trend. More details in this section.
Streaming or messaging system transfers data from producer to consumer. There are two key types of streaming (1) Point-to-Point Messaging system (2) Publish-Subscriber messaging system. As with Modern Data Platform, Modern Data Streaming for both real-time and non-real-time are gaining traction.
Apache Kafka:
What is Apache Kafka:
The Kafka message broker architecture or the general Kafka architecture has only 3 key elements, the producer, topic, and the consumer. Consumer consumers whatever topic they are interested in, and the producer produces/generate and publishes data source to the topic. All the applications, ML / AI / BI, any SQL / NoSQL databases, etc., can directly consume from Apache Kafka.
Scalability, Elasticity, and Flexibility get a good score. If not designed carefully, can get completely convoluted and becomes extremely complex with multiple queues in place and will run for the money for first-generation architecture instead of modern architecture.
The Kafka Ecosystem and Cluster architecture
Kafka Ecosystem: The Kafka ecosystem consists of Kafka Cluster, Producer, Consumer and Zookeeper.
How does it work:
Producer publishes the message to broker’s topic partition. -> consumers consume the data. Zookeeper elects the leader and scales out and in.
As part of how Kafka works, I would like to discuss 3 key points
- Publishing and Subscribing using Workflows
- Partition, Leadership and Replication
- Event-driven architecture using Quorum controller (KRaft)
- Publishing and Subscribing using Workflows
- Partition, Leadership and Replication
- Event-driven architecture using Quorum controller. (KRaft):
Event-Driven Raft consensus is Kafka without Zookeeper. The Quorum controller is an Event-driven consensus
Refer source.
Further reading: KIP-595 A Raft protocol for the Metadata Quorum
As per Kafka, a Quorum controller takes much lesser time to start off and shut down millions of partitions. This is more performant. Quorum controller stores its state using the event-sourced storage model.
Simpler operations, simpler deployment, tighter security, support for up to 2+ million partitions, single process execution.
What problem does it solve:
Data Streaming is not a new solution. Be it IBM MQ or any publisher/subscriber messaging, data streaming is time immoral. The new streaming not only enables transferring of messages from point A to point B but is slowly replacing the “Distributed system architecture”. While there are multiple reference architectures for Distributed systems such as (1) Shard (2) Streams (3) Databases
Key Features:
Kafka Connect
To be able to connect the data sources producers or data sinks with Kafka, then we have to use Kafka Connect.
Kafka Connect is a tool for scalability and reliability. It reliably enables stream data between source and target systems. Where massive data is involved, Kafka Connect can ingest the entire data set and can enable processing.
Event Streams:
Data management = Storage + Flow.
Source: Kafka summit 2021
Kafka is the central nervous system in the entire enterprise architecture. Apps, SAAS Applications, Databases, Data Warehouses, etc., connect to Kafka in producer and consumer capacity.
Ksqldb: Data in Motion + Data at Rest
KSqldb enables you to build modern, real-time applications with the same ease of querying from traditional databases. It is an event streaming database that helps to create stream processing applications on top of Apache Kafka.
You can build a streaming app with just a few steps:
1. Capture events
2. Perform continuous transformations
3. Create materialized views
4. Serve lookups against materialized views
Disaster Recovery for multi-region and multi Data-Centers
If one of the DC goes down, the Kafka should reliably work and should auto-heal to bring up the next DC to work efficiently. While setting this up manually might be a bit difficult, fully managed Kafka As A Service (MSK) in AWS / Azure / GCP is fully managed, automatically backed up, auto self-healing if one of the DC goes down, highly available, and highly reliable. The MSK in the public cloud migrates and runs existing Apache Kafka applications on the public cloud without any major changes and is fully managed in the sense that all the upgrades are taken care of – no operational overhead. To further process data streams, we can also use fully managed Apache Flink applications on many of the public clouds (AWS)
Refer: Source
Managed Kafka Services (MSK) - AWS MSK
Refer: AWS Source
Kafka Security
Data Governance is part of security, that meets data discovery
- Data Quality
- Data Catalog
- Security
- Data Lineage
- Self Service platform
- Data Policies
Kafka Monitoring
One of the ways to monitor Kafka is to measure specific metrics. All the metrics that are published by Kafka are accessible via the Java Management Extension interface. MSK in the public cloud will have provisions to view the metrics. There are several metrics sources such as Application Metrics, Logs, Infrastructure metrics, synthetic clients, and client metrics.
Kafka use cases:
1. Machine Learning
Kafka drives cutting-edge Machine Learning technology. Apache Kafka and Machine Learning is a great combination to use Data and Analytics pipeline.
Refer Source.
2. Microservices
Kafka Microservices enables Event-driven systems using Kafka and Ksqldb.
Refer Source Kafka. Second ref.
Kafka streams API enables to chain a collection of asynchronous services together, connected via events. Using Kafka streaming for the event-based microservices design enables efficient workflow for the design pattern.
3. Internet of Things
Real-time IoT data solutions with Confluent enable to pass the data between IoT and analytics systems. Confluent simplifies IoT Event stream processing. Source.
4. Cloud
Kafka is much as a managed service for AWS / Azure and GCP.
5. Data in Motion
Kafka enables the movement of the massive amount of data with Kafka and combined with Data at Rest.
6. Containers – Kubernetes
Confluent for Kubernetes provides a complete, declarative API to build your own private cloud Kafka service. It automates the deployment of Confluent Platform and leverages Kubernetes to enhance the platform’s elasticity, ease of operations, and resiliency for enterprises operating at any scale. Mobile-based applications can leverage the scalability and performance of Apache Kafka and enabling client-server communication.
Other Modern Data Streaming platforms including Kafka, SQS, Data Streaming solutions & Event hubs with Azure, Real-time streaming Data Pipeline & pub/sub using GCP.
---------------------------------------------------------------------------------------------------------------
Conclusion
Data and Analytics trends 2021.
Refer: Gartner source.
With data being upended and with Modern Cloud Data platforms in place, not only that better, simple, faster, smarter, and efficient data platforms can be built at the organization level. The main idea of this article is to see the trends and for any size of organization to see the opportunities that are available to build better and exceptional data platforms.
Simplicity is the ultimate sophistication – Leonardo da Vinci.
Enterprise Data, Digital & AI Strategy, Architecture & Transformation Leader | Author
3 年Excellent Article, I am wondering can we do the data transfer between KSQL (Confluent) DB vs Spark (Databricks) SQL DB?