登录查看更多内容

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

发布日期: 2024年9月21日

In this blog post, we’ll explore how to leverage ClickHouse's materialized views to efficiently move data from Kafka topics into aggregate tables. We’ll go through a complete setup, explain key concepts, and provide hands-on labs and exercises to solidify your understanding. ClickHouse

What is ClickHouse?

ClickHouse is a columnar database management system (DBMS) designed for online analytical processing (OLAP). It is known for its high performance on complex queries, especially with large datasets. ClickHouse allows users to run fast queries on massive volumes of data, making it an excellent choice for data analytics.

Key Features of Materialized Views:

Data Transformation: They allow you to transform data from one format to another as it is ingested.
Automatic Updates: Materialized views automatically update when new data arrives, ensuring the target table is always in sync.

ClickHouse Engines: ReplicatedReplacingMergeTree vs. MergeTree

ClickHouse offers several storage engines, with MergeTree and ReplicatedReplacingMergeTree being two of the most commonly used.

MergeTree: This is the foundational engine for storing data in ClickHouse. It supports partitioning and allows for efficient queries. However, it does not provide built-in replication.
ReplicatedReplacingMergeTree: This engine builds upon MergeTree but adds replication features. It ensures data is replicated across multiple nodes for fault tolerance and high availability. This is especially important in distributed environments.

When to Use Each Engine:

Use MergeTree when you need high-performance querying on a single node without the need for replication.
Use ReplicatedReplacingMergeTree when you want to ensure data is safely replicated across multiple nodes, providing resilience against node failures.

In our example, we will use ReplicatedReplacingMergeTree to replicate data across nodes.

HANDS ON LABS

Video Guides

Setting Up the Environment: Docker Compose Configuration

We will set up a ClickHouse cluster and a Kafka environment using Docker. Below is the configuration for your docker-compose.yml file:

Once the containers are up, exec into the Kafka container and Run Script to push Samples Messages into Kafka Topics

Data Generation Script

Start a Python shell and use the following code to simulate customer data generation and push it to the Kafka topic:

Creating ClickHouse Tables and Materialized Views

Step 1: Create Databases

In ClickHouse, create separate databases for Kafka-related data and aggregated data:

Step 2: Create Kafka Consumer Table

Create a Kafka consumer table to capture raw data from the Kafka topic:

Alex Merced 4 个月前

Overview of Discord's data platform that daily…

Arpit Bhayani 1 年前

A deep dive: What is LSM tree?

Vivek Bansal 2 个月前

Step 3: Create Parsed Customers Table

Switch to the shard database and create a table to store parsed customer data:

Step 4: Create Materialized View to Transform Data

Create a materialized view that moves data from the Kafka consumer table to the customers table:

Query the data

Output

Push Update through Kafka UI

Query ClickHouse to See updates

Step 5: Create Aggregated Data Table

Aggregates in ClickHouse are designed for high-performance analytics, particularly when working with large datasets. They enable efficient data summarization and analysis, which is critical for reporting and business intelligence applications. Here’s a detailed look at how aggregates work in ClickHouse, particularly with the AggregatingMergeTreeengine.

How Aggregates Work in ClickHouse

Creating an Aggregating Table: When you create an AggregatingMergeTree table, you define the structure of your data and specify aggregate functions. In our example, we create a table for counting customers by state:

Here, state is defined as a LowCardinality string, which is memory-efficient for columns with a limited number of unique values. The customer_count_state uses the AggregateFunction(count, UInt32) to count occurrences efficiently.

2) Defining a Materialized View: Next, we create a materialized view that continuously updates the customer_count_by_state table with new data from the customers table. This ensures that any incoming data is automatically aggregated and stored:

This materialized view captures any new data added to the customers table and performs the aggregation on-the-fly, inserting the results into the customer_count_by_state table.

Inserting Historical Data: For historical data, we need to perform an initial insertion into the aggregating table. This ensures that existing records in the customers table are accounted for in our aggregated counts:

This step aggregates the existing data, allowing us to establish a baseline for the counts by state.

Querying Aggregated Data: To retrieve the aggregated customer counts, we can query the customer_count_by_statetable using the countMerge() function:

Continuous Aggregation Moving Forward

After the initial setup, any new customer records added to the customers table will automatically trigger the materialized view, updating the customer_count_by_state table without requiring manual inserts. This architecture allows for seamless integration of historical data with ongoing data inflow, making the aggregation process both efficient and real-time.

By utilizing the power of AggregatingMergeTree and materialized views, ClickHouse enables fast and scalable data analytics, ensuring you can derive insights from your data as it evolves.

Exercise Files

https://github.com/soumilshah1995/ClickHouse-Kafka-Integration/tree/main

Conclusion

In this blog, we've explored the powerful capabilities of ClickHouse for aggregating data efficiently, particularly through the use of materialized views and the AggregatingMergeTree engine. We learned how to set up a robust architecture for moving data from Kafka topics into aggregate tables, enabling real-time analytics that can scale with your data needs.

By defining an aggregating table and creating a materialized view, we not only captured historical data but also ensured that new incoming records were seamlessly integrated into our aggregates. This approach not only improves query performance but also simplifies the data processing pipeline, allowing for faster insights and better decision-making.

As data volumes continue to grow, leveraging ClickHouse's unique features can provide significant advantages in terms of speed, efficiency, and scalability. Whether you are building applications that require real-time analytics or managing large datasets for reporting, ClickHouse offers a powerful solution to meet your needs.

With the hands-on labs and exercises provided, you're now equipped to implement this architecture in your own projects. Embrace the potential of ClickHouse and discover how it can transform your data analytics journey. Happy querying!

要查看或添加评论，请登录

查看全部

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

What is ClickHouse?

Key Features of Materialized Views:

ClickHouse Engines: ReplicatedReplacingMergeTree vs. MergeTree

When to Use Each Engine:

HANDS ON LABS

Setting Up the Environment: Docker Compose Configuration

Data Generation Script

Creating ClickHouse Tables and Materialized Views

Step 1: Create Databases

Step 2: Create Kafka Consumer Table

领英推荐

Step 3: Create Parsed Customers Table

Step 4: Create Materialized View to Transform Data

Push Update through Kafka UI

Step 5: Create Aggregated Data Table

How Aggregates Work in ClickHouse

Continuous Aggregation Moving Forward

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Example Tech for Data Scientist & Engineer

Understanding Apache Iceberg's Metadata.json

3 Reasons Data Engineers Should Embrace Apache Iceberg

10 big data technologies you must know

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

Proposal for a Management Architecture for Large Volumes of Data

The Data Story of Powerplay (Part-2)

Delta Lake Hits 20 Million Monthly Downloads and Unveils Groundbreaking Features in 4.0.0 Release

What is ClickHouse?

Key Features of Materialized Views:

ClickHouse Engines: ReplicatedReplacingMergeTree vs. MergeTree

When to Use Each Engine:

HANDS ON LABS

Setting Up the Environment: Docker Compose Configuration

Data Generation Script

Creating ClickHouse Tables and Materialized Views

Step 1: Create Databases

Step 2: Create Kafka Consumer Table

领英推荐

Step 3: Create Parsed Customers Table

Step 4: Create Materialized View to Transform Data

Push Update through Kafka UI

Step 5: Create Aggregated Data Table

How Aggregates Work in ClickHouse

Continuous Aggregation Moving Forward

Conclusion

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

2024年11月8日

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

2024年11月3日

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

2024年10月26日

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

2024年10月20日

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

2024年9月30日

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

2024年9月29日

Building a Universal Data Lake with EMR Serverless: Hands-On Labs for Querying with Snowflake, Athena, and Spark – A Guide for Beginners, Leaders

2024年9月15日

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

2024年9月8日

Developer Guide: How to Submit Hudi PySpark(Python) Jobs to EMR Serverless (7.1.0) with AWS Glue Hive MetaStore

2024年9月4日

Learn How to Use DeltaStreamer with JDBC Puller to Pull Data from Snowflake into Universal Table Formats (Iceberg | Delta | Hudi): Hands-on Labs

2024年8月30日

社区洞察

其他会员也浏览了

Example Tech for Data Scientist & Engineer

Understanding Apache Iceberg's Metadata.json

3 Reasons Data Engineers Should Embrace Apache Iceberg

10 big data technologies you must know

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

Proposal for a Management Architecture for Large Volumes of Data

The Data Story of Powerplay (Part-2)

Delta Lake Hits 20 Million Monthly Downloads and Unveils Groundbreaking Features in 4.0.0 Release