登录查看更多内容

Developing a Real-Time Data Warehouse

Abhishek Singh

Technical Lead Data Engineer Azure at Publicis Sapient. Expertise in SQL, Pyspark and Scala with Spark, Kafka with Spark Streaming, Databricks, and Data Tuning Spark Application for PetaByte. Cloud AWS, Azure and GCP

发布日期: 2022年9月23日

Many data engineers coming from traditional batch processing frameworks have questions about real-time data processing systems, like

“What kind of data model did you implement, for real-time processing?”

“Trying to figure out how people build real-time data warehouse solution”

This post is a review of what?real-time?data processing means and how it is used in most companies.

What is real-time data processing?

First, we need to define what?real-time?data processing is. When someone says real-time, we think microseconds but that is not always the case. There are different levels of?real-time?systems

1.???microsecond systems (e.g., spaceships that require custom hardware)

2.???mili second systems (e.g., real-time auctions)

3.???second or minute systems (real-time?data processing pipeline)

4.???We are going to be talking about the?real-time?data processing pipeline and for convenience let’s say the taken to process the data is maxed 5 minutes.

1. Decide what needs to be real-time

The first step is deciding what data needs to be made available in?real time. This is usually driven by business requirements. Most?real-time?data required for analysts or data scientists would be ok with about a 1h delay. When the consumer of your?real-time?data is another service or application then a lower time delay would be ideal since this data may be used for automatic consumption (e.g., anomaly detection). Usually, in most companies, the data that is processed in real-time is a very small subset of the incoming data. A trade-off between the business requirements and software development costs needs to be made here, this will depend on the individual business requirement.

2. Modeling Technique

Most companies populate their data warehouses using a combination of?real-time?and batch processing. The consumer of the data will need to know when and how to use the real-time data v the batch data that is available. This is called lambda architecture. When processing data in?real-time?most data processing is at a row level, if you want to do aggregates (eg number of clicks at the last minute, etc.) you will need to use windowing (also called mini-batching).

3. Always write time modular data processors

This concept applies to not only?real-time?data processing but all data processing systems. Write your data processing logic in such a way that it can scale to any arbitrary time frame. This is important because you may be doing the same processing over a large batch in batch processing and over a very small (mini) batch in your?real-time?processing. In this scenario,?real-time?processing will give you the most recent data but may miss out on late arriving events or out-of-order events, but the batch process can be designed to catch them (handle late arriving events).

For example, if you are writing a SQL query to aggregate the errors in your log data

Data & Analytics 3 周前

Mastering Data Modeling in Microsoft Dataverse

Data & Analytics 5 个月前

What is Data Lineage? How Can Prolifics Help with Data…

Prolifics 3 个月前

# DO NOT DO THIS

def _get_data_proc_query():

???return 'select log_id, count(errors) err_count from log_data where datetime_field > now() - 5 min'

# DO THIS

def _get_data_proc_query(start_time, end_time):

???return f'select log_id, count(errors) err_count from log_data where datetime_field >= {start_time} and datetime_field < {end_time}'

Note that in the above the recommended function can

1.???Be executed over any time input interval v the non-recommended one as it is not modular on time.

2.???Be preventing late arriving messages from corrupting the err count of another time frame.

4. Tools

There are many open-source stream processing tools that do?real-time?data processing, deciding which one to use depends on your current and future business requirements. Given below are some of the most popular options

Apache Spark: One of the most popular data processing frameworks, if not the most popular one. It has a huge community and is a very active framework. Apache Spark was initially a batch processing framework but later incorporated streaming by using batch on a much smaller scale (mini-batch), but recently have released a pure streaming implementation.

Apache Flink: This was designed as a streaming first framework and as such provides advanced stream-based processing mechanisms and is usually faster than spark when performing stream-based repetitive processing (e.g., ML optimizations).

Akka: This is a library available in Java and Scala based on the actor model. This is generally used when the requirements are more domain-specific than can be achieved with the Apache Spark or Apache Flink data processing model. For example, if you are streaming in data and based on very complex domain requirements, use the incoming data to build a knowledge graph.

Conclusion

Hope this gives you a good overview of?real-time?data processing and an idea of how to get started with it.

Reza Bahadorizadeh

1 年

Hi Abhishek I congratulation you, the content is useful and pragmatic for whom have challenges in process real-time data flow

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Developing a Real-Time Data Warehouse

Abhishek Singh

Technical Lead Data Engineer Azure at Publicis Sapient. Expertise in SQL, Pyspark and Scala with Spark, Kafka with Spark Streaming, Databricks, and Data Tuning Spark Application for PetaByte. Cloud AWS, Azure and GCP

1. Decide what needs to be real-time

2. Modeling Technique

3. Always write time modular data processors

领英推荐

4. Tools

更多精彩文章

社区洞察

其他会员也浏览了

What is Data Engineering?

How to Integrate Data from Multiple Sources: Best Practices

Data Modeling Decoded: Crafting Effective Data Structures for Informed Decision-Making

Data Engineering

Data Mesh with Snowflake AI Recipes

Delta Live Tables Series — Part 3 — Data Lineage and Dependency Management

An Overview of Data Pipelines

DATA MODELLING: DEFINITION, TYPES, IMPORTANCE, AND BENEFITS

All You Need to Know About Data Engineering

What is Data Extraction? Definition, Tools & Importance

1. Decide what needs to be real-time

2. Modeling Technique

3. Always write time modular data processors

领英推荐

4. Tools

Interview Question for Lead Data Engineer at MAANG

2024年4月9日

SQL Server Big Data Clusters on Azure

2022年9月30日

Uber System Architecture Design

2022年9月30日

"Key Concepts, to Master Window Functions"

2022年9月28日

"Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Streaming"

2022年9月27日

Netflix High-Level System Architecture

2022年9月24日

"How to improve SQL as a Senior Data Engineer"

2022年9月24日

What is the difference between a data lake and a data warehouse?

2022年9月23日

"Spark Performance Tuning with help of Spark UI"

2022年9月22日

Change Data Capture Using Kafka Debezium and PostgreSQL

2022年9月22日

社区洞察

其他会员也浏览了

What is Data Engineering?

How to Integrate Data from Multiple Sources: Best Practices

Data Modeling Decoded: Crafting Effective Data Structures for Informed Decision-Making

Data Engineering

Data Mesh with Snowflake AI Recipes

Delta Live Tables Series — Part 3 — Data Lineage and Dependency Management

An Overview of Data Pipelines

DATA MODELLING: DEFINITION, TYPES, IMPORTANCE, AND BENEFITS

All You Need to Know About Data Engineering

What is Data Extraction? Definition, Tools & Importance