ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Managing Design Trade Offs!

Parijat Bose

Data | Cloud | GenAI

å‘å¸ƒæ—¥æœŸ: 2024å¹´2æœˆ12æ—¥

Problem statement:

Design a data warehousing job where the job has to load the execution date partition of a target table X depends on the execution date partition(n)th of table Y and (n+2)th partition of table Z.

Complexity:

Designing a job orchestration that depends on execution dates (n) and (n+2) might be challenging due to the dependency on future data (n+2), which is not available at the time of execution (n). Here are a few design trade-offs you might want to consider:

é¢†è‹±æŽ¨è

Myths and Misconceptions Data Mesh and Data Warehousing

Lyftrondata 3 ä¸ªæœˆå‰

Real-time Data Analytics Platform - 2/3 Multi-Tier Architecture

Real-time Data Analytics Platform - 2/3 Multi-Tierâ€¦

Elsayed Rashed 1 å¹´å‰

Why Data Mismatches Persist in Data Migration Projects (And How Graph Thinking Changes Everything)

Why Data Mismatches Persist in Data Migration Projectsâ€¦

Yoav Aviv 3 ä¸ªæœˆå‰

Tradeoffs:

Latency vs Consistency: If the job waits for the n+2th partition of table Z to be ready, it ensures consistency, as the job always uses the most recent data. However, it leads to increased latency because the job cannot start until the n+2th partition is ready. On the other hand, if the job is executed as soon as the nth partition of table Y is ready, it may reduce latency but compromise data consistency.
Complexity vs Performance: Implementing a complex system to handle such dependencies can ensure high performance as it could allow for parallel processing of the nth and n+2th partitions. However, the increase in system complexity can also lead to more difficult maintenance and higher chances of errors.
Resource Utilization vs Timeliness: If the job starts processing the nth day partition while waiting for the n+2th day partition, it might lead to a more efficient utilization of resources but could compromise the timeliness of the job, especially if the n+2th partition takes longer than expected to get ready.
Real-Time vs Batch Processing: Real-time processing could provide more up-to-date data but may add a lot of performance overhead and need more resources. Batch processing, on the other hand, might be less resource-intensive but result in less fresh data.

Strategies:

Dependency Management: Use a job scheduling or orchestration tool that can handle dependencies, such as Apache Airflow or Luigi. This can help ensure that the job is executed only when all the necessary data is ready.
Buffering: Implement a buffer mechanism where you keep two days of data in buffer for table Z. This ensures that when the job runs on the nth day for table Y, the n+2th day data for table Z is already available.
Data Versioning: Keep versions of your data in your warehouse so that you can always fetch the correct version depending on the job execution date.
Error Handling and Retry Mechanisms: Implement robust error handling and retry mechanisms to handle any failures during the job execution.
Monitoring and Alerting: Monitor the system and set up alerts for any delays or failures in the readiness of the data partitions.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Parijat Boseçš„æ›´å¤šæ–‡ç«

The Vital Connection Between Data Lineage and Data Quality

2024å¹´8æœˆ12æ—¥

The Vital Connection Between Data Lineage and Data Quality

In the dynamic world of hospitality, data plays a vital role in driving operational efficiency, enhancing guestâ€¦
Quality Assurance vs. Quality Control in Data Management

2024å¹´8æœˆ7æ—¥

Quality Assurance vs. Quality Control in Data Management

Having had the opportunity to work in diverse industries, including credit cards, life sciences, and hospitality, I'veâ€¦
Why Make the Switch: Migrating from Apache Hive to Apache Iceberg

2024å¹´7æœˆ26æ—¥

Why Make the Switch: Migrating from Apache Hive to Apache Iceberg

As data lakes continue to grow in size and complexity, organizations face new challenges in managing and querying theirâ€¦

2 æ¡è¯„è®º
Debunking Window Functions

2024å¹´2æœˆ7æ—¥

Debunking Window Functions

A retail company have a employee_sales table which logs the sales of every employee in every city he is working forâ€¦
A Comparative Analysis of Avro, Parquet, and ORC: Understanding the Differences

2023å¹´5æœˆ16æ—¥

A Comparative Analysis of Avro, Parquet, and ORC: Understanding the Differences

Data storage formats play a crucial role in big data processing and analytics. Avro, Parquet, and ORC (Optimized Rowâ€¦

1 æ¡è¯„è®º
Top 25 File Types used in Data Engineering

2023å¹´5æœˆ11æ—¥

Top 25 File Types used in Data Engineering

In Data Engineering, these are the top 25 file types used to store and transfer data.: CSV (Comma-Separated Values) -â€¦
GraphQL - Alternative to REST API

2023å¹´4æœˆ27æ—¥

GraphQL - Alternative to REST API

GraphQL is an API query language that is built on a simple and flexible type system. It is designed to be independentâ€¦
Heard of Great Expectations DQ framework?

2023å¹´4æœˆ26æ—¥

Heard of Great Expectations DQ framework?

Great Expectations is an open-source Python library for data quality testing, monitoring, and documentation. Itâ€¦
Presto: "I think I should now make way for Trino!"

2023å¹´4æœˆ24æ—¥

Presto: "I think I should now make way for Trino!"

In 2019, the developers of PrestoSQL announced that they would be forking the project to create a new version of theâ€¦
Presto - Reading Big Data at lightning speed!

2023å¹´4æœˆ22æ—¥

Presto - Reading Big Data at lightning speed!

When it comes to big data analytics, processing large datasets can be a significant challenge. One of the keyâ€¦

See all articles

Managing Design Trade Offs!

Parijat Bose

Data | Cloud | GenAI

é¢†è‹±æŽ¨è

Parijat Boseçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

What is Change Data Capture?

Real-time Data Analytics Platform - 1/3 Architecture & Design Considerations

Data vault builder

Data Architecture

The Three Stages of Data Modeling: A Structured Approach to Data Architecture

Data Modeling Techniques for Effective Data Management

My modern DWH pattern: The Analytical Information Factory (#AIF)

What is Data Modeling? Types, Process and Benefits

Streamlining Data Warehouse

Top Data Modeling Tools for 2024: A Guide for Analysts & Engineers

é¢†è‹±æŽ¨è

Parijat Boseçš„æ›´å¤šæ–‡ç«

The Vital Connection Between Data Lineage and Data Quality

Quality Assurance vs. Quality Control in Data Management

Why Make the Switch: Migrating from Apache Hive to Apache Iceberg

Debunking Window Functions

A Comparative Analysis of Avro, Parquet, and ORC: Understanding the Differences

Top 25 File Types used in Data Engineering

GraphQL - Alternative to REST API

Heard of Great Expectations DQ framework?

Presto: "I think I should now make way for Trino!"

Presto - Reading Big Data at lightning speed!

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

What is Change Data Capture?

Real-time Data Analytics Platform - 1/3 Architecture & Design Considerations

Data vault builder

Data Architecture

The Three Stages of Data Modeling: A Structured Approach to Data Architecture

Data Modeling Techniques for Effective Data Management

My modern DWH pattern: The Analytical Information Factory (#AIF)

What is Data Modeling? Types, Process and Benefits

Streamlining Data Warehouse

Top Data Modeling Tools for 2024: A Guide for Analysts & Engineers

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†