登录查看更多内容

Any design is a trade-off

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

发布日期: 2025年2月3日

Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one and all'.

A few years ago, I was part of a team that developed a set of data engineering tools with the intent to develop modular components that would address and enable faster deployment of common tasks like ingestion, data validation and machine learning.

My team and I had three modules - data quality checks, data profiling and data reconciliation. We implemented these modules using PySpark. We also developed the modules as micro-services.

For a customer implementation, the team came back to me stating that the pipelines that were being orchestrated from Airflow were timing out. To resolve the issue, we dug into details and realized that this was being caused due to a slow network. And it was related to our design decision.

As we were using a metadata based approach, it was decided to use MongoDB to store the application metadata. For example, for the reconciliation module, we stored source and target information. We stored data store type, table name, path to file (if data store is file) and more. We also stored execution information - things like date of execution, start time, end time, result of execution and more.

As mentioned, we had three components - data quality, data profiling and data reconciliation. Each was developed as a micro-service. Each micro-service had to interact with MongoDB database to fetch metadata and also store execution information.

This is pretty straight forward. Where is the design element? As we have three services that interact with the metadata database, the relevant question to ask is, how does each component interact with the database?

For this seemingly simple ask, we can have two approaches.

领英推荐

The Latest In Distributed SQL - July

TiDB, powered by PingCAP 7 个月前

The Data Story of Powerplay (Part-2)

Powerplay 2 年前

Orchestrating Data Workflows with Apache Airflow: A…

CHISQUARE LABS 3 个月前

The first approach is to have each component interact with the database directly. The second approach is have an intermediate component to handle the interaction with the database and then have each component connect with the intermediate component. In other words, in the second approach, we can define another micro-service to handle the interactions with the database.

While the first approach is simple, the drawback is that each component is directly coupled with the backend database. If we decide to change the database from MongoDB to say PostgreSQL, then each component has to be updated to work with the new database.

In the second approach, as the connection to the database is handled by a dedicated micro-service, if we change the database, we can substitute the MongoDB micro-service implementation by a PostgreSQL micro-service implementation, while ensuring that the interface remains the same. If the interface is not changed, the three components will not be aware of the database change.

But now we have one more micro-service in the picture. Even if a micro-service is deployed in the same data center as the component micro-services, the fact that micro-services use the network (we were using REST APIs), means that we add a small communication overhead a compared to the execution of a function. If the network is nor performant, this communication becomes the bone of contention. Additionally, each network connect goes through authentication and authorization, which adds to the time taken for execution.

This example shows that we have a dilemma. On the one hand, we have components where code is duplicated, while on the other hand, we have components that are flexible, but where we will spend some additional time.

As for code duplication, you will correctly suggest that we should implement the common code as functions / methods. While this is one approach, it does not take away the fact that we no longer have 'seamless replacement' of components. Each time we make a change, we have to update, re-test and re-deploy each component.

If you are not convinced with this, you can watch a few YouTube videos related to aircrafts from the second world war (WW2) (of all things). The design of each aircraft was all about trade offs. The Supermarine Spitfire - one of the well known aircrafts from WW2 - was not able to accommodate cannons for a long time due to its wing. While its wing design was superb, it was much thinner than that of the Hawker Hurricane. Hence it was easier to accommodate cannons in the Hurricane as compared to the Spitfire.

#design #trade_off #micro_service #microservice

要查看或添加评论，请登录

Bipin Patwardhan的更多文章

Change management is crucial (Databricks version)

2025年2月22日

Change management is crucial (Databricks version)

My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting…
Friday fun - Impersonation (in a good way)

2025年2月14日

Friday fun - Impersonation (in a good way)

All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…
Quick Tip: The headache caused by import statements in Python

2025年1月22日

Quick Tip: The headache caused by import statements in Python

When developing applications, there has to be a method to the madness. Just because a programming environment allows…
Databricks: Enabling safety in utility jobs

2025年1月13日

Databricks: Enabling safety in utility jobs

I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…
A Simple Code Generator Using a Cool Python Feature

2025年1月2日

A Simple Code Generator Using a Cool Python Feature

For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a…
Recap of my articles from 2024

2024年12月17日

Recap of my articles from 2024

As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…
Handling dates

2024年12月9日

Handling dates

Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.
pfff -- why are you spending time to save 16sec execution time

2024年12月3日

pfff -- why are you spending time to save 16sec execution time

In my current project, we are implementing a data processing and reporting application using Databricks. All the code…

2 条评论
Quick Tip - Add a column to a table (Databricks)

2024年11月26日

Quick Tip - Add a column to a table (Databricks)

As the saying goes, change is the only constant, even in the data space. As we design tables for our data engineering…
Friday Fun - Reduce time of execution and face execution failure

2024年11月15日

Friday Fun - Reduce time of execution and face execution failure

In my project that has been executing since Dec 2023, things have been going good. We do have the occasional hiccup…

See all articles

Any design is a trade-off

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

领英推荐

Bipin Patwardhan的更多文章

社区洞察

其他会员也浏览了

Apache Hudi: Copy on Write(CoW) Table

Databases Deconstructed: The Value of Data Lakehouses and Table Formats

Detailed Guide on DataBricks Delta?Lake- Part 1

Azure Data Engineer Interview questions with Answers 2024

5 Peta Byte Data Lake Design - Part 2

Working with Semi-Structured JSON Data in Databricks

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

High-level architecture for a text-to-SQL solution designed to generate complex queries, self-correct them, and query various data sources

Where is the database schema? #SQL #NoSQL

Why Open Table Formats and Apache Iceberg Are Reshaping Data Engineering

领英推荐

Bipin Patwardhan的更多文章

Change management is crucial (Databricks version)

Friday fun - Impersonation (in a good way)

Quick Tip: The headache caused by import statements in Python

Databricks: Enabling safety in utility jobs

A Simple Code Generator Using a Cool Python Feature

Recap of my articles from 2024

Handling dates

pfff -- why are you spending time to save 16sec execution time

Quick Tip - Add a column to a table (Databricks)

Friday Fun - Reduce time of execution and face execution failure

社区洞察

其他会员也浏览了

Apache Hudi: Copy on Write(CoW) Table

Databases Deconstructed: The Value of Data Lakehouses and Table Formats

Detailed Guide on DataBricks Delta?Lake- Part 1

Azure Data Engineer Interview questions with Answers 2024

5 Peta Byte Data Lake Design - Part 2

Working with Semi-Structured JSON Data in Databricks

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

High-level architecture for a text-to-SQL solution designed to generate complex queries, self-correct them, and query various data sources

Where is the database schema? #SQL #NoSQL

Why Open Table Formats and Apache Iceberg Are Reshaping Data Engineering