登录查看更多内容

Helpful Extract & Load Practices for High-Quality Raw Data : #2/5

Dr. RVS Praveen Ph.D

Director - Product Engineering at LTIMindtree

发布日期: 2023年2月13日

ELT is becoming the default choice for data architectures and yet, many best practices focus primarily on “T”: the transformations.

But the extract and load phase is where data quality is determined for transformation and beyond. As the saying goes, “garbage in, garbage out.”

Robust EL pipelines provide the foundation for delivering accurate, timely, and error-free data.

Here are top data practices used & loved by the industry experts that will drive up quality for all your data sets, no matter what tool you use.

Setting the Stage: We need E&L practices, because “copying raw data” is more complex than it sounds.

The idea of ELT as a pattern sounds easy, “just copy over the source data first and then run your transformations over the raw data inside your own space”. However, both the word “copy” as well as “raw data” are two words with hidden obstacles.?

“Copying” sounds easy. But source data changes, and unless you know what changed, “copying” turns out to be more complicated than you think. Imagine a production table of 150 million “orders” that does come with a “timestamp” but not with a “modified” data. And yes, these exist, all over the place. How do you know that orders got modified, and if so, which ones? For instance, how would you know which orders got “canceled”, an operation that usually takes place in the same data record and just “modifies” it in place.

“Raw data” sounds clear. And yet the idea of extracting & loading implicitly means that usually, you copy between two different technical systems A and B, where you need to adjust the data to match system, B. You ingest from REST APIs and put it into Snowflake, or from an Oracle database into Redshift. Every time you change systems, you will need to modify the “raw data” to adhere to the rules of the new system. You need to do type casting, you need to think about whether you want to “flatten JSONs”, or whether you want to add additional metadata to your data.?

领英推荐

9 Predictions for Data in 2023

Tomasz Tunguz 2 年前

Delta Live Tables — Part 4— Data Processing and…

Krishna Yogi Kolluru 8 个月前

Databricks SQL Series — Part 5 — Managing and Securing…

Krishna Yogi Kolluru 8 个月前

Simply “copying raw data” will bring up new questions every time you add a new data source or data target to your list. Even if it is just a new table from the same production database you’ve always been copying from.?

These practices will be your guide whenever you take on a new source for ingesting data into your data system.

Tip #2 : Deduplicate data at a level beyond the raw level.

There are usually three cases of duplicate data hitting your data systems you will want to “deduplicate”. But no matter the case, don’t do it at the raw/landing level!

The first case is intentional duplicate data, where a source system contains something your end-users or you consider to be duplicates. For instance your CRM system might have two entries for a certain customer that canceled and signed up again. If you deduplicate at the raw level, this means either merging the two or deleting one. Both of which will delete data that is present in the source system.

The second case is unintentional duplicate data, where the source system either deleted a record you still have in your data warehouse, or the source system unintentionally produces duplicate data it will likely delete in the future. Even though this is an “error”, I don’t recommend deleting this data in your raw ingestion area but rather filter it further down the line, for instance in the next stage of your modelling. Otherwise you will end up adding a logic to your ingestion that is hard to follow up later.?

The third case is duplication happening due to technical restrictions. It might be the case that your ingestion tooling prefers a “at least once delivery” strategy or it might even be a bug in an ingestion process. With “at least once delivery” incremental load strategies, you’re ensuring to get all data rows, but might duplicate some. Again we recommend to keep the duplicate data at the raw level, and filter it down at a later level.

要查看或添加评论，请登录

Dr. RVS Praveen Ph.D的更多文章

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

2024年8月4日

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

"Craft Compelling Data Stories: A 7-Step Recipe for Effective Insight and Information Dissemination" If data can be…
The Future of Data Integration: Moving Beyond Traditional ETL

2024年7月8日

The Future of Data Integration: Moving Beyond Traditional ETL

Forward-looking technologies are typically innovative and embraced by early adopters, providing some business value…

1 条评论
Fostering Analytical Maturity in Organizations (AMO)

2024年4月18日

Fostering Analytical Maturity in Organizations (AMO)

Several straightforward frameworks to identify your organization's analytical requirements and enhance its data-driven…

1 条评论
Revealing Contemporary Data Frameworks: From Warehouses to Meshes

2024年4月15日

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Traversing the Data Revolution Embark on an expedition through the transformative terrain of modern data architecture…

1 条评论
Part 5: Navigating Generative AI in Retail & Commercial Banking

2024年4月2日

Part 5: Navigating Generative AI in Retail & Commercial Banking

In the contemporary digital landscape, retail and commercial banking encounter a plethora of hurdles, ranging from the…
Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

2024年3月27日

Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

The banking sector has long been a pioneer in embracing technology to optimize operational efficiency, minimize costs…
Part 3: Exploring Generative AI Applications in Banking

2024年3月21日

Part 3: Exploring Generative AI Applications in Banking

wIn this segment of the series, we navigate the realm of Generative AI models, uncovering their intricacies and the…
Part 2: Exploring Generative AI in Banking: An Overview

2024年3月4日

Part 2: Exploring Generative AI in Banking: An Overview

In the rapidly evolving realm of technology, the advent of artificial intelligence (AI) has proven transformative…
Part 1: Introduction to Generative AI Playbook for Banking

2024年3月3日

Part 1: Introduction to Generative AI Playbook for Banking

In the swiftly changing terrain of financial services, the emergence of Generative Artificial Intelligence (AI)…

3 条评论
Constructing a Data Platform in 2024

2024年2月18日

Constructing a Data Platform in 2024

A guide to developing a contemporary, adaptable data platform to drive your analytics and data science initiatives…

3 条评论

See all articles

Helpful Extract & Load Practices for High-Quality Raw Data : #2/5

Dr. RVS Praveen Ph.D

Director - Product Engineering at LTIMindtree

Setting the Stage: We need E&L practices, because “copying raw data” is more complex than it sounds.

领英推荐

Tip #2 : Deduplicate data at a level beyond the raw level.

Dr. RVS Praveen Ph.D的更多文章

社区洞察

其他会员也浏览了

Business Intelligence meets Data Engineering with Emerging Technologies

Data Modeling

Real-Time Analytics on Live Databases

Agentic RAGs: consolidated querying of SQL & Document repositories

Data Build Tool(DBT) — Aamir P

Advanced Data Analytics with Apache’s Cutting-Edge Tools

Microsoft Fabric End-to-End Project?—? with Shorcut, Data Pipeline, DataFlow.

Generative AI Tools Landscape - Data Applications – Part1

Architecting Data Pipelines

Helpful Extract & Load Practices for High-Quality Raw Data : #4/5

Setting the Stage: We need E&L practices, because “copying raw data” is more complex than it sounds.

领英推荐

Tip #2 : Deduplicate data at a level beyond the raw level.

Dr. RVS Praveen Ph.D的更多文章

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

The Future of Data Integration: Moving Beyond Traditional ETL

Fostering Analytical Maturity in Organizations (AMO)

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Part 5: Navigating Generative AI in Retail & Commercial Banking

Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

Part 3: Exploring Generative AI Applications in Banking

Part 2: Exploring Generative AI in Banking: An Overview

Part 1: Introduction to Generative AI Playbook for Banking

Constructing a Data Platform in 2024

社区洞察

其他会员也浏览了

Business Intelligence meets Data Engineering with Emerging Technologies

Data Modeling

Real-Time Analytics on Live Databases

Agentic RAGs: consolidated querying of SQL & Document repositories

Data Build Tool(DBT) — Aamir P

Advanced Data Analytics with Apache’s Cutting-Edge Tools

Microsoft Fabric End-to-End Project?—? with Shorcut, Data Pipeline, DataFlow.

Generative AI Tools Landscape - Data Applications – Part1

Architecting Data Pipelines

Helpful Extract & Load Practices for High-Quality Raw Data : #4/5