Helpful Extract & Load Practices for High-Quality Raw Data? : #1/5

Helpful Extract & Load Practices for High-Quality Raw Data : #1/5

ELT is becoming the default choice for data architectures and yet, many best practices focus primarily on “T”: the transformations.

But the extract and load phase is where data quality is determined for transformation and beyond. As the saying goes, “garbage in, garbage out.”

Robust EL pipelines provide the foundation for delivering accurate, timely, and error-free data.

Here are top data practices used & loved by the industry experts that will drive up quality for all your data sets, no matter what tool you use.

Setting the Stage: We need E&L practices, because “copying raw data” is more complex than it sounds.

The idea of ELT as a pattern sounds easy, “just copy over the source data first and then run your transformations over the raw data inside your own space”. However, both the word “copy” as well as “raw data” are two words with hidden obstacles.?

“Copying” sounds easy. But source data changes, and unless you know what changed, “copying” turns out to be more complicated than you think. Imagine a production table of 150 million “orders” that does come with a “timestamp” but not with a “modified” data. And yes, these exist, all over the place. How do you know that orders got modified, and if so, which ones? For instance, how would you know which orders got “canceled”, an operation that usually takes place in the same data record and just “modifies” it in place.

“Raw data” sounds clear. And yet the idea of extracting & loading implicitly means that usually, you copy between two different technical systems A and B, where you need to adjust the data to match system, B. You ingest from REST APIs and put it into Snowflake, or from an Oracle database into Redshift. Every time you change systems, you will need to modify the “raw data” to adhere to the rules of the new system. You need to do type casting, you need to think about whether you want to “flatten JSONs”, or whether you want to add additional metadata to your data.?

Simply “copying raw data” will bring up new questions every time you add a new data source or data target to your list. Even if it is just a new table from the same production database you’ve always been copying from.?

These practices will be your guide whenever you take on a new source for ingesting data into your data system.

Tip #1 : Make each EL run uniquely identifiable – timestamp everything.

We start with arguably the most important best practice: Make every bit of data you load into your data system identifiable and traceable back to the process that got it there.

Typical ways of doing this is to include metadata values that capture:

  • Ingestion time: the timestamp indicating when the load process started.
  • Ingestion process: a unique identifier representing the load process and its instance.
  • Source system: Metadata about where the data was extracted from.

No alt text provided for this image
No alt text provided for this image

Add any or all of these metadata to each row/entry of the data you ingest. We recommend you use the starting time of your ingestor as the “ingestion time” as it simplifies the process. For example: The “identifier of your ingestion instance” should be clear. Don’t just provide the “Airflow-MongoDB-Ingester” as a process, but the “Airflow-MongoDB-Ingester-AID1234” where AID1234 clearly identifies the specific run of ingestion.?

One nice benefit of having the source system as metadata is that you’re able to quickly debug problems in downstream dashboards and identify its source. This is also useful metadata for other use cases.

If you have a legacy and a new customer registration component, you can provide the source as a filter option inside dashboards, allowing users to filter for customers from just one system.

要查看或添加评论,请登录

Dr. RVS Praveen Ph.D的更多文章

社区洞察

其他会员也浏览了