MODERN  DATA STACK
Manifest #1 for Extract &Load data  operations - Best Practices & Specifications for Low code Editors - Outside python notebooks

MODERN DATA STACK Manifest #1 for Extract &Load data operations - Best Practices & Specifications for Low code Editors - Outside python notebooks

Extract & Load?? tools/platforms? - Batch ingestions Batch ingest data from various sources (ERP, Sales Force, rest API, Flat files etc.)


  • Feeding Raw , Staging or Bronze layers
  • For EDITORS tools / services - Prepare your specifications with your prerequisites
  • Minimum features required to guarantee robust and efficient projects (reduce time to market fo datamesh IT projects - quick ingestions durations etc.)

1) Connection to "lot of "? sources (ERP , API , files , SQL..)?& "lot of " targeted modern stack? Lakehouses & DWH??

?? Manage key vault providers for services accounts & secrets

?? Deal with potentially Managed access?(E.L service?is well-known by SQL source &?SQL target)?

2) Manage interfaces to deal?with sources API?setups

?? URL parameters (hard coded , expressions , caching loop)

?? Body parameters?(hard coded , expressions , caching loop) ? ?? Unnest / Flatten Yes or No?the JSON?answer ?(No=SQL (via DBT) will do the job based arrays/records type columns)

?? Continuation token?

?? Chunk?of blocks

?? Offer "ready to use" interfaces according to grab API data (parent & children loop) from main cases (Sales force , Azure Graph , O365 , PBI logs etc ..)

?? Offer to community SDK (yaml , python) to create your own connectors? interfaces (sources & targets)? Will need? to be endorsed?

2bis ) Manage interfaces to deal?with flat files on blob buckets folders

3) Copy / Connections / Synchronize SOURCE --> TARGET?

?? Choose?columns to synchronize

?? Automatic targeted table creation first run time ? yes/no?

?? Manage schema changes detections =>

? -- Schema less?

? --- or manage a schema? ?

-- If?Schema less?do you want an automatic schema?alter on targeted?SQL table ? ?? Offer?natively several options as ingestion =>

-- Full refresh (select all source rows) or Incremental (select "new" source rows)

-- Full refresh : Overwrite or Append new rows (no worries DBT can deduplicate after ==> gold layer)

--?Incremental on SQL source : Append new rows based a pivotal columns (most of the time a date or and id)

-- Incremental on Flat file source : Append new rows based source files metadata , last?update system datetime (add this column on the targeted schema) ?

-- If incremental is YES , do you want an automatic deduplication ? , based primary keys for the merge (one row by PK with values for MAX(date or id)?<== kind of PK merge with last values

-- If incremental is YES , and automatic deduplication is NO then (no worries DBT can deduplicate after ==> gold layer)

4) Audit logs

?? Add?automatically ingestion date time & pipeline session id on targeted SQL table

5) Monitoring

?? Store and offer logs & informations

?? Alert in case of success or issues or?warnings ( schemas changes)?

?? Offer?logs via Api (copy histories : datetimes , rows count , issues.. )?

6) Orchestrations service (Airflow - Microsoft datafactory etc.)

?? On SAAS?(one editor - one data platform ) - normally it's integrated?

?? Not on SAAS (several editors or data platforms)

Example of chain : Airbyte E.L --> DBT (Transformation for gold data) ? ?--> Refresh a power bi semantic model --> send notification

?? Orchestration needs to use planning?&?Source systems triggers?(messages or files changes etc.)?

?? E.L tool/service can provide messagess and fill?in variables to orchestrator

7) Embedded & API

?? Run jobs via API

8) DevOps

?? The E.L?pipelines need to be?CICD Compliant (automatic deployments on DEV/TEST/PROD environments parameters)

?? Nothing linked to environments is "hard coded" (source - target - credentials are managed via environment variables) ?

Artur K?nig

POWER BI or DIE Podcast & Streams ???| Microsoft Data Plattform @ BI or DIE ??| Self-Service, Agile BI and Analytics ??| Power BI Fanboy ??| Sketchnotes ??| YouTube ??| TDWI Expert & Author ??

8 个月

great summary! what do you think about adding an "increment staging layer" before bronze when using sources with large data? I newly tend to do this in fabric to detect errors before appending

回复

要查看或添加评论,请登录

社区洞察