MODERN DATA STACK Manifest #1 for Extract &Load data operations - Best Practices & Specifications for Low code Editors - Outside python notebooks
?? Christophe Hervouet
Stratégie et Conseil DATA (plateformes de données BI modernes / organisation / gouvernance / architectures) -------- Modern BI Data Platforms Advisor (Organization/ Governance and Architectures)
Extract & Load?? tools/platforms? - Batch ingestions Batch ingest data from various sources (ERP, Sales Force, rest API, Flat files etc.)
1) Connection to "lot of "? sources (ERP , API , files , SQL..)?& "lot of " targeted modern stack? Lakehouses & DWH??
?? Manage key vault providers for services accounts & secrets
?? Deal with potentially Managed access?(E.L service?is well-known by SQL source &?SQL target)?
2) Manage interfaces to deal?with sources API?setups
?? URL parameters (hard coded , expressions , caching loop)
?? Body parameters?(hard coded , expressions , caching loop) ? ?? Unnest / Flatten Yes or No?the JSON?answer ?(No=SQL (via DBT) will do the job based arrays/records type columns)
?? Continuation token?
?? Chunk?of blocks
?? Offer "ready to use" interfaces according to grab API data (parent & children loop) from main cases (Sales force , Azure Graph , O365 , PBI logs etc ..)
?? Offer to community SDK (yaml , python) to create your own connectors? interfaces (sources & targets)? Will need? to be endorsed?
2bis ) Manage interfaces to deal?with flat files on blob buckets folders
3) Copy / Connections / Synchronize SOURCE --> TARGET?
?? Choose?columns to synchronize
?? Automatic targeted table creation first run time ? yes/no?
?? Manage schema changes detections =>
? -- Schema less?
? --- or manage a schema? ?
-- If?Schema less?do you want an automatic schema?alter on targeted?SQL table ? ?? Offer?natively several options as ingestion =>
-- Full refresh (select all source rows) or Incremental (select "new" source rows)
-- Full refresh : Overwrite or Append new rows (no worries DBT can deduplicate after ==> gold layer)
--?Incremental on SQL source : Append new rows based a pivotal columns (most of the time a date or and id)
-- Incremental on Flat file source : Append new rows based source files metadata , last?update system datetime (add this column on the targeted schema) ?
-- If incremental is YES , do you want an automatic deduplication ? , based primary keys for the merge (one row by PK with values for MAX(date or id)?<== kind of PK merge with last values
-- If incremental is YES , and automatic deduplication is NO then (no worries DBT can deduplicate after ==> gold layer)
4) Audit logs
?? Add?automatically ingestion date time & pipeline session id on targeted SQL table
5) Monitoring
?? Store and offer logs & informations
?? Alert in case of success or issues or?warnings ( schemas changes)?
?? Offer?logs via Api (copy histories : datetimes , rows count , issues.. )?
6) Orchestrations service (Airflow - Microsoft datafactory etc.)
?? On SAAS?(one editor - one data platform ) - normally it's integrated?
?? Not on SAAS (several editors or data platforms)
Example of chain : Airbyte E.L --> DBT (Transformation for gold data) ? ?--> Refresh a power bi semantic model --> send notification
?? Orchestration needs to use planning?&?Source systems triggers?(messages or files changes etc.)?
?? E.L tool/service can provide messagess and fill?in variables to orchestrator
7) Embedded & API
?? Run jobs via API
8) DevOps
?? The E.L?pipelines need to be?CICD Compliant (automatic deployments on DEV/TEST/PROD environments parameters)
?? Nothing linked to environments is "hard coded" (source - target - credentials are managed via environment variables) ?
POWER BI or DIE Podcast & Streams ???| Microsoft Data Plattform @ BI or DIE ??| Self-Service, Agile BI and Analytics ??| Power BI Fanboy ??| Sketchnotes ??| YouTube ??| TDWI Expert & Author ??
8 个月great summary! what do you think about adding an "increment staging layer" before bronze when using sources with large data? I newly tend to do this in fabric to detect errors before appending