Data Pipeline
Nazir Ahammad Syed
Data Architect | AWS | Snowflake Cloud | Python | DevOps | Data warehouse | Automation Expert
A data pipeline refers to a series of processes that involve ingesting, moving, and transforming raw data from various sources to a designated destination. Typically, the data at this destination is utilized for purposes such as analysis, machine learning, or other business functions.
1.? Data Pipeline?Architecture
A data pipeline’s architecture consists of four main three components:?
1.1 Data?Providers.
1.2 Data?Processing (ETL/ELT).
1.2.1 Orchestration Process.
1.3 Target/Data?Consumers.
?
1.1 Data?Providers
Common data sources are...
- On-Premises?source systems -?application databases, APIs, Applications servers files from an SFTP server
- Cloud?-?AWS, Azure, GCP
- SaaS?-?Salesforce, workday data
- Streaming/Edge?–?Machine data, Connect device data, logs?
?
1.2 ?Data?Processing (ETL/ELT)
Data processing refers to the transformations (Change Data Capture - CDC) that need to be applied to the data within the pipeline. This typically involves cleaning, filtering, and applying business-specific logic to the data.
1.2.1 Orchestration Process
Based on applications and business needs different types of tools are used.
For example:
- For simple pipelines scheduling. Cron schedule can be used.
- For advanced or complexed pipelines, workflow (Airflow, Control-M…) based orchestrator are more appropriate.
Orchestration can be classified into two main types: batch processing (the most common) and real-time processing.
1.3 ?Target/Data?Consumers
The target where we send/deliver our data. Most common data targets are databases or data storage areas designed for analytics, such as a data warehouse or data lake.
Senior SQL/BI Developer
7 个月Thanks for sharing.!