登录查看更多内容

Extract Transform Load (ETL)

Doug Rose

Author | Artificial Intelligence | Data Ethics | Agility

发布日期: 2024年9月3日

To analyze a body of data, that data must first be loaded into a data warehouse; that is, it must be copied from one or more systems, converted into a uniform format, and written to the new destination. This process is commonly referred to as extract, transform, and load (ETL). ETL provides the means to combine disparate data from multiple sources and create a homogenous data set that can be analyzed in order to extract business intelligence from it.

Extract

During extraction, data is read from one or more sources and held in temporary storage for transformation and loading. An organization may extract data from its own internal systems, such as a transaction processing system that records all order activities or from external sources, such as data it purchases or obtains for free from other organizations.

Extraction is commonly broken down into two logical extractions methods:

Full: All data currently available in the source system is extracted.
Incremental: Only data that has been added to the source system after a full extraction is extracted. The purpose of incremental extractions is to keep the data in the data warehouse up to date.

Extraction is also broken down into two physical extraction methods:

Online: Data is extracted directly from the source system; for example, streaming data from websites or online games.
Offline: Data is extracted from files created by the source system; for example, via a save or export operation.

Transform

During the transform stage, data is processed to make all data consistent in structure and format so that it all conforms to a uniform schema. A schema provides the structure and rules for organizing data in a relational database. The source and target database systems may use different schemas; for example, the source database may store shipping information in a Customer table, whereas the target database stores shipping information in a separate Shipping table. Or, the source table may have dates in the MM/DD/YYYY format, whereas the target uses the DD/MM/YYYY format. To successfully copy data from the source to the target, certain transformations must be made to ensure that the source data is in an acceptable format.

Transformations can be handled in two ways:

Multistage: The transformation process is broken down into distinct steps, each of which results in a separate staging table prior to data being inserted into the target table.
Pipelined: Staging tables are eliminated to streamline the process in the event that the target database is a more integral part of the ETL operation.

Load

During the load operation, all newly transformed data is written to the target data warehouse for storage. Various mechanisms can be used to load data into the target warehouse, including the following:

Structured query language (SQL)
External tables
OnCommand Insight (OCI) databases and Direct-Path application programming interfaces (APIs)
Export/import features of the source and target databases

Variations on the Theme

ETL is commonly described as a three-step process primarily to make it easier to understand. In practice, ETL is not a series of clearly defined steps but more of a single process. As such, the sequence of events may vary. Depending on the approach, ETL may be more like one of the following:

Extract, load, and transform (ELT): Data is loaded into the target warehouse and then transformed. ELT is likely to increase performance when the target data warehouse has greater compute capacity than would otherwise be available for the transformation engine required for an ETL operation.
Extract and then transform while loading: With a pipelined approach to transforming data, the transformation and loading steps are often combined.

The ETL Bottleneck

Given the increasing volumes of data that organizations must capture and integrate into their data warehouses, ETL often becomes a major bottleneck. Database administrators need to constantly revise their ETL procedures to accommodate variations in the data arriving from different sources. In addition, the volume and velocity of data can overwhelm an organization’s existing data warehouse storage and compute capabilities, leading to delays in producing time-sensitive reports and business intelligence. ETL operations often compete for the same storage and compute resources needed to handle data queries and analytics.

Fortunately, data warehousing technology has evolved to help reduce or eliminate the impact of the ETL bottleneck. For example, cloud data warehousing provides virtually unlimited storage and compute resources, so that ETL does not need to compete with queries and analytics for limited resources. In addition, data warehouse frameworks such as Hadoop take advantage of distributed, parallel processing to distribute work-intensive tasks such as ETL over multiple servers to complete jobs faster.

With the right tools and technologies in place, organizations can now stream diverse data from multiple sources into their data warehouses and query and analyze that data in near real time. If you or your team is in charge of procuring a new data warehouse solution for your organization, look for a solution that provides unlimited concurrency, storage, and compute, to avoid contention issues between ETL processes and people in the organization who need to use the same system to run queries and conduct analysis. Also look for a system that can live-stream data feeds and process structured, semi-structured, and unstructured data quickly and easily without complicated and costly ETL or ELT processes. In most cases, the ideal solution will be data warehouse built for the cloud.

Frequently Asked Questions

Dhiraj Patra 6 个月前

ETL in brief (includes Data governance and Data…

Kumar Preeti Lata 5 个月前

Ensuring Data Integrity: A Guide to Validating…

Aliz 8 个月前

What is an ETL Tool?

An ETL tool is software designed to manage the Extract, Transform, Load (ETL) process, which involves extracting data from various sources, transforming the data into the required format, and loading it into a target data store such as a database, data warehouse, or data lake.

How does the ETL work?

The ETL process works by first extracting raw data from multiple data sources, then transforming the data to meet specific business or technical needs, and finally loading the processed data into a data store for use in data analysis and reporting.

What is the purpose of ETL in data integration?

The purpose of ETL in data integration is to streamline the process of combining data from various sources into a single, cohesive data repository, thereby enabling better data processing, data analysis, and data quality management.

Why is data transformation important in an ETL pipeline?

Data transformation is important in an ETL pipeline because it ensures that the extracted data is converted into a suitable format for analysis, reporting, and storage. It helps in data cleansing, converting data types, and applying business rules to the data.

What types of data can be handled by ETL tools?

ETL tools can handle various types of data, including structured data (like databases), unstructured data (like text files), semi-structured data (like JSON and XML files), and large data sets often found in big data environments.

How do ETL tools improve data quality?

ETL tools improve data quality by automating data cleansing processes, identifying and correcting errors, standardizing data formats, and ensuring consistency across different data sources.

What is the difference between a data warehouse and a data lake in the context of ETL?

A data warehouse is a structured environment designed for efficient querying and reporting on historical data, optimized for structured data types. A data lake is a storage repository that can hold vast amounts of raw data in its native format, accommodating both structured and unstructured data, and is typically used for big data processing and advanced analytics.

How does an ETL pipeline support data analytics?

An ETL pipeline supports data analytics by ensuring that data from different sources is accurately and consistently processed, transformed, and loaded into an analytical data store. This allows for comprehensive data analysis and insightful business intelligence.

What challenges can occur during the ETL process?

Challenges during the ETL process can include handling large data volumes, ensuring data quality and data integrity, managing data transformations and business logic, maintaining data security, and integrating data from disparate sources.

How do modern ETL tools use the cloud?

Modern ETL tools leverage the cloud to enhance scalability, flexibility, and cost-efficiency. They often integrate seamlessly with cloud data warehouses, cloud-based data lakes, and other cloud services, enabling organizations to process data in real-time and handle large and dynamic data workloads effectively.

This is my weekly newsletter that I call The Deep End because I want to go deeper than results you’ll see from searches or AI, incorporating insights from the history of data and data science. Each week I’ll go deep to explain a topic that’s relevant to people who work with technology. I’ll be posting about artificial intelligence, data science, and data ethics.?

This newsletter is 100% human written ?? (* aside from a quick run through grammar and spell check).

More Sources:

The Deep End

47,081 位关注者

Karthik Balasubramanian

Quality Engineering Practice Lead at Accenture - Guidewire Practice | Data Evangelist | AI Practitioner | Social Worker |

2 个月

Well documented article! happy to see this article bringing a refresh to my memory the various experiences I had in the word of ETL and ELT. ?? Thank you Doug Rose, great insights.

1 次回应

Sunday Adesina

Payment Integrity Leader | Healthcare Fraud SME | AI/ML Consultant & Data Science Problem Solver | HealthTech Product Strategist | Agile Practitioner

2 个月

Well written as always! ETL is very important step in data engineering. You can also elaborate or add some emerging ETL automation tools for both streaming and batch data pipeline orchestrations. May be next blog post. Good job.

1 次回应

Patrick Patron

Lidero proyectos de transformación digital y gestiono la innovación para mejorar el rendimiento empresarial | Mentor y apasionado por los viajes???? [ LION ]

2 个月

Great insights, Doug! ?? The power of ETL (Extract, Transform, Load) is truly transformative in today’s data-driven world. ?? It’s incredible how effectively managing and transforming data can lead to more informed decision-making and improved business strategies. ?? The emphasis on not just extracting data, but also transforming it into meaningful, actionable insights, is crucial. ?? As businesses continue to generate massive amounts of data, mastering ETL processes becomes more important than ever to maintain a competitive edge. ?? Looking forward to more of your thoughts on optimizing data processes and leveraging technology to enhance business outcomes! ??

3 次回应

Sebastien Praly

.NET SQL data architect and software lead

2 个月

Lately being developing ETLs using .NET to connect to third-party APIs and incrementally loading their data into a database. Then creating views and procs and integrate the data to the customer's warehouse for report consumption. When I help a customer build these ETL my first question to the source is "do you have APIs". Even better what about webhooks? publishing data in JSON format has made the work much easier. If I need to schedule, easy power automate flow to trigger my own APIs that make the integration calls.

2 次回应

查看更多评论

要查看或添加评论，请登录

Doug Rose的更多文章

Maximizing Data Opportunities through Data Sprints: An Analytics Framework

2024年11月7日

Maximizing Data Opportunities through Data Sprints: An Analytics Framework

In earlier newsletters I present a six-stage framework to structure the work a data science team performs and five…
Exploring the Data Science Life Cycle: Understanding the Data Science Process

2024年11月5日

Exploring the Data Science Life Cycle: Understanding the Data Science Process

A project lifecycle can be a useful tool for structuring the process that a team follows. (A lifecycle is a repeating…
Enhance Critical Thinking Skills in Data Analysis for Data-Driven Decision-Making

2024年10月31日

Enhance Critical Thinking Skills in Data Analysis for Data-Driven Decision-Making

Think about the last time you heard someone admit to being wrong about something important and personal — not wrong…

5 条评论
Mastering the art of Data Storytelling Details: Crafting a Compelling Story with Data

2024年10月29日

Mastering the art of Data Storytelling Details: Crafting a Compelling Story with Data

The big picture of storytelling is nailing down the five key elements of a story: characters, setting, plot, conflict…

2 条评论
Working Together on a Data Science Team: Successful Data Science Teamwork

2024年10月24日

Working Together on a Data Science Team: Successful Data Science Teamwork

As I explained in a previous newsletter a data science team should consist of three to five members, including the…
Building a Data Science Team: A Successful Data Team Structure

2024年10月22日

Building a Data Science Team: A Successful Data Team Structure

Building a data science team is not as simple as hiring a database administrator and a few data analysts. You want to…

8 条评论
Defining Big Data

2024年10月17日

Defining Big Data

Big data is a term that describes an immense volume of diverse data typically analyzed to identify patterns, trends…

8 条评论
General Problem Solver with Artificial Intelligence (AI)

2024年10月15日

General Problem Solver with Artificial Intelligence (AI)

The General Problem Solver In 1959, Allen Newell and Herbert A. Simon took a new approach to Artificial Intelligence.

4 条评论
Data Science Tools To Consider Using

2024年10月10日

Data Science Tools To Consider Using

The key ingredient for a successful data scientist is a curious, skeptical, and innovative mind. However, data…

1 条评论
What Is Data Science?

2024年10月8日

What Is Data Science?

Data science is a multi-disciplinary approach to extracting insight from data. The disciplines involved include…

4 条评论

See all articles

Extract Transform Load (ETL)

Doug Rose

Author | Artificial Intelligence | Data Ethics | Agility

Extract

Extraction is also broken down into two physical extraction methods:

Transform

Transformations can be handled in two ways:

Load

Variations on the Theme

The ETL Bottleneck

Frequently Asked Questions

领英推荐

What is an ETL Tool?

How does the ETL work?

What is the purpose of ETL in data integration?

Why is data transformation important in an ETL pipeline?

What types of data can be handled by ETL tools?

How do ETL tools improve data quality?

What is the difference between a data warehouse and a data lake in the context of ETL?

How does an ETL pipeline support data analytics?

What challenges can occur during the ETL process?

How do modern ETL tools use the cloud?

More Sources:

The Deep End

47,081 位关注者

Doug Rose的更多文章

社区洞察

其他会员也浏览了

(New Project) Build an ETL service pipeline

ETL with Mage is like the secret sauce that helps you squeeze out the full flavor of your data's potential.

GenAI based ETL & Visualization

?? Integrations Unlocked: ETL Pipelines (Part 2) ??

?? Integrations Unlocked: ETL Pipelines (Part 3) ??

Mastering the Core: Key Concepts for Data Engineers - Data Pipelines and ETL

Mastering the Art of ETL: Overcoming Challenges and Maximizing Efficiency

ADF

Data Pipelines: From Raw Data to Real Results

Mastering ETL Processes: The Backbone of Data Integration

Extract

Extraction is also broken down into two physical extraction methods:

Transform

Transformations can be handled in two ways:

Load

Variations on the Theme

The ETL Bottleneck

Frequently Asked Questions

领英推荐

What is an ETL Tool?

How does the ETL work?

What is the purpose of ETL in data integration?

Why is data transformation important in an ETL pipeline?

What types of data can be handled by ETL tools?

How do ETL tools improve data quality?

What is the difference between a data warehouse and a data lake in the context of ETL?

How does an ETL pipeline support data analytics?

What challenges can occur during the ETL process?

How do modern ETL tools use the cloud?

More Sources:

The Deep End

47,081 位关注者

Doug Rose的更多文章

Maximizing Data Opportunities through Data Sprints: An Analytics Framework

Exploring the Data Science Life Cycle: Understanding the Data Science Process

Enhance Critical Thinking Skills in Data Analysis for Data-Driven Decision-Making

Mastering the art of Data Storytelling Details: Crafting a Compelling Story with Data

Working Together on a Data Science Team: Successful Data Science Teamwork

Building a Data Science Team: A Successful Data Team Structure

Defining Big Data

General Problem Solver with Artificial Intelligence (AI)

Data Science Tools To Consider Using

What Is Data Science?

社区洞察

其他会员也浏览了

(New Project) Build an ETL service pipeline

ETL with Mage is like the secret sauce that helps you squeeze out the full flavor of your data's potential.

GenAI based ETL & Visualization

?? Integrations Unlocked: ETL Pipelines (Part 2) ??

?? Integrations Unlocked: ETL Pipelines (Part 3) ??

Mastering the Core: Key Concepts for Data Engineers - Data Pipelines and ETL

Mastering the Art of ETL: Overcoming Challenges and Maximizing Efficiency

ADF

Data Pipelines: From Raw Data to Real Results

Mastering ETL Processes: The Backbone of Data Integration