Data warehouse, data lake, and the features of ETL and ELT
Article 1 of 2
Originally published on our blog .
Click here for the second article: The ETL to ELT to EtLT Evolution, and data pipelines
In the dynamic world of data, many professionals are still fixated on traditional patterns of data warehousing and ETL, even while their organizations are migrating to the cloud and adopting cloud-native data services. But the capabilities of the clouds have eclipsed traditional data architectures, and have upset the roles of data acquisition (“Extract”), logical processing (“Transform”), and populating a schema (“Load”).?
Central to this transformation are two shifts. First, the unmatched scalability of cloud databases which maintains classic table structures while separating storage and compute. Secondly, the rise of data lakes that catalyzed the transition from ELT to ELT and paved the way for niche paradigms such as Reverse ETL and Zero-ETL. Still, these methods have been overshadowed by EtLT — the predominant approach reshaping today’s data landscape.
In this article, we assess:
The second article covers:
Understanding this evolution equips data teams with the tools to navigate the ever-shifting data space, amplifying their impact on business value. Let’s take a closer look.
Modern Cloud Data Platforms
The native capabilities of the cloud providers have been joined by third-party services to offload that data into separate less costly systems that are optimized for analysis of that data. These services fall into two broad categories: those designed to serve as data warehouses and those designed like data lakes. Let’s investigate these two traditionally contrasting modes of operation.
The Data Warehouse Pattern
The heart of a data warehouse lies in its schema, capturing intricate details of business operations. This unchanging schema forms the foundation for all queries and business intelligence. Modern platforms like Redshift, Snowflake, and BigQuery have elevated the data warehouse model. By marrying the tried-and-true practices of traditional warehousing with modern scalability, these platforms eliminate the cumbersome tasks associated with on-prem management while enabling the execution of analysis, reporting, and ad-hoc inquiries.
The Data Lake Pattern
Emerging in contrast to the structured world of warehousing, data lakes cater to the dynamic and diverse nature of modern internet-based applications. For example, unlike traditional platforms with set schemas, data lakes adapt to frequently changing data structures at points where the data is loaded, accessed, and used.?
These fluid conditions require unstructured data environments that natively operate with constantly changing formats, data structures, and data semantics. Services like AWS Glue, Databricks, and Dataproc have powerful data lake capabilities, where code-heavy processes and agile workflows can transform data into many different forms.?
This pattern requires new methodical, traceable, and repeatable methods like Directed Acyclic Graphs (DAGs) that give the business confidence in the results.
?
ETL for Traditional Data Warehouse Pattern
As discussed, data warehouses are structured repositories designed for predefined schemas that are optimized for analytical querying and reporting of data. The ETL pattern was developed specifically to load data into these rigorous schemas, with the following characteristics:
These requirements are typically met by ETL tools, like Informatica, that include their own transform engines to "do the work" of cleaning, normalizing, and integrating the data as it is loaded into the data warehouse schema. If you encounter a tool that claims to perform ETL but uses the data warehouse itself to do this work, you can consider it to be ELT — the transformation (“T”) occurs after the data is loaded (“L”).
ELT for the Data Lake Pattern
As discussed earlier, data lakes are highly flexible repositories that can store vast volumes of raw data with very little preprocessing. They can accommodate any type of data, from structured to semi-structured to unstructured, and do not need a predefined schema. The ELT pattern emerged as a more suitable approach to process data in data lakes, due to these reasons: