ETL or ELT
20 years ago, when I used Informatica to do ETL, we had an ETL server. The Informatica PowerCenter was installed on a separate Windows server, sometimes multiple servers. The same with Ab Initio. We use the power of that server to do transformation, before loading the data into the target warehouse, for example in an Oracle server which is on Unix. With Ab Initio you could even use different servers for different job, e.g. you do data formating and matching on an IBM z/OS server, and use a Linux server to aggregate the data, like this: (source: Abinitio.com )
In the SQL Server world we had something called DTS (Data Transformation Services, it’s the predecessor to SSIS), whereas in Oracle we had OWB (Oracle Warehouse Builder, it’s the predecessor to ODI). With DTS we had the option of it being on the same server of the SQL Server. So if the target warehouse is housed on the SQL Server, we were doing ELT. Meaning the data is loaded into a staging tables within the target SQL Server, the same server where the data warehouse is loaded. The same with Oracle, their ODI (Oracle Data Integrator, previously Sunopsis) also pioneered ELT - it loaded the data into a staging area located in the same Oracle server, where the warehouse is located.
But that was 15-20 years ago.
Today, data warehouses are in the cloud. They have flexible compute power (memory, processor, disk). You can increase the RAM or CPU within minutes. So there is no argument about ETL any more. Everyone from Data Vault to Kimball, from Snowflake and Databricks to Microsoft Fabric to Redshift and BigQuery uses ELT. We load our data to the staging area within the target data warehouse platform in the cloud. Then uses the power of that warehouse platform to do transformations.
It can be a series of stored procedures in Synapse to transform data from the staging tables to the transform layer (or silver layer if you use the medallion model). It can be Matillion or dbt utilising Snowflake compute power to load data from staging tables (or from S3 files) into Data Vault hub, satellite and link tables. Or perhaps you use PySpark notebooks to utilise Databricks Spark power to transform data from the staging delta tables to the dimensions and fact tables in your Kimball warehouse. Or it could be an ADF running on Azure, loading files from ADLS into staging area in Synapse, and then using Synapse computing power (or Fabric, or Azure SQL) to transform the data.
Let’s go through some examples. Here is an example by Dominic Colyer (source: link ) where they used Fivetran to load data from source into Snowflake staging tables. Then they utilise Snowflake compute power to do data transformation. Data in staging tables is loaded into a permanent history tables using Snowflake tasks.
Here the Kimball marts are just views on the history tables. They are not physicalised as mart tables. So queries from Power BI use Snowflake computing power to instantly doing transformations from history tables into fact & dimensions.
Here’s another example where you utilise Databricks compute power to do transformation within the lakehouse: (source: Databricks ). You load/ingest data from source into Parquet or JSON files in the lakehouse. Then you do transformation several times, for example from bronze layer to silver layer to gold layer. All utilising Databricks Spark compute resources.
Here is another example by Mazlum Tosun (link ) where they extract data from the source into Google Cloud Storage (GCS) as JSON files and utilises BigQuery computing power to transform the data.
领英推荐
Last one: here is an example for Microsoft Fabric by Ben Haeverans (link ) where ADF is used to load/ingest data from sources into data lake. Synapse Data Engineering uses Synapse Spark cluster power to model and transform data from Broze to Silver to Gold layers.
Today is the world of SaaS. Your data lake and data warehouse is on a cloud platform. Be it Fabric, Databricks, Redshift, Big Query or Snowflake, your lake or warehouse or lakehouse is on a cloud platform. Be it Azure, AWS or Google Cloud. You don’t install a database server, you just create a serverless database! Today, when we say ETL, we meant ELT. Most companies do ELT. Within the data platform (Snowflake, Databricks, Fabric, Big Query, Redshift, Azure SQL, etc.) they have staging tables. Within their data warehouse/lake platform they have staging tables. So they load the data first, and then transform.
So the argument today is not whether it is ETL or ELT. It is (almost) always ELT. You load the data to the target platform first, and then transform. The argument today is whether you do ELT or ELTLT. In other words, whether you transforms twice or once. If you have a Data Vault as an Enterprise Data Warehouse, then you do ELTLT. You load twice, you transform twice. First you transform the data from the staging tables into Data Vault. Then you transform the data from Data Vault into the Kimball Marts. If you don’t have an Enterprise DW (many companies don’t), and you only have a Kimball mart then you transform only once. You transform the data in the staging tables into fact & dim tables, and that’s it.
Here is an example of Data Vault on Redshift by George Komninos, Devika Singh and Simon Dimaline (link ). They transform the data twice, first from the staging tables into the Data Vault, then from the Data Vault into the Star Schema. All 3 layers here are in Amazon Redshift so the transformations are utilising Redshift computing power.
Does this mean that ETL is dead? No. First of all, we say “ETL tool”, not “ELT tool”. Even though those tools are doing ELT, we call them ETL tool. I suppose that is because of habit. It’s been 20 years we call them ETL tool, so we keep doing it.
The other situation where we use ETL instead of ELT is in a migration project (rather than data warehousing or data lake). In data migration, the target system is not a data warehouse or data lake, but an operational application. So we create a data transformation area like this:
In the above picture we can see that the data is extracted from the source application into a data transformation area where we do data transformation, data quality check and data governance, with the data flowing from the staging area to the transformation area. Then the data is loaded from the presentation area to the target application. So in the case of data migration, it is definitely not ELT. But it is truly still E-T-L.
Would appreciate your comments about ETL and ETL. Would love to learn from your experience.
Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance
3 个月ETL and ELT are two data integration processes essential for managing and processing large datasets. ETL is traditionally used for on-premises systems where data is transformed before loading into the target system, ensuring high data quality upfront. On the other hand, ELT leverages the power of modern, cloud-based platforms by loading data first and transforming it as needed, offering more flexibility and scalability. The choice between ETL and ELT depends on your data needs, infrastructure, and desired processing efficiency. ??????
Database Doctor
3 个月This conversation comes up every now and again. It always reminds me of the "should I use an ORM?" conversation... IT professionals will jump though the craziest hoops to avoid learning SQL... Set yourself free and learn SQL and the answer to these questions becomes obvious....
ETL is certainly not dead. As long as there are accessible tools available that make the job manageable, then I don't see it going away anytime soon.
Consultant database, data warehouse, BI, data mart, cube, ETL, SQL, analysis, design, development, documentation, test, management, SQL Server, Access, ADP+, Kimball practitioner. JOIN people ON data.
3 个月Mostly ETL because I love doing the T part to shape and suit data to the data model that L will fill up. E is sometimes missing a data profiling of source system data model and source data which can give some not good data quality. When T is at the end it can be for advanced report calcutations or DAX calcutations that is depending of business users search criteria mixed with the wish to see data as-is (current values) or as-was (registered values). ELT I think is for smaller data sets where T can be done in memory.