The Changing landscape of ETL/ELT tools
Adam Morton
Empowering businesses to harness the full potential of data | Best-Selling Author | Founder of Mastering Snowflake Program
Thank you for reading my latest article The Changing landscape of ETL/ELT tools.?
Here at LinkedIn I regularly write about modern data platforms and technology trends.To read my future articles simply join my network here or click 'Follow'. Also feel free to connect with me via YouTube .
------------------------------------------------------------------------------------------------------------
When I started my career in data over 20 years ago the very first ETL tool I was exposed to was a Microsoft product called Data Transformation Services (DTS) which was part of SQL Server 2000.
Today DTS appears very primitive to what ETL/ELT tools we have today, although you would still be able to recognise some of the fundamental principles of data pipelines.
ETL
The world of technology never stands still, and 20 years is a long time in this industry. Back then ETL tools often demanded dedicated compute resources - often provided by a dedicated server - to provide for your ETL processes. This was in addition to the compute resources required for your data warehouse by the way. This was to remove contention between the two services. You still see traces of these patterns today - think Snowflake’s virtual warehouses, creating multiple warehouses in Snowflake to separate competing workloads still remains, it’s just a lot easier and a lot faster to set up and configure!
Back then you were heavily constrained by the power of the physical infrastructure you had available. This was becuase ETL tools looked to apply business logic and transformation steps to incoming data by processing the physical data in memory (where possible for speed) before landing that data into your data warehouse. This was where the concept of ETL was born. First you (E)xtracted your data from your source systems using your ETL tool, next you applied your (T)ransformations before finally (L)oading data into your data warehouse.?
ELT
This led to a variation of this approach called ELT, where you extract and load your data to your data warehouse and then use the processing power of the database to ‘push down’ your transformational business logic to where the data sits. This approach reduces data movement (between your ETL tool and the data warehouse), and aims to provide data engineers with more control over how the transformations are applied.
?
Many things I read tend to position ELT as something which came about due to the advent of the Cloud, but this isn’t strictly true. Sure, the unparalleled scalability of resources in the Cloud helped greatly, but in 2007 I was working in a team applying the ELT techniques by allowing the power of the database to take care of the transformations. In this scenario we used the ETL tools as an orchestration ‘wrapper’ to call stored procedures which encapsulated the transformation logic within the database. I still come across clients who have yet to make the shift to the cloud who employ this exact technique today!
The ever-changing nature of data?
A few things which have changed since those days are the sheer volume and variety of data. Use cases featuring XML were rare, and, as a result the tools always struggled to deal with taking that data and ‘shredding’ it into a relational format. Now we don’t just have XML, we have JSON, AVRO and Parquet amongst others.?
We also have data arriving not only in daily batches of data, but also in micro-batches and near real time event data for IoT sensors for example. All of which needs to be handled and processed with ideally the same tools on its way to the data warehouse so the data consumers can start to analyze the data and generate insights.?
It’s these developments, and the ever-changing shape and nature of data, accelerated by the advent of the Cloud which has led to the next generation of ETL/ELT tools. In this post I am going to touch on some products which aim to fulfill these needs. Although these products aren’t completely new, they are still maturing and improving all the time. The one thing they all have in common is that they are cloud-native, which means developed from the ground up for the cloud and the new demands of business.?
dbt
dbt or Data Build Tool covers the ‘T’ in ELT. It’s predominantly SQL based meaning that most data professionals don’t need to learn a new programming language. In dbt you create SQL files which generally create an object (table or view) in your target data platform.?
These SQL files are referred to as models and apply the business logic to the data - essentially you’re declaratively scripting the final target table (or view) output in code. One of the other neat features about dbt is it promotes a concept called modularisation. This means each model is created in isolation, but within each model you create a reference to upstream dependencies to other models within your dbt project. This has several benefits, because dbt is aware of your dependencies it knows it needs to run model ‘a’ before model ‘b’. It also allows dbt to surface lineage diagrams as part of the auto generated documentation feature.
If dbt only takes care of the ‘T’ then you’ll need to have a different approach to take of the Extract and Load. This often leads organizations to use dbt in combination with a data loader tool such as Stitch or Fivetran which can load data from various disparate data sources onto your data platform. Fivetran for example has a native integration with dbt which you can find more about in this video . Alternatively, some customers who use Snowflake as their data platform can decide to use a feature of Snowflake called Snowpipe. They can configure this to automatically ingest data from a cloud storage location often referred to as a ‘Data Lake’, typically this might be Amazon S3, Google Cloud Storage or Microsoft Azure Blob.?
Regardless of the approach you take, as you’ll be using a combination of tools with dbt then you’ll need to consider how to bring these tools and processes together to allow you to orchestrate the end-to-end data flow. See the Airflow section towards the end of this article.
To find out more about dbt check out one of my most watched videos here .
Matillion
Matillion is a company which offers two separate products; Matillion Data Loader and Matillion ETL. It’s a cloud based SaaS solution which can run on Microsoft, AWS or Google cloud platforms. Their services are wrapped up in an easy to use graphical interface making it user-friendly and a good choice for those customers coming from a tool such as SSIS or Informatica for example as it will feel familiar.?
Their Data Loader product allows customers to transfer data from a range of data sources into the cloud quickly and easily and I often see this product used to support data migrations from on-prem to the cloud warehouses.?
The Matillion ETL tool applies transformations by converting transformations to SQL behind-the-scenes to execute directly on the data platform where the data lives. This push down optimization approach allows Matillion to take advantage of the resources available in the underlying data platform. If you’d like to learn more about Matillion and get hands on then check out my video .
Interestingly, and sometimes confusing for customers, Matillion ETL provides the ability to trigger dbt jobs too!
Apache NiFi
Next up is NiFi which was originally developed by the National Security Agency (NSA), and now it’s a top-level Apache Project under open source license. My experience with NiFi has always been in conjunction with big data projects such as Hortonworks and Apache Spark.
I actually used it with Hortonworks to collect telemetry data from cars as part of a pilot I executed for Admiral Insurance back in 2016. We actually presented the solution at the Hadoop Summit in Dublin and if you're struggling to sleep you can watch the recording here .
NiFi is similar to Matillion in the sense it has a graphical UI and users drag and drop components before linking them up to create a data pipeline. Although I don’t get involved much in NiFi projects it is still out there. I’d be keen to hear about your use cases for NiFi in the comments below!??
Airflow?
I wanted to include a range of different tools which can be used for ETL/ELT to demonstrate the breadth of what is out there in the market. Airflow is not a dedicated ETL/ELT tool, more an orchestration tool.
Due to it’s open source nature, the fact that it comes with a range of built-in operators to connect the modern data platforms such as Snowflake, Redshift and BigQuery and support for Python out of the box it is a very common tool to see either orchestrating data movement by calling dbt jobs for example or by executing code directly against the data in the data platform itself.
To stay up to date with the latest business and tech trends in data and analytics, make sure to subscribe to my newsletter, follow me on LinkedIn , and YouTube , and, if you’re interested in taking a deeper dive into Snowflake check out my books ‘Mastering Snowflake Solutions’ and ‘SnowPro Core Certification Study Guide’ .
-----------------------------------------------------------------------------------------------------------
About Adam Morton
Adam Morton is an experienced data leader and author in the field of data and analytics with a passion for delivering tangible business value. Over the past two decades Adam has accumulated a wealth of valuable, real-world experiences designing and implementing enterprise-wide data strategies, advanced data and analytics solutions as well as building high-performing data teams across the UK, Europe, and Australia.?
Adam’s continued commitment to the data and analytics community has seen him formally recognised as an international leader in his field when he was awarded a Global Talent Visa by the Australian Government in 2019.
Today, Adam works in partnership with Intelligen Group, a Snowflake pureplay data and analytics consultancy based in Sydney, Australia. He is dedicated to helping his clients to overcome challenges with data while extracting the most value from their data and analytics implementations.
He has also developed a signature training program that includes an intensive online curriculum, weekly live consulting Q&A calls with Adam, and an exclusive mastermind of supportive data and analytics professionals helping you to become an expert in Snowflake. If you’re interested in finding out more, visit www.masteringsnowflake.com.