Tales From the Cloud Data & Analytics Crypt #5 - The shadow moving your ETL to the cloud...
Bartlomiej Graczyk
Chief Technology Architect | Tech Evangelist & Public Speaker | Cloud Data, Analytics and AI | MVP Data Platform | Demystifying complexity of data solutions in the cloud | Helping others to grow in tech roles |
Welcome, my dear crypt keepers, to another tale of the digital underworld where the very fabric of data is woven, spun and transformed into the fabric of our modern lives. On this night, we delve into the realm of Enterprise Extract-Transform-Load (ETL) tools, where two mighty beasts rule supreme: SQL Server Integration Services (SSIS) and Azure Data Factory (ADF). SSIS has been the chosen tool for those who traverse the Microsoft SQL Server landscape. But with the recent addition of ADF to its stable, the landscape has shifted. In this post, I'll unravel the similarities and differences between these two beasts, and explore the strengths and weaknesses of each. I'll also cast light upon the architectures that best suit each of these tools, and reveal some secrets.
For our first part of a tale, let me speak of SSIS, known to many of you, but for the newbies among us, let me offer a quick recap. SSIS is ETL software that comes bundled with the commercial versions of SQL Server. SSIS encapsulates its ETL work into packages that can be executed individually or in groups. As an installed software, SSIS requires a physical or virtual machine to run. However, this also makes it highly customizable through the built-in scripting tools or third-party vendor offerings. The cost to run SSIS packages is nonexistent after installation, and the development environment can be created using SQL Server Developer Edition for free. SSIS is a stable and mature product that has adapted to both local and cloud-based data connections.
Now, for our second part, we shall speak of ADF aka Pipelines in Azure Synapse Analytics(although it is not the same), a "newcomer" to the ETL scene. ADF is designed as a serverless ETL service based on the Microsoft Azure platform. ADF runs entirely in the cloud (besides of Self hosted integration runtime which can eventually land in on-premises environment if needed, to provide connectivity between cloud and on-premises world), with no software to install and no operating system to configure. The cost of ADF is based on usage, and its flexible licensing model allows for pay-as-you-go rather than upfront cost. Scaling up or down is planned to be effortless in ADF, and the cost reflects the level of performance configured. A data factory in ADF contains pipelines, similar to SSIS project and package structures, and activities can be invoked manually or through scheduled triggers. On top or in parallel we have very similar (yes similar is good word) functionality in Azure Synapse Analytics called Pipelines. In theory concept is the same as in ADF, the main difference - there is no feature parity in both which means something which is landing in ADF will not necessary be added to Synapse Pipelines - that's the reality.
Also deep in the heart of the cloud ADF, there lies a hidden gem, a secret exception to the common belief. Azure Data Factory, the powerful data processing tool, possesses a mysterious capability that has long been shrouded in mystery. It whispers of an SSIS runtime, a place where SSIS packages can be stored and executed as if they were in an on-prem instance.
Organizations seeking to embark on a journey towards the cloud, or those who wish to keep a part of their SSIS infrastructure intact, listen closely. This SSIS integration runtime (IR) allows you to move to Azure Data Factory, yet still keep your precious SSIS assets safe and secure. It appears as a conventional SSIS catalog, with a familiar user interface and monitoring tools that have been tried and tested. But there's more, it runs entirely on a serverless service, meaning that there's no need to install SQL Server or maintain a server machine just for running SSIS packages.
So, my dear crypt keepers, join me on this journey as we delve into the world of ETL and compare these two beasts.
As we already know SSIS has been around for nearly two decades, and in that time has built up a large user community with a wealth of online resources, tutorials, and articles. It’s a highly customizable tool, with a wide range of built-in tasks and transformations, as well as the ability to extend the tool’s functionality with third-party components or custom code. This level of customization can be a double-edged sword, however, as it can make it challenging for new users to learn the tool, or for experienced users to troubleshoot when things go wrong.
ADF, on the other hand, is a relatively new player in the ETL space, but has been designed with the cloud in mind from the ground up. Its user interface is highly intuitive and easy to navigate, with a clear separation of data connections, sources, destinations, and transformations. It’s also been optimized for performance and scalability, with the ability to process vast amounts of data in parallel using Azure resources such as Azure Data Lake Storage, Databricks and other. While this level of performance and scalability can be difficult to achieve with SSIS, it comes at a cost of reduced flexibility and customization.
In terms of functionality, SSIS has an edge over Azure Data Factory with a broader library of built-in control flow and data flow operations. With granular row-level error handling and a vast array of native components and add-ins, users can accomplish a multitude of tasks without needing to write any code. While ADF is still evolving and catching up in terms of features, it still has a way to go to match the capabilities of SSIS. However, even if an organization primarily uses ADF, it can still leverage the power of SSIS by creating an SSIS package and deploying it to the ADF SSIS integration runtime, unlocking the full potential of SSIS-specific functionality.
领英推荐
The crypt keeper whispers, "Performance be a tricky one, there be many variables to consider"
With the small file and no transformations, SSIS and ADF were neck and neck. But with a touch of transformations, SSIS raced ahead, leaving ADF in the dust. On the other hand, the large file ADF come out as the victor, with swift copy operations and faster ETL processing. But why was ADF slower on the small file? Well, it seems that when using a mapping data flow in ADF, a Databricks cluster is summoned in the background, slowing down small loads. But fear not, designed as an ELT operation, ADF can be lightning fast.
Optimizing runtimes in both SSIS and ADF can be done with ease, with SSIS having a greater control over resources. But, in the world of ETL, the answer to performance issues is...more resources. And that's where ADF shines, with the ability to adjust performance settings with a few clicks, even just for a single execution. Pay as you go and crank up the performance, that's the way of the cloud. And remember, ETL operations don't always run around the clock, so flexibility and lower cost be a ghostly good deal
Ah, the age-old question of cost comparison between SSIS and ADF.
The answer, my friend, is not as straightforward as you may hope. For, you see, these two tools have vastly different pricing models. SSIS operates on fixed licensing, while ADF follows a pay-as-you-go approach.
But beware, for there are hidden costs lurking in the shadows. Maintaining an on-prem SQL Server, for instance, involves more than just database maintenance. There's the time and effort to install the operating system, keep up with patches, pay for hosting and physical space, and other administrative tasks.
Verily, a wise craftsman knows that the right tool for the job must be chosen.
Let it not be thought that these tools are enemies, but rather, that they are complementary. For those who have invested in SSIS, yet seek to venture into the cloud, a data factory can be created to bridge that gap. And for those who call ADF home, but require the power of C# scripting, an SSIS package can be deployed either locally or within the haunting halls of an ADF SSIS runtime. Choose wisely, for the right tool shall bring success.
Data Solutions Tech Lead at Elitmind, Music Composer & Piano Player at Home ;)
2 年(1/2) I could describe the pros and cons in pages for both products.? But reducing it to the top list - the biggest advantage of ADF in my opinion is the possibility of parameterization, including column mapping, which in the case of SSIS ended up generating the package programmatically. In ADF, mappings don't have to exist and happen on-the-fly. Then "The Power" of Azure IR and Managed VNET + Mapping Data Flows. The biggest problem? Difficult to estimate costs and latency.? We are charged by every orchestration event, every internal or external activity and data movement extra. Once we will generate pipelines with nested pipelines, create a logic that connects to metadata, run the logic in loops, it can be hard to estimate pricings.