Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You
This post is part of a five-blog series focused on how businesses can plan and build for a data-driven future. The blogs will walk readers through the fundamental dynamics and nuances in modern data management and processing. My first blog described the five approaches to data processing. This blog takes a deep dive into ELT, why it became popular, when does it work well, and when it doesn’t.
TheELT approach to data pipelining is in vogue and is attractive for simple use cases due to fast time to value and ease of use. But as your data needs grow, you will likely hit a wall if you only rely on ELT. As the market is over- rotating towards ELT, providers are not always telling you about the challenges that you might face when you go all-in with ELT. You need to be aware of these and think holistically to avoid hitting the wall and regretting investment decisions about your data engineering infrastructure.
To learn about the E, L, T, and the different data processing mechanisms - ELT, ETL, API Integration, API Proxy, and Data-as-a-service, check out the first blog in this series - Five Approaches to Data Processing.
Gradually, then Suddenly: The rise of ELT
Those who have been in the data space know that ELT isn’t a new approach. In fact, users of Teradata, Netezza, Exadata, and other on-premise data warehouses have leveraged ELT in the past. However, what has happened in the last three to four years is that ELT has become incredibly popular. ELT’s rise in popularity was driven by the trifacta of technology, people, and process. Let’s take a trip down memory lane to understand how ELT got here.
Gradually
2010-2013
In the early part of last decade, Data Lakes became an important part of the data framework. With storage becoming cheap, the logical thing to do was to put all data into a central lake from where it could be provisioned to users. Hadoop became a popular Data Lake system given its ability to store and compute literally infinite amounts of data. In fact, Nexla co-founder Avinash built one of the largest Hadoop infrastructures in the world that amassed 300 billion records or 500TB of new data every day.
2014-2015
Unfortunately, Hadoop was not just slow, it was very hard to manage as an infrastructure. Getting data from Hadoop required complex map-reduce jobs written by engineers, but there weren’t enough engineers. Plus waiting and watching for handoffs was not really a process. Hive did give analysts a SQL interface but it couldn’t overcome the fact that getting data from Hadoop was too slow for analysts who were used to quick responses from database queries.
Meanwhile, data teams kept getting bigger, particularly on the analytics side as business demanded more reports and dashboards. Data Engineering skills continued to be scarce - that is true even today. So on one hand the already scarce data engineering expertise focused on Hadoop, Kafka, and other complex technologies. Simultaneously, the number of data analysts, who could query data via SQL was growing
Around this same time in early 2014 cloud Data Warehouses (DWH), such as Redshift, started to become popular for use cases where the Hadoop infrastructure was an overkill. This was a good alternative as most companies didn’t have data volumes that needed Hadoop’s scale. Now analysts could write straight-up SQL and the columnar storage of these data warehouses gave fairly responsive query performance for typical use cases. A better way to work with data at scale was on the horizon.
2016-2020
While all this was happening, Snowflake started to pick momentum as it overcame the performance bottlenecks of Redshift - e.g. parallel query performance. Meanwhile, Looker gained steam as an easy to use, cloud-native solution that leveraged the scalability of cloud data-warehouses while empowering analysts who had SQL expertise.
… and then Suddenly
The challenge for analysts now was getting data into the cloud DWH. This is when ELT started to become popular powered by these five factors:
- Cheap Storage made data replication economically feasible. You could Extract (read) and Load (write) everything you extracted without transforming or filtering.
- Pre-defined models: The only gotcha was that when writing data to a DWH you needed to define the tables, data-types, and object relationships. To solve this, ELT providers like Fivetran and Stitch created pre-defined models for popular data sources such as SaaS apps - Salesforce, Marketo, etc.
- Ease of Use: No Transform on extract meant there was little user input needed when copying data as-is from one system to another. Now any user could provide credentials to say, Salesforce, and data would get replicated. No user input meant it was easy to use.
- SQL Transforms: Transforms stage moved into the DB/DWH. Since analysts already knew SQL they could craft a SQL query that would read data from a table, perform a join, and write the transformed result into another table.
- No need for DBAs: The one system-level concern was that traditionally DBAs (Database Administrators) had been very particular about which users can create and modify database tables. After all, decisions such as table indexing, rebalancing, vacuuming had material performance impact, and required a DBA’s expertise. Snowflake and other cloud warehouses obviated this concern as they automated many of these DBA tasks. Now anyone could create a table or modify with little concern about performance impact.
None of the steps needed engineers whether replicating data or writing SQL transforms. A result of all these technology choices was that ELT had freed up the Analysts from their dependency on Data Engineering resources. Fewer handoffs and not waiting in a backlog was a welcome change and suddenly drove ELT’s popularity.
Did you know?
SQL or Structured Query Language is considered a 4th generation programming language.
Programming languages are organized into generations one through five based on the level of automation and complexity.
First generation programming languages (1GL): Machine code entered as 1s and 0s. No one writes this today but there was a time when people did.
Second generation programming languages(2GL): Assembly programming languages. These days only firmware engineers, BIOS programmers, and kernel developers need to write assembly.
Third generation programming languages (3GL): Most programming languages fall within this classification. C, C++, Java, Python etc.
Fourth generation programming languages (4GL): SQL is 4GL which is an advancement over 3GL giving a more user friendly near human-like interface, but with some structure. Databases proved to be a great candidate for 4GL because data is structured in tables, and operations like selects, joins, aggregates, are well defined. 4GL doesn’t work well when operations can be highly varied and we resort to 3GL in that case. That’s why Transforms coded in SQL are easy to create, but have limitations in the complexity of logic that can be applied.
Fifth generation programming languages (5GL): This is a future ideal where nearly human language can provide a computer with instructions and constraints. That hasn’t happened yet for programming computers, although we can invoke existing programs with voice commands like we do with Alexa, Siri, and Google Assistant.
6 ELT Challenges Nobody Tells You
Technology experts often hear questions like, “What is the best programming language?” The answer always is: “Depends on what problem you are trying to solve”. You may use Python for data science work, Javascript for your web pages, Java for creating web-services, and C or C++ for programming devices. However these are common practices, the only rule is that there is no hard and fast rule. For example, frameworks like Node have made Javascript suddenly suitable for creating web-services which would have previously not been a great idea.
The same rule applies for data integration and processing. “Which of the five methods should you be using? Depends on the problem you are solving.”
The benefits of the ELT process around performance and operational resilience are well known, let’s look at the shortcomings that get overlooked with the hype. This will help you make an informed choice about the ELT approach.
Challenge 1: Not all data will end up in a Warehouse
ELT is easy and analysts can set it up themselves. What is the catch then? The catch is that with ELT, the destination system needs to be a DB or a DWH - a system with compute capabilities to execute the SQL transformations. So what happens when data needs to be sent to an application, API, a SaaS system, a data stream, a spreadsheet, a file, or an email? Well now you need to perform the transform step to get data into the right structure and you may need to use one of the other methods such as ETL, API integration, API Proxy, or Data-as-a-service.
Challenge 2: Managing Transforms with ELT can get very complicated
Ask any engineer and they will tell you that they love SQL for simple queries and for certain actions like joining structured data, it is the absolute best. But, try doing anything complex, with intricate logic, and you end up with a monstrosity of code. This is exactly the reason engineers invented ORM as a way to programmatically interact with databases. When using ELT be careful that SQL will get very complicated and unmaintainable very quickly. While solutions like DBT are making it easier to manage SQL like code, they can’t overcome the basic fact that SQL isn’t designed for complex transformations that you may need.
Challenge 3: ELT doesn’t work with Real-time Data
It goes without saying, but if you put a database or DWH in the data path your data flow can’t be real-time anymore. For many use cases, this is a non-starter. So if you are taking online order information in a store and formatting (transforming) that into an email notification, it wouldn’t make sense to use ELT in the data path, instead, a webhook triggered real-time event, with in-stream transform and email as a final destination will work much better.
Challenge 4: ELT may create lock-in and make you less future-ready
Think you will never migrate to another DWH? Think again.
Ask any of the teams trying to migrate from Netezza or Teradata to a modern cloud DWH. Replicating data into the new system is easy, but they have so much code lying around built on top of these DWHs, that migration is becoming an expensive nightmare. What is all this code? A lot of SQL and stored procedures. While you may try to stay with standard ANSI SQL, it is almost impossible to not leverage the full capabilities of your DWH. A few years later all your ELT processes will be locked into the next DWH vendor and if you need to migrate for any reason, you are locked-in with a renewed cycle of pain, all over again.
Challenge 5: Data Replication and the Copy of a Copy Problem
One of the biggest governance challenges for enterprises is multiple, confusing copies of data which makes it hard to know which copy is needed versus which can be deleted. During an audit, one big asset management firm found 17 copies of its client database sprinkled around in its infrastructure. Needless to say each copy was out of sync and there was no single source of truth.
While cheap storage and easy replication via ELT allows us to create new copies, good governance is essential to avoid this problem. Small and medium businesses may not see this issue much, but enterprises have a real copy management problem. On the other hand, more streaming and real-time mechanisms don’t end up creating a persistent storage of data allowing data to be used in run-time
Challenge 6: Beware of Bigger Bills
Ask any large enterprise and they will tell you that storage is cheap, but compute is the more expensive part of their cloud DWH TCO (Total Cost of Ownership). ELT may not contribute much to the cost initially when you are small. The simple transforms won’t be cost-prohibitive. But beware of as your needs grow, you may start getting hit by bigger bills.
The Nexla take on ELT
Nexla was built with a vision of data flow across organizations that is orthogonal to how other companies approach data processing. We recognize the tremendous importance and value of the ELT process. As a result, our approach to data integration supports all five styles, including ELT.
Some of the key ways in which we make ELT scalable include:
- Collaboration: Being fundamentally collaborative means multiple users can be working together writing code, as well as reusing each other’s work. This means not everyone needs to be a SQL expert and the same code doesn't need to be re-written.
- Versioning: Source control is automated as Nexla automatically versions all transforms
- Hybrid Data Flows: This is a super powerful approach that allows a mixing of integration methods. For example, you can do an EL process to replicate data. Thereafter, the T can be SQL code doing joins and aggregates that SQL is ideal for. However, instead of creating a new table to store the results of the T, it can be fed live into the TL phase of an ETL process. As a result, you have created a unique flow that mixes the best of ELT and ETL.
Finally, our thinking is that data challenges are complex. Use the best tool or combination of tools that give you an easy, performant, and scalable solution for your team, all the while keeping yourself future-ready!.
Good Luck! And feel free to send your questions to [email protected]
_______________
In the next blog, we will cover why data processing is broken for enterprises. After all, this space has existed for 30+ years. What has been the progression of technology and what can we learn to better solve data challenges in the future? After that, we will cover the current state of the data ecosystem, and the future of data with a look into emerging concepts like data fabric and data mesh. Stay tuned.
This article is a cross-post from Nexla blog Six ELT Challenges Nobody tells you.
Head-Product (AgriCentral) at Olam Information Services Pvt. Ltd
3 年This is a great. Was a very interesting read especially for a non-techie like me! Thanks
AI, Data and Analytics Business Leader
4 年Excellent blog series Saket Saurabh ...extremely well sequenced and insightful