登录查看更多内容

Who will win the ETL tool race?

Cameron O'Rourke

Technical Product Marketing and Product Management for Data, Analytics, AI/ML and Web3 Projects | Ex-Oracle | Six SaaS Startups | Developer | Writer | Video Expert

发布日期: 2023年1月29日

I was catching up on Substack this morning and ran across an article, A Zero ETL Future that I found provocative and interesting.

ETL, or "Extract, Transform, Load" is the data engineering practice that makes data usable for analysis and reporting. The work product is a data pipeline, an automated set of processes that move and reshape data from formats that are suitable for fast business operations, to formats that enable reasonably fast business analysis.?

The problem has been that the construction of data pipelines can be mind-numbingly complex, lengthy, and costly. The resulting ETL code is often inflexible in the face of business changes, error-prone, and easily broken.?

"Zero ETL" is a wistful concept discussed by data engineers that would allow data to flow from sources to targets with no, or minimal transformation. A few companies in the data ecosphere, including AWS, Snowflake and Databricks have been making noises around limiting the amount of ETL necessary for data analysis, generally by making it easier to query original sources of data, and by making analytical query engines more powerful. Shorter, simpler data pipelines would mean less development time, less maintenance, less cost, and faster access to data.?

Until recently, I worked at Incorta, a data management and analytics platform that has approached the data pipeline concept differently. Incorta touts its ability to analyze business application data with minimal ETL. They are able to do this with a unique capability that pre-computes some of the most common data transformations. Data analysis is essentially done via views on the original data. The benefits of this approach are numerous including: low pipeline development and maintenance costs, the ability to drill from high-level business metrics to transaction-level detail, and improved analytical flexibility.?

This experience has led me to believe that analytics against transaction-level, normalized data (the data format used by most business applications) will become the norm in a year or two. There are new analytics engines in the works that can replicate some of Incorta's attributes such as Databricks' Photon or the open-source StarRocks (https://www.starrocks.io/) database. These new engines show that a shallow view layer on top of original data can deliver good performance in a format that is flexible enough to serve different analytical uses. If this goal is achieved, the business benefits will be numerous (I've counted nine) and compelling (with business agility being at the top of the heap.)

However, some amount of data pipeline will always be necessary. As the author of this article points out, true elimination of big 'T' transformations would require that application vendors make their data available in a more consumable format. The author also points out the need for maintaining historical data, as business applications are generally weak in terms of capturing trends over time.?

领英推荐

Now Playing: Data Warehousing ft. ETL/ELT Pipelines

Astera 1 年前

The Normality of data

Geordie Consulting Ltd 6 个月前

The Changing landscape of ETL/ELT tools

Adam Morton 1 年前

But one other necessary element that the author does not mention is semantics, metadata, or data about the data. Data in its raw form is inaccessible to non-specialists. Even Oracle eBusiness Suite, with its relatively modern relational structures is breathtakingly complex and vast -- requiring expert knowledge to know where to tap in, and how to interpret. And Oracle EBS is one of the best; sit an unmedicated data engineer in front of an SAP R3 database for the first time and they will very likely run screaming from the building in a few hours.

To make data usable by even the data literati (analysts, data scientists) they need answers to questions like: "What data is available that is relevant to me?" and "What is the nature of this data?" and "What other data should I also be paying attention to?"?They also need to understand the source, timing, veracity, and reliability (quality) of the data.?

So, in addition to having easier access to data sources, better tools for (visually) constructing and operating data pipelines, and better analytics engines capable of analyzing raw data, we need better data about the data integrated with the tools.

There are data cataloging products that can provide a view of selected data assets, and give users information about the structure and usefulness of data. But these tools generally: a) do not cover all the data used, from raw sources to a business metric on a dashboard, and, b) are not available inside the tools, where those who are constructing or consuming data pipelines need it most.

We are missing a widely adopted standard for metadata that would allow data catalogs and data pipeline tools to interoperate. I think this is needed because true "No ETL" data pipelines (or at least, pipelines that we need to build) will only come about when we can start to reuse and build up a library of semantic data. The terrifying prospect of decoding an SAP database should only need to be experienced once.??

An industry-adopted standard for metadata exchange could get us a long way towards a "Zero ETL" future, with pre-built pipelines eventually coalescing into sharable marketplaces of domain-specific use cases. A number of ETL tool vendors are creating visual low-code or no-code tools that simplify data pipeline building, and this would be an ideal integration point for creating and consuming rich metadata, or "data about the data."

I think the ETL tool vendor that integrates visual design, low-code, metadata, and observability will be a winner. What do you think??

要查看或添加评论，请登录

Cameron O'Rourke的更多文章

I learned something new today about crypto hardware wallets

2023年8月23日

I learned something new today about crypto hardware wallets

In the Web3 / crypto space I literally learn something new every day. Today, I learned that not all hardware wallets…

1 条评论
Understanding the Value of Stories — Success with Data Storytelling

2023年8月2日

Understanding the Value of Stories — Success with Data Storytelling

22 February 2023 By: Cameron O’Rourke This work is licensed under the Creative Commons Attribution-NoDerivatives 4.0…

1 条评论
ETL + Metric Store = Awesome

2023年2月8日

ETL + Metric Store = Awesome

This could be significant. This morning, dbt announced that it will acquire Transform, the people behind MetricFlow and…

1 条评论
Great idea: The yearly brag document

2023年1月31日

Great idea: The yearly brag document

This is outside of the data and analytics sphere, but is such a useful idea that I had to share it. Julia Evans writes…

1 条评论
6 steps for business people to lead data analytics initiatives

2021年12月22日

6 steps for business people to lead data analytics initiatives

Originally published on the Incorta blog, April 7, 2021 Faced with new challenges and pressures, more and more…

1 条评论

See all articles

Who will win the ETL tool race?

Cameron O'Rourke

Technical Product Marketing and Product Management for Data, Analytics, AI/ML and Web3 Projects | Ex-Oracle | Six SaaS Startups | Developer | Writer | Video Expert

领英推荐

Cameron O'Rourke的更多文章

社区洞察

其他会员也浏览了

Transformation Engineering

ETL

Reverse ETL on Snowflake

The Evolution of ETL (Extract, Transform, Load) Processes: A Journey from Simplicity to Innovation

Out with the Old ETL: Navigating the Upgrade Maze

Data Pipelines: From Raw Data to Real Results

ETL pipelines

Reverse ETL vs. ETL

Azure Data Factory: A Beginner’s Guide to Building ETL Pipelines ??

?? Integrations Unlocked: ETL Pipelines (Part 3) ??

领英推荐

Cameron O'Rourke的更多文章

I learned something new today about crypto hardware wallets

Understanding the Value of Stories — Success with Data Storytelling

ETL + Metric Store = Awesome

Great idea: The yearly brag document

6 steps for business people to lead data analytics initiatives

社区洞察

其他会员也浏览了

Transformation Engineering

ETL

Reverse ETL on Snowflake

The Evolution of ETL (Extract, Transform, Load) Processes: A Journey from Simplicity to Innovation

Out with the Old ETL: Navigating the Upgrade Maze

Data Pipelines: From Raw Data to Real Results

ETL pipelines

Reverse ETL vs. ETL

Azure Data Factory: A Beginner’s Guide to Building ETL Pipelines ??

?? Integrations Unlocked: ETL Pipelines (Part 3) ??