登录查看更多内容

The Data Virtualization Performance Question

Chris Walters

Technologist, Data Solutions, Generative AI, ????

发布日期: 2023年1月27日

Having now worked in the #datavirtualization space for nearly 7 years with #denodo, it recently dawned on me that I have more practical experience using #datavirtualization than not only most people in the world (obviously :-), but also most technologists who work in the data space. When I first started working with #denodo, I was quite surprised at the number of people who work in the #datamanagement space every day, but were completely unfamiliar with #datavirtualization technology, which to my mind is a must-have in any modern #dataarchitecture for enterprises with large amounts of disparate sources of data. As I traveled the country meeting with different prospects and speaking and presenting at different conferences and technology events, I found myself becoming somewhat of a #datavirtualization evangelist (kinda like a televangelist minus the slick silk suit, Rolls Royce, helicopter, and yacht :-). Spreading the gospel of this innovative way of querying data from multiple sources without having to migrate the data was what this #datavirtualization evangelist spent most days doing.

With this technology being a fundamentally new concept to most prospects I've spoken with, I've heard just about every question that one has about this technology, from the most skeptical to the genuinely curious. I think it would be helpful to the larger #data and #datamanagement community to better understand this technology so I will spend some time periodically sharing a bit of what I've learned about this technology over the years and why it should be in serious consideration for your #dataarchitecture now and forevermore. One of the first areas of skepticism for those with a strong technical background learning about #datavirtualization for the first time is that of performance. How does it perform given that data has to be pulled from multiple sources in real time. To one with a strong technical background, that will immediately raise alarms about why this could potentially be deadly slow.

Following is a response I gave in a #dataandanalytics group in response to a #datafabric post. I made the comment that the best way to implement the #dataintegration component of a #datafabric will be to utilize #datavirtualization. The response I got from the author of the post was that in his past dealings with #datavirtualization, the performance wasn't up to par so I responded as follows. May this response help to put to rest whatever performance questions you may have in your mind about #datavirtualization.

I think those criticisms regarding performance may have been valid say 8-10 years ago. In practice though, I’ve been involved with numerous #datavirtualization projects over the past 6+ years with multibillion dollar companies with humongous datasets on the order of billions and hundreds of millions of rows of data to be integrated from different sources. We've achieved a level of performance which is on par and in the same order of magnitude as if all the data were physically moved to a central location via #ETL jobs. By analyzing the technical metadata and understanding #dataprofiling characteristics of data coming from the relevant data sources, the optimization algorithms employed through #datavirtualization emphasize maximizing query push down at the data sources. This minimizes how much data is retrieved from each data source, minimizes how much data is transferred across networks, and minimizes the amount of data that has to be processed within the virtual layer. When it comes to federating data across multiple disparate data sources, a lot of the total execution time is spent with the transfer of data across networks into the virtual layer. Therefore, by minimizing how much data is transferred by pushing down as much execution as possible to the data sources, that is the key to maximizing performance with data virtualization.

领英推荐

Most Popular Articles in Volume 302 Issue , Posted Week of May 22nd

John J. McLaughlin 1 年前

AI-Ready Data Infrastructure: The Answer to Large…

Huawei IT Products & Solutions 7 个月前

RAID 0

Richard Wadsworth 2 个月前

This simple answer helped to address the initial concern but let me give a concrete example to drill the point home more clearly. Consider a scenario with 1B sales transactions in one source, 5M customers in another source, 5K products in another source, and having the requirement to get total sales by customer in a particular zip code for a particular product line over a particular time period. The traditional solution to doing this would be to #ETL and copy data from these 3 different sources into a single source which can then be queried to answer the desired question. In many organizations, these sort of processes could take days, weeks, or longer, thereby hindering an organization's ability to make data-driven decisions. Using the #datavirtualization platform from #denodo specifically, the query optimizer would automatically rewrite the queries to the underlying data sources which would (1) filter the 1B sales transactions to only those for the relevant time period in question, (2) filter the 5M customers to only those within the specified zip code, (3) filter the 5K products to only those relevant to the specified product line, and (4) aggregate the filtered sales transactions by the 5M unique customers. The final combination and aggregation of this data coming from the 3 data sources would then be done within the virtual layer, but by first pushing down these filtering and aggregation queries to the underlying data sources, it drastically reduces the amount of data that has to be transferred across networks and is the key to maximizing performance in these federated query scenarios.

So now you know the answer to the #datavirtualization performance question. Please share with anyone who has similar questions.

-CW

Tech Talk & Tings

624 位关注者

Gretchen S Gettelman

Fly Fishing Expert

2 年

Hi Chris. I hope all is going well for you! You look exactly the same as I remember!!

1 次回应

查看更多评论

要查看或添加评论，请登录

Chris Walters的更多文章

Using GenAI and Denodo to Generate a Development Plan for Jamaica [video]

2025年3月3日

Using GenAI and Denodo to Generate a Development Plan for Jamaica [video]

An estimated 60-75% of data that is collected is never used for analysis purposes. That begs the question, how much…
A Strong Foundation for GenAI

2025年2月18日

A Strong Foundation for GenAI

While visiting Guangzhou, China for Denodo's global retreat recently, I visited historic Yuexiu Park. This park…

4 条评论
Technology is great; GREAT service is even better!

2025年2月10日

Technology is great; GREAT service is even better!

Ok, here’s the scene. You travel to a country you’ve never been to before and don’t speak the native language.

2 条评论
How to Shop for Data!

2024年1月23日

How to Shop for Data!

By the click of a "Buy Now" button, just about everything we desire nowadays is available to us seamlessly, simply, and…
Common Misperceptions about Data Virtualization (podcast)

2023年8月31日

Common Misperceptions about Data Virtualization (podcast)

Being in a sales role @ Denodo, pioneers in the field of #datavirtualization, for these past 7+ years, I've had…
Data Fabric - What's in a name ?

2023年2月10日

Data Fabric - What's in a name ?

In Romeo & Juliet, William Shakespeare famously wrote "What’s in a name? That which we call a rose, By any other name…
Data Silos - Say it ain't so!

2023年2月3日

Data Silos - Say it ain't so!

Despite our best efforts, it is inevitable that #datasilos will emerge in today's increasingly data-driven…

1 条评论

See all articles

The Data Virtualization Performance Question

Chris Walters

Technologist, Data Solutions, Generative AI, ????

领英推荐

Tech Talk & Tings

624 位关注者

Chris Walters的更多文章

社区洞察

其他会员也浏览了

How to ensure secure data processing

How to Use Telemetry Pipelines to Maintain Application Performance.

The next generation data infrastructure in Japan

Vector Clocks: The Simple Way to Keep Distributed Systems in Sync

The 3DI Data Utility Model

What CIOs need to know about hyper-converged systems

Cron Jobs vs Events for async data processing

5 Data Resolutions for APAC in 2023: How organisations can boost resilience and ensure growth in economic uncertainty

Distributed Systems Design Pattern: Lease-Based Coordination?-?[Stock Trading Data Consistency Use?Case]

Data Virtualization with Cloud7 IT Services

领英推荐

Tech Talk & Tings

624 位关注者

Chris Walters的更多文章

Using GenAI and Denodo to Generate a Development Plan for Jamaica [video]

A Strong Foundation for GenAI

Technology is great; GREAT service is even better!

How to Shop for Data!

Common Misperceptions about Data Virtualization (podcast)

Data Fabric - What's in a name ?

Data Silos - Say it ain't so!

社区洞察

其他会员也浏览了

How to ensure secure data processing

How to Use Telemetry Pipelines to Maintain Application Performance.

The next generation data infrastructure in Japan

Vector Clocks: The Simple Way to Keep Distributed Systems in Sync

The 3DI Data Utility Model

What CIOs need to know about hyper-converged systems

Cron Jobs vs Events for async data processing

5 Data Resolutions for APAC in 2023: How organisations can boost resilience and ensure growth in economic uncertainty

Distributed Systems Design Pattern: Lease-Based Coordination?-?[Stock Trading Data Consistency Use?Case]

Data Virtualization with Cloud7 IT Services