The Data Virtualization Performance Question

The Data Virtualization Performance Question

Having now worked in the #datavirtualization space for nearly 7 years with #denodo, it recently dawned on me that I have more practical experience using #datavirtualization than not only most people in the world (obviously :-), but also most technologists who work in the data space. When I first started working with #denodo, I was quite surprised at the number of people who work in the #datamanagement space every day, but were completely unfamiliar with #datavirtualization technology, which to my mind is a must-have in any modern #dataarchitecture for enterprises with large amounts of disparate sources of data. As I traveled the country meeting with different prospects and speaking and presenting at different conferences and technology events, I found myself becoming somewhat of a #datavirtualization evangelist (kinda like a televangelist minus the slick silk suit, Rolls Royce, helicopter, and yacht :-). Spreading the gospel of this innovative way of querying data from multiple sources without having to migrate the data was what this #datavirtualization evangelist spent most days doing.

With this technology being a fundamentally new concept to most prospects I've spoken with, I've heard just about every question that one has about this technology, from the most skeptical to the genuinely curious. I think it would be helpful to the larger #data and #datamanagement community to better understand this technology so I will spend some time periodically sharing a bit of what I've learned about this technology over the years and why it should be in serious consideration for your #dataarchitecture now and forevermore. One of the first areas of skepticism for those with a strong technical background learning about #datavirtualization for the first time is that of performance. How does it perform given that data has to be pulled from multiple sources in real time. To one with a strong technical background, that will immediately raise alarms about why this could potentially be deadly slow.

Following is a response I gave in a #dataandanalytics group in response to a #datafabric post. I made the comment that the best way to implement the #dataintegration component of a #datafabric will be to utilize #datavirtualization. The response I got from the author of the post was that in his past dealings with #datavirtualization, the performance wasn't up to par so I responded as follows. May this response help to put to rest whatever performance questions you may have in your mind about #datavirtualization.

I think those criticisms regarding performance may have been valid say 8-10 years ago. In practice though, I’ve been involved with numerous #datavirtualization projects over the past 6+ years with multibillion dollar companies with humongous datasets on the order of billions and hundreds of millions of rows of data to be integrated from different sources. We've achieved a level of performance which is on par and in the same order of magnitude as if all the data were physically moved to a central location via #ETL jobs. By analyzing the technical metadata and understanding #dataprofiling characteristics of data coming from the relevant data sources, the optimization algorithms employed through #datavirtualization emphasize maximizing query push down at the data sources. This minimizes how much data is retrieved from each data source, minimizes how much data is transferred across networks, and minimizes the amount of data that has to be processed within the virtual layer. When it comes to federating data across multiple disparate data sources, a lot of the total execution time is spent with the transfer of data across networks into the virtual layer. Therefore, by minimizing how much data is transferred by pushing down as much execution as possible to the data sources, that is the key to maximizing performance with data virtualization.
No alt text provided for this image

This simple answer helped to address the initial concern but let me give a concrete example to drill the point home more clearly. Consider a scenario with 1B sales transactions in one source, 5M customers in another source, 5K products in another source, and having the requirement to get total sales by customer in a particular zip code for a particular product line over a particular time period. The traditional solution to doing this would be to #ETL and copy data from these 3 different sources into a single source which can then be queried to answer the desired question. In many organizations, these sort of processes could take days, weeks, or longer, thereby hindering an organization's ability to make data-driven decisions. Using the #datavirtualization platform from #denodo specifically, the query optimizer would automatically rewrite the queries to the underlying data sources which would (1) filter the 1B sales transactions to only those for the relevant time period in question, (2) filter the 5M customers to only those within the specified zip code, (3) filter the 5K products to only those relevant to the specified product line, and (4) aggregate the filtered sales transactions by the 5M unique customers. The final combination and aggregation of this data coming from the 3 data sources would then be done within the virtual layer, but by first pushing down these filtering and aggregation queries to the underlying data sources, it drastically reduces the amount of data that has to be transferred across networks and is the key to maximizing performance in these federated query scenarios.

So now you know the answer to the #datavirtualization performance question. Please share with anyone who has similar questions.

-CW

No alt text provided for this image

Hi Chris. I hope all is going well for you! You look exactly the same as I remember!!

要查看或添加评论,请登录

Chris Walters的更多文章

  • Using GenAI and Denodo to Generate a Development Plan for Jamaica [video]

    Using GenAI and Denodo to Generate a Development Plan for Jamaica [video]

    An estimated 60-75% of data that is collected is never used for analysis purposes. That begs the question, how much…

  • A Strong Foundation for GenAI

    A Strong Foundation for GenAI

    While visiting Guangzhou, China for Denodo's global retreat recently, I visited historic Yuexiu Park. This park…

    4 条评论
  • Technology is great; GREAT service is even better!

    Technology is great; GREAT service is even better!

    Ok, here’s the scene. You travel to a country you’ve never been to before and don’t speak the native language.

    2 条评论
  • How to Shop for Data!

    How to Shop for Data!

    By the click of a "Buy Now" button, just about everything we desire nowadays is available to us seamlessly, simply, and…

  • Common Misperceptions about Data Virtualization (podcast)

    Common Misperceptions about Data Virtualization (podcast)

    Being in a sales role @ Denodo, pioneers in the field of #datavirtualization, for these past 7+ years, I've had…

  • Data Fabric - What's in a name ?

    Data Fabric - What's in a name ?

    In Romeo & Juliet, William Shakespeare famously wrote "What’s in a name? That which we call a rose, By any other name…

  • Data Silos - Say it ain't so!

    Data Silos - Say it ain't so!

    Despite our best efforts, it is inevitable that #datasilos will emerge in today's increasingly data-driven…

    1 条评论

社区洞察

其他会员也浏览了