Data Virtualization 2.0: ETL’s doppelg?nger rising again?
Generated using Bing Image Creator

Data Virtualization 2.0: ETL’s doppelg?nger rising again?

In the realm of data management, data virtualization has quietly thrived for over 15 years. Yet, it remains a puzzle to many, often mistaken for storage virtualization or confused with data integration. The reality, however, is both simpler and more intriguing: data virtualization is a vital pattern within the broader landscape of data integration.

Data integration, often abbreviated as DI, constitutes a substantial market. Depending on the source, the global DI market is estimated to range from $4 billion to $14 billion. It is growing at 10-15% per year. The data virtualization market is estimated to be approximately 10% of the total DI market.

The initial iteration of data virtualization, which I'll call "Data Virtualization 1.0," didn't see significant success. However, the newer, cloud-age "Data Virtualization 2.0" is a different narrative. While it is likely to remain a niche market, it may grow at a faster rate compared to overall DI market with the cloud tailwinds.

Data Virtualization 1.0 as a pattern of Data Integration

Data virtualization provides a specific method for integrating and accessing data from various sources, making it available to users and applications as if it were stored in a single location, without physically moving or copying the data. This is significantly different from other patterns of data integration including ETL (Extract-Tranform-Load) and ELT (Extract-Load-Transform).

The following illustrations delineate the distinctions between the ETL/ELT and data virtualization patterns.

Figure 1: ETL/ELT pattern of Data Integration


Figure 2: Data Virtualization pattern of Data Integration

ETL flavor is favored by traditional Data Integration vendors. ELT flavor is favored by Data Warehouse and Data Lake vendors as it preaches to first load all data into data warehouse or a data lake and then transform/process it.

Data Virtualization 1.0 offered a unique approach to the issue. Its promise was that you could achieve data integration without the need to write data pipelines or maintain a data warehouse. This was a very appealing promise as you can significantly reduce both CAPEX (Data warehouse cost) and OPEX (Cost to write and maintain data pipelines) compared to ETL/ELT patterns.

Comparing ETL and DV 1.0 Architectures

Comparing ETL and data virtualization architectures

  • Extract: Connect to all data sources or data producers and Extract data. This step, also known as source connectivity, is very similar to ETL/ELT patterns.
  • Transform: Before the data is served, it needs to be transformed to match the needs of consuming layer. Again, this step is very similar to ETL/ELT patterns.
  • Cache: This is where DV 1.0 differs significantly from ETL/ELT. In DV 1.0, data needs to be cached to avoid the necessity of retrieving it from data producers every time the consuming layer requires specific data.
  • Logical Data Model: This is a metadata layer that maps the data required by the consumption layer. Differing from the metadata layer in traditional databases and akin to its counterpart in federated databases, this layer tracks data across multiple data producers and cache.
  • Serve: This layer typically includes SQL parser, optimizer, and query execution engine. While theoretically, any SQL engine could suffice, optimal performance and efficiency require this SQL engine to possess an understanding of caching and the logical data model. In essence, DV 1.0 must incorporate its dedicated SQL layer to ensure peak performance and efficiency. Also, this layer must be clustered to allow for multiple queries to execute in parallel.
  • Storage Management: Since DV 1.0 had to manage potentially large quantities of cached data, it needed highly efficient, scalable and manageable storage layer. Storage layer must be clustered such that multiple data serving nodes can access it in parallel.

Why did Data Virtualization 1.0 remain a niche DI pattern?

Following things are evident from the above discussion:

  • The architectural components needed for data virtualization were a combination of ETL and data warehousing, encompassing a broader set of features and functionalities.
  • Without clustered serving and storage layers, performance and availability of DV 1.0 would significantly suffer. Getting these things right in DV 1.0 before containers and Kubernetes was a significant challenge. Even with a good clustering support for storage and serving layer, DV 1.0 could become a logical bottleneck causing management and upgrade challenges as all the analytics workloads must go through it.
  • In DV 1.0, establishing an efficient caching layer was paramount. Without a finely tuned caching mechanism, it would inadvertently send redundant queries to the underlying data producers, leading to performance issues for both the data producers and consumers. It had to get following right for an efficient caching layer: what to cache, when to cache, how long to cache and where to cache.
  • DV 1.0 was not good for ML workloads. The reason is that data scientists typically want to explore all data before deciding what data to focus on. This is the same challenge that data warehouses also faced. They would propose ELT instead of ETL to address that but not all customers could do that because of prohibitive cost of storing all data in data warehouses.

  • DV 1.0 exclusively addressed data integration, leaving other crucial components in your data toolchain unattended—such as data quality, data governance, data catalog, and master data management. Achieving these functionalities requires more than DV alone. Using separate tools for each function leads to the complexity of managing multiple tools, resulting in additional operational expenses and management challenges.

DV 2.0: Reimagining Data Virtualization

DV 1.0 was rational for specific workloads, particularly in an era when on-premises data warehouses were costly, and achieving independent scalability for storage and compute proved challenging.

Couple of things have changed since DV 1.0 days:

  • Separation of compute and storage in cloud: Cloud based storage is cheap, highly scalable and elastic. This enables cost-effective data lakes where you can store all your data without having to worry about what data to keep. This especially works well for ML workloads. Cloud storage also provides efficient tiering mechanism like hot, warm, cold for different storage latency needs.
  • More choices for the serving layer: Processing data in the cloud extends beyond SQL, offering various alternatives. For instance, Apache Spark, pandas, and similar solutions provide tailored approaches that align with specific use cases.
  • Server-less compute in cloud allows customers to bring up compute components only when needed and pay only for what they use.

With cloud and data lakes, DV can be reimagined:

  • Extraction can be performed on edge with minimal transformations.
  • The cloud ingestion layer is dedicated to the efficient intake of data into data lake, leveraging technologies such as Kafka and others.
  • Cloud data lake replaces storage and caching layer of DV 1.0 with cost-effective, highly efficient, elastic and scalable storage.
  • While a logical data model remains necessary, cloud-based data catalogs have the potential for augmentation to fulfill this role.
  • Server-less SQL engines or other on-demand computing mechanisms, such as Apache Spark, can be employed for data serving purposes.

In contrast to the highly monolithic nature of DV 1.0, DV 2.0 offers customers the flexibility to select different solutions for each component.

DV reimagined

Conclusion

Some might contend that this is not a reimagination of DV but rather a reimagining of ETL or data integration. Indeed, I opted for "DV reimagined" for the sake of simplicity in illustrating how architectural components are broken, bent, and blended in this cloud data-lake based approach. In the design realm, such breaking, bending, and blending are recognized as catalysts for innovation across all creative domains. This phenomenon echoes in the data integration world, ushering in new possibilities, ideas, and innovative implementation approaches.

要查看或添加评论,请登录

Darshan Joshi的更多文章

  • #GenAI: Start of the ETL Hunger Games?

    #GenAI: Start of the ETL Hunger Games?

    #ETL #DataIntegration #DataEngineering #DataCatalog The ETL (Extract-Transform-Load) process has long been a…

    5 条评论

社区洞察

其他会员也浏览了