Data virtualisation - an overview
Banner was created using canva.com

Data virtualisation - an overview

The practice of copying data to multiple locations to integrate it has been prevalent in the industry. But, is there a way to see a combined view from all the data sources and silos without physically copying?

We created many data warehouses, data marts and data lakes. We have realised that we have copied data to so many places that maintenance, security and data governance have now become a challenge. Did we at least remove silos? No, we still have data silos. So what is the solution? Is there an alternate? The answer is ‘yes’ and it is called data virtualisation. Though it is not a magic bullet, it is still worth exploring. Let’s discuss it in this article.

Data virtualisation is combining data virtually from different sources into a single, unified view. The data remains in the source system itself and is not replicated anywhere else. 

How does it work?  

You can achieve data virtualisation by performing the following three actions: Connect, Combine and Serve.

No alt text provided for this image
  1. Connect to the data source using the JDBC or HTTP client URL for JSON files, etc.
  2. Combine: Once you connect to the data source, you can extract data and create a base view for each source. You can integrate them into a single unified schema. It happens virtually without replicating the source data physically.
  3. Serve the data to all the consumers, such as data analysts, machine learning engineers, and data scientists. The good thing is that they don’t even know where all this data is coming from, what were their formats originally, etc. 

Advantages from a data engineer’s perspective:

No alt text provided for this image

1. Accelerated delivery: As you don’t create any physical replica of data, data virtualisation is proven to deliver a minimum viable product a lot faster than the traditional data warehouse solutions.

2. Abstraction: Data virtualization uses service-oriented architecture and decouples storage from processing.

3. Secured: As the data virtualisation combines data and serves for consumption from one place, you can implement all data security control governance in one place. 

4. Lineage: you can track the data lineage of the virtual target dataset as it is combined centrally.

5. Reuse: You can replicate the same business logic to all the sources, improving developer productivity.

6. POC: You can use data virtualisation as a proof of concept for creating an expensive data warehouse.

7. Change the source data: You can add or remove any columns a lot faster than the ETL processes.

No alt text provided for this image

8. Data ownership: As you don’t copy or replicate the data, the ownership of the data assets still lies with the respective source side business.

9. Transform and clean: You can perform transformation and data cleaning activities virtually before you serve them to consumers.

Advantages from a data scientist's perspective:

1. Simplicity: data scientists don’t need to extract data from disparate sources and merge them all on the consumption side as the data is available in a single place.

No alt text provided for this image

2. Single source of truth: As the data virtualisation provides a unified schema aggregating data from all sources, it serves as a ready-made single source of truth.

3. Reflect changes to underlying data: As the raw data still stays in the source, the changes in data are reflected at the consumption layer with no additional process.

Points to consider:

1. Data virtualisation is not a replacement for other data integration techniques wherein you need to copy the data physically. You may still need to create a data warehouse and data lakes where you need to perform complex logic for integration or the volume is too high.

No alt text provided for this image

2. Performance tuning is key and the key to successful data virtualisation is performance tuning.

3. As the consumption layer serves from many disparate sources, the uptime of your consumption layer should be in synch with the source systems.

4. Similar to data integration techniques, you will need to have a comprehensive data catalogue that describes the metadata of the disparate data stores, like where is the datastore? what does it contain? how frequently is it updated? and so on.

5 The data definition varies from department to department, business to business. However, in order for you to create a meaningful data virtualisation solution, you will need to have a uniform data definition across the organisation.

Hope this gives a high-level understanding of data virtualisation. I have attempted to minimize the usage of jargon and focused on concepts. Thanks for reading. If you find this article useful, please like, share and comment.

Views are personal and in no way reflect my current & previous organisations and vendor partners.————————————————————————————————————

Image credit:

  1. Photo by Andrea Piacquadio from Pexels
  2. Photo by FlorS Q from Pexels
  3. Photo by RODNAE Productions from Pexels
  4. Photo by Brett Jordan from Pexels
  5. Photo by Anthony from Pexels
  6. Photo by Christina Morillo from Pexels
  7. Photo by Luke from Pexels

References & Additional Reading:

  1. https://www.dhirubhai.net/pulse/how-can-you-address-data-silos-sujithkumar-chandrasekaran/?trackingId=UgBPoMuOTRi0Onz%2By0Xs0g%3D%3D
  2. https://www.denodo.com/en/webinar/data-virtualization-introduction-4
  3. https://www.tibco.com/resources/ebook-download/data-virtualization-going-beyond-traditional-data-integration-achieve
  4. https://www.datavirtualizationblog.com/overlooked-capability-data-virtualisation/?utm_source=Denodo&utm_medium=SM-Medium
  5. https://www.datavirtualizationblog.com/data-virtualization-enables-successful-data-governance/?utm_source=Denodo&utm_medium=SM-Medium
  6. https://denodo.medium.com/digital-transformation-in-financial-services-addressing-5-key-trends-with-data-virtualisation-5cd9039b60a2
  7. https://medium.com/@rahulraghavendhra/data-virtualization-shop-for-all-the-data-you-need-d0904f50893d
  8. https://www.datavirtualizationblog.com/successful-data-virtualisation-more-than-the-right-choice-of-platform/
  9. https://www.datavirtualizationblog.com/5-ways-survive-thrive-competitive-market-using-data-virtualisation/
Sathya Narayana Kaliprasad

Technology Leader | Delivery Head | Product Management | Software Delivery | CRE | Agile

2 年

Well written Suji... Can you also expand to showcase the various tools used? Their advantages and disadvantages

要查看或添加评论,请登录

SujithKumar Chandrasekaran的更多文章

  • GDPR in 3 mins - 1 of 7 Principles

    GDPR in 3 mins - 1 of 7 Principles

    Having gone through the scope and objective in our earlier Newsletters, let us discuss the protection and…

  • GDPR in 3 mins - Objective & Rights it protects

    GDPR in 3 mins - Objective & Rights it protects

    Understanding the legal terms is difficult for an Engineer like me. However, I attempted my level best to simplify by…

    1 条评论
  • GDPR in 3 mins - Scope & Definitions

    GDPR in 3 mins - Scope & Definitions

    The General Data Protection Regulation (GDPR) is the world's strictest data privacy and security law. This law was…

    1 条评论
  • Are you becoming a Chicken ?

    Are you becoming a Chicken ?

    I had never taken Tea or coffee until I went to the university and started to stay in the hostel. That was because my…

    3 条评论
  • Differential data privacy - an Overview

    Differential data privacy - an Overview

    Customers' data is private, and the data analyst can't use this sensitive information. But then, the Dataset is full of…

  • Differential Data privacy - demystified

    Differential Data privacy - demystified

    One of the critical challenges data practitioners face is that we expect them to provide vital information without…

    1 条评论
  • Model extraction using Active Learning

    Model extraction using Active Learning

    Most cloud service providers offer Machine Learning as a Service (MLaas). By the way, what is MLaaS? As the name…

  • Data Free Model Extraction Attack

    Data Free Model Extraction Attack

    Before we start discussing the data-free model extraction attack, let us understand how the Model extraction typically…

  • I know what you did last summer

    I know what you did last summer

    You had a common business problem across the industry. So you, as a CDO, secured funding from the Business to develop a…

  • Adversarial attacks on "Explanation models"

    Adversarial attacks on "Explanation models"

    Before we start our discussion on attacks, let us understand the explanation model, why we need it in the first place…

社区洞察

其他会员也浏览了