DATA VIRTUALIZATION
Mustafa Qizilbash
‘Open for New Opportunities (Globally), Author, Data & AI Practitioner & CDMP Certified, Innovator of Four 4s Formula, DAC Architecture, PVP Approach
Data Virtualization is the virtual consumption layer and connect with anonymous data stores i.e., RDBMS, data appliance, NoSQL, Web Services, SaaS, and enterprise applications etc. As its connecting directly with all data stores, the data models of all systems remain intact with no change.
‘It’s considered as one of the solutions for Data Sovereignty problem statement.’
What data stores?
Here data stores are referred as traditional RDBMS or OLTP systems, OLAP systems, Data Warehouse, Data Lake, Delta Lake, Data Vault, Business Vault, Kafka, data appliance, NoSQL, Web Services, SaaS, and enterprise applications etc. Yes, it’s confusing as we all know above data stores serve their purpose for one or the other business requirement then why this another layer on top?
Data Virtualization hosts views connecting directly with all possible data stores. These views can be one-to-one or can be complex SQL written views to pull data from multiple source systems to create a virtualized layer on top of consumer layer to consume.
Having said that, as many new virtualized data models can be created as second or third or fourth layers to fulfil different business requirements. All these layers can be called virtual Semantic Layer.
In nutshell, all data from either raw from source or processed, all will be available in one environment.
By default, data virtualization tools use host processing power but there are tools which can enable in-memory processing to pull data in-memory for processing. Tools can also restrict like if the query will take more than certain time, then it should abort.
Data Virtualization tools can also notify if changes are made in source systems which can be propagated in virtualization tool automatically. In case, dirty data is found in virtualization layer, unlike any other data store i.e., RDBMS or OLTP systems, OLAP systems, Data Warehouse, Data Lake, Delta Lake, Data Vault, Business Vault etc., it can make change or write back or clean data back in source systems, as well. For example, if data quality issue has been identified, and that data is residing e.g., in 3 source systems and in a data mart then data in all these 4 data stores can directly be fixed from Data Virtualization layer. No need to run different implementation exercises in all the data stores.
Another good example can be like, one has data pipeline coming from through Data Lake à Data Warehouse à Data Marts à OLAP/ Semantic Layer à BI tool. Consider if user want to add one more column in BI Dashboard, one must pipeline the whole process to get the new column which will take weeks or months to deliver to business. Solution is to virtualize source systems, get the column immediate in BI tool and in parallel let IT process the pipeline, once done replace new column with column coming from pipeline in BI tool.
Advantages
●???????Single Access Point: Access all data stores from single environment
●???????Write Back Capability: Fix data directly from data virtualization environment means write back capability into source systems
●???????UAM: Use Access Management at one place
●???????Data Quality Management: Data Quality issues can be identified as single point and can be fixed from same tool
●???????No Data Movement: No ETL/ ELT is required
●???????Ease of use: As users don’t have to write queries in different languages. In Data Virtualization, user can write in one language, and it will create wrapper to talk to all the connected systems
●???????No Additional Storage: As it is just a virtual copy of new unified data model and doesn’t cost any separate copy of data, so no additional storage is required
●???????Low Risk: As its not hosting the data copy as compared to Data Warehouses or Data Lake where data is physical moved, so there is no chance of missing any data
●???????Better for Analytics: As all the source data models are converted into one unified data model and data is fetched in real-time, it makes source data available in near real-time for Data Analysts and Data Scientists
Disadvantages
●???????Performance Impact: Some impact on source systems as some queries will run at source
●???????Functional Knowledge: Seek dedicated involvement by Data Stewards and Functional Consultants
●???????No Historical Data: As there is no copy of data hosted so history data is also not maintained. Data Virtualization is best suited along with Data Warehouse in-place
For me, the way Data Virtualization tools are maturing, these will become the sole consumption layer for all the datastores.
Cheers.