Accelerating your data strategy with Augmented Data Management capabilities provided by Generative AI
BCG Platinion Kali Copilot

Accelerating your data strategy with Augmented Data Management capabilities provided by Generative AI

According to Gartner, better data management can save an average organization $12.9 million annually. These financial improvements are achieved in many ways, such as by enabling more automated processes, reducing employees’ time searching for needed data and informing more accurate business decision-making.

This blog post will discuss how data strategy and enterprise data management capabilities can be improved with Generative AI. For the sake of clarity and brevity, I will use “standard” Generative AI chat interfaces from Google Gemini, OpenAI ChatGPT and Anthropic Claude.

Enterprise Data strategy as a first step

Before launching proof of concepts, proof of value or prototypes, you need to define your data strategy.

The data strategy is done by business domain and cross-domain, and will define the main business outcomes to be achieved.

Data strategy can be based on generic goals, like “I want to reduce employees loss of time in manual, tedious task by two hours per week”, or it can be very specific to a business domain like “I want to improve the accuracy of my clients data”.

Then use cases will be created with a clear value associated to them, and way to measure it systematically and automatically. Use cases should be aligned with company data governance (and if not done, should be put in place in an agile way). Use cases data should be clearly described, categorized and governed. Then, you must evaluate their current quality, and the key data sources involved.

Looking at all use cases, their value and desirability, the data they need and their quality you will be able to prioritize them. Non mature companies can do up to 3 use cases per year (from design to development to production and deployment in business domain) while more mature can do around 15 use cases per year.

Use case realization are dependent of the data team maturity, size and skills, but also of the data factory (or data office) platform tools and services available.

Finally a good data strategy always come with an acculturation plan, and building a community of data champions.

From use cases to data products

Another key question to ask is to know if you want to adopt a data product approach. A data product is a data intensive asset directly usable by data consumers. Introduced by Zhamak Dehghani, data mesh advocates treating data as a product, pushing ownership deeper into to the producing teams. The 5 fundamentals of data mesh are: data as a product, domain orientation, self-serve infrastructure, federated governance (each domain is responsible of its own product) and agile way of working.

The theory behind this is that ownership is better aligned with data creation, which brings more visibility and quality to the data. This “shift left” of responsibilities addresses the data explosion, which has overwhelmed central data teams tasked to cleanse, harmonize, and integrate data for the larger good.

Domain-driven design and data product go hand in hand because they espouse the idea of keeping data as the responsibility of the teams which produce the data and control its source. The delegation of responsibilities for specific data sets to data domains means that enterprises are split into domains, each taking the responsibly for building abstractions, serving data, maintaining associated metadata, improving data quality, applying life cycle management, performing code control and so on.

Data is then a first-class citizen and a product of the domain. A new role, the data product owner, is then created to own and manage the data product. The data product owner will ensure the development made are aligned with the downstream consumers and must ensure business outcomes are met (and measure continuously its value).

Each domain is responsible for ingesting, processing, and serving its data products to downstream consumers. Data engineering and software engineering must then be tightly aligned, and ideally part of the same functional team, so they can work with a data product owner to produce high-quality, curated data products.

Data products must be part of the data strategy, and help accelerate the realization by delegating the work to dedicated teams across the company, instead of a centralized data factory.

Generative AI is an accelerator of the realization of the data strategy

Enterprise “data strategy” can benefit from Gen AI at all stages and could lead to the following major benefits:

  • Gathering and Curating new needed Data. Capability to identify and leverage new data set creating competitive advantage and that were not used so far (especially for non structured data). We should also mention usage of generative AI for easy synthetic data created for testing purpose.
  • Leveraging External Data. Leverage new data ecosystems and marketplaces to capture external data and value them, while governing them appropriately (especially in term of cost of Intellectual property).
  • Improving Data Availability. In particular, “dark data” (information or content used for a single operational purpose then often forgotten about) can generate new value streams when ingested by generative AI applications. Yet much of this kind of data remains in archives or in off-line storage.
  • Re-think Data Lifecycle. Expand the metadata management to cover for data provenance, bias, AI generated content, etc.. throughout the data lifecycle.
  • Re-imagine Data Quality. Automate data quality assessment and expand the realm of data quality tooling to improve data supply chain drastically.
  • Simplify data governance. Data governance proved to be difficult to implement in large company. The idea is then to augment data product owner, data steward or data custodian with generative Ai in order to simplify their work. A data steward can then leverage generative AI to improve the quality of the data glossary, the dictionary, or the ontology generation. It can also be used to in key data management tasks, like data quality analysis and improvements. Of course, most of data management tool or modern data platform are being enhanced with generative AI capabilities.

It also require to:

  • Upgrade Data Governance. Establish process and guardrails to ensure what and how data is used by Gen AI applications to ensure compliance with legal and privacy policies under overall Responsible AI framework (or regulations like the EU AI Act).
  • Ensure integration and scalability of the solution. Implementing Generative AI in large corporations will require to setup the right infrastructure and platforms, and the Geerative AI solutions to use (open source, commercial, etc.).
  • Provide trustable results. For the generative AI to be valuable, it is required that results are trustable and accurate. The service is provided to help employee gain time, not loose time or make mistakes.

Generative AI empowers data management with 3 new capabilities

While AI-powered contextualization is compelling, the true power of the new Generative AI technologies is in its ability to codify human expertise by creating new and better data

The three capabilities are: connect the dots, offer polyglotism and assist with copilot.

Connect the dots

The main objectives are to:

  • Simplify data ingestion and leverage multimodal (image, text, sound) facts and knowledge (internally and externally).
  • Offer holistic views and insights on pre-learned or pre-defined Enterprise data concepts (syntax and semantic)
  • Enable multiple ways to interact using voice or chat instead of mouse (richer user experience)

In particular, Generative AI offers natively a capability to improve data quality. Let’s look at what OpenAI ChatGPT, and Google Gemini can do.

OpenAI ChatGPT, and Google Gemini Bard can natively check key data quality indicators

Anthropic Claude is also quite powerful.

Claude can natively check key data quality indicators

Let’s check the quality of a file describing C02 consumption of products without providing any additional information about the data.

Uploaded data set and request to analyze its quality

Analysis is done nearly immediately:

Description of the quality of the data set

Claude is also proposing ways to improve the data quality of the uploaded data set:

Claude proposed improvements to be made

I can ask Claude to improve the content, generate a new data set and let me know what was changed:

Claude is generating the corrected data set

Of course, Claude will detail all changes made, for enabling me to track and validate the changes:

Description of the changes made in the data set

In order to automate this data quality improvement, I asked Claude to generate the python code. This Python code could then be used for automating data quality through dedicated data engineering pipelines.

Excerpt of the Claude generated Python

Claude will also explain how the code is built …


Description of the generated Python code

If you prefer to use the GreatExpectations library, Claude can also do it.

Excerpt of the Python generated file with GreatExpectations

The result is followed by a description of the Python code structure.

Description of the content of the Python code with GreatExpectations library generated

Now let’s see if Claude can generate a JSON file that will let me see my data as a report in PowerBI:

Excerpt of the PowerBI JSON File generated

Offer Polyglotism

Generative AI handles natively data management concepts and frameworks (or can be trained for) and can generate (meta)models and views on demand, acting as a translator, enabling to move from multiple data format and representation easily.

Polyglotism in data modeling provide 5 main advantages (see: Tshepiso Mogoswane):

  • Adaptability: Different modelling methodologies excel in different scenarios. Being able to switch methodologies allows you to adapt to changing business requirements, data sources, or analytical needs. It provides the flexibility to choose the most suitable approach for a particular situation without needing to rework the entire data model from scratch.
  • Optimization: Each modelling methodology has its own strengths and optimizations. By being able to switch methodologies, you can optimize your data model based on specific use cases. For example, choose a dimensional model for reporting and analysis and switch to a Data Vault model for improved data integration and traceability. This flexibility enables you to fine-tune your data model to maximize performance and efficiency.
  • Data Governance: Different modelling methodologies have varying levels of built-in data governance features. Switching methodologies allows you to leverage specific governance capabilities that align with your organization’s requirements. For instance, Data Vault provides extensive data lineage and auditing capabilities, while dimensional models offer intuitive hierarchies and user-friendly structures for data exploration. Adapting the methodology based on data governance needs ensures compliance and data management best practices.
  • Skillset and Team Expertise: Modeling methodologies may require different skillsets and expertise. Switching methodologies based on a single logical data model allows you to leverage your team’s existing skills and expertise. It avoids extensive retraining or hiring of new resources, as you can utilize your team’s proficiency in multiple methodologies.
  • Continuous Improvement: The ability to switch methodologies promotes a culture of continuous improvement and learning. By experimenting with different modelling approaches, you can discover new insights, evaluate their effectiveness, and refine your data modelling practices over time. This iterative process contributes to the evolution and maturation of your data management capabilities.

We will take several examples to show the value of Generative AI in data modelling, taxonomy and ontology building, database schema evolution, and database retro-engineering.

Generative AI can help in data modeling

Let’s reuse Tshepiso Mogoswane blog on the OpenAI blog, called Exploring Data Modelling with ChatGPT. The data model used is described here.

Employee logical data model

I can now ask ChatGPT to generate a Data Vault 2.0 model from it to be used in my data warehouse.

Description of the key phases to build a data vault V2 model

And the result is described below.


Data Vault V2 model generated

Retro-engineering a database

One of the powerful capability of LLMs are retro-engineering of database. This section is based on the Towards Data Science blog Post, and we will use the data set proposed. It represents “MyCompany” HR system export of all employees, containing many details regarding the company as well, some of them being confidential. It has to be mentioned that some data are confidential like “Salary”, “Age”, or “Annual_Evaluation”.


“MyCompany” HR system export of all employees

Let’s see what ChatGPT can do:

  • Could you identify the categorical columns within this dataset as well as confidential ones?

  • Could you suggest a database schema with different tables (pay attention to creating a separate table for confidential data)?

  • For the tables with categorical data, please provide the SQL script to create them, including their content (Key and Values).

  • For the remaining tables, please provide the script to create their schema.

  • For each column of each table, can you suggest some data quality checks?

  • Can yo generate the full new data model with all tables in UML with mermaid?

Then we can look at the UML diagram in a Mermaid Editor

Using Generative AI to build a Taxonomy

Generative AI models can assist in creating a taxonomy. The main capabilities are resumed by chatGPT:

  • Category Generation: Generative AI models, when trained on vast amounts of data, can identify recurring themes, topics, or categories. These can be used as a starting point for creating a taxonomy.
  • Hierarchy Creation: Using contextual clues, a generative AI model can help determine the hierarchical relationships between different categories. For example, in a taxonomy of animals, the model might determine that “canine” should be a subcategory of “mammals” based on its understanding of these terms.
  • Taxonomy Population: Generative AI can assist in populating the taxonomy with relevant entities or subcategories. For instance, it could suggest that “dogs,” “wolves,” and “foxes” should be included under “canines.”
  • Taxonomy Refinement: As new data is processed, a generative AI model can suggest additions or changes to the taxonomy. This could include adding new categories, modifying the hierarchy, or adding new entities to existing categories.
  • Taxonomy Validation: The model could check for inconsistencies or errors in the taxonomy, such as an entity being included in the wrong category or categories that should be subcategories of another.

Let’s take the previous example with the CO2 emission data set. Let’s try to generate a taxonomy from it.

Using Generative AI to build an Ontology

Generative AI models can assist in creating ontologies (source: chatGPT):

  • Concept Generation: Process vast amounts of data and identify common concepts and entities. By identifying these entities and their relationships, these models can help generate the basic structure of an ontology.
  • Relationship Identification: Trained on a large corpus of data, generative AI models can identify relationships between different entities based on context and use them to build the links in an ontology.
  • Hierarchy Creation: Identify hierarchical relationships between concepts based on the data. This can be used to create the hierarchical structure often seen in ontologies.
  • Ontology Population: Populate the ontology with instances of the identified concepts (involve generating synthetic data or extracting relevant information from the data the model was trained on).
  • Ontology Refinement: Refine and update an existing ontology. By processing new data, the model can identify new concepts or relationships, which can be added to the ontology. It can also identify potential errors or inconsistencies in the ontology, which can then be corrected.

Reusing the previous data set, you can see below the result of Claude analysis:

and the explanation of the file generated:

Let’s see if Claude can generate a graphical representation of this ontology. As you can see below, he can not.

But Claude proposed solutions:

I downloaded protégé and imported the OWL file:

Mixing LLM with ontologies to improve the quality of the results

By mixing LLM capabilities and ontology, you can drastically improve the quality of your results. The ontology is used to refine the query and the result to ensure better accuracy.

The best example I found is PoolParty. I lets you compare the results of questions you ask in the subject area of ESG (Environmental, Social, and Governance) that are submitted directly to ChatGPT and with those which are filtered through an ESG taxonomy and knowledge graph (managed in PoolParty software) so that the questions are enriched before being sent to ChatGPT. For example, let’s ask “Which form of renewable energy contributes the most to the world-wide energy mix?”

Results are shown below

RAG is a good solution to add external content to LLM reasoning, but without a clear description of the terms and their relationships. A semantic graph or an ontology can improve drastically the query results.

Assist With Copilot

Generative AI assist data management professionals in their day-to-day tasks, either using automated agent-based background tasks or leveraging real-time in context support within tools.

You have globally 3 options to Augment your data management practice and tools:

  1. Embedded. Use natively embedded capabilities in your existing data management tools
  2. Standalone. Create dedicated applications for solving particular issues (like data catalogue accuracy, or data quality)
  3. Hybrid. Leverage best of both previous options to leverage Generative AI and interact with or command your Data Management tools

Embedded Copilot capabilities

Below, I provide some examples of commercial tools that already started to provide such copilot capabilities:

Standalone Copilot capabilities

You can also develop your own copilot based on your data management processes and standards. We, BCG Platinion Paris, created our own data quality copilot, named Kali, to be able to leverage years of development in data quality and to codify in agents the key tasks to be done.

BCG Platinion Kali Copilot

Hybrid Copilot capabilities

This is normally done by integrating your generative AI solutions using APIs of your current data management tools.

要查看或添加评论,请登录

William E.的更多文章

社区洞察

其他会员也浏览了