Accelerating your data strategy with Augmented Data Management capabilities provided by Generative AI
William E.
From IT strategy to digital transformation and implementation at scale (Business Platforms, Data and Generative AI, Enterprise Architecture including governance, integration (APi, EDA), technology innovation)
According to Gartner, better data management can save an average organization $12.9 million annually. These financial improvements are achieved in many ways, such as by enabling more automated processes, reducing employees’ time searching for needed data and informing more accurate business decision-making.
This blog post will discuss how data strategy and enterprise data management capabilities can be improved with Generative AI. For the sake of clarity and brevity, I will use “standard” Generative AI chat interfaces from Google Gemini, OpenAI ChatGPT and Anthropic Claude.
Enterprise Data strategy as a first step
Before launching proof of concepts, proof of value or prototypes, you need to define your data strategy.
The data strategy is done by business domain and cross-domain, and will define the main business outcomes to be achieved.
Data strategy can be based on generic goals, like “I want to reduce employees loss of time in manual, tedious task by two hours per week”, or it can be very specific to a business domain like “I want to improve the accuracy of my clients data”.
Then use cases will be created with a clear value associated to them, and way to measure it systematically and automatically. Use cases should be aligned with company data governance (and if not done, should be put in place in an agile way). Use cases data should be clearly described, categorized and governed. Then, you must evaluate their current quality, and the key data sources involved.
Looking at all use cases, their value and desirability, the data they need and their quality you will be able to prioritize them. Non mature companies can do up to 3 use cases per year (from design to development to production and deployment in business domain) while more mature can do around 15 use cases per year.
Use case realization are dependent of the data team maturity, size and skills, but also of the data factory (or data office) platform tools and services available.
Finally a good data strategy always come with an acculturation plan, and building a community of data champions.
From use cases to data products
Another key question to ask is to know if you want to adopt a data product approach. A data product is a data intensive asset directly usable by data consumers. Introduced by Zhamak Dehghani, data mesh advocates treating data as a product, pushing ownership deeper into to the producing teams. The 5 fundamentals of data mesh are: data as a product, domain orientation, self-serve infrastructure, federated governance (each domain is responsible of its own product) and agile way of working.
The theory behind this is that ownership is better aligned with data creation, which brings more visibility and quality to the data. This “shift left” of responsibilities addresses the data explosion, which has overwhelmed central data teams tasked to cleanse, harmonize, and integrate data for the larger good.
Domain-driven design and data product go hand in hand because they espouse the idea of keeping data as the responsibility of the teams which produce the data and control its source. The delegation of responsibilities for specific data sets to data domains means that enterprises are split into domains, each taking the responsibly for building abstractions, serving data, maintaining associated metadata, improving data quality, applying life cycle management, performing code control and so on.
Data is then a first-class citizen and a product of the domain. A new role, the data product owner, is then created to own and manage the data product. The data product owner will ensure the development made are aligned with the downstream consumers and must ensure business outcomes are met (and measure continuously its value).
Each domain is responsible for ingesting, processing, and serving its data products to downstream consumers. Data engineering and software engineering must then be tightly aligned, and ideally part of the same functional team, so they can work with a data product owner to produce high-quality, curated data products.
Data products must be part of the data strategy, and help accelerate the realization by delegating the work to dedicated teams across the company, instead of a centralized data factory.
Generative AI is an accelerator of the realization of the data strategy
Enterprise “data strategy” can benefit from Gen AI at all stages and could lead to the following major benefits:
It also require to:
Generative AI empowers data management with 3 new capabilities
While AI-powered contextualization is compelling, the true power of the new Generative AI technologies is in its ability to codify human expertise by creating new and better data
The three capabilities are: connect the dots, offer polyglotism and assist with copilot.
Connect the dots
The main objectives are to:
In particular, Generative AI offers natively a capability to improve data quality. Let’s look at what OpenAI ChatGPT, and Google Gemini can do.
Anthropic Claude is also quite powerful.
Let’s check the quality of a file describing C02 consumption of products without providing any additional information about the data.
Analysis is done nearly immediately:
Claude is also proposing ways to improve the data quality of the uploaded data set:
I can ask Claude to improve the content, generate a new data set and let me know what was changed:
Of course, Claude will detail all changes made, for enabling me to track and validate the changes:
In order to automate this data quality improvement, I asked Claude to generate the python code. This Python code could then be used for automating data quality through dedicated data engineering pipelines.
Claude will also explain how the code is built …
If you prefer to use the GreatExpectations library, Claude can also do it.
The result is followed by a description of the Python code structure.
Now let’s see if Claude can generate a JSON file that will let me see my data as a report in PowerBI:
Offer Polyglotism
Generative AI handles natively data management concepts and frameworks (or can be trained for) and can generate (meta)models and views on demand, acting as a translator, enabling to move from multiple data format and representation easily.
Polyglotism in data modeling provide 5 main advantages (see: Tshepiso Mogoswane):
We will take several examples to show the value of Generative AI in data modelling, taxonomy and ontology building, database schema evolution, and database retro-engineering.
Generative AI can help in data modeling
Let’s reuse Tshepiso Mogoswane blog on the OpenAI blog, called Exploring Data Modelling with ChatGPT. The data model used is described here.
I can now ask ChatGPT to generate a Data Vault 2.0 model from it to be used in my data warehouse.
领英推荐
And the result is described below.
Retro-engineering a database
One of the powerful capability of LLMs are retro-engineering of database. This section is based on the Towards Data Science blog Post, and we will use the data set proposed. It represents “MyCompany” HR system export of all employees, containing many details regarding the company as well, some of them being confidential. It has to be mentioned that some data are confidential like “Salary”, “Age”, or “Annual_Evaluation”.
Let’s see what ChatGPT can do:
Then we can look at the UML diagram in a Mermaid Editor
Using Generative AI to build a Taxonomy
Generative AI models can assist in creating a taxonomy. The main capabilities are resumed by chatGPT:
Let’s take the previous example with the CO2 emission data set. Let’s try to generate a taxonomy from it.
Using Generative AI to build an Ontology
Generative AI models can assist in creating ontologies (source: chatGPT):
Reusing the previous data set, you can see below the result of Claude analysis:
and the explanation of the file generated:
Let’s see if Claude can generate a graphical representation of this ontology. As you can see below, he can not.
But Claude proposed solutions:
I downloaded protégé and imported the OWL file:
Mixing LLM with ontologies to improve the quality of the results
By mixing LLM capabilities and ontology, you can drastically improve the quality of your results. The ontology is used to refine the query and the result to ensure better accuracy.
The best example I found is PoolParty. I lets you compare the results of questions you ask in the subject area of ESG (Environmental, Social, and Governance) that are submitted directly to ChatGPT and with those which are filtered through an ESG taxonomy and knowledge graph (managed in PoolParty software) so that the questions are enriched before being sent to ChatGPT. For example, let’s ask “Which form of renewable energy contributes the most to the world-wide energy mix?”
Results are shown below
RAG is a good solution to add external content to LLM reasoning, but without a clear description of the terms and their relationships. A semantic graph or an ontology can improve drastically the query results.
Assist With Copilot
Generative AI assist data management professionals in their day-to-day tasks, either using automated agent-based background tasks or leveraging real-time in context support within tools.
You have globally 3 options to Augment your data management practice and tools:
Embedded Copilot capabilities
Below, I provide some examples of commercial tools that already started to provide such copilot capabilities:
Standalone Copilot capabilities
You can also develop your own copilot based on your data management processes and standards. We, BCG Platinion Paris, created our own data quality copilot, named Kali, to be able to leverage years of development in data quality and to codify in agents the key tasks to be done.
Hybrid Copilot capabilities
This is normally done by integrating your generative AI solutions using APIs of your current data management tools.