登录查看更多内容

Accelerating your data strategy with Augmented Data Management capabilities provided by Generative AI

William E.

From IT strategy to digital transformation and implementation at scale (Business Platforms, Data and Generative AI, Enterprise Architecture including governance, integration (APi, EDA), technology innovation)

发布日期: 2024年12月29日

According to Gartner, better data management can save an average organization $12.9 million annually. These financial improvements are achieved in many ways, such as by enabling more automated processes, reducing employees’ time searching for needed data and informing more accurate business decision-making.

This blog post will discuss how data strategy and enterprise data management capabilities can be improved with Generative AI. For the sake of clarity and brevity, I will use “standard” Generative AI chat interfaces from Google Gemini, OpenAI ChatGPT and Anthropic Claude.

Enterprise Data strategy as a first step

Before launching proof of concepts, proof of value or prototypes, you need to define your data strategy.

The data strategy is done by business domain and cross-domain, and will define the main business outcomes to be achieved.

Data strategy can be based on generic goals, like “I want to reduce employees loss of time in manual, tedious task by two hours per week”, or it can be very specific to a business domain like “I want to improve the accuracy of my clients data”.

Then use cases will be created with a clear value associated to them, and way to measure it systematically and automatically. Use cases should be aligned with company data governance (and if not done, should be put in place in an agile way). Use cases data should be clearly described, categorized and governed. Then, you must evaluate their current quality, and the key data sources involved.

Looking at all use cases, their value and desirability, the data they need and their quality you will be able to prioritize them. Non mature companies can do up to 3 use cases per year (from design to development to production and deployment in business domain) while more mature can do around 15 use cases per year.

Use case realization are dependent of the data team maturity, size and skills, but also of the data factory (or data office) platform tools and services available.

Finally a good data strategy always come with an acculturation plan, and building a community of data champions.

From use cases to data products

Another key question to ask is to know if you want to adopt a data product approach. A data product is a data intensive asset directly usable by data consumers. Introduced by Zhamak Dehghani, data mesh advocates treating data as a product, pushing ownership deeper into to the producing teams. The 5 fundamentals of data mesh are: data as a product, domain orientation, self-serve infrastructure, federated governance (each domain is responsible of its own product) and agile way of working.

The theory behind this is that ownership is better aligned with data creation, which brings more visibility and quality to the data. This “shift left” of responsibilities addresses the data explosion, which has overwhelmed central data teams tasked to cleanse, harmonize, and integrate data for the larger good.

Domain-driven design and data product go hand in hand because they espouse the idea of keeping data as the responsibility of the teams which produce the data and control its source. The delegation of responsibilities for specific data sets to data domains means that enterprises are split into domains, each taking the responsibly for building abstractions, serving data, maintaining associated metadata, improving data quality, applying life cycle management, performing code control and so on.

Data is then a first-class citizen and a product of the domain. A new role, the data product owner, is then created to own and manage the data product. The data product owner will ensure the development made are aligned with the downstream consumers and must ensure business outcomes are met (and measure continuously its value).

Each domain is responsible for ingesting, processing, and serving its data products to downstream consumers. Data engineering and software engineering must then be tightly aligned, and ideally part of the same functional team, so they can work with a data product owner to produce high-quality, curated data products.

Data products must be part of the data strategy, and help accelerate the realization by delegating the work to dedicated teams across the company, instead of a centralized data factory.

Generative AI is an accelerator of the realization of the data strategy

Enterprise “data strategy” can benefit from Gen AI at all stages and could lead to the following major benefits:

Gathering and Curating new needed Data. Capability to identify and leverage new data set creating competitive advantage and that were not used so far (especially for non structured data). We should also mention usage of generative AI for easy synthetic data created for testing purpose.
Leveraging External Data. Leverage new data ecosystems and marketplaces to capture external data and value them, while governing them appropriately (especially in term of cost of Intellectual property).
Improving Data Availability. In particular, “dark data” (information or content used for a single operational purpose then often forgotten about) can generate new value streams when ingested by generative AI applications. Yet much of this kind of data remains in archives or in off-line storage.
Re-think Data Lifecycle. Expand the metadata management to cover for data provenance, bias, AI generated content, etc.. throughout the data lifecycle.
Re-imagine Data Quality. Automate data quality assessment and expand the realm of data quality tooling to improve data supply chain drastically.
Simplify data governance. Data governance proved to be difficult to implement in large company. The idea is then to augment data product owner, data steward or data custodian with generative Ai in order to simplify their work. A data steward can then leverage generative AI to improve the quality of the data glossary, the dictionary, or the ontology generation. It can also be used to in key data management tasks, like data quality analysis and improvements. Of course, most of data management tool or modern data platform are being enhanced with generative AI capabilities.

It also require to:

Upgrade Data Governance. Establish process and guardrails to ensure what and how data is used by Gen AI applications to ensure compliance with legal and privacy policies under overall Responsible AI framework (or regulations like the EU AI Act).
Ensure integration and scalability of the solution. Implementing Generative AI in large corporations will require to setup the right infrastructure and platforms, and the Geerative AI solutions to use (open source, commercial, etc.).
Provide trustable results. For the generative AI to be valuable, it is required that results are trustable and accurate. The service is provided to help employee gain time, not loose time or make mistakes.

Generative AI empowers data management with 3 new capabilities

While AI-powered contextualization is compelling, the true power of the new Generative AI technologies is in its ability to codify human expertise by creating new and better data

The three capabilities are: connect the dots, offer polyglotism and assist with copilot.

Connect the dots

The main objectives are to:

Simplify data ingestion and leverage multimodal (image, text, sound) facts and knowledge (internally and externally).
Offer holistic views and insights on pre-learned or pre-defined Enterprise data concepts (syntax and semantic)
Enable multiple ways to interact using voice or chat instead of mouse (richer user experience)

In particular, Generative AI offers natively a capability to improve data quality. Let’s look at what OpenAI ChatGPT, and Google Gemini can do.

OpenAI ChatGPT, and Google Gemini Bard can natively check key data quality indicators

Anthropic Claude is also quite powerful.

Claude can natively check key data quality indicators

Let’s check the quality of a file describing C02 consumption of products without providing any additional information about the data.

Uploaded data set and request to analyze its quality

Analysis is done nearly immediately:

Description of the quality of the data set

Claude is also proposing ways to improve the data quality of the uploaded data set:

I can ask Claude to improve the content, generate a new data set and let me know what was changed:

Claude is generating the corrected data set

Of course, Claude will detail all changes made, for enabling me to track and validate the changes:

Description of the changes made in the data set

In order to automate this data quality improvement, I asked Claude to generate the python code. This Python code could then be used for automating data quality through dedicated data engineering pipelines.

Claude will also explain how the code is built …

Description of the generated Python code

If you prefer to use the GreatExpectations library, Claude can also do it.

Excerpt of the Python generated file with GreatExpectations

The result is followed by a description of the Python code structure.

Description of the content of the Python code with GreatExpectations library generated

Now let’s see if Claude can generate a JSON file that will let me see my data as a report in PowerBI:

Excerpt of the PowerBI JSON File generated

Offer Polyglotism

Generative AI handles natively data management concepts and frameworks (or can be trained for) and can generate (meta)models and views on demand, acting as a translator, enabling to move from multiple data format and representation easily.

Polyglotism in data modeling provide 5 main advantages (see: Tshepiso Mogoswane):

Adaptability: Different modelling methodologies excel in different scenarios. Being able to switch methodologies allows you to adapt to changing business requirements, data sources, or analytical needs. It provides the flexibility to choose the most suitable approach for a particular situation without needing to rework the entire data model from scratch.
Optimization: Each modelling methodology has its own strengths and optimizations. By being able to switch methodologies, you can optimize your data model based on specific use cases. For example, choose a dimensional model for reporting and analysis and switch to a Data Vault model for improved data integration and traceability. This flexibility enables you to fine-tune your data model to maximize performance and efficiency.
Data Governance: Different modelling methodologies have varying levels of built-in data governance features. Switching methodologies allows you to leverage specific governance capabilities that align with your organization’s requirements. For instance, Data Vault provides extensive data lineage and auditing capabilities, while dimensional models offer intuitive hierarchies and user-friendly structures for data exploration. Adapting the methodology based on data governance needs ensures compliance and data management best practices.
Skillset and Team Expertise: Modeling methodologies may require different skillsets and expertise. Switching methodologies based on a single logical data model allows you to leverage your team’s existing skills and expertise. It avoids extensive retraining or hiring of new resources, as you can utilize your team’s proficiency in multiple methodologies.
Continuous Improvement: The ability to switch methodologies promotes a culture of continuous improvement and learning. By experimenting with different modelling approaches, you can discover new insights, evaluate their effectiveness, and refine your data modelling practices over time. This iterative process contributes to the evolution and maturation of your data management capabilities.

We will take several examples to show the value of Generative AI in data modelling, taxonomy and ontology building, database schema evolution, and database retro-engineering.

Generative AI can help in data modeling

Let’s reuse Tshepiso Mogoswane blog on the OpenAI blog, called Exploring Data Modelling with ChatGPT. The data model used is described here.

I can now ask ChatGPT to generate a Data Vault 2.0 model from it to be used in my data warehouse.

领英推荐

Data Science – The Cornerstone of Certainty During…

Radiant Digital 1 年前

“Data Leader” interview about Data Governance with…

Artefact 10 个月前

Key Trends Shaping the Future of Data Pipelines:…

XenonStack 3 个月前

Description of the key phases to build a data vault V2 model

And the result is described below.

Retro-engineering a database

One of the powerful capability of LLMs are retro-engineering of database. This section is based on the Towards Data Science blog Post, and we will use the data set proposed. It represents “MyCompany” HR system export of all employees, containing many details regarding the company as well, some of them being confidential. It has to be mentioned that some data are confidential like “Salary”, “Age”, or “Annual_Evaluation”.

“MyCompany” HR system export of all employees

Let’s see what ChatGPT can do:

Could you identify the categorical columns within this dataset as well as confidential ones?

Could you suggest a database schema with different tables (pay attention to creating a separate table for confidential data)?

For the tables with categorical data, please provide the SQL script to create them, including their content (Key and Values).

For the remaining tables, please provide the script to create their schema.

For each column of each table, can you suggest some data quality checks?

Can yo generate the full new data model with all tables in UML with mermaid?

Then we can look at the UML diagram in a Mermaid Editor

Using Generative AI to build a Taxonomy

Generative AI models can assist in creating a taxonomy. The main capabilities are resumed by chatGPT:

Category Generation: Generative AI models, when trained on vast amounts of data, can identify recurring themes, topics, or categories. These can be used as a starting point for creating a taxonomy.
Hierarchy Creation: Using contextual clues, a generative AI model can help determine the hierarchical relationships between different categories. For example, in a taxonomy of animals, the model might determine that “canine” should be a subcategory of “mammals” based on its understanding of these terms.
Taxonomy Population: Generative AI can assist in populating the taxonomy with relevant entities or subcategories. For instance, it could suggest that “dogs,” “wolves,” and “foxes” should be included under “canines.”
Taxonomy Refinement: As new data is processed, a generative AI model can suggest additions or changes to the taxonomy. This could include adding new categories, modifying the hierarchy, or adding new entities to existing categories.
Taxonomy Validation: The model could check for inconsistencies or errors in the taxonomy, such as an entity being included in the wrong category or categories that should be subcategories of another.

Let’s take the previous example with the CO2 emission data set. Let’s try to generate a taxonomy from it.

Using Generative AI to build an Ontology

Generative AI models can assist in creating ontologies (source: chatGPT):

Concept Generation: Process vast amounts of data and identify common concepts and entities. By identifying these entities and their relationships, these models can help generate the basic structure of an ontology.
Relationship Identification: Trained on a large corpus of data, generative AI models can identify relationships between different entities based on context and use them to build the links in an ontology.
Hierarchy Creation: Identify hierarchical relationships between concepts based on the data. This can be used to create the hierarchical structure often seen in ontologies.
Ontology Population: Populate the ontology with instances of the identified concepts (involve generating synthetic data or extracting relevant information from the data the model was trained on).
Ontology Refinement: Refine and update an existing ontology. By processing new data, the model can identify new concepts or relationships, which can be added to the ontology. It can also identify potential errors or inconsistencies in the ontology, which can then be corrected.

Reusing the previous data set, you can see below the result of Claude analysis:

and the explanation of the file generated:

Let’s see if Claude can generate a graphical representation of this ontology. As you can see below, he can not.

But Claude proposed solutions:

I downloaded protégé and imported the OWL file:

Mixing LLM with ontologies to improve the quality of the results

By mixing LLM capabilities and ontology, you can drastically improve the quality of your results. The ontology is used to refine the query and the result to ensure better accuracy.

The best example I found is PoolParty. I lets you compare the results of questions you ask in the subject area of ESG (Environmental, Social, and Governance) that are submitted directly to ChatGPT and with those which are filtered through an ESG taxonomy and knowledge graph (managed in PoolParty software) so that the questions are enriched before being sent to ChatGPT. For example, let’s ask “Which form of renewable energy contributes the most to the world-wide energy mix?”

Results are shown below

RAG is a good solution to add external content to LLM reasoning, but without a clear description of the terms and their relationships. A semantic graph or an ontology can improve drastically the query results.

Assist With Copilot

Generative AI assist data management professionals in their day-to-day tasks, either using automated agent-based background tasks or leveraging real-time in context support within tools.

You have globally 3 options to Augment your data management practice and tools:

Embedded. Use natively embedded capabilities in your existing data management tools
Standalone. Create dedicated applications for solving particular issues (like data catalogue accuracy, or data quality)
Hybrid. Leverage best of both previous options to leverage Generative AI and interact with or command your Data Management tools

Embedded Copilot capabilities

Below, I provide some examples of commercial tools that already started to provide such copilot capabilities:

Illumex.ai the “Generative Semantic Fabric for Complex Enterprises”
Ydata specialized in data quality
Informatica in their Intelligent Data Management Cloud (IDMC) product with Claire GPT
Atlan data catalog AI Copilot
Cluedin MDM with Generative AI embedded
Galileo an end-to-end platform for GenAI evaluation, experimentation, observability, and protection.

Standalone Copilot capabilities

You can also develop your own copilot based on your data management processes and standards. We, BCG Platinion Paris, created our own data quality copilot, named Kali, to be able to leverage years of development in data quality and to codify in agents the key tasks to be done.

Hybrid Copilot capabilities

This is normally done by integrating your generative AI solutions using APIs of your current data management tools.

要查看或添加评论，请登录

William E.的更多文章

New Trilogy of slides about Big Data

2018年5月13日

New Trilogy of slides about Big Data

It took me two years to make an update, but, all slides are now online! Big data architecture: Hadoop and Data Lake…

2 条评论
Ne vous laissez jamais abuser par le mot Saas

2017年12月26日

Ne vous laissez jamais abuser par le mot Saas

Avec la révolution du Cloud, certains éditeurs de logiciels ont vite fait de renommer leur offre “SaaS” (Software as a…

1 条评论
2018 une année charnière pour les DSI

2017年12月11日

2018 une année charnière pour les DSI

Désormais, les DSI sont face à un seul choix. Soit ils fabriquent (ce sont des “crafter”) soit ils utilisent (ce sont…

2 条评论
L'adoption du Cloud en Entreprise et ses pièges

2017年1月9日

L'adoption du Cloud en Entreprise et ses pièges

Retrouvez mon dernier article sur l'adoption du Cloud en entreprise publié dans la Revue du Digital en cliquant ici.
Comment faire évoluer son architecture .Net?

2016年11月25日

Comment faire évoluer son architecture .Net?

Après avoir rencontré de nombreuses startups et éditeurs de logiciels ayant investis dans des sites ou applications…
Guide de l’expérimentation Big Data, pas à pas

2016年6月3日

Guide de l’expérimentation Big Data, pas à pas

J'ai écrit un article de retour d'expérience sur la revue du digital, ici. Cet article est complété par une…
Comparaison de logiciels de décisionnel en mode SaaS

2016年6月3日

Comparaison de logiciels de décisionnel en mode SaaS

Bonjour Quatre articles ont été récemment publiés dans la Revue du Digital concernant une étude que j'avais réalisé…
La guerre des contenus au c?ur de la révolution digitale

2016年3月22日

La guerre des contenus au c?ur de la révolution digitale

La révolution digitale que nous vivons actuellement peut être mise en exergue en analysant l’évolution des contenus…
Les travailleurs du savoir face à la révolution digitale

2016年3月22日

Les travailleurs du savoir face à la révolution digitale

La révolution digitale a entrainé une baisse dramatique des co?ts marginaux de production des contenus, mais aussi…

1 条评论
La symétrie des attentions pour améliorer le service et l’engagement

2016年3月22日

La symétrie des attentions pour améliorer le service et l’engagement

Dans les entreprises bousculées par la révolution digitale, et la révolution des usages, l’effort de transformation est…

See all articles