登录查看更多内容

AI-Driven Data Quality: Solving the Business Need Gap - Part 2

Rohit Bajaj

Solution Architect, Prompt Engineering, Generative AI, MDM 2x, Azure 2x, DG 2x, CDP 1x, Oracle 2x, Alteryx 2x

发布日期: 2024年11月19日

This article continues from Part 1, exploring the generation of data quality (DQ) rules using Generative AI (GenAI). We conducted preliminary tests to assess the capabilities of a leading large language model (LLM) in generating industry-specific DQ rules. Using a sample banking dataset, we examined two use cases. As discussed in the previous article, the generated output includes not only the rule suggestions but also the rationale behind them.

Use Case 1: Prevent Customer Churn

Suggested DQ Rules:

1. Check for missing values in the "Saving accounts" and "Checking account" fields. Customers with missing information for these fields may indicate a lack of engagement with the bank's products and services, which could be a risk factor for churn.

2. Analyse the "Credit amount" field. Customers with high credit amounts may be more likely to churn if they find better rates or terms elsewhere. Conversely, customers with low credit amounts may be less engaged with the bank's lending products, which could also be a risk factor.

3. Examine the "Duration" field, which likely represents the length of the customer's relationship with the bank. Customers with shorter durations may be at higher risk of churn, as they have not yet established a strong loyalty to the bank.

4. Investigate the "Housing" field, which indicates whether the customer owns or rents their home. Customers who rent may be more transient and therefore at higher risk of churn compared to homeowners who are more established in their location.

Use Case 2: Identify High Net-worth Individuals

Suggested DQ Rules:

1. Age: Implement range checks to identify customers in the prime working age group (e.g. 25-50 years old) who are more likely to have higher income potential.?

2. Job: Analyze the job codes to identify customers in high-paying professions like doctors, lawyers, executives, etc.?

3. Credit amount: Customers with higher approved credit amounts could indicate higher income and net worth.?

4. Checking/Savings accounts: Customers with "rich" or "quite rich" levels in these fields are more likely to be affluent.

领英推荐

JZ Articles Posted week of December 9th

Jerry Zabell 3 个月前

Above & Beyond: Unveiling the Next Frontier in Data…

Tony Moroney 9 个月前

Unlocking Alpha: Going Beyond Bloomberg with…

Diego Vallarino, PhD (he/him) 8 个月前

Image 2: Identify High Net-Worth Individuals

Wordings in bold denote the proposed DQ rules/checks and those in italics denote the reasoning.

Above are the actual results from a simple test without much of functional documentation being provided in form of RAG Knowledge base, prompt or model fine tuning, human in loop feedback and curation etc. Results clearly show that more work needs to be done to improve the quality of recommended rules, in terms of DQ like verbiage, in terms of comprehensiveness etc., but that is for another day.

Below are some pointers as to how the proposed rules from above can be used -

Data Profiling and EDA

The suggested rules provide an opportunity to do advanced data profiling and exploratory data analysis to form a more custom and industry use case tailored manner. e.g. for the use case of preventing customer churn evaluate how many customers are in top and bottom percentiles/quartile ranges.
This would complement the pure play data profiling capability of different tools e.g. completeness, uniqueness etc.

Data Cleansing and Transformation

Leverage insights from profiling and EDA to create rules –

Standardize Job Codes to broader categories. e.g. for the use case of identifying High Net-worth Individuals, professionals like Doctor, Physician, Medical Practitioner, Surgeon, Medic can be standardized to ‘Doctor’ and likewise.
Implement range checks on Age field.
Clear definition of “rich” and “quite rich” in Savings/Checkings account, validate with data in another fields

Data Enrichment and Feature Engineering

Derive new features, such as income range, net worth ranges or create a composite “affluence score” based on multiple attributes.
Incorporate external data, such as census or industry specific income data for feature enhancement. This information might not be there in the original data but when onboarded into the mix might aid the business use case, and hence its availability and feasibility of integration can be checked.

Data Quality Rules and Validation

Formalize suggested (/translated and verified) rules as data quality rules
Leverage rules for validation during data ingestion or model training and analysis
Further use of GenAI to convert finalized DQ rules to technical code base

Image 3: Methods of leveraging the GenAI suggested DQ rules

This initial exploration of GenAI for industry-specific data quality rule generation shows early promise. While the results are encouraging, further rigorous refinements, testing, validation, applying principles of data safety and security are required to assess the scalability, reliability, and overall effectiveness of this approach.

Thanks to Napkin AI for providing the platform which helped me generate the visuals.

要查看或添加评论，请登录

Rohit Bajaj的更多文章

AI-Driven Data Quality: Solving the Business Need Gap - Part 1

2024年11月18日

AI-Driven Data Quality: Solving the Business Need Gap - Part 1

Introduction Data quality is a cornerstone of data management, vital across legacy systems, cloud applications, and…

5 条评论

AI-Driven Data Quality: Solving the Business Need Gap - Part 2

Rohit Bajaj

Solution Architect, Prompt Engineering, Generative AI, MDM 2x, Azure 2x, DG 2x, CDP 1x, Oracle 2x, Alteryx 2x

Use Case 1: Prevent Customer Churn

Suggested DQ Rules:

Use Case 2: Identify High Net-worth Individuals

Suggested DQ Rules:

领英推荐

Data Profiling and EDA

Data Cleansing and Transformation

Data Enrichment and Feature Engineering

Data Quality Rules and Validation

Rohit Bajaj的更多文章

社区洞察

其他会员也浏览了

Leveraging #AI & #Digital Transformation in Finance

Data Can Point But it Can't Touch

POLICE UNPREDICTABILITY

How Our Product "Muawin" Achieved Remarkable Results with Our AI-Powered Solution

Emerging Trend in Business Intelligence: Impact on the Financial Technology Industry

Insight, Not Data: The True Currency of Digital Transformation

About circles, infinity and kindness... and big human data.

Data Gravity: IT's new Black Hole

Graph Customer Similarity

Data is the New Oil: Unveiling the Modern Day Technology Metaphors

Use Case 1: Prevent Customer Churn

Suggested DQ Rules:

Use Case 2: Identify High Net-worth Individuals

Suggested DQ Rules:

领英推荐

Data Profiling and EDA

Data Cleansing and Transformation

Data Enrichment and Feature Engineering

Data Quality Rules and Validation

Rohit Bajaj的更多文章

AI-Driven Data Quality: Solving the Business Need Gap - Part 1

社区洞察

其他会员也浏览了

Leveraging #AI & #Digital Transformation in Finance

Data Can Point But it Can't Touch

POLICE UNPREDICTABILITY

How Our Product "Muawin" Achieved Remarkable Results with Our AI-Powered Solution

Emerging Trend in Business Intelligence: Impact on the Financial Technology Industry

Insight, Not Data: The True Currency of Digital Transformation

About circles, infinity and kindness... and big human data.

Data Gravity: IT's new Black Hole

Graph Customer Similarity

Data is the New Oil: Unveiling the Modern Day Technology Metaphors