AI-Driven Data Quality: Solving the Business Need Gap - Part 2
Rohit Bajaj
Solution Architect, Prompt Engineering, Generative AI, MDM 2x, Azure 2x, DG 2x, CDP 1x, Oracle 2x, Alteryx 2x
This article continues from Part 1, exploring the generation of data quality (DQ) rules using Generative AI (GenAI). We conducted preliminary tests to assess the capabilities of a leading large language model (LLM) in generating industry-specific DQ rules. Using a sample banking dataset, we examined two use cases. As discussed in the previous article, the generated output includes not only the rule suggestions but also the rationale behind them.
Use Case 1: Prevent Customer Churn
Suggested DQ Rules:
1. Check for missing values in the "Saving accounts" and "Checking account" fields. Customers with missing information for these fields may indicate a lack of engagement with the bank's products and services, which could be a risk factor for churn.
2. Analyse the "Credit amount" field. Customers with high credit amounts may be more likely to churn if they find better rates or terms elsewhere. Conversely, customers with low credit amounts may be less engaged with the bank's lending products, which could also be a risk factor.
3. Examine the "Duration" field, which likely represents the length of the customer's relationship with the bank. Customers with shorter durations may be at higher risk of churn, as they have not yet established a strong loyalty to the bank.
4. Investigate the "Housing" field, which indicates whether the customer owns or rents their home. Customers who rent may be more transient and therefore at higher risk of churn compared to homeowners who are more established in their location.
Use Case 2: Identify High Net-worth Individuals
Suggested DQ Rules:
1. Age: Implement range checks to identify customers in the prime working age group (e.g. 25-50 years old) who are more likely to have higher income potential.?
2. Job: Analyze the job codes to identify customers in high-paying professions like doctors, lawyers, executives, etc.?
3. Credit amount: Customers with higher approved credit amounts could indicate higher income and net worth.?
4. Checking/Savings accounts: Customers with "rich" or "quite rich" levels in these fields are more likely to be affluent.
领英推荐
Wordings in bold denote the proposed DQ rules/checks and those in italics denote the reasoning.
Above are the actual results from a simple test without much of functional documentation being provided in form of RAG Knowledge base, prompt or model fine tuning, human in loop feedback and curation etc. Results clearly show that more work needs to be done to improve the quality of recommended rules, in terms of DQ like verbiage, in terms of comprehensiveness etc., but that is for another day.
Below are some pointers as to how the proposed rules from above can be used -
Data Profiling and EDA
Data Cleansing and Transformation
Leverage insights from profiling and EDA to create rules –
Data Enrichment and Feature Engineering
Data Quality Rules and Validation
This initial exploration of GenAI for industry-specific data quality rule generation shows early promise. While the results are encouraging, further rigorous refinements, testing, validation, applying principles of data safety and security are required to assess the scalability, reliability, and overall effectiveness of this approach.
Thanks to Napkin AI for providing the platform which helped me generate the visuals.