AI-Driven Data Quality: Solving the Business Need Gap - Part 2

This article continues from Part 1, exploring the generation of data quality (DQ) rules using Generative AI (GenAI). We conducted preliminary tests to assess the capabilities of a leading large language model (LLM) in generating industry-specific DQ rules. Using a sample banking dataset, we examined two use cases. As discussed in the previous article, the generated output includes not only the rule suggestions but also the rationale behind them.

Use Case 1: Prevent Customer Churn

Suggested DQ Rules:

1. Check for missing values in the "Saving accounts" and "Checking account" fields. Customers with missing information for these fields may indicate a lack of engagement with the bank's products and services, which could be a risk factor for churn.

2. Analyse the "Credit amount" field. Customers with high credit amounts may be more likely to churn if they find better rates or terms elsewhere. Conversely, customers with low credit amounts may be less engaged with the bank's lending products, which could also be a risk factor.

3. Examine the "Duration" field, which likely represents the length of the customer's relationship with the bank. Customers with shorter durations may be at higher risk of churn, as they have not yet established a strong loyalty to the bank.

4. Investigate the "Housing" field, which indicates whether the customer owns or rents their home. Customers who rent may be more transient and therefore at higher risk of churn compared to homeowners who are more established in their location.


Image 1: Prevent Customer Churn

Use Case 2: Identify High Net-worth Individuals

Suggested DQ Rules:

1. Age: Implement range checks to identify customers in the prime working age group (e.g. 25-50 years old) who are more likely to have higher income potential.?

2. Job: Analyze the job codes to identify customers in high-paying professions like doctors, lawyers, executives, etc.?

3. Credit amount: Customers with higher approved credit amounts could indicate higher income and net worth.?

4. Checking/Savings accounts: Customers with "rich" or "quite rich" levels in these fields are more likely to be affluent.


Image 2: Identify High Net-Worth Individuals

Wordings in bold denote the proposed DQ rules/checks and those in italics denote the reasoning.

Above are the actual results from a simple test without much of functional documentation being provided in form of RAG Knowledge base, prompt or model fine tuning, human in loop feedback and curation etc. Results clearly show that more work needs to be done to improve the quality of recommended rules, in terms of DQ like verbiage, in terms of comprehensiveness etc., but that is for another day.

Below are some pointers as to how the proposed rules from above can be used -

Data Profiling and EDA

  • The suggested rules provide an opportunity to do advanced data profiling and exploratory data analysis to form a more custom and industry use case tailored manner. e.g. for the use case of preventing customer churn evaluate how many customers are in top and bottom percentiles/quartile ranges.
  • This would complement the pure play data profiling capability of different tools e.g. completeness, uniqueness etc.

Data Cleansing and Transformation

Leverage insights from profiling and EDA to create rules –

  • Standardize Job Codes to broader categories. e.g. for the use case of identifying High Net-worth Individuals, professionals like Doctor, Physician, Medical Practitioner, Surgeon, Medic can be standardized to ‘Doctor’ and likewise.
  • Implement range checks on Age field.
  • Clear definition of “rich” and “quite rich” in Savings/Checkings account, validate with data in another fields

Data Enrichment and Feature Engineering

  • Derive new features, such as income range, net worth ranges or create a composite “affluence score” based on multiple attributes.
  • Incorporate external data, such as census or industry specific income data for feature enhancement. This information might not be there in the original data but when onboarded into the mix might aid the business use case, and hence its availability and feasibility of integration can be checked.

Data Quality Rules and Validation

  • Formalize suggested (/translated and verified) rules as data quality rules
  • Leverage rules for validation during data ingestion or model training and analysis
  • Further use of GenAI to convert finalized DQ rules to technical code base


Image 3: Methods of leveraging the GenAI suggested DQ rules

This initial exploration of GenAI for industry-specific data quality rule generation shows early promise. While the results are encouraging, further rigorous refinements, testing, validation, applying principles of data safety and security are required to assess the scalability, reliability, and overall effectiveness of this approach.

Thanks to Napkin AI for providing the platform which helped me generate the visuals.

要查看或添加评论,请登录

Rohit Bajaj的更多文章

社区洞察

其他会员也浏览了