Why you don’t need perfect Data Quality to get started with AI

Why you don’t need perfect Data Quality to get started with AI

I’m seeing a number of articles and ads about Data Quality which I believe are doing more harm than good (except for the vendors who sell you some new software, of course):

“Certainty over your data quality is essential for success”
“Poor data quality can put AI initiatives at serious risk for businesses”

It does sound logical - after all, bad data leads to bad decisions, right? Well, it depends: whilst the higher your data quality is, the better the result usually is, the idea that everything needs to be pristine before you can apply it and extract value using AI is – in my experience - a myth.

The Reality of Data Quality in AI

AI models - whether analytical models, time-series forecasting, or generative AI - aren’t as fragile as you might think. Many AI systems are designed to work with imperfect, messy, and even incomplete data. Rather than striving for perfection, organisations should focus on:

  • Key data elements - attributes that must be present and accurate (e.g. dates, product IDs, financial amounts).
  • Understanding patterns and consistency - so you know what you’re working with.
  • Being clear on uncertainty levels - so you can define and mitigate risks accordingly.

This approach isn’t about being careless; it’s about being pragmatic and calculated. If you demand perfect data quality before starting, you’ll likely never get started: I’ve seen the amount of time expended and lack of results generated from “boiling the ocean” Data Quality projects.

You also shouldn’t just dump anything and everything into AI and expect reliable results! The key is knowing when and where your data is "good enough" and where the risks lie if it’s not quite there.


“OK James”, I hear you say, “how do I know when my data is ‘good enough’?”

Instead of striving for perfection, you should go in based on your use cases, audience, and what AI you’re applying. Consider both structured and unstructured data, noting that much of it is semi-structured (with elements such as document type, source, and creation date).

For Analytical Modeling

  • Core Requirements and Noise Handling: Key fields, such as dates and company names, must be consistently accurate (say 95% correct). Statistical techniques, and applied domain expertise from the data scientist, can often smooth out minor errors.
  • Example: In financial forecasting, if 5% of transaction dates are slightly off, robust models may still reliably predict trends, provided historical patterns are strong.
  • Quality Checks: Monitor the model build for overfitting to noise and check compliance (especially in regulated sectors). Also add continuous checking on change in data – catching any model drift early on.

For Generative AI

  • Flexibility with Imperfection and Ambiguity: Generative AI thrives on patterns and probabilities rather than exact figures. It can also extract meaning even from incomplete datasets.
  • Example: A customer service chatbot can provide helpful responses even if some input data is ambiguous, provided the core context is intact.
  • Quality Checks: Be wary of amplified biases or hallucinations—situations where missing data might lead to confidently incorrect outputs. Provide guidance to your users when starting out.


General "good enough" considerations:

  • Key fields must be mostly correct and consistent, e.g. Dates populated and in the same format 95% of the time; company names and agreement date populated in contract documents.
  • Data must be reasonably coherent, e.g. no large sets of purchases without a customer attached; truncated transcripts of meetings mixed with full ones.
  • You have reasonably complete source data. No big gaps in supply; similar layout over time for the same data.
  • You have manageable noise / duplicated data, e.g. minimal instances of weather readings of 5000°C or the same content presented 50 times [use some bell curve analysis or basic de-dupe].
  • You've considered your audience - that what you're deploying has been through a review of a) any compliance/regulatory minimums; b) appropriate access control to the underlying data; c) the level of data maturity / critical thinking the audience has. This is slightly less about data quality and more about security.


Other things - though harder to define 'good':

  • Sufficient volume to derive patterns – more for machine learning and patterns for generative AI but, usually, more data is better.
  • Keep as much as is sensible – don’t filter out data just because it has some minor errors – as long as it isn’t completely unreadable! ?
  • No excessive bias – not just around bias in language (removing all social media posts that have swear words in), but also the filtering: if you’re dropping 10% of your data through poor data quality then have a closer look at what impact that might have.
  • Relevant to the task in hand – not scanning all of SharePoint – being somewhat selective around your use case.


One thing that will make a big difference to the quality of any results is METADATA!

Metadata is ‘data about data’ – such as the aforementioned file creation date, origin, author.

Providing metadata to your audience (or the generative or agentic AI middle-man) will bring increased accuracy and consistency to your results. Even when the audience is a data scientist (or a BI developer or maybe a good ML engineer) providing explanatory metadata will give them a better starting point.

Remember, also, that Generative AI works in language, so – for accessing structured data in a repository - providing additional textual information about the layout and content of tables and columns is going to make it more accurate when it builds and runs SQL from natural queries. I’ve put a somewhat technical example at the bottom of the article.

Other valuable metadata we can attach here, is anything we’ve discovered around data quality in our “good enough” pre-processing – I'm following the 6 most commonly used dimensions of data quality.

Dimensions of Data Quality

These would usually be presented with more contextual detail, and attached to specific data, for example:

  • Timeliness status: fail; details: "Customer table data update is delayed due to an ETL pipeline failure"
  • Accuracy status: warning; details:"5% of ERP orders don't match the ecommerce system"
  • Completeness status: warning; details: "Customer date of birth only contains the year on10% of records"


Summary of Steps

In getting to, and keeping to, good enough there are 3 key processes that you should add into your BAU:

  1. Analysing and documenting/annotating historic data – part of the process of deciding ‘good enough’ will generate some insights. Add all the metadata you can around the data you've processed, and any anomalies you've found, and lodge them somewhere that is accessible (to a human or AI). Start small and grow.
  2. Monitoring new data for continuity/accuracy – start with some basic checking & alerting rules (which can extend to data observability). Also add the timeliness statistic – “xyz source was last updated on dd-mon-yyyy” - it even adds value as a piece of information in BI reports.
  3. Fixing data issues – if you haven't already, start looking at fixing the attribute and source issues that are coming up as frequent or critical to insights. I'll add a recommendation of going back to the source owner / business process to fix the DQ issue - versus trying to fix it up in your data pipeline. Doing the latter can hide broader issues and you'll end your days continually fixing up minor issues (which won't be "good enough").


Final thoughts: Just Get Started - but with guardrails

Waiting for perfect data quality before using AI is like waiting for perfect weather before going outside - you’ll spend days doing nothing and then finding out your mates had a great time. Instead, take a structured but flexible approach:

  1. Ensure key data elements are reliable.
  2. Calculate and understand the risks of imperfections and their potential impact based on your audience and use case. Try some 'out there' questions of the data.
  3. Iterate and improve models and content over time rather than demanding perfection upfront. Inform your users of the known issues, and apply feedback from them on any sub-optimal results.
  4. Track the data – make sure the data continues to be complete, coherent, and representative of real life

AI doesn’t require perfection—it requires thoughtful execution and pragmatism. The goal is progress, with built-in safeguards along the way.

?

P.S. I’m thinking of writing a non-technical deeper dive around how to prepare data for generative AI, for things like a contact centre knowledge base - shout if you’d like me to do it!


The Metadata Example

NB: Maybe skim over this if you're not technical - though pop down and look at the Reference Data heading for the Order Status table!

I've gone with a simple construct of someone placing an order. A customer table, a set of products, purchases, and the basket items in each purchase.

Tables/Entities:

We populate some information on the tables so that the user/AI can build or get context on what's in there.

Tables

Note that there are multiple codified/lookup attributes in the structure, but I've only documented one of them (OrderStatus) to reduce complexity!


Simple ERD


Attributes

We do similar things with the attribute names - provide more real English terms to provide increased understanding on how to use the attributes. This is the starter of a business glossary - or just use the business glossary / data catalog tool you already have!

Note that I've suggested a view over the raw attribute names to present more meaningful ones to the AI. The tools I've used thus far prefer snake_case over camelCase - but that won't last long...

Customer table attributes
Product table attributes
Order table attributes
Order Item attributes
Order Status attributes

Reference Data

Content like the Order Status descriptions could be the most important thing you generate - codified/lookup fields are often less than perfect in explaining what they mean!

Order Status content


Metadata in Code

For many tools – including Generative AI, we usually present this data in an extensible structure, such as JSON

{
    "tables": {
        "customer": {
            "description": "Stores customer details, including contact info and status.",
            "columns": [
                {
                    "name": "customer_id",
                    "type": "INTEGER",
                    "description": "A unique identifier assigned to each customer.",
                    "primary_key": true
                },
                {
                    "name": "customer_name",
                    "type": "VARCHAR",
                    "description": "Full name of the customer."
                },
                {
                    "name": "email",
                    "type": "VARCHAR",
                    "description": "Email address of the customer."
                },
                {
                    "name": "phone_number",
                    "type": "VARCHAR",
                    "description": "Customer's contact phone number."
                },
                {
                    "name": "country_code",
                    "type": "VARCHAR",
                    "description": "ISO country code for the customer's location.",
                    "reference_values": "ISO 3166-1 country codes"
                },
                {
                    "name": "customer_status",
                    "type": "VARCHAR",
                    "description": "Current status of the customer.",
                    "reference_values": [
                        "active",
                        "inactive",
                        "churned"
                    ]
                }
            ]
        },
        "product": {
            "description": "Stores product details, including pricing and stock levels.",
            "columns": [
                {
                    "name": "product_id",
                    "type": "INTEGER",
                    "description": "A unique identifier for each product.",
                    "primary_key": true
                },
                {
                    "name": "product_name",
                    "type": "VARCHAR",
                    "description": "The name of the product."
                },
                {
                    "name": "category",
                    "type": "VARCHAR",
                    "description": "The category of the product.",
                    "reference_values": [
                        "electronics",
                        "apparel",
                        "home",
                        "books"
                    ]
                },
                {
                    "name": "price_aud",
                    "type": "DECIMAL(10,2)",
                    "description": "Price of the product in AUD."
                },
                {
                    "name": "stock_quantity",
                    "type": "INTEGER",
                    "description": "Number of units available in stock."
                }
            ]
        },
        "order": {
            "description": "Tracks customer purchases and order details.",
            "columns": [
                {
                    "name": "order_id",
                    "type": "INTEGER",
                    "description": "A unique identifier for each order.",
                    "primary_key": true
                },
                {
                    "name": "customer_id",
                    "type": "INTEGER",
                    "description": "The customer who placed the order.",
                    "foreign_key": "customer.customer_id"
                },
                {
                    "name": "order_date",
                    "type": "DATE",
                    "description": "The date the order was placed."
                },
                {
                    "name": "total_amount_aud",
                    "type": "DECIMAL(10,2)",
                    "description": "The total order amount in AUD."
                },
                {
                    "name": "order_status_id",
                    "type": "INTEGER",
                    "description": "The current status of the order.",
                    "foreign_key": "order_status.order_status_id"
                }
            ]
        },
        "order_item": {
            "description": "Links products to specific orders, tracking quantities and pricing.",
            "columns": [
                {
                    "name": "order_item_id",
                    "type": "INTEGER",
                    "description": "A unique identifier for each order item.",
                    "primary_key": true
                },
                {
                    "name": "order_id",
                    "type": "INTEGER",
                    "description": "The order associated with this item.",
                    "foreign_key": "order.order_id"
                },
                {
                    "name": "product_id",
                    "type": "INTEGER",
                    "description": "The product included in the order.",
                    "foreign_key": "product.product_id"
                },
                {
                    "name": "quantity",
                    "type": "INTEGER",
                    "description": "Number of units of the product ordered."
                },
                {
                    "name": "price",
                    "type": "DECIMAL(10,2)",
                    "description": "The price of the product at the time of purchase in AUD."
                }
            ]
        },
      "orderStatus": {
            "description": "Stores predefined values for order statuses.",
            "columns": [
                {
                    "name": "orderStatusId",
                    "type": "INTEGER",
                    "description": "Unique identifier for each order status.",
                    "primary_key": true
                },
                {
                    "name": "orderStatus",
                    "type": "VARCHAR",
                    "description": "The status of the order.",
                    "reference_values": [
                        "pending",
                        "shipped",
                        "delivered",
                        "cancelled"
                    ]
                },
                {
                    "name": "description",
                    "type": "TEXT",
                    "description": "Explanation of what each order status means."
                }
            ],
          "values": [
                {
                    "orderStatusId": 1,
                    "orderStatus": "pending",
                    "description": "Order has been placed but not yet processed."
                },
                {
                    "orderStatusId": 2,
                    "orderStatus": "shipped",
                    "description": "Order has been shipped and is in transit."
                },
                {
                    "orderStatusId": 3,
                    "orderStatus": "delivered",
                    "description": "Order has been delivered to the customer."
                },
                {
                    "orderStatusId": 4,
                    "orderStatus": "cancelled",
                    "description": "Order was cancelled before fulfilment."
                }
            ]
        }
    }
}        
Rob Benson

There's More..

1 个月

And its a yes from me too James. As someone ingrained in me many years ago (Thanks David) "Strive for perfection but don't wait around for it". I'm also a yes please on the prepping data form Generative AI article!

Paul McLeod

Providing AI & Analytics for Edge in a Complex ?? >Architect_Data_AI_ML_Enterprise_Automation< __ >> RΞSULTS

1 个月

Absolutely Agree, James !!! (That's with the heading.. will devote time to your detail later but looks right) AI and before it Data Mining and ML are sense-making technologies. They provide different opportunities to engage with the human world. All data is imperfect for any given purpose.. The decision to use any given dataset relates to the effort in applying it usefully.. it is an entrepreneurial decision. So many consulting organisations do not truly want to help clients do anything other than reorganize data.. they see applications as a means to sell more investment in "data quality". The language of Maturity Models is used to make Data Qualify (in a perfectionist sense) an absolute pre-requisite. The quality of data is its ability to be used for a specific purpose, and the effort and risk in doing that. There is no zero-effort, zero-risk combination of data and purpose. Every real world scenario has opportunities and risks surrounding the use of any candidate data set. AI, ML and sense-making technology in general is key to dealing with the real world. And this is where modern expectations sit. "Computer can't compute that" is rarely an acceptable angle in 2024. I will read your detail with interest. Title is ??

Cam S.

Managing world class audio data and analytics for LiSTNR, Hit and Triple M

1 个月

Love this! Pragmatic and helpful. My fav is metadata - so nerdy and so important. ??

Richard Jackson

Platform Delivery Manager at Certus Solutions Limited

1 个月

?? Important discussion topic with key stand out themes for me James: Pragmatism, understanding key data elements, acknowledging decision paralysis holding us back, being open and realistic about certainty levels and how much confidence you can impart on insights, putting processes in place for continued success into the future. I still don't think data is considered shiny and sexy, but it's so foundationally important we all need to swing more attention to it!

回复

要查看或添加评论,请登录

James Hartwright的更多文章

社区洞察

其他会员也浏览了