Why you don’t need perfect Data Quality to get started with AI
James Hartwright
Managing Partner & fractional CDO, CTO at Pragmaticians | Certus Solutions | Traffyk.ai
I’m seeing a number of articles and ads about Data Quality which I believe are doing more harm than good (except for the vendors who sell you some new software, of course):
“Certainty over your data quality is essential for success”
“Poor data quality can put AI initiatives at serious risk for businesses”
It does sound logical - after all, bad data leads to bad decisions, right? Well, it depends: whilst the higher your data quality is, the better the result usually is, the idea that everything needs to be pristine before you can apply it and extract value using AI is – in my experience - a myth.
The Reality of Data Quality in AI
AI models - whether analytical models, time-series forecasting, or generative AI - aren’t as fragile as you might think. Many AI systems are designed to work with imperfect, messy, and even incomplete data. Rather than striving for perfection, organisations should focus on:
This approach isn’t about being careless; it’s about being pragmatic and calculated. If you demand perfect data quality before starting, you’ll likely never get started: I’ve seen the amount of time expended and lack of results generated from “boiling the ocean” Data Quality projects.
You also shouldn’t just dump anything and everything into AI and expect reliable results! The key is knowing when and where your data is "good enough" and where the risks lie if it’s not quite there.
“OK James”, I hear you say, “how do I know when my data is ‘good enough’?”
Instead of striving for perfection, you should go in based on your use cases, audience, and what AI you’re applying. Consider both structured and unstructured data, noting that much of it is semi-structured (with elements such as document type, source, and creation date).
For Analytical Modeling
For Generative AI
General "good enough" considerations:
Other things - though harder to define 'good':
One thing that will make a big difference to the quality of any results is METADATA!
Metadata is ‘data about data’ – such as the aforementioned file creation date, origin, author.
Providing metadata to your audience (or the generative or agentic AI middle-man) will bring increased accuracy and consistency to your results. Even when the audience is a data scientist (or a BI developer or maybe a good ML engineer) providing explanatory metadata will give them a better starting point.
Remember, also, that Generative AI works in language, so – for accessing structured data in a repository - providing additional textual information about the layout and content of tables and columns is going to make it more accurate when it builds and runs SQL from natural queries. I’ve put a somewhat technical example at the bottom of the article.
Other valuable metadata we can attach here, is anything we’ve discovered around data quality in our “good enough” pre-processing – I'm following the 6 most commonly used dimensions of data quality.
These would usually be presented with more contextual detail, and attached to specific data, for example:
Summary of Steps
In getting to, and keeping to, good enough there are 3 key processes that you should add into your BAU:
领英推荐
Final thoughts: Just Get Started - but with guardrails
Waiting for perfect data quality before using AI is like waiting for perfect weather before going outside - you’ll spend days doing nothing and then finding out your mates had a great time. Instead, take a structured but flexible approach:
AI doesn’t require perfection—it requires thoughtful execution and pragmatism. The goal is progress, with built-in safeguards along the way.
?
P.S. I’m thinking of writing a non-technical deeper dive around how to prepare data for generative AI, for things like a contact centre knowledge base - shout if you’d like me to do it!
The Metadata Example
NB: Maybe skim over this if you're not technical - though pop down and look at the Reference Data heading for the Order Status table!
I've gone with a simple construct of someone placing an order. A customer table, a set of products, purchases, and the basket items in each purchase.
Tables/Entities:
We populate some information on the tables so that the user/AI can build or get context on what's in there.
Note that there are multiple codified/lookup attributes in the structure, but I've only documented one of them (OrderStatus) to reduce complexity!
Attributes
We do similar things with the attribute names - provide more real English terms to provide increased understanding on how to use the attributes. This is the starter of a business glossary - or just use the business glossary / data catalog tool you already have!
Note that I've suggested a view over the raw attribute names to present more meaningful ones to the AI. The tools I've used thus far prefer snake_case over camelCase - but that won't last long...
Reference Data
Content like the Order Status descriptions could be the most important thing you generate - codified/lookup fields are often less than perfect in explaining what they mean!
Metadata in Code
For many tools – including Generative AI, we usually present this data in an extensible structure, such as JSON
{
"tables": {
"customer": {
"description": "Stores customer details, including contact info and status.",
"columns": [
{
"name": "customer_id",
"type": "INTEGER",
"description": "A unique identifier assigned to each customer.",
"primary_key": true
},
{
"name": "customer_name",
"type": "VARCHAR",
"description": "Full name of the customer."
},
{
"name": "email",
"type": "VARCHAR",
"description": "Email address of the customer."
},
{
"name": "phone_number",
"type": "VARCHAR",
"description": "Customer's contact phone number."
},
{
"name": "country_code",
"type": "VARCHAR",
"description": "ISO country code for the customer's location.",
"reference_values": "ISO 3166-1 country codes"
},
{
"name": "customer_status",
"type": "VARCHAR",
"description": "Current status of the customer.",
"reference_values": [
"active",
"inactive",
"churned"
]
}
]
},
"product": {
"description": "Stores product details, including pricing and stock levels.",
"columns": [
{
"name": "product_id",
"type": "INTEGER",
"description": "A unique identifier for each product.",
"primary_key": true
},
{
"name": "product_name",
"type": "VARCHAR",
"description": "The name of the product."
},
{
"name": "category",
"type": "VARCHAR",
"description": "The category of the product.",
"reference_values": [
"electronics",
"apparel",
"home",
"books"
]
},
{
"name": "price_aud",
"type": "DECIMAL(10,2)",
"description": "Price of the product in AUD."
},
{
"name": "stock_quantity",
"type": "INTEGER",
"description": "Number of units available in stock."
}
]
},
"order": {
"description": "Tracks customer purchases and order details.",
"columns": [
{
"name": "order_id",
"type": "INTEGER",
"description": "A unique identifier for each order.",
"primary_key": true
},
{
"name": "customer_id",
"type": "INTEGER",
"description": "The customer who placed the order.",
"foreign_key": "customer.customer_id"
},
{
"name": "order_date",
"type": "DATE",
"description": "The date the order was placed."
},
{
"name": "total_amount_aud",
"type": "DECIMAL(10,2)",
"description": "The total order amount in AUD."
},
{
"name": "order_status_id",
"type": "INTEGER",
"description": "The current status of the order.",
"foreign_key": "order_status.order_status_id"
}
]
},
"order_item": {
"description": "Links products to specific orders, tracking quantities and pricing.",
"columns": [
{
"name": "order_item_id",
"type": "INTEGER",
"description": "A unique identifier for each order item.",
"primary_key": true
},
{
"name": "order_id",
"type": "INTEGER",
"description": "The order associated with this item.",
"foreign_key": "order.order_id"
},
{
"name": "product_id",
"type": "INTEGER",
"description": "The product included in the order.",
"foreign_key": "product.product_id"
},
{
"name": "quantity",
"type": "INTEGER",
"description": "Number of units of the product ordered."
},
{
"name": "price",
"type": "DECIMAL(10,2)",
"description": "The price of the product at the time of purchase in AUD."
}
]
},
"orderStatus": {
"description": "Stores predefined values for order statuses.",
"columns": [
{
"name": "orderStatusId",
"type": "INTEGER",
"description": "Unique identifier for each order status.",
"primary_key": true
},
{
"name": "orderStatus",
"type": "VARCHAR",
"description": "The status of the order.",
"reference_values": [
"pending",
"shipped",
"delivered",
"cancelled"
]
},
{
"name": "description",
"type": "TEXT",
"description": "Explanation of what each order status means."
}
],
"values": [
{
"orderStatusId": 1,
"orderStatus": "pending",
"description": "Order has been placed but not yet processed."
},
{
"orderStatusId": 2,
"orderStatus": "shipped",
"description": "Order has been shipped and is in transit."
},
{
"orderStatusId": 3,
"orderStatus": "delivered",
"description": "Order has been delivered to the customer."
},
{
"orderStatusId": 4,
"orderStatus": "cancelled",
"description": "Order was cancelled before fulfilment."
}
]
}
}
}
There's More..
1 个月And its a yes from me too James. As someone ingrained in me many years ago (Thanks David) "Strive for perfection but don't wait around for it". I'm also a yes please on the prepping data form Generative AI article!
Providing AI & Analytics for Edge in a Complex ?? >Architect_Data_AI_ML_Enterprise_Automation< __ >> RΞSULTS
1 个月Absolutely Agree, James !!! (That's with the heading.. will devote time to your detail later but looks right) AI and before it Data Mining and ML are sense-making technologies. They provide different opportunities to engage with the human world. All data is imperfect for any given purpose.. The decision to use any given dataset relates to the effort in applying it usefully.. it is an entrepreneurial decision. So many consulting organisations do not truly want to help clients do anything other than reorganize data.. they see applications as a means to sell more investment in "data quality". The language of Maturity Models is used to make Data Qualify (in a perfectionist sense) an absolute pre-requisite. The quality of data is its ability to be used for a specific purpose, and the effort and risk in doing that. There is no zero-effort, zero-risk combination of data and purpose. Every real world scenario has opportunities and risks surrounding the use of any candidate data set. AI, ML and sense-making technology in general is key to dealing with the real world. And this is where modern expectations sit. "Computer can't compute that" is rarely an acceptable angle in 2024. I will read your detail with interest. Title is ??
Managing world class audio data and analytics for LiSTNR, Hit and Triple M
1 个月Love this! Pragmatic and helpful. My fav is metadata - so nerdy and so important. ??
Platform Delivery Manager at Certus Solutions Limited
1 个月?? Important discussion topic with key stand out themes for me James: Pragmatism, understanding key data elements, acknowledging decision paralysis holding us back, being open and realistic about certainty levels and how much confidence you can impart on insights, putting processes in place for continued success into the future. I still don't think data is considered shiny and sexy, but it's so foundationally important we all need to swing more attention to it!