登录查看更多内容

The dangers of invented data

Dan Kellett

发布日期: 2022年4月29日

The dream of a data scientist is perfectly curated, complete data. The type of data set that is easy to collect, easy to understand and, most importantly, covers the sample you want to apply the final machine learning model to. There’s a reason why this is a dream and not a reality - the world is complicated. There are hundreds of reasons why that perfect data set doesn’t exist and a key job for the data scientist is to overcome this.

This assumes you have some data but what if you have nothing? Well one approach is to generate some using synthetic data. Using some expert knowledge of the underlying population you want to apply your model to, you can create a synthetic data set to be used in your model building. Making some assumptions on data distributions and adding an appropriate amount of noise can actually lead you to a better predictive model.

One example where I have used this approach in the past is where an organisation has been looking to test an entirely new product. In this situation there may be some patchwork data sets that could be used but by boosting this with some synthesized data I was able to get to a big enough sample to get some confidence in a stable model.

This sounds too good to be true doesn’t it? Creating data to build powerful models? There are definite pitfalls with taking this approach and in my mind there are three huge risks:

You are constrained by the limitations of your thinking and experience. The expert defines the rules within the data. This is definitely a great starting point but what if you’re wrong?
Hubris leads to false confidence in the results - never forget this is synthetic data!
There will always be bias in your data but you may have introduced more without realising it.

领英推荐

Dimension Reduction Linear Discriminant Analysis

360DigiTMG 5 个月前

NEW from Learning Data 06/26 - 07/07!

Maven Analytics 1 年前

When Bias Overpowers Data: Recognizing and Mitigating…

Iain Brown PhD 3 周前

Synthetic data creation is a powerful way to help get a leg up where data coverage is sparse, however don’t fully fall for the hype, make sure you really trust your expert and don’t forget to understand your risks.

Questions to ask:

Do you have a clear understanding which of your models and decisions have been made on synthetic data versus actual data?
What assumptions have been made to create synthetic data?
What checks and balances have you undertaken to ensure the data reflects the forward-looking population?
Can you replicate the synthetic data and who owns it?

Last week: the perfect Sunday roast

Next week: the best thing you can do to ensure AI success

Sourish Banerjee

Technology Entrepreneur | Making Impact through Trustworthy AI innovation

2 年

A nice article Dan Kellett Cold start is obviously a big problem. A large bank once given a job to build a complex decisioning model based on 24 rows and claimed they had 2 years of data (12 months x 2 years) to work on. The same way Synthetic data can reduce the bias, improve generalisation and fairness, however it can do exactly the opposite by injecting Expert Knowledge Bias as you have mentioned. However, we have seen a trend often practiced and accepted in the industry - Human in the loop. Start with a model based on synthetic data (to avoid cold start), constant human in the loop monitor and improve performance on real data (don't depend on it till it reaches a certain threshold with high confidence interval), add explainability as much as possible, and finally add model monitoring (to capture any drift).

要查看或添加评论，请登录

Dan Kellett的更多文章

My 4 microblogs on AI governance

2022年7月21日

My 4 microblogs on AI governance

Over the last 4 weeks I have looked to cover key learnings from my 21 years being involved in the governance of Machine…

1 条评论
Data karma

2022年7月19日

Data karma

AI success relies on a large amount of knowledge. This may be technical knowledge, data knowledge or business knowledge.

2 条评论
Goldilocks and SQL

2022年7月14日

Goldilocks and SQL

Last week I wrote about my early years as a data scientist and the challenge of jumping the experience chasm as I moved…

2 条评论
Wise council

2022年7月7日

Wise council

I joined Capital One straight out of university. I completed my Bachelors degree in Mathematics and Statistics and…

1 条评论
The Jets and the Sharks

2022年6月30日

The Jets and the Sharks

This week I want to tell you a story about one of my earliest model building projects. I was a recent graduate making…

1 条评论
My 8 microblogs on AI model building

2022年6月24日

My 8 microblogs on AI model building

Over the last 8 weeks I have looked to cover key learnings from my 21 years building Machine Learning models in…
Occam’s Razor

2022年6月23日

Occam’s Razor

Buying a new car can be a pretty daunting experience unless you know exactly what you want. Deciding on a make and…
Opening up the watch

2022年6月16日

Opening up the watch

Imagine it’s your birthday and there’s a knock on your door. The delivery person hands you a beautifully wrapped parcel…

1 条评论
Help out your future self

2022年6月8日

Help out your future self

I’ll be honest with you… I actually really enjoy building flat pack furniture. The step-by-step approach appeals to my…

2 条评论
What can go wrong... and what will you do about it?

2022年5月31日

What can go wrong... and what will you do about it?

Sometimes, despite everyone’s best intentions, things go wrong. Good risk management can help ensure this doesn’t have…

3 条评论

See all articles

The dangers of invented data

Dan Kellett

领英推荐

Dan Kellett的更多文章

社区洞察

其他会员也浏览了

Ensuring Data Integrity: Techniques for Handling Missing Values in Machine Learning

Handling Outliers in ML: Best Practices for Robust Data Preprocessing

From Memorisation to Generalisation: How to Tackle Overfitting

The Critical Role of Data Quality in Machine Learning

Modern Model Accuracy Analysis

Day 22: Model Retraining and Feedback Loops in MLOps

Data Preprocessing Techniques In Machine Learning:

Without a Data Strategy, AI is just a Buzzword

Uncovering Unique Insights from Data: The Power of Thinking Outside the Box

How do cleaning, normalization, and handling missing values improve machine learning in Data Science?

领英推荐

Dan Kellett的更多文章

My 4 microblogs on AI governance

Data karma

Goldilocks and SQL

Wise council

The Jets and the Sharks

My 8 microblogs on AI model building

Occam’s Razor

Opening up the watch

Help out your future self

What can go wrong... and what will you do about it?

社区洞察

其他会员也浏览了

Ensuring Data Integrity: Techniques for Handling Missing Values in Machine Learning

Handling Outliers in ML: Best Practices for Robust Data Preprocessing

From Memorisation to Generalisation: How to Tackle Overfitting

The Critical Role of Data Quality in Machine Learning

Modern Model Accuracy Analysis

Day 22: Model Retraining and Feedback Loops in MLOps

Data Preprocessing Techniques In Machine Learning:

Without a Data Strategy, AI is just a Buzzword

Uncovering Unique Insights from Data: The Power of Thinking Outside the Box

How do cleaning, normalization, and handling missing values improve machine learning in Data Science?