The dangers of invented data

The dangers of invented data

The dream of a data scientist is perfectly curated, complete data. The type of data set that is easy to collect, easy to understand and, most importantly, covers the sample you want to apply the final machine learning model to. There’s a reason why this is a dream and not a reality - the world is complicated. There are hundreds of reasons why that perfect data set doesn’t exist and a key job for the data scientist is to overcome this.

This assumes you have some data but what if you have nothing? Well one approach is to generate some using synthetic data. Using some expert knowledge of the underlying population you want to apply your model to, you can create a synthetic data set to be used in your model building. Making some assumptions on data distributions and adding an appropriate amount of noise can actually lead you to a better predictive model.

One example where I have used this approach in the past is where an organisation has been looking to test an entirely new product. In this situation there may be some patchwork data sets that could be used but by boosting this with some synthesized data I was able to get to a big enough sample to get some confidence in a stable model.

This sounds too good to be true doesn’t it? Creating data to build powerful models? There are definite pitfalls with taking this approach and in my mind there are three huge risks:

  1. You are constrained by the limitations of your thinking and experience. The expert defines the rules within the data. This is definitely a great starting point but what if you’re wrong?
  2. Hubris leads to false confidence in the results - never forget this is synthetic data!
  3. There will always be bias in your data but you may have introduced more without realising it.

Synthetic data creation is a powerful way to help get a leg up where data coverage is sparse, however don’t fully fall for the hype, make sure you really trust your expert and don’t forget to understand your risks.

Questions to ask:

  • Do you have a clear understanding which of your models and decisions have been made on synthetic data versus actual data?
  • What assumptions have been made to create synthetic data?
  • What checks and balances have you undertaken to ensure the data reflects the forward-looking population?
  • Can you replicate the synthetic data and who owns it?

Last week: the perfect Sunday roast

Next week: the best thing you can do to ensure AI success

Sourish Banerjee

Technology Entrepreneur | Making Impact through Trustworthy AI innovation

2 年

A nice article Dan Kellett Cold start is obviously a big problem. A large bank once given a job to build a complex decisioning model based on 24 rows and claimed they had 2 years of data (12 months x 2 years) to work on. The same way Synthetic data can reduce the bias, improve generalisation and fairness, however it can do exactly the opposite by injecting Expert Knowledge Bias as you have mentioned. However, we have seen a trend often practiced and accepted in the industry - Human in the loop. Start with a model based on synthetic data (to avoid cold start), constant human in the loop monitor and improve performance on real data (don't depend on it till it reaches a certain threshold with high confidence interval), add explainability as much as possible, and finally add model monitoring (to capture any drift).

回复

要查看或添加评论,请登录

Dan Kellett的更多文章

  • My 4 microblogs on AI governance

    My 4 microblogs on AI governance

    Over the last 4 weeks I have looked to cover key learnings from my 21 years being involved in the governance of Machine…

    1 条评论
  • Data karma

    Data karma

    AI success relies on a large amount of knowledge. This may be technical knowledge, data knowledge or business knowledge.

    2 条评论
  • Goldilocks and SQL

    Goldilocks and SQL

    Last week I wrote about my early years as a data scientist and the challenge of jumping the experience chasm as I moved…

    2 条评论
  • Wise council

    Wise council

    I joined Capital One straight out of university. I completed my Bachelors degree in Mathematics and Statistics and…

    1 条评论
  • The Jets and the Sharks

    The Jets and the Sharks

    This week I want to tell you a story about one of my earliest model building projects. I was a recent graduate making…

    1 条评论
  • My 8 microblogs on AI model building

    My 8 microblogs on AI model building

    Over the last 8 weeks I have looked to cover key learnings from my 21 years building Machine Learning models in…

  • Occam’s Razor

    Occam’s Razor

    Buying a new car can be a pretty daunting experience unless you know exactly what you want. Deciding on a make and…

  • Opening up the watch

    Opening up the watch

    Imagine it’s your birthday and there’s a knock on your door. The delivery person hands you a beautifully wrapped parcel…

    1 条评论
  • Help out your future self

    Help out your future self

    I’ll be honest with you… I actually really enjoy building flat pack furniture. The step-by-step approach appeals to my…

    2 条评论
  • What can go wrong... and what will you do about it?

    What can go wrong... and what will you do about it?

    Sometimes, despite everyone’s best intentions, things go wrong. Good risk management can help ensure this doesn’t have…

    3 条评论

社区洞察

其他会员也浏览了