The dangers of invented data
The dream of a data scientist is perfectly curated, complete data. The type of data set that is easy to collect, easy to understand and, most importantly, covers the sample you want to apply the final machine learning model to. There’s a reason why this is a dream and not a reality - the world is complicated. There are hundreds of reasons why that perfect data set doesn’t exist and a key job for the data scientist is to overcome this.
This assumes you have some data but what if you have nothing? Well one approach is to generate some using synthetic data. Using some expert knowledge of the underlying population you want to apply your model to, you can create a synthetic data set to be used in your model building. Making some assumptions on data distributions and adding an appropriate amount of noise can actually lead you to a better predictive model.
One example where I have used this approach in the past is where an organisation has been looking to test an entirely new product. In this situation there may be some patchwork data sets that could be used but by boosting this with some synthesized data I was able to get to a big enough sample to get some confidence in a stable model.
This sounds too good to be true doesn’t it? Creating data to build powerful models? There are definite pitfalls with taking this approach and in my mind there are three huge risks:
领英推荐
Synthetic data creation is a powerful way to help get a leg up where data coverage is sparse, however don’t fully fall for the hype, make sure you really trust your expert and don’t forget to understand your risks.
Questions to ask:
Last week: the perfect Sunday roast
Technology Entrepreneur | Making Impact through Trustworthy AI innovation
2 年A nice article Dan Kellett Cold start is obviously a big problem. A large bank once given a job to build a complex decisioning model based on 24 rows and claimed they had 2 years of data (12 months x 2 years) to work on. The same way Synthetic data can reduce the bias, improve generalisation and fairness, however it can do exactly the opposite by injecting Expert Knowledge Bias as you have mentioned. However, we have seen a trend often practiced and accepted in the industry - Human in the loop. Start with a model based on synthetic data (to avoid cold start), constant human in the loop monitor and improve performance on real data (don't depend on it till it reaches a certain threshold with high confidence interval), add explainability as much as possible, and finally add model monitoring (to capture any drift).