Resisting the urge to overfit
I drive back via NH8 from Gurgaon to Delhi on a daily basis. For folks familiar with the drive, the entry to Delhi opposite Ambience Mall is fairly congested. As a commuter, you can either take the service lane or the highway and you can use google maps for the same. The problem with google maps is that it is often wrong during congestion and can't be followed blindly. For the first few days during the drive, I did what many CS folks do - took both the paths on alternate days and observed which is faster. Once the exploration phase was over, I now stick to the service lane before Ambience mall and then cut into the highway just at the entry point. I also make it a point to observe if the overall behavior is drifting away from my original hypothesis. However, there are days, when I get this urge to deviate from the "optimal" strategy. It could be a little bit of extra congestion at the entry to the highway or something else; signals that are not strongly correlated with the drive through time but do have a weak correlation.
I am generally considered as a very logical person but even I find it very difficult to resist this urge; adding one additional attribute to the model; solve that one extra false negative.
In my experience, there are two distinct phases of building any data science model. The first phase or the lab phase is one where you collect the data to build your model. In this phase, you go out and talk to everyone under the sun trying to get as representative a data as possible. Once you have the data, you disappear in a lab and come out only after you have a model and all sorts of metrics - precision, recall, F1, confusion matrix, in hand. You have a smug smile on your face as throw these metrics to your business stakeholders and want to get to the real world "deployment phase".
Of course, no one on the business side understands any of these metrics. So, they do what they understand. They call someone in the field and ask them to send you 10 data points from the field. You run the data through the model and voila; the model is correct 9 out of 10 times. You claim success, again; see the precision is 90%.
The business exec however looks glumly at the one sample that you missed. Can we solve for that? Even better, they already have an answer - if you just put a condition on attribute X > Y, you get what you need. That is what you need to change in your algorithm. It may not even matter if you were using a deep learning model that already included X (and there is no "algorithm" to change) or a stochastic decision tree that leads to higher accuracy for different thresholds than the Y that works in this false negative. This negative sample is all that matters at that point in the discussion.
This is where it is absolutely important to hold ground. It is important to get the discussion back on the data set and the metric you are targeting. If the data set was not representative, then it needs to get fixed. If the precision of 80% is not good enough, we need to take it up to 90%. However, under no circumstances, can you guarantee that it will work with that one negative sample. The only goal that you can take back is to have a high precision (or whatever metric makes sense for the business) on a real world test set.
Often enough, I have seen data scientists/engineers come back and tweak the model to somehow fit that particular sample. This is sometimes done by choosing model parameters that will make that sample work; at other times by including that sample in the training set and picking a deeper network that will ensure a higher match for the training set. It is much simpler to just change enough in your model to get that offending data set out of the way (at a cost of over-fitting and lower precision in the real world) than to convince others that they need to change the way they think; look at aggregates and not individual samples for assessing if a model is working.
This urge to overfit is common to us all. But, take a deep breath and let that one sample fail. Never forget that a model that always works in training does not learn; it only memorizes.
Engineer ? Philosopher ! Writer ?
2 å¹´It all goes back to the first principles. Do the right thing, period.