17 More Must-Know Data Science Interview Questions and Answers, Part 2
Gregory Piatetsky-Shapiro
Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.
See also part 1 of 17 More Must-Know Data Science Interview Questions and Answers. This is part 2
Q7. What is overfitting and how to avoid it?
Gregory Piatetsky answers:
Overfitting is when you build a predictive model that fits the data "too closely", so that it captures the random noise in the data rather than true patterns. As a result, the model predictions will be wrong when applied to new data.
We frequently hear about studies that report unusual results (especially if you listen to Wait Wait Don't Tell Me) , or see findings like "an orange used car is least likely to be a lemon", or learn that studies overturn previous established findings (eggs are no longer bad for you).
Many such studies produce questionable results that cannot be repeated.
This is a big problem, especially in social sciences or medicine, when researchers frequently commit the cardinal sin of Data Science - Overfitting the data.
The researchers test too many hypotheses without proper statistical control, until they happen to find something interesting. Then they report it. Not surprisingly, next time the effect (which was partly due to chance) will be much smaller or absent.
These flaws of research practices were identified and reported by John P. A. Ioannidis in his landmark paper Why Most Published Research Findings Are False (PLoS Medicine, 2005). Ioannidis found that very often either the results were exaggerated or the findings could not be replicated. In his paper, he presented statistical evidence that indeed most claimed research findings are false!
Ioannidis noted that in order for a research finding to be reliable, it should have:
- Large sample size and with large effects
- Greater number of and lesser selection of tested relationship
- Greater flexibility in designs, definitions, outcomes, and analytical modes
- Minimal bias due to financial and other factors (including popularity of that scientific field)
Unfortunately, too often these rules were violated, producing spurious results, such as S&P 500 index strongly correlated to production of butter in Bangladesh, or US spending on science, space and technology correlated with suicides by hanging, strangulation, and suffocation (from https://tylervigen.com/spurious-correlations)
See more strange and spurious findings at Spurious correlations by Tyler Vigen or discover them by yourself using tools such as Google correlate.
Several methods can be used to avoid "overfitting" the data:
- Try to find the simplest possible hypothesis
- Regularization (adding a penalty for complexity)
- Randomization Testing (randomize the class variable, try your method on this data - if it find the same strong results, something is wrong)
- Nested cross-validation (do feature selection on one level, then run entire method in cross-validation on outer level)
- Adjusting the False Discovery Rate
- Using the reusable holdout method - a breakthrough approach proposed in 2015
Good data science is on the leading edge of scientific understanding of the world, and it is data scientists responsibility to avoid overfitting data and educate the public and the media on the dangers of bad data analysis.
See also:
- 4 Reasons Your Machine Learning Model is Wrong (and How to Fix It)
- When Good Advice Goes Bad
- The Cardinal Sin of Data Mining and Data Science: Overfitting
- Big Idea To Avoid Overfitting: Reusable Holdout to Preserve Validity in Adaptive Data Analysis
- Overcoming Overfitting with the reusable holdout: Preserving validity in adaptive data analysis
- 11 Clever Methods of Overfitting and how to avoid them
Q8. What is the curse of dimensionality?
Prasad Pore answers:
"As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially."
- Charles Isbell, Professor and Senior Associate Dean, School of Interactive Computing, Georgia Tech
Let’s take an example below. Fig. 1 (a) shows 10 data points in one dimension i.e. there is only one feature in the data set. It can be easily represented on a line with only 10 values, x=1, 2, 3... 10.
But if we add one more feature, same data will be represented in 2 dimensions (Fig.1 (b)) causing increase in dimension space to 10*10 =100. And again if we add 3rd feature, dimension space will increase to 10*10*10 = 1000. As dimensions grows, dimensions space increases exponentially.
10^1 = 10
10^2 = 100
10^3 = 1000 and so on...
This exponential growth in data causes high sparsity in the data set and unnecessarily increases storage space and processing time for the particular modelling algorithm. Think of image recognition problem of high resolution images 1280 × 720 = 921,600 pixels i.e. 921600 dimensions. OMG. And that’s why it’s called Curse of Dimensionality. Value added by additional dimension is much smaller compared to overhead it adds to the algorithm.
Bottom line is, the data that can be represented using 10 space units of one true dimension, needs 1000 space units after adding 2 more dimensions just because we observed these dimensions during the experiment. The true dimension means the dimension which accurately generalize the data and observed dimensions means whatever other dimensions we consider in dataset which may or may not contribute to accurately generalize the data.
See the rest of the answers on KDnuggets:
17 More Must-Know Data Science Interview Questions and Answers, Part 2
https://www.kdnuggets.com/2017/02/17-data-science-interview-questions-answers-part-2.html
Manager BI & Data Governance @ Philip Morris | Data Governance
7 年Nice one, LOL :-D
Relocated to mid-Michigan and Open to Work
7 年Correlation does not mean causation! That's what I teach my AP Statistics high school students. You can find a whole collection of these types of plots. What is telling in this example is that the vertical axis is missing any scale information for either curve. You can make any positive correlation curves match when you hide the vertical scale values. You have no idea, in this case, if the slopes of curves are actually the same.
Senior Floral Event & Wedding Designer
7 年Morbid
Director of Climate Change at BMO Global Asset Management
7 年See also Helmut Sies 1988 study showing correlation between stork population and birth rate in Germany.
Strategic Marketing Professional, semi-retired
7 年Not to mention frequent shoelace replacement