登录查看更多内容

17 More Must-Know Data Science Interview Questions and Answers, Part 2

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

发布日期: 2017年3月6日

+ 关注

Q7. What is overfitting and how to avoid it?

Gregory Piatetsky answers:

Overfitting is when you build a predictive model that fits the data "too closely", so that it captures the random noise in the data rather than true patterns. As a result, the model predictions will be wrong when applied to new data.

We frequently hear about studies that report unusual results (especially if you listen to Wait Wait Don't Tell Me) , or see findings like "an orange used car is least likely to be a lemon", or learn that studies overturn previous established findings (eggs are no longer bad for you).

Many such studies produce questionable results that cannot be repeated.

This is a big problem, especially in social sciences or medicine, when researchers frequently commit the cardinal sin of Data Science - Overfitting the data.

The researchers test too many hypotheses without proper statistical control, until they happen to find something interesting. Then they report it. Not surprisingly, next time the effect (which was partly due to chance) will be much smaller or absent.

These flaws of research practices were identified and reported by John P. A. Ioannidis in his landmark paper Why Most Published Research Findings Are False (PLoS Medicine, 2005). Ioannidis found that very often either the results were exaggerated or the findings could not be replicated. In his paper, he presented statistical evidence that indeed most claimed research findings are false!

Ioannidis noted that in order for a research finding to be reliable, it should have:

Large sample size and with large effects
Greater number of and lesser selection of tested relationship
Greater flexibility in designs, definitions, outcomes, and analytical modes
Minimal bias due to financial and other factors (including popularity of that scientific field)

Unfortunately, too often these rules were violated, producing spurious results, such as S&P 500 index strongly correlated to production of butter in Bangladesh, or US spending on science, space and technology correlated with suicides by hanging, strangulation, and suffocation (from https://tylervigen.com/spurious-correlations)

See more strange and spurious findings at Spurious correlations by Tyler Vigen or discover them by yourself using tools such as Google correlate.

Several methods can be used to avoid "overfitting" the data:

Try to find the simplest possible hypothesis
Regularization (adding a penalty for complexity)
Randomization Testing (randomize the class variable, try your method on this data - if it find the same strong results, something is wrong)
Nested cross-validation (do feature selection on one level, then run entire method in cross-validation on outer level)
Adjusting the False Discovery Rate
Using the reusable holdout method - a breakthrough approach proposed in 2015

Good data science is on the leading edge of scientific understanding of the world, and it is data scientists responsibility to avoid overfitting data and educate the public and the media on the dangers of bad data analysis.

Q8. What is the curse of dimensionality?

Prasad Pore answers:

"As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially."

- Charles Isbell, Professor and Senior Associate Dean, School of Interactive Computing, Georgia Tech

Let’s take an example below. Fig. 1 (a) shows 10 data points in one dimension i.e. there is only one feature in the data set. It can be easily represented on a line with only 10 values, x=1, 2, 3... 10.

But if we add one more feature, same data will be represented in 2 dimensions (Fig.1 (b)) causing increase in dimension space to 10*10 =100. And again if we add 3rd feature, dimension space will increase to 10*10*10 = 1000. As dimensions grows, dimensions space increases exponentially.

   10^1 = 10

   10^2 = 100

   10^3 = 1000 and so on...

This exponential growth in data causes high sparsity in the data set and unnecessarily increases storage space and processing time for the particular modelling algorithm. Think of image recognition problem of high resolution images 1280 × 720 = 921,600 pixels i.e. 921600 dimensions. OMG. And that’s why it’s called Curse of Dimensionality. Value added by additional dimension is much smaller compared to overhead it adds to the algorithm.

Bottom line is, the data that can be represented using 10 space units of one true dimension, needs 1000 space units after adding 2 more dimensions just because we observed these dimensions during the experiment. The true dimension means the dimension which accurately generalize the data and observed dimensions means whatever other dimensions we consider in dataset which may or may not contribute to accurately generalize the data.

See the rest of the answers on KDnuggets:

17 More Must-Know Data Science Interview Questions and Answers, Part 2

https://www.kdnuggets.com/2017/02/17-data-science-interview-questions-answers-part-2.html

Alexey Kayuchenko, PhD

Manager BI & Data Governance @ Philip Morris | Data Governance

7 年

Nice one, LOL :-D

Michael Toepper

Relocated to mid-Michigan and Open to Work

7 年

Correlation does not mean causation! That's what I teach my AP Statistics high school students. You can find a whole collection of these types of plots. What is telling in this example is that the vertical axis is missing any scale information for either curve. You can make any positive correlation curves match when you hide the vertical scale values. You have no idea, in this case, if the slopes of curves are actually the same.

1 次回应

Veronica Sallee

Senior Floral Event & Wedding Designer

7 年

Morbid

Graham Takata

Director of Climate Change at BMO Global Asset Management

7 年

See also Helmut Sies 1988 study showing correlation between stork population and birth rate in Germany.

Tibor Egervary

Strategic Marketing Professional, semi-retired

7 年

Not to mention frequent shoelace replacement

查看更多评论

要查看或添加评论，请登录

查看全部

17 More Must-Know Data Science Interview Questions and Answers, Part 2

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

See also part 1 of 17 More Must-Know Data Science Interview Questions and Answers. This is part 2

Q7. What is overfitting and how to avoid it?

Q8. What is the curse of dimensionality?

更多精彩文章

社区洞察

其他会员也浏览了

21 Must-Know Data Science Interview Questions and Answers, part 2

Ghosts In The Machine: Uncovering Five Hidden Patterns In Your Data

Top 10 Data Science Interview Questions You Need to Know

Important Questions for Data Scientist Interview Pt-2

AIML 11- Choosing the appropriate correlation coefficient

Hierarchical Clustering: A Comprehensive Guide to Understanding and Applying This Powerful Data Analysis Technique

Data Talks: Are you listening?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Uncertainty Quantification on Sparse Spatiotemporal Data Prediction

Graph Theory and Network Analysis in Data Science

See also part 1 of 17 More Must-Know Data Science Interview Questions and Answers. This is part 2

Q7. What is overfitting and how to avoid it?

Q8. What is the curse of dimensionality?

KDnuggets: Personal History and Nuggets of Experience

2021年12月4日

Which Data Science Skills are core and which are hot/emerging ones?

2019年9月17日

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

2019年2月11日

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

2018年12月4日

How Important is that Machine Learning Model be Understandable?

2018年11月19日

Anticipating the next move in data science – my interview with Thomson Reuters

2018年11月18日

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

2018年10月31日

How many Data Scientists are there and is there a shortage?

2018年9月19日

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

2018年7月30日

SuperDataScience Podcast: Insights from the Founder of KDnuggets

2018年7月23日

社区洞察

其他会员也浏览了

21 Must-Know Data Science Interview Questions and Answers, part 2

Ghosts In The Machine: Uncovering Five Hidden Patterns In Your Data

Top 10 Data Science Interview Questions You Need to Know

Important Questions for Data Scientist Interview Pt-2

AIML 11- Choosing the appropriate correlation coefficient

Hierarchical Clustering: A Comprehensive Guide to Understanding and Applying This Powerful Data Analysis Technique

Data Talks: Are you listening?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Uncertainty Quantification on Sparse Spatiotemporal Data Prediction

Graph Theory and Network Analysis in Data Science