登录查看更多内容

21 Must-Know Data Science Interview Questions and Answers, part 2

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

发布日期: 2016年2月29日

The post on KDnuggets 20 Questions to Detect Fake Data Scientists has been very popular - most viewed post of the month.

However these questions were lacking answers, so KDnuggets Editors got together and wrote the answers. Here is part 2 of the answers, starting with a "bonus" , but probably a most important Data Science question in the era of Big Data.

Bonus Question: Explain what is overfitting and how would you control for it

Answer by Gregory Piatetsky.

Overfitting is finding spurious results that are due to chance and cannot be reproduced by subsequent studies.

We frequently see newspaper reports about studies that overturn the previous findings, like eggs are no longer bad for your health, or saturated fat is not linked to heart disease. The problem, in our opinion is that many researchers, especially in social sciences or medicine, too frequently commit the cardinal sin of Data Mining - Overfitting the data.

The researchers test too many hypotheses without proper statistical control, until they happen to find something interesting and report it. Not surprisingly, next time the effect, which was (at least partly) due to chance, will be much smaller or absent.

These flaws of research practices were identified and reported by John P. A. Ioannidis in his landmark paper (PLoS Medicine, 2005). Ioannidis found that very often either the results were exaggerated or the findings could not be replicated. In his paper, he presented statistical evidence that indeed most claimed research findings are false.

Ioannidis noted that in order for a research finding to be reliable, it should have:

Large sample size and with large effects
Greater number of and lesser selection of tested relationship
Greater flexibility in designs, definitions, outcomes, and analytical modes
Minimal bias due to financial and other factors (including popularity of that scientific field)

Unfortunately, too often these rules were violated, producing irreproducible results. For example, S&P 500 index was found to be strongly related to Production of butter in Bangladesh (from 19891 to 1993) (here is PDF)

See more interesting (and totally spurious) findings which you can discover yourself using tools such as Google correlate or Spurious correlations by Tyler Vigen.

Several methods can be used to avoid "overfitting" the data

Try to find the simplest possible hypothesis
Regularization (adding a penalty for complexity)
Randomization Testing (randomize the class variable, try your method on this data - if it find the same strong results, something is wrong)
Nested cross-validation (do feature selection on one level, then run entire method in cross-validation on outer level)
Adjusting the False Discovery Rate
Using the reusable holdout method - a breakthrough approach proposed in 2015

Good data science is on the leading edge of scientific understanding of the world, and it is data scientists responsibility to avoid overfitting data and educate the public and the media on the dangers of bad data analysis.

See also

The Cardinal Sin of Data Mining and Data Science: Overfitting
Big Idea To Avoid Overfitting: Reusable Holdout to Preserve Validity in Adaptive Data Analysis
Overcoming Overfitting with the reusable holdout: Preserving validity in adaptive data analysis
11 Clever Methods of Overfitting and how to avoid them
Tag: Overfitting

Q12. Give an example of how you would use experimental design to answer a question about user behavior.

Answer by Bhavya Geethika.

Step 1: Formulate the Research Question:
What are the effects of page load times on user satisfaction ratings?

Step 2: Identify variables:
We identify the cause & effect. Independent variable -page load time, Dependent variable- user satisfaction rating

Step 3: Generate Hypothesis:
Lower page download time will have more effect on the user satisfaction rating for a web page. Here the factor we analyze is page load time.

Fig 12: There is a flaw in your experimental design (cartoon from here)

Step 4: Determine Experimental Design.
We consider experimental complexity i.e vary one factor at a time or multiple factors at one time in which case we use factorial design (2^k design). A design is also selected based on the type of objective (Comparative, Screening, Response surface) & number of factors.

Here we also identify within-participants, between-participants, and mixed model.For e.g.: There are two versions of a page, one with Buy button (call to action) on left and the other version has this button on the right.

Within-participants design - both user groups see both versions.

Between-participants design - one group of users see version A & the other user group version B.

Step 5: Develop experimental task & procedure:
Detailed description of steps involved in the experiment, tools used to measure user behavior, goals and success metrics should be defined. Collect qualitative data about user engagement to allow statistical analysis.

Step 6: Determine Manipulation & Measurements

Manipulation: One level of factor will be controlled and the other will be manipulated. We also identify the behavioral measures:

Latency- time between a prompt and occurrence of behavior (how long it takes for a user to click buy after being presented with products).
Frequency- number of times a behavior occurs (number of times the user clicks on a given page within a time)
Duration-length of time a specific behavior lasts(time taken to add all products)
Intensity-force with which a behavior occurs ( how quickly the user purchased a product)

Step 7: Analyze results
Identify user behavior data and support the hypothesis or contradict according to the observations made for e.g. how majority of users satisfaction ratings compared with page load times.

Read the rest on KDnuggets:

21 Must-Know Data Science Interview Questions and Answers, part 2

https://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers-part2.html

Vassilios Rendoumis

Innovation Manager - Quantum Computing at Deutsche Bundesbank

9 年

Thanks Gregory. Is there a link for part 1?

Vin Miller

Data And Technology Strategist & Team Leader / CGEIT / TOGAF Enterprise Architecture Practitioner / FinOps Certified Practitioner / SAFe 5 Agilist

9 年

Couldn't agree more with John above: regardless of specific job description, the most useful interview discussions, in my experience, focus around mistakes. Asking about them tells you a ton about a person's ability to correct a path, to learn lessons, and to apply three lessons. You also learn a lot about their personal temperament.

GianPietro S.

Data Scientist AI Lecturer| Founder| Investor| Ex Fixed Income trader | Quant | Former Instructor at Columbia University,{among other things} Never Despair, life is short. Climb a mountain! Eat a peach!

9 年

Most of the stuff we ask is about what projects they, the interviewee, undertook that "rocked the boat." How many failures you had along the way and why. This is an incisive way of reading how the person thought their way through to the "wrong" answer and what they did to remidiate it. Collaboration among individuals and writing skills trump some of the tech stuff. Very few data scientists get to work on their own models unless one is in research and even then you're mostly there to help some one understand their challenges or physical models. Genetics is a prime example while quantitative methods in finance maybe a counter example. That said, the lingo of engineering is an increasingly important thing. How to get the most processing out of the hardware you have rather than the hardware you want. Just my thoughts. Thanks for all the work you guys are doing.

Kevin D. Rooney

9 年

While this is a great list of things to know, it points to the general problem with most interview questions - google a list and study the answers. Better questions are framed around prior projects or activities someone has done, and looking for evidence of these things - did they control for sampling bias etc and ask the appropriate clarifying follow up questions. That's why you always need an expert asking the questions. The whole premise of Expert Interview! Check us out!

1 次回应

查看更多评论

要查看或添加评论，请登录

Gregory Piatetsky-Shapiro的更多文章

KDnuggets: Personal History and Nuggets of Experience

2021年12月4日

KDnuggets: Personal History and Nuggets of Experience

Dear Readers, I have big news! After 40+ years of working full time, including 35+ years of data mining/KDD/data…

160 条评论
Which Data Science Skills are core and which are hot/emerging ones?

2019年9月17日

Which Data Science Skills are core and which are hot/emerging ones?

The latest KDnuggets Poll asked 1. Which skills / knowledge areas do you currently have (at the level you can use in…

30 条评论
Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

2019年2月11日

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

For the first time in several years the name of this highly anticipated Gartner MQ for Data Science and Machine…

10 条评论
AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

2018年12月4日

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

As in the past, we bring you a roundup of predictions and analysis from experts. We have asked What were the main…

6 条评论
How Important is that Machine Learning Model be Understandable?

2018年11月19日

How Important is that Machine Learning Model be Understandable?

The previous KDnuggets Poll asked When building Machine Learning / Data Science models in 2018, how often was it…

10 条评论
Anticipating the next move in data science – my interview with Thomson Reuters

2018年11月18日

Anticipating the next move in data science – my interview with Thomson Reuters

Thomson Reuters has a series, AI experts, where they interview thought leaders from different areas - including…

11 条评论
Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

2018年10月31日

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

The latest KDnuggets Poll asked: What was the largest dataset you analyzed / data mined? This poll received 1108 votes,…

5 条评论
How many Data Scientists are there and is there a shortage?

2018年9月19日

How many Data Scientists are there and is there a shortage?

(this blog was jointly written with Preet Gandhi, NYU) The 2011 McKinsey report on Big Data said that “The United…

8 条评论
Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

2018年7月30日

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

This article is based on a KDnuggets blog jointly written with Dan Clark. The 2018 World Cup is over, with France…

45 条评论
SuperDataScience Podcast: Insights from the Founder of KDnuggets

2018年7月23日

SuperDataScience Podcast: Insights from the Founder of KDnuggets

I recently appeared on Super DataScience Podcast, where I had an interesting conversation with SDS Founder Kirill…

4 条评论

See all articles

21 Must-Know Data Science Interview Questions and Answers, part 2

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

Bonus Question: Explain what is overfitting and how would you control for it

Q12. Give an example of how you would use experimental design to answer a question about user behavior.

Gregory Piatetsky-Shapiro的更多文章

社区洞察

其他会员也浏览了

Important Questions for Data Scientist Interview Pt-2

Cognitive Biases in Data Science

Spectral Analysis Techniques in Data Science

AIML 11- Choosing the appropriate correlation coefficient

Text Data Analytics: A Methodological Review and Demonstration

16 Useful Advices for Aspiring Data Scientists

From Bits to Billions: The Dawn of Big Data ?

The Sexiest job of the 21st century: Harvard Business Review

From trees to webs

Graph Theory and Network Analysis in Data Science

Bonus Question: Explain what is overfitting and how would you control for it

Q12. Give an example of how you would use experimental design to answer a question about user behavior.

Gregory Piatetsky-Shapiro的更多文章

KDnuggets: Personal History and Nuggets of Experience

Which Data Science Skills are core and which are hot/emerging ones?

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

How Important is that Machine Learning Model be Understandable?

Anticipating the next move in data science – my interview with Thomson Reuters

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

How many Data Scientists are there and is there a shortage?

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

SuperDataScience Podcast: Insights from the Founder of KDnuggets

社区洞察

其他会员也浏览了

Important Questions for Data Scientist Interview Pt-2

Cognitive Biases in Data Science

Spectral Analysis Techniques in Data Science

AIML 11- Choosing the appropriate correlation coefficient

Text Data Analytics: A Methodological Review and Demonstration

16 Useful Advices for Aspiring Data Scientists

From Bits to Billions: The Dawn of Big Data ?

The Sexiest job of the 21st century: Harvard Business Review

From trees to webs

Graph Theory and Network Analysis in Data Science