21 Must-Know Data Science Interview Questions and Answers, part 2
Gregory Piatetsky-Shapiro
Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.
The post on KDnuggets 20 Questions to Detect Fake Data Scientists has been very popular - most viewed post of the month.
However these questions were lacking answers, so KDnuggets Editors got together and wrote the answers. Here is part 2 of the answers, starting with a "bonus" , but probably a most important Data Science question in the era of Big Data.
Bonus Question: Explain what is overfitting and how would you control for it
Answer by Gregory Piatetsky.
Overfitting is finding spurious results that are due to chance and cannot be reproduced by subsequent studies.
We frequently see newspaper reports about studies that overturn the previous findings, like eggs are no longer bad for your health, or saturated fat is not linked to heart disease. The problem, in our opinion is that many researchers, especially in social sciences or medicine, too frequently commit the cardinal sin of Data Mining - Overfitting the data.
The researchers test too many hypotheses without proper statistical control, until they happen to find something interesting and report it. Not surprisingly, next time the effect, which was (at least partly) due to chance, will be much smaller or absent.
These flaws of research practices were identified and reported by John P. A. Ioannidis in his landmark paper (PLoS Medicine, 2005). Ioannidis found that very often either the results were exaggerated or the findings could not be replicated. In his paper, he presented statistical evidence that indeed most claimed research findings are false.
Ioannidis noted that in order for a research finding to be reliable, it should have:
- Large sample size and with large effects
- Greater number of and lesser selection of tested relationship
- Greater flexibility in designs, definitions, outcomes, and analytical modes
- Minimal bias due to financial and other factors (including popularity of that scientific field)
Unfortunately, too often these rules were violated, producing irreproducible results. For example, S&P 500 index was found to be strongly related to Production of butter in Bangladesh (from 19891 to 1993) (here is PDF)
See more interesting (and totally spurious) findings which you can discover yourself using tools such as Google correlate or Spurious correlations by Tyler Vigen.
Several methods can be used to avoid "overfitting" the data
- Try to find the simplest possible hypothesis
- Regularization (adding a penalty for complexity)
- Randomization Testing (randomize the class variable, try your method on this data - if it find the same strong results, something is wrong)
- Nested cross-validation (do feature selection on one level, then run entire method in cross-validation on outer level)
- Adjusting the False Discovery Rate
- Using the reusable holdout method - a breakthrough approach proposed in 2015
Good data science is on the leading edge of scientific understanding of the world, and it is data scientists responsibility to avoid overfitting data and educate the public and the media on the dangers of bad data analysis.
See also
- The Cardinal Sin of Data Mining and Data Science: Overfitting
- Big Idea To Avoid Overfitting: Reusable Holdout to Preserve Validity in Adaptive Data Analysis
- Overcoming Overfitting with the reusable holdout: Preserving validity in adaptive data analysis
- 11 Clever Methods of Overfitting and how to avoid them
- Tag: Overfitting
Q12. Give an example of how you would use experimental design to answer a question about user behavior.
Answer by Bhavya Geethika.
Step 1: Formulate the Research Question:
What are the effects of page load times on user satisfaction ratings?
Step 2: Identify variables:
We identify the cause & effect. Independent variable -page load time, Dependent variable- user satisfaction rating
Step 3: Generate Hypothesis:
Lower page download time will have more effect on the user satisfaction rating for a web page. Here the factor we analyze is page load time.
Fig 12: There is a flaw in your experimental design (cartoon from here)
Step 4: Determine Experimental Design.
We consider experimental complexity i.e vary one factor at a time or multiple factors at one time in which case we use factorial design (2^k design). A design is also selected based on the type of objective (Comparative, Screening, Response surface) & number of factors.
Here we also identify within-participants, between-participants, and mixed model.For e.g.: There are two versions of a page, one with Buy button (call to action) on left and the other version has this button on the right.
Within-participants design - both user groups see both versions.
Between-participants design - one group of users see version A & the other user group version B.
Step 5: Develop experimental task & procedure:
Detailed description of steps involved in the experiment, tools used to measure user behavior, goals and success metrics should be defined. Collect qualitative data about user engagement to allow statistical analysis.
Step 6: Determine Manipulation & Measurements
Manipulation: One level of factor will be controlled and the other will be manipulated. We also identify the behavioral measures:
- Latency- time between a prompt and occurrence of behavior (how long it takes for a user to click buy after being presented with products).
- Frequency- number of times a behavior occurs (number of times the user clicks on a given page within a time)
- Duration-length of time a specific behavior lasts(time taken to add all products)
- Intensity-force with which a behavior occurs ( how quickly the user purchased a product)
Step 7: Analyze results
Identify user behavior data and support the hypothesis or contradict according to the observations made for e.g. how majority of users satisfaction ratings compared with page load times.
Read the rest on KDnuggets:
21 Must-Know Data Science Interview Questions and Answers, part 2
https://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers-part2.html
Innovation Manager - Quantum Computing at Deutsche Bundesbank
9 年Thanks Gregory. Is there a link for part 1?
Data And Technology Strategist & Team Leader / CGEIT / TOGAF Enterprise Architecture Practitioner / FinOps Certified Practitioner / SAFe 5 Agilist
9 年Couldn't agree more with John above: regardless of specific job description, the most useful interview discussions, in my experience, focus around mistakes. Asking about them tells you a ton about a person's ability to correct a path, to learn lessons, and to apply three lessons. You also learn a lot about their personal temperament.
Data Scientist AI Lecturer| Founder| Investor| Ex Fixed Income trader | Quant | Former Instructor at Columbia University,{among other things} Never Despair, life is short. Climb a mountain! Eat a peach!
9 年Most of the stuff we ask is about what projects they, the interviewee, undertook that "rocked the boat." How many failures you had along the way and why. This is an incisive way of reading how the person thought their way through to the "wrong" answer and what they did to remidiate it. Collaboration among individuals and writing skills trump some of the tech stuff. Very few data scientists get to work on their own models unless one is in research and even then you're mostly there to help some one understand their challenges or physical models. Genetics is a prime example while quantitative methods in finance maybe a counter example. That said, the lingo of engineering is an increasingly important thing. How to get the most processing out of the hardware you have rather than the hardware you want. Just my thoughts. Thanks for all the work you guys are doing.
While this is a great list of things to know, it points to the general problem with most interview questions - google a list and study the answers. Better questions are framed around prior projects or activities someone has done, and looking for evidence of these things - did they control for sampling bias etc and ask the appropriate clarifying follow up questions. That's why you always need an expert asking the questions. The whole premise of Expert Interview! Check us out!