登录查看更多内容

Top-111 Data Science Interview Questions & Detailed Answers

Mukesh Manral????

??DataScience Specialist(Consultant) - Generative AI | MLOps | Data & AI Architect | Product Development | Cloud - AI + Education

发布日期: 2023年2月24日

+ 关注

Machine Learning & Mathematics:

class>What is cross-validation? How to do it right?

Statistics:

How do you assess the statistical significance of an insight?
Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?
What is the Central Limit Theorem? Explain it. Why is it important?
What is statistical power?
Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?
Provide a simple example of how an experimental design can help answer a question about behavior. How does experimental data contrast with observational data?
Is mean imputation of missing data acceptable practice? Why or why not?
What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset
How do you handle missing data? What imputation techniques do you recommend?
You have data on the durations of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?
Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring?
You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?
You’re about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it’s raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that “Yes” it is raining. What is the probability that it’s actually raining in Seattle?
There’s one box - has 12 black and 12 red cards, 2nd box has 24 black and 24 red; if you want to draw 2 cards at random from one of the 2 boxes, which box has the higher probability of getting the same color? Can you tell intuitively why the 2nd box has a higher probability
What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
Define: quality assurance, six sigma.
Give examples of data that does not have a Gaussian distribution, nor log-normal.
What is root cause analysis? How to identify a cause vs.?a correlation? Give examples
Give an example where the median is a better measure than the mean
Given two fair dices, what is the probability of getting scores that sum to 4? to 8?
What is the Law of Large Numbers?
How do you calculate needed sample size?
When you sample, what bias are you inflicting?
How do you control for biases?
What are confounding variables?
What is A/B testing?
An HIV test has a sensitivity of 99.7% and a specificity of 98.5%. A subject from a population of prevalence 0.1% receives a positive test result. What is the precision of the test (i.e the probability he is HIV positive)?
Infection rates at a hospital above a 1 infection per 100 person days at risk are considered high. An hospital had 10 infections over the last 1787 person days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard
You roll a biased coin (p(head)=0.8) five times. What’s the probability of getting three or more heads?
A random variable X is normal with mean 1020 and standard deviation 50. Calculate P(X>1200)
Consider the number of people that show up at a bus station is Poisson with mean 2.5/h. What is the probability that at most three people show up in a four hour period?
You are running for office and your pollster polled hundred people. Sixty of them claimed they will vote for you. Can you relax?
Geiger counter records 100 radioactive decays in 5 minutes. Find an approximate 95% interval for the number of decays per hour.
The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really networthy?
Consider influenza epidemics for two parent heterosexual families. Suppose that the probability is 17% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 12% while the probability that both the mother and father have contracted the disease is 6%. What is the probability that the mother has contracted influenza?
Suppose that diastolic blood pressures (DBPs) for men aged 35-44 are normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10. About what is the probability that a random 35-44 year old has a DBP less than 70?
In a population of interest, a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student’s T confidence interval for the mean brain volume in this new population?
A diet pill is given to 9 subjects over six weeks. The average difference in weight (follow up - baseline) is -2 pounds. What would the standard deviation of the difference in weight have to be for the upper endpoint of the 95% T confidence interval to touch 0?
In a study of emergency room waiting times, investigators consider a new and the standard triage systems. To test the systems, administrators selected 20 nights and randomly assigned the new triage system to be used on 10 nights and the standard system on the remaining 10 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 3 hours with a variance of 0.60 while the average MWT for the old system was 5 hours with a variance of 0.68. Consider the 95% confidence interval estimate for the differences of the mean MWT associated with the new system. Assume a constant variance. What is the interval? Subtract in this order (New System - Old System).
To further test the hospital triage system, administrators selected 200 nights and randomly assigned a new triage system to be used on 100 nights and a standard system on the remaining 100 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 4 hours with a standard deviation of 0.5 hours while the average MWT for the old system was 6 hours with a standard deviation of 2 hours. Consider the hypothesis of a decrease in the mean MWT associated with the new treatment. What does the 95% independent group confidence interval with unequal variances suggest vis a vis this hypothesis? (Because there’s so many observations per group, just use the Z quantile instead of the T.)

Algolia 1 年前

Feature Clustering: A Simple Solution to Many Machine…

Vincent Granville 1 年前

Types of CLustering Algorithm

Shashank Sharma 1 年前

Process & Miscellaneous:

How to optimize algorithms? (parallel processing and/or faster algorithms). Provide examples for both
Examples of NoSQL architecture
Provide examples of machine-to-machine communications
Compare R and Python
Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?
What is star schema? Lookup tables?
What is the life cycle of a data science project ?
How to efficiently scrape web data, or collect tons of tweets?
How to clean data?
How frequently an algorithm must be updated?
What is POC (proof of concept)?
Explain Tufte’s concept of “chart junk”
How would you come up with a solution to identify plagiarism?
How to detect individual paid accounts shared by multiple users?
Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?
What is your definition of big data?
Explain the difference between “long” and “wide” format data. Why would you use one or the other?
Do you know a few “rules of thumb” used in statistical or computer science? Or in business analytics?
Name a few famous API’s (for instance GoogleSearch)
Give examples of bad and good visualizations

Answers:

Top-111 Data Science Interview Questions & Detailed Answers

Mukesh Manral????

??DataScience Specialist(Consultant) - Generative AI | MLOps | Data & AI Architect | Product Development | Cloud - AI + Education

Machine Learning & Mathematics:

Statistics:

领英推荐

Process & Miscellaneous:

Manralai-AiConsulting+Edu

1,186 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Introduction to Simple Linear Regression in Machine Learning

Supervised Machine Learning in Time Series Forecasting

Data Science vs. Artificial Intelligence vs. Machine Learning vs. Deep Learning

KD 17:n01: 5 Machine Learning Projects You Can’t Overlook; Future of Deep Learning

What is Data Science: Exploring the World of Data Science

"Predicting Credit Card Defaults in Taiwan Using Machine Learning"

Data Science and its Nearest-Neighbours

AIML22- DATA SCIENCE AND ML PROJECTS WITH SOLUTION

Unsupervised Learning: Clustering and Dimensionality Reduction

The Ultimate Guide to Feature Scaling in Data Science

Machine Learning & Mathematics:

Statistics:

领英推荐

Process & Miscellaneous:

Manralai-AiConsulting+Edu

1,186 位关注者

Master Categorical Data Encoding Methods in 60 seconds??

2024年7月30日

Attention

2024年2月5日

SQL in a Nutshell: A Hilarious Breakdown by Mukesh Manral???? #sql?@Manralai

2023年12月16日

Navigating the Evolution of NLP: A Comprehensive Deep Dive into Cutting-Edge Models Beyond 2013 ?? #nlp #deeplearning

2023年11月29日

??ChatGpt : Explaining Math's behind ChatGpt without getting into math's

2023年5月15日

Guide to Commonly Used Deep Learning Kernel_Initializers in Real-World Projects

2023年3月27日

Why did the computer vision engineer choose YOLO?

2023年3月14日

Hypothesis Testing - Framework

2023年3月8日

Feature Importance and Feature Selection - Framework

2023年2月25日

Handling ImBalanced Classes-Framework

2023年2月25日

社区洞察

其他会员也浏览了

Introduction to Simple Linear Regression in Machine Learning

Supervised Machine Learning in Time Series Forecasting

Data Science vs. Artificial Intelligence vs. Machine Learning vs. Deep Learning

KD 17:n01: 5 Machine Learning Projects You Can’t Overlook; Future of Deep Learning

What is Data Science: Exploring the World of Data Science

"Predicting Credit Card Defaults in Taiwan Using Machine Learning"

Data Science and its Nearest-Neighbours

AIML22- DATA SCIENCE AND ML PROJECTS WITH SOLUTION

Unsupervised Learning: Clustering and Dimensionality Reduction

The Ultimate Guide to Feature Scaling in Data Science