Data Science Interview Questions Askedin Top Tech Companies
Ashish Patel ????
Sr AWS AI ML Solution Architect at IBM | Generative AI Expert Strategist | Author Hands-on Time Series Analytics with Python | IBM Quantum ML Certified | 12+ Years in AI | IIMA | 100k+Followers | 6x LinkedIn Top Voice |
The latest data from Glassdoor can tell us what major tech companies like to ask their candidates during their recruitment interviews. First of all, there is a sad conclusion: According to statistics, almost all companies have different styles. Glassdoor allows anonymous submission of content, many candidates who are willing to provide you with interview questions from major companies such as Facebook, Google, and Microsoft. I have listed some of them for your reference. Besides, if you want to change your career as a data scientist, here is a practical guide (how to change your career as a data scientist?)
APPLE
1. If you have millions of users, each user has hundreds of transactions, and these transactions exist in dozens of products. How do you segment these users into meaningful categories?
MICROSOFT
1. Describe a project you have worked on and its advantages.
2. How to deal with general characteristics with high-cardinality?
3. What would you do if you wanted to summarize your Twitter feed?
4. What are the steps to correct and clean the data before applying machine learning algorithms?
5. How to measure the distance between data points?
6. Please define a difference.
7. Describe the differences between box plots and histograms, and their use cases.
1. What features would you use to build a recommendation algorithm for users?
UBER
1. Choose any product or application you like and describe how to improve it.
2. How to find anomalies in the distribution?
3. How to check if a trend in the distribution is due to anomalies?
4. How to estimate the impact of Uber on the traffic and driving environment?
5. What metrics would you consider to track the effectiveness of Uber's paid advertising strategy in attracting new users? Then, how do you want to estimate the ideal customer acquisition cost?
LINKED IN
1. (To Big Data Engineers) please explain what REST is.
1. Why are you use feature selection?
2. If two predictors are highly correlated, what is their effect on the logistic regression coefficient? What is the confidence interval for the coefficients?
3. What is the difference between Gaussian Mixture Model and K-Means?
4. How to pick k in K-Means?
5. How do you know if a Gaussian mixture model is applicable?
6. Assuming the labels of the clustering model are known, how do you evaluate the performance of the model?
MICROSOFT
1. Which machine learning projects do you have to make you proud of?
2. Feel free to find a machine learning algorithm and describe it.
3. Please explain how Gradient Boosting works.
4. (For data mining engineers) Please explain the decision tree model.
5. (To a data mining engineer) what is a neural network?
6. Please explain the Bias-Variance Tradeoff.
7. How to deal with imbalanced binary classification?
8. What is the difference between L1 and L2 regularization?
UBER
1. What characteristics do you use to predict whether Uber drivers to accept the order requests? Which supervised learning algorithm do you use to solve this problem, and how do you compare the results of the algorithms?
1. Identify and describe three different kernel functions, and in which are cases which one should use?
2. Feel free to explain a method in machine learning.
3. How to deal with sparse data?
IBM
1. How to prevent over-fitting?
2. How to deal with outliers in the data?
3. How do I evaluate the performance of logistic regression and simple linear regression models?
4. What is the difference between supervised and unsupervised learning?
5. What is cross-validation, and why should I use it?
6. What is the name of the matrix used to evaluate the predictive model?
7. What is the relationship between logistic regression coefficients and Odds Ratio?
8. What is the relationship between PCA and linear and quadratic discriminant analysis (LDA and QDA)?
9. If you have a dependent variable in classification and a continuous independent variable mixed classification, which algorithm, method, or tool will you use for analysis?
10. (To industry analysts) what is the difference between logistic and linear regression? How to avoid local minima?
SALES-FORCE
1. What data and models do you use to measure loss/error? How do I test model performance?
2. Assuming I am a non-technical person, please explain to me a machine learning algorithm.
CAPITAL ONE (AN AMERICAN BANK)
1. How to build a model to predict credit card fraud?
2. How to deal with missing or corrupt data?
3. How to derive new features from existing features?
4. If you try to predict the gender of a customer, but only have 100 data points, what might go wrong?
5. With two years of transaction history, what characteristics can be used to predict credit risk?
6. Please design an artificial intelligence program for playing tic-tac-toe.
ZILLOW
1. Please explain overfitting and how to prevent overfitting.
2. Why does SVM need to maximize edges between support vectors?
1. How to use Map / Reduce to divide extensive graphs into smaller blocks and calculate their edges in parallel based on the fast/dynamic changes of the data?
2. (To the data engineer) given a list: 123, 345234, 678345, 123 ... where the first column is the ID of the fan, and the second column is the ID of the fan. Find all mutually following pairs (the pairs in the example above are 123, 345). How can I use Map / Reduce to solve the problem when the list is out of memory?
CAPTIAL ONE
1. (To the data engineer) what is Hadoop serialization?
2. Explain a simple Map / Reduce problem.
1. (To a data engineer), please write a Hive UDF that returns sentiment scores. For example, if good = 1, bad = -1, and average = 0, then when you evaluate a restaurant because of "good food, poor service", your score may be 1-1 = 0
CAPTIAL ONE
1. (To data engineers) how does RDD work in Spark in Scala?
1. Assuming I am a non-technical person, please explain to me cross-validation.
2. Please describe the non-normal probability distribution, and then tell us how to apply it?
MICROSOFT
1. (For data mining) please explain what heteroscedasticity is and how to solve it.
1. Given Twitter user data, how do you measure engagement?
UBER
1. What is the difference between time series and prediction techniques?
2. Explain Principle Component Analysis (PCA) and the equations used by PCA.
3. How to solve Multicollinearity?
4. (To analysts) Please write an equation that optimizes our advertising spend on Twitter and Facebook.
1. What is the probability of drawing two cards in a deck?
IBM
1. What are p-values and confidence intervals?
CAPITAL ONE
1. (To Data Analyst) if you have 70 red marbles and the ratio of green to red marbles is 2 to 7, how many green marbles?
2. What distribution does New York City's commute data look like?
3. For a dice, the probability of a 6 in the case of 6 throws is larger than the probability of at least two 6s in the case of 12 throws, and the probability of at least 100 6 in 600 throws.
PAYPAL
1. What is the Central Limit Theorem and how do I prove it? What is its application direction?
1. (To the data analyst) please write a program to determine the height of the binary tree.
MICROSOFT
1. Create a function to check if a word has a palindrome structure.
1. Please build a power set.
2. How do I find the median in a huge data set?
UBER
1. (To a data engineer) write a function to calculate the square root of a given number (2 decimal point precision). After: Avoid redundant calculations and now use caching to optimize your functionality.
1. Assuming two binary strings, write a function to add them together without using any built-in string to int conversion or parsing tools. For example: if you give the function binary strings 100 and 111, it should return 1011. What is the spatial and temporal complexity of your solution?
2. Write a function that takes two sorted lists and returns their union in the sorted list.
1. (To the data engineer), please write some code to determine if the left and right parentheses in the string are balanced?
2. How to find the second largest element in a binary search tree?
3. Write a function that takes two sorted vectors and returns a sorted vector.
4. If you have a stream of input numbers, how do you find the most frequently occurring numbers during the run?
5. Write a function that adds one number to another, just like the pow () function.
6. Split large strings into valid fields and store them in a dictionary. If the string cannot be split, return false. How complicated is your solution?
CAPTIAL ONE
1. (To a data engineer) how to "break apart" two series (like JOIN in SQL does the reverse)?
2. Create a function for adding, the numbers are represented as two linked lists.
3. Create a function that calculates a matrix.
4. How to use Python to read a huge tab-delimited number file to calculate how often each number appears?
PAYPAL
1. Write a function that takes a sentence in O (n) time and prints it backward.
2. Write a function that picks from an array, divides them into two possible arrays, and prints the maximum difference between the two arrays (in O (n) time).
3. Write a program that performs merge sort.
MICROSOFT
1. (For data analysts) Define and explain the differences between clustered and nonclustered indexes.
2. (To a data analyst) what are the different ways to return the row count of a table?
1. (To a data engineer) given a raw data table, how do I perform an ETL (extract, transform, and load) using SQL to get the data in the required format?
2. How do I write a SQL query to calculate the frequency table for an attribute involving two joins? If you want ORDER BY or GROUP BY attributes, what changes do you need to make? How do you interpret NULL?
1. (To a data engineer) how can I improve the throughput of ETL (extract, transform, and load)?
1. Suppose you have 10 packs of pinball, each pack contains 10 pinballs. If one pack weighs differently than the others, but you can only weigh once, what should you do?
1. You are planning to fly to Seattle and want to know if you need to bring an umbrella, so you call three friends in Seattle. Every friend has a 2/3 chance to tell the truth and a 1/3 chance to lie to you. If they all say "it will rain," what is the probability that it will rain in Seattle?
2. If there is an ant on the three corners of an equilateral triangle, and each randomly chooses a direction and then goes straight to the other edge, what is the probability that the three ants will not meet each other? If there are n ants in the n-corner, what is the probability?
3. How many zeros are in the result of 100.
UBER
1. Imagine you work in a hospital. The frequency of patient visits conforms to a Poisson distribution, while the frequency of doctors' care of patients conforms to a uniform distribution. Please write a function or a piece of code to output the average waiting time of the patient and the doctor's participation on a specific day.
1. You are climbing an n-step staircase, and you can take any number of k steps. How many different ways do you reach the top of the stairs? (This is a modified version of the stairs problem)
Senior Biostatistician at CRO ( freelancer)/R programming/SPSS/Master Student in bioinformatics and biostatistics
5 年Very good
Clinical Trials Biostatistician at 2KMM (100% R-based CRO) ? Frequentist (non-Bayesian) paradigm ? NOT a Data Scientist (no ML/AI/Big data) ? Against anti-car/-meat/-cash and C40 restrictions
5 年I bet the stat part of the interview at IBM is the toughest experience ?? "What are p-values and confidence intervals?" - how many candidates do we have, Josh? - 65 - damn... find for me a question most are likely to fail... - hold my p-value ?? - no... GOD NO! This is cruelty, Josh... - ??