5 Tricky Data Science Interview questions asked by Top Companies
Shivam Modi
46K Followers | I help people build their AI & Data Science career | Founder & CEO - Learn Everything AI | IIT Bombay | Click "Follow" to learn AI & Data Science daily
1. You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
The following ways to handle missing data values: If the data set is large, we can simply remove the rows with missing data values. It is the quickest way; we use the rest of the data to predict the values. For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas' data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).
2. What are dimensionality reduction and its benefits?
Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely. This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there's no point in storing a value in two different units (meters and inches).
3. How can you select k for k-means?
We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where 'k' is the number of clusters. Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.
4. What is the significance of p-value?
p-value typically ≤ 0.05 This indicates strong evidence against the null hypothesis; so you reject the null hypothesis. p-value typically > 0.05 This indicates weak evidence against the null hypothesis, so you accept the null hypothesis. p-value at cutoff 0.05 This is considered to be marginal, meaning it could go either way.
领英推荐
5. You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of 96 percent. Why shouldn't you be happy with your model's performance? What can you do about it?
Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based as a measure of performance. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection, and can greatly improve a patient's prognosis. Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), and F measure to determine the class-wise performance of the classifier.?
Looking forward to becoming a Data Scientist? Check out the Data Science Course and get certified today.
Celebrate your freedom to learn Data Science and build a brighter future. Best chance to Save, Enroll Now.
46K Followers | I help people build their AI & Data Science career | Founder & CEO - Learn Everything AI | IIT Bombay | Click "Follow" to learn AI & Data Science daily
2 年Checkout my Data Science combo course and become Job-Ready Data Scientist. ?Course Link: https://learneverythingai.com/data-science-combo-course/ ?Website: https://www.learneverythingai.com