Machine Learning Interview series - tackling questions on classification algorithms.
Classification algorithms are applied in a variety of problems in the real world applications. In interviews candidates are often asked about problems to solve where a classification algorithm can be used.
Interviewers then often deep dive on these algorithms and find out how these are applied, what are their pros and cons, what their limitations are and how some of these algorithms work in the background.
Review some of these interview questions here: https://machinelearningfaq.com/classification/
Multiple algorithms exist that can help ML practitioners solve different types of problems. Some of these algorithms include: Logistic Regression, k-nearest neighbors, Naive Bayes, Tree based algorithms, Support vector machine etc.
The classification problems themselves can be categorized into 4 major categories: A) Binary, which is one of the most common of the problems, where one may want to predict if an event will occur vs not, B) Multi-Class, where one may want to put a categorical label on the event or input, C) Multi-label where an input can belong to more than one class for example a student could have minors in Chemistry as well as Math, and D) Imbalanced, where examples in each class are unequally distributed - for example credit card fraudulent activity, where 99.99% are not fraudulent and the remaining are.
Interview questions can often be broad such as "What is Naive Bayes classifier?", "Describe KNN for classification algorithm? And what are its limitations?" etc, they can also begin with a case study such as - "How would you predict if a particular employee will leave the job vs stay in a year?" and they can sometimes just ask simpler concepts such as "What is type 1 and type 2 error?".
One important one asked in interviews many times is "How do you deal with class imbalance?"
While multiple questions can be reviewed in the web-link: https://machinelearningfaq.com/classification/ a somewhat detailed description of imbalanced classes has been given below.
Imbalanced Classes
Class imbalance means when in the data-set samples of one class carries far more samples than any other classes. This is a very common case in many classification applications e.g Credit card fraud detection. Say for example, 99% of the credit card transactions are not fraudulent, but 1% could be. In such situations general classification metrics are not very useful, for example say there is 99% chance that a transaction is not fraudulent then having a predictor which predicts "always not fraud" will have 99% accuracy! Would that be any useful? No! Not at all!
There are few ways to handle class imbalance.
- Under sampling of majority class: In this method we only pick samples of majority class which are same in number as the samples of minority class. Disadvantage here is that we drop a lot of data.
- Over sampling of minority class: In this we over-sample, basically sample with replacement the examples of the minority class to make both the classes of equal size. Here, the disadvantage is that we can overfit the minority class samples a lot.
- Another popular technique is SMOTE (Synthetic Minority Oversampling Technique): In this technique new random minority class data points are generated such that they have similar characteristics like the original data. It’s a very popular technique and is widely used for imbalanced datasets.
SMOTE
import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt mpl.rc('axes', labelsize=14) mpl.rc('xtick', labelsize=12) mpl.rc('ytick', labelsize=12) %matplotlib inline
#Create an imbalanced dataset from sklearn.datasets import make_classification X, y = make_classification(n_samples=100000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=1)
def plotData(X, y): markers = ('s', 'x', 'o', '^', 'v') colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan') plt.figure(figsize=(20,10)) for idx, cl in enumerate(np.unique(y)): plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], alpha=0.8, c=colors[idx], marker=markers[idx], label=cl) plt.xlabel('Feature 1', fontsize=20) plt.ylabel('Feature 2', fontsize=20) plt.legend(fontsize=20) plt.title('Imbalanced class', fontsize=20) plotData(X, y)
Let us apply SMOTE:
from imblearn.over_sampling import SMOTE smote = SMOTE() X_smote, y_smote = smote.fit_resample(X, y) #Lets count the number of instances print(f"class counts {np.bincount(y_smote)}") plotData(X_smote, y_smote) pipe_lr.fit(X_smote, y_smote) print(f"Accuracy = {pipe_lr.score(X_smote, y_smote)}") y_smote_pred = pipe_lr.predict(X) confusion_matrix_test = confusion_matrix(y, y_smote_pred) print("Confusion Matrix =", confusion_matrix_test) plot_confusion_matrix(pipe_lr, X, y) print(f"Accuracy = {accuracy_score(y, y_smote_pred):.5}") print(f"Precision = {precision_score(y, y_smote_pred):.5}") print(f"Recall = {recall_score(y, y_smote_pred):.5}") print(f"F1 Score = {f1_score(y, y_smote_pred):.5}") class counts [99900 99900] Accuracy = 0.9724824824824825 Confusion Matrix = [[99892 8] [ 7 93]] Accuracy = 0.99985 Precision = 0.92079 Recall = 0.93 F1 Score = 0.92537
All of the metrics look good: Accurate, precision, recall, and F1 score.
class counts [99900 99900] Accuracy = 0.9724824824824825 Confusion Matrix = [[99892 8] [ 7 93]] Accuracy = 0.99985 Precision = 0.92079 Recall = 0.93 F1 Score = 0.92537
Had we applied Logistic regression, what would we get?
#lets apply Logistic regression on the class and see the results from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline Accuracy = 0.99983
from sklearn.metrics import accuracy_score from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.metrics import f1_score print(f"Accuracy = {accuracy_score(y, y_pred):.5}") print(f"Precision = {precision_score(y, y_pred):.5}") print(f"Recall = {recall_score(y, y_pred):.5}") print(f"F1 Score = {f1_score(y, y_pred):.5}") Accuracy = 0.99983 Precision = 1.0 Recall = 0.83 F1 Score = 0.9071
Note here that while accuracy and precision are high, the recall is very low~ at 0.83.
This shows that SMOTE works better in this case.
Hope this post helps some candidates. Please review more questions at: https://machinelearningfaq.com/classification/
Stay tuned for more!