Machine Learning Interview series - tackling questions on classification algorithms.

Machine Learning Interview series - tackling questions on classification algorithms.

Classification algorithms are applied in a variety of problems in the real world applications. In interviews candidates are often asked about problems to solve where a classification algorithm can be used.

Interviewers then often deep dive on these algorithms and find out how these are applied, what are their pros and cons, what their limitations are and how some of these algorithms work in the background.

Review some of these interview questions here: https://machinelearningfaq.com/classification/

Multiple algorithms exist that can help ML practitioners solve different types of problems. Some of these algorithms include: Logistic Regression, k-nearest neighbors, Naive Bayes, Tree based algorithms, Support vector machine etc.

The classification problems themselves can be categorized into 4 major categories: A) Binary, which is one of the most common of the problems, where one may want to predict if an event will occur vs not, B) Multi-Class, where one may want to put a categorical label on the event or input, C) Multi-label where an input can belong to more than one class for example a student could have minors in Chemistry as well as Math, and D) Imbalanced, where examples in each class are unequally distributed - for example credit card fraudulent activity, where 99.99% are not fraudulent and the remaining are.

Interview questions can often be broad such as "What is Naive Bayes classifier?", "Describe KNN for classification algorithm? And what are its limitations?" etc, they can also begin with a case study such as - "How would you predict if a particular employee will leave the job vs stay in a year?" and they can sometimes just ask simpler concepts such as "What is type 1 and type 2 error?".

One important one asked in interviews many times is "How do you deal with class imbalance?"

While multiple questions can be reviewed in the web-link: https://machinelearningfaq.com/classification/ a somewhat detailed description of imbalanced classes has been given below.

Imbalanced Classes

Class imbalance means when in the data-set samples of one class carries far more samples than any other classes. This is a very common case in many classification applications e.g Credit card fraud detection. Say for example, 99% of the credit card transactions are not fraudulent, but 1% could be. In such situations general classification metrics are not very useful, for example say there is 99% chance that a transaction is not fraudulent then having a predictor which predicts "always not fraud" will have 99% accuracy! Would that be any useful? No! Not at all!

There are few ways to handle class imbalance.

  1. Under sampling of majority class: In this method we only pick samples of majority class which are same in number as the samples of minority class. Disadvantage here is that we drop a lot of data.
  2. Over sampling of minority class: In this we over-sample, basically sample with replacement the examples of the minority class to make both the classes of equal size. Here, the disadvantage is that we can overfit the minority class samples a lot.
  3. Another popular technique is SMOTE (Synthetic Minority Oversampling Technique): In this technique new random minority class data points are generated such that they have similar characteristics like the original data. It’s a very popular technique and is widely used for imbalanced datasets.

SMOTE

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

%matplotlib inline


#Create an imbalanced dataset
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=1)


def plotData(X, y):
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    plt.figure(figsize=(20,10))
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], 
                    y=X[y == cl, 1],
                    alpha=0.8, 
                    c=colors[idx],
                    marker=markers[idx], 
                    label=cl)
    plt.xlabel('Feature 1', fontsize=20)
    plt.ylabel('Feature 2', fontsize=20)
    plt.legend(fontsize=20)
    plt.title('Imbalanced class', fontsize=20)
plotData(X, y)
No alt text provided for this image

Let us apply SMOTE:

from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_smote, y_smote = smote.fit_resample(X, y)

#Lets count the number of instances
print(f"class counts {np.bincount(y_smote)}")
plotData(X_smote, y_smote)

pipe_lr.fit(X_smote, y_smote)
print(f"Accuracy = {pipe_lr.score(X_smote, y_smote)}")

y_smote_pred = pipe_lr.predict(X)

confusion_matrix_test = confusion_matrix(y, y_smote_pred)
print("Confusion Matrix =", confusion_matrix_test)
plot_confusion_matrix(pipe_lr, X, y) 

print(f"Accuracy = {accuracy_score(y, y_smote_pred):.5}")
print(f"Precision = {precision_score(y, y_smote_pred):.5}")
print(f"Recall = {recall_score(y, y_smote_pred):.5}")
print(f"F1 Score = {f1_score(y, y_smote_pred):.5}")
class counts [99900 99900]
Accuracy = 0.9724824824824825
Confusion Matrix = [[99892     8]
 [    7    93]]
Accuracy = 0.99985
Precision = 0.92079
Recall = 0.93
F1 Score = 0.92537

All of the metrics look good: Accurate, precision, recall, and F1 score.

class counts [99900 99900]
Accuracy = 0.9724824824824825
Confusion Matrix = [[99892     8]
 [    7    93]]
Accuracy = 0.99985
Precision = 0.92079
Recall = 0.93
F1 Score = 0.92537

Had we applied Logistic regression, what would we get?

#lets apply Logistic regression on the class and see the results
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
Accuracy = 0.99983


from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

print(f"Accuracy = {accuracy_score(y, y_pred):.5}")
print(f"Precision = {precision_score(y, y_pred):.5}")
print(f"Recall = {recall_score(y, y_pred):.5}")
print(f"F1 Score = {f1_score(y, y_pred):.5}")
Accuracy = 0.99983
Precision = 1.0
Recall = 0.83
F1 Score = 0.9071

Note here that while accuracy and precision are high, the recall is very low~ at 0.83.

This shows that SMOTE works better in this case.

Hope this post helps some candidates. Please review more questions at: https://machinelearningfaq.com/classification/

Stay tuned for more!



要查看或添加评论,请登录

Rohit Malshe的更多文章

社区洞察

其他会员也浏览了