登录查看更多内容

Machine Learning Interview series - tackling questions on classification algorithms.

Rohit Malshe

Principal Research Scientist

发布日期: 2021年1月17日

Classification algorithms are applied in a variety of problems in the real world applications. In interviews candidates are often asked about problems to solve where a classification algorithm can be used.

Interviewers then often deep dive on these algorithms and find out how these are applied, what are their pros and cons, what their limitations are and how some of these algorithms work in the background.

Review some of these interview questions here: https://machinelearningfaq.com/classification/

Multiple algorithms exist that can help ML practitioners solve different types of problems. Some of these algorithms include: Logistic Regression, k-nearest neighbors, Naive Bayes, Tree based algorithms, Support vector machine etc.

The classification problems themselves can be categorized into 4 major categories: A) Binary, which is one of the most common of the problems, where one may want to predict if an event will occur vs not, B) Multi-Class, where one may want to put a categorical label on the event or input, C) Multi-label where an input can belong to more than one class for example a student could have minors in Chemistry as well as Math, and D) Imbalanced, where examples in each class are unequally distributed - for example credit card fraudulent activity, where 99.99% are not fraudulent and the remaining are.

Interview questions can often be broad such as "What is Naive Bayes classifier?", "Describe KNN for classification algorithm? And what are its limitations?" etc, they can also begin with a case study such as - "How would you predict if a particular employee will leave the job vs stay in a year?" and they can sometimes just ask simpler concepts such as "What is type 1 and type 2 error?".

One important one asked in interviews many times is "How do you deal with class imbalance?"

While multiple questions can be reviewed in the web-link: https://machinelearningfaq.com/classification/ a somewhat detailed description of imbalanced classes has been given below.

Imbalanced Classes

Class imbalance means when in the data-set samples of one class carries far more samples than any other classes. This is a very common case in many classification applications e.g Credit card fraud detection. Say for example, 99% of the credit card transactions are not fraudulent, but 1% could be. In such situations general classification metrics are not very useful, for example say there is 99% chance that a transaction is not fraudulent then having a predictor which predicts "always not fraud" will have 99% accuracy! Would that be any useful? No! Not at all!

There are few ways to handle class imbalance.

Under sampling of majority class: In this method we only pick samples of majority class which are same in number as the samples of minority class. Disadvantage here is that we drop a lot of data.
Over sampling of minority class: In this we over-sample, basically sample with replacement the examples of the minority class to make both the classes of equal size. Here, the disadvantage is that we can overfit the minority class samples a lot.
Another popular technique is SMOTE (Synthetic Minority Oversampling Technique): In this technique new random minority class data points are generated such that they have similar characteristics like the original data. It’s a very popular technique and is widely used for imbalanced datasets.

SMOTE

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

%matplotlib inline

#Create an imbalanced dataset
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=1)

def plotData(X, y):
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    plt.figure(figsize=(20,10))
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], 
                    y=X[y == cl, 1],
                    alpha=0.8, 
                    c=colors[idx],
                    marker=markers[idx], 
                    label=cl)
    plt.xlabel('Feature 1', fontsize=20)
    plt.ylabel('Feature 2', fontsize=20)
    plt.legend(fontsize=20)
    plt.title('Imbalanced class', fontsize=20)
plotData(X, y)

Let us apply SMOTE:

from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_smote, y_smote = smote.fit_resample(X, y)

#Lets count the number of instances
print(f"class counts {np.bincount(y_smote)}")
plotData(X_smote, y_smote)

pipe_lr.fit(X_smote, y_smote)
print(f"Accuracy = {pipe_lr.score(X_smote, y_smote)}")

y_smote_pred = pipe_lr.predict(X)

confusion_matrix_test = confusion_matrix(y, y_smote_pred)
print("Confusion Matrix =", confusion_matrix_test)
plot_confusion_matrix(pipe_lr, X, y) 

print(f"Accuracy = {accuracy_score(y, y_smote_pred):.5}")
print(f"Precision = {precision_score(y, y_smote_pred):.5}")
print(f"Recall = {recall_score(y, y_smote_pred):.5}")
print(f"F1 Score = {f1_score(y, y_smote_pred):.5}")
class counts [99900 99900]
Accuracy = 0.9724824824824825
Confusion Matrix = [[99892     8]
 [    7    93]]
Accuracy = 0.99985
Precision = 0.92079
Recall = 0.93
F1 Score = 0.92537

All of the metrics look good: Accurate, precision, recall, and F1 score.

class counts [99900 99900]
Accuracy = 0.9724824824824825
Confusion Matrix = [[99892     8]
 [    7    93]]
Accuracy = 0.99985
Precision = 0.92079
Recall = 0.93
F1 Score = 0.92537

Had we applied Logistic regression, what would we get?

#lets apply Logistic regression on the class and see the results
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
Accuracy = 0.99983

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

print(f"Accuracy = {accuracy_score(y, y_pred):.5}")
print(f"Precision = {precision_score(y, y_pred):.5}")
print(f"Recall = {recall_score(y, y_pred):.5}")
print(f"F1 Score = {f1_score(y, y_pred):.5}")
Accuracy = 0.99983
Precision = 1.0
Recall = 0.83
F1 Score = 0.9071

Note here that while accuracy and precision are high, the recall is very low~ at 0.83.

This shows that SMOTE works better in this case.

Hope this post helps some candidates. Please review more questions at: https://machinelearningfaq.com/classification/

Stay tuned for more!

要查看或添加评论，请登录

Rohit Malshe的更多文章

A few observations from Image Models

2023年12月11日

A few observations from Image Models

I happened to play with Meta's recent https://imagine.meta.

2 条评论
A Rapid Fire Machine Learning Interview

2021年3月21日

A Rapid Fire Machine Learning Interview

How does a rapid fire round look like? Data science interviews on phone rounds, or very first rounds often try to ask a…
Almost all data engineering interviews focus on system design - And these simple questions stump a lot of candidates!

2021年3月17日

Almost all data engineering interviews focus on system design - And these simple questions stump a lot of candidates!

Someone I know in another company was preparing for an interview! We discussed a few ideas, ran into system design and…

1 条评论
90% of the candidates get confused with these SQL questions

2021年3月8日

90% of the candidates get confused with these SQL questions

In many 'business intelligence engineer', 'data engineer', or 'data scientist' interviews I have found that 90% of the…

3 条评论
Machine Learning Interviews - Decision trees

2021年2月27日

Machine Learning Interviews - Decision trees

Context In almost all the data science interviews, interviewers may ask questions around decision trees. We will cover…
SQL data retrieval and SQL sub-queries: Tackling some interview questions!

2021年2月20日

SQL data retrieval and SQL sub-queries: Tackling some interview questions!

Interview questions on SQL can begin with simple data retrieval, checking knowledge on creating common table…
Machine Learning Interview series: Tackling questions on non-linear regression, generalized additive models

2021年2月16日

Machine Learning Interview series: Tackling questions on non-linear regression, generalized additive models

Many times, regression models can be built using polynomial regressions. These can be quite useful where relationship…

2 条评论
Machine Learning Interview Series - Tackling questions on Model Selection.

2021年2月6日

Machine Learning Interview Series - Tackling questions on Model Selection.

Today we will review some questions and answers on model selection. First of all let us get a few basics out of the way…
SQL relational models and SQL joins

2021年2月1日

SQL relational models and SQL joins

Focusing on SQL interviews this week yet again; In SQL interviews, some easy to ask questions linger around relational…
Tackling SQL interviews

2021年1月25日

Tackling SQL interviews

Typically interviews for Business Intelligence engineers, Data Engineers, Data Scientists, Research Scientists, and…

3 条评论

See all articles

Machine Learning Interview series - tackling questions on classification algorithms.

Rohit Malshe

Principal Research Scientist

Imbalanced Classes

SMOTE

Rohit Malshe的更多文章

社区洞察

其他会员也浏览了

Artificial Intelligence is Transforming the Information Technology Industry

Top AI Applications in Auditing: From Data Analytics to Fraud Detection

Understanding and Addressing Class Imbalance in Machine Learning

The Role of Security Engineers

Empowering Finance with Azure OpenAI: Fraud, Risk, and Customer Success

Understanding the Confusion Matrix: A Must-Know for Machine Learning Practitioners

Understanding Class Imbalance in Classification Models: Why It Matters and How to Handle It

How to define and contextualize machine learning problem

What are Confusion Matrix and cybercrime cases where they using Confusion matrix?

Introduction to Machine Learning

Imbalanced Classes

SMOTE

Rohit Malshe的更多文章

A few observations from Image Models

A Rapid Fire Machine Learning Interview

Almost all data engineering interviews focus on system design - And these simple questions stump a lot of candidates!

90% of the candidates get confused with these SQL questions

Machine Learning Interviews - Decision trees

SQL data retrieval and SQL sub-queries: Tackling some interview questions!

Machine Learning Interview series: Tackling questions on non-linear regression, generalized additive models

Machine Learning Interview Series - Tackling questions on Model Selection.

SQL relational models and SQL joins

Tackling SQL interviews

社区洞察

其他会员也浏览了

Artificial Intelligence is Transforming the Information Technology Industry

Top AI Applications in Auditing: From Data Analytics to Fraud Detection

Understanding and Addressing Class Imbalance in Machine Learning

The Role of Security Engineers

Empowering Finance with Azure OpenAI: Fraud, Risk, and Customer Success

Understanding the Confusion Matrix: A Must-Know for Machine Learning Practitioners

Understanding Class Imbalance in Classification Models: Why It Matters and How to Handle It

How to define and contextualize machine learning problem

What are Confusion Matrix and cybercrime cases where they using Confusion matrix?

Introduction to Machine Learning