KNN Algorithm in Credit Limit Decision-making

KNN Algorithm in Credit Limit Decision-making

In this article, we will gain a better understanding of the realm of machine learning, specifically in the implementation of the K-Nearest Neighbors (KNN) algorithm, using the renowned scikit-learn library.

Our focus will be on the practical application of KNN on the “Credit Card Limit Classification” dataset, available on Kaggle. This dataset is used for the task of classifying credit limit increase requests. When a client requests a limit increase, the bank consults a third-party credit company that provides a recommendation — either “deny” or “grant” the credit — which the bank then passes on to the client.

Each credit inquiry represents an additional cost to the bank. To manage these costs, the bank has limited new requests for a limit increase to one every three months.

Our objective is to use the KNN algorithm to create a model that can assist the bank in making more efficient and economical decisions about credit limit increase requests. We will explore this dataset, making machine learning accessible and understandable.

Moreover, we will pay special attention to understanding the model evaluation metrics, which are fundamental to assessing its performance and suitability to the problem at hand.

Considerations

We will start by loading the dataset that will serve as the basis to train and test our model. It’s important to note that depending on the dataset, it may be necessary to preprocess the data.?

However, for the purpose of this article, we will assume that the data is already properly processed and ready for use. Our focus will be on the implementation of the KNN algorithm and understanding the evaluation metrics.

Nevertheless, it’s worth mentioning that data preprocessing is a crucial step in any machine learning project. It can include tasks such as cleaning up missing or incorrect data, transforming categorical variables into numeric ones, or normalizing the values so they all are on the same scale. These steps ensure the quality and reliability of the model that will be built.

Import Libraries?

First, let’s load the dataset into the notebook and import the main libraries necessary to carry out our customer classification project.

At first glance, we have a dataset with 9500 rows and 17 columns.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score
from sklearn.preprocessing import StandardScaler
 
# Load the dataset
df = pd.read_csv('train.csv')
df.head()
 
# Dataset dimensions
print(df.shape)        

Predicted Variable?

Based on this dataset that we named as df, we will try to predict which customers should have their credit card limit increased or not.

The variable “limite_adicional” (additional_limit) that we will be predicting, receives the labels of “Negar” (Deny) and “Conceder” (Grant), as we can see below when we apply the unique() method to the respective column.

df.loc[:, 'limite_adicional'].unique()

# Output: array(['Negar', 'Conceder'], dtype=object)        

Label Distribution (Target Variable)

Our database includes 7995 customers classified as “Negar” (Deny) and 1505 as “Conceder” (Grant). When normalizing this count with the normalize parameter, we notice a significant imbalance: 84% of customers have their credit denied, while only 16% receive the grant.

This distribution already indicates that accuracy, a common metric for evaluating model performance, will not be the most suitable one going forward.?

This is due to the fact that accuracy can be influenced by the majority class, which may not reflect the true performance of the model. Therefore, the dataset is significantly imbalanced!

round(df.loc[:, 'limite_adicional'].value_counts(normalize=True),2)        
N?o foi fornecido texto alternativo para esta imagem

Distance Calculation

When using the KNN algorithm, we need to calculate the distance between data points. There are several ways to do this, and the choice of distance calculation method can affect the results of the algorithm. Let’s elucidate three types of distances without delving into the mathematical merit: Euclidean, Manhattan, and Mahalanobis.

Euclidean Distance: This is the “straight line” distance between two points. It’s as if you were flying from one point to another, without having to follow streets or roads. For most problems, Euclidean distance works well.

Manhattan Distance: This distance is calculated as if you were driving in a city, having to follow streets and avenues. It is a more realistic measure for problems that involve physical movement, such as the delivery of a package.

Mahalanobis Distance: This distance takes into account more factors and is more conservative. For example, if we were estimating the delivery time of a package, the Mahalanobis distance would consider things like traffic, roadblocks, and other unforeseen events.

KNN Premise

The premise of the algorithm is that there exists a similarity relationship based on distance. To do this, we need to understand which characteristics are most relevant for our classification task.

For example, if we are trying to classify customers, some characteristics may not be useful. The “id_cliente” (customer_id) or the CPF (the Brazilian equivalent of a Social Security Number), for example, are unique to each customer and do not help us find similarities. So, we won’t use these characteristics.

On the other hand, “idade” (age) can be a useful characteristic. People of the same age may have similar behaviors, so this characteristic can help us find similarities.

However, not all useful characteristics are easy to use with KNN. For example, ‘investe_exterior’ (invests abroad) and ‘pessoa_polit_exp’ (politically exposed person) are important characteristics, but they are categories (yes or no) instead of numbers. KNN has difficulty finding similarities with categories, so we will leave these characteristics out for now.

Finally, we will also leave out the response variable ‘limite_adicional’ (additional_limit), because this is the thing we are trying to predict, not a feature we use for making predictions.

df.columns

Index([
'id_cliente', 'idade', 'saldo_atual', 'divida_atual', 'renda_anual',
'valor_em_investimentos', 'taxa_utilizacao_credito', 'num_emprestimos',
'num_contas_bancarias', 'num_cartoes_credito', 'dias_atraso_dt_venc',
'num_pgtos_atrasados', 'num_consultas_credito', 'taxa_juros',
'investe_exterior', 'pessoa_polit_exp', 'limite_adicional'],
 dtype='object')        

Feature Selection

Here in feature selection, we will already divide the data set into two. We start by setting the training set by assigning (featurs) to x_train and the target variable (label), to y_train.

# Define features and label
features = [
'idade', 'saldo_atual', 'divida_atual', 'renda_anual',
'valor_em_investimentos', 'taxa_utilizacao_credito',
'num_emprestimos', 'num_contas_bancarias', 'num_cartoes_credito',
'dias_atraso_dt_venc', 'num_pgtos_atrasados', 'num_consultas_credito',
'taxa_juros']

# label = 'limite_adicional'

# Split the dataset into predictors and target
X = df[features]
y = df[label]

# Split the set into training and testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)        

Train Test?Split?

The train_test_split function divides our dataset into two parts: training and testing. We use the training set to teach the model and the test set to check how well it has learned.

Before this, we separate our dataset into predictor variables X (information used to make predictions) and target variable y (what we want to predict).

For example, if we want to predict whether a customer will request a credit limit increase, the predictor variables will be the ones we defined in the features list, and the target variable is whether the customer will have the ‘limite_adicional’ (additional_limit) or not.

In our split, we use 80% of the data for training and the remainder for testing. This helps us check whether the model can make accurate predictions on new data, not just on the ones it has already seen.

Training?

Having selected the features, we move on to training the algorithm using the scikit-learn library, through the KNeighborsClassifier class, where one of its key parameters is the number of neighbors, that is, ‘n_neighbors’.

For now, we will use the number 7 neighbors arbitrarily, and we save the definitions of our algorithm in knn_classifier.

# Initialize the KNN classifier with parameter k=7
knn_classifier = KNeighborsClassifier(n_neighbors=7)

# Train the KNN algorithm model
knn_classifier.fit(X_train, y_train)        
N?o foi fornecido texto alternativo para esta imagem

The KNN algorithm works by measuring the distance between a data point and all other points. It then selects the “k” closest points — in our case, we chose 7. After that, it looks at the class (or category) of those closest points.

For instance, if we’re trying to classify a customer as “grant” or “deny” credit, the KNN algorithm looks at the 7 most similar customers we already know. If the majority of those customers had their credit granted, then the algorithm will predict that this new customer should also have their credit granted.

To do this, we need to “train” the algorithm with data we already know. We use the ‘fit’ method for this, providing it with our training data.?

These training data include both the features of the customers (such as age, income, credit history) and the class we already know (whether the credit was granted or denied).

Once trained, the algorithm is ready to make predictions about new customers. It already knows how to find the closest “neighbors” and how to use this information to make a prediction.

Cross-Validation?

At this moment, we continue to assume the arbitrary value of 7. However, to find the ideal value of ‘k’ in the KNN algorithm, we can use an approach called cross-validation.

Cross-validation is a technique where we divide our dataset into ‘k’ subsets. We then train the model ‘k’ times, each time using a different subset as the test set and the rest of the data as the training set.

For each training, we calculate the model’s accuracy and, in the end, we choose the value of ‘k’ that provided the highest average accuracy.

Another common approach is simply to test different values of ‘k’ and choose the one that results in the best performance on the test set.

However, it’s important to be careful not to choose a ‘k’ value that’s too high, as this can lead to overfitting, where the model fits the training data so well that it performs poorly when predicting new data.

Similarly, a ‘k’ value that’s too low can lead to underfitting, where the model does not learn enough from the training data and, therefore, also performs poorly in predicting new data.

Therefore, finding the ideal value of ‘k’ is a balance between avoiding overfitting and underfitting.


from sklearn.neighbors import KNeighborsClassifie
from sklearn.model_selection import cross_val_score

# List to store accuracies
cv_scores = []

# Number of folds
folds = 10

# Creating odd values of k for KNN
neighbors = list(range(1, 50, 2))

# Run cross-validation
for k in neighbors:
knn = KneighborsClassifier(n_neighbors = k)
scores = cross_val_score(knn, X_train, y_train, cv = folds, scoring = 'accuracy')
cv_scores.append(scores.mean())

# Identifying the best k
optimal_k = neighbors[cv_scores.index(max(cv_score))]
print('The optimal number of neighbors is %d' % optimal_k)        

In this code, cross-validation is being used to test different values of k (from 1 to 49, in steps of 2). For each value of k, the model is trained and tested using cross-validation, and the average accuracy is stored.

Finally, the code identifies the value of k that produced the highest average accuracy and prints that value. In this case, the optimal number of neighbors was found to be 29. This means that, for this dataset and setup, the KNN algorithm performs best when considering the 29 closest neighbors when making a prediction.

# output: The optimal number of neighbors is 29.        

Prediction from the Trained Algorithm Now, we can use our ‘knn_classifier’ to make predictions, or better said, to classify. If we look back at our training feature set ‘X_train’, note that it does not include the labeled column ‘limite_adicional’ (additional_limit) on the right side of the table.

X_train.head()        
N?o foi fornecido texto alternativo para esta imagem

The score method of the KNN classifier returns the accuracy of the model on the provided dataset.?

Accuracy is an evaluation metric that measures the proportion of correct predictions made by the model, but it should be used with caution! It is calculated as the number of correct predictions divided by the total number of predictions. If the dataset is not balanced, a high accuracy is meaningless.

In the case of knn_classifier.score(X_train, y_train), we are calculating the model’s accuracy on the training set. This means we are assessing how well the model fits the training data before later testing it on new data.

# Check the accuracy of the model on the training set
knn_classifier.score(X_train, y_train)

# output: 0.8594736842105263        

The value of 0.85 is the accuracy of the model on the training set.?

This means that the model made correct predictions for approximately 85.9% of the samples in the training set, or in other words, it correctly predicted the outcome in 85.9% of the cases. However, as we saw in the beginning in the Distribution of the Label (Target Variable), this dataset is definitely imbalanced.

Prediction on the test?set

?The predict method takes the features of the test set, X_test, and returns the model’s predictions for those features. These predictions are stored in the variable y_pred.

# Make predictions on the test set
y_pred = knn_classifier.predict(X_test)

# Calculate the accuracy of the model on the test set
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model on the test set: ", accuracy)

# Output: Accuracy of the model on the test set: 0.8236842105263158        

Accuracy (considering class imbalance)

?The accuracy_score function takes the true labels, y_test, and the model’s predictions, y_pred, and returns the proportion of correct predictions relative to the existing y_test. In other words, accuracy is the number of correct predictions divided by the total number of predictions.

When the KNN model was used to predict the target variable on the test set, it achieved a prediction accuracy of 82.4%.?

This is a good sign as it shows that the model is able to generalize well to new data it hasn’t seen during training. However, remember that the application of accuracy here is only for educational purposes.

Evaluation Metrics

?“It’s not the best metric that matters, but the best represents our business.”

When dealing with classification algorithms in supervised learning, evaluating the model’s performance is crucial. For this purpose, we rely on various evaluation metrics that allow us to understand the effectiveness of our model.

The Confusion Matrix is one of these fundamental metrics. It provides a clear view of how our machine learning model is classifying the different classes.

Choosing the correct metric greatly depends on the specific problem we’re trying to solve and the cost associated with different types of errors (false positives and false negatives — in some cases, lives depend on it).

Confusion Matrix and Classification Report

The confusion matrix shows the distribution of our model’s predictions. The rows represent the true classes, and the columns represent the predicted classes. Therefore, we have:


from sklearn.metrics import confusion_matrix, classification_report

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(10,7))
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()        
N?o foi fornecido texto alternativo para esta imagem

Our model made the following predictions:

  • 45 true positives (clients who should have been granted credit and the model correctly predicted it)
  • 271 false negatives (clients who should have been granted credit, but the model predicted it as denied)
  • 64 false positives (clients who should not have been granted credit, but the model predicted it as granted)
  • 1520 true negatives (clients who should not have been granted credit and the model correctly predicted it)

Classification Report!?

The classification_report is a function that generates a report with key evaluation metrics for a classification model.

These metrics include precision (how many of the model’s classifications are correct), recall (how many of the classifications that should have been made were actually made by the model), and the F1 score (a harmonic mean of precision and recall that seeks a balance between both).

In simpler terms, the classification_report gives us an overview of how well our model is performing in correctly classifying the data.

# Generate the classification report
class_report = classification_report(y_test, y_pred, digits=5)
print('\nClassification Report:\n', class_report)        
N?o foi fornecido texto alternativo para esta imagem
Classification Report Metrics

This first group of metrics (Precision, Recall, F1-Score) is more specific and provides a more detailed view of the model’s performance in each class:

Precision: It’s like a basketball player making a basket. If he attempted 10 shots and made 5, his precision is 50%. In our case, when the model predicts whether a client should be granted credit or not, it is 41.28% precise in granting and 84.87% precise in denying.

Recall (Sensitivity): Imagine we have 100 apples, and 30 of them are rotten. If we correctly identify 20 rotten apples, our recall is 20/30 = 66.7%. In our case, the model correctly identifies 14.24% of clients who should be granted credit and 95.96% of clients who should not be granted.

F1-Score: It is an average that tries to balance precision and recall. It’s like a “middle ground” between the two metrics. In our case, the F1-Score is 0.21176 for “Grant” and 0.90074 for “Deny”.

Broad Overview

This other group of metrics (Support, Accuracy, Macro Avg, Weighted Avg) is more general and provides a broad view of the model’s performance:

Support: It simply tells us how many examples we have for each class. In our case, we have 316 examples of “Grant” and 1584 of “Deny”.

Accuracy: It is the percentage of correct predictions made by the model. In our case, the model achieves an accuracy of 82.37%.

Macro Avg: It is the simple average of the metrics for each class.

Weighted Avg: It is the average of the metrics for each class, but giving more weight to the classes with more examples.

Metric of Precision or Recall (depends on the?context)

Now, which metric is more important depends on the context and the business objective.

If the bank wants to minimize the risk of granting credit limit increases to clients who should not receive them (to avoid defaults, for example), then Precision would be more important. In this case, the bank prefers to err on the side of denying credit increases to some clients who could pay (false negatives), rather than granting increases to clients who will not be able to pay (false positives).

On the other hand, if the bank wants to ensure that all eligible clients receive credit limit increases (to increase customer satisfaction or credit card utilization, for example), then Recall would be more important. In this case, the bank prefers to err on the side of granting credit increases to some clients who will not be able to pay (false positives), rather than denying increases to clients who could pay (false negatives).

Examples of KNN Algorithm Usage

The KNN algorithm is a powerful tool that is used in many different situations:

Product recommendation: KNN can be used to find similar products based on customer preferences. For example, if a customer likes a certain book, KNN can find other books that similar customers also enjoyed.

Image recognition: Classifying images based on their visual features. For example, it can be used to recognize handwritten digits or identify faces in photos.

Anomaly detection: Detecting data points that are significantly different from others. For example, it can be used to detect credit card fraud or identify faults in machine systems.

News classification: Categorizing news articles based on their content. For example, it can be used to identify whether an article is about politics, sports, technology, etc.

Customer clustering: Identifying groups of customers with similar behaviors or characteristics. This can be useful for market segmentation or for personalizing offers and recommendations.

Image classification: KNN is widely used for image classification based on their visual features. For example, it can be used to identify objects or people in images. The image is represented as a matrix of numbers, where each row-column pair represents a pixel, and each pixel’s value and its neighboring pixels determine its density.

Search system: Creating a search system that returns similar items based on a query. For example, it can be used to find similar images in an image database or to find similar documents in a set of texts.

Advantages and Disadvantages

KNN has several advantages and disadvantages:

Advantages:

  1. Easy to understand and explain: KNN is a straightforward algorithm that is easy to grasp and explain to others.
  2. No need to build a model: Unlike other algorithms that require model building and training, KNN does not involve complex model construction.
  3. Versatility: KNN can be used for classification, regression, and search problems.

Disadvantages:

  1. Computationally expensive: As the number of data points in the dataset increases, the computational cost of KNN can become significant, as it requires calculating distances between points.
  2. Sensitivity to feature scaling: KNN is sensitive to the scale of features, so it is important to normalize or scale the data before applying the algorithm.
  3. Impact of irrelevant features: In datasets with irrelevant features, the performance of KNN can be negatively affected, as the algorithm considers all features equally.
  4. Distorted distance: The distance between points can be distorted when dealing with datasets with varying feature scales or different distributions.

Overall, while KNN is a simple and versatile algorithm, its performance and efficiency can be affected by the size of the dataset and the characteristics of the features. Careful consideration should be given to these factors when applying KNN to a problem.

Conclusion

In conclusion, we have explored the practical implementation of the KNN algorithm and the importance of evaluation metrics in machine learning. Using the “Credit Card Limit Classification” dataset, we have demonstrated how KNN can be applied to assist business decisions, specifically in optimizing the credit limit increase process.

Furthermore, we have highlighted the relevance of understanding evaluation metrics such as precision, recall, F1-score, and accuracy, which are crucial for assessing the model’s performance and its suitability for the given problem. Each metric offers a unique perspective on the model’s effectiveness, and the choice of the most appropriate metric depends on the context and business objectives.

We hope that this article has provided an accessible overview of the implementation of the algorithm and its evaluation metrics in ML.

Thank you for your time, ????.

要查看或添加评论,请登录

Leonardo A.的更多文章

社区洞察

其他会员也浏览了