Results obtained building a predictive model for credit risk analysis

Results obtained building a predictive model for credit risk analysis

Credit risk analysis is a key component in maintaining the health of financial institutions' balance sheets. Keeping a low default rate ensures that the loans being made are profitable. For this, the use of machine learning to build models capable of identifying patterns and predicting whether a customer may become in default has been intensified.


Clique aqui para ler esse artigo em Português.


* Note

This is a summarized article that shows the main results.

To check the full study, including the codes and methodology used, click here.


The Study

This project aimed to create a machine-learning model that predicts whether a new customer may become default.

Initial considerations

The dataset used in this project was originally made available by Nubank. It contains 45,000 records and 43 attributes.

Some issues were identified in this dataset, with the most detrimental being the imbalance in default information, as the majority of the records were non-default. However, this was expected and was duly considered at the time of constructing the predictive model.


Model?Decision

The algorithm chosen to create the prediction model, XGBoost, comes from the family of supervised classifiers of the Decision Tree type. Its acronym stands for Extreme Gradient Boosting, and it has been widely used by professionals in the field due to its high degree of precision and accuracy in model creation.

This is partly due to the large number of hyperparameters that can be adjusted, significantly improving the model's performance. It can also be applied to various types of problems across a wide range of sectors.


Performance Assessment Metric

Among the metrics to evaluate the performance of the created model, the main one was Recall, which provides the best measure for the specific problem under study. The reason is that in the case of defaults, False Negatives are more harmful to a company than False Positives. In other words, it is better for the model to err by saying that a customer is in default when in reality they are not. Making a mistake by indicating that a customer is not in default when they actually are can lead to losses for the business. With this in mind, the higher the Recall value, the better the model's performance.


Study Development

First, a base model was created using Logistic Regression. This provided a benchmark of what an algorithm without further adjustments could achieve, resulting in a Recall of 0.0290.

Then, after standardizing and balancing the data, 7 other models were created, also with the aim of comparing Recall values. The best model at this stage was the LGBM Classifier with a Recall of 0.6562, while XGBoost was in second place with 0.6483. However, after optimizing the hyperparameters of XGBoost, this value increased to 0.6640.

In an effort to further improve the model, feature engineering was performed with the creation of 4 new variables. Once again, the base model was run (0.0513), and after standardizing and balancing the data, 7 models were recreated. This time, half of them performed better, and when there was an improvement, it was more significant than the deterioration.

Once again, the hyperparameters for XGBoost were optimized, reaching a Recall value of 0.6663, the best value found so far.

When the test data were run on the XGBoost models created, with and without feature engineering, the Recall values obtained were 0.6872 and 0.6547, respectively. This means that the model with feature engineering was 3.25% better than the model without.

To confirm the superiority of one model over the other, a z-hypothesis test was conducted, with a p-value result of 4.41e-08. This statistically confirms that the model with feature engineering indeed performs better than the one without.


Conclusion

After optimizing the hyperparameters of the XGBoost algorithm and utilizing feature engineering, a model was achieved with a Recall value of 0.6872 in tests, the best value among the 18 models created in this study. Moreover, this improved the XGBoost's own evaluation score by more than 0.325 points, as confirmed by a statistical hypothesis test.

This underscores the importance and influence of both hyperparameter optimization and the execution of feature engineering in enhancing machine learning models.


Get to know more about this?study

This study is available on Google Colab and on GitHub. Just click on the images below to be redirected.


[LoffredoDS] Credit Risk Analysis.ipynb


raffaloffredo/credit_risk_analysis



Let's Connect!

Mauro Marsella

Head of Planning and Control at Istituto per il Credito Sportivo | MBA

7 个月

Dear Raffaela, I've just fineshed to read your amazing job on Medium.com . Really well done and informative!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了