登录查看更多内容

Results obtained with machine learning models to predict credit card fraud

Raffaela Loffredo

发布日期: 2023年6月20日

+ 关注

Clique aqui para ler esse artigo em Português.

* Note

This is a summarized article that shows the main results.

To check the full study, including the codes and methodology used, click here.

Introduction

With the advent of the internet, mobile phones, 4G, and technological banks known as fintech, the volume of digital transactions is rapidly increasing. In 2022, the major credit card companies together conducted over 580 billion transactions worldwide (Statista). In Brazil, there were nearly 40 billion transactions, amounting to over 3 trillion Brazilian reais. That’s almost 19,000 transactions per second (Valor Investe). The forecast is for continued growth of around 15% per year, but in the first quarter of 2023, it was already 17% (ABECS) compared to the previous year’s quarter.

Unfortunately, this presents a sea of opportunities for malicious individuals. It is estimated that approximately 20 billion dollars are lost each year due to online payment fraud (Ravelin). Furthermore, from 2021 to 2022, this type of fraud saw a 40% increase (Sumsub), indicating that fraudsters are becoming increasingly creative in their attempts to deceive banks.

A KPMG study revealed that instant and/or online payments are the second major concern of financial institutions, related to the most significant risks in the Americas. The reason for this is due to the high damage, as we have seen, that this type of crime causes both to the financial institutions and to their clients.

Among the various measures to prevent such situations, there is the monitoring of transactions through the development of technologies that can assess risks and make real-time decisions to determine whether a particular payment is fraudulent or not. This is made possible thanks to advancements in artificial intelligence.

Proof of the effectiveness of using machine learning algorithms is that out of 10 banks, 7 invest in this type of technology (KPMG). However, despite the significant advancements in this area, there is still room for improvement. Half of the banks that perform these analyses reported a significant increase in False Positives, meaning that the model indicates a transaction as fraudulent when it is not. You may have experienced this situation yourself: when trying to make a purchase, your card was preemptively blocked by the bank. As this situation causes embarrassment and problems for customers, banks aim to reduce these occurrences so that fraud detection becomes more effective.

With the evolution of digital channels, an increasing number of transactions are conducted in this manner. Consequently, the volume of historical data increases, providing access to more customer behavior information. All of this facilitates the identification of potential fraud. Therefore, ongoing studies of this kind are necessary for financial institutions, as even minimal improvement in such algorithms can result in millions of savings.

The Research

Given the above, the general objective of this project was to analyze data from over 280,000 credit card transactions provided by European operators in order to improve the detection of frauds carried out using this payment method through machine learning methods.

Initial Considerations

The dataset is imbalanced. Out of the 280,000 transactions, only 492 were identified as frauds, representing only 0.17% of the total transactions. Hence, the term "imbalanced," as there are significantly more legitimate transactions than fraudulent ones. This requires specific data treatment before creating prediction models. If this is not done, the generated model would be very good at predicting legitimate transactions but very poor at predicting fraudulent transactions. And this goes against the objective of this study, which is precisely to identify fraudulent transactions.

Model Selection

The choice of machine learning models to be used was made based on the problem of the study and the intended objective. Since we want to predict credit card fraud, we have already narrowed down our options to supervised models since we need an answer to the question: "Is this transaction a fraud?" and since we have a binary response of "Yes" or "No," we look at classification algorithms.

Therefore, the chosen models are Logistic Regression and Decision Tree.

Logistic Regression

Logistic Regression is a classification algorithm that assigns observations to classes based on their probability of belonging to a particular group of classes. Therefore, it is a good model to use when the dependent variable is categorical.

In our study, this means that once the model is trained, it will receive information about a new credit card transaction and, according to the algorithm, determine the probability of it being a fraud. The class with the highest probability is what the algorithm will indicate as the potential class of that transaction.

To do this, logistic regression transforms the output generated by the model using the sigmoid function (which is a logistic function) to return a probability value and then determine the class to which the observation belongs.

Decision Tree

A Decision Tree is a supervised learning algorithm used for classification and regression problems. Its process involves finding boundaries in the data and then splitting them into subsets.

Evaluation Metrics

Among the metrics for evaluating the performance of a classification algorithm, Recall provides the best measure for the specific problem under study. This is because, in the case of fraud, False Negatives are more harmful to a company than False Positives. In other words, it is better for the model to make mistakes by classifying a transaction as a fraud when it is not, rather than mistakenly classifying a fraudulent transaction as legitimate, which would result in financial losses for the business.

Therefore, we look for a high Recall rate.

Considering the purpose of this study, another metric that will be used is AUC (Area Under the Curve), which indicates how well the model can distinguish between two things. In our case, it measures the ability to distinguish between a legitimate transaction from a fraudulent one.

Finally, the confusion matrix compares the predicted values with the actual values, showing the model's errors and correct predictions.

领英推荐

Can AI Outsmart Fraudsters? The Future of…

Mu Sigma Inc. 2 个月前

How Machine Learning Neutralizes the Devastating…

Naveen Joshi 2 年前

Why the Rise of CNP Fraud Isn’t Cause for Panic

Forter 1 年前

Comparison between Logistic Regression and Decision Tree Results

With two classification models created, we can compare the metrics obtained from them and determine which one better suits our problem of identifying credit card fraud.

Confusion Matrix

Out of 85,443 tested transactions, we can observe that regarding False Positives, Logistic Regression only made mistakes in 2,800 transactions, compared to 7,242 by the Decision Tree. This is a considerable difference in dissatisfied customers. As for correctly identifying fraud, Logistic Regression got 3 more cases right and made fewer errors in the case of False Negatives.

Therefore, analyzing the results obtained from both models, the Logistic Regression algorithm is superior in all aspects compared to the Decision Tree model.

Recall

Recall, as a metric, provides the best measure for our specific problem. The higher the Recall value, the better the model will be at identifying fraud.

The Logistic Regression model has a higher Recall value of 91.22%, compared to 89.19% for the Decision Tree model.

AUC

Below, I present the plotted results for the AUC curve, as well as the obtained values, side by side for easy comparison of this metric.

The AUC value for Logistic Regression, 93.97%, is higher than the value of 90.35% given to the Decision Tree model.

Conclusion

When evaluating the performance of both models to identify the best one for predicting credit card fraud, considering Recall, AUC, and the confusion matrix, it is concluded that the Logistic Regression algorithm is superior to the Decision Tree.

Finally, it is worth noting that, despite the good results obtained, there is always room for improvement in the models. Other classification models can be used for further performance comparisons, and parameter optimization specific to each algorithm can be applied.

But... What does this mean in practice?

To understand the real-life applications that this model can impact, let's consider the information about credit card transactions in Brazil for 2022. There were 18.2 billion transactions totaling 2.1 trillion Brazilian reais. Dividing the total value by the number of transactions gives us an average of 115.38 reais per transaction.

With the best model created above, namely Logistic Regression, and applying the results found with the data test, out of the 18.2 billion transactions conducted in Brazil in 2022, we would have detected:

17,572,100,000 legitimate transactions.
29,120,000 frauds, which would correspond to approximately 3,359,865,600 Brazilian reais (considering the average transaction amount of R$ 115.38). With today's currency dollar conversion (R$ 4.87 for U$ 1 on June 10th, 2023), this is almost 700 hundred million dollars.

In other words, it would have been possible to save over 3 billion reais by preventing fraudulent transactions from occurring, simply by using machine learning.

Get to know more about this?study

This study is available on Google Colab and on GitHub. Just click on the images below to be redirected.

Let's Connect!

Alexandre Cassiano

1 年

Luan Alberto Lucas Santana

2 次回应

要查看或添加评论，请登录

Raffaela Loffredo的更多文章

ERC-7231: Os dados s?o meus, eu vendo se eu quiser!

2024年5月6日

ERC-7231: Os dados s?o meus, eu vendo se eu quiser!

Já tem algum tempo que isso martela na minha cabe?a: "Se dados s?o com o novo petróleo e se com a blockchain podemos de…

2 条评论
Building a solution to combat Fake News with Machine-Learning

2024年1月9日

Building a solution to combat Fake News with Machine-Learning

Fake news has been a recurring problem in our post-globalization era with easy access to the internet. This type of…
Constru??o de uma solu??o para combate de Fake News com Machine-Learning

2024年1月9日

Constru??o de uma solu??o para combate de Fake News com Machine-Learning

As fake news têm sido um problema recorrente na nossa Era pós globaliza??o e com fácil acesso à internet. Esse tipo…

3 条评论
Results obtained building a predictive model for credit risk analysis

2023年11月7日

Results obtained building a predictive model for credit risk analysis

Credit risk analysis is a key component in maintaining the health of financial institutions' balance sheets. Keeping a…

5 条评论
Resultados obtidos na constru??o de modelo preditivo de análise de risco de crédito

2023年11月7日

Resultados obtidos na constru??o de modelo preditivo de análise de risco de crédito

A análise de risco de crédito é pe?a chave para a boa manuten??o dos balan?os das institui??es financeiras. Manter uma…

18 条评论
Previs?o de demanda de vinhos por meio de análise de séries temporais

2023年10月10日

Previs?o de demanda de vinhos por meio de análise de séries temporais

Click here to read this article in English. A análise de séries temporais tem por objetivo a identifica??o de padr?es e…

11 条评论
Forecasting wine demand through time series analysis

2023年10月10日

Forecasting wine demand through time series analysis

Clique aqui para ler esse artigo em Português. Time series analysis aims to identify patterns and forecast trends to…

3 条评论
Results obtained with machine learning models for churn prediction

2023年9月5日

Results obtained with machine learning models for churn prediction

Clique aqui para ler esse artigo em Português. * Note This is a summarized article that shows the main results.

6 条评论
Resultados obtidos com modelos de machine learning para prever churn

2023年9月5日

Resultados obtidos com modelos de machine learning para prever churn

Click here to read this article in English. * Observa??o Este é um relato resumido do estudo.

2 条评论
Cambridge Analytica x Privacidade dos Dados

2023年8月29日

Cambridge Analytica x Privacidade dos Dados

Na época em que o caso se tornou público pouco se falava nos possíveis direitos que tínhamos aos nossos dados. Mas após…

2 条评论

See all articles

Results obtained with machine learning models to predict credit card fraud

Raffaela Loffredo

Introduction

The Research

Initial Considerations

Model Selection

Logistic Regression

Decision Tree

Evaluation Metrics

领英推荐

Comparison between Logistic Regression and Decision Tree Results

Conclusion

But... What does this mean in practice?

Get to know more about this?study

Let's Connect!

Raffaela Loffredo的更多文章

社区洞察

其他会员也浏览了

How AI is Enhancing Security and Fraud Detection in Fintech?

Banking Giants, Revolut's AI, and the Ongoing Fight Against Fraud

MRC Melbourne 2024 Registration Open Now, MRC Barcelona Keynote Revealed, and More

Smarter Fraud Prevention with Machine Learning

AI and Financial Integrity: How FinTechs are Building Fraud-Resilient Systems

RBI Innovation Hub nails it with the MuleHunter.ai

Preventing Credit Card Disputes: How AI-Driven Predictive Analytics and Contract Analysis Reduce Risks

RBI's MuleHunter AI and the Role of Fintech in Combating Fraud

Addressing Imbalance in Credit Card Fraud Detection Data

Advancing Fraud Prevention in Banking through AI

Introduction

The Research

Initial Considerations

Model Selection

Logistic Regression

Decision Tree

Evaluation Metrics

领英推荐

Comparison between Logistic Regression and Decision Tree Results

Conclusion

But... What does this mean in practice?

Get to know more about this?study

Let's Connect!

Raffaela Loffredo的更多文章

ERC-7231: Os dados s?o meus, eu vendo se eu quiser!

Building a solution to combat Fake News with Machine-Learning

Constru??o de uma solu??o para combate de Fake News com Machine-Learning

Results obtained building a predictive model for credit risk analysis

Resultados obtidos na constru??o de modelo preditivo de análise de risco de crédito

Previs?o de demanda de vinhos por meio de análise de séries temporais

Forecasting wine demand through time series analysis

Results obtained with machine learning models for churn prediction

Resultados obtidos com modelos de machine learning para prever churn

Cambridge Analytica x Privacidade dos Dados

社区洞察

其他会员也浏览了

How AI is Enhancing Security and Fraud Detection in Fintech?

Banking Giants, Revolut's AI, and the Ongoing Fight Against Fraud

MRC Melbourne 2024 Registration Open Now, MRC Barcelona Keynote Revealed, and More

Smarter Fraud Prevention with Machine Learning

AI and Financial Integrity: How FinTechs are Building Fraud-Resilient Systems

RBI Innovation Hub nails it with the MuleHunter.ai

Preventing Credit Card Disputes: How AI-Driven Predictive Analytics and Contract Analysis Reduce Risks

RBI's MuleHunter AI and the Role of Fintech in Combating Fraud

Addressing Imbalance in Credit Card Fraud Detection Data

Advancing Fraud Prevention in Banking through AI