Fraud Detection using XGBoost: A Machine Learning Approach

Fraud Detection using XGBoost: A Machine Learning Approach

The core of the project lies in handling the significant class imbalance typical in fraud detection datasets and optimising model performance to effectively distinguish between legitimate and fraudulent transactions.

The dataset used for this project was sourced from Kaggle and consists of synthetic data designed to simulate real-world financial transactions, including both legitimate and fraudulent cases. This synthetic dataset was selected because it provides a safe and accessible way to explore fraud detection techniques without dealing with sensitive information. You can access the dataset here .

Here's a breakdown of the key steps involved:

1. Data Preparation:

The dataset used contains details of financial transactions, with 'isFraud' being the target variable indicating whether a transaction is fraudulent or not. The data is pre-processed by:

  • Dropping unnecessary columns: Features such as nameOrig and nameDest are removed since they are identifiers that don’t contribute meaningfully to fraud detection.
  • One-hot encoding categorical features: The type column, which categorises the type of transaction, is transformed into numerical format via one-hot encoding.
  • Scaling numerical features: Transaction amounts and balances are standardised using the StandardScaler to ensure that all numerical values are on the same scale, which is important for the model’s performance.

2. Splitting the Dataset:

The cleaned dataset is divided into training and test sets using an 80/20 split. This ensures that the model is trained on a portion of the data and evaluated on unseen data, giving a reliable measure of performance.

3. Addressing Class Imbalance:

Fraud detection datasets are notoriously imbalanced, meaning that fraudulent transactions are much rarer than legitimate ones. To address this, a scale_pos_weight parameter is calculated, which adjusts the weight of the fraud class in the model to balance the bias towards the majority class.

4. XGBoost Model Training:

XGBoost, a powerful gradient boosting algorithm, is used as the main model due to its strength in handling imbalanced datasets and capturing complex patterns in the data. The initial model is trained with the calculated scale_pos_weight, and predictions are made on the test set. Standard metrics like accuracy and a classification report (including precision, recall, and F1-score) are generated to evaluate the model’s performance.

5. Hyperparameter Optimisation:

To further improve the model, a RandomizedSearchCV is performed to find the best combination of hyperparameters. This step involves tuning parameters such as max_depth, learning_rate, and n_estimators to enhance the model’s ability to accurately identify fraudulent transactions. The search results in a more optimised version of the XGBoost model, which is then retrained and re-evaluated on the test set.

6. Threshold Tuning:

Given that fraud detection often prioritises precision (to reduce false positives), different decision thresholds are explored. Typically, a model will predict a transaction as fraudulent if its probability exceeds 0.5, but this threshold can be adjusted. The project tests various thresholds between 0.1 and 0.95 to find an optimal balance between precision and recall, depending on the business’s risk tolerance.

7. Results:

The project concludes by examining the performance of the optimised model both at the standard 0.5 threshold and with a higher threshold (e.g., 0.9), which increases precision but might sacrifice some recall. Metrics like accuracy, precision, recall, and F1-score are compared across different thresholds to determine the best trade-off for identifying fraud while minimising false positives.

Final Outcome:

This project demonstrates how a combination of robust data preprocessing, thoughtful handling of class imbalance, hyperparameter tuning, and threshold adjustment can result in a highly effective model for fraud detection. The approach leverages the power of XGBoost and precision-tuning techniques to ensure the model balances identifying fraudulent transactions with minimising false alarms.

Please Note: My full project write-up is highly detailed, covering all steps, decisions, and methods used to develop this model. If you're interested in a deeper dive, please continue reading. Alternatively, for a quick overview, check out the TL;DR version below:

TL;DR In this project, I developed a machine learning model using XGBoost to detect fraudulent transactions. Key steps included data preparation, handling class imbalance, hyperparameter tuning, and threshold optimisation. The model achieved an accuracy of 99.96%, with a precision of 0.74 and a recall of 0.96 at the selected threshold of 0.9, balancing fraud detection with minimising false positives.

In this fraud detection project, my goal was to build a machine learning model that accurately identifies fraudulent transactions. To achieve this, I needed to import several key Python libraries, each playing a vital role in handling data, training the model, and evaluating its performance.

First, I imported pandas to load and process the dataset. Fraud detection datasets typically contain both numerical and categorical features, along with irrelevant columns. Using pandas, I was able to efficiently clean the data by dropping unnecessary columns, encoding categorical variables, and preparing the dataset for machine learning.

Next, I chose XGBoost as the core model for this project. XGBoost is well-suited for fraud detection, especially in cases where the dataset is imbalanced, meaning fraudulent transactions are far less common than legitimate ones. By using XGBoost’s boosting mechanism, I could handle this imbalance and ensure that the model was focused on identifying fraud while not being overwhelmed by legitimate transactions.

To ensure the model generalised well to new data, I used train_test_split from sklearn.model_selection to split the dataset into training and testing sets. This was crucial in preventing overfitting, as I needed to evaluate the model’s performance on data it hadn’t seen before, simulating real-world fraud detection scenarios.

Finally, I evaluated the model using accuracy_score and classification_report from sklearn.metrics. While accuracy gives an overall sense of correctness, I focused on more detailed metrics like precision and recall, which were necessary to fine-tune the model. Precision helped me measure how many detected fraud cases were actually fraudulent, and recall told me how many real fraud cases the model successfully caught. These metrics were key to ensuring the model performed well in identifying fraud while minimising false positives.

In summary, I combined efficient data manipulation, a robust machine learning algorithm with XGBoost, and detailed evaluation metrics to create a model that balances precision and recall, providing an effective solution for detecting fraudulent transactions.

In this step, I loaded the dataset using pandas with the read_csv function. The dataset, stored in a CSV file named transactions.csv, contains various transaction records that will be analysed to detect fraudulent activity.

Loading the dataset is the first crucial step in any data science project because it allows me to work with the data directly in a structured format. With pandas, the data is easily accessible as a DataFrame, which makes it straightforward to perform operations such as cleaning, transforming, and preparing the data for machine learning. After loading the data, I could begin inspecting and manipulating the transactions, which include fields like transaction amounts, balances, and the fraud indicator (isFraud), all of which are vital for building the fraud detection model.

In this step, I performed Exploratory Data Analysis (EDA) to gain an initial understanding of the dataset, which is crucial before proceeding with data cleaning, feature engineering, or model building. EDA allows me to get a sense of the structure of the data, inspect the types of features available, and identify any potential issues such as missing values or irrelevant columns.

Understanding the Dataset:

  1. Inspecting the First Few Rows: By using df.head(), I printed the first few rows of the dataset. This provided me with a snapshot of the data, allowing me to visually inspect the features and understand what type of information is being captured in each column. This initial view is helpful in spotting obvious issues like irrelevant columns or unusual data formats.

Getting an Overview of the Dataset:

  • Next, by using df.info (), I inspected the data types of each column, checked for missing values, and reviewed the overall structure of the dataset. This command provided an overview of the number of rows, columns, and data types, helping me determine how to handle categorical and numerical data and whether any further cleaning steps were necessary.

Key Insights from EDA:

  • Unique Identifiers: The columns nameOrig and nameDest are likely unique identifiers for the origin and destination accounts involved in each transaction. Since they are strings and unique to each transaction, they would not provide meaningful patterns for the model. Therefore, I could conclude that these columns may need to be dropped later to avoid unnecessary complexity in the model.
  • Feature Types: The type column is categorical and would need to be one-hot encoded to transform it into a numerical format for machine learning. Additionally, numerical columns such as amount, oldbalanceOrg, and newbalanceOrig could benefit from scaling to ensure all features are on a comparable scale.
  • No Immediate Missing Data: At first glance, no columns appear to have missing values. However, more detailed checks might be needed to ensure there are no hidden missing values (e.g., zeros that might need to be treated as missing in certain contexts).

Conclusion:

By performing this initial exploratory data analysis, I was able to understand the structure of the dataset and identify potential issues. Based on this analysis, I confirmed that columns like nameOrig and nameDest are likely unnecessary for the model, and they could be dropped in subsequent steps. Additionally, I recognised the need to encode categorical variables and scale the numerical features to prepare the dataset for model training. This EDA step provided me with the necessary information to proceed confidently with data cleaning and preprocessing.

From the Exploratory Data Analysis (EDA) output, I was able to gather important insights into the dataset that informed the next steps in data preprocessing and model development. Here’s a breakdown of the key findings:

Output Overview:

  1. First Few Rows of the Dataset (df.head()): The df.head() command provides a snapshot of the first five rows of the dataset. It shows the features and some example values for each column: step: Represents the time step of the transaction. type: The type of transaction (e.g., PAYMENT, TRANSFER, CASH_OUT). amount: The amount involved in the transaction. nameOrig and nameDest: The origin and destination account names/IDs, which appear to be unique identifiers. oldbalanceOrg and newbalanceOrig: The balance of the origin account before and after the transaction. oldbalanceDest and newbalanceDest: The balance of the destination account before and after the transaction. isFraud: A binary indicator of whether the transaction is fraudulent (1 for fraud, 0 for non-fraud). isFlaggedFraud: Indicates whether the transaction was flagged as fraudulent by an external system (also binary).

From the output, it’s clear that columns like nameOrig and nameDest are unique identifiers and are likely irrelevant for model training because they do not contain generalisable patterns for detecting fraud. These would be dropped in later steps to simplify the dataset.

  1. Dataset Structure (df.info ()): The df.info () output provides a summary of the dataset, including the data types and the number of entries in each column. Key points from this output: The dataset contains 6,362,620 rows and 11 columns. There are no missing values in any of the columns, as the number of non-null entries equals the number of rows. The dataset contains a mix of float64 (for numerical columns), int64 (for categorical binary indicators like isFraud and isFlaggedFraud), and object types (for string/categorical values like type, nameOrig, and nameDest).

Key Insights from the EDA Output:

  1. Unique Identifiers: As seen in the first few rows, the columns nameOrig and nameDest are likely unique to each transaction or account. Since these columns don't hold meaningful patterns for fraud detection, they would need to be removed to avoid unnecessary complexity and overfitting.
  2. Categorical Variables: The type column is categorical and represents the type of transaction. Before training the model, this column would need to be one-hot encoded to transform it into a numerical format that the model can process effectively.
  3. Fraud Indicators: The isFraud column is the target variable for this project, with 1 indicating fraudulent transactions and 0 representing legitimate transactions. The isFlaggedFraud column might serve as a secondary indicator to identify transactions that were flagged by an external system, but further analysis would determine its usefulness.
  4. No Missing Data: The dataset does not contain any missing values, so no imputation is required. This allows me to proceed directly with data preprocessing, such as scaling and encoding, without worrying about filling in gaps.
  5. Numerical Features: The numerical columns, such as amount, oldbalanceOrg, and newbalanceOrig, would need to be scaled before training the model, especially because they cover a wide range of values. This step ensures that features like transaction amounts are on a comparable scale, which helps the model converge more effectively during training.

Conclusion:

The EDA output provided valuable insights into the dataset’s structure, revealing that certain columns (e.g., nameOrig and nameDest) are unnecessary for model training and should be dropped. Additionally, the categorical type column needs to be transformed via one-hot encoding, and the numerical columns will require scaling. Importantly, the dataset does not contain missing values, which simplifies the data preparation process. These findings paved the way for efficient preprocessing and ultimately guided the development of a high-performing fraud detection model.

At this stage, I imported StandardScaler from sklearn.preprocessing, which I would later use to scale the numerical features of the dataset. Scaling ensures that features like transaction amounts and balances are on a similar scale, preventing any one feature from dominating the model’s learning process.

Next, I cleaned the dataset by dropping unnecessary columns, specifically nameOrig and nameDest. These columns represent the origin and destination of each transaction, which are just identifiers with no direct relevance to detecting fraud. Including them could introduce noise into the model without contributing useful predictive information.

By dropping these columns, I simplified the dataset and focused on features that are more likely to help the model detect patterns indicative of fraud. At this point, I was working towards shaping the data into a cleaner format, preparing it for scaling and further processing.

In this step, I used pandas' get_dummies function to one-hot encode the categorical column type, which represents the type of transaction (such as transfer, payment, etc.). One-hot encoding is a technique that transforms categorical variables into a format suitable for machine learning algorithms, which generally expect numerical inputs.

By using one-hot encoding, I created new binary columns for each category in the type column. For example, if the transaction type is "transfer," this category would be represented as a new column with values of 0 or 1, depending on whether the transaction matches that type. I also used the drop_first=True parameter, which drops one of the categories to avoid multicollinearity, ensuring that the model doesn't interpret the original and encoded categories as separate, unrelated features.

This step was important because categorical variables like transaction type can provide meaningful information for detecting fraud. By encoding them numerically, I allowed the model to consider these transaction types in a way it understands, improving its ability to learn from the data.

At this point, I applied StandardScaler to scale the numerical columns in the dataset. The columns I selected for scaling were amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, and newbalanceDest. These columns represent the transaction amount and the balances before and after the transaction for both the origin and destination accounts.

Scaling is important because machine learning algorithms like XGBoost perform better when the numerical features are on a similar scale. Without scaling, features with larger values, such as transaction amounts, could disproportionately influence the model compared to features with smaller values, like balances.

By using StandardScaler, I transformed these numerical columns so that they have a mean of 0 and a standard deviation of 1, standardising the data. This made it easier for the model to learn from the data and treat each feature equally, preventing any large value from dominating the learning process. This step was key in ensuring that the dataset was now fully prepared for model training.

At this stage, I used df_clean.head() to display the first few rows of the cleaned and preprocessed dataset. This allowed me to visually inspect the data and ensure that the transformations—such as dropping unnecessary columns, one-hot encoding the type column, and scaling the numerical values—were applied correctly.

This step is essential because it gives a quick snapshot of the dataset's structure, helping me confirm that all features are in the right format for the next stage of the project: model training. By checking the output, I could verify that the data was ready, with the isFraud column still intact as the target variable, and the other features properly encoded and scaled. This final check helps avoid potential issues during the modelling phase and ensures that I’m working with clean and consistent data.


After displaying the cleaned dataset, I inspected the output, which showed the first few rows of the data. The table confirmed that the dataset had been properly transformed, with the following key columns now prepared for machine learning:

  • step: Likely indicating the time step or sequence of transactions.
  • amount: The scaled transaction amount, ensuring all values are on a comparable scale.
  • oldbalanceOrg: The scaled balance of the origin account before the transaction.
  • newbalanceOrig: The scaled balance of the origin account after the transaction.
  • oldbalanceDest: The scaled balance of the destination account before the transaction.
  • newbalanceDest: The scaled balance of the destination account after the transaction.
  • isFraud: The target variable, where 1 indicates a fraudulent transaction, and 0 indicates a legitimate one.
  • isFlaggedFraud: This column remains mostly zeros and might indicate whether a transaction was flagged for review, though it’s not the primary focus here.
  • One-hot encoded columns: The transaction types (CASH_OUT, DEBIT, PAYMENT, and TRANSFER) have been successfully encoded into binary columns (True or False), allowing the model to consider transaction type as a relevant feature.

This step was critical in ensuring that all features, including both numerical and categorical data, are properly prepared for the upcoming modelling phase. The scaled numerical columns and the one-hot encoded categorical variables are now in the right format for training the model.

In this step, I defined the features (X) and the target variable (y) for the machine learning model. This is a crucial step in preparing the data for model training.

Features (X): I selected all columns from the cleaned dataset except the target column (isFraud). This means the features include transaction-related data, such as amount, oldbalanceOrg, newbalanceOrig, and the one-hot encoded transaction types. These features will be used by the model to learn patterns and relationships in the data that may help identify fraudulent transactions.

Target (y): The target variable is the isFraud column, which indicates whether a transaction is fraudulent (1) or legitimate (0). The model will use this column during training to learn what characteristics are associated with fraudulent transactions.

By separating the features from the target, I prepared the dataset for the next phase, which is splitting it into training and test sets. This ensures that the model can be trained to detect fraud based on the available features and evaluated on its ability to predict fraud on unseen data.

In this step, I split the dataset into training and testing sets using train_test_split from sklearn.model_selection. This is a crucial step in machine learning because it ensures that the model is trained on one portion of the data and then evaluated on a separate portion, helping prevent overfitting and ensuring that the model generalises well to unseen data.

·???????? Training set (80%): I assigned 80% of the data to the training set (X_train and y_train). This is where the model will learn the patterns in the data and the relationship between the features and the target (fraud or not fraud). Training on a larger portion of the data allows the model to capture more patterns, leading to better performance.

·???????? Testing set (20%): The remaining 20% of the data was set aside as the testing set (X_test and y_test). After the model is trained, it will be evaluated on this unseen data, which gives a realistic measure of how well the model is likely to perform in real-world scenarios.

I also set random_state=42 to ensure that the split is reproducible. By doing this, I ensure that the model will always receive the same training and testing data every time I run the code, making the results consistent and comparable. This step prepared the dataset for model training and evaluation.

In this step, I addressed the issue of class imbalance in the dataset by calculating the scale_pos_weight parameter. Class imbalance is common in fraud detection, where fraudulent transactions (the positive class) are much rarer than legitimate ones (the negative class). If the imbalance is not handled properly, the model may focus too heavily on predicting legitimate transactions, failing to detect fraud effectively.

To address this, I calculated the scale_pos_weight by dividing the number of non-fraudulent transactions by the number of fraudulent ones in the training set. This adjustment ensures that the model pays more attention to the minority class (fraudulent transactions) during training.

By using this weight, the model is better equipped to handle the imbalance, improving its ability to identify fraud despite the skewed distribution in the data. This step is critical for ensuring that the model performs well in detecting the rarer fraudulent transactions.

At this stage, I initialized the XGBoost classifier, which is the core machine learning model for the project. I specifically configured it to handle the class imbalance by setting the scale_pos_weight parameter, which I calculated earlier. Here’s why each component of the classifier is important:

  • Objective ('binary'):

Since this is a binary classification problem (fraud vs. non-fraud), I set the objective to 'binary’.

This tells the XGBoost model that we are dealing with a classification task with two possible outcomes.

  • scale_pos_weight: This parameter is crucial in handling the class imbalance. By setting the weight to the ratio I calculated earlier, I instructed the model to give more importance to correctly identifying fraudulent transactions, which are the minority class. This helps in improving the model's sensitivity towards fraud detection.
  • random_state=42: Setting a random state ensures the results are reproducible. Every time I run the model, the data will be split and processed the same way, which is important for consistency when tuning or evaluating the model’s performance.

By initializing the XGBoost classifier with these settings, I ensured that the model is optimised to detect fraud effectively, balancing the need for precision and recall even in the presence of class imbalance. Now, the classifier is ready to be trained on the dataset.

In this step, I trained the XGBoost model using the fit method, which involved feeding the model the training data (X_train and y_train). During this process, the model learned patterns in the data that distinguish fraudulent transactions from legitimate ones.

·???????? X_train: This contains the features (such as transaction amount, balances, and transaction type) from the training set. The model uses these features to understand which characteristics are common in both fraudulent and non-fraudulent transactions.

·???????? y_train: This is the target variable (whether a transaction is fraudulent or not) for the corresponding training examples. By learning the relationship between the features in X_train and the target in y_train, the model builds a set of decision rules for predicting fraud.

The model fitting process involves XGBoost's gradient boosting technique, where it builds a series of decision trees, each one improving on the errors made by the previous ones. Since I also included the scale_pos_weight parameter, the model gave more weight to the minority class (fraudulent transactions), ensuring it doesn't overlook them during training.

At the end of this step, the model had been trained and was now ready to be tested on the unseen data to evaluate its effectiveness in detecting fraud.

In this step, I used the trained XGBoost model to make predictions on the test set (X_test). This is the critical phase where the model is evaluated on data it has never seen before, providing a realistic measure of how well it performs in identifying fraudulent transactions.

·???????? X_test: This is the set of features from the 20% of the data that was held back during the training process. The model uses these features to make predictions about whether each transaction is fraudulent or legitimate.

·???????? y_pred_xgb: The predictions made by the model are stored in y_pred_xgb. For each transaction in the test set, the model predicts either 1 (fraud) or 0 (non-fraud). These predictions will be compared to the actual values in the test set (y_test) to assess the model's performance.

At this point, the model has made its predictions, and the next step involves evaluating how accurate and effective these predictions are in detecting fraud, especially when dealing with the imbalanced nature of the data.

In this step, I evaluated the performance of the XGBoost model by comparing its predictions (y_pred_xgb) to the actual outcomes in the test set (y_test). To do this, I used two key metrics: accuracy and the classification report.

  • Accuracy: I calculated the accuracy of the model using accuracy_score. This metric gives the percentage of correct predictions the model made out of the total predictions. While accuracy is useful as an initial measure, it can be misleading in cases of class imbalance, as a model could achieve high accuracy simply by predicting the majority class (non-fraud) most of the time. Therefore, accuracy alone is not sufficient for evaluating a fraud detection model.
  • Classification Report: The classification_report provides a much deeper insight into the model’s performance. It includes: Precision: The percentage of transactions the model flagged as fraud that were actually fraudulent. High precision means fewer false positives. Recall: The percentage of actual fraudulent transactions that the model successfully identified. High recall means fewer false negatives. F1-score: A balanced measure that considers both precision and recall, giving a single score to assess the model’s effectiveness.

After calculating these metrics, I printed out the results:

  • Accuracy with XGBoost: This gives a quick idea of how well the model performed overall.
  • Classification Report: This gives a detailed view of how well the model handled both the minority class (fraud) and the majority class (non-fraud).

At this point, I had a clearer understanding of how well the XGBoost model was performing, especially in terms of its ability to detect fraud while balancing false positives and false negatives. The next steps could involve fine-tuning the model for even better performance based on these metrics.


Upon reviewing the model's evaluation results, I found that the XGBoost model achieved an overall accuracy of 99.85%. While this high accuracy is encouraging, it’s crucial to dive deeper into the classification report to get a better understanding of how well the model is performing, especially in detecting fraudulent transactions.

  • Class 0 (Non-fraudulent transactions): Precision: 1.00 – The model perfectly identified legitimate transactions, meaning every transaction it predicted as non-fraudulent was indeed legitimate. Recall: 1.00 – The model correctly captured all non-fraudulent transactions, leaving none undetected. F1-score: 1.00 – A perfect score here indicates a strong balance between precision and recall for legitimate transactions.
  • Class 1 (Fraudulent transactions): Precision: 0.46 – The model correctly identified 46% of the flagged fraudulent transactions, meaning that there are some false positives (legitimate transactions incorrectly flagged as fraud). Recall: 0.99 – The model successfully detected 99% of actual fraudulent transactions, which is excellent for reducing false negatives (fraud cases that go undetected). F1-score: 0.63 – The F1-score reflects a balance between the relatively low precision and the high recall for fraudulent transactions.

The overall macro average and weighted average indicate strong performance, but these metrics highlight an important detail: while the model is excellent at identifying legitimate transactions (class 0), its precision for detecting fraud (class 1) needs improvement. The high recall for fraud (0.99) shows that the model captures almost all actual fraud cases, but the lower precision (0.46) means a significant portion of flagged transactions are false positives.

Given these results, the next step would be to focus on fine-tuning the model to improve its precision in detecting fraudulent transactions. This could involve adjusting decision thresholds or exploring hyperparameter tuning to find a better balance. Improving precision is essential to reduce the number of legitimate transactions incorrectly flagged as fraud, without sacrificing the model’s ability to catch fraudulent cases.

In summary, the model is performing well overall, especially in terms of recall for fraud detection. However, the focus should now shift to refining precision to ensure the model strikes the right balance between identifying fraudulent transactions and minimising false positives.

In this step, I introduced hyperparameter tuning using RandomizedSearchCV from sklearn.model_selection. This is a critical step to further optimise the XGBoost model's performance, as finding the best combination of hyperparameters can significantly improve how well the model detects fraudulent transactions.

I defined a parameter grid to sample from during the randomised search. The parameters I chose to tune are essential for controlling the model’s complexity, learning rate, and its ability to handle the imbalanced dataset. Here’s an overview of the parameters:

·???????? max_depth: This parameter controls the maximum depth of each tree in the XGBoost model. By tuning this, I can control how deep the model’s decision trees can grow. Deeper trees can capture more complex patterns, but they may also overfit the training data. I set a range of values (4, 6, and 8) to explore different levels of tree complexity.

·???????? learning_rate: This parameter determines the step size at each boosting iteration. A lower learning rate means the model takes smaller steps, which can improve generalisation, but may require more boosting rounds. I set values of 0.01, 0.1, and 0.2 to test both conservative and more aggressive learning rates.

·???????? n_estimators: This refers to the number of trees (boosting rounds) in the model. A higher number of estimators can help the model capture more patterns but may also lead to overfitting if not tuned carefully. I tested values of 50, 100, and 200 to balance between a lightweight model and one with enough complexity to perform well.

·???????? scale_pos_weight: I included the scale_pos_weight parameter to continue addressing the class imbalance. This value was calculated earlier to ensure that the minority class (fraud) receives the necessary weight during training.

By defining this parameter grid, I set the stage for the next phase of hyperparameter tuning, where RandomizedSearchCV will test different combinations of these values to identify the best-performing configuration for the XGBoost model. The goal of this process is to improve the model's precision and recall, particularly for the minority class (fraud), while maintaining strong overall performance.

In this step, I set up RandomizedSearchCV to begin the hyperparameter tuning process. RandomizedSearchCV is a powerful technique for finding the best combination of hyperparameters by sampling from a given parameter grid. Here's how I configured it for the XGBoost model:

·???????? Estimator (xgb_model): I used the previously trained XGBoost model as the base estimator. This means RandomizedSearchCV will experiment with different hyperparameter combinations on this model to find the best settings.

·???????? param_distributions: This refers to the parameter grid I defined earlier, which includes values for max_depth, learning_rate, n_estimators, and scale_pos_weight. RandomizedSearchCV will randomly sample from these parameter combinations during the tuning process.

·???????? n_iter=10: I set the number of iterations to 10, meaning RandomizedSearchCV will test 10 different combinations of hyperparameters. This is a faster alternative to an exhaustive search but still gives a good range of options to explore.

·???????? scoring='precision': I chose to score the model based on precision. Since detecting fraudulent transactions accurately is crucial, I wanted to focus on improving precision, which measures how many of the transactions flagged as fraud are actually fraudulent. This is important for reducing false positives and ensuring the model doesn't unnecessarily flag legitimate transactions as fraud.

·???????? cv=3: I used 3-fold cross-validation to evaluate each set of hyperparameters. Cross-validation helps ensure that the model generalises well by training and testing on different subsets of the data multiple times. This adds robustness to the results, preventing the model from overfitting to any particular subset.

·???????? verbose=1: This parameter provides detailed output during the search process, allowing me to track the progress and see which combinations are being tested.

·???????? n_jobs=-1: I set this to use all available CPU cores, which speeds up the tuning process by running multiple experiments in parallel.

·???????? random_state=42: By setting a random state, I ensured that the results of the search would be reproducible, meaning the same combinations of hyperparameters would be tested each time.

With this configuration, RandomizedSearchCV was ready to begin exploring different hyperparameter combinations. The goal of this process is to find the set of hyperparameters that maximises precision while maintaining the model’s ability to detect fraudulent transactions effectively. This step is key to fine-tuning the model and improving its performance, especially in terms of reducing false positives.

In this step, I executed the random search by calling the fit method on the random_search object, which began the process of hyperparameter tuning. This is where RandomizedSearchCV tested various combinations of hyperparameters using the training data (X_train and y_train), evaluating each combination based on the precision score.

Here's what happened during this step:

·???????? Training and Testing: For each hyperparameter combination, the model was trained on different subsets of the training data and evaluated using cross-validation. This means that the training data was split into three parts (due to cv=3), where the model was trained on two parts and tested on the remaining part. This process was repeated for all three parts, and the average precision score was calculated for each hyperparameter combination.

·???????? Precision Focus: Since I set scoring='precision', RandomizedSearchCV prioritised precision in its evaluation. It aimed to find the combination of hyperparameters that maximised the precision of the model, ensuring that the model's predictions of fraud cases were as accurate as possible, minimising false positives.

·???????? Efficient Search: Unlike GridSearchCV, which tests all possible combinations of hyperparameters, RandomizedSearchCV randomly selects combinations from the defined grid. This is more efficient, allowing the search to explore a wide range of options without needing to test every possible combination.

After this step, RandomizedSearchCV would have evaluated 10 different sets of hyperparameters, using the precision score as the key metric. The next step would be to extract the best-performing hyperparameters and use them to build an improved version of the XGBoost model, further enhancing its ability to detect fraud while reducing false alarms.

In this step, I retrieved the best hyperparameters identified by RandomizedSearchCV after testing various combinations. By calling random_search.best _params_, I was able to extract the optimal values for the model's key parameters based on the precision score.

  • best_params: This object contains the best combination of hyperparameters that resulted in the highest precision during the random search process. These are the values that will help the model perform better at identifying fraudulent transactions while minimising false positives.

After printing the best hyperparameters, I now had a clear understanding of which parameter settings would yield the most effective model. These parameters could be applied to create an improved version of the XGBoost classifier, ensuring that it achieves a better balance between detecting fraud and reducing false positives.

At this stage, I was ready to initialise a new XGBoost model with the optimised parameters and retrain it to observe the impact of these changes on its performance.

The best hyperparameters identified through RandomizedSearchCV, as shown in the output, were:

  • scale_pos_weight: 771.05
  • n_estimators: 200
  • max_depth: 8
  • learning_rate: 0.1

These values were selected based on the precision score, which means they are geared towards improving the model’s ability to detect fraudulent transactions while reducing the number of false positives. Here’s a brief breakdown of each parameter:

  • scale_pos_weight: This value adjusts the weight for the minority class (fraudulent transactions), helping the model focus more on detecting fraud while dealing with the class imbalance. The large value reflects the significant imbalance in the dataset.
  • n_estimators (200): The model will build 200 decision trees during the boosting process, allowing it to capture more complex patterns in the data. Increasing the number of estimators generally enhances the model’s accuracy, though it can also make the model more prone to overfitting if not tuned properly.
  • max_depth (8): Setting the maximum depth of the trees to 8 allows the model to learn deeper, more detailed decision rules. This helps in capturing complex relationships in the data but also ensures the model doesn't overfit by keeping the trees moderately deep.
  • learning_rate (0.1): This learning rate controls how much the model updates with each boosting iteration. A learning rate of 0.1 is a balanced choice, allowing the model to learn effectively without making large, potentially destabilising updates during training.

With these hyperparameters, I was ready to initialise a new XGBoost model and retrain it using these optimal settings. The next step involved observing how the model's performance improved, particularly in terms of precision, recall, and the overall balance between detecting fraud and reducing false positives.

At this stage, I used the best hyperparameters identified by RandomizedSearchCV to initialise a new XGBoost model, ensuring it was configured to perform optimally for fraud detection.

Here’s how the model was set up:

  • objective='binary’:
  • This specifies that the model is a binary classification model, suited for predicting two possible outcomes: fraud (1) or non-fraud (0).
  • scale_pos_weight=771.05: This value, derived from the RandomizedSearchCV process, ensures the model is well-balanced to address the significant class imbalance in the dataset. It tells the model to place more emphasis on correctly identifying fraudulent transactions.
  • n_estimators=200: The model will use 200 decision trees in the boosting process, allowing it to capture complex patterns in the data.
  • max_depth=8: With a max depth of 8, the model can build moderately deep trees to capture important decision rules without overfitting.
  • learning_rate=0.1: A learning rate of 0.1 helps the model learn at a steady, controlled pace, balancing the speed of learning with accuracy.
  • random_state=42: This ensures that the model’s behaviour is reproducible, allowing me to consistently replicate the results.

By initialising the XGBoost model with these hyperparameters, I ensured that the model would be better equipped to handle the complexities of the fraud detection task, particularly by improving precision and reducing false positives. The next step was to train this optimised model and then evaluate its performance on the test data to see how well the improvements translated into results.

In this step, I trained the XGBoost model that was initialised with the best hyperparameters obtained from RandomizedSearchCV. Using the fit method, I trained the model on the training data (X_train and y_train), allowing it to learn the patterns that differentiate fraudulent transactions from legitimate ones.

·???????? X_train: This contains the features from the training set, which includes transaction details such as amounts, balances, and one-hot encoded transaction types. These features help the model understand what factors contribute to fraud.

·???????? y_train: This is the target variable, indicating whether each transaction in the training set is fraudulent (1) or legitimate (0). The model uses this information to build a decision-making process that predicts fraud.

The training process involved XGBoost’s boosting technique, where multiple decision trees are built iteratively. Each new tree focuses on correcting the errors made by the previous trees. With the tuned hyperparameters, including scale_pos_weight, the model was better suited to handling the imbalanced nature of the dataset, ensuring that it pays adequate attention to the minority class (fraud).

Once the model was trained, it was ready to make predictions on the test set to evaluate how well it generalises to unseen data. This step is crucial in understanding how the optimised model performs in real-world scenarios and whether the improvements in hyperparameter tuning translate into better fraud detection.

In this step, I focused on understanding which features had the most impact on the model’s decision-making process by extracting the feature importance from the optimised XGBoost model. This is an important step in model interpretation, as it helps me identify which features are contributing the most to the detection of fraudulent transactions.

Here’s a breakdown of the process:

1.????? Extracting Feature Importance: I used the feature_importances_ attribute from the trained XGBoost model to get the importance scores for each feature in the dataset. These scores represent how much each feature contributes to the model’s decision-making process. Features with higher importance scores have a greater impact on the model’s predictions.

2.????? Creating a DataFrame for Visualisation: To make the feature importance more understandable, I created a DataFrame (importance_df) that combines the feature names and their respective importance scores. I then sorted the DataFrame by importance in descending order, so the most influential features appear at the top.

3.????? Visualising Feature Importance with Plotly: I used Plotly Express to create an interactive bar chart that visualises the feature importance. The chart displays the features on the y-axis and their importance scores on the x-axis. This allows me to quickly see which features are the most important in predicting fraudulent transactions. The interactive chart also includes hover information, where users can see the importance score with precision.

4.????? Customising the Plot: I customised the layout of the plot by ensuring that the y-axis was sorted in ascending order of total importance and adjusted the size of the chart for better visibility. This makes the chart easier to interpret and visually appealing.

5.????? Saving the Plot: I saved the interactive plot as an HTML file (feature_importance_plot.html). This file can be opened in any web browser, allowing stakeholders or colleagues to explore the feature importance interactively.

This visualisation provides a clear understanding of which features the model relies on most to detect fraud. It also serves as a useful tool for communicating model behaviour to others. By identifying the most important features, I can gain insights into which aspects of the transactions (such as amounts, balances, or transaction types) are most indicative of fraud, helping to further refine the model or even adjust business rules around these insights.

The feature importance plot, as visualised above, highlights the key features that contributed most to the XGBoost model’s decision-making when predicting fraudulent transactions. Here's a detailed breakdown of the most important features:

1.????? newbalanceOrig: This feature stands out with the highest importance score by a significant margin. It indicates that the balance of the origin account after the transaction is a critical factor in determining fraud. This makes intuitive sense because unusual changes in an account's balance post-transaction could signal suspicious activity, which the model has learned to identify.

2.????? type_PAYMENT: The second most important feature is the transaction type, specifically PAYMENT. This shows that certain transaction types are more predictive of fraud than others. For instance, payments may be more closely associated with fraudulent patterns, which the model has picked up on.

3.????? oldbalanceOrg: The original balance of the origin account before the transaction is also a key feature. This suggests that significant patterns in account balances before transactions occur can be strong indicators of potential fraud.

4.????? type_TRANSFER: Similar to PAYMENT, the transaction type TRANSFER also plays a role in the model’s decisions, though its importance is lower. This may reflect that fraud often occurs through transfers between accounts, which the model has recognised.

5.????? amount: Although the transaction amount has a smaller importance score compared to balance-related features, it still contributes to the model's ability to identify fraud, especially when considered alongside other features like balances and transaction types.

Other features such as newbalanceDest, isFlaggedFraud, and step were found to have less influence on the model’s predictions, with lower importance scores.

Understanding feature importance is crucial for interpreting how the model makes decisions and provides insights into which aspects of a transaction are most predictive of fraud. This information can be used not only to refine the model further but also to inform business rules around fraud detection. For instance, the high importance of balance-related features suggests that monitoring drastic balance changes could be key in flagging potentially fraudulent transactions in real-time.

In this step, I used the optimised XGBoost model to make predictions on the test data (X_test). After training the model on the training set with the best hyperparameters, this step evaluates how well the model generalises to unseen data.

·???????? X_test: This contains the features from the test set, representing 20% of the data that the model has not seen during training. The model uses these features (e.g., transaction amounts, balances, and types) to predict whether each transaction is fraudulent (1) or legitimate (0).

·???????? y_pred_xgb_best: The predictions made by the model are stored in this variable. These predicted values will be compared to the actual outcomes (y_test) in the next step to evaluate the model's performance.

By making predictions on the test data, I was now ready to assess how well the optimised model performs in a real-world context. This step is crucial to ensure that the improvements from hyperparameter tuning translate into better results in identifying fraudulent transactions, especially in terms of precision and recall.

In this step, I evaluated the performance of the XGBoost model that was trained using the best hyperparameters. This evaluation allows me to see how well the model performs on the test data by comparing its predictions (y_pred_xgb_best) to the actual outcomes (y_test).

I used two key metrics for this evaluation:

  • Accuracy: I calculated the overall accuracy of the model using accuracy_score. This gives the percentage of correct predictions the model made out of all test cases. While accuracy is a good general measure, it’s particularly important to look at other metrics, especially in fraud detection, where class imbalance can make accuracy misleading.
  • Classification Report: The classification_report provides a more detailed breakdown of the model's performance. It includes: Precision: The proportion of transactions flagged as fraud that are actually fraudulent. This is critical for reducing false positives, ensuring legitimate transactions are not wrongly flagged. Recall: The proportion of actual fraudulent transactions that the model successfully identified. High recall helps minimise the number of fraudulent transactions that go undetected. F1-score: A balance between precision and recall, giving a single metric that considers both aspects.

By generating these metrics, I could assess the impact of the hyperparameter tuning and whether the optimised model has improved in identifying fraudulent transactions while maintaining an acceptable level of false positives. The next step involves reviewing the results of these evaluations to determine the model's overall effectiveness in real-world fraud detection scenarios.

In this step, I visualised the confusion matrix to better understand how well the optimised XGBoost model performed in predicting fraudulent transactions. The confusion matrix provides a breakdown of the model’s predictions into four key categories, helping to identify how well it handles both fraudulent and non-fraudulent transactions.

Here’s a detailed breakdown of the process:

  1. Confusion Matrix: The confusion matrix (cm) was generated using confusion_matrix(y_test, y_pred_xgb_best). This matrix shows:
  2. Creating the Heatmap: I used Plotly to create an annotated heatmap to visualise the confusion matrix. The heatmap provides a clear, visual representation of how the model performed, with the Blues color scale indicating the intensity of each value. The darker the color, the higher the number of predictions in that category.
  3. Highlighting True Positives: I added an annotation to highlight the True Positives, which are crucial in fraud detection. This helps emphasise the model's success in identifying fraudulent transactions, which is the key focus of this project.
  4. Labels and Layout: I labelled the x-axis as Predicted (what the model predicted) and the y-axis as Actual (the true labels). The matrix is split into two categories—Fraud and Not Fraud—which makes it easy to interpret how well the model performed in distinguishing between fraudulent and legitimate transactions.
  5. Saving the Confusion Matrix Plot: After creating the heatmap, I saved it as an interactive HTML file using fig.write_html("confusion_matrix.html"). This allows the plot to be easily shared and viewed in any web browser, making it accessible for stakeholders or colleagues who want to explore the model's performance interactively.

By visualising the confusion matrix, I can assess the balance between True Positives and False Positives, which directly relates to the precision of the model. This step provides deeper insights into how the model handles fraud detection and where it may need further refinement, especially in reducing false positives or catching more fraudulent cases.

The confusion matrix, as visualised above, provides a detailed look into how the optimised XGBoost model performed on the test set. Here's a breakdown of the results:

·???????? True Negatives (1269274): These are the legitimate transactions that the model correctly identified as not fraudulent. This large number shows that the model excels in correctly identifying non-fraudulent transactions, which is expected given the significant imbalance in the dataset.

·???????? False Positives (1630): These are legitimate transactions that the model incorrectly flagged as fraudulent. While this number is relatively small compared to the total number of transactions, reducing these false positives is crucial to avoid unnecessarily blocking legitimate transactions.

·???????? False Negatives (1597): These represent fraudulent transactions that the model failed to detect, mistakenly classifying them as non-fraudulent. Minimising false negatives is vital because undetected fraud can result in financial losses.

·???????? True Positives (23): These are the fraudulent transactions that the model correctly identified. While the number is small, this reflects the challenge of fraud detection, where fraudulent transactions are rare but critical to catch.

This confusion matrix highlights the model's strong ability to identify legitimate transactions but also shows that there is room for improvement in identifying fraudulent ones. The challenge remains to balance the precision and recall of the model, reducing false positives without missing too many fraud cases. In the next steps, I would explore potential avenues for further refining the model, such as adjusting the decision threshold or implementing additional techniques to reduce false negatives while maintaining a high level of precision.

In this step, I printed the final evaluation results for the XGBoost model, which was trained with the best hyperparameters. This step provides a summary of the model's overall accuracy and the detailed classification report, giving insight into how well the model performed after hyperparameter tuning.

  • Accuracy: The overall accuracy of the model is calculated as the percentage of correct predictions made on the test set. While this is a useful metric, it doesn't provide a complete picture, especially in imbalanced datasets like fraud detection.
  • Classification Report: This report includes key metrics such as: Precision: Measures the proportion of transactions predicted as fraud that are actually fraudulent. High precision is important for reducing false positives (legitimate transactions wrongly flagged as fraud). Recall: Indicates how many actual fraudulent transactions the model successfully identified. High recall ensures that fewer fraud cases go undetected. F1-score: A balance between precision and recall, providing a single score to summarise the model's performance in detecting fraud.

By printing these evaluation metrics, I was able to confirm how well the model's performance improved with the best hyperparameters. This final evaluation helps determine whether the model is effective enough for real-world deployment, striking a balance between catching fraudulent transactions and minimising false positives.

The classification report provides a more detailed breakdown of these metrics for both the fraud and non-fraud classes, allowing me to assess the trade-offs between precision and recall and identify areas where further fine-tuning may be needed.

The final evaluation results, as shown in the screenshot, reflect the performance of the XGBoost model after applying the best hyperparameters:

  • Overall Accuracy: 99.87% – This high accuracy suggests that the model is making correct predictions for the vast majority of transactions. However, as previously noted, accuracy alone is not sufficient in fraud detection due to class imbalance.
  • Class 0 (Non-fraudulent transactions): Precision: 1.00 – All transactions predicted as non-fraudulent were indeed legitimate, reflecting the model's ability to correctly identify legitimate transactions. Recall: 1.00 – The model successfully captured all legitimate transactions, meaning no legitimate transactions were incorrectly flagged as fraud. F1-score: 1.00 – The model strikes a perfect balance for the non-fraudulent class, with both precision and recall at 100%.
  • Class 1 (Fraudulent transactions): Precision: 0.49 – Only 49% of the transactions flagged as fraudulent were actually fraud. This highlights some room for improvement in reducing false positives. Recall: 0.99 – The model successfully identified 99% of the actual fraudulent transactions, demonstrating excellent performance in detecting fraud. F1-score: 0.66 – The F1-score, which balances precision and recall, shows a solid performance for the fraud class, though the precision could be further improved.
  • Macro and Weighted Averages: The macro average of 0.75 for precision and the weighted average of 1.00 reflect the overall balance of the model across both classes.

In summary, the model shows strong recall for detecting fraud, capturing nearly all fraudulent transactions. However, the precision for fraud detection could still be improved, as around half of the flagged fraudulent transactions were false positives. This trade-off between precision and recall is common in fraud detection, where it’s crucial to minimise both missed fraud cases and unnecessary disruptions to legitimate customers. Given these results, I have decided to fine-tune the model further, focusing specifically on improving the precision without sacrificing the model’s strong recall.

In this step, I decided to experiment with different decision thresholds to improve the balance between precision and recall for fraud detection. By default, the XGBoost model classifies a transaction as fraud if its predicted probability is 0.5 or higher. However, adjusting this threshold can change how the model balances false positives and false negatives.

Here’s the process I followed:

Testing Different Thresholds:

I iterated through a range of thresholds from 0.5 to 0.95. For each threshold, I adjusted the predicted probabilities (y_pred_prob_xgb_best) and converted them into binary predictions (fraud or non-fraud). This allowed me to explore how varying the threshold affects the model’s predictions.

Reevaluating the Model:

For each threshold, I recalculated the classification report. This report includes precision, recall, and the F1-score, allowing me to see how the model’s performance shifts as the threshold changes:

  • Lower thresholds (e.g., 0.5) tend to increase recall, catching more fraudulent cases but at the cost of precision, leading to more false positives.
  • Higher thresholds (e.g., 0.9) tend to increase precision, flagging fewer transactions as fraud and reducing false positives, but this may reduce recall, leading to more missed fraud cases.

Evaluating the Trade-Off:

By adjusting the threshold, I aimed to find the optimal balance between precision and recall, particularly in improving precision without sacrificing too much recall. For each threshold, the classification report allowed me to compare these metrics and choose a threshold that best aligns with my project goals.

This process was essential for fine-tuning the model, as it gave me more control over the trade-offs between precision and recall, allowing me to customise the model’s behaviour based on business needs. By testing a range of thresholds, I could determine which one provides the best balance for detecting fraud while minimising false positives.

Based on the detailed classification reports for different thresholds shown in the screenshots, I was able to evaluate the trade-offs between precision and recall as the threshold increased from 0.5 to 0.95.

Key Observations:

  1. Lower Thresholds (0.5 to 0.6): Recall is very high across all thresholds, maintaining around 98-99%. This means the model is still detecting almost all fraudulent transactions at these thresholds. Precision gradually increases from 0.49 (at 0.5 threshold) to 0.55 (at 0.6 threshold), showing fewer false positives as the threshold increases. F1-score also improves slightly as precision increases, reaching 0.70 at the 0.6 threshold.
  2. Moderate Thresholds (0.65 to 0.75): Precision continues to improve significantly, rising from 0.57 at 0.65 to 0.62 at 0.75. This shows that the model is flagging fewer legitimate transactions as fraud while still maintaining good recall. Recall remains stable, though there’s a slight drop from 98% to 97-98% as the threshold increases, indicating a minor reduction in the model's ability to detect all fraudulent cases. The F1-score improves further, reaching 0.76 at the 0.75 threshold.
  3. Higher Thresholds (0.8 to 0.95): Precision sees the most improvement at higher thresholds, reaching 0.81 at the 0.95 threshold. This is ideal for reducing false positives, making the model more selective when flagging fraud. Recall begins to decrease more noticeably at higher thresholds, especially at 0.9 and 0.95, dropping to around 93-96%. This indicates that the model is now missing some fraudulent transactions as the threshold increases. F1-score peaks at 0.87 at the 0.95 threshold, showing that while precision has improved, there is a slight compromise in recall.

Conclusion:

By adjusting the threshold, I can tailor the model’s behaviour to either focus more on recall (catching more fraud) or precision (reducing false positives). For example:

  • Thresholds between 0.6 and 0.75 seem to offer a good balance, where the model maintains high recall while gradually improving precision.
  • Higher thresholds (0.8 to 0.95) significantly improve precision, but at the cost of recall, meaning the model may miss some fraudulent cases.

Now, I need to determine the optimal threshold that strikes the right balance between precision and recall for my specific fraud detection objectives. Finding this optimal threshold will allow the model to perform most effectively in identifying fraud while controlling false positives.

In this step, I evaluated the precision, recall, and F1-score of the model across various thresholds to help identify the optimal threshold for fraud detection. Adjusting the threshold directly impacts how the model balances between detecting fraudulent transactions and minimising false positives.

Process:

  1. Testing Different Thresholds: I tested thresholds ranging from 0.5 to 0.95, adjusting the predicted probabilities to binary outcomes (fraud or non-fraud) based on each threshold. This allowed me to observe how precision, recall, and the F1-score changed with each threshold.
  2. Recording Performance Metrics: For each threshold, I calculated:
  3. Visualising the Results: I plotted the results using Plotly, adding lines for precision, recall, and the F1-score. The chart helps visualise how each metric changes as the threshold increases:
  4. Highlighting the Chosen Threshold: I annotated the chart to highlight the threshold of 0.9 as a chosen threshold based on its performance in balancing precision and recall. This threshold seemed to provide a good compromise between catching most fraud cases while minimising false positives.
  5. Customising the Visualisation: I customised the plot’s layout, including a white background for the paper and a light grey plot background to enhance readability. The font and dimensions were also adjusted to make the chart clear and easy to interpret.
  6. Saving the Threshold Plot: After creating the visualisation, I saved it as an interactive HTML file using fig.write_html("thresholds_precision_recall_f1.html"). This allows the plot to be easily shared and explored interactively in any web browser, making it accessible for detailed analysis by stakeholders.

Conclusion: By visualising how precision, recall, and F1-score change at different thresholds, I was able to determine the best threshold for the model. Based on this analysis, I selected a threshold of 0.9 as the optimal point that strikes a good balance between minimising false positives (high precision) while still catching most fraudulent cases (good recall). This visualisation provides an intuitive way to understand how model performance shifts with different thresholds, aiding in making an informed decision about the final threshold to use.

The visualisation above presents the Precision, Recall, and F1-score across different decision thresholds for the XGBoost model, ranging from 0.5 to 0.95. The goal of this analysis is to assess how these key performance metrics change as the threshold is adjusted and to determine the best threshold for fraud detection.

Key Observations:

  1. Precision (Blue Line): As the threshold increases, precision improves steadily. This is expected, as raising the threshold makes the model more conservative in flagging transactions as fraud. At the chosen threshold of 0.9, precision reaches approximately 0.81, meaning 81% of the transactions flagged as fraudulent are indeed fraudulent. This reduction in false positives is important for minimising disruption to legitimate transactions.
  2. Recall (Green Line): Recall remains high across all thresholds but begins to slightly decline as the threshold increases. This is because the model becomes less aggressive in identifying fraudulent transactions at higher thresholds, leading to more fraud cases going undetected. At the chosen threshold of 0.9, recall remains robust at around 0.93, indicating that the model still captures 93% of actual fraud cases, which is an impressive trade-off given the increased precision.
  3. F1-Score (Orange Line): The F1-score, which balances precision and recall, rises steadily and peaks near the higher thresholds. At the chosen threshold of 0.9, the F1-score reaches approximately 0.87, indicating a strong balance between precision and recall. This suggests that the model is performing well in identifying fraud while keeping false positives at a manageable level.

Analysis:

The visualisation highlights the typical trade-off between precision and recall in fraud detection. By increasing the threshold, I improved the precision, making the model more selective in identifying fraudulent transactions. However, the cost of increasing the threshold is a slight reduction in recall, meaning that while fewer false positives occur, a few more fraud cases might go undetected.

At the chosen threshold of 0.9, the model achieves a strong balance between precision (0.81) and recall (0.93), resulting in an F1-score of 0.87. This threshold represents an effective trade-off, where the model captures the majority of fraud cases while reducing false alarms.

Conclusion:

The chosen threshold of 0.9 strikes the best balance between detecting fraudulent transactions (high recall) and minimising false positives (improved precision). This visualisation effectively demonstrates how threshold adjustment can be used to fine-tune the model for specific objectives, depending on whether the priority is to catch more fraud cases or to reduce disruptions for legitimate transactions.

This analysis provides clarity on the model’s behaviour across various thresholds and confirms that the selected threshold of 0.9 is optimal for achieving the desired trade-off between precision and recall.

In this step, I obtained the predicted probabilities for each transaction in the test set. Instead of simply predicting whether a transaction is fraudulent or not, the XGBoost model provides a probability score indicating how likely it is that each transaction is fraud.

Here’s what I did:

  1. predict_proba method: I used the predict_proba() method from the trained XGBoost model (xgb_best_model). This method returns two probabilities for each transaction: The probability that the transaction is not fraud (class 0). The probability that the transaction is fraud (class 1).

I focused on the second value ([:, 1]), which gives the probability that a transaction is fraudulent.

  1. Storing the Fraud Probabilities: I saved these fraud probabilities in the variable y_pred_prob_xgb_best. These values range from 0 to 1, where higher values indicate a higher likelihood of the transaction being fraudulent.

By obtaining the predicted probabilities, I was able to gain a more nuanced understanding of the model's confidence in its predictions. This information can be used to adjust decision thresholds, allowing the model to classify transactions as fraudulent or legitimate based on specific probability cut-offs. This is key for fine-tuning the model's performance, as I can set thresholds to control the balance between false positives and false negatives.

In this step, I generated the Precision-Recall Curve for the XGBoost model using the predicted probabilities for the test set. The Precision-Recall Curve is a useful tool in evaluating how well the model performs at different thresholds, particularly in cases where the classes are imbalanced, like fraud detection.

Obtaining Precision-Recall Data: I used the precision_recall_curve() function to calculate the precision, recall, and corresponding thresholds from the model's predicted probabilities (y_pred_prob_xgb_best) and the true labels (y_test). This function computes precision and recall for every possible threshold, allowing me to plot their relationship.

Creating the Precision-Recall Curve: To visualise the relationship between precision and recall at different thresholds, I plotted Recall on the x-axis and Precision on the y-axis. This curve helps identify the point where the model maintains a good balance between correctly identifying fraud (high recall) and minimising false positives (high precision).

Annotating the Optimal Threshold (0.9): I previously selected 0.9 as the chosen threshold, so I highlighted this specific point on the curve. Using optimal_threshold_index, I pinpointed the location on the curve that corresponds to the 0.9 threshold and added an annotation showing the performance metrics at this threshold—precision close to 0.81 and recall around 0.93.

Customising the Visualisation: I adjusted the layout to enhance clarity, including resizing the plot and adding clear axis labels. Additionally, I carefully positioned the annotation to avoid overlapping with the curve, making the visualisation easy to interpret.

Saving the Precision-Recall Curve: To preserve the visualisation, I saved it as an interactive HTML file using fig.write_html("precision_recall_curve.html"). This ensures that the plot can be shared and viewed interactively, allowing for deeper analysis in any web browser.

Analysis: The Precision-Recall Curve provides a clearer view of the model’s behaviour at different thresholds, highlighting the trade-off between precision and recall. At the chosen threshold of 0.9, the model maintains a strong balance, with recall at approximately 0.93 and precision at around 0.81. This ensures that most fraudulent transactions are detected while keeping false positives low.

Conclusion: The Precision-Recall Curve supports the choice of a 0.9 threshold, confirming that it provides an effective balance between detecting fraud and minimising false alarms. This visualisation validates the decision to use the 0.9 threshold for optimising the model’s performance in real-world fraud detection scenarios.

With the Precision-Recall Curve clearly showing the optimal threshold at 0.9, the model has been successfully fine-tuned to strike the right balance between detecting fraudulent transactions and reducing false positives. The analysis and visualisation confirm that the model’s performance is effective at this threshold.

In this step, I generated and plotted the Receiver Operating Characteristic (ROC) Curve to evaluate the model's overall ability to distinguish between fraudulent and non-fraudulent transactions across a range of thresholds. The ROC Curve is a widely used visualisation to assess model performance in binary classification tasks.

Calculating ROC Curve Data: I used the roc_curve() function to compute the false positive rate (FPR) and true positive rate (TPR) for various threshold values, based on the predicted probabilities (y_pred_prob_xgb_best) and the true labels (y_test). Additionally, I calculated the Area Under the Curve (AUC), a summary metric that reflects the model's overall performance across all thresholds.

Plotting the ROC Curve: The ROC curve illustrates the trade-off between the true positive rate (recall) and the false positive rate as the decision threshold changes. In the plot, the orange line represents the model’s performance, while the dashed navy line serves as a baseline, indicating the performance of a random classifier.

AUC Score: The AUC value for the ROC curve is 0.98, indicating that the model excels in distinguishing between fraudulent and legitimate transactions. An AUC closer to 1 suggests that the model performs exceptionally well across different thresholds, not limited to the chosen one.

Customisation: I enhanced the visualisation by adding titles, axis labels, and a comparison baseline to better illustrate the model's performance. This makes it easier to interpret how well the model discriminates between classes.

Saving the ROC Curve: To preserve this analysis, I saved the ROC Curve as an interactive HTML file using fig.write_html("roc_curve.html"). This allows for interactive exploration of the visualisation in any web browser.

Analysis: The AUC score of 0.98 indicates that the model maintains excellent performance across a wide range of thresholds, successfully separating fraudulent from legitimate transactions. However, the chosen threshold of 0.9, as determined through a focused precision-recall analysis, remains the most effective balance between precision and recall for this specific task. The ROC Curve provides a comprehensive view of the model's capabilities, while the Precision-Recall Curve gives a more specific insight into threshold decision-making.

Conclusion: The AUC score of 0.98 confirms the model’s strong discriminative power between the two classes. Nevertheless, my decision to use a threshold of 0.9 stems from a targeted analysis of precision and recall trade-offs, as visualised in the precision-recall curve. This threshold ensures that the model's predictions are optimally balanced for effective fraud detection.

The ROC curve shown above illustrates an excellent performance, with an AUC score of 1.00. This indicates that the model has perfect discrimination between fraudulent and non-fraudulent transactions across all thresholds in the test set. While this result is very promising, it's important to note that the AUC value reflects overall model capability, not the behaviour at the chosen threshold of 0.9.

In this step, I applied the chosen threshold of 0.9 to the predicted probabilities in order to fine-tune the balance between precision and recall for fraud detection. By setting this higher threshold, I aimed to improve precision, reducing the number of legitimate transactions incorrectly flagged as fraud.

Process:

  1. Setting the Threshold: I set the decision threshold to 0.9 using the predicted probabilities (y_pred_prob_xgb_best). This means that only transactions with a fraud probability of 0.9 or higher are classified as fraud, making the model more selective.
  2. Generating Adjusted Predictions: By applying the threshold, I converted the predicted probabilities into binary predictions (1 for fraud, 0 for non-fraud). These adjusted predictions (y_pred_adjusted_xgb_best) reflect the model's behaviour at the chosen threshold.
  3. Evaluating the Adjusted Model: I calculated the accuracy of the model with this adjusted threshold, which represents the overall percentage of correct predictions. I also generated the classification report, which provides a detailed breakdown of the model’s precision, recall, and F1-score for both classes (fraud and non-fraud).

Results:

  • Accuracy with Adjusted Threshold: The accuracy score provides an overall view of the model's performance with the higher threshold, but it's important to focus on precision and recall for fraud detection, where class imbalance can make accuracy less meaningful.
  • Classification Report: The classification report provides a more granular view, including: Precision: The percentage of transactions flagged as fraud that were actually fraudulent. This should improve with a higher threshold. Recall: The percentage of actual fraudulent transactions that the model successfully flagged. This may slightly decrease with a higher threshold, as fewer fraud cases might be flagged to increase precision. F1-score: A balanced measure that considers both precision and recall.

Conclusion:

By adjusting the threshold to 0.9, I focused on improving the precision of the model, reducing the number of false positives. The trade-off, as expected, is a slight decrease in recall, but the overall model performance should now be more aligned with the goal of minimising false fraud alerts while still detecting the majority of fraudulent transactions.

The results after applying the adjusted threshold of 0.9 show a strong improvement in the balance between precision and recall:

Evaluation of Results:

  • Accuracy: The model achieved an overall accuracy of 99.96%, which is impressive, though, in fraud detection, accuracy is less informative due to class imbalance. It's important to focus on precision and recall to better understand the model's behaviour.
  • Class 0 (Non-fraudulent transactions): Precision: 1.00 – Every transaction that was predicted as non-fraudulent was indeed legitimate, meaning there are no false positives for non-fraud transactions. Recall: 1.00 – The model correctly captured all non-fraudulent transactions, which is expected due to the model's higher precision focus. F1-score: 1.00 – A perfect F1-score reflects the model's excellent handling of non-fraudulent transactions.
  • Class 1 (Fraudulent transactions): Precision: 0.74 – The model identified 74% of the transactions it flagged as fraudulent correctly, indicating a strong reduction in false positives compared to previous results. Recall: 0.96 – Despite the higher threshold, the model still captures 96% of all fraudulent transactions, meaning only 4% of actual fraud cases were missed. F1-score: 0.83 – The F1-score for fraud remains strong, indicating a good balance between precision and recall.
  • Macro and Weighted Averages: The macro average of 0.87 for precision and 0.98 for recall reflects a good overall balance across both classes. The weighted average of 1.00 shows that the model is highly effective overall, particularly given the imbalance between the two classes.

Conclusion:

The classification report confirms that setting the threshold to 0.9 has successfully increased precision while maintaining a high level of recall. The model now catches 96% of all fraud cases, while reducing the number of false positives, as indicated by a precision of 0.74. This trade-off is effective for the goal of fraud detection, as it minimises false alerts while still capturing the majority of fraudulent transactions.

Overall, this result represents a well-optimised model for fraud detection, where false positives are reduced and fraudulent transactions are still being detected at a high rate.

?

Summary of Findings

In this project, I set out to build an effective fraud detection model using the XGBoost classifier, addressing the challenges of class imbalance and optimising model performance through careful tuning and evaluation.

The key steps involved:

  1. Data Preparation: I began by cleaning and transforming the dataset, one-hot encoding categorical variables and scaling the numerical features to ensure they were appropriately standardised.
  2. Handling Class Imbalance: To address the class imbalance in the dataset, I calculated and applied a scale_pos_weight in the XGBoost model to give more importance to the minority class (fraudulent transactions), ensuring the model focused sufficiently on detecting fraud.
  3. Hyperparameter Tuning: Using RandomizedSearchCV, I optimised key parameters such as the learning rate, number of estimators, and tree depth. This fine-tuning resulted in the best set of hyperparameters that significantly improved model performance.
  4. Threshold Optimisation: I explored a range of thresholds to balance precision (reducing false positives) and recall (catching more fraudulent transactions). After testing several thresholds, I determined that 0.9 provided the best trade-off between precision and recall.
  5. Final Model Evaluation: The model, with the chosen threshold of 0.9, achieved a precision of 0.74 and a recall of 0.96, resulting in an F1-score of 0.83 for fraud detection. This threshold successfully reduced false positives while still capturing a high percentage of fraudulent transactions. The overall accuracy of the model was 99.96%.
  6. Visual Analysis: The Precision-Recall Curve, ROC Curve, and feature importance plots provided clear visual insights into the model’s performance and the key drivers of fraud detection.

Final Conclusion

In conclusion, the project successfully achieved its goal of developing an effective fraud detection model that balances precision and recall. The model's strong recall ensures that the majority of fraudulent transactions are caught, while the increase in precision at the 0.9 threshold reduces false positives, thereby minimising unnecessary disruption to legitimate transactions.

The chosen threshold of 0.9 reflects a careful balance, resulting in an F1-score of 0.83, which signifies that the model performs well in real-world applications where false positives can lead to unnecessary interventions but missing fraud could result in significant financial losses.

This project demonstrates that by carefully handling class imbalance, optimising hyperparameters, and fine-tuning decision thresholds, I was able to create a robust fraud detection model that meets both performance and practical business needs. Going forward, the model could be periodically retrained and enhanced with additional data to ensure it continues to detect emerging fraud patterns effectively. Overall, I am satisfied with the outcome, and the model is ready for potential deployment.


This project has been a fascinating journey into the complexities of fraud detection. I’ve gained deeper insights into how machine learning models operate in real-world scenarios, especially in handling imbalanced datasets and fine-tuning models for optimal performance. The process of exploring the nuances of XGBoost and finding the right balance between precision and recall has been both challenging and rewarding. I truly enjoyed working on this project, and I’m genuinely proud of the results, as the final model achieved a strong balance, significantly reducing false positives while capturing the majority of fraudulent transactions. It’s satisfying to see the hours of data preparation, tuning, and validation come together in a solution that’s both effective and practical.

This project has not only deepened my technical skills in data science and machine learning but also sharpened my ability to interpret results, make informed decisions, and communicate complex findings clearly. It’s been a transformative learning experience, enhancing my understanding of what it takes to build and optimise a model that could potentially be used in a high-stakes real-world context. I’m excited to take these learnings forward and tackle even more complex data challenges. Fraud detection is an ever-evolving field, and I look forward to applying these skills to new projects, learning from each step, and pushing the boundaries of what’s possible with machine learning. Look out for more of my machine learning projects in the future and if you connect then you will see them first!

Christy Ehlert-Mackie, MBA, MSBA

Data Analyst who ?? Excel | SQL | Tableau | I analyze and interpret data so companies have the information and insights they need to make sound business decisions.

2 周

Impressive work, Stuart! How long did it take you to do this project?

Theresa N.

Data Analyst | SQL | Tableau | Excel | Data Visualization

3 周

Well done!

Nilesh Ray

Technical Program Manager driving global strategic initiatives in transformative technology

3 周

Congrats Stuart!

Ashley Roberts

Data Analyst | Sales Operations | Excel | Tableau | SQL | Power BI | R | Python

3 周

My word, Stuart! I cannot say that I've read this article but I can tell this project was an immense amount of work and very technical. RESPECT!

要查看或添加评论,请登录

Stuart Walker的更多文章

  • R HR Attrition Data Analysis

    R HR Attrition Data Analysis

    Human Resources (HR) attrition, also known as employee turnover, is an enduring concern for organisations across…

    18 条评论
  • Python Engineering/Iron Ore Processing Data Analysis

    Python Engineering/Iron Ore Processing Data Analysis

    Iron ore processing is a crucial aspect of the metallurgical industry, primarily concerned with the extraction and…

    21 条评论
  • NBA 2021/2022 Player Stats Data Analysis

    NBA 2021/2022 Player Stats Data Analysis

    The role of analytics in sports is crucial, as it offers valuable insights that empower teams to make informed…

    8 条评论
  • Healthcare SQL Data Analysis

    Healthcare SQL Data Analysis

    In my previous SQL project I looked at financial data from the World Bank. I was able to analyse the data quickly using…

    6 条评论
  • World Bank SQL Data Analysis

    World Bank SQL Data Analysis

    In the world of data analysis you may get datasets that have a few hundred or maybe even a few thousand rows of data…

    14 条评论
  • Massachusetts School Data Analysis

    Massachusetts School Data Analysis

    Imagine being hired to analyse a wealth of education data for a specific state in USA, well in this Tableau project…

    8 条评论
  • DoorDash Data Analysis

    DoorDash Data Analysis

    In the world we now live in we have a multitude of options to stay at home and get hot delicious food and groceries…

    26 条评论

社区洞察

其他会员也浏览了