Anomaly | Fraud Detection

Anomaly | Fraud Detection

Fraudulent activities in various domains have become increasingly sophisticated, making it imperative for organizations to deploy robust fraud detection mechanisms. The advent of machine learning has brought about significant advancements in this field, allowing for more accurate and efficient detection of anomalies and fraudulent behavior. In this article, we will delve into the application of both supervised and unsupervised machine learning techniques, specifically the Isolation Forest and XGBoost Classifier, to bolster fraud detection efforts.

Anomalies are transactions that deviate significantly from the normal spending behavior of a customer. By identifying anomalous transactions, banks can flag them for further investigation and prevent fraudulent transactions from being completed.

Click here to access the source code

Unsupervised Machine Learning with Isolation Forest

The unsupervised machine learning comes in handy when when data is not labeled. The groups are created based on same set of features .Unlike supervised methods, Isolation Forest does not require labeled data, making it well-suited for identifying fraudulent activities when there is limited prior knowledge about them.

Little Bit of EDA

The EDA is conducted to get a true definition of data

  • OutliersOutlier are very problematic while performing ML process. Dealing with the outliers can be very helpful in detecting the anomalies in the transactions

for i in df.select_dtypes(include = 'number').columns.values:
    print(df[i].skew())
    print(df[i].kurtosis())
    print(df[i].describe())
    plt.boxplot(df[i])
    plt.show()        

This code snippet loops through the dataframe and create a boxplot to check for the outliers.

https://www.kaggle.com/code/ihtishammehmood/unsupervised-supervised

Well , There are couple of outliers in the data frame. To get rid of these first. Following code snippet will come in handy:

def find_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = (series < lower_bound) | (series > upper_bound)
    return outliers

def cap_outliers_in_dataframe(df):
    capped_df = df.copy()

    for column in capped_df.columns:
        if np.issubdtype(capped_df[column].dtype, np.number): 
            outliers = find_outliers_iqr(capped_df[column])

            
            Q1 = capped_df[column].quantile(0.25)
            Q3 = capped_df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            capped_df[column] = np.where(outliers, np.clip(capped_df[column], lower_bound, upper_bound), capped_df[column])

    return capped_df

capped_df = cap_outliers_in_dataframe(df)        

This code helps to cap the the outliers within the upper and lower bound. The main point of capping the data is to avoid data trimming . The IQR Technique keeps the data within upper and lower bound.

  • Normal or Exponential?It is important to understand either the data is normally distributed or not.

import matplotlib.pyplot as plt
import statsmodels.api as sm

def make_univariate_plots(df, factors, title, plot_type):
    n = len(factors)
    ncols = 3  # You can adjust the number of columns as needed
    nrows = (n + ncols - 1) // ncols

    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 4 * nrows))
    fig.suptitle(title, fontsize=16)

    for i, factor in enumerate(factors):
        ax = axes[i // ncols, i % ncols]
        sm.qqplot(df[factor], line='s', ax=ax)
        ax.set_title(f'Q-Q plot for {factor}')

    plt.tight_layout()
    plt.show()
        

The code defines a function called make_univariate_plots(), which generates univariate plots of the specified features in the given DataFrame. It takes four arguments: the DataFrame, a list of feature names, a title, and a plot type. The function supports two plot types: qq-plot and distribution. A Q-Q plot compares the quantiles of a data set to the quantiles of a normal distribution, while a distribution plot shows the distribution of a data set.

make_univariate_plots(
    df=numeric_df,
    factors=feature_names,
    title='Normal Q-Q plots of features',
    plot_type='qq-plot',
)        

This would generate a figure with three columns and as many rows as needed to accommodate all of the Q-Q plots. Each plot would have the title Q-Q plot for {factor}, where {factor} is the name of the feature being plotted.

https://www.kaggle.com/code/ihtishammehmood/unsupervised-supervised


It does does seem that data is normally distributed as the there is no standardization in data

  • Feature EngineeringFeature engineering plays a crucial role in building an effective fraud detection model. It involves selecting relevant features, creating new ones, and transforming the data. Common techniques include one-hot encoding, scaling, and handling missing values. In the following code snippet, we will be able to see correlation among different features

import plotly.express as px
correlation_matrix = capped_df.select_dtypes(include='number').corr()

fig_corr_matrix = px.imshow(correlation_matrix,
                            x=correlation_matrix.columns,
                            y=correlation_matrix.columns,
                            color_continuous_scale='Greens',  
                            title='Interactive Correlation Matrix')

# Customize the layout
fig_corr_matrix.update_layout(width=1200, height=900)
fig_corr_matrix.show()        

This code calculates the correlation between among features and displays it in form of correlation matrix

https://www.kaggle.com/code/ihtishammehmood/unsupervised-supervised

In this visualization, The more deeper the color will become ,more more positively correlated it will get.

Model Training

The Isolation Forest algorithm builds an ensemble of isolation trees. These trees work by randomly selecting features and splitting data points until anomalies are isolated in shorter paths, while normal data points end up in deeper trees. The anomaly score of each data point is then calculated.

## create a pipeline
pipeline = Pipeline([
    ('scaler' ,StandardScaler()),
    ('pca',PCA(n_components=2)),
    ('isolation_forest',IsolationForest(n_estimators=100, max_samples='auto', contamination='auto',
                                         max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, ))
                                        
])        


This line of code defines a machine learning pipeline using scikit-learn. The pipeline consists of three sequential steps:

  1. StandardScaler: The first step in the pipeline applies standardization to the data using the StandardScaler transformer. Standardization ensures that the data features have a mean of zero and a standard deviation of one, making them suitable for various machine learning algorithms.
  2. PCA (Principal Component Analysis): The second step employs Principal Component Analysis, represented by PCA, to reduce the dimensionality of the data down to 2 components. PCA is a technique that identifies the most significant patterns in the data, helping to simplify and speed up subsequent analysis.
  3. Isolation Forest: The final step involves the IsolationForest anomaly detection model. It is configured with specific hyperparameters, including the number of estimators, sampling strategy, and randomness control, for effectively identifying anomalies or outliers in the data. This pipeline can be used to process and detect anomalies in a dataset efficiently.

pipeline.fit(df_2)
prediction = pipeline.predict(df_2)
anomaly  = df_2[prediction==-1].to_numpy()
normal = df_2[prediction==1].to_numpy()
df2_index = np.where(prediction<0)
df2_index        

This code applies a machine learning pipeline to detect anomalies in the dataset df_2. After training the pipeline on the data, it generates predictions, with -1 indicating anomalies and 1 for normal data. It then extracts and stores anomalies in the anomaly array and normal data in the normal array. Additionally, it identifies and saves the indices of anomalies in the df2_index variable. This allows for precise analysis of the unusual data points in the original dataset.

# Create a scatter plot
plt.figure(figsize=(16, 10))
plt.scatter(normal[:, 0], normal[:, 1], c='b', marker='o', label='Normal')
plt.scatter(anomaly[:, 0], anomaly[:, 1], c='r', marker='x', label='Anomaly')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Anomaly Detection with Isolation Forest')
plt.legend(loc='best')

plt.show()        

In this code, we're creating a scatter plot to visualize the results of our anomaly detection. The plt.figure(figsize=(16, 10)) line sets the size of the plot. We use two different markers and colors to distinguish normal data points (blue circles) and anomalies (red crosses) based on the arrays we generated earlier. The xlabel and ylabel functions label the axes, and the title function gives the plot a title. Lastly, we add a legend to clarify the data points' meanings, and plt.show() displays the plot. This visualization helps us easily spot and understand anomalies in our data, which is valuable for tasks like fraud detection or quality control.

https://www.kaggle.com/code/ihtishammehmood/unsupervised-supervised

The scatter plot presented here serves as a visual representation of our anomaly detection results. In this plot, the choice of colors and markers is deliberate, making it a powerful tool for understanding our data. The blue circles represent normal data points, while the red crosses denote anomalies. This color-coding immediately distinguishes between the two, allowing for quick and intuitive identification of unusual data patterns. This kind of visualization is essential for gaining insights into the distribution and location of anomalies within the dataset, making it particularly useful in fraud detection and other applications where the identification of outliers is crucial.

Remember, This has been processed after applying IQR(Interquartile Range)

Supervised Machine Learning with XGBoost Classifier

XGBoost, an optimized gradient boosting algorithm, has gained immense popularity for its robustness and efficiency in solving classification problems. In a supervised setting, XGBoost can be a potent tool for fraud detection.

Model Training

With the data prepared and features engineered, we can proceed to train the XGBoost Classifier. XGBoost builds an ensemble of decision trees, optimizing their structure to minimize classification errors. The algorithm is highly customizable, allowing you to fine-tune various hyperparameters for optimal performance.

from sklearn.model_selection import train_test_split
x_train , x_test , y_train ,y_test = train_test_split(x,y, test_size = 0.30, random_state=42)

from sklearn.preprocessing import MinMaxScaler
scaler  = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)        

In this code, we are preparing a dataset for machine learning by splitting it into training and testing sets and applying feature scaling. We use train_test_split from scikit-learn to divide the dataset into training (x_train, y_train) and testing (x_test, y_test) sets, with a 70-30 split ratio and a specified random seed (42) for reproducibility. Next, we apply feature scaling using MinMaxScaler, which scales the features so that they all fall within the same range (usually 0 to 1). This ensures that each feature contributes equally to the machine learning model, preventing any feature from dominating the others due to differences in scale. Both the training and testing data are transformed with this scaler, making them ready for model training and evaluation.

from xgboost import XGBClassifier


clf = XGBClassifier(max_depth=6,
                        learning_rate=0.05,
                        n_estimators=200,  
                        min_child_weight=2,
                        scale_pos_weight=0.5,
                        subsample=0.9 ,
                        colsample_bytree=0.5,
                        colsample_bylevel=0.8 ,
                        reg_alpha=0.05 ,
                        reg_lambda=0.1 ,
                        max_delta_step=2 ,
                        gamma=0.1,
                        random_state=0)        

This code imports the XGBoost classifier from the XGBoost library and initializes it with a set of hyperparameters. In this specific instance, the hyperparameters are configured as follows: max_depth sets the maximum depth of the decision trees, learning_rate controls the step size during optimization, n_estimators specifies the number of boosting rounds, min_child_weight imposes a minimum sum of instance weight needed in a child, scale_pos_weight addresses class imbalance, subsample defines the fraction of training data to use during each boosting round, colsample_bytree and colsample_bylevel control column subsampling, reg_alpha and reg_lambda introduce regularization to prevent overfitting, max_delta_step handles the maximum step size, and gamma is the regularization parameter. The random_state ensures reproducibility of results. These parameter settings are designed to fine-tune the XGBoost model for optimal performance in a specific task.

clf.fit(x_train , y_train)

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

y_pred = clf.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(accuracy)
print(report)
        

Now the model is being trained on a training dataset. After the training it is important to evaluate the performance of the model which is why accuracy score and classification report comes in handy.

https://www.kaggle.com/code/ihtishammehmood/unsupervised-supervised

In this summary, model result suggests that the classifier is highly accurate and excels in correctly classifying both class 0 and class 1. The dataset seems to be well-balanced, and the model's precision, recall, and F1-scores are all at their maximum values, indicating outstanding classification performance. However, it's essential to consider the context of the data and the potential presence of any data leakage or overfitting.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(9, 9))
sns.heatmap(cm, annot=True, fmt='.3f', linewidths=.5, square=True, cmap='Spectral')
plt.ylabel('Actual Output')
plt.xlabel('Predicted Output')
score = accuracy_score(y_test, y_pred)
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size=15)
plt.show()        

This code will be helpful in seeing the visual presentation of confusion matrix

https://www.kaggle.com/code/ihtishammehmood/unsupervised-supervised

It shows how the model's predictions compare to the actual class labels. Let's interpret the matrix:

  • True Positives (TP): In the top-left corner, we have 85,027 instances where the model correctly predicted class 0.
  • False Positives (FP): In the top-right corner, we have 122 instances where the model incorrectly predicted class 1 when the actual class was 0.
  • False Negatives (FN): In the bottom-left corner, we have 258 instances where the model incorrectly predicted class 0 when the actual class was 1.
  • True Positives (TN): In the bottom-right corner, we have 85,182 instances where the model correctly predicted class 1.

interpretation:

  • The model correctly identified 85,027 instances as class 0, and it correctly identified 85,182 instances as class 1, which is excellent.
  • The model made 122 false positive errors, meaning it predicted class 1 when it was actually class 0.
  • The model made 258 false negative errors, meaning it predicted class 0 when it was actually class 1.

In summary, the model has a high number of true positives and true negatives, which indicates good overall performance. The relatively low number of false positives and false negatives suggests that the model's accuracy is high, but it does make some errors in classifying the two classes.

Conclusion

Anomaly detection with machine learning is a powerful tool that can be used to combat credit card fraud. By identifying anomalous transactions, banks and other financial institutions can flag them for further investigation and prevent fraudulent transactions from being completed.

Professional Note

The author of this article is a highly skilled data analytics professional and machine learning engineer with a deep understanding of ML techniques. The author has successfully completed some data analytics and machine learning projects.



要查看或添加评论,请登录

Ihtisham Mehmood的更多文章

社区洞察

其他会员也浏览了