Anomaly | Fraud Detection
Ihtisham Mehmood
Co-Founder @ DMC | Data Scientist | Generative AI | Agentic AI | MLOps | Data Analyst | MBA | BBA
Fraudulent activities in various domains have become increasingly sophisticated, making it imperative for organizations to deploy robust fraud detection mechanisms. The advent of machine learning has brought about significant advancements in this field, allowing for more accurate and efficient detection of anomalies and fraudulent behavior. In this article, we will delve into the application of both supervised and unsupervised machine learning techniques, specifically the Isolation Forest and XGBoost Classifier, to bolster fraud detection efforts.
Anomalies are transactions that deviate significantly from the normal spending behavior of a customer. By identifying anomalous transactions, banks can flag them for further investigation and prevent fraudulent transactions from being completed.
Unsupervised Machine Learning with Isolation Forest
The unsupervised machine learning comes in handy when when data is not labeled. The groups are created based on same set of features .Unlike supervised methods, Isolation Forest does not require labeled data, making it well-suited for identifying fraudulent activities when there is limited prior knowledge about them.
Little Bit of EDA
The EDA is conducted to get a true definition of data
for i in df.select_dtypes(include = 'number').columns.values:
print(df[i].skew())
print(df[i].kurtosis())
print(df[i].describe())
plt.boxplot(df[i])
plt.show()
This code snippet loops through the dataframe and create a boxplot to check for the outliers.
Well , There are couple of outliers in the data frame. To get rid of these first. Following code snippet will come in handy:
def find_outliers_iqr(series):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = (series < lower_bound) | (series > upper_bound)
return outliers
def cap_outliers_in_dataframe(df):
capped_df = df.copy()
for column in capped_df.columns:
if np.issubdtype(capped_df[column].dtype, np.number):
outliers = find_outliers_iqr(capped_df[column])
Q1 = capped_df[column].quantile(0.25)
Q3 = capped_df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
capped_df[column] = np.where(outliers, np.clip(capped_df[column], lower_bound, upper_bound), capped_df[column])
return capped_df
capped_df = cap_outliers_in_dataframe(df)
This code helps to cap the the outliers within the upper and lower bound. The main point of capping the data is to avoid data trimming . The IQR Technique keeps the data within upper and lower bound.
import matplotlib.pyplot as plt
import statsmodels.api as sm
def make_univariate_plots(df, factors, title, plot_type):
n = len(factors)
ncols = 3 # You can adjust the number of columns as needed
nrows = (n + ncols - 1) // ncols
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 4 * nrows))
fig.suptitle(title, fontsize=16)
for i, factor in enumerate(factors):
ax = axes[i // ncols, i % ncols]
sm.qqplot(df[factor], line='s', ax=ax)
ax.set_title(f'Q-Q plot for {factor}')
plt.tight_layout()
plt.show()
The code defines a function called make_univariate_plots(), which generates univariate plots of the specified features in the given DataFrame. It takes four arguments: the DataFrame, a list of feature names, a title, and a plot type. The function supports two plot types: qq-plot and distribution. A Q-Q plot compares the quantiles of a data set to the quantiles of a normal distribution, while a distribution plot shows the distribution of a data set.
make_univariate_plots(
df=numeric_df,
factors=feature_names,
title='Normal Q-Q plots of features',
plot_type='qq-plot',
)
This would generate a figure with three columns and as many rows as needed to accommodate all of the Q-Q plots. Each plot would have the title Q-Q plot for {factor}, where {factor} is the name of the feature being plotted.
It does does seem that data is normally distributed as the there is no standardization in data
import plotly.express as px
correlation_matrix = capped_df.select_dtypes(include='number').corr()
fig_corr_matrix = px.imshow(correlation_matrix,
x=correlation_matrix.columns,
y=correlation_matrix.columns,
color_continuous_scale='Greens',
title='Interactive Correlation Matrix')
# Customize the layout
fig_corr_matrix.update_layout(width=1200, height=900)
fig_corr_matrix.show()
This code calculates the correlation between among features and displays it in form of correlation matrix
In this visualization, The more deeper the color will become ,more more positively correlated it will get.
Model Training
The Isolation Forest algorithm builds an ensemble of isolation trees. These trees work by randomly selecting features and splitting data points until anomalies are isolated in shorter paths, while normal data points end up in deeper trees. The anomaly score of each data point is then calculated.
## create a pipeline
pipeline = Pipeline([
('scaler' ,StandardScaler()),
('pca',PCA(n_components=2)),
('isolation_forest',IsolationForest(n_estimators=100, max_samples='auto', contamination='auto',
max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, ))
])
This line of code defines a machine learning pipeline using scikit-learn. The pipeline consists of three sequential steps:
领英推荐
pipeline.fit(df_2)
prediction = pipeline.predict(df_2)
anomaly = df_2[prediction==-1].to_numpy()
normal = df_2[prediction==1].to_numpy()
df2_index = np.where(prediction<0)
df2_index
This code applies a machine learning pipeline to detect anomalies in the dataset df_2. After training the pipeline on the data, it generates predictions, with -1 indicating anomalies and 1 for normal data. It then extracts and stores anomalies in the anomaly array and normal data in the normal array. Additionally, it identifies and saves the indices of anomalies in the df2_index variable. This allows for precise analysis of the unusual data points in the original dataset.
# Create a scatter plot
plt.figure(figsize=(16, 10))
plt.scatter(normal[:, 0], normal[:, 1], c='b', marker='o', label='Normal')
plt.scatter(anomaly[:, 0], anomaly[:, 1], c='r', marker='x', label='Anomaly')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Anomaly Detection with Isolation Forest')
plt.legend(loc='best')
plt.show()
In this code, we're creating a scatter plot to visualize the results of our anomaly detection. The plt.figure(figsize=(16, 10)) line sets the size of the plot. We use two different markers and colors to distinguish normal data points (blue circles) and anomalies (red crosses) based on the arrays we generated earlier. The xlabel and ylabel functions label the axes, and the title function gives the plot a title. Lastly, we add a legend to clarify the data points' meanings, and plt.show() displays the plot. This visualization helps us easily spot and understand anomalies in our data, which is valuable for tasks like fraud detection or quality control.
The scatter plot presented here serves as a visual representation of our anomaly detection results. In this plot, the choice of colors and markers is deliberate, making it a powerful tool for understanding our data. The blue circles represent normal data points, while the red crosses denote anomalies. This color-coding immediately distinguishes between the two, allowing for quick and intuitive identification of unusual data patterns. This kind of visualization is essential for gaining insights into the distribution and location of anomalies within the dataset, making it particularly useful in fraud detection and other applications where the identification of outliers is crucial.
Remember, This has been processed after applying IQR(Interquartile Range)
Supervised Machine Learning with XGBoost Classifier
XGBoost, an optimized gradient boosting algorithm, has gained immense popularity for its robustness and efficiency in solving classification problems. In a supervised setting, XGBoost can be a potent tool for fraud detection.
Model Training
With the data prepared and features engineered, we can proceed to train the XGBoost Classifier. XGBoost builds an ensemble of decision trees, optimizing their structure to minimize classification errors. The algorithm is highly customizable, allowing you to fine-tune various hyperparameters for optimal performance.
from sklearn.model_selection import train_test_split
x_train , x_test , y_train ,y_test = train_test_split(x,y, test_size = 0.30, random_state=42)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
In this code, we are preparing a dataset for machine learning by splitting it into training and testing sets and applying feature scaling. We use train_test_split from scikit-learn to divide the dataset into training (x_train, y_train) and testing (x_test, y_test) sets, with a 70-30 split ratio and a specified random seed (42) for reproducibility. Next, we apply feature scaling using MinMaxScaler, which scales the features so that they all fall within the same range (usually 0 to 1). This ensures that each feature contributes equally to the machine learning model, preventing any feature from dominating the others due to differences in scale. Both the training and testing data are transformed with this scaler, making them ready for model training and evaluation.
from xgboost import XGBClassifier
clf = XGBClassifier(max_depth=6,
learning_rate=0.05,
n_estimators=200,
min_child_weight=2,
scale_pos_weight=0.5,
subsample=0.9 ,
colsample_bytree=0.5,
colsample_bylevel=0.8 ,
reg_alpha=0.05 ,
reg_lambda=0.1 ,
max_delta_step=2 ,
gamma=0.1,
random_state=0)
This code imports the XGBoost classifier from the XGBoost library and initializes it with a set of hyperparameters. In this specific instance, the hyperparameters are configured as follows: max_depth sets the maximum depth of the decision trees, learning_rate controls the step size during optimization, n_estimators specifies the number of boosting rounds, min_child_weight imposes a minimum sum of instance weight needed in a child, scale_pos_weight addresses class imbalance, subsample defines the fraction of training data to use during each boosting round, colsample_bytree and colsample_bylevel control column subsampling, reg_alpha and reg_lambda introduce regularization to prevent overfitting, max_delta_step handles the maximum step size, and gamma is the regularization parameter. The random_state ensures reproducibility of results. These parameter settings are designed to fine-tune the XGBoost model for optimal performance in a specific task.
clf.fit(x_train , y_train)
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
y_pred = clf.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(accuracy)
print(report)
Now the model is being trained on a training dataset. After the training it is important to evaluate the performance of the model which is why accuracy score and classification report comes in handy.
In this summary, model result suggests that the classifier is highly accurate and excels in correctly classifying both class 0 and class 1. The dataset seems to be well-balanced, and the model's precision, recall, and F1-scores are all at their maximum values, indicating outstanding classification performance. However, it's essential to consider the context of the data and the potential presence of any data leakage or overfitting.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(9, 9))
sns.heatmap(cm, annot=True, fmt='.3f', linewidths=.5, square=True, cmap='Spectral')
plt.ylabel('Actual Output')
plt.xlabel('Predicted Output')
score = accuracy_score(y_test, y_pred)
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size=15)
plt.show()
This code will be helpful in seeing the visual presentation of confusion matrix
It shows how the model's predictions compare to the actual class labels. Let's interpret the matrix:
interpretation:
In summary, the model has a high number of true positives and true negatives, which indicates good overall performance. The relatively low number of false positives and false negatives suggests that the model's accuracy is high, but it does make some errors in classifying the two classes.
Conclusion
Anomaly detection with machine learning is a powerful tool that can be used to combat credit card fraud. By identifying anomalous transactions, banks and other financial institutions can flag them for further investigation and prevent fraudulent transactions from being completed.
Professional Note
The author of this article is a highly skilled data analytics professional and machine learning engineer with a deep understanding of ML techniques. The author has successfully completed some data analytics and machine learning projects.