Data Analysis with Python: Fraud Detection using Benford’s Law – Part 1
I am a true believer that your curiosity to learn is what pushes your knowledge further in almost any career you are in. It was my curiosity that accelerated my professional and personal growth.
?Instead of solely relying on traditional fraud detection techniques like rules-based systems and manual review processes, my curiosity drove me to learn more sophisticated methodologies to identify fraudulent patterns and data anomalies.
Whether you are in the data analytics field or a small business owner, fraudulent transactions permeate almost every sector of industries.
According to the Association of Certified Fraud Examiners (ACFE), "the typical organization loses 5% of revenue to fraud each year.”? I also worked with different types of large datasets containing financial data anomalies when I previously worked as a business analyst.?
?Benford’s Law analysis is one of the widely used techniques for potential fraud detection due to its simplicity and effectiveness.
?Did you know that the IRS uses Benford’s Law to detect fraudulent corporate tax returns?
Where can it be applied?
Benford’s Law can be applied to any real-world data set such as the number of likes on social media, stock prices, expenses, invoice amounts, outgoing check amounts, accounts payable, accounts receivable, cash flow statements, tax returns, and so on.
?
How can everyday professionals apply Benford’s Law?
Many LinkedIn articles explain the theoretical concepts about Benford’s Law but, not the practical, real-life applications of it. For this reason, I thought it would be a good opportunity to highlight a practical example of fraud detection using Benford’s Law from one of my freelance works.
?For this data analysis, I specifically used Python but, I will also explain step by step how you can perform the same task using Excel so you can easily apply the analysis in your daily work involving data anomalies.
Let’s dive in.
After reading this series of articles, you will be able to:
?·?????Understand Benford’s Law analysis (Part 1)
·????? Analyze a practical application using Python (Part 1) and Excel (Part 2)
·????? Run insightful SQL queries to answer business questions (Part 2)
?
?Understanding Benford’s Law analysis
“whenever there is an external influence over people’s behavior, the possibility arises of a deviation from Benford’s law.” - MIT Technology Review
?Benford’s Law states that naturally occurring numbers follow logarithmic progression in a uniform distribution. More precisely, Benford’s Law holds when the expected distribution of the first digit from 1 through 9 occurs the most (approx. 30%) and the last digit 9 occurs the least (approx. 5%) in unmanipulated datasets containing naturally occurring numbers.
?In short, statistical deviation shown from Benford’s Law could theoretically justify further examination of the datasets where fraudulent data manipulation is suspected.
Below is the Benford’s Law distribution curve showing leading digits frequency generated from using Excel log 10 function.
Steps to analyze Benford’s Law using Python:
?To run a visualization in Python, you need to install matplotlib which is a visualization utility in Python.
To import a CSV file to Python, you can easily do this by installing Pandas (a Python library) which is widely used for data manipulation and analysis. Pandas is built on top of Numpy, which is used for mathematical operations and arrays, so you don’t have to separately install it.
?Or you can use below Python code to directly import your CSV file. Just make sure your CSV file is stored in your local folder first before running the code.
import csv
def load_data(yourfilename):
yourdata = []
with open(yourfilename) as your_CSV_file_name:
your_CSV_file_name = csv.reader(your_CSV_file_name, delimiter = ',')
for row in your_CSV_file_name:
yourdata.append(row)
return yourdata
#type in your file name with .csv
new_data = load_data('Benfordlaw_rawdata2.csv')
for row in new_data:
print(row)
Steps of Data Analysis with Python: Fraud Detection using Benford’s Law:
?
Step 1: import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
After the installation, you might get an error message like this “Import “matplotlib” could not be resolved from source Pylance”.
In such case, press CTRL + Shift + P (Window) or Command + Shift + P (macOS) to open the command palette and then select the correct interpreter. Then type “Python: select interpreter” in the search field and select the right interpreter.
?Or click on the bottom left corner where it says your “virtual environment” name. Once you choose the correct interpreter, the error squiggle line will be removed.
Step 2: Access data from CSV file
领英推荐
df = pd.read_csv('Benfordlaw_rawdata2.csv')
Step 3: It is always a good practice to look at your data types before analyzing data and correct them if necessary.
df.info()
Step 4: If the data type of the column contains a dollar sign or any other symbols, you can replace it into float.
df['First_digit'] = df['First_digit'].str.replace("$","",regex = True).replace(",","",regex = True).astype(float)
Step 5: Extract the first digit from the invoice amounts
df = df.reset_index()
df['extract_digit'] = df['Invoice_amount_received'].astype(str).str[:1]
print(df.set_index(['Invoice_amount_received', 'extract_digit']))
Step 5 output: I extracted the first digit from each invoice amount. The output matches with the numbers (First_digit column) which already have been calculated in Excel.
?
Step 6: frequency by first_digits in invoice amounts
first_digit_count = df.pivot_table(values = ['Invoice_amount_received'],
index = 'First_digit', aggfunc = len, fill_value = '')
Step 7: the sum total of invoice amounts by the first digit. I used pivot_table.
total_invoice_amt = df.pivot_table(values = ['Invoice_amount_received'],
index = 'First_digit', aggfunc = sum, fill_value = '')
Step 8: Merge the above two tables
df = total_invoice_amt.merge(first_digit_count, how = "left", left_on = 'First_digit',
right_index = True, suffixes = ['','_frequency'])
Step 9: Calculate the weighted percentage occurrence distribution for the subject data.
df['subjectData_distribution'] = df['Invoice_amount_received_frequency']/ df['Invoice_amount_received_frequency'].sum()
?Step 10: Calculate Benford's Law distribution log10(1+1/n)
range = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
df['Benfordlaw_distribution'] = np.log10(1 / range + 1)
df['leading_digit'] = range
print(df)
Step 10 output:
Step 11: Below codes display visualization (clustered columns and line chart)
fig, ax = plt.subplots(figsize = (8, 5))
df.reset_index()['Benfordlaw_distribution'].plot(kind = "line", ax = ax, secondary_y = False, color = 'red')
ax.legend(["Benford curve"])
df.plot(kind = 'bar', x = 'leading_digit', y = ['subjectData_distribution','Benfordlaw_distribution'],
ax = ax, color = ['salmon','dodgerblue'], rot = 0)
plt.show()
?Step 11 visualization output:?
As you can see, the statistical distributions of the fifth and sixth digits from the subject datasets do not conform to Benford’s Law.
In Part 2, I will explain step by step how you can accomplish the same task in Excel, and show you practical examples of SQL queries to answer insightful business questions and draw a conclusion.
References: Organizations Worldwide Lose Trillions of Dollars to Occupational?Fraud - ACFE (press release, Mar 30, 2022)???
How Benford’s Law Reveals Suspicious Activity on Twitter - MIT Technology Review (article, Apr 21, 2015)
#ACFE #certifiedfraudexaminer #CAMS #python #excel #SQL #dataanalytics #dataanalyst #fraud #benfordlaw