登录查看更多内容

Data Analysis with Python: Fraud Detection using Benford’s Law – Part 1

Jaeyun Ju

Former Lead Business Analyst | SQL | Python | Excel VBA | Google Advanced Data Analytics Certified

发布日期: 2024年7月25日

I am a true believer that your curiosity to learn is what pushes your knowledge further in almost any career you are in. It was my curiosity that accelerated my professional and personal growth.

?Instead of solely relying on traditional fraud detection techniques like rules-based systems and manual review processes, my curiosity drove me to learn more sophisticated methodologies to identify fraudulent patterns and data anomalies.

Whether you are in the data analytics field or a small business owner, fraudulent transactions permeate almost every sector of industries.

According to the Association of Certified Fraud Examiners (ACFE), "the typical organization loses 5% of revenue to fraud each year.”? I also worked with different types of large datasets containing financial data anomalies when I previously worked as a business analyst.?

?Benford’s Law analysis is one of the widely used techniques for potential fraud detection due to its simplicity and effectiveness.

?Did you know that the IRS uses Benford’s Law to detect fraudulent corporate tax returns?

Where can it be applied?

Benford’s Law can be applied to any real-world data set such as the number of likes on social media, stock prices, expenses, invoice amounts, outgoing check amounts, accounts payable, accounts receivable, cash flow statements, tax returns, and so on.

How can everyday professionals apply Benford’s Law?

Many LinkedIn articles explain the theoretical concepts about Benford’s Law but, not the practical, real-life applications of it. For this reason, I thought it would be a good opportunity to highlight a practical example of fraud detection using Benford’s Law from one of my freelance works.

?For this data analysis, I specifically used Python but, I will also explain step by step how you can perform the same task using Excel so you can easily apply the analysis in your daily work involving data anomalies.

Let’s dive in.

After reading this series of articles, you will be able to:

?·?????Understand Benford’s Law analysis (Part 1)

·????? Analyze a practical application using Python (Part 1) and Excel (Part 2)

·????? Run insightful SQL queries to answer business questions (Part 2)

?Understanding Benford’s Law analysis

“whenever there is an external influence over people’s behavior, the possibility arises of a deviation from Benford’s law.” - MIT Technology Review

?Benford’s Law states that naturally occurring numbers follow logarithmic progression in a uniform distribution. More precisely, Benford’s Law holds when the expected distribution of the first digit from 1 through 9 occurs the most (approx. 30%) and the last digit 9 occurs the least (approx. 5%) in unmanipulated datasets containing naturally occurring numbers.

?In short, statistical deviation shown from Benford’s Law could theoretically justify further examination of the datasets where fraudulent data manipulation is suspected.

Below is the Benford’s Law distribution curve showing leading digits frequency generated from using Excel log 10 function.

Steps to analyze Benford’s Law using Python:

?To run a visualization in Python, you need to install matplotlib which is a visualization utility in Python.

To import a CSV file to Python, you can easily do this by installing Pandas (a Python library) which is widely used for data manipulation and analysis. Pandas is built on top of Numpy, which is used for mathematical operations and arrays, so you don’t have to separately install it.

?Or you can use below Python code to directly import your CSV file. Just make sure your CSV file is stored in your local folder first before running the code.

import csv
def load_data(yourfilename):
    yourdata = []
    with open(yourfilename) as your_CSV_file_name:
        your_CSV_file_name = csv.reader(your_CSV_file_name, delimiter = ',')
        for row in your_CSV_file_name:
            yourdata.append(row)
        return yourdata

#type in your file name with .csv
new_data = load_data('Benfordlaw_rawdata2.csv') 
for row in new_data:
    print(row)

Steps of Data Analysis with Python: Fraud Detection using Benford’s Law:

Step 1: import libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

After the installation, you might get an error message like this “Import “matplotlib” could not be resolved from source Pylance”.

In such case, press CTRL + Shift + P (Window) or Command + Shift + P (macOS) to open the command palette and then select the correct interpreter. Then type “Python: select interpreter” in the search field and select the right interpreter.

?Or click on the bottom left corner where it says your “virtual environment” name. Once you choose the correct interpreter, the error squiggle line will be removed.

Step 2: Access data from CSV file

领英推荐

Beautifiers

360DigiTMG 1 年前

Navigating Legal Landscapes in Scraping, Parsing URLs…

Oxylabs.cn 10 个月前

Coding Resilience: How Python Empowers Cybersecurity

Django Stars 10 个月前

df = pd.read_csv('Benfordlaw_rawdata2.csv')

Step 3: It is always a good practice to look at your data types before analyzing data and correct them if necessary.

df.info()

Step 4: If the data type of the column contains a dollar sign or any other symbols, you can replace it into float.

df['First_digit'] = df['First_digit'].str.replace("$","",regex = True).replace(",","",regex = True).astype(float)

Step 5: Extract the first digit from the invoice amounts

df = df.reset_index()
df['extract_digit'] = df['Invoice_amount_received'].astype(str).str[:1]
print(df.set_index(['Invoice_amount_received', 'extract_digit']))

Step 5 output: I extracted the first digit from each invoice amount. The output matches with the numbers (First_digit column) which already have been calculated in Excel.

Step 6: frequency by first_digits in invoice amounts

first_digit_count = df.pivot_table(values = ['Invoice_amount_received'], 
                                                           index = 'First_digit', aggfunc = len, fill_value = '')

Step 7: the sum total of invoice amounts by the first digit. I used pivot_table.

total_invoice_amt = df.pivot_table(values = ['Invoice_amount_received'], 
                                                             index = 'First_digit', aggfunc = sum, fill_value = '')

Step 8: Merge the above two tables

df = total_invoice_amt.merge(first_digit_count, how = "left", left_on = 'First_digit', 
                                                     right_index = True, suffixes = ['','_frequency'])

Step 9: Calculate the weighted percentage occurrence distribution for the subject data.

df['subjectData_distribution'] = df['Invoice_amount_received_frequency']/ df['Invoice_amount_received_frequency'].sum()

?Step 10: Calculate Benford's Law distribution log10(1+1/n)

range = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
df['Benfordlaw_distribution'] = np.log10(1 / range + 1)
df['leading_digit'] = range
print(df)

Step 10 output:

Step 11: Below codes display visualization (clustered columns and line chart)

fig, ax = plt.subplots(figsize = (8, 5))
df.reset_index()['Benfordlaw_distribution'].plot(kind = "line", ax = ax, secondary_y = False, color =  'red')
ax.legend(["Benford curve"])
df.plot(kind = 'bar', x = 'leading_digit', y = ['subjectData_distribution','Benfordlaw_distribution'], 
        ax = ax, color = ['salmon','dodgerblue'], rot = 0)

plt.show()

?Step 11 visualization output:?

As you can see, the statistical distributions of the fifth and sixth digits from the subject datasets do not conform to Benford’s Law.

In Part 2, I will explain step by step how you can accomplish the same task in Excel, and show you practical examples of SQL queries to answer insightful business questions and draw a conclusion.

References: Organizations Worldwide Lose Trillions of Dollars to Occupational?Fraud - ACFE (press release, Mar 30, 2022)???

How Benford’s Law Reveals Suspicious Activity on Twitter - MIT Technology Review (article, Apr 21, 2015)

#ACFE #certifiedfraudexaminer #CAMS #python #excel #SQL #dataanalytics #dataanalyst #fraud #benfordlaw

要查看或添加评论，请登录

Jaeyun Ju的更多文章

Automated Regression Statistics and ANOVA Interpretation in Excel using Python

2025年3月6日

Automated Regression Statistics and ANOVA Interpretation in Excel using Python

??I built an Automated Regression Statistics Interpretation Tool In Excel using Python..
Automating Financial Analysis with Excel VBA: The Secret to Cutting Hours of Repetitive Tasks using Excel

2024年12月3日

Automating Financial Analysis with Excel VBA: The Secret to Cutting Hours of Repetitive Tasks using Excel

In a fast-paced financial landscape, time is the most valuable resource. That's why I built an Excel tool using macros…
SQL: Statistical Approach to Detecting Anomalies and Fraud using SQL

2024年10月10日

SQL: Statistical Approach to Detecting Anomalies and Fraud using SQL

SQL can serve as a powerful tool for conducting statistical data analysis, playing a crucial role in uncovering…

5 条评论
Honest Review: Google Advanced Data Analytics Certificate

2024年9月27日

Honest Review: Google Advanced Data Analytics Certificate

I recently completed the Google Advanced Data Analytics program from Google! I want to share what I've learned and…

5 条评论
Data Analysis with Excel and SQL: Fraud Detection using Benford’s Law – Part 2

2024年8月7日

Data Analysis with Excel and SQL: Fraud Detection using Benford’s Law – Part 2

In Part 1, we explored how Benford’s Law can be applied to fraud detection. We also analyzed a potential fraud case…
LEARN SQL: Free Quality Resources to Practice SQL in Your Browser

2024年7月17日

LEARN SQL: Free Quality Resources to Practice SQL in Your Browser

The best way to learn SQL is through hands-on practice. SQL is one of the high-demand technical skills not just in data…

See all articles

Data Analysis with Python: Fraud Detection using Benford’s Law – Part 1

Jaeyun Ju

Former Lead Business Analyst | SQL | Python | Excel VBA | Google Advanced Data Analytics Certified

Where can it be applied?

How can everyday professionals apply Benford’s Law?

After reading this series of articles, you will be able to:

?Understanding Benford’s Law analysis

Steps to analyze Benford’s Law using Python:

Steps of Data Analysis with Python: Fraud Detection using Benford’s Law:

领英推荐

Jaeyun Ju的更多文章

社区洞察

其他会员也浏览了

Python's Auditing Revolution: From Good to Great

A Data Analyst's Guide to Web Scraping With Python

Python for Dark Web OSINT: Automate Threat Monitoring

Independent Dark Web Data Breach Query with Python

Want to Learn a Useful Python Skill for Network Engineers? Extract Data from Unstructured Files!

Menu-Based System Health Check Script with Email Reports: Automating using Python and Bash

Extract Large Datasets from Salesforce using Python

Fidel Vetino Comprehensive Data Pipeline Implementation with ML, LLM, Deep Dive, Python Design Patterns, Airflow, Security Measures, RAG Integration..

Initial EDA for Fraud Detection in Python

How to wrangle the Data with Python?

Where can it be applied?

How can everyday professionals apply Benford’s Law?

After reading this series of articles, you will be able to:

?Understanding Benford’s Law analysis

Steps to analyze Benford’s Law using Python:

Steps of Data Analysis with Python: Fraud Detection using Benford’s Law:

领英推荐

Jaeyun Ju的更多文章

Automated Regression Statistics and ANOVA Interpretation in Excel using Python

Automating Financial Analysis with Excel VBA: The Secret to Cutting Hours of Repetitive Tasks using Excel

SQL: Statistical Approach to Detecting Anomalies and Fraud using SQL

Honest Review: Google Advanced Data Analytics Certificate

Data Analysis with Excel and SQL: Fraud Detection using Benford’s Law – Part 2

LEARN SQL: Free Quality Resources to Practice SQL in Your Browser

社区洞察

其他会员也浏览了

Python's Auditing Revolution: From Good to Great

A Data Analyst's Guide to Web Scraping With Python

Python for Dark Web OSINT: Automate Threat Monitoring

Independent Dark Web Data Breach Query with Python

Want to Learn a Useful Python Skill for Network Engineers? Extract Data from Unstructured Files!

Menu-Based System Health Check Script with Email Reports: Automating using Python and Bash

Extract Large Datasets from Salesforce using Python

Fidel Vetino Comprehensive Data Pipeline Implementation with ML, LLM, Deep Dive, Python Design Patterns, Airflow, Security Measures, RAG Integration..

Initial EDA for Fraud Detection in Python

How to wrangle the Data with Python?