Data Analysis with Excel and SQL: Fraud Detection using Benford’s Law – Part 2
In Part 1, we explored how Benford’s Law can be applied to fraud detection. We also analyzed a potential fraud case that did not conform to Benford’s Law using Python.
?This article offers a step-by-step guide to completing the same task using Excel. At the end of the tutorial, we will explore answering a few insightful business questions using SQL.
If you are interested in learning more about Benford’s Law and its application in Python, please refer to my last article "Data Analysis with Python: Fraud Detection using Benford’s Law – Part 1".
After reading this article, you will be able to:
·????? Analyze a practical application of Benford’s Law using Excel
·????? Execute insightful SQL queries to address business questions
·????? Draw a conclusion on the suitability of Benford’s Law for fraud detection
?
Here are the steps to analyze a case using Benford’s Law in Excel:
Step 1: Find the first digit using the LEFT function. (I used VALUE and LEFT functions).
Step 2: Count the number of occurrences of the first digit using the COUNTIF function.
Step 3: Calculate the weighted % of the occurrence of each leading digit.
Step 4: Calculate Benford’s Law distribution % using the LOG10 function.
Step 5: Insert visual charts and compare. (Highlight L1:M10 -> go to Insert -> click Combo -> select Clustered Column - Line)
Now, let’s compare the distribution of the subject data against the distribution of Benford’s Law.
As we have observed from the previous article, the statistical distributions of the leading digits of 5 and 6 in the subject datasets do not conform to Benford’s Law.
?Now that we have detected two anomalies from Benford’s Law comparison, we can find out more information associated with those two digits.
Let’s dive into some insightful business questions and use SQL to answer them.
As data professionals, sometimes we need to think like a “data detective.” Using SQL, we can ask questions, identify patterns, and establish relationships.?
For the following tutorials, we will use Common Table Expressions (CTEs) to break down complex queries into more manageable codes, making them more readable.
But first, let’s consider a hypothetical scenario involving a conflict of interest related to a procurement fraud scheme, such as employee-vendor collusion or bribery. This scenario involves an employee exhibiting favoritism towards one vendor over another by awarding it a contract above the market price.
We are specifically interested in investigating how frequently each employee, whose leading digits of transactions are 5 and 6, uses each vendor that has visited the office based on the visitors log.
?To ensure ethical handling of data, we will be using the following datasets, which contain made-up information: Below are two tables depicting Orders and Visitors data.
?
领英推荐
Orders Table:
Visitors Table:
In a database, you can utilize SQL queries to ask questions such as:?
Question 1: Rank the employees by the highest number of transactions that fall under the leading digits 5 and 6.
?
Code:
--Rank employees by the highest number of transactions that fall under the digits 5 and 6
WITH ranked_employee_cte AS (
SELECT
account_holder_name,
account_holder_id,
COUNT(first_digit) AS count_leading_digit,
RANK() OVER (
PARTITION BY account_holder_id
ORDER BY COUNT(first_digit) DESC
) AS most_frequent_digit
FROM orders
WHERE first_digit BETWEEN 5 AND 6
GROUP BY account_holder_name, account_holder_id
)
SELECT
DISTINCT account_holder_name,
account_holder_id,
count_leading_digit
FROM ranked_employee_cte AS ranked
LEFT JOIN visitors AS vendors
ON ranked.account_holder_id = vendors.authorized_personnel_id
WHERE most_frequent_digit = 1
ORDER BY count_leading_digit DESC;
Output:
Question 2: List the vendors used by each employee for each transaction in order of frequency among visitors.
?
Code:
--List the vendors among the visitors used by each employee in the highest frequency order.
WITH ranked_vendor_cte AS (
SELECT
account_holder_name,
vendor_name,
account_holder_id,
COUNT(vendor_name) AS frequency_vendor,
RANK() OVER (
PARTITION BY account_holder_id
ORDER BY COUNT(vendor_name) DESC
) AS most_frequent_vendor
FROM orders
WHERE first_digit BETWEEN 5 AND 6
GROUP BY account_holder_name, vendor_name, account_holder_id
)
SELECT
DISTINCT account_holder_name,
vendor_name,
account_holder_id,
frequency_vendor
FROM ranked_vendor_cte AS ranked_vendor
LEFT JOIN visitors AS vendors
ON ranked_vendor.account_holder_id = vendors.authorized_personnel_id
WHERE most_frequent_vendor = 1
AND vendors.vendor_name = ranked_vendor.vendor_name
ORDER BY frequency_vendor DESC;
Output:
From the two outputs, we can observe a clear association between employee ID 1138I and VendorID12. In fact, the employee predominantly used VendorID_12 for most of his or her orders.
As you can see, many analytical questions can be answered using SQL.?
Furthermore, we could use geographic longitude and latitude data to infer the distance proximity relationship between the employee and the vendor and
the time proximity relationship between the date when the vendor visited the "alleged" employee in the office and the date when the employee placed an order using that particular vendor’s service.
We shall examine these advanced SQL questions more in future articles.
In reality, establishing such a connection would require more than just a few SQL queries, but we can certainly connect the dots to lay the groundwork for an initial investigation.
?
?Conclusions:
As fraud detection tools and techniques become more advanced, fraudulent activities also become more complex. The traditional rule-based approach to fraud detection, which relies on predetermined attributes, has limitations in fast-evolving industries.
Benford’s Law is widely used as a tool for fraud detection. However, as noted by many academic researchers, a dataset deviating from Benford’s law does not necessarily indicate fraud. Instead, it serves as a starting point for further investigation to confirm fraud.
Some professionals argue that a better approach, while not entirely foolproof, is to combine traditional fraud detection methods with data-driven statistical analysis and advanced models like machine learning.
Although fraud detection tools are powerful, detecting fraud still requires human judgment and finesse, such as that of a seasoned fraud examiner. A fraud examiner's inquisitive judgment, particularly his or her analytical mindset, is essential for robust fraud detection and investigation.
In the near future, we will explore more practical ways to learn fraud detection techniques using Python, SQL, and Excel.
Thank you for taking the time to read this article during your busy schedule.
?
#ACFE #certifiedfraudexaminer #CAMS #python #excel #SQL #compliance #dataanalytics #fraud #benfordlaw #audit