Outlier Detection and Removal in Performance Marketing Data with Scikit-Learn’s IsolationForest
Bj?rn Thomsen
Marketing Lead at meshcloud.io | Driving B2B Market Growth for Platform Engineering Company | Performance Marketing, Data Analytics, Marketing Strategy
In online marketing, data is everything. Whether you’re evaluating your TikTok ads performance, segmenting audiences, or optimizing Google campaign budgets, accurate data analysis is crucial. But there’s a common challenge: outliers. These data points deviate so much from the norm that they can distort averages, correlations, and other metrics.
In this guide, I want to walk you through a practical approach to detecting and visualizing outliers in Python. Using scikit-learn for machine learning-based outlier detection, and matplotlib for outlier visualization. We’ll break down the process into simple steps.
By the way: This article is designed for Performance Marketers and Web Analysts with very basic Python skills.
Python Libraries You’ll Need to Install
Install the necessary libraries using your command-line interface:
Why Marketing Data Isn’t Always Straightforward
Before we start: Marketing data is rarely perfect. Despite the abundance of analytics tools and tracking systems, the datasets we rely on are often flawed for several reasons:
Outliers exacerbate these problems. They may inflate averages, distort correlations, or falsely trigger alarms in A/B testing. While some outliers are noise, others—like a spike in high-value customers—are worth investigating.
Understanding Outliers
Outliers are data points that deviate significantly from the majority. For example, a customer spending €10,000 in one transaction might be an outlier in a dataset where the average spend is $50. Outliers can stem from various causes:
Statistical Methods for Outlier Detection
Statistical methods like standard deviation, z-score, and interquartile range (IQR) can help define outliers:
Important: While these methods are robust for small datasets, modern machine learning techniques like Isolation Forest excel with complex, multidimensional data, which is often the case in marketing. We will use the Scikit-Learn library for outlier detection
Step 1: Generating the Dataset
To demonstrate outlier detection, we first need a dataset. For this tutorial, we simulate data representing daily Google Ads spend and clicks—two common metrics in online marketing. We also inject outliers to mimic real-world anomalies, such as click fraud or high-value customers.
In the code below: Spend and Clicks are generated with a normal distribution, simulating typical user behavior. Outliers are added manually to reflect extreme cases like suspiciously high clicks or unusually high spending.
领英推荐
import numpy as np
import pandas as pd
# Step 1: Generate fake marketing data
def generate_dataset():
# Generate normal spend and clicks
spend = np.random.normal(50, 5, 95) # Average spend = $50, std deviation $10
clicks = np.random.normal(35, 5, 95) # Average clicks = 30, std deviation 5
# Add outliers to simulate anomalies
spend_outliers = np.random.uniform(90, 5, 5)
clicks_outliers = np.random.uniform(11, 10, 5)
spend = np.concatenate([spend, spend_outliers])
clicks = np.concatenate([clicks, clicks_outliers])
# Combine into a DataFrame
data = pd.DataFrame({'Spend': spend, 'Clicks': clicks})
return data
# Generate the dataset
data = generate_dataset()
print(data.head()) # View the first few rows
Explanation: We generated a synthetic dataset with two features, "Spend" and "Clicks," by using the np.random.normal function to create data points with specified means and standard deviations, simulating realistic user behavior. Outliers are introduced using the np.random.uniform function, which generates uniformly distributed random values over specified ranges to simulate anomalous data points.
The np.concatenate method is then used to merge the normal data with the outliers for both features, and the pd.DataFrame constructor organizes the resulting arrays into a structured table format for further analysis.
Step 2: Detecting Outliers with Isolation Forest
Outliers in multidimensional data are tricky to identify manually. Here, we use the Isolation Forest algorithm from scikit-learn. It works by isolating anomalies based on how easily they can be separated from the rest of the data. The algorithm assigns each data point a score, flagging outliers with extreme scores.
Isolation Forest uses random partitions of the data to measure "isolation." Points that are easily isolated are likely to be outliers. For contamination: We specify the proportion of outliers in the dataset. In this example, we assume 5% of the data points are anomalies.
from sklearn.ensemble import IsolationForest
# Step 2: Detect outliers
def detect_outliers(data):
model = IsolationForest(contamination=0.05)
# Fit the model and predict outliers
data['Outlier'] = model.fit_predict(data[['Spend', 'Clicks']])
# Label the points
data['Outlier'] = data['Outlier'].apply(lambda x: 'Outlier' if x == -1 else 'Normal')
return data
# Detect outliers
data_with_outliers = detect_outliers(data)
# Display the outliers
outliers = data_with_outliers[data_with_outliers['Outlier'] == 'Outlier']
print("Outliers Detected:")
print(outliers)
Explanation: This code uses IsolationForest from the sklearn.ensemble module to detect anomalies in the dataset. The IsolationForest model is initialized with a contamination parameter of 0.05, indicating that 5% of the data points are expected to be outliers.
The model is trained on the "Spend" and "Clicks" columns using the fit_predict method, which assigns a label of -1 to outliers and 1 to normal points. A lambda function is applied to relabel the results as 'Outlier' or 'Normal' for readability, and the final dataset is filtered to display only the rows flagged as outliers.
Step 3: Visualizing Outliers
Visualizing our data helps to understand patterns and anomalies at a glance. Here, we use Matplotlib for a basic scatter plot to show normal points (blue) and outliers (red).
import matplotlib.pyplot as plt
# Step 3: Visualize the data with a dark color theme
def visualize_data(data):
plt.style.use('dark_background') # Apply a dark background style
plt.figure(figsize=(10, 7)) # Larger figure size for better clarity
colors = {'Normal': '#1f77b4', 'Outlier': '#d62728'} # Cool blue for normal, striking red for outliers
markers = {'Normal': 'o', 'Outlier': 'x'} # Different markers for outliers
for label in colors:
subset = data[data['Outlier'] == label]
plt.scatter(
subset['Spend'],
subset['Clicks'],
label=label,
c=colors[label],
alpha=0.9,
s=90, # Uniform marker size
marker=markers[label]
)
plt.title('Outlier Detection in Marketing Data', fontsize=16, weight='bold', color='white')
plt.xlabel('Spend ($)', fontsize=12, color='white')
plt.ylabel('Clicks', fontsize=12, color='white')
plt.legend(title="Data Type", fontsize=10, title_fontsize=12, loc='upper left', facecolor='#2b2b2b')
plt.grid(True, linestyle='--', alpha=0.3, color='white')
plt.tight_layout() # Ensure labels fit within the figure
plt.show()
visualize_data(data_with_outliers)
Explanation: Here we use matplotlib.pyplot to create a scatter plot that visualizes the detected outliers and normal data points in a dark-themed chart. The plt.style.use('dark_background') applies a modern dark background, and plt.figure(figsize=(10, 7)) sets a large canvas size for better visibility.
Each label ('Normal' or 'Outlier') is filtered and plotted separately using plt.scatter, with distinct markers (o for normal, x for outliers) and custom colors.
Step 4: Removig Outliers
We have detected outliers. But should we remove them? Depends! ?? Outlier removal in marketing data is justified when anomalies stem from errors, irrelevant noise, or disrupt analytical assumptions like normality in statistical models. In our case, we assume the dataset includes instances of click fraud in Google Ads, which we aim to identify and exclude.
# Remove outliers from the dataset
def remove_outliers(data):
# Filter out rows labeled as 'Outlier'
clean_data = data[data['Outlier'] == 'Normal'].drop(columns=['Outlier'])
return clean_data
# Remove outliers
clean_data = remove_outliers(data_with_outliers)
# Display the cleaned dataset
print("Cleaned Dataset (Outliers Removed):")
print(clean_data)
# Optionally, check the shape of the dataset
print(f"Original Dataset Shape: {data_with_outliers.shape}")
print(f"Cleaned Dataset Shape: {clean_data.shape}")
Explanation: Finally, it’s time to kick those pesky outliers out of your dataset! ???? The code above filters the dataset to retain only rows labeled as Normal, effectively removing all identified outliers. It also drops the Outlier column for a cleaner output, ensuring the dataset is ready for further analysis.
Conclusion
Outlier detection in online marketing is about more than just cleaning data. It’s about uncovering hidden patterns and opportunities. By addressing intercorrelations, understanding statistical significance, and using robust methods like Isolation Forest, marketers can improve their analyses and make more informed decisions.
The Python example provided is a starting point—simplified to show how you can implement outlier detection in your marketing workflow. In real-world applications, the dataset might be larger, the features more complex, and the stakes much higher. ??