登录查看更多内容

Outlier Detection and Removal in Performance Marketing Data with Scikit-Learn’s IsolationForest

Bj?rn Thomsen

Marketing Lead at meshcloud.io | Driving B2B Market Growth for Platform Engineering Company | Performance Marketing, Data Analytics, Marketing Strategy

发布日期: 2024年12月1日

In online marketing, data is everything. Whether you’re evaluating your TikTok ads performance, segmenting audiences, or optimizing Google campaign budgets, accurate data analysis is crucial. But there’s a common challenge: outliers. These data points deviate so much from the norm that they can distort averages, correlations, and other metrics.

In this guide, I want to walk you through a practical approach to detecting and visualizing outliers in Python. Using scikit-learn for machine learning-based outlier detection, and matplotlib for outlier visualization. We’ll break down the process into simple steps.

By the way: This article is designed for Performance Marketers and Web Analysts with very basic Python skills.

Python Libraries You’ll Need to Install

Install the necessary libraries using your command-line interface:

NumPy and Pandas for data manipulation.
Matplotlib for visualizing results.
Scikit-learn (specifically IsolationForest) for outlier detection.

Why Marketing Data Isn’t Always Straightforward

Before we start: Marketing data is rarely perfect. Despite the abundance of analytics tools and tracking systems, the datasets we rely on are often flawed for several reasons:

Intercorrelation: Metrics like clicks, impressions, and conversions are interdependent. This overlap can obscure the true impact of specific factors, making anomalies harder to spot.
Statistical Significance: With small datasets (e.g., niche campaigns), random variations can appear as meaningful patterns.
Human and System Errors: Duplicate entries, tracking misfires, or outlier behaviors—like a single user spending thousands of dollars—can skew results.

Outliers exacerbate these problems. They may inflate averages, distort correlations, or falsely trigger alarms in A/B testing. While some outliers are noise, others—like a spike in high-value customers—are worth investigating.

Understanding Outliers

Outliers are data points that deviate significantly from the majority. For example, a customer spending €10,000 in one transaction might be an outlier in a dataset where the average spend is $50. Outliers can stem from various causes:

Fraudulent clicks in ad campaigns.
Bot traffic inflating engagement metrics.
Unexpectedly high-value transactions, signaling a new customer segment.

Statistical Methods for Outlier Detection

Statistical methods like standard deviation, z-score, and interquartile range (IQR) can help define outliers:

Standard Deviation: Measures how much data points deviate from the mean

Z-Score: Indicates how many standard deviations a point is from the mean

Interquartile Range (IQR): Focuses on the spread of the middle 50% of the data

Important: While these methods are robust for small datasets, modern machine learning techniques like Isolation Forest excel with complex, multidimensional data, which is often the case in marketing. We will use the Scikit-Learn library for outlier detection

Step 1: Generating the Dataset

To demonstrate outlier detection, we first need a dataset. For this tutorial, we simulate data representing daily Google Ads spend and clicks—two common metrics in online marketing. We also inject outliers to mimic real-world anomalies, such as click fraud or high-value customers.

In the code below: Spend and Clicks are generated with a normal distribution, simulating typical user behavior. Outliers are added manually to reflect extreme cases like suspiciously high clicks or unusually high spending.

领英推荐

Got Analytics? Becoming a Data-Driven Marketer

SugarCRM 1 年前

Busting Common Myths: 3 Assumptions Marketers Must…

MarTech Cube 7 个月前

Data Science in Marketing: A Guide for Brands by…

Redwood Algorithms 2 年前

import numpy as np
import pandas as pd

# Step 1: Generate fake marketing data
def generate_dataset():

    # Generate normal spend and clicks
    spend = np.random.normal(50, 5, 95)  # Average spend = $50, std deviation $10
    clicks = np.random.normal(35, 5, 95)  # Average clicks = 30, std deviation 5
    
    # Add outliers to simulate anomalies
    spend_outliers = np.random.uniform(90, 5, 5)
    clicks_outliers = np.random.uniform(11, 10, 5)
    
    spend = np.concatenate([spend, spend_outliers])
    clicks = np.concatenate([clicks, clicks_outliers])
    
    # Combine into a DataFrame
    data = pd.DataFrame({'Spend': spend, 'Clicks': clicks})
    return data

# Generate the dataset
data = generate_dataset()
print(data.head())  # View the first few rows

Explanation: We generated a synthetic dataset with two features, "Spend" and "Clicks," by using the np.random.normal function to create data points with specified means and standard deviations, simulating realistic user behavior. Outliers are introduced using the np.random.uniform function, which generates uniformly distributed random values over specified ranges to simulate anomalous data points.

The np.concatenate method is then used to merge the normal data with the outliers for both features, and the pd.DataFrame constructor organizes the resulting arrays into a structured table format for further analysis.

We have sucessfully created fake Google Ads data.

Step 2: Detecting Outliers with Isolation Forest

Outliers in multidimensional data are tricky to identify manually. Here, we use the Isolation Forest algorithm from scikit-learn. It works by isolating anomalies based on how easily they can be separated from the rest of the data. The algorithm assigns each data point a score, flagging outliers with extreme scores.

Isolation Forest uses random partitions of the data to measure "isolation." Points that are easily isolated are likely to be outliers. For contamination: We specify the proportion of outliers in the dataset. In this example, we assume 5% of the data points are anomalies.

from sklearn.ensemble import IsolationForest

# Step 2: Detect outliers
def detect_outliers(data):
    model = IsolationForest(contamination=0.05)
    # Fit the model and predict outliers
    data['Outlier'] = model.fit_predict(data[['Spend', 'Clicks']])
    # Label the points
    data['Outlier'] = data['Outlier'].apply(lambda x: 'Outlier' if x == -1 else 'Normal')
    return data

# Detect outliers
data_with_outliers = detect_outliers(data)

# Display the outliers
outliers = data_with_outliers[data_with_outliers['Outlier'] == 'Outlier']
print("Outliers Detected:")
print(outliers)

Explanation: This code uses IsolationForest from the sklearn.ensemble module to detect anomalies in the dataset. The IsolationForest model is initialized with a contamination parameter of 0.05, indicating that 5% of the data points are expected to be outliers.

The model is trained on the "Spend" and "Clicks" columns using the fit_predict method, which assigns a label of -1 to outliers and 1 to normal points. A lambda function is applied to relabel the results as 'Outlier' or 'Normal' for readability, and the final dataset is filtered to display only the rows flagged as outliers.

Well done: IsolationForest has detected our outliers in row 95 to 99.

Step 3: Visualizing Outliers

Visualizing our data helps to understand patterns and anomalies at a glance. Here, we use Matplotlib for a basic scatter plot to show normal points (blue) and outliers (red).

import matplotlib.pyplot as plt

# Step 3: Visualize the data with a dark color theme
def visualize_data(data):
    plt.style.use('dark_background')  # Apply a dark background style
    plt.figure(figsize=(10, 7))  # Larger figure size for better clarity
    colors = {'Normal': '#1f77b4', 'Outlier': '#d62728'}  # Cool blue for normal, striking red for outliers
    markers = {'Normal': 'o', 'Outlier': 'x'}  # Different markers for outliers
    
    for label in colors:
        subset = data[data['Outlier'] == label]
        plt.scatter(
            subset['Spend'], 
            subset['Clicks'], 
            label=label, 
            c=colors[label], 
            alpha=0.9, 
            s=90,  # Uniform marker size
            marker=markers[label]
        )
    
    plt.title('Outlier Detection in Marketing Data', fontsize=16, weight='bold', color='white')
    plt.xlabel('Spend ($)', fontsize=12, color='white')
    plt.ylabel('Clicks', fontsize=12, color='white')
    plt.legend(title="Data Type", fontsize=10, title_fontsize=12, loc='upper left', facecolor='#2b2b2b')
    plt.grid(True, linestyle='--', alpha=0.3, color='white')
    plt.tight_layout()  # Ensure labels fit within the figure
    plt.show()

visualize_data(data_with_outliers)

Explanation: Here we use matplotlib.pyplot to create a scatter plot that visualizes the detected outliers and normal data points in a dark-themed chart. The plt.style.use('dark_background') applies a modern dark background, and plt.figure(figsize=(10, 7)) sets a large canvas size for better visibility.

Each label ('Normal' or 'Outlier') is filtered and plotted separately using plt.scatter, with distinct markers (o for normal, x for outliers) and custom colors.

Our outliers are marked red and can be removed.

Step 4: Removig Outliers

We have detected outliers. But should we remove them? Depends! ?? Outlier removal in marketing data is justified when anomalies stem from errors, irrelevant noise, or disrupt analytical assumptions like normality in statistical models. In our case, we assume the dataset includes instances of click fraud in Google Ads, which we aim to identify and exclude.

# Remove outliers from the dataset
def remove_outliers(data):
    # Filter out rows labeled as 'Outlier'
    clean_data = data[data['Outlier'] == 'Normal'].drop(columns=['Outlier'])
    return clean_data

# Remove outliers
clean_data = remove_outliers(data_with_outliers)

# Display the cleaned dataset
print("Cleaned Dataset (Outliers Removed):")
print(clean_data)

# Optionally, check the shape of the dataset
print(f"Original Dataset Shape: {data_with_outliers.shape}")
print(f"Cleaned Dataset Shape: {clean_data.shape}")

Explanation: Finally, it’s time to kick those pesky outliers out of your dataset! ???? The code above filters the dataset to retain only rows labeled as Normal, effectively removing all identified outliers. It also drops the Outlier column for a cleaner output, ensuring the dataset is ready for further analysis.

Conclusion

Outlier detection in online marketing is about more than just cleaning data. It’s about uncovering hidden patterns and opportunities. By addressing intercorrelations, understanding statistical significance, and using robust methods like Isolation Forest, marketers can improve their analyses and make more informed decisions.

The Python example provided is a starting point—simplified to show how you can implement outlier detection in your marketing workflow. In real-world applications, the dataset might be larger, the features more complex, and the stakes much higher. ??

要查看或添加评论，请登录

Bj?rn Thomsen的更多文章

Hands-On Example: Google Trends Visualization with Pytrends API – Keyword Volumes, Regional Insights, and Correlations

2025年1月25日

Hands-On Example: Google Trends Visualization with Pytrends API – Keyword Volumes, Regional Insights, and Correlations

When it comes to keyword research, spotting trends, identifying anomalies, or uncovering correlations between search…
Evaluating the Practical Use of No-Code AI App Builders—Like Lovable and Bolt—in Marketing

2024年12月18日

Evaluating the Practical Use of No-Code AI App Builders—Like Lovable and Bolt—in Marketing

Marketing teams, especially in SMBs, often face a dual responsibility: not only are they tasked with creating and…
Visualizing Marketing Geo Data with PyGWalker: As Simple as Mapping Meteorite Strikes

2024年11月16日

Visualizing Marketing Geo Data with PyGWalker: As Simple as Mapping Meteorite Strikes

Mapping geospatial data, such as website traffic or sales by location, doesn’t necessarily require expensive tools like…
Unveiling Hidden Patterns in Performance Marketing with HoloViews' Violin Plots and Radial Heatmaps

2024年9月27日

Unveiling Hidden Patterns in Performance Marketing with HoloViews' Violin Plots and Radial Heatmaps

When marketers and web analysts want to represent complex distributions and relationships between values, they…
Strategic Website Planning: Using Python and NetworkX to Visualize & Compare Sitemap Trees

2024年8月23日

Strategic Website Planning: Using Python and NetworkX to Visualize & Compare Sitemap Trees

Website maintenance, frequent updates, or even a partial relaunch can sometimes lead to a drift from the original…

2 条评论
From Practical to Playful: How to Animate E-commerce Data & Website Logs with Python and Matplotlib

2024年6月27日

From Practical to Playful: How to Animate E-commerce Data & Website Logs with Python and Matplotlib

Sometimes, it's not enough to simply track and tabulate Ecommerce or website KPIs. Marketers often face the challenge…
You're Doing SEO Well! But Did You Know That You Can Conduct Keyword Research Using Python Pytrends & Data Visualization?

2024年5月8日

You're Doing SEO Well! But Did You Know That You Can Conduct Keyword Research Using Python Pytrends & Data Visualization?

New marketing campaigns, which require tailored content on the website, often necessitate intensive SEO research. Tools…

1 条评论
3 Implementation Ideas: Enhancing Google Ads with PandasAI, OpenWeather, and IDEFICS

2024年4月26日

3 Implementation Ideas: Enhancing Google Ads with PandasAI, OpenWeather, and IDEFICS

Paid campaigns are becoming increasingly expensive. Especially in Google Search and Google Display Network, click…
Restricting Tag Deployment in Google Tag Manager

2024年3月29日

Restricting Tag Deployment in Google Tag Manager

Managing Tag Deployment with Google Tag Manager There are many reasons to exercise caution when deploying tags…

2 条评论
Enhancing Marketing Attribution Through Predictive Visualization With Markov Chain

2024年3月11日

Enhancing Marketing Attribution Through Predictive Visualization With Markov Chain

For those engaged in high-level multi-channel marketing, the goal is to extract the absolute maximum from their…

See all articles

Outlier Detection and Removal in Performance Marketing Data with Scikit-Learn’s IsolationForest

Bj?rn Thomsen

Marketing Lead at meshcloud.io | Driving B2B Market Growth for Platform Engineering Company | Performance Marketing, Data Analytics, Marketing Strategy

Python Libraries You’ll Need to Install

Why Marketing Data Isn’t Always Straightforward

Understanding Outliers

Statistical Methods for Outlier Detection

Step 1: Generating the Dataset

领英推荐

Step 2: Detecting Outliers with Isolation Forest

Step 3: Visualizing Outliers

Step 4: Removig Outliers

Conclusion

Bj?rn Thomsen的更多文章

社区洞察

其他会员也浏览了

Scaling Analytics As You Grow

How to Achieve Last-Click UTM Attribution in Google Analytics 4 with BigQuery

We built a custom analytics system instead of using Google Analytics. How and why?

Google Sheets Tracking Using Google Tag Manager

Deep Dive Newsletter - May 2023

Free Data into Dollars:

Data-Driven Decisions: Using Statistical Analysis in Marketing Analytics

Leveraging Big Data for Marketing Insights: Unlocking New Opportunities

From Data to Decisions: A Comprehensive Framework for Data-Driven Digital Marketing Optimization

Historical UA data available until July 2024

Python Libraries You’ll Need to Install

Why Marketing Data Isn’t Always Straightforward

Understanding Outliers

Statistical Methods for Outlier Detection

Step 1: Generating the Dataset

领英推荐

Step 2: Detecting Outliers with Isolation Forest

Step 3: Visualizing Outliers

Step 4: Removig Outliers

Conclusion

Bj?rn Thomsen的更多文章

Hands-On Example: Google Trends Visualization with Pytrends API – Keyword Volumes, Regional Insights, and Correlations

Evaluating the Practical Use of No-Code AI App Builders—Like Lovable and Bolt—in Marketing

Visualizing Marketing Geo Data with PyGWalker: As Simple as Mapping Meteorite Strikes

Unveiling Hidden Patterns in Performance Marketing with HoloViews' Violin Plots and Radial Heatmaps

Strategic Website Planning: Using Python and NetworkX to Visualize & Compare Sitemap Trees

From Practical to Playful: How to Animate E-commerce Data & Website Logs with Python and Matplotlib

You're Doing SEO Well! But Did You Know That You Can Conduct Keyword Research Using Python Pytrends & Data Visualization?

3 Implementation Ideas: Enhancing Google Ads with PandasAI, OpenWeather, and IDEFICS

Restricting Tag Deployment in Google Tag Manager

Enhancing Marketing Attribution Through Predictive Visualization With Markov Chain

社区洞察

其他会员也浏览了

Scaling Analytics As You Grow

How to Achieve Last-Click UTM Attribution in Google Analytics 4 with BigQuery

We built a custom analytics system instead of using Google Analytics. How and why?

Google Sheets Tracking Using Google Tag Manager

Deep Dive Newsletter - May 2023

Free Data into Dollars:

Data-Driven Decisions: Using Statistical Analysis in Marketing Analytics

Leveraging Big Data for Marketing Insights: Unlocking New Opportunities

From Data to Decisions: A Comprehensive Framework for Data-Driven Digital Marketing Optimization

Historical UA data available until July 2024