Visualizing Nullity: How to Uncover Gaps in Your Marketing Data With Python

Visualizing Nullity: How to Uncover Gaps in Your Marketing Data With Python

Nullity is a significant issue in the analysis of marketing data, particularly when integrating performance metrics from various marketing channels and ad networks. This often results in gaps or missing values, which can distort clean attribution of the contribution of a marketing channel or ad network, or even the entire statistical analysis.

Dealing with these gaps in the context of data wrangling involves various methods as per statistical theory. Common approaches include dropping rows with missing metrics, setting them to "null," or imputing them with a measure such as the mean, median, or arithmetic/logarithmic mean, depending on the context. This topic delves into its own field of study, which we won't explore further in this article.

Our focus in this article is solely on visualizing nullity to easily identify the extent of missing data and any resulting interdependencies. For this purpose, we utilize the Python library "missingno" available at: https://github.com/ResidentMario/missingno. This tool allows marketers, even those without in-depth knowledge of data analysis, to gain a visual understanding of the nullity within their data.

By the way: Not only 0, null, "", and NaN pose problems in data interpretation. Outliers such as min and max values often distort the overall picture as well. A single massively underperforming ad with an extremely high CPA can skew the entire performance review. Therefore, always be on the lookout for outliers and gaps.

Installing missingno

The first step is to locally open a new Jupyter Notebook. You can find how to install and run a Jupyter Notebook here: https://jupyter.org/install

We also need the missingno library for Python. We can install it using the following command line in our console:

pip install missingno        

Now we open a new Jupyter Notebook and import our marketing performance data, for which I will use dummy values in CSV format for Google Ads, TikTok Ads, and LinkedIn Ads, as a DataFrame. You can replace the file path with your own CSV file. By the way, missingno makes use of Matplotlib, as we can see in the code.

import pandas as pd 
import missingno as msno 
df = pd.read_csv(r"C:\Users\Bj?rnThomsen\Desktop\your-csv-file.csv", delimiter=',', skiprows=0, low_memory=False) 
%matplotlib inline        

The Nullity Matrix

Using the matrix() method, we produce our initial visualization, portraying the magnitude of nullity within the dataset. White lacunae signify absent dimensions. The sparkline on the right encapsulates the overall profile of data completeness, identifying rows with the maximum and minimum nullity in the dataset.

msno.matrix(df.sample(35))        
White lacunae signify absent dimensions in the missingno nullity matrix.

The Nullity Dendrogram

Next, we generate a dendrogram with the dendrogram() method to illustrate the hierarchical clustering of nullity correlations between values. While this may sound complex, it simply means that we represent pairs of metrics or clusters in a hierarchical tree, whose values or nullity are particularly strongly correlated.

Due to the lack of performance data in our Linkedin Remarketing Campaigns in Asia, as well as missing region codes for our Google Ads Campaigns, comparability is obviously compromised.

msno.dendrogram(df)        
The dendrogram method in missingno creates a correlation tree for nullity.

The Nullity Correlation Heatmap

Lastly, I would like to create a nullity correlation heatmap using the heatmap() method. This is not a conventional heatmap that simply calculates the Pearson product-moment correlation coefficient of the values contained within. Instead, it focuses solely on the correlation of presence or absence of nullity.

This operates as follows: Nullity correlation ranges from -1 (indicating that if one variable appears, the other definitely does not) to 0 (suggesting that the appearance or absence of variables has no effect on each other) to 1 (indicating that if one variable appears, the other definitely does as well).

msno.heatmap(df)        
Not the typical correlation heatmap you know: This one focuses solely on the correlation of nullity.

Conclusion

From a personal perspective, entries with missing data should be regarded with significant skepticism in the overall analysis. Addressing nullity through various statistical imputation techniques is essential for maintaining data integrity. For an initial visual assessment of the extent of nullity, especially with limited data analysis resources at hand, missingno in Python stands out as the preferred choice for an initial assessment

Merged marketing data in particular, such as performance values from various marketing channels and ad networks, tend to have gaps. This is due to the nature of the matter: Video performance produces different metrics than email marketing or SEO.

By the way, rarely have I found an article to come together so effortlessly. With missingno, visualizing nullity is easy as pie. ??


要查看或添加评论,请登录

社区洞察