Visualizing Nullity: How to Uncover Gaps in Your Marketing Data With Python
Bj?rn Thomsen
Marketing Lead at meshcloud.io | Accelerating B2B Market Growth | Professional in Performance Marketing & Web Analytics
Nullity is a significant issue in the analysis of marketing data, particularly when integrating performance metrics from various marketing channels and ad networks. This often results in gaps or missing values, which can distort clean attribution of the contribution of a marketing channel or ad network, or even the entire statistical analysis.
Dealing with these gaps in the context of data wrangling involves various methods as per statistical theory. Common approaches include dropping rows with missing metrics, setting them to "null," or imputing them with a measure such as the mean, median, or arithmetic/logarithmic mean, depending on the context. This topic delves into its own field of study, which we won't explore further in this article.
Our focus in this article is solely on visualizing nullity to easily identify the extent of missing data and any resulting interdependencies. For this purpose, we utilize the Python library "missingno" available at: https://github.com/ResidentMario/missingno . This tool allows marketers, even those without in-depth knowledge of data analysis, to gain a visual understanding of the nullity within their data.
By the way: Not only 0, null, "", and NaN pose problems in data interpretation. Outliers such as min and max values often distort the overall picture as well. A single massively underperforming ad with an extremely high CPA can skew the entire performance review. Therefore, always be on the lookout for outliers and gaps.
Installing missingno
The first step is to locally open a new Jupyter Notebook. You can find how to install and run a Jupyter Notebook here: https://jupyter.org/install
We also need the missingno library for Python. We can install it using the following command line in our console:
pip install missingno
Now we open a new Jupyter Notebook and import our marketing performance data, for which I will use dummy values in CSV format for Google Ads, TikTok Ads, and LinkedIn Ads, as a DataFrame. You can replace the file path with your own CSV file. By the way, missingno makes use of Matplotlib, as we can see in the code.
import pandas as pd
import missingno as msno
df = pd.read_csv(r"C:\Users\Bj?rnThomsen\Desktop\your-csv-file.csv", delimiter=',', skiprows=0, low_memory=False)
%matplotlib inline
The Nullity Matrix
Using the matrix() method, we produce our initial visualization, portraying the magnitude of nullity within the dataset. White lacunae signify absent dimensions. The sparkline on the right encapsulates the overall profile of data completeness, identifying rows with the maximum and minimum nullity in the dataset.
msno.matrix(df.sample(35))
The Nullity Dendrogram
Next, we generate a dendrogram with the dendrogram() method to illustrate the hierarchical clustering of nullity correlations between values. While this may sound complex, it simply means that we represent pairs of metrics or clusters in a hierarchical tree, whose values or nullity are particularly strongly correlated.
Due to the lack of performance data in our Linkedin Remarketing Campaigns in Asia, as well as missing region codes for our Google Ads Campaigns, comparability is obviously compromised.
msno.dendrogram(df)
The Nullity Correlation Heatmap
Lastly, I would like to create a nullity correlation heatmap using the heatmap() method. This is not a conventional heatmap that simply calculates the Pearson product-moment correlation coefficient of the values contained within. Instead, it focuses solely on the correlation of presence or absence of nullity.
This operates as follows: Nullity correlation ranges from -1 (indicating that if one variable appears, the other definitely does not) to 0 (suggesting that the appearance or absence of variables has no effect on each other) to 1 (indicating that if one variable appears, the other definitely does as well).
msno.heatmap(df)
Conclusion
From a personal perspective, entries with missing data should be regarded with significant skepticism in the overall analysis. Addressing nullity through various statistical imputation techniques is essential for maintaining data integrity. For an initial visual assessment of the extent of nullity, especially with limited data analysis resources at hand, missingno in Python stands out as the preferred choice for an initial assessment
Merged marketing data in particular, such as performance values from various marketing channels and ad networks, tend to have gaps. This is due to the nature of the matter: Video performance produces different metrics than email marketing or SEO.
By the way, rarely have I found an article to come together so effortlessly. With missingno, visualizing nullity is easy as pie. ??