The Mystery of NULL Values: Why They Matter and How to Tackle Them
Yugandhara Saste
Data Engineer @ Bristlecone | Azure Data Factory | Databricks | Pyspark | AWS (S3, EMR, Athena) | Python | Snowflake | MYSQL | Airflow
what challenge is greater than cleaning a dataset? Regardless of whether you choose SQL or Python as your weapon, you have to deal with messy data and interact with what we know as NULL values.
We have to be aware that dealing with NULL values can be tricky, but it’s essential for accurate data analysis.
What are these NULL values?
When I first encountered the concept of NULL, my first thought was: “Why do we need NULLs when we have empty strings?”. At that time I didn’t even understand the difference between NULL and any other way to indicate that there is nothing.
These small intruders ( NULL values) within our data represent missing or unknown data. It’s a special way to indicate that the field has no value assigned to it.
Often the concept of NULL values is mistaken for an empty string or a 0 value, but that’s nothing more wrong than that.
And I'm going to show you why ?? Ready?
Let’s assume that we have a table named space_usagewith 4 columns, like the one below:
In each column, I've added the specific values: NULL , empty value, and 0. Now if we want to see the length in bytes for each we run the following script:
SELECT id, DATALENGTH(val_null) AS size_null, DATALENGTH(val_empty) AS size_empty, DATALENGTH(val_zero) AS size_zeroFROM space_usage;
The result will be:
Now, the differences between these 3 are pretty obvious, right???
Why do NULLs exist?
When you receive a dataset and start working with it, you might wonder why these small enemies exist.
It’s a valid question, and handling and resolving these values can be frustrating because it’s often unclear where they come from.
But…
Something always happens for a reason, and so does the existence of these NULL values; some of those are listed below:
Let’s deal with these little traitors
Dealing with these tiny invaders is vital for anyone who works with data and loves clean data. Fortunately, several ways exist to deal with these small but exhausting fights.
First thing first, you need to identify them and challenge them to battle by using the IS NULL condition:
SELECT Account_id, Transaction_Date, Amount, Fee
FROM Transactions
WHERE Fee IS NULL;
Using this SELECT statement we identify all the NULL values that make their way into our Fee column.
A few ways to deal with these are:
SELECT account_id, transaction_date,fee,
CASE
WHEN fee IS NULL then 0
ELSE fee
END AS not_null_fee
FROM Transactions;
Result:
2. ISNULL() — want to check something, but you don’t want to use a CASE statement? That’s the method:
SELECT account_id, transaction_date,
amount, ISNULL(amount, 0) AS not_null_amount
FROM Transactions
Result:
3. COALESCE() — another fancy way is to use the COALESCE function
领英推荐
SELECT account_id, transaction_date, fee, COALESCE(fee, 1) AS not_null_fee
FROM Transactions
You were waiting for an image showing the results, right? Sorry to disappoint you, but this result is the same as the one returned by using the ISNULL() function, so there is no need for more images ??.
4. NVL() — this function is PL/SQL characteristic, and it does the same thing as the ones above
SELECT account_id, transaction_date, fee, NVL(fee, 1) AS not_null_fee
FROM Transactions
Now that we have new tools in our toolkit, let’s begin the adventures and do some practice ????
Please use what we learn
Let’s do a short exercise of imagination…Let’s say that you’re one of the data analysts at one of the largest banks, and top management needs you to generate a financial report based on the transactions table with the following key tasks:
Good, now it’s time to take a look at our Transactions table??
?? OMG, what are we having here?
Yes, you’re right… NULLvalues…lots of them. But you’re a data guru who knows how to deal with them.
Let’s take small steps and handle all the tasks with experience:
Identify transactions with missing critical information
In the banking niche, the columns that are related to accounts are vital, so in our case, columns such as account_id, transaction_date, or amount play the role of critical information, and to identify these, you can use the following query:
SELECT
transaction_id,
account_id,
transaction_date,
amount,
description,
fee
FROM
Transactions
WHERE
account_id IS NULL
OR transaction_date IS NULL
OR amount IS NULL;
Result:
The transactions with the IDs 3,5,7,8 contains NULL values, and now it’s time for some debugging in the system based on your discoveries. ???
Calculate the total transaction amount per account
To calculate this we’ll use the SUM function on our cleaned column amount :
SELECT
account_id,
SUM(COALESCE(amount, 0)) AS total_transaction_amount
FROM
transactions
GROUP BY
account_id
Result:
We dealt with the NULLs from our amount column using the COALESCE() function, so now the director can see the amount for each account_id and smiles happily.
But… (the never-ending But ??)
We still have a NULL value, but it’s in the account_id column, where we know we don’t have much to do.
Detailed report for a specific month
Sometimes, the bank needs to see its profitability and identify trends by running a report for a particular month or period. You can mark as done this task by using the script below:
SELECT
account_id,
COUNT(*) AS number_of_transactions,
SUM(COALESCE(amount, 0)) AS total_transaction_amount,
SUM(COALESCE(fee, 0)) AS total_fees_collected
FROM
transactions
WHERE
transaction_date BETWEEN '2023-08-01' AND '2023-08-31'
GROUP BY
account_id;
??Explanation:
Conclusion
The almost invisible enemies of our data, NULL values, teach us one of the most valuable lessons in data work: the critical importance of data quality and integrity. By incorporating the techniques discussed in this article, you can handle NULL values more effectively, leading to more reliable results.
Remember that NULLs represent missing or unknown data. Their presence in your data is not an obstacle but an opportunity to apply your skills and tools to ensure data quality. As you continue to work with data, keep refining your approaches to handling NULLs, and you’ll find that your analyses become more robust and insightful.
Until next time, happy querying! ?? ????
AI Training Data | NLP | Prompt Engineering | Multilingual Speech-to-Text Transcription | Chatbot | Conversational AI | Machine translation | Human in the loop AI integration
1 个月Yugandhara, Very interesting, thanks for sharing!