The Mystery of NULL Values: Why They Matter and How to Tackle Them

The Mystery of NULL Values: Why They Matter and How to Tackle Them

what challenge is greater than cleaning a dataset? Regardless of whether you choose SQL or Python as your weapon, you have to deal with messy data and interact with what we know as NULL values.

We have to be aware that dealing with NULL values can be tricky, but it’s essential for accurate data analysis.

What are these NULL values?

When I first encountered the concept of NULL, my first thought was: “Why do we need NULLs when we have empty strings?”. At that time I didn’t even understand the difference between NULL and any other way to indicate that there is nothing.

These small intruders ( NULL values) within our data represent missing or unknown data. It’s a special way to indicate that the field has no value assigned to it.

Often the concept of NULL values is mistaken for an empty string or a 0 value, but that’s nothing more wrong than that.

And I'm going to show you why ?? Ready?

Let’s assume that we have a table named space_usagewith 4 columns, like the one below:

In each column, I've added the specific values: NULL , empty value, and 0. Now if we want to see the length in bytes for each we run the following script:

SELECT id, DATALENGTH(val_null) AS size_null, DATALENGTH(val_empty) AS size_empty, DATALENGTH(val_zero) AS size_zeroFROM space_usage;        

The result will be:

  • size_null: NULL (no memory allocated)
  • size_empty: some memory allocated (empty string)
  • size_zero: 4 bytes (standard integer representation)

Now, the differences between these 3 are pretty obvious, right???

Why do NULLs exist?

When you receive a dataset and start working with it, you might wonder why these small enemies exist.

It’s a valid question, and handling and resolving these values can be frustrating because it’s often unclear where they come from.

But…

Something always happens for a reason, and so does the existence of these NULL values; some of those are listed below:

  • Incomplete data entry: how many times have you skipped over a field on a form because you didn’t know what to put there? Sounds familiar, even to me ?? This is where a NULL value comes in and replaces the data that is not available.
  • Data Collection Processes: smart homes are no longer a new concept. However, some data might go missing due to non-response, malfunction, or other reasons specific to the data collection method. Again the NULLs comes in.
  • Merging Data from Multiple Sources: the most common task for a data engineer is to collect data from multiple sources; it’s almost impossible to not encounter fields that are missing in one source but present in another. This situation leads to NULL values in the merged data set.
  • Data migration: having a new system can give us a few headaches, especially when there are some differences between this brand-new system and the older used one. Given these differences, some fields may not have corresponding data in the new system, and the emergence of NULL values are inevitable.

Let’s deal with these little traitors

Dealing with these tiny invaders is vital for anyone who works with data and loves clean data. Fortunately, several ways exist to deal with these small but exhausting fights.

First thing first, you need to identify them and challenge them to battle by using the IS NULL condition:

SELECT Account_id, Transaction_Date, Amount, Fee
FROM Transactions
WHERE Fee IS NULL;        

Using this SELECT statement we identify all the NULL values that make their way into our Fee column.

A few ways to deal with these are:

  1. CASE statement — The long way to handle a NULL value, but it allows multiple scenarios to replace a value (pretty useful when you have strings):


SELECT account_id, transaction_date,fee, 
 CASE
  WHEN fee IS NULL then 0
  ELSE fee
 END AS not_null_fee
FROM Transactions;        

Result:

2. ISNULL() — want to check something, but you don’t want to use a CASE statement? That’s the method:

SELECT account_id, transaction_date, 
       amount, ISNULL(amount, 0) AS not_null_amount
FROM Transactions        

Result:

3. COALESCE() — another fancy way is to use the COALESCE function

SELECT account_id, transaction_date, fee, COALESCE(fee, 1) AS not_null_fee
FROM Transactions        

You were waiting for an image showing the results, right? Sorry to disappoint you, but this result is the same as the one returned by using the ISNULL() function, so there is no need for more images ??.

4. NVL() — this function is PL/SQL characteristic, and it does the same thing as the ones above

SELECT account_id, transaction_date, fee, NVL(fee, 1) AS not_null_fee
FROM Transactions        

Now that we have new tools in our toolkit, let’s begin the adventures and do some practice ????

Please use what we learn

Let’s do a short exercise of imagination…Let’s say that you’re one of the data analysts at one of the largest banks, and top management needs you to generate a financial report based on the transactions table with the following key tasks:

  • Calculate the total transaction amount per account
  • Identify transactions missing critical information
  • Generate a report of fees collected

Good, now it’s time to take a look at our Transactions table??

?? OMG, what are we having here?

Yes, you’re right… NULLvalues…lots of them. But you’re a data guru who knows how to deal with them.

Let’s take small steps and handle all the tasks with experience:

Identify transactions with missing critical information

In the banking niche, the columns that are related to accounts are vital, so in our case, columns such as account_id, transaction_date, or amount play the role of critical information, and to identify these, you can use the following query:


SELECT 
  transaction_id,
  account_id,
  transaction_date,
  amount,
  description,
  fee
FROM 
  Transactions
WHERE 
  account_id IS NULL 
  OR transaction_date IS NULL 
  OR amount IS NULL;        

Result:

The transactions with the IDs 3,5,7,8 contains NULL values, and now it’s time for some debugging in the system based on your discoveries. ???

Calculate the total transaction amount per account

To calculate this we’ll use the SUM function on our cleaned column amount :

SELECT 
 account_id,
 SUM(COALESCE(amount, 0)) AS total_transaction_amount
FROM 
 transactions
GROUP BY 
 account_id        

Result:

We dealt with the NULLs from our amount column using the COALESCE() function, so now the director can see the amount for each account_id and smiles happily.

But… (the never-ending But ??)

We still have a NULL value, but it’s in the account_id column, where we know we don’t have much to do.

Detailed report for a specific month

Sometimes, the bank needs to see its profitability and identify trends by running a report for a particular month or period. You can mark as done this task by using the script below:

SELECT 
    account_id,
    COUNT(*) AS number_of_transactions,
    SUM(COALESCE(amount, 0)) AS total_transaction_amount,
    SUM(COALESCE(fee, 0)) AS total_fees_collected
FROM 
    transactions
WHERE 
    transaction_date BETWEEN '2023-08-01' AND '2023-08-31'
GROUP BY 
    account_id;        

??Explanation:

  • Handling NULLvalues gracefully: we usedCOALESCE function to replace NULLvalues with a default value (e.g., 0) to return accurate calculations.
  • Group and aggregate: as you saw we used GROUP BY clause to group data by account or month and aggregate functions like SUM to calculate the total_transaction_amount and total_fees_collected columns.
  • Filtering: The WHERE clause was used to filter out other values not in our specific period.

Conclusion

The almost invisible enemies of our data, NULL values, teach us one of the most valuable lessons in data work: the critical importance of data quality and integrity. By incorporating the techniques discussed in this article, you can handle NULL values more effectively, leading to more reliable results.

Remember that NULLs represent missing or unknown data. Their presence in your data is not an obstacle but an opportunity to apply your skills and tools to ensure data quality. As you continue to work with data, keep refining your approaches to handling NULLs, and you’ll find that your analyses become more robust and insightful.

Until next time, happy querying! ?? ????

Robin Ayoub

AI Training Data | NLP | Prompt Engineering | Multilingual Speech-to-Text Transcription | Chatbot | Conversational AI | Machine translation | Human in the loop AI integration

1 个月

Yugandhara, Very interesting, thanks for sharing!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了