Pandas Vs. SQL: String Formatting and Preprocessing Data
Announcement
I have just added a set of new videos on Python strings to the Master Python Fundamentals course: Join the course to learn Python the easy way: Master Python Fundamentals: Classes and Functions
Can you answer this challenge? Check challenge here
Introduction
As someone who uses both Python and SQL, I cannot overstate how important these two tools are for data analysis. Both Python and SQL are widely used for data analysis, making them must-learn tools for anyone dancing with data. In this article, I want to compare string processing functions in Python and SQL. We are going to explore Pandas and SQL functions for finding and removing leading and trailing spaces, extracting substrings, and replacing specific substrings in columns. Now, please note that this is not a competition to determine which approach is superior (well, we may do that some other time). We are simply trying to explore the functionality available in SQL and Pandas for string formatting and preprocessing tasks.
Checking for Leading and Trailing spaces
Here is the data that we are going to use for this article. This data will be loaded in a pandas DataFrame and SQL database:
One of the most common problems found in structured data is the presence of leading and trailing whitespaces. These extra spaces can disrupt your analysis and lead to inaccurate interpretations. For instance, strings with unwanted spaces may fail to match expected values during comparisons or calculations, leading to incorrect results. Additionally, retrieving records with such strings could return no matches, even though the data exists. As part of the data cleaning process, it’s important to remove these whitespaces before proceeding with any analysis. Let's look at two methods for handling leading and trailing spaces—one using Pandas and the other using SQL. But before removing them, we need to first confirm their presence. Let’s start by checking for these spaces in a pandas DataFrame.
Here, we are checking if the sales_person column has either leading or trailing spaces. The function is using the startswith() and str.endswith() methods to check if the strings start with or end with some spaces. This creates two boolean series: one that identifies rows with leading spaces and one that identifies rows with trailing spaces. The logical OR operator ( | ) combines these two conditions, and the any() method checks if at least one row in the column satisfies either condition. If any rows have leading or trailing spaces, the function returns true; otherwise, it returns false. So we can confirm that we have 3 rows with leading spaces in the column. We do not have any rows with trailing.
Now, if you are using SQL, here is how you would check for leading and trailing spaces in a column:
This SQL query checks for rows in the sales_person column of the sales table that contain either leading or trailing spaces. It uses the LENGTH() function to compare the length of each string before and after applying the TRIM() function, which removes leading and trailing spaces. If the length of the trimmed string is different from the original, it indicates that the string had spaces. The CASE statement assigns a value of 1 if this condition is met and 0 otherwise. The SUM() function then adds up the number of rows where leading or trailing spaces were found. The result is that we have 3 rows with leading and trailing spaces.
Build the Confidence to Tackle Data Analysis Projects in 2024
Ready to go in and do some real data analysis? The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Start your journey with "50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners."
Other Resources
Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.
100 Python Tips and Tricks, Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks.
Removing Leading and Trailing Spaces
When you find some leading and trailing spaces, you can use the Pandas strip() method to remove the spaces. Now, to demonstrate the flexibility of pandas and Python in general, we will combine the process of checking for leading and trailing spaces with the process of removing them in one function:
领英推荐
In this code, once we have confirmed that we have leading and trailing spaces in the column, we use the strip() method to remove them. Once we have removed the spaces, we check the column again to confirm if the spaces have been removed. If the spaces have been removed, the code will return True. In practice, though, str.strip() should always remove all leading and trailing spaces, so the second check is more of a safety measure and may not be strictly necessary.
To remove the leading and trailing spaces with SQL, we use the TRIM() function. This is a simple procedure of simply passing the column name (sales_person) to the TRIM() function. To update the table in the database with the trimmed column, we use the UPDATE statement. The UPDATE statement applies this transformation to all rows in the sales table:
Here, the UPDATE statement modifies the salesperson column by applying the TRIM() function. The TRIM() function removes any leading (spaces at the beginning) or trailing spaces (spaces at the end) from the values in the column. This is applied to all the rows in the table. After updating the table, the second part of the code checks if there are any rows that still have leading or trailing spaces. This is the same query that we used above to confirm if the column had leading or trailing spaces. Now that we have removed the spaces, the expectation is that this check will return zero (0) as the number of rows with leading and trailing spaces. You can see in the output that the expectation has been met.
Extracting Substrings (First Name)
In Pandas, to extract part of a string, we can use the str.split() method. If we want to extract first names from the "sales_person" column, here is how we can use this method:
In this code, the split() method splits the full name in the sales_person column by spaces into a list for each row (e.g., ['Giovanni', 'Rossi']). In the code, str[0] extracts the first element (the first name) from the split list for each row. The extracted first name is stored in a column called first_names. This process adds a new column to the DataFrame.
Here is how we can extract the first name using SQL:
In this query, we are using the SUBSTR function with INSTR to extract a substring within the sales_person column. The first part of the code SUBSTR(sales_person, 1) stipulates that we are extracting a string from the sales_person column, starting from position 1 (index 1). The second part of the code, INSTR(sales_person, ' ') - 1) stipulates where the extraction will end. In this case, it will end at the first occurrence of space. This effectively extracts the first name from the column. If we want to update the table with the first_names column, then we would have to use the UPDATE statement. But we won't do that now.
Replacing Values
In the data, the name 'JAAArgen' seems to have a typo. This name has three uppercase letters, which is inconsistent with the other names in the column. We want to replace 'AAA' with 'a'. Here is how we can do it with Pandas using the str.replace() method:
Look at that name in row number 6; it has been cleaned. In this code, the replace() method searches through each row in the column and substitutes every instance of the string 'AAA' with 'a'. The operation is applied across the entire column, and the result is assigned back to data['sales_person'].
Since we already know the name in the sales_person column that we want to clean, if we want, we can apply this code to the specific column. First, we have to use a condition to locate the specific row where the name matches, and then apply the replacement only to that row. Here is the code below:
You can see in row 6 that the name has been modified. The advantage of applying an operation only to the rows that need it can improve performance, especially in large datasets. This is because replacing values across the entire column involves processing every row, even those that don't need changes.
To replace the values using SQL, we can use the REPLACE() function. We are going to use the UPDATE statement to update the table:
This SQL query uses the REPLACE() function to search for the substring 'AAA' within each entry of the sales_person column and replace it with the lowercase letter 'a'. The UPDATE statement modifies the "salesperson" column in the table. Once we have made the change, we run the SELECT statement to verify whether the update was successful. You can see in the output that the change was successful.
Final Thoughts
You can see how Python compares to SQL in cleaning and preprocessing strings. The big takeaway from this is that string formatting and preprocessing are fundamental tasks when dealing with data, whether in pandas or SQL. Both approaches have their strengths, but we are not here to debate that. Ultimately, you should strive to have solid knowledge in both tools. Having a solid grasp of both tools will give you flexibility in preprocessing and cleaning your data effectively, no matter where it resides. Thanks for reading.
Newsletter Sponsorship
You can reach a highly engaged audience of over 320,000 tech-savvy subscribers and grow your brand with a newsletter sponsorship. Contact me at [email protected] today to learn more about the sponsorship opportunities.
Work from home at F.L.P
3 个月???? URGENT HIRING ?????? ● URGENT HIRING FOR THAILAND ?? ??Position for Promotion Customer relationship management System and Customer Finding ?? 11K To 20K?? As Per Your Interview And Performance?? 《 FRESHER & ,[STUDENTS] [ EXPERIENCED,] [ HOUSEWIFE ]? CAN APPLY 》 Any Details Send Me Your WhatsApp Message. 94052 02717. Only limited Seats Available...??????
Fintech | Business Intelligence
4 个月I believe that pandas, and Python in general, offer a more versatile toolkit for cleaning and analyzing data. However, learning SQL remains essential for any data analyst.
Freelance Data Analyst | Business Intelligence Specialist | Marketing Data Analyst | | SQL | Data Analyst | Business Analyst
4 个月Genio !
OK Bo?tjan Dolin?ek
Marketing Director @ ScoutDI | MSc. Electrical Engineering: Driving brand recognition and generating sales leads through Marketing Strategy, Content, Process Improvement and Communication.
4 个月Hmm, is there a data file?