Lesser known facts about .loc and value_counts in Pandas for python
MANIDIPA CHAKRAVARTI
Senior Manager | Procurement Professional | SAP Ariba | Data Science Enthusiast
Professional data scientists usually spend a very large portion of their time on Data Cleaning. Pandas is the definitive library for performing data analysis with Python. In this article we will discuss some tips to do data cleaning in a quick and efficient manner
We will see the detailed usage of
- DataFrame.loc
- DataFrame.value_counts
- Dataframe .query and .memory_usage
#Loading the dataframe df.head()
1. Select subset rows and columns using .loc
#select subsets of columns and all rows df.loc[:, 'symboling':'body-style'] #select subsets of rows and columns df.loc[2:10, 'symboling':'body-style'] #select particular rows and columns df.loc[[2,10,12], 'symboling':'body-style']
2. Inclusion - Conditional filtering of DataFrames by category using .loc
#For a single category, we are able to do this by using the == operator. df[df.make == 'alfa-romero'] #For multiple categories, we have to make use of the isin function: df[df.make.isin(['alfa-romero', 'audi'])]
3. Exclusion - Conditional filtering of DataFrames by category using .loc
#Use of the ~ (tilde) sign, which is the complement operator. df[~df.make.isin(['alfa-romero', 'audi'])]
4. Reverse row order of data using .loc
df.loc[::-1]
5. Reverse column order of data using .loc
df.loc[:, ::-1]
6. value_counts() Default usage
df['body-style'].value_counts()
7. value_counts() in ascending order using ascending parameter
#The series returned by value_counts() is in descending order by default. We can reverse the case by setting the ascending parameter to True . df['body-style'].value_counts(ascending=True)
8. value_counts() using normalize parameter
#Return proportions rather than frequencies or number of occurences df['body-style'].value_counts(normalize=True)
9. value_counts() using bin parameter
#for binning of continuous numerical variables into discrete intervals df['price'].value_counts(bins=10)
10. query on specfic columns and data using .query parameter
For those of us who mostly use SQL in our daily work, this function allows for quick and easy access to the subsections of DataFrames that you want to access.
df.query("price > 35000") query_car_prices = [34184,16430,11694] df.query("price in @query_car_prices")
11. monitor the amount of data used in each of the columns of your data frame using .memory_usage parameter
To limit the runtimes of your programs you must be vigilant over the size of the objects you are creating in your program, and also the number of objects you are creating. It returns as series the total number of bytes of memory that the data frame is currently occupying in my memory right now.
df.memory_usage()
For step-by-step code with outputs please refer to the jupyter notebook in github repository.
Senior Manager | Procurement Professional | SAP Ariba | Data Science Enthusiast
4 年Sathish Kumar Reshma Varghese