Lesser known facts about .loc and value_counts in Pandas for python

Lesser known facts about .loc and value_counts in Pandas for python

Professional data scientists usually spend a very large portion of their time on Data Cleaning. Pandas is the definitive library for performing data analysis with Python. In this article we will discuss some tips to do data cleaning in a quick and efficient manner

We will see the detailed usage of

  1. DataFrame.loc
  2. DataFrame.value_counts
  3. Dataframe .query and .memory_usage
#Loading the dataframe
df.head()

1. Select subset rows and columns using .loc

#select subsets of columns and all rows
df.loc[:, 'symboling':'body-style']

#select subsets of rows and columns
df.loc[2:10, 'symboling':'body-style']

#select particular rows and columns
df.loc[[2,10,12], 'symboling':'body-style']

2. Inclusion - Conditional filtering of DataFrames by category using .loc

#For a single category, we are able to do this by using the == operator.
df[df.make == 'alfa-romero']

#For multiple categories, we have to make use of the isin function:

df[df.make.isin(['alfa-romero', 'audi'])]

3. Exclusion - Conditional filtering of DataFrames by category using .loc

#Use of the ~ (tilde) sign, which is the complement operator.
df[~df.make.isin(['alfa-romero', 'audi'])]

4. Reverse row order of data using .loc

df.loc[::-1]

5. Reverse column order of data using .loc

df.loc[:, ::-1]

6. value_counts() Default usage

df['body-style'].value_counts()

7. value_counts() in ascending order using ascending parameter

#The series returned by value_counts() is in descending order by default. We can reverse the case by setting the ascending parameter to True .

df['body-style'].value_counts(ascending=True)

8. value_counts() using normalize parameter

#Return proportions rather than frequencies or number of occurences
df['body-style'].value_counts(normalize=True)

9. value_counts() using bin parameter

#for binning of continuous numerical variables into discrete intervals
df['price'].value_counts(bins=10)

10. query on specfic columns and data using .query parameter

For those of us who mostly use SQL in our daily work, this function allows for quick and easy access to the subsections of DataFrames that you want to access.

df.query("price > 35000")

query_car_prices = [34184,16430,11694]
df.query("price in @query_car_prices")

11. monitor the amount of data used in each of the columns of your data frame using .memory_usage parameter

To limit the runtimes of your programs you must be vigilant over the size of the objects you are creating in your program, and also the number of objects you are creating. It returns as series the total number of bytes of memory that the data frame is currently occupying in my memory right now.

df.memory_usage()

For step-by-step code with outputs please refer to the jupyter notebook in github repository.


MANIDIPA CHAKRAVARTI

Senior Manager | Procurement Professional | SAP Ariba | Data Science Enthusiast

4 年
回复

要查看或添加评论,请登录

MANIDIPA CHAKRAVARTI的更多文章

社区洞察

其他会员也浏览了