登录查看更多内容

Lesser known facts about .loc and value_counts in Pandas for python

MANIDIPA CHAKRAVARTI

Senior Manager | Procurement Professional | SAP Ariba | Data Science Enthusiast

发布日期: 2020年9月26日

+ 关注

Professional data scientists usually spend a very large portion of their time on Data Cleaning. Pandas is the definitive library for performing data analysis with Python. In this article we will discuss some tips to do data cleaning in a quick and efficient manner

We will see the detailed usage of

DataFrame.loc
DataFrame.value_counts
Dataframe .query and .memory_usage

#Loading the dataframe
df.head()

1. Select subset rows and columns using .loc

#select subsets of columns and all rows
df.loc[:, 'symboling':'body-style']

#select subsets of rows and columns
df.loc[2:10, 'symboling':'body-style']

#select particular rows and columns
df.loc[[2,10,12], 'symboling':'body-style']

2. Inclusion - Conditional filtering of DataFrames by category using .loc

#For a single category, we are able to do this by using the == operator.
df[df.make == 'alfa-romero']

#For multiple categories, we have to make use of the isin function:

df[df.make.isin(['alfa-romero', 'audi'])]

3. Exclusion - Conditional filtering of DataFrames by category using .loc

#Use of the ~ (tilde) sign, which is the complement operator.
df[~df.make.isin(['alfa-romero', 'audi'])]

4. Reverse row order of data using .loc

df.loc[::-1]

5. Reverse column order of data using .loc

df.loc[:, ::-1]

6. value_counts() Default usage

df['body-style'].value_counts()

7. value_counts() in ascending order using ascending parameter

#The series returned by value_counts() is in descending order by default. We can reverse the case by setting the ascending parameter to True .

df['body-style'].value_counts(ascending=True)

8. value_counts() using normalize parameter

#Return proportions rather than frequencies or number of occurences
df['body-style'].value_counts(normalize=True)

9. value_counts() using bin parameter

#for binning of continuous numerical variables into discrete intervals
df['price'].value_counts(bins=10)

10. query on specfic columns and data using .query parameter

For those of us who mostly use SQL in our daily work, this function allows for quick and easy access to the subsections of DataFrames that you want to access.

df.query("price > 35000")

query_car_prices = [34184,16430,11694]
df.query("price in @query_car_prices")

11. monitor the amount of data used in each of the columns of your data frame using .memory_usage parameter

To limit the runtimes of your programs you must be vigilant over the size of the objects you are creating in your program, and also the number of objects you are creating. It returns as series the total number of bytes of memory that the data frame is currently occupying in my memory right now.

df.memory_usage()

For step-by-step code with outputs please refer to the jupyter notebook in github repository.

MANIDIPA CHAKRAVARTI

Senior Manager | Procurement Professional | SAP Ariba | Data Science Enthusiast

4 年

Sathish Kumar Reshma Varghese

要查看或添加评论，请登录

MANIDIPA CHAKRAVARTI的更多文章

Encode Categorical Variables to Numeric Variables: Label encoder v/s One hot encoder

2020年9月17日

Encode Categorical Variables to Numeric Variables: Label encoder v/s One hot encoder

Label encoder v/s One hot encoder Typically, any structured data set includes multiple columns – a combination of…
Data cleaning with Numpy and Pandas

2020年9月8日

Data cleaning with Numpy and Pandas

Proper data cleaning is the “secret” sauce behind machine learning. Better data beats fancier algorithms… Garbage in =…

Lesser known facts about .loc and value_counts in Pandas for python

MANIDIPA CHAKRAVARTI

Senior Manager | Procurement Professional | SAP Ariba | Data Science Enthusiast

Professional data scientists usually spend a very large portion of their time on Data Cleaning. Pandas is the definitive library for performing data analysis with Python. In this article we will discuss some tips to do data cleaning in a quick and efficient manner

We will see the detailed usage of

1. Select subset rows and columns using .loc

2. Inclusion - Conditional filtering of DataFrames by category using .loc

3. Exclusion - Conditional filtering of DataFrames by category using .loc

4. Reverse row order of data using .loc

5. Reverse column order of data using .loc

6. value_counts() Default usage

7. value_counts() in ascending order using ascending parameter

8. value_counts() using normalize parameter

9. value_counts() using bin parameter

10. query on specfic columns and data using .query parameter

11. monitor the amount of data used in each of the columns of your data frame using .memory_usage parameter

MANIDIPA CHAKRAVARTI的更多文章

社区洞察

其他会员也浏览了

The lambda() and more

Data Manipulation with Pandas

Powerful Python libraries for data visualization

Use Python In Power Query to retrieve H3 indices

Basics of NumPy

Crafting Visuals in Python

DQ Outlier Detection with Interquartile Range (IQR) in Python

Aggregating DataFrames in Pandas

Python for Data Science:

How To make Interactive Plot Graph For Statistical Data Visualization Using Seaborn Python library

Professional data scientists usually spend a very large portion of their time on Data Cleaning. Pandas is the definitive library for performing data analysis with Python. In this article we will discuss some tips to do data cleaning in a quick and efficient manner

We will see the detailed usage of

1. Select subset rows and columns using .loc

2. Inclusion - Conditional filtering of DataFrames by category using .loc

3. Exclusion - Conditional filtering of DataFrames by category using .loc

4. Reverse row order of data using .loc

5. Reverse column order of data using .loc

6. value_counts() Default usage

7. value_counts() in ascending order using ascending parameter

8. value_counts() using normalize parameter

9. value_counts() using bin parameter

10. query on specfic columns and data using .query parameter

11. monitor the amount of data used in each of the columns of your data frame using .memory_usage parameter

MANIDIPA CHAKRAVARTI的更多文章

Encode Categorical Variables to Numeric Variables: Label encoder v/s One hot encoder

Data cleaning with Numpy and Pandas

社区洞察

其他会员也浏览了

The lambda() and more

Data Manipulation with Pandas

Powerful Python libraries for data visualization

Use Python In Power Query to retrieve H3 indices

Basics of NumPy

Crafting Visuals in Python

DQ Outlier Detection with Interquartile Range (IQR) in Python

Aggregating DataFrames in Pandas

Python for Data Science:

How To make Interactive Plot Graph For Statistical Data Visualization Using Seaborn Python library