Few Essential Pandas Functions.
Image from https://pypi.org/project/pandas/

Few Essential Pandas Functions.

This summer apart from volunteering in research work with Professor Julio, I also worked on a few Kaggle datasets which included a lot of data crunching. This article discusses a few essential Pandas functions which came in handy to comprehend the given datasets and perform certain manipulations for the analysis.

The following topics and discussed in this article:

An index is simply a label for rows just like the names for every column.

No alt text provided for this image

Note that an index is not a part of dataframe i.e; it is not counted as a column.

Eg: The shape of the above dataframe is (4,4)

The main characteristics of an index function are identification and selection:

  1. Identification: An index is a pointer to a data location.?
  2. Selection: An index selects a value w.r.t it row label (index) and the column name.

No alt text provided for this image

FYI, use set_index to set a column as an index if required and go back to the original dataframe by using reset_index.

Range-Index: This concept is rarely discussed but is one of those good-to-know topics. It is an immutable index implemented by giving a monotonous index range, especially when the dataset is small and does not require a lot of computing power. Of course, it is the default index type used by dataframes when the range is not explicitly mentioned.?

No alt text provided for this image

  • It is one of those “memory saving” techniques because we can give a monotonous range
  • It also improves computing speed

Pivot Table: The “pandas.pivot_table” has the same functionality to that of the “pivot table” function in Excel.?

“Pivot_table” can be used to filter and aggregate data. Any column can be used as an index to retrieve aggregated values for our analysis.

The following example uses “pivot_table” on the “Titanic” dataset:

No alt text provided for this image

The screengrab above shows that the column ‘Survived’ has been set as an index to count the number of people who survived and those who couldn’t survive across both genders.

Value_counts: This function is used almost like a ritual in every dataset along with functions like info() and describe(). Value_counts() can be used to find unique values and their count in each column.

No alt text provided for this image

In the screenshot above, the ‘for’ loop gives the unique count of every column value excluding the NA values by default.?

To_datetime: Oftentimes, most research-oriented surveys have datasets with the year, month, date, hour, and minutes in separate columns. To concatenate all these columns into a single date in ‘YYYY-MM-DD’ format, pd.to_datetime(df[[‘year_col’,’month_col’,’date_col’]]) can be used.?

The following screenshots explain the same:

No alt text provided for this image
No alt text provided for this image

Output:

No alt text provided for this image
Viraj Kumar

Product Manager at DECA Games

2 年

Thanks for sharing! Good to know about Range-Index :)

Neeraj Gunjan

Sr. Data Analyst - Uber |Forecast Specialist|Python | MBA|Business Analystl Hive | PySpark| SQL Big Query | Datastudio |Advance Excel

2 年

Thanks for sharing

要查看或添加评论,请登录

Asha Pondicherry的更多文章

社区洞察

其他会员也浏览了