Few Essential Pandas Functions.
This summer apart from volunteering in research work with Professor Julio, I also worked on a few Kaggle datasets which included a lot of data crunching. This article discusses a few essential Pandas functions which came in handy to comprehend the given datasets and perform certain manipulations for the analysis.
The following topics and discussed in this article:
An index is simply a label for rows just like the names for every column.
Note that an index is not a part of dataframe i.e; it is not counted as a column.
Eg: The shape of the above dataframe is (4,4)
The main characteristics of an index function are identification and selection:
FYI, use set_index to set a column as an index if required and go back to the original dataframe by using reset_index.
Range-Index: This concept is rarely discussed but is one of those good-to-know topics. It is an immutable index implemented by giving a monotonous index range, especially when the dataset is small and does not require a lot of computing power. Of course, it is the default index type used by dataframes when the range is not explicitly mentioned.?
Pivot Table: The “pandas.pivot_table” has the same functionality to that of the “pivot table” function in Excel.?
领英推荐
“Pivot_table” can be used to filter and aggregate data. Any column can be used as an index to retrieve aggregated values for our analysis.
The following example uses “pivot_table” on the “Titanic” dataset:
The screengrab above shows that the column ‘Survived’ has been set as an index to count the number of people who survived and those who couldn’t survive across both genders.
Value_counts: This function is used almost like a ritual in every dataset along with functions like info() and describe(). Value_counts() can be used to find unique values and their count in each column.
In the screenshot above, the ‘for’ loop gives the unique count of every column value excluding the NA values by default.?
To_datetime: Oftentimes, most research-oriented surveys have datasets with the year, month, date, hour, and minutes in separate columns. To concatenate all these columns into a single date in ‘YYYY-MM-DD’ format, pd.to_datetime(df[[‘year_col’,’month_col’,’date_col’]]) can be used.?
The following screenshots explain the same:
Output:
Product Manager at DECA Games
2 年Thanks for sharing! Good to know about Range-Index :)
Sr. Data Analyst - Uber |Forecast Specialist|Python | MBA|Business Analystl Hive | PySpark| SQL Big Query | Datastudio |Advance Excel
2 年Thanks for sharing