登录查看更多内容

Exploratory Data Analysis Using Pandas Profiling

Amit Jain

Actively looking for new job | 6.10+ YoE as a Data Scientist

发布日期: 2021年11月10日

Pandas profiling is an open-source Python module with which we can quickly do an?exploratory data analysis, it also generates interactive reports in web format.

?Pandas profiling helps in visualizing and understanding the distribution of each variable. It generates a report with all the information easily available.

?It offers report generation for the dataset with lots of features and customizations for the report generated.

To start profiling a dataframe, there are two ways:

?1. We can call the ‘.profile_report()’ function on pandas dataframe.

dataset.profile_report()

2. We can pass the dataframe object to the profiling function and then call the function object created to start the generation of the profile.

profile = ProfileReport(dataset, title='Regression Pandas Profiling Report', explorative=True)

profile

Sections of the Report:

1. Overview

This section consists of the 3 tabs: Overview, Warnings, and Reproduction.

The Overview consists of overall statistics. This includes the number of variables (features or columns of the dataframe), Number of observations (rows of dataframe), Missing cells, percentage of missing cells, Duplicate rows, percentage of duplicate rows, and Total size in memory.

The warnings tab consists of any type of warnings related to cardinality, correlation with other variables, missing values, zeroes, skewness of the variables, and many others.

The reproduction tab simply displays information related to the report generation. It shows the start and ends the time of the analysis, the time taken to generate the report, the software version of pandas profiling, and a configuration download option.

2. Variables

This section of the report gives a detailed analysis of all the variables/columns/features of the dataset. The information presented varies depending upon the data type of variable.

Numeric Variables

For numeric data type features, we get information about the distinct values, missing values, min-max, mean, and negative values count. We also get small representation values in the form of a Histogram.

The toggle button expands to the Statistics, Histogram, Common values, Extreme values tab.

The statistics tab includes:

Quantile statistics: Min-Max, percentiles, median, range, and IQR (Inter Quartile range)
Descriptive statistics: Standard Deviation, Coefficient of variance, Kurtosis, mean, skewness, variance, and monotonicity.

The histogram tab displays the frequency of variables or distribution of numeric data. The common values tab is basically value_counts of the variables presented as both counts and percentage frequency.

String Variables

For string type variables, we get Distinct (unique) values, distinct percentage, missing, missing percentage, memory size, and a horizontal bar presentation of all the unique values with count presentation.

It also reports any warnings associated with the variable irrespective of its data type.

The toggle button expands to the Overview, Categories, Words, and Characters tab.

Overview tab in case of string type values displays max-min median mean length, total characters, distinct characters, distinct categories, unique, and sample from the dataset.

StrataScratch 1 年前

Exploring Qualitative Data Analysis with PyCharm

Maxwell E. Uduafemhe, PhD. RTr. 1 年前

The 6 components of Open-Source Data Science/ Machine…

Gregory Piatetsky-Shapiro 6 年前

The categories tab displays a histogram and sometimes a pie chart of the value counts of the feature. The table contains the value, count, and percentage frequency.

The words and the characters tab do the same job as the categories tab in terms of the way of presenting the data in tabular and histogram format, but it can go much deeper into the lower case, upper case, punctuation, special characters categories count too.

3. Correlations

Correlation is used to describe the degree to which two variables move in coordination with one another. In the pandas profiling report, we can access 5 types of correlation coefficients:

Pearson’s r
Spearman’s ρ
Kendall’s τ
Phik (φk)

We can also click on the toggle button to get details about the various correlations’ coefficients.

4. Missing values

The report generated also contains the visualizations for the missing values present in the dataset. We get 3 types of plots: Count, matrix, and dendrogram. The count plot is a basic bar plot with an x-axis as column names and the length of the bar represents the number of values present (without null values). Similarly, are the matrix and the dendrogram.

5. Sample

This section displays the first and last 10 rows of the dataset.

6. Interactions

Generate a 2D scatter plot (or hexagonal binned plot) for all continuous variable pairs.

How to save the report?

We can save this report in –

1.?HTML format

2.?JSON format

The save function remains the same for any of the formats, just change the file extension while saving. To save the report, call the “.to_file()” function on the profile object:

profile.to_file("eda_html_report_pandas_profiling.html")

profile.to_file("eda_html_report_pandas_profiling.json")

Widget in Jupyter notebook

While running the panda profiling in Jupyter notebooks, we will get the HTML rendered in the code cell only. We can make it act like a widget that is easily accessible and offers a compact view. To do this, simply call “.to_widgets()” on profile object:

profile.to_widgets()

You can check my GitHub profile for code.

Aishwarya Mishra

Associate Professor Dept Of CSE , IES College of Technology

3 年

Thanks for sharing .. Informative

1 次回应

Sameer Kumar Pandey

Business Analyst II at Adobe

3 年

Very informative..Thanks

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Exploratory Data Analysis Using Pandas Profiling

Amit Jain

Actively looking for new job | 6.10+ YoE as a Data Scientist

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Top 10 Tools for data scientists in 2022

Top 10 Python Libraries Every Data Science

Unlocking the Power of Synthetic Data - How Python Faker Package Might be Changing the Game for Data Scientists

Machine Learning - All you need to know about Outliers

Python api data analysis with Tensorflow and ChatGPT

Introduction to Quant Investing with Python

Leveraging People and Python in AI for Optimal Data Utilization

Introduction to Network Analysis with Neo4j, AuraDB, and Python ???

Empowering Data Analysis with Python: Unleash Your Analytical Superpowers!

Unlocking Time Series Insights with TSFresh: A Python Guide

领英推荐

How to install WML(Watson Machine Learning) using catalog in Openshift

2022年9月14日

Using Fast loading libraries like Vaex

2021年12月15日

Shapash : Machine Learning Interpretable & Understandable

2021年12月15日

Azure Cognitive Services

2021年12月14日

Autoviz & Autovizwidget

2021年11月24日

Exploratory Data Analysis using pandas visual analysis library

2021年11月12日

Exploratory Data Analysis Using D-Tale Library

2021年11月11日

Exploratory Data Analysis with Sweetviz

2021年9月8日

Python program to check available slots for Covid vaccination centers in your nearest pin code

2021年5月3日