Exploratory Data Analysis Using Pandas Profiling

Exploratory Data Analysis Using Pandas Profiling

Pandas profiling is an open-source Python module with which we can quickly do an?exploratory data analysis, it also generates interactive reports in web format.

?Pandas profiling helps in visualizing and understanding the distribution of each variable. It generates a report with all the information easily available.

?It offers report generation for the dataset with lots of features and customizations for the report generated.

To start profiling a dataframe, there are two ways:

?1. We can call the ‘.profile_report()’ function on pandas dataframe.

dataset.profile_report()        

2. We can pass the dataframe object to the profiling function and then call the function object created to start the generation of the profile.

profile = ProfileReport(dataset, title='Regression Pandas Profiling Report', explorative=True)

profile        

Sections of the Report:

1. Overview

This section consists of the 3 tabs: Overview, Warnings, and Reproduction.

The Overview consists of overall statistics. This includes the number of variables (features or columns of the dataframe), Number of observations (rows of dataframe), Missing cells, percentage of missing cells, Duplicate rows, percentage of duplicate rows, and Total size in memory.

No alt text provided for this image

The warnings tab consists of any type of warnings related to cardinality, correlation with other variables, missing values, zeroes, skewness of the variables, and many others.

No alt text provided for this image

The reproduction tab simply displays information related to the report generation. It shows the start and ends the time of the analysis, the time taken to generate the report, the software version of pandas profiling, and a configuration download option.

No alt text provided for this image

2. Variables

This section of the report gives a detailed analysis of all the variables/columns/features of the dataset. The information presented varies depending upon the data type of variable.

Numeric Variables

For numeric data type features, we get information about the distinct values, missing values, min-max, mean, and negative values count. We also get small representation values in the form of a Histogram.

No alt text provided for this image

The toggle button expands to the Statistics, Histogram, Common values, Extreme values tab.

The statistics tab includes:

  1. Quantile statistics: Min-Max, percentiles, median, range, and IQR (Inter Quartile range)
  2. Descriptive statistics: Standard Deviation, Coefficient of variance, Kurtosis, mean, skewness, variance, and monotonicity.

No alt text provided for this image

The histogram tab displays the frequency of variables or distribution of numeric data. The common values tab is basically value_counts of the variables presented as both counts and percentage frequency.

No alt text provided for this image
No alt text provided for this image

String Variables

For string type variables, we get Distinct (unique) values, distinct percentage, missing, missing percentage, memory size, and a horizontal bar presentation of all the unique values with count presentation.

No alt text provided for this image

It also reports any warnings associated with the variable irrespective of its data type.

The toggle button expands to the Overview, Categories, Words, and Characters tab.

Overview tab in case of string type values displays max-min median mean length, total characters, distinct characters, distinct categories, unique, and sample from the dataset.

No alt text provided for this image

The categories tab displays a histogram and sometimes a pie chart of the value counts of the feature. The table contains the value, count, and percentage frequency.

No alt text provided for this image

The words and the characters tab do the same job as the categories tab in terms of the way of presenting the data in tabular and histogram format, but it can go much deeper into the lower case, upper case, punctuation, special characters categories count too.

3. Correlations

Correlation is used to describe the degree to which two variables move in coordination with one another. In the pandas profiling report, we can access 5 types of correlation coefficients:

  • Pearson’s r
  • Spearman’s ρ
  • Kendall’s τ
  • Phik (φk)

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

We can also click on the toggle button to get details about the various correlations’ coefficients.

4. Missing values

The report generated also contains the visualizations for the missing values present in the dataset. We get 3 types of plots: Count, matrix, and dendrogram. The count plot is a basic bar plot with an x-axis as column names and the length of the bar represents the number of values present (without null values). Similarly, are the matrix and the dendrogram.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

5. Sample

This section displays the first and last 10 rows of the dataset.

No alt text provided for this image
No alt text provided for this image

6. Interactions

Generate a 2D scatter plot (or hexagonal binned plot) for all continuous variable pairs.

No alt text provided for this image

How to save the report?

We can save this report in –

1.?HTML format

2.?JSON format

The save function remains the same for any of the formats, just change the file extension while saving. To save the report, call the “.to_file()” function on the profile object:

profile.to_file("eda_html_report_pandas_profiling.html")

profile.to_file("eda_html_report_pandas_profiling.json")        

Widget in Jupyter notebook

While running the panda profiling in Jupyter notebooks, we will get the HTML rendered in the code cell only. We can make it act like a widget that is easily accessible and offers a compact view. To do this, simply call “.to_widgets()” on profile object:

profile.to_widgets()        
No alt text provided for this image

You can check my GitHub profile for code.

?

Aishwarya Mishra

Associate Professor Dept Of CSE , IES College of Technology

3 年

Thanks for sharing .. Informative

Sameer Kumar Pandey

Business Analyst II at Adobe

3 年

Very informative..Thanks

要查看或添加评论,请登录

社区洞察

其他会员也浏览了