Exploratory Data Analysis Using Pandas Profiling
Pandas profiling is an open-source Python module with which we can quickly do an?exploratory data analysis, it also generates interactive reports in web format.
?Pandas profiling helps in visualizing and understanding the distribution of each variable. It generates a report with all the information easily available.
?It offers report generation for the dataset with lots of features and customizations for the report generated.
To start profiling a dataframe, there are two ways:
?1. We can call the ‘.profile_report()’ function on pandas dataframe.
dataset.profile_report()
2. We can pass the dataframe object to the profiling function and then call the function object created to start the generation of the profile.
profile = ProfileReport(dataset, title='Regression Pandas Profiling Report', explorative=True)
profile
Sections of the Report:
1. Overview
This section consists of the 3 tabs: Overview, Warnings, and Reproduction.
The Overview consists of overall statistics. This includes the number of variables (features or columns of the dataframe), Number of observations (rows of dataframe), Missing cells, percentage of missing cells, Duplicate rows, percentage of duplicate rows, and Total size in memory.
The warnings tab consists of any type of warnings related to cardinality, correlation with other variables, missing values, zeroes, skewness of the variables, and many others.
The reproduction tab simply displays information related to the report generation. It shows the start and ends the time of the analysis, the time taken to generate the report, the software version of pandas profiling, and a configuration download option.
2. Variables
This section of the report gives a detailed analysis of all the variables/columns/features of the dataset. The information presented varies depending upon the data type of variable.
Numeric Variables
For numeric data type features, we get information about the distinct values, missing values, min-max, mean, and negative values count. We also get small representation values in the form of a Histogram.
The toggle button expands to the Statistics, Histogram, Common values, Extreme values tab.
The statistics tab includes:
The histogram tab displays the frequency of variables or distribution of numeric data. The common values tab is basically value_counts of the variables presented as both counts and percentage frequency.
String Variables
For string type variables, we get Distinct (unique) values, distinct percentage, missing, missing percentage, memory size, and a horizontal bar presentation of all the unique values with count presentation.
It also reports any warnings associated with the variable irrespective of its data type.
The toggle button expands to the Overview, Categories, Words, and Characters tab.
Overview tab in case of string type values displays max-min median mean length, total characters, distinct characters, distinct categories, unique, and sample from the dataset.
领英推荐
The categories tab displays a histogram and sometimes a pie chart of the value counts of the feature. The table contains the value, count, and percentage frequency.
The words and the characters tab do the same job as the categories tab in terms of the way of presenting the data in tabular and histogram format, but it can go much deeper into the lower case, upper case, punctuation, special characters categories count too.
3. Correlations
Correlation is used to describe the degree to which two variables move in coordination with one another. In the pandas profiling report, we can access 5 types of correlation coefficients:
We can also click on the toggle button to get details about the various correlations’ coefficients.
4. Missing values
The report generated also contains the visualizations for the missing values present in the dataset. We get 3 types of plots: Count, matrix, and dendrogram. The count plot is a basic bar plot with an x-axis as column names and the length of the bar represents the number of values present (without null values). Similarly, are the matrix and the dendrogram.
5. Sample
This section displays the first and last 10 rows of the dataset.
6. Interactions
Generate a 2D scatter plot (or hexagonal binned plot) for all continuous variable pairs.
How to save the report?
We can save this report in –
1.?HTML format
2.?JSON format
The save function remains the same for any of the formats, just change the file extension while saving. To save the report, call the “.to_file()” function on the profile object:
profile.to_file("eda_html_report_pandas_profiling.html")
profile.to_file("eda_html_report_pandas_profiling.json")
Widget in Jupyter notebook
While running the panda profiling in Jupyter notebooks, we will get the HTML rendered in the code cell only. We can make it act like a widget that is easily accessible and offers a compact view. To do this, simply call “.to_widgets()” on profile object:
profile.to_widgets()
You can check my GitHub profile for code.
?
Associate Professor Dept Of CSE , IES College of Technology
3 年Thanks for sharing .. Informative
Business Analyst II at Adobe
3 年Very informative..Thanks