Exploratory Data Analysis: Four Must-Know Techniques
Introduction
Exploratory data analysis
Univariate Analysis
A DataFrame is made up of different variables, which are represented by columns. Univariate analysis involves examining the distribution and statistical properties of each variable (column) in isolation. This is important when you want to get a deep understanding of the values in a column. Univariate analysis includes summary statistics, visualizations, and outlier detection
In this example, the box (the orange box) represents the middle 50% of the data in the "Height" column, and the line inside the box represents the median value. The whiskers extend from the box to the minimum and maximum values that are within 1.5 times the interquartile range (IQR). Anything outside this range is considered an outlier. In this example, we can consider 200 and 210 as upper outliers and 145 and 140 as lower outliers, as they fall outside the range of whiskers.
Build the Confidence to Tackle Data Analysis Projects (40% OFF)
To build a successful data analysis project, one must have skills in data cleaning and preprocessing, visualization, modeling, EDA, and so forth. The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Start your 50-day challenge now. Click here to get 40% off.
Other Resources
Want to learn Python fundamentals the easy way? Check out Master Python Fundamentals: The Ultimate Guide for Beginners.
Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.
100 Python Tips and Tricks, Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks.
Bivariate Analysis
Once you are done analyzing individual columns, you also want to explore the relationship between two variables. This is known as bivariate analysis. The significance of this analysis is that you want to explore correlations, associations, or differences between pairs of variables. For example, you may want to know how the height of the person impacts their weight. This means that you will have to analyze the "Height" and "Weight" columns for correlations. We can use a scatter plot to analyze if there is a correlation between the two variables:
You can see that most of the data points are concentrated in the middle of the plot, indicating that a large number of individuals in the dataset have heights around 170 cm and weights around 70 kg. But we do not see any strong signs of correlation between the two variables. We can safely conclude that the scatter plot shows a weak or non-existent correlation between height and weight in the dataset.
Multivariate Analysis
Apart from analyzing two variables, you can also analyze relationships between more than two variables. This type of analysis is known as Multivariate Analysis. A good starting point for such an analysis is using a pairplot. It tackles the challenge of visualizing relationships between multiple variables simultaneously. A pairplot provides a "bivariate analysis" by examining two variables at a time, but the overall pairplot itself is a tool for multivariate analysis because it allows you to see the relationships between multiple variables in a single visualization. Here is a pairplot of the three variables in the dataset.
You can see that a pairplot provides both a comprehensive view of the relationships between multiple variables (bivariate analysis) and a univariate view of each variable. The diagonal elements of the plot (histograms) show the distribution of the data for each variable independently. You can use the graphs to assess the shape of the distribution (e.g., normal, skewed, or uniform). The scatter plots in the pairplot help you visualize the relationship between two variables, revealing patterns, correlations, and potential outliers. This pairplot does not show strong relationships between height, weight, and age in this dataset. This is evident from the scatter plots, which do not show any discernible patterns or trends.
Feature Engineering
Feature engineering can also be used during EDA to gain deeper insights into the data. It can be used to transform data into a more informative and usable format for various purposes Feature engineering involves adding new variables to the dataset or transforming existing variables. Let's perform feature engineering on our dataset by adding the body mass index (BMI) column.
The column "BMI" has been added to the DataFrame. So, by creating new features or transforming existing ones, you might uncover features that are more relevant to your analysis and remove redundant or irrelevant ones.
Conclusion
These are just a few examples of the various types of EDA techniques used in data analysis. Depending on the nature of the dataset and the specific goals of the analysis, you may use different combinations of these techniques to gain a comprehensive understanding of the data. The book "50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners" provides a comprehensive set of challenges to help you learn various types of EDA. Join this LinkedIn group for Python students and professionals to learn more about Python-related topics.
Newsletter Sponsorship
You can reach a highly engaged audience of over 260,000 tech-savvy subscribers and grow your brand with a newsletter sponsorship. Contact me at [email protected] today to learn more about the sponsorship opportunities.
Data Analyst | Microsoft Excel / SQL / Microsoft Power BI | Python
9 个月Was the dataset already cleaned? You didn't check if there was null values and all that.
Researcher| Data Scientist
9 个月Thanks for sharing