Data Pre-processing

Data Pre-processing

Introduction

Over the years, we have seen how decision making and analytics in many industries are increasingly being driven by data and models. Many organizations are starting to embrace the use of machine learning as part of business optimization. However, machine learning models are only as good as the quality of data used to train them. We could consider how financial institutions use client information such as transaction amounts or locations to detect credit card fraud. Having unclean data on these observations can easily throw off a machine learning model. Once data is collected from various sources, the first thing is to ensure that the data is cleaned appropriately before being feed into a model.

Data pre-processing is never an easy step and many data scientists face several challenges when it comes to data pre-processing. Aside from the challenge of understanding the business problem and trying to come up with the best approach to solving it, data pre-processing is also an area that requires a lot of time and input. In this article, I used the scientific python ecosystem and geospatial data visualization & analysis capabilities of Folium and the ArcGIS platform to highlight some of the key approaches to data pre-processing, namely; Data Cleaning, Exploratory Data Analysis, Variable Transformation and Feature Engineering.

I used the California Housing data obtained from Github. The data set contains 20,640 entries. Each entry represents one district. There are 10 attributes longitude, latitude, housing_median_age, total_rooms, total_bed rooms, population, households, median_income, median_house_value, and ocean_proximity. This dataset was based on data from the 1990 California census. It is not exactly recent, but it has many qualities for learning, so we will pretend it is recent data.

Note: The Python notebook used for this article can be found here Python notebook.

Data Collection

In a typical real-world setting, data is usually spread across multiple tables/documents/files from different sources. To access this data, you would first need to get your credentials and access authorizations. However, for the dataset I used, things were a lot simpler, I simply downloaded a single compressed file that contained a CSV file with all the data.

To download this file, I used a function. Once the function is called, it creates a dataset/housing, downloads the housing.tgz, and extracts the housing.csv from the directory. This approach is particularly useful if changes are made to the original file, it makes it possible to fetch these changes whenever the code is run. We can take a quick look at a snippet of the data frame.

No alt text provided for this image

Using python’s info () method we get a quick description of the data, in particular the total number of rows, and each attribute’s type and the number of non-null values.

No alt text provided for this image

There are 10 attributes in total, 9 of which are numerical (float64) and ocean_proximity being the only non-numeric attribute with the object type.

Data Cleaning

Missing data

Rarely does a data scientist avoid the problem of missing data. Once the missing data is identified, the challenge is to address the issues raised by the missing data. To do so, the data scientist's primary concern is to look for patterns and relationships underlying the missing data to maintain as close as possible the original distribution of values when any remedy is applied. Based on Hair et al. (2013), chapter 'Examining your data', missing data under 10% for an individual case or observation can generally be ignored. Once this decision is made, the data scientist must apply specialized techniques for ignorable missing data. Pandas makes it extremely easy to sanitize tabular data. 

No alt text provided for this image

A look at the summary of the missing values in the data frame showed 207 entries in the total_bedrooms column were recorded as NA representative of a lack of a described attribute. These missing values from the total_bedrooms column were remedied using the mean centrality measure.

Outliers

Outliers in real estate data could be due to several reasons such as typographical errors during data entry. Plotting the distribution of numeric columns can give a sense of outliers in the data set. I used two different kinds of plots to check for the presence of outliers. A histogram plot and box plot.

Histogram Plots of variables

No alt text provided for this image

Boxplots of Variables

No alt text provided for this image

From the boxplots above, it appears that there are several outliers to be concerned about. Total_roomstotal_bedroomspopulationhouseholdsmedian_income appear to have the most number of outliers, whereas median_house_value does not seem to have that many outliers, but we can not know for sure. From the histogram, we see a huge presence of skewing in the distribution which could also be a result of outliers. A few different approaches exist to filter outliers. A popular technique is to use a 6 sigma filter which removes values that are greater than 3 standard deviations from the mean. This filter assumes that the data follows a normal distribution and uses the mean as the measure of centrality. However, when data suffers heavily from outliers, as in this case, the mean can get distorted. An Inter Quartile Range (IQR) filter which uses median, a more robust measure of centrality, can filter out outliers that are at a set distance from the median in a more reliable fashion.

6 Sigma Filter

No alt text provided for this image

IQR Filter

No alt text provided for this image

After removing outliers using the IQR filter, the distribution of numeric columns looks much healthier.

Data Transformation

Models used in Machine Learning Workflows often make assumptions about the data. An example of this is a Logistic regression model, which assumes linearity of independent variables and log odds.

In most datasets, feature and target variables are often skewed. The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail.

No alt text provided for this image

This issue can be resolved by the use of data transformation. There are many approaches to data transformation, examples of these are log transformation, square root transformation, etc link. In this article, a square root transformation is used to transform the skewed variables.

No alt text provided for this image

It appears that most of the variables are more normally distributed after the transformation.

Exploratory Data Analysis

When it comes to exploratory data analysis, the goal is mainly to go in-depth and gain a bit more understanding of the data. At this stage of the pre-processing exercise, there are two main techniques used, statistical techniques and visual techniques.

Given that this dataset contains longitudes and latitude, one visualization approach would be to use a scatterplot to view all the districts. Since it is known that the data in the dataset is for districts in the state of California, we can see from the scatterplot below that it resembles the map of California.

No alt text provided for this image

Making some modifications to the diagram made it easier to visualize the places where there is a high density of data points.

No alt text provided for this image

You can see the high-density areas, namely the Bay Area and around Los Angeles and San Diego, plus a long line of fairly high density in the Central Valley, in particular around Sacramento and Fresno. However, this is only possible to see if one knows the exact layout of the map of California or if you are working with a map beside you.

Pandas provides an efficient API to explore the statistical distribution of the numeric columns. To explore the spatial distribution of this data set, folium for Python is used.

Similar to plotting a statistical chart out of a data frame object, a spatial plot on an interactive map widget can be plotted as shown below.

No alt text provided for this image

Using a clustering algorithm to detect the main clusters, I was able to create a map showing the different clusters of districts, as you can see from the map below, it shows a similar kind of clustering distribution as the one noted from the scatterplot above.

No alt text provided for this image

We could go a step further and cluster the observations based on the different counties in California. To do this, we will need to create a new column to our dataset containing the different counties where these districts are found. This was done using the reverse geocode function for both Folium and ArcGIS API for Python.

No alt text provided for this image

Based on the map generated, we see that the Los Angeles county area has the highest population.

Correlations

In python you can easily compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the corr () method.

No alt text provided for this image

Correlation coefficient range from -1 to 1. A correlation coefficient close to 1, is representative of a strong positive linear relationship; for example when a correlation coefficient of 0.68 between median_income and median_house_value. This means that as the median_income increase so does the mean_house_value. A correlation coefficient of -1 means there is a strong negative relationship. Finally, a coefficient close to zero means that there is no linear correlation.

We can also check the correlation between attributes using the scatterplot matrix function which plots every numerical attribute against every other attribute. The main diagonal displays a histogram of each attribute. Again we see that the most promising attribute to predict the median house value is the median income.

No alt text provided for this image

Categorical Variables

These are variables that describe a 'characteristic' of a data unit and are selected from a small group of categories. In python, a categorical variable can have the type "Object" or "int64". A good way to visualize categorical variables is by using boxplots. Below is a boxplot that was generated to investigate the relationship between "ocean_proximity" and "median_house_value".

No alt text provided for this image

Here we see that the distribution of median house values between the five ocean proximity categories is distinct enough to take ocean_proximity as a potentially good predictor of median_house_value.

ANOVA: Analysis of Variance

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption and reports it as the F-test score. A larger score means there is a larger difference between the means.

P-value: P-value tells us how statistically significant our calculated score value.

If one variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.

We could use this to check the extent of the relationship between ocean proximity and median house value. For this analysis, we conjecture that the median_house_value differs significantly from the ocean_proximity. Thus, this aims at having sufficient evidence to reject the null hypothesis.

  • H_0: median_house_value of houses with different ocean_proximities are equal
  • H_1: median_house_value of houses with different ocean_proximities are not equal

We set ?? to be 0.05 which is referred to as a 95% confidence level.

No alt text provided for this image

This is a great result, with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance, thus we reject the null hypothesis. This implies that at the 95% confidence level, we reject the null hypothesis that the median_house_values of houses with different ocean proximities are equal.

Next, we could also check the extent of the relationship between the county and median house value. For this analysis, we conjecture that the median_house_value differs significantly from the county. Thus, this aims at having sufficient evidence to reject the null hypothesis.

  • H_0: median_house_value of houses with different counties are equal
  • H_1: median_house_value of houses with different counties are not equal

We set ?? to be 0.05 which is referred to as a 95% confidence level.

No alt text provided for this image

This is a great result, with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance, thus we reject the null hypothesis. This implies that at the 95% confidence level, we reject the null hypothesis that the median_house_values of houses with different ocean proximities are equal.

Feature Engineering

Feature encoding.

Most of the machine learning algorithms work on numerical data only and will not be able to process categorical data. So we need to transform categorical data into numerical form without losing the sense of information. Since the ocean_proximity variable has no ranking properties, I used the one-hot encoding function of python.

No alt text provided for this image

Feature Scaling.

Feature scaling is one of the most important prerequisites for most machine learning algorithm. Various features of the dataset can have their range of values being far different from each other. For instance, attributes such as median_house_value tend to be really large numbers compared to, say, the number of bedrooms and when used without scaling, it tends to dominate the scores beyond its allotted weight.

To rectify this, all numerical columns were scaled to a uniform range of 0-1 using the MinMaxScaler function from the scikit-learn library.

No alt text provided for this image

Conclusion

This article demonstrates how data science and machine learning can be employed in one aspect of the real estate industry. As shown in this study, Python libraries such as Pandas can be used for visualization and statistical analysis, and libraries such as Folium and the ArcGIS API for Python for spatial analysis.

I sought to make a simple illustration of some of the most common and important steps involved in data pre-processing. Usually, the depth to which one will go in exploring the data will depend on what kind of relationships you seek to uncover. In most cases the process may not be as simple as I have illustrated above, starting from the data retrieval stage, data is generally not easily accessible. Secondly, the data cleaning stage will most likely be an iterative process as you may have to run different tests over and over just to ensure that the data is clean enough for the model.

Another thing to note is that the input data was spatially enriched with information about the different counties where observed districts are found. This could be extended to further engineering using various APIs depending on the kind of model one intends to build and also the kind of business problem one seeks to answer.


Donald Phiri

Program Manager- Environment

3 年

Thanks for posting

要查看或添加评论,请登录

Musa Phiri的更多文章

社区洞察

其他会员也浏览了