Comprehensive Machine Learning Solution

John Akwei

Senior Data Scientist at ContextBase

发布日期: 2021年5月23日

+ 关注

All programming by John Akwei, ECMp ERMp Data Scientist

May 18, 2021

Section 1 - Problem Definition

Section 1.1 - Project Summary

Section 1.2 - Modeling Process Presentation

Section 2 - Data Preparation

Section 2.1 - Working Directory Specification

Section 2.2 - Required R Language Packages

Section 2.3 - Session Information

Section 2.4 - Data Importing

Section 3 - Data Discovery

Section 3.1 - Exploratory Data Analysis

Section 3.2 - Statistical Analysis of the Dataset

Section 3.3 - Data Optimization

Section 4 - Feature Engineering

Section 4.1 - Outliers, Missing Values, Conversion

Section 4.2 - Feature Selection

Section 4.3 - Cross-Validation

Section 4.4 - Feature Scaling (Data Normalization)

Section 5 - Model Development

Section 5.1 - Linear Regression Modeling

Section 5.2 - Evaluation of Model Assumptions

Section 5.3 - Predictive Analytics

Section 6 - Model Validation

Section 6.1 - Accuracy of the Predictions

Section 7 - Insights & Inferences

Section 7.1 - Conclusions

Section 8 - References

Section 1 - Problem Definition

Section 1.1 - Project Summary

This document is a Machine Learning End-to-End Solution that encompasses a wide variety of Machine Learning techniques for gaining insight from a wide variety of datasets. The overall objective of this document is to autonomously train several Predictive Analytics models on Cross-Validated data, (that has been separated into training and testing datasets), in order to predict a continuous outcome. The following Machine Learning pipeline, coded in the R Programming Language, can be applied for generic Data Science Classification and Regression. The technical documentation of this Machine Learning Solution seeks to explain each Machine Learning step in depth, and guide Data Scientists in applying the Machine Learning steps to new data. This Comprehensive Machine Learning Solution is programmed in the R programming language, encoded in the RMarkdown format, and includes technical descriptions, statistical formulas, functional coding, graphs, and tables.

Machine Learning

Machine learning is a branch of artificial intelligence that automates analytical model building for data analysis. Machine Learning algorithms are configured to learn from data, identify patterns, and make decisions with minimal human intervention. Machine Learning Theory asserts that, via pattern recognition, computers can learn without being programmed to perform specific tasks. In order to accomplish this, Machine learning algorithms rely on Statistical Inference to find generalizable predictive patterns.

Machine Learning achieves the objectives of Artificial Intelligence by utilizing iteration to independently adapt as models are exposed to new data. Machine Learning algorithms learn from previous data analytics to produce reliable, repeatable decisions and results. Performance is limited by the probabilistic bounds of the training datasets used for learning.

Interest in machine learning is due to the volumes and varieties of available data on the internet, computational processing that is cheaper and more powerful, and affordable data storage. All of these things indicate it’s possible to quickly and automatically produce models that can analyze more complex data and deliver faster and more accurate results, even on a very large scale. By building more precise models, an organization has a better chance of identifying profitable opportunities, avoiding unknown risks, or discovering new information.

Machine Learning Theory

Machine learning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y).

Predictions in the future (Y) are made from new examples of input variables (X). The function (f) represents the Machine Learning algorithm that processes new data via Regression and Predictive Analytics. Machine Learning has three general algorithmic methods: Supervised, Unsupervised, and Reinforcement Learning. Supervised learning requires the labeling of example inputs/outputs, in order to learn the mapping of inputs to outputs. Unsupervised learning examines data without labels to infer the relationships between dataset variables. Unsupervised learning can assist in creating new labels for data categories from hidden patterns in the data. Thereby, determining the significance of dataset features that were previously unknown, or organize datasets into related sub-populations not previous specified in the dataset. Reinforcement learning interacts with a dynamic environment to perform a specific task. Reinforcement learning algorithms maximize the feedback that corresponds with the objectives of the Machine Learning.

Data Science

Data Science is the application of scientific methods and processes to the analysis of datasets stored in modern voluminous data storage, that are accessible via the internet. The objective in analyzing business datasets with Data Science methods is to extract knowledge to increase the competitiveness of businesses, or to provide insights that can lead to increased efficiency in business processes. Businesses can benefit from the Data Science analysis of proprietary datasets, or the datasets stored on the internet by other businesses, or by governments, or other institutions. Internet datasets are usually available in formats that are readable by modern Data Science programming languages.

The primary categories of data found within internet datasets are “structured data” and “unstructured data”. Structured data is characterized by having an organized, and easily processable format. Unstructured data is characterized by having a difficult to interpret structure of observations, and data samples. Unstructured data usually requires re-organization of data cells for easier algorithmic processing.

This Data Science document is programmed in the popular Data Science programming language, R. The R programming language is derived from the statistical programming language “S”, that is derived from the statistical database programming language “SAS”. The R programming language emerged with the advent of Data Science, and is uniquely capable of handling the processes required by Data Science. The R programming language allows for convenient dataset access, efficient algorithmic manipulation of datasets, (for example, the ability to apply functions across datasets without FOR loops), and efficient statistical processing of dataset records and observations. The R programming language has a vast collection of dataset processing packages that encompass a wide variety of modern statistical and scientific methods. R also provides for convenient graphical processing of data contained within internet datasets.

Section 1.2 - Modeling Process Presentation

This document utilizes a Machine Learning pipeline to predict the price of new Cars appended to the US Cars data found at AUCTION EXPORT.com. This dataset includes Information about 28 brands of clean and used vehicles for sale in the USA. Twelve features were assembled for each car in the dataset. The objective of this document’s Machine Learning pipeline is to determine the most effective method of predicting car prices from the available Machine Learning techniques.

The Data Science techniques sequentially applied within this document include Data Importation from the internet, Data Preparation via Exploratory Data Analysis, Statistical Analysis of the dataset, and Optimization of the data. Feature Engineering then takes place via Feature Selection, Cross-Validation, and Feature Scaling for Data Normalization. The relationships between the dataset’s variables are then modelled via Linear Regression, Evaluation of Regression Techniques, and Predictive Analytics. Finally, the predictive models are validated via the accuracy measurement techniques of R2, Root Mean Square Error, and Mean Absolute Error.

The dataset variables within the US Cars dataset that have the most effect on car price are determined via Correlation Analysis of the dataset’s explanatory variables. Correlation testing provides the ability to select a final set of data categories in the US Cars data that will allow for the greatest effectiveness, accuracy, and precision in Machine Learning. The most effective Machine Learning techniques determined by the analysis will process future US Cars data for car price prediction.

Evaluation of linear models provides the status of the Alternate Hypothesis of data correlation, (whether it is possible to say the data is not in any way correlated), the Probability Value of a change in one data category causing a change in the second examined data category, and the 95% Confidence Interval of data independence. After Correlation testing, the data variables within the US Cars dataset that have indicated the greatest correlation to “price” are examined with several Predictive Analytics methods to determine the algorithm to use for Cross-Validated Machine Learning.

Predictive Analytics is a form of Data Science, where Probability Theory, (derived from the mathematical discoveries of Bayes, Kolmogorov. and Bernoulli), are applied to data categories to determine causality in data variables. Given the probability of data categories having an effect on other data categories, it is possible to deduce previously unknown information in datasets. The data that was hidden, before Predictive Analytics, is usable in a wide variety of business and government applications. Predictive Analytics examines data variables as Explanatory, or Independent, variables that possible have an effect on Response, or Dependent, variables. Usually several Explanatory Variables are processable via Predictive Analytics to discern the unknown state of one Response variable, however overfitting of Explanatory variables is possible, thereby leading to inaccurate prediction of the unknown state of the examined Response variable.

Choice of a Predictive Analytics algorithm usually depends on the bias and variance characteristics of the data. High bias/low variance classifiers such as Na?ve Bayes, Linear Regression, Linear Discriminant Analysis and Logistic Regression are used for datasets where the independent variables’ values stay within a limited range, yet have a high amount of influence on the dependent variables’ value. Low variance suggests small changes to the target as the training dataset changes. Low bias presumes less assumptions about the target value. High variance presumes the target function will change greatly with slight changes in the training data. Examples of low bias/high variance classifiers include Decision Trees, K-Nearest Neighbors, Support Vector Machines, and Random Forest.

Section 2 - Data Preparation

Section 2.1 - Working Directory Specification

The interactive programming of this Machine Learning Solution was accomplished with the RStudio Interactive Development Environment, (IDE). In addition to the basic capabilities of the R programming language, several R language packages of pre-programmed functions are used for the algorithms of this Machine Learning Solution.

The “Set Working Directory” R language basic function is used to set the working directory to the directory with the source files. The function, “setwd()”, sets the filepath as the current working directory of the R environment. The permanence of the filepath varies with different operating systems, and the status of the R language Integrated Development Environment. Then, the “Get Working Directory” function is used to verify that the working directory has been set to the right location.

# setwd("C:/Users/johna/Dropbox/Programming/MachineLearning")

# getwd()

Section 2.2 - Required R Language Packages

The required R programming language packages are installed in this section, and included in the package library. The function, “install.packages()”, downloads and installs R programming language packages from CRAN-like repositories or from local files. The function, “library()”, loads the R language packages into the session library of packages, in order to run the functions within the packages. The R packages included are packages for Linear Regression, Predictive Analytics, Cross-Validation, Model Validation, Machine Learning, data manipulation, and plotting.

The “tidyverse” package performs data manipulation and visualization. The “caret” package computes cross-validation methods. The “kernlab” package provides functions for Support Vector Machines machine learning. The “randomForest” package performs Classification and regression based on a forest of decision trees using random inputs, based on Breiman (2001). The “repr” package creates string and binary representations of objects.

The “e1071” package has functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, and generalized k-nearest neighbour.

The “pacman” package provides tools to more conveniently perform tasks associated with add-on packages. “pacman” conveniently wraps library and package related functions and names them in an intuitive and consistent fashion. It seeks to combine functionality from lower level functions which can speed up workflow.

The “mgcv” package contains functions for Generalized additive (mixed) models, some of their extensions and other generalized ridge regression with multiple smoothing parameter estimation by (Restricted) Marginal Likelihood, Generalized Cross Validation and similar, or using iterated nested Laplace approximation for fully Bayesian inference. Includes a gam() function, a wide variety of smoothers, ‘JAGS’ support and distributions beyond the exponential family.

The “car” package facilitates the application and interpretation of regression analysis. The “nortest” package contains five omnibus tests for testing the composite hypothesis of normality. The “lmtest” package is a collection of tests, data sets, and examples for diagnostic checking in linear regression models. Furthermore, some generic tools for inference in parametric models are provided.

The “ggplot2” package is an open-source data visualization package for the statistical programming language R. “ggplot2” is an implementation of Grammar of Graphics — a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers.

The “GGally” package extends ‘ggplot2’ by adding several functions to reduce the complexity of combining geometric objects with transformed data. Some of these functions include a pairwise plot matrix, a two group pairwise plot matrix, a parallel coordinates plot, a survival plot, and several functions to plot networks.

The “plotrix” package is intended to provide a method for getting many sorts of specialized plots quickly, yet allow easy customization of those plots without learning a great deal of specialized syntax.

The RMarkdown formatting “knitr” package is a general-purpose literate programming engine, with lightweight API’s designed to give users full control of the output without heavy coding work. It combines many features into one package with slight tweaks motivated from everyday use of Sweave. The “knitr” table function, “kable()” is used for formatting the document’s tables.

Section 2.3 - Session Information

Session information is provided for reproducible Data Science research, with the RStudio IDE. The Session Information below is for reference when running the required packages, and R langauge code.

Section 2.4 - Data Importing

This section is a guide to importing data for Machine Learning projects. The dataset used for this demonstration of Machine Learning techniques is found at: https://www.kaggle.com/doaaalsenani/usa-cers-dataset.

Below are important considerations before importing data for a Machine Learning projects, or other forms of Data Science projects.

Selecting Data for Machine Learning

Knowing what you want to algorithmically predict will help you decide which data may be more valuable to collect. When formulating the problem, conduct data exploration in consideration of the following categories; Classification, Clustering, and Regression.

Classification algorithms are configured to answer binary yes or no questions or classify groups of multiple objects, and are supervised with labeling of example right answers.

Clustering algorithms perform a task similar to Classification algorithms. However, Clustering has the ability to group similar data together without supervision by a Data Scientist, in the form of labels, target variables, or target data classes. Thereby, Clustering can reveal new and/or unanticipated groups, relationships, and classifications for inputted data. Clustering has the ability to reveal unknown qualities within data, that is then available for arbitrary classification by the Data Scientist.

Regression algorithms use the relationship between data categories to yield insight for new data, explanations of new, present, or future occurrences. Regression analysis usually relies on numeric data, and is a pre-process step for Predictive Analytics.

Importing the Dataset

In order to begin processing the dataset that was selected to demonstrate this Machine Learning Solution, “USA_cars_datasets.csv” is imported into the R programming environment with the “read.csv()” function. The “read.csv()” function uses R’s built in functionality to read and manipulate spreadsheets in comma-separated values (CSV) files.

The following code imports the source dataset to be analyzed.

projectData <- read.csv("USA_cars_datasets.csv")

Section 3 - Data Discovery

Section 3.1 - Exploratory Data Analysis

Exploratory Data Analysis is an approach for analyzing datasets to summarize their main characteristics, in order to decide on subsequent Predictive Analytics methods. The quality of the dataset should be examined to determine the usefulness of your available data. Regardless of sophistication, a Machine Learning algorithm is limited by the probabilistic capabilities of the data. If the data you are working with is collected or labeled by humans, reviewing a subset of data will help with estimation of possible mistakes via human error.

The data should also be reviewed for possible omitted values. Usually, omitted values are replaceable with the median value of the entire dataset column. However, the more omitted values that are within the dataset, the more the results of the Machine Learning of the data is expected to be inaccurate. The dataset chosen for a Machine Learning analysis should be the right type of data for the insights that are needed. If your company is selling electronics in the US and is planning on expanding into Europe, you should try to gather data that can aid in Machine Learning analysis of both markets. The data used for Machine Learning should also have a balance of examples of the needed outputs from the regression and predictive analytics algorithms.

Characteristics of the Data

The US Cars’dataset that is being examined in this Machine Learning Solution was scraped from AUCTION EXPORT.com, and includes Information about 28 brands of clean and used vehicles for sale in US. Twelve features/observations/variables were assembled for each car in the dataset.

“X” / Integer / The dataset row numbers

“price” / Integer / The sale price of the vehicle in the ad

“brand” / String / The brand of car

“model” / String / model of the vehicle

“year” / Integer / The vehicle registration year

“title_status” / Status / String This feature included binary classification, which are clean title vehicles and salvage insurance

“mileage” / Float / miles traveled by vehicle

“color” / String / Color of the vehicle

“vin” / String / The vehicle identification number is a collection of 17 characters (digits and capital letters)

“lot” / Integer / A lot number is an identification number assigned to a particular quantity or lot of material from a single manufacturer. For cars, a lot number is combined with a serial number to form the Vehicle Identification Number.

“state” / String / The location in which the car is being available for purchase

“country” / String / The country in which the car is being available for purchase

“condition” / String / Time

The Row and Column Dimensions of the Dataset

In order to estimate the extent of the data available for Machine Learning, the dimensions of the dataset are examined by the dim() function in R. This function returns the total number of rows and columns within the dataframe created from the source .csv file.

The Categories of the Datasets

Examining a list of variable, (or observation), names of the dataset allows for estimation of the variety of data gathered by the source dataset creators. This gives the Data Scientist an idea of what categories are possibly of greater importance than other categories. Also, it is possible to determine if other data categories are required for more insightful data.

Sample of Records Imported For Machine Learning

Before beginning the Data Science exploration of a dataset, it is important to view the data within the datasets rows and columns. This will quickly show if the dataset has a lot of missing information, or if there is flawed formatting within the dataset. Technical problems when transferring data, for example misreading of the dataset during internet transfer, are easy to find via viewing the dataset contents. The below table shows the 12 categories of data in the dataset, and the first five samples.

The Structure of the projectData Dataset

The structure of the dataset is examined with the str() function in R, in order to quickly view important basic characteristics of the data, including formatting, (the first line of the below table), data types, (integer, character, numeric, etc.), and examples of the contents of the observations.

## Table 6. Structure of the Dataset
## 'data.frame':    2499 obs. of  13 variables:
##  $ X           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ price       : int  6300 2899 5350 25000 27700 5700 7300 13350 14600 5250 ...
##  $ brand       : chr  "toyota" "ford" "dodge" "ford" ...
##  $ model       : chr  "cruiser" "se" "mpv" "door" ...
##  $ year        : int  2008 2011 2018 2014 2018 2018 2010 2017 2018 2017 ...
##  $ title_status: chr  "clean vehicle" "clean vehicle" "clean vehicle" "clean vehicle" ...
##  $ mileage     : num  274117 190552 39590 64146 6654 ...
##  $ color       : chr  "black" "silver" "silver" "blue" ...
##  $ vin         : chr  "  jtezu11f88k007763" "  2fmdk3gc4bbb02217" "  3c4pdcgg5jt346413" "  1ftfw1et4efc23745" ...
##  $ lot         : int  159348797 166951262 167655728 167753855 167763266 167655771 167753872 167692494 167763267 167656121 ...
##  $ state       : chr  "new jersey" "tennessee" "georgia" "virginia" ...
##  $ country     : chr  " usa" " usa" " usa" " usa" ...
##  $ condition   : chr  "10 days left" "6 days left" "2 days left" "22 hours left" ...

Statistical Summarization of the dataset

The summary() function in R provides vital statistics on the observations within the dataset. In the case of the US Cars data, statistics on the 12 variables are provided. The range, max, min, and quartiles of numeric variables, amount of character variable entries, and the class of the data category if not numeric are analyzed for later determination of regression modeling algorithms to apply.

Section 3.2 - Statistical Analysis of the Dataset

In this section of the Exploratory Data Analysis, a precise examination of the dataset’s statistical content is made to determine a strategy to gain the insights needed from the dataset. A set of basic operations is performed to quantitatively describe the data. Univariate examination of single variables, and Multivariate examination of groups of variables within the dataset are performed. Statistical Data Analysis examines continuous data that is measurable on a scale but not precisely countable, and discreet data where precise counting of quantities is possible.

Continuous data is distributed under a continuous distribution function, or a probability density function. Discreet data is distributed under discreet distribution function, that is also called a probability mass function. Poisson distribution is a common probability mass function, and normal distribution is a common probability density function.

The Standard Deviation of the Target Variable

The sd() function displays the standard deviation of the target/explanatory/dependent variable, “price”. This is useful to determine the effectiveness of the resulting predictions from this analysis.

Brand Price Distribution

The following plot displays the distribution of the target variable, “price”, for selection of regression techniques that are the most effective with the type of distribution of the prediction target.

Vehicle Brand Popularity

The distribution of popularity of the different vehicle brands within the dataset can give the Data Science programmer insight into the effectiveness of the Data Science analysis.

Price Difference of New vs Used Vehicles

The effect of the price difference of new vs used vehicles can be judged by visualization of the differing ranges.

Price Distribution

The distribution of the dependent variable, “price”, is displayed to determine a Data Science strategy for regression in relationship to the independent variables.

Quantity Per Model Year

The amount of cars for sale sorted based on model year, allows for determination the effect of these factors on predictions of future prices.

Price Per Model Year

The effect of vehicle age on price is displayed to gain insights useful for evaluation of the Machine Learning results.

Age Range of Vehicles

The age range of the vehicles within the dataset is examined with the range() function, because of high likelihood of significance in Machine Learning predictions of future prices of new vehicles. It is possible that sub-ranges of vehicle ages could have detrimental effect on the regression analysis of the data, leading to inaccurate predictions.

Mean Age of Vehicles

The mean age of the vehicles will give the Data Scientist an idea of the age of a significant number of the vehicles in the dataset.

Median Age of Vehicles

The median age of the vehicles will give the Data Scientist an idea of where the statistical center of the dataset exists.

Price Range of Vehicles

Statistical Analysis of the price range of the vehicles in the dataset, allows for determining algorithmic strategies for predicting future prices, and to estimate the reliability of the output of Machine Learning strategies.

Mean Price of Vehicles

The mean price of the vehicles will give the Data Scientist an idea of the price strata of a significant number of the vehicles in the dataset.

Median Price of Vehicles

The median price of the vehicles will give the Data Scientist an idea of where the statistical center of the target variable exists.

Table of Title Status per Brand

The effect of the dependent variables, “title_status” and “brand” can be summized from the following table of title statuses and brands.

Table of the Total Cars per Country by Brand

The following table demonstrates the lack of data about vehicles in Canada, and demonstrates the relative popularity of vehicle brands in the USA.

Price vs Mileage

Visual examination of the effect of vehicle mileage on the response variable, “price”, is useful for determination of the general effect to these dataset observations.

Plot of Vehicle Colors

The range and quantities of individual colors of vehicles allows for estimation of the significance of individual colors on the target variable, “price”. The below knitr table code demonstrates that 90% of the dataset’s cars have only 6 of the 49 available colors.

df <- data.frame(table(projectData$color))
df2 <- df[rev(order(df$Freq)),]
popular_color_percentage <- sum(df2$Freq[1:6])/sum(df$Freq)
kable(popular_color_percentage, caption = "Table 17. 90% of Cars have 6 of 49 colors total.")

The below univariate discreet values plot visualizes the popularity of colors within the dataset.

Test for Normal Distribution of Data

If your distribution is not normal (in other words, the skewness and kurtosis deviate a lot from 1.0, you should use a non parametric test like chi-square test. You can test if your data are normally distributed visually (with QQ-plots and histograms) or statistically (with tests such as D’Agostino-Pearson and Kolmogorov-Smirnov). In these cases, it’s the residuals, the deviations between the model predictions and the observed data, that need to be normally distributed.

Shapiro-Wilk Test

The Shapiro-Wilk test is a way to tell if a random sample comes from a normal distribution. The test gives you a W value; small values indicate your sample is not normally distributed (you can reject the null hypothesis that your population is normally distributed if your values are under a certain threshold). The formula for the W value is: Where xi are the ordered random sample values and ai are constants generated from the covariances, variances and means of the sample (size n) from a normally distributed sample. The test has limitations, most importantly that the test has a bias by sample size. The larger the sample, the more likely you’ll get a statistically significant result. Ho = The data are from the LogNormal distribution. Small p-values reject Ho.

## Table 19. Shapiro-Wilk Test of Distribution Normality
## Ho: Data is Normally Distributed.
## Shapiro test results for Price distribution =  3.163541e-28 
##     < 0.05 = not normally distributed
## Shapiro test results for Mileage distribution =  3.041625e-61 
##     < 0.05 = not normally distributed

Interpretation of the Shapiro-Wilk Test

The null-hypothesis of this test is that the population is normally distributed. Thus, if the p-value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. On the other hand, if the p-value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) can not be rejected (e.g., for an alpha level of .05, a data set with a p-value of less than .05 rejects the null hypothesis that the data are from a normally distributed population). Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case.

Skew of the Distributions

The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.

γ1 = μ2 ∕ μ3

Intuitively, the skewness is a measure of symmetry. As a rule, negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicate that the mean of the data values is larger than the median, and the data distribution is right-skewed. The significance of skewness is determined by distance from 1.0.

## Skew of the price distribution =  0.922 is skewed to the right.
## Skew of the mileage distribution =  7.071 is highly skewed to the right.

QQ-Plots of Normality - the residuals are normally distributed

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help assess if a set of data plausibly came from some theoretical distribution such as a Normal or Exponential. For example, if we run a statistical analysis that assumes our dependent variable is Normally distributed, we can use a Normal Q-Q plot to check that assumption. It’s just a visual check, and somewhat subjective. Q-Q Plots show at-a-glance if Normality is plausible, how the assumption is violated, and what data points contribute to the violation. A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.

Heteroskedasticity

Heteroskedasticity refers to the variability of data points being unequal across the range of values of a second variable that predicts it. In case of homoskedasticity, the data points are equally scattered while in case of heteroskedasticity the data points are not equally scattered. The existence of heteroskedasticity is a major concern in regression analysis and the analysis of variance, as it invalidates statistical tests of significance that assume that the modelling errors all have the same variance. While the ordinary least squares estimator is still unbiased in the presence of heteroskedasticity, it is inefficient and generalized least squares should be used instead.

Because heteroskedasticity concerns expectations of the second moment of the errors, its presence is referred to as misspecification of the second order. The econometrician Robert Engle won the 2003 Nobel Memorial Prize for Economics for his studies on regression analysis in the presence of heteroskedasticity, which led to his formulation of the autoregressive conditional heteroskedasticity (ARCH) modeling technique.

Linear Model to Test for Heteroskadacity

## Figures 13 - 16. Linear Model to Test for Heteroskadacity

The plots we are interested in are at the top-left and bottom-left. The top-left is the chart of residuals vs fitted values, while in the bottom-left one, it is standardized residuals on Y axis. If there is absolutely no heteroscedastity, you should see a completely random, equal distribution of points throughout the range of X axis and a flat red line.

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2995.627, Df = 1, p = < 2.22e-16
## p = < 2.22e-16 < 0.05 = Reject the Null Hypothesis = Homoskedasticity

Section 4 - Feature Engineering

Section 4.1 - Outliers, Missing Values, Conversion

This section examines the Dependent Variable for Outliers, and identifies, visualizes and removes outliers from a dataset. Statisticians often come across outliers when working with datasets, and it is important to deal with them because of how significantly they can distort a statistical model. Your dataset may have values that are distinguishably different from most other values, these are referred to as outliers. Usually, an outlier is an anomaly that occurs due to measurement errors but in other cases, it can occur because the experiment being observed experiences momentary but drastic turbulence. In either case, it is important to deal with outliers because they can adversely impact the accuracy of your results, especially in regression models.

Most statistical parameters such as mean, standard deviation and correlation are highly sensitive to outliers. Consequently, any statistical calculation based on these parameters is affected by the presence of outliers. Whether it is good or bad to remove outliers from your dataset depends on whether they affect your model positively or negatively. Remember that outliers aren’t always the result of badly recorded observations or poorly conducted experiments. They may also occur due to natural fluctuations in the experiment and might even represent an important finding of the experiment. Whether you’re going to drop or keep the outliers requires some amount of investigation. However, it is not recommended to drop an observation simply because it appears to be an outlier. Statisticians have devised several ways to locate the outliers in a dataset. The most common methods include the Z-score method and the Interquartile Range (IQR) method. The IQR method does not depend on the mean and standard deviation of a dataset.

The interquartile range is the central 50% or the area between the 75th and the 25th percentile of a distribution. A point is an outlier if it is above the 75th or below the 25th percentile by a factor of 1.5 times the IQR.

For example, if Q1 = 25th percentile, Q3 = 75th percentile, then,

IQR = Q3 – Q1

And an outlier would be a point below

Q1 ? (1.5)IQR

or above

Q3 + (1.5)IQR

Visualizing Outliers

One of the easiest ways to identify outliers is by visualizing them in boxplots. Boxplots typically show the median of a dataset along with the first and third quartiles. They also show the limits beyond which all data values are considered as outliers. It is interesting to note that the primary purpose of a boxplot, given the information it displays, is to help you visualize the outliers in a dataset. Your dataset may have thousands or even more observations and it is important to have a numerical cut-off that differentiates an outlier from a non-outlier. This allows you to work with any dataset regardless of how big it may be.

Create a boxplot to identify outliers:

Eliminating Outliers

The following coding saves the outliers in a vector, then a boxplot without outliers can be visualized.

outliers <- boxplot(projectData$price, plot=FALSE)$out

projectData <- projectData[-which(projectData$price %in% outliers),]

boxplot(projectData$price,
        main = "Figure 18. Boxplot of Removed Target Variable Outliers")

Examination of the Dataset for Missing Values

cat('It is ', any(is.na(projectData)),
    ' that projectData has Missing Values', '.', sep='')
## It is FALSE that projectData has Missing Values.

If missing values exist, then find the observations with missing values

# NAcol <- which(colSums(is.na(projectData)) > 0)
# cat('There are ',length(NAcol), ' columns with missing values', '.',
# sep='')

The numerical Missing Values are replaced with imputed values using the median of the column of data

# projectData <- impute_mean_if(projectData, is.numeric)

Verify that Missing Values have been removed

# cat('It is ', any(is.na(analysisData)),
#     ' that projectData has missing values', '.', sep='')

Data Conversion

Data types within datasets occasionally require transformation for statistical processing. It is possible to order or unorder categorical data via conversion into factors. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting. Factors are stored as integers, and have labels associated with these unique integers. Factors look like character vectors, however they are processable as integers, and caution is needed when treating them like strings. Factors can only contain a pre-defined set of values, known as levels. The following coding section converts categorical dataset columns from the character class to the factor class, for ordering and unordering.

projectData$brand <- factor(projectData$brand)
projectData$model <- factor(projectData$model)
projectData$title_status <- factor(projectData$title_status)
projectData$color <- factor(projectData$color)
projectData$vin <- factor(projectData$vin)
projectData$state <- factor(projectData$state)
projectData$country <- factor(projectData$country)
projectData$condition <- factor(projectData$condition)

Section 4.2 - Feature Selection

Feature Selection is the process of selecting the most significant features from a given dataset. The most significant features are independent (or “explanatory”) variables that have a high correlation with the dependent (or “response”) variable. Feature Selection allows for more accurate Linear Modeling, Predictive Analytics, and Machine Learning. The table below has the output of high correlation variables from the US Cars dataset.

# 1) The numeric variables are subsetted from the dataset.
numericVars <- which(sapply(projectData, is.numeric)) 

# 2) A dataframe of numeric columns is created.
all_numVar <- projectData[, numericVars]

# 3) The numeric variables are correlated.
cor_numVar <- cor(all_numVar, use="pairwise.complete.obs")

# 4) The correlated numeric variables are sorted by decreasing price.
cor_sorted <- as.matrix(sort(cor_numVar[,'price'], decreasing=TRUE))

# 5) The numeric variables with high corelation are subsetted.
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.1)))

kable(CorHigh, caption = "Table 20. High Correlation Variables for Linear Modeling")

Correlation Plotting Visualizes the Degrees of Independent Variable Correlation

Negative Correlation

## The Independent Variable, Year, has a high negative correlation with the Independent Variable, Mileage, of  -0.593364

Positive Correlation

## The Dependent Variable, Price, has a positive correlation with the Independent Variable, Year, of  0.4450838

Stepwise Regression

Stepwise Regression failed to select high correlation variables from the US Cars dataset because of an “essentially perfect fit” of the dataset’s variables.

# reduced.model <- step(projectData, direction = "backward")  
# 
# Warning messages:
# 1: attempting model selection on an essentially perfect fit is nonsense  
# 
# summary(reduced.model)  
# plot(reduced.model)  
# ModelComparison_AIC <- AIC(lmMod, reduced.model)  
# print(ModelComparison_AIC)  
# ModelComparison_BIC <- BIC(lmMod, reduced.model)  
# print(ModelComparison_BIC)

Section 4.3 - Cross-Validation

Cross-Validation of Machine Learning Models

Machine Learning uses linear regression to identify patterns in data, and predict future values. Cross-Validation divides the dataset into data for the Machine Learning algorithm to learn, (training data), and data that the Machine Learning Algorithm is required to predict, (testing data). This section prepares an independently sampled training dataset and test data set. Using the same data to train and test a machine learning model, creates information leakage between the training and test processes. Thereby causing the model to not generalize well. A model that generalizes well produces consistent results when presented with new cases. The random sub-samples of the data are created using a process called Bernoulli sampling. Bernoulli sampling accepts a sample into the data subset with probability p. In this case, the probability that a given case is in the training dataset is p. The probability a case is in the test dataset then becomes 1?p. The code in the cell below uses the sample function to create a set of indices for splitting the dataset. The index and the negative of the index are used to sample the original data frame. The code in the cell below performs this split.

Setting a seed number to generate a reproducible random sampling

set.seed(123)

Creating training data as 80% of the dataset

dataSubset <- createDataPartition(projectData$price,p=0.8,list=F)

Select the training rows

train  <- projectData[dataSubset, ]

Select the test rows

test <- projectData[-dataSubset, ]

Dimensions of the Training and Testing Sets

## The dimensions of the Training Set are: 195113.
## The dimensions of the Testing Set are: 48413.

Verify that the Cross-Validation Train and Test datasets are an accurate partition of the source dataset

## It is TRUE that the crossValidated Train and Test datasets 
##     are an accurate partition of the source dataset.

First Five Rows of Train and Test datasets

Section 4.4 - Feature Scaling (Data Normalization)

Feature Scaling is a data preprocessing step where we adjust the scales of the features to have a standard scale of measure. Some machine learning algorithms are sensitive to feature scaling while others are virtually invariant to it. Feature scaling in machine learning is one of the most critical steps during the pre-processing of data before creating a machine learning model. Scaling can make a difference between a weak machine learning model and a better one. Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled.

The difference in ranges of features will cause different gradient descent step sizes for each feature. Feature Scaling ensures that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features. Distance algorithms like K-NN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity. Feature Scaling ensures all the features contribute equally to the result of Euclidean distance-based processing. Tree-based algorithms are fairly insensitive to the scale of the features. Decision Trees branch off nodes based on a single feature to increase the homogeneity of the node. This split on a feature is not influenced by remaining features.

Min-Max Scaling, Min-Max Normalization

In Min-Max normalization, data is scaled to be in the range {0,1}

where, xi is the ith sample value, Min(X) is the minimum value of all samples, Max(X) is the maximum value of all samples. Min-Max normalization is a good choice for cases where the value being scaled has a complex distribution. A variable with a distribution with multiple modes might be a good candidate for Min-Max normalization. However, the presence of a few outliers can distort the result by giving unrepresentative values of Min(X) or Max(X).

Mean normalization

where x is an original value, x′ is the normalized value. There is another form of the means normalization which is when we divide by the standard deviation which is also called standardization.

Z-Score Normalization

Z-Score normalization transforms a variable so that it has zero mean and unit standard deviation (or variance). Z-Score normalization is performed using the following formula:

where, μ is the mean of the variable χ, σ is the standard deviation of the variable χ.

Normalize or Standardize?

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. χmax and χmin are the maximum and the minimum values of a feature. When the value of χ is the minimum value in the column, the numerator will be 0, and hence χ′ is 0. When the value of χ is the maximum value in the column, the numerator is equal to the denominator and thus the value of χ′ is 1. If the value of χ is between the minimum and the maximum value, then the value of χ′ is between 0 and 1.

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. The mean of the attribute becomes zero and the resultant distribution has a unit standard deviation. μ is the mean of the feature values and σ is the standard deviation of the feature values.

Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks. Standardization is generally good to use when the data follows a Gaussian distribution. However, if you have outliers in your data, they will not be affected by standardization. The choice of using normalization or standardization depends on your problem and the machine learning algorithm you are using. Fitting your model to raw, normalized and standardized data allows for performance comparison for best results. It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. The scaling of target values is generally not required. Bypassing Feature Scaling leads to bias that influences the prediction accuracy.

The following code computes the center (mean) and scale (standard deviation) of the feature. These scale parameters are then applied to the feature in the train and test data frame.

First, subset Train and Test datasets into high correlation features

train <- train[,c(2,5,7,10)]
test <- test[,c(2,5,7,10)]

Print first row of subsetted Train and Test datasets

Perform Feature Scaling to Normalize the Train/Test datasets

train_scale_dependent <- data.frame(scale(train[,-1]))
train <- cbind(train$price, train_scale_dependent)
names(train) <- c("price", "year", "mileage", "lot")

test_scale_dependent <- data.frame(scale(test[,-1]))
test <- cbind(test$price, test_scale_dependent)
names(test) <- c("price", "year", "mileage", "lot")

Print first row of normalized Train and Test datasets

kable(head(train, 5), caption = "Table 25. First Five Rows of Test Dataset")

kable(head(test, 5), caption = "Table 26. First Five Rows of Test Dataset")

Section 5 - Model Development

Section 5.1 - Linear Regression Modeling

Overview of regression

The method of regression is one of the oldest and most widely used analytics methods. The goal of regression is to produce a model that represents the “best fit” to some observed data. Typically the model is a function describing some type of curve (lines, parabolas, etc.) that is determined by a set of parameters (e.g., slope and intercept). “Best fit” means that there is an optimal set of parameters according to a chosen evaluation criteria. A regression models attempt to predict the value of one variable, known as the dependent variable, response variable or label, using the values of other variables, known as independent variables, explanatory variables or features. Single regression has one feature used to predict one label. Multiple regression uses two of more features to predict the label. In mathematical form the goal of regression is to find a function of some features χ which predicts the label value γ. This function can be written as follows:

The challenge in regression is to learn the function f(X) so that the predictions of y^ are accurate. In other word, we train the model to minimize the difference between our predicted y^ and the known label values y. In fact, the entire field of supervised learning has this goal. Many machine learning models, including some of the latest deep learning methods, are a form of regression. There methods often suffer from the same problems, including over-fitting and mathematically unstable fitting methods.

Overview of linear regression

Linear regression are a foundational form of regression. The simplest case of linear regression is know is single regression, since there is a single feature. The function f(X) is linear in the model coefficients. For a single vector of features x the linear regression equation is written as follows:

The model coefficients are a, which we call the slope and b, which we call the intercept. Notice that this is just the equation of a straight line for one variable. But, what are the best values of a and b? In linear regression, a and b are chosen to minimize the squared error between the predictions and the known labels. This quantity is known as the sum squared error or SSE. For n training cases the SSE is computed as follows:

The approach to regression that minimizes SSE is know as the method of least squares.

Train the regression model

In R models are defined by an equation using the ～ symbol to mean modeled by. In summary, the variable to be modeled is always on the left. The features are listed on the right. This basic scheme can be written as shown here.

label ～ features

For example, if the dependent variable (dv) is modeled by two features (f1 and f2), with no interaction, the formula would be: dv ～ f1 + f2

In this case there is only one feature or independent variable. So the simple formula in the code represent this model.

Fit the Linear Regression Models

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (or outcome variable) and one or more independent variables (predictors, covariates, or features). Linear regression is the most common form of regression analysis, in which one finds a linear combination that most closely fits the data. This allows the researcher to estimate the conditional expectation, (population average value), of the dependent variable when the independent variables take on a given set of values.

A Linear Regression produces a Coefficients table, where the first row gives the estimates of the y-intercept, and the second row gives the regression coefficient of the model. Row 1 of the table is labeled “Intercept”. This is the y-intercept of the regression equation, with a value of 0.20. The Estimate column is the estimated effect, also called the regression coefficient or r2 value. The Std. Error column displays the standard error of the estimate. The t value column displays the test statistic. Unless you specify otherwise, the test statistic used in linear regression is the t-value from a two-sided t-test. The larger the test statistic, the less likely it is that the results occurred by chance. The Pr(>| t |) column shows the p-value. This number tells us how likely we are to see the estimated effect of an independent variable on the dependent variable if the null hypothesis of no effect were true.

Set the Seed for reproducibility of results

“set.seed()” allows for the setting of an initial seed for stream-based variate generators, thereby assurring reproducibility of regression algorithm results.

set.seed(5678)

Fit the Basic Linear Regression Model

Simple linear regression is a parametric test used to estimate the relationship between two quantitative variables, and assumes Homogeneity of variance (or homoscedasticity): the size of the error in our prediction doesn’t change significantly across the values of the independent variable, no hidden relationships among observations, and the data follows a normal distribution.

model_LN <- lm(price ~ ., data = train)

kable(summary(model_LN)$coefficients, caption = "Table 27. Basic Linear Moding Coefficients")

kable(confint(model_LN, level=0.95), caption = "Table 28. 95% Confidence Interval of Basic Linear Model")

cat("Figures 22 - 25. Basic Linear Modeling of US Cars Data")
## Figures 22 - 25. Basic Linear Modeling of US Cars Data
par(mfrow=c(2,2))
plot(model_LN)

Fit the Na?ve Bayes model

Na?ve Bayes modelling utilizes prior knowledge about a series of probabilities, vs a Gaussian approach that applies the same probability for each iteration of a probability series. Bayes’ Theorem is stated as:

Where P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability. P(d|h) is the probability of data d given that the hypothesis h was true. P(h) is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h. P(d) is the probability of the data (regardless of the hypothesis). After calculating the posterior probability for a number of different hypotheses, you can select the hypothesis with the highest probability. This is the maximum probable hypothesis and may formally be called the maximum a posteriori (MAP) hypothesis. This can be written as:

MAP(h) = max(P(h|d))

MAP(h) = max((P(d|h) ? P(h))/P(d))

MAP(h) = max(P(d|h) ? P(h))

Na?ve Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values. It is called Na?ve Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1,d2,d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) ? P(d2|H) and so on. No coefficients need to be fitted by optimization procedures. The class probabilities are simply the frequency of instances that belong to each class divided by the total number of instances. In the simplest case each class would have the probability of 0.5 or 50% for a binary classification problem with the same number of instances in each class.

model_NB <- naiveBayes(price ~ . , data = train)
summary(model_NB)
##           Length Class  Mode     
## apriori   676    table  numeric  
## tables      3    -none- list     
## levels    676    -none- character
## isnumeric   3    -none- logical  
## call        4    -none- call

Fit the GAM model

Generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions. Despite its lack of popularity in the data science community, GAM is a powerful and yet simple technique. GAM’s flexible predictor functions can uncover hidden patterns in the data. Regularization of predictor functions helps avoid overfitting.

In general, GAM has the interpretability advantages of GLMs where the contribution of each independent variable to the prediction is clearly encoded. However, it has substantially more flexible because the relationships between independent and dependent variable are not assumed to be linear. In fact, we don’t have to know a priori what type of predictive functions we will eventually need. From an estimation standpoint, the use of regularized, nonparametric functions avoids the pitfalls of dealing with higher order polynomial terms in linear models. From an accuracy standpoint, GAMs are competitive with popular learning techniques.

model_GAM <- gam(price ~ year +  mileage + lot, data = train)

summary(model_GAM)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## price ~ year + mileage + lot
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17773.4      210.6  84.375  < 2e-16 ***
## year          2992.7      266.2  11.241  < 2e-16 ***
## mileage      -2498.1      264.9  -9.431  < 2e-16 ***
## lot            781.7      213.7   3.657 0.000262 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.231   Deviance explained = 23.2%
## GCV = 8.6748e+07  Scale est. = 8.6571e+07  n = 1951
summary(model_GAM)$p.coeff
## (Intercept)        year     mileage         lot 
##  17773.4070   2992.7305  -2498.1124    781.6569

Fit the K-Nearest Neighbor Model

K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. Nearest neighbor based classifiers use some or all the patterns available in the training set to classify a test pattern. K-Nearest Neighbor is a non-parametric classification method used for classification and regression. In both cases, the input consists of the k closest training examples in the dataset. The output depends on whether K-NN is used for classification or regression.

In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k=1, then the object is simply assigned to the class of that single nearest neighbor.

In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors. k-NN is a type of classification where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then normalizing the training data can improve its accuracy dramatically.

Both for classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor. The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. K-NN algorithms are sensitive to the local structure of the data.

model_KNN <- train(price ~ ., data = train, method = "knn")

summary(model_KNN)
##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      3      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   1      -none-     logical  
## param       0      -none-     list
plot(model_KNN, main = "Figure 26. Summary of KNN Model")

Fit the Support Vector Machines Model

Support Vector Machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. SVMs are based on statistical learning frameworks, and one of the most robust prediction methods. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. Support Vector Machines usually demonstrate high accuracy, resistance to overfitting, and respond to high dimensionality of the dataset.

A support-vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks like outliers detection Intuitively. A good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin, the lower the generalization error of the classifier.

# Classification Tree
form <- as.formula(price ~ .)

model_SVM <- ksvm(form, train)

# Correlation Coefficients
model_SVM@scaling$y.scale
## $`scaled:center`
## [1] 17773.41
## 
## $`scaled:scale`
## [1] 10609.7

Fit the Decision Trees Model

A Decision Tree is a supervised learning predictive model that uses a set of binary rules to calculate a target value. It is used for either classification (categorical target variable) or regression (continuous target variable). Hence, it is also known as CART (Classification & Regression Trees). The algorithm of the decision tree models works by repeatedly partitioning the data into multiple sub-spaces so that the outcomes in each final sub-space is as homogeneous as possible. This approach is technically called recursive partitioning. The produced result consists of a set of rules used for predicting the outcome variable, which can be either a continuous variable, for regression trees or a categorical variable, for classification trees. The decision rules generated by the CART (Classification & Regression Trees) predictive model are generally visualized as a binary tree.

Advantages of Decision Trees:

It is quite interpretable and easy to understand.

It can also be used to identify the most significant variables in your data-set

Disadvantages of Decision Trees:

Decision Trees are limited by inability to learn after initial processing, and possible overfitting of the data, therefore non-adaptable to new data.

model_DT <- rpart(price ~ ., data = train)

prp(model_DT, main = "Figure 27. Summary of Decision Trees Model")

Fit the Random Forest Model

Random Forests classify the mean/average of an ensemble of decision tree regressions of the independent variables vs the dependent variable. Each individual tree in the random forest produces a class prediction and the class with the most votes becomes the model’s prediction. The Random Forest algorithm works well with a variety of classes, numeric and categorical, within datasets. Random Forest is also probability-based, thereby compensating for the distance-based scheme of Support Vector Machines.

model_RF <- randomForest(price ~ ., data = train, ntree = 100)

summary(model_RF)
##                 Length Class  Mode     
## call               4   -none- call     
## type               1   -none- character
## predicted       1951   -none- numeric  
## mse              100   -none- numeric  
## rsq              100   -none- numeric  
## oob.times       1951   -none- numeric  
## importance         3   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            11   -none- list     
## coefs              0   -none- NULL     
## y               1951   -none- numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call
par(mfrow = c(1,1))
plot(model_RF, main = "Figure 28. Summary of Random Forest Model")

Fit the Gradient Boosting Model

Gradient Boosting employs a group of linear regression methods to boost the predictive accuracy of a linear model. It builds the model in a stage-wise fashion, and generalizes by allowing optimization of an arbitrary differentiable loss function. Gradient Boosting optimizes a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction.

Advantages of Gradient Bootsing:

Has a higher possible predictive accuracy vs other regression techniques. Has different loss function optimization.

Improved fitting with hyperparameter tuning options.

Fits data with imputation of missing values.

Disadvantages of Gradient Boosting:

Possible overfitting with reiteration of regression algorithms.

Computation time and memory exhaustive via reiteration of algorithms.

model_GBM <- gbm(price ~ ., data = train, distribution = "tdist",
                 n.trees = 1000, interaction.depth = 11, shrinkage = 0.02)

summary(model_GBM, main ="Figure 29. Summary of Gradient Boosting Model")

##             var   rel.inf
## lot         lot 60.140946
## mileage mileage 34.190574
## year       year  5.668481

Section 5.2 - Evaluation of Model Assumptions

This section contains evaluations of regression model assumptions, and strategies on what to do in the event of violations of the model assumption.

Anderson-Darling Test of Goodness of Fit

The Anderson-Darling Goodness of Fit Test (AD-Test) is a measure of how well your data fits a specified distribution. It’s commonly used as a test for normality. In this section of the Machine Learning pipeline, the Anderson-Darling test is used to test that the fitted values of the Regression Models are derived from the US Cars’ data.

The hypotheses for the AD-test are:

H0: The data comes from a specified distribution.

H1: The data does not come from a specified distribution.

The general steps to calculate the Anderson-Darling Statistic are:

Step 1: Calculate the AD Statistic for each distribution.

n = the sample size,

F(x) = CDF for the specified distribution,

i = the ith sample, calculated when the data is sorted in ascending order.

Step 2: Find the statistic’s p-value (probability value). The formula for the p-value depends on the value for the AD statistic from Step 1. The following formulas are taken from Agostino and Stephen’s Goodness of Fit Techniques.

AD statistic P-Value Formula:

If AD ≥ 0.60 then p = exp(1.2937 – 5.709(AD) + 0.0186(AD)2

If 0.34 < AD? < .60 then p = exp(0.9177 – 4.279(AD) – 1.38(AD)2

If 0.20 < AD? < .34 then p = 1 – exp(?8.318 + 42.796(AD) ? 59.938(AD)2

If AD ≤ 0.20 < AD? < .34 then p = 1 – exp(?13.436 + 101.14(AD) ? 223.73(AD)2

Small p-values (less than your chosen alpha level) means that you can reject the null hypothesis. In other words, the data does not come from the named distribution. If you are comparing several distributions, choose the one that gives the largest p-value; this is the closest match to your data.

Section 5.3 - Predictive Analytics

Predictive Analytics is a group of algorithmic statistical methods that create predictions of future data values from regressions of historical data.

Basic Linear Modeling Predictions

pred_LN <- predict(model_LN, newdata = test)

Na?ve Bayes Predictions

pred_NB <- predict(model_NB, newdata = test)

GAM Predictions

pred_GAM <- predict(model_GAM, newdata = test)

K-Nearest Neighbor Predictions

pred_KNN <- predict(model_KNN, newdata = test)

Support Vector Machines Predictions

pred_SVM <- predict(model_SVM, newdata = test)

Decision Trees Predictions

pred_DT <- predict(model_DT, newdata = test)

Random Forest Predictions

pred_RF <- predict(model_RF, newdata = test)

Gradient Boosting Predictions

pred_GBM <- predict(model_GBM, newdata = test)

Section 6 - Model Validation

Section 6.1 - Accuracy of the Predictions

There are many possible metrics used for the evaluation of regression models. Generally, these metrics are functions of the residual value, or difference between the predicted value or score and actual label value:

ri = f(xi) ? yi = y^i ? yi

The Performance Measurement Metrics used for this Machine Learning Solution are R-Squared, Root Mean Square Error, and Mean Absolute Error. R squared or R2, also known as the coefficient of determination, R2 = 1 ? SSres / SStot where, SSres = ∑i = 1Nri2, or the sum of the squared residuals, SStot = ∑i = 1Nyi2, or the sum of the squared label values. R2 is measure of the reduction in sum of squared values between the raw label values and the residuals. If the model has not reduced the sum of squares of the labels, R2 = 0, the model is incapable of accurate predictions. If all ri = 0, the model fits the data perfectly and R2 = 1. Adjusted R squared or Radj2 is R2 adjusted for degrees of freedom in the model, Radj2 = 1 ? var(r) / var(y) = 1 ? SSres(n?p?1) / SStot(n?1) where, var(r) = the variance of the residuals, var(y) = the variance of the labels, n = the number of samples or cases, p = number of model parameters. The interpretation of Radj2 is usually the same as R2. If the number of parameters is significant with respect to the number of cases, R2 will give an overly optimistic measure of model performance. In general, the difference between Radj2 and R2 becomes less significant as the number of cases n grows. There can be a significant difference if there are a large number of model parameters. With only infrequent exceptions, it is not ordinarily possible to get values of R2 outside the range {0,1}. R2 can only be greater than 1 in degenerate cases where all label values are the same, and a prediction algorithm is unnecessary. If your model gives an R2 less than 0, it invariably means that there is a bug in your code, given that the residuals of your model have greater dispersion than the original labels.

Root Mean Square Error is the square root of the mean squared error. Mean square error is identical to the variance of the residuals, with a slight bias. This metric is the one linear regression minimizes. Mean square error is in units of the square of the label values.

Mean squared error or MSE,

Therefore, the root mean squared error is identical to the standard deviation of the residuals, with a slight bias. Root mean square error is in the same units as the label values.

Root mean squared error or RMSE,

Mean Absolute Error is the Mean Square Error, where the absolute value is not squared. This performance measure is more intuitive via representing the average of the magnitude of the residuals, (the distance between the data points and the regression line).

Mean absolute error or MAE,

where || is the absolute value operator. The median absolute error locates the center of the absolute residuals. A significant variance between the median absolute error and the mean absolute error, indicate the presence of outliers in the residuals.

Median absolute error,

Accuracy of the Basic Linear Modeling Predictions

## Table 43. Accuracy of the Basic Linear Modeling Predictions
##          R2     RMSE      MAE
## 1 0.2882012 8982.904 7246.457

Accuracy of the Na?ve Bayes Predictions

## Table 44. Accuracy of the Naive Bayes Predictions
##          R2     RMSE     MAE
## 1 0.2706309 20465.15 17554.6

Accuracy of the Generalized Additive Models Predictions

## Table 45. Accuracy of the Generalized Additive Models Predictions
##          R2     RMSE      MAE
## 1 0.2882012 8982.904 7246.457

Accuracy of the K-Nearest Neighbors Predictions

## Table 46. Accuracy of the K-Nearest Neighbors Predictions
##          R2    RMSE      MAE
## 1 0.3649415 8511.56 6489.852

Accuracy of the Support Vector Machines Predictions

## Table 47. Accuracy of the Support Vector Machines Predictions
##           R2     RMSE     MAE
## 1 0.05954126 13277.91 9848.73

Accuracy of the Decision Trees Predictions

## Table 48. Accuracy of the Decision Trees Predictions
##          R2     RMSE      MAE
## 1 0.3419149 8657.044 6637.835

Accuracy of the Random Forest Predictions

## Table 49. Accuracy of the Random Forest Predictions
##          R2     RMSE      MAE
## 1 0.3606071 8557.925 6581.667

Accuracy of the Gradient Boosting Predictions

##          R2     RMSE      MAE
## 1 0.3921105 8759.558 6587.375

Figure 30. Chart of Predictive Model Accuracy

Section 7 - Insights & Inferences

Section 7.1 - Conclusions

This document has implemented Data Visualization, Cross-Tabulation, K-means Data Clustering, Statistical Analysis, Data Correlation, Predictive Analytics, Cross Validation, and Machine Learning in order to verify that it is possible to predict “price” of future inclusions to the US Cars dataset. The general structure of the analysis followed the CRISP-DM, (cross-Industry Process for Data Mining), process methodology for Data Science analysis. The general process steps were Problem Definition, (what are the requirements for the project?), Data Preparation, (acquiring and organizing the data for the project), Data Discovery, (meaningful exploratory analysis of the data), Feature Engineering, (optimizing the data for Machine Learning), Model Development, (regression analysis of the data for future predictions, evaluation of regression models, and predictive analytics), and finally Model Validation, (selection of the Machine Learning model that has the highest quality for deployment).

A flowchart of this document’s Machine Learning procedures is presented at the beginning of the document. The beginning sections of the document should familiarize users of this Comprehensive Machine Learning Solution with the general theories of Machine Learning, and with the methods of initiating this project. After initial data importation, the data is thoroughly examined for relevant characteristics, then statistical examined for subsequent optimal processing. A set of basic charts and tables are created to explore the data statistically. When the exploration is complete, the data is configured for processing via outlier detection, missing value imputation, and data type conversion.

With the dataset optimized for processing, the dataset features that have the best ability to predict the target variable are deduced via Correlation Analysis. Various forms of Correlation are demonstrated in this section. Stepwise Regression was not selected as a correlation method because of the project dataset’s essentially perfect fit of independent variables. As a result, Stepwise Regression was unable to reduce the dataset’s independent variables. Correlation Analysis demonstrated that the dataset variables that have the most correlation with the target variable “price” are “lot”, “mileage”, and “year”.

At this point in the analysis, the dataset was ready to be divided in training and testing datasets that are a requirement for Machine Learning. This process is called Cross-Validation, and is vital for usability of the deployed Machine Learning algorithm. Next, possible irregularity of dataset variable ranges are scaled for the optimal range to process the independent variables against the number range of the dependent variable. This process is referred to as Feature Scaling. The complexities involved are explained in the text of the Feature Scaling section. At this point in the analysis, the project data is ready for artificial intelligence learning of the data observation’s effect on the target variable.

The first step of the automated learning process is regression of the training dataset’s series of observation-based records. The regression methods chosen for this Comprehensive Machine Learning Pipeline were; Basic Linear Regression, Na?ve Bayes Regression, General Additive Models, K-Nearest Neighbor, Support Vector Machines, Decision Trees, Random Forest, and Gradient Boosting. It was presumed at the beginning of the programming process that Gradient Boosting would have the best results, considering the intensive processing through several regression methods of this algorithm. Neural Network regression was not selected, after non-convergence in 1 of 1 repetitions within the stepmax caused incompatibility with this regression method.

The linear regressions are evaluated by the Anderson-Darling Goodness of Fit method. The AD-testing stage verified that Feature Scaling had normalized the data to such a degree that the linear regression methods were essentially equalized, thereby showing minimal difference in predictive capability at this stage of the processing. The AD-testing algorithm was not compatible with the K-NN regression algorithm. AD-testing did indicate that Decision Trees should have the greatest predictive abilities. However, this was very likely because Decision Trees are usually not sensitive to Feature Scaling.

After processing of the eight linear regression methods, Predictive Analytics is then used to learn the training dataset in order to predict the contents of the testing dataset. The most accurate method is selected for final Machine Learning deployment. The methods of evaluating the prediction models are R2, RMSE, and MAE

MAE. The Predictive Analytics trials demonstrated that the Gradient Boosting algorithm produces the most accurate predictions. The accuracy of the predictions are verified with a R2 Probability Value of 0.3981. Random Forest, K-NN, and Decision Trees had the next highest prediction accuracy. As expected Na?ve Bayes, Basic Linear Modeling, and GAM had the lowest predictive accuracy. Unexpectedly, Support Vector Machines had an unusually low prediction accuracy of 0.0595. This is very likely a result of a requirement for a higher level of parameterization of the SVM algorithm.

The overall accuracy levels of the various Machine Learning methods of the document are improvable via parameter optimization throughout the Machine Learning process. This Comprehensive Machine Learning Solution presents only the primary parameterizations, in order to allow for optimal usability for a wide variety of Machine Learning projects. The US Cars dataset was very likely chosen as a Kaggle competition because of difficulty in predicting the target variable, “price”. However, the above document gives a basic framework for subsequent parameter optimization. The second document in the References section below, demonstrates the complex parameter optimization of the data and algorithms, that are required to improve the accuracy of the predictive models used in this Comprehensive Machine Learning Solution, and for future Machine Learning projects.

Section 8 - References

The project is derived from https://github.com/MicrosoftLearning/ Principles-of-Machine-Learning-R/blob/master/Module4/IntroductionToRegression.ipynb, and include refinements specified by the client.

https://github.com/MicrosoftLearning/Principles-of-Machine-Learning-R/blob/master/Module4/IntroductionToRegression.ipynb

https://rstudio-pubs-static.s3.amazonaws.com/248952_706edc85cfa84a369dfe401a763d32fc.html

https://m-clark.github.io/generalized-additive-models/application.html

https://multithreaded.stitchfix.com/blog/2015/07/30/gam/

https://www.kaggle.com/zeynepkockar/analyze-with-r

要查看或添加评论，请登录

John Akwei的更多文章

The Higgs Boson: The Key to Unifying the Four Forces?

2023年7月27日

The Higgs Boson: The Key to Unifying the Four Forces?

by John Akwei, ECMp ERMp Data Scientist Founder of ContextBase, https://contextbase.github.
The Binary Planck Wave/Object/Gravity Theory of the Origin and Structure of the Universe

2023年7月24日

The Binary Planck Wave/Object/Gravity Theory of the Origin and Structure of the Universe

by John Akwei, ECMp ERMp Data Scientist Section 1 - Preamble The following proposes a theory of new physics related to…
Solving the Most Complex Problems in Business with ContextBase and LLM AI

2023年6月27日

Solving the Most Complex Problems in Business with ContextBase and LLM AI

ContextBase is an AI startup that has the potential to revolutionize the way businesses address their most complex…
Introducing ContextBase: The Future of AI for Businesses

2023年6月26日

Introducing ContextBase: The Future of AI for Businesses

ContextBase is a new startup that is developing proprietary large language model (LLM) artificial intelligence (AI)…
Times Series Clustering with Dynamic Time Warping

2022年5月22日

Times Series Clustering with Dynamic Time Warping

by John Akwei, ECMp ERMp Data Scientist Table of Contents Section 1 — Problem Definition Section 1.1 — Project Summary…

2 条评论
Repertory Grid Analysis

2022年5月20日

Repertory Grid Analysis

Repertory Grid Analysis of Innovation Management Methodologies by John Akwei, ECMp ERMp Data Scientist June 9, 2019…
ContextBase - Topic Modeling

2021年5月21日

ContextBase - Topic Modeling

ContextBase - https://contextbase.github.
ContextBase Cryptocurrency Markets Analysis

2021年5月20日

ContextBase Cryptocurrency Markets Analysis

All programming by John Akwei, ECMp ERMp Data Scientist ContextBase, https://contextbase.github.
The Future of Cryptocurrency

2018年1月4日

The Future of Cryptocurrency

A more energy efficient and secure form of Bitcoin will possibly emerge as a World Currency, like the Globo. An easily…
Augmenting R Programming/Data Science with Tableau

2017年2月9日

Augmenting R Programming/Data Science with Tableau

After years of R programming and Data Science experience, I decided to study Tableau. I was motivated by the prospect…

See all articles

All programming by John Akwei, ECMp ERMp Data Scientist

May 18, 2021

Table of Contents

Section 1 - Problem Definition

Section 1.1 - Project Summary

Machine Learning

Machine Learning Theory

Data Science

Section 1.2 - Modeling Process Presentation

Section 2 - Data Preparation

Section 2.1 - Working Directory Specification

Section 2.2 - Required R Language Packages

Section 2.3 - Session Information

Section 2.4 - Data Importing

Selecting Data for Machine Learning

Importing the Dataset

Section 3 - Data Discovery

Section 3.1 - Exploratory Data Analysis

Characteristics of the Data

The Row and Column Dimensions of the Dataset

The Categories of the Datasets

Sample of Records Imported For Machine Learning

The Structure of the projectData Dataset

Statistical Summarization of the dataset

Section 3.2 - Statistical Analysis of the Dataset

The Standard Deviation of the Target Variable

Brand Price Distribution

Vehicle Brand Popularity

Price Difference of New vs Used Vehicles

Price Distribution

Quantity Per Model Year

Price Per Model Year

Age Range of Vehicles

Mean Age of Vehicles

Median Age of Vehicles

Price Range of Vehicles

Mean Price of Vehicles

Median Price of Vehicles

Table of Title Status per Brand

Table of the Total Cars per Country by Brand

Price vs Mileage

Plot of Vehicle Colors

Test for Normal Distribution of Data

Shapiro-Wilk Test

Interpretation of the Shapiro-Wilk Test

Skew of the Distributions

QQ-Plots of Normality - the residuals are normally distributed

Heteroskedasticity

Linear Model to Test for Heteroskadacity

Section 4 - Feature Engineering

Section 4.1 - Outliers, Missing Values, Conversion

Visualizing Outliers

Create a boxplot to identify outliers:

Eliminating Outliers

Examination of the Dataset for Missing Values

If missing values exist, then find the observations with missing values

The numerical Missing Values are replaced with imputed values using the median of the column of data

Verify that Missing Values have been removed

Data Conversion

Section 4.2 - Feature Selection

Correlation Plotting Visualizes the Degrees of Independent Variable Correlation

Negative Correlation

Positive Correlation

Stepwise Regression

Section 4.3 - Cross-Validation

Cross-Validation of Machine Learning Models

Setting a seed number to generate a reproducible random sampling

Creating training data as 80% of the dataset

Select the training rows

Select the test rows

Dimensions of the Training and Testing Sets

Verify that the Cross-Validation Train and Test datasets are an accurate partition of the source dataset

First Five Rows of Train and Test datasets

Section 4.4 - Feature Scaling (Data Normalization)

Min-Max Scaling, Min-Max Normalization

Mean normalization

Z-Score Normalization

Normalize or Standardize?

First, subset Train and Test datasets into high correlation features

Print first row of subsetted Train and Test datasets