Linear Regression. Making Sense Of The Future Based On The Past.
When considering your vehicles fuel consumption as a predictor of how far you can travel once the empty sign goes on is a natural part of our lives. In fact, we probably assume that a vehicle should have this feature no questions asked. But what in fact is happening behind the scenes? How is the car deciding what to show you as an output?
Based on the last trip? Well that could be biased. You could have been driving at 120mph or hauling a caravan or driving backwards for 200 miles. Whatever it is you were up to, that is not enough.
Using linear regression, the computer can look at your last X number of trips, incorporate some basic attributes of the journey and infer a predictive model to estimate the distance that you can travel based on your previous behaviors. Now if you have been speeding mostly, your consumption will be higher, but should average back out once you start adhering to the speed limits.
But how does this work? Lets give it a go.
What is Linear Regression
Linear regression is a statistical method that is aimed at modeling the relationship between independent variables X and a dependent variable Y in a linear fashion (Preacher, et al., 2006). The outcome is a formula that given an input or set of input attributes, can predict an output value that is in line with the derived linear plot. It does not however guarantee that all training elements will fall on the derived line using the formula, and thus it is important to consider error measures between the observed dataset and predicted values to fine tune the algorithm (Hyndman & Koehler, 2006).
(Chakure, 2019)
Mathematically, what does it look like?
Below is a sample of what to expect when considering a linear regression formula with either one or more attributes and associated contributing weights.
(SuperDataScience, 2018)
What does this mean for me?
A simple example could be salary expectations based on experience as per below graph, which could be linear in nature such that one would expect to have a basic salary when you start, and then based on the number of years in the field, you would see linear increase. Alternatively, the experience can be represented by a kernel method that takes multiple features such as years of experience, specific field, geo-location, training and skillsets to produce an experience number which is then plotted to the salary expectations.
(SuperDataScience, 2018)
So many inputs, which to use?
Multiple input variables can greatly increase the accuracy and correlation of the prediction with the actual observed value but can also increase the complexity of the algorithm. Factors to consider is storage of inputs, calculation complexity and computational time. It is thus important to use feature selection that would pick out those attributes that offer the most value in terms of prediction (Ludwig, et al., 2015). Feature selection allows for simplification of visualization and facilitates a better understanding of the data, it reduces the complexity of the algorithm and possibly reduces the curse of dimensionality and improves prediction computational timings (Guyon & Elisseeff, 2003). The M5 algorithm uses an approach that uses standard deviation reduction and looks to leaf value approximations by linear regression models. It also improves the predictions by smoothing out the process and creating smaller models. The M5 Prime algorithm is an improvement on the original M5 by accommodating for missing values and managing enumerable features. It has been used in practical areas such as streamflow prediction, modeling sediment yield, approximating the breakwater scour depths and predicting concrete performance of compressive strength (DÃaz, et al., 2017).
How do you know which models fit best?
Error metrics are used to establish the validity of the proposed model predictions against the observed values. Measures such as the Mean Absolute Error and Mean Square Error are aimed at understanding the overall error of the model whereas the correlation is for understanding the relationship between the predictions and observations with zero indicating no relationship and one or minus one a strong relationship.
(Pascual, 2019)
(Pascual, 2019)
(mathsisfun, 2018)
References
Chakure, A., 2019. Types of Linear Regression. [Online] Available at: https://hackernoon.com/types-of-linear-regression-w4o227s5[Accessed 08 Feb 2020].
DÃaz, I. et al., 2017. Machine learning applied to the prediction of citrus production. Spanish Journal of Agricultural Research, 15(2), pp. 1-12.
Guyon, I. & Elisseeff, A., 2003. An introduction to variable and feature selection. The Journal of Machine Learning Research, Volume 3, p. 1157–1182.
Hyndman, R. J. & Koehler, A. B., 2006. Another look at measures of forecast accuracy. International Journal of Forecasting, Volume 22, p. 679–688.
Ludwig, N., Feuerriegel, S. & Neumann, D., 2015. Putting Big Data analytics to work: Feature selection for forecasting electricity prices using the LASSO and random forests. Journal of Decision Systems, 24(1), pp. 1-28.
mathsisfun, 2018. Correlation. [Online] Available at: https://www.mathsisfun.com/data/correlation.html [Accessed 08 Feb 2020].
Pascual, C., 2019. Tutorial: Understanding Regression Error Metrics in Python. [Online] Available at: https://www.dataquest.io/blog/understanding-regression-error-metrics/[Accessed 08 Feb 2020].
Preacher, K. J., Curran, P. J. & Bauer, D. J., 2006. Computational Tools for Probing Interactions in Multiple Linear Regression, Multilevel Modeling, and Latent Curve Analysis. Journal of Educational and Behavioral Statistics, 31(3), pp. 437-448.
SuperDataScience, 2018. Regression & Classification - Logistic Regression. [Online] Available at: https://www.superdatascience.com/blogs/regression-classification-logistic-regression[Accessed 08 Feb 2020].
Front Desk Receptionist at Assisted Living Brightwater
5 å¹´Wow!
Glazier/fabricator at Self Employed
5 å¹´Yes hurricane predications? Yes seems it would be very helpful to get the general. Ha error metrics,mean absolute error and mean square error. ????????
Sr.Technical Architect @ confidential | Sr. Data Specialist, SQL Database Architect/Development
5 å¹´Good informative post