Linear Regression. Making Sense Of The Future Based On The Past.

Linear Regression. Making Sense Of The Future Based On The Past.

When considering your vehicles fuel consumption as a predictor of how far you can travel once the empty sign goes on is a natural part of our lives. In fact, we probably assume that a vehicle should have this feature no questions asked. But what in fact is happening behind the scenes? How is the car deciding what to show you as an output?

Based on the last trip? Well that could be biased. You could have been driving at 120mph or hauling a caravan or driving backwards for 200 miles. Whatever it is you were up to, that is not enough.

Using linear regression, the computer can look at your last X number of trips, incorporate some basic attributes of the journey and infer a predictive model to estimate the distance that you can travel based on your previous behaviors. Now if you have been speeding mostly, your consumption will be higher, but should average back out once you start adhering to the speed limits.

But how does this work? Lets give it a go.

What is Linear Regression

Linear regression is a statistical method that is aimed at modeling the relationship between independent variables X and a dependent variable Y in a linear fashion (Preacher, et al., 2006). The outcome is a formula that given an input or set of input attributes, can predict an output value that is in line with the derived linear plot. It does not however guarantee that all training elements will fall on the derived line using the formula, and thus it is important to consider error measures between the observed dataset and predicted values to fine tune the algorithm (Hyndman & Koehler, 2006).

No alt text provided for this image

(Chakure, 2019)

Mathematically, what does it look like?

Below is a sample of what to expect when considering a linear regression formula with either one or more attributes and associated contributing weights.

No alt text provided for this image

(SuperDataScience, 2018)

What does this mean for me?

A simple example could be salary expectations based on experience as per below graph, which could be linear in nature such that one would expect to have a basic salary when you start, and then based on the number of years in the field, you would see linear increase. Alternatively, the experience can be represented by a kernel method that takes multiple features such as years of experience, specific field, geo-location, training and skillsets to produce an experience number which is then plotted to the salary expectations.

No alt text provided for this image

(SuperDataScience, 2018)

So many inputs, which to use?

Multiple input variables can greatly increase the accuracy and correlation of the prediction with the actual observed value but can also increase the complexity of the algorithm. Factors to consider is storage of inputs, calculation complexity and computational time. It is thus important to use feature selection that would pick out those attributes that offer the most value in terms of prediction (Ludwig, et al., 2015). Feature selection allows for simplification of visualization and facilitates a better understanding of the data, it reduces the complexity of the algorithm and possibly reduces the curse of dimensionality and improves prediction computational timings (Guyon & Elisseeff, 2003). The M5 algorithm uses an approach that uses standard deviation reduction and looks to leaf value approximations by linear regression models. It also improves the predictions by smoothing out the process and creating smaller models. The M5 Prime algorithm is an improvement on the original M5 by accommodating for missing values and managing enumerable features. It has been used in practical areas such as streamflow prediction, modeling sediment yield, approximating the breakwater scour depths and predicting concrete performance of compressive strength (Díaz, et al., 2017).

How do you know which models fit best?

Error metrics are used to establish the validity of the proposed model predictions against the observed values. Measures such as the Mean Absolute Error and Mean Square Error are aimed at understanding the overall error of the model whereas the correlation is for understanding the relationship between the predictions and observations with zero indicating no relationship and one or minus one a strong relationship.

No alt text provided for this image

(Pascual, 2019)

No alt text provided for this image

(Pascual, 2019)

No alt text provided for this image

(mathsisfun, 2018)

References

Chakure, A., 2019. Types of Linear Regression. [Online] Available at: https://hackernoon.com/types-of-linear-regression-w4o227s5[Accessed 08 Feb 2020].

Díaz, I. et al., 2017. Machine learning applied to the prediction of citrus production. Spanish Journal of Agricultural Research, 15(2), pp. 1-12.

Guyon, I. & Elisseeff, A., 2003. An introduction to variable and feature selection. The Journal of Machine Learning Research, Volume 3, p. 1157–1182.

Hyndman, R. J. & Koehler, A. B., 2006. Another look at measures of forecast accuracy. International Journal of Forecasting, Volume 22, p. 679–688.

Ludwig, N., Feuerriegel, S. & Neumann, D., 2015. Putting Big Data analytics to work: Feature selection for forecasting electricity prices using the LASSO and random forests. Journal of Decision Systems, 24(1), pp. 1-28.

mathsisfun, 2018. Correlation. [Online] Available at: https://www.mathsisfun.com/data/correlation.html [Accessed 08 Feb 2020].

Pascual, C., 2019. Tutorial: Understanding Regression Error Metrics in Python. [Online] Available at: https://www.dataquest.io/blog/understanding-regression-error-metrics/[Accessed 08 Feb 2020].

Preacher, K. J., Curran, P. J. & Bauer, D. J., 2006. Computational Tools for Probing Interactions in Multiple Linear Regression, Multilevel Modeling, and Latent Curve Analysis. Journal of Educational and Behavioral Statistics, 31(3), pp. 437-448.

SuperDataScience, 2018. Regression & Classification - Logistic Regression. [Online] Available at: https://www.superdatascience.com/blogs/regression-classification-logistic-regression[Accessed 08 Feb 2020].


Anne Gieg

Front Desk Receptionist at Assisted Living Brightwater

5 å¹´

Wow!

John Cotner

Glazier/fabricator at Self Employed

5 å¹´

Yes hurricane predications? Yes seems it would be very helpful to get the general. Ha error metrics,mean absolute error and mean square error. ????????

Raj Kumar D

Sr.Technical Architect @ confidential | Sr. Data Specialist, SQL Database Architect/Development

5 å¹´

Good informative post

要查看或添加评论,请登录

Adriaan Stander的更多文章

  • Bot Programming - Where the low level performance is at

    Bot Programming - Where the low level performance is at

    During the 2024 festive season break, I was fortunate enough to take part in the CodinGame winter challenge. The aim of…

    4 条评论
  • I am a Long Term Investor

    I am a Long Term Investor

    I met a friend for coffee recently. We haven’t seen each other in a while.

    2 条评论
  • What do you mean an Internal Developer Platform Culture?

    What do you mean an Internal Developer Platform Culture?

    TLDR; and Key Takeaways Understanding IDP Culture: An IDP culture centres around viewing development teams as internal…

    4 条评论
  • We Are All Connected

    We Are All Connected

    On a bright yellow sunny afternoon in early February I visited the lovely town of Franschhoek with my family. We often…

    2 条评论
  • Be the best version of you

    Be the best version of you

    Yesterday, my 8 year old son had his first school athletics event in, well since ever for him. He is now Grade 3, and…

    5 条评论
  • Scaling agile the right way

    Scaling agile the right way

    Equipped with a tool that works seemingly well at small scale in very specific settings, how does one scale the success…

  • Why you shouldn't "Let me show you"

    Why you shouldn't "Let me show you"

    THE LITTLE BOY, by Helen E. Buckley Once a little boy went to school.

  • From Why to Why Not

    From Why to Why Not

    “Why do we have to change”, “Why can’t we keep doing what we use to do?”, “Why is this happening?”, “Why me?”, “Why…

  • Your Data Lake On Amazon S3

    Your Data Lake On Amazon S3

    In God we trust. All others must bring data.

  • Four Simple Rules To Live By

    Four Simple Rules To Live By

    A seemingly very simple list, but one that can make a world of difference in your life. And they apply no matter who…

    27 条评论

社区洞察

其他会员也浏览了