Maybe you should be using Ordinary Least Squares Regression

Maybe you should be using Ordinary Least Squares Regression

Recently, I was curious about a simple question: How does Ordinary Least Squares (OLS) regression hold up against more complex models?

Here's what I tested and what I found.

OLS is sometimes recommended as a starting point, and it’s easy to overlook just how robust and competitive it can be—even when assumptions like linearity and normality don’t fully hold. More importantly, complex models like Random Forests and SVMs need to earn their place by proving they’re significantly better.

To put this to the test, I ran an experiment on datasets of varying types and sizes:

  • Linear vs. Non-linear relationships
  • Continuous vs. Binary outcomes
  • Smaller (1,000 rows) vs. Larger (100,000 rows) datasets

I compared OLS against Logistic Regression, Random Forests, SVMs, and SVR (Support Vector Regression), using appropriate metrics for binary and continuous outcomes.

The Results

Speed:

  • OLS was consistently the fastest model, even on 100,000 rows.
  • SVMs and Random Forests slowed down significantly as the data scaled.

Performance:

  • On linear data, OLS matched or outperformed more complex models.
  • On non-linear data, Random Forests performed best, but OLS still delivered reasonable results.
  • Even for binary outcomes, OLS produced meaningful coefficients and predictions, often close to logistic regression.

Key Takeaways

  • OLS is more than a starting point—it’s a benchmark. Before adding complexity, ask: “Is my model meaningfully better than OLS?”
  • Simplicity often wins. Like Occam’s razor, if a complex model doesn’t outperform OLS, it might not justify the extra cost.
  • OLS is robust. Even when assumptions don’t fully hold, it delivers interpretable and competitive results.

Why does this matter?

In a world of increasingly complex models, it’s easy to forget that sometimes the simplest approach is also the best. Whether you’re working with millions of rows or imperfect data, OLS remains a reliable, fast, and effective tool. Start simple. Benchmark with OLS. If the complex model can’t beat it, maybe it’s not needed at all.

Check out the notebook here: https://colab.research.google.com/drive/1aLWiKd3g1MqNdF6p8LOuz6oeUTXUriea?usp=sharing

#MachineLearning #DataScience #Regression #Benchmarking #OccamsRazor #Efficiency


Cole Napper

VP Research & Innovation | People Analytics, Workforce Planning, & Talent Intelligence | Directionally Correct - #1 People Analytics Podcast & Substack Newsletter | Prolific Author, Writer, Speaker | HR Tech Advisor

3 个月

“The more you know, the less you use” one of my favorite stats quotes about OLS regression

Manpreet(Manny) Sidhu

Cloud Strategy & AGI Applied Data Science Leader | Author | Speaker | Mentor

3 个月

Insightful article, will be interesting to see OLS applied to time series data.

要查看或添加评论,请登录

Jesse Russell, PhD的更多文章

社区洞察

其他会员也浏览了