Linear Regression (Less Linear Than You Might Think)
For a very long time I associated Linear Regression with fitting a straight line (or hyperplane in higher dimensions) to a number of data points as shown below:
Obviously, the line represents a linear relation between x and f(x), thus the name "Linear Regression" comes at no surprise. The line is a model of the underlying relationship and is fitted such, that it is "close" to the sample data according to the usual square error. The resulting linear model can be used to predict the value of f(x) for x-values not contained in the sample data set.
However, at some point I learned that linear regression can also look like this
or like this
and even like this:
What is going on here? Apart from the first case, these are definitely non-linear relationships between the underlying variable x and the function f(x) used for fitting the data.
The Explanation
We can solve this apparent paradox by looking at the following characterization (AI-generated but approved by yours truly):
Linear regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables, focusing on minimizing the sum of squared differences between observed and predicted values. Despite the name, linear regression can model both linear and non-linear relationships. The "linear" aspect refers to the model being linear in its coefficients, meaning it can incorporate variables in forms like polynomials f(x) = ax^4 + bx^3 + cx^2 +dx +e while remaining linear with respect to the coefficients a,b,c,d,e. Thus, linear regression applies to a wide range of scenarios, including those with non-linear data relationships, as long as the equation is linear in the parameters.
That sums it up nicely and, luckily, linear regression problems can be solved directly using standard matrix operations (like inverse and pseudoinverse) which I just verbally describe below since since linkedin does not exactly excel at displaying formulas.
Solving Linear Regression Problems
In linear regression, we deal with a system of equations derived from our data, where each data point contributes one equation. Here, n represents the number of data points, and m refers to the number of coefficients (including the intercept) we need to estimate to define our linear model.
When n>m, meaning we have more data points than coefficients, the system is overdetermined. In such cases, it's impossible to find a perfect fit for all data points due to inherent data noise. The objective shifts towards finding coefficients that minimize the error between the observed and predicted values, a process known as least squares minimization.
The solution involves using the pseudoinverse, a method that allows for the estimation of the most suitable coefficients under these conditions. This approach ensures that our linear regression model is as accurate and reliable as possible, capable of making predictions on new data while acknowledging the limitations imposed by having more data points than parameters to estimate.
Choice of Features
In linear regression, selecting appropriate features is critical for model accuracy and prediction quality. Any feature that combines linearly with coefficients is viable, offering the flexibility to model complex relationships. This includes not just polynomial features, which capture non-linear patterns through powers of variables, but also functions like sine, cosine, exponential, and logarithmic. These can model behaviors like cycles and growth patterns while adhering to the linear regression principle.
Feature selection should be informed by data analysis and domain expertise to ensure model relevance. Properly chosen features can significantly enhance model performance, whether by capturing the essence of the data with polynomial terms or by modeling specific behaviors with functional transformations. The balance is crucial: too simple a model may miss underlying patterns (under-fitting), while an overly complex model risks fitting noise rather than signal (over-fitting).
Revisiting the Examples
To reiterate a key point, linear regression is only required to be linear in the coefficients, but not in the so-called "features". For example, if we try to model a data set using the quadratic formula
f(x) = ax^2 +bx +c,
领英推荐
we have three coefficients a,b,c. We also have one quadratic feature (x^2), one linear feature (x) and one constant feature (1). As described above, every sample data pair (x,y) gives rise to one equation
y = ax^2 +bx +c.
100 data points (like in the examples shown) thus result in a highly overdetermined equation system with 100 equations and only three coefficients where we can compute a least-squares solution using the pseudoinverse.
Below, you see the same examples as in the beginning of the article but supplemented with the usually unkown "ground truth" function (green line) which was used to generate the data before Gaussian noise was added. Please note that for each example an "appropriate" model was chosen which is usually not possible since the relation of x and y is in practice unknown. Unsurprisingly, the results of the linear regression are excellent in all cases. The following three examples are based on polynomials of degree 1,2, and 3:
This example below is special since the generating function is the trigonometric sine function, in particular the ground truth function was f(x) = 3sin(2x). Knowing this, we chose the feature to be also sine and used the correct internal factor 2, i.e. we took sin(2x) as feature. That lead to a nearly perfect result of the linear regression identifying the coefficient value 3 very closely.
Please note that this is a very constructed example. If we use a different frequency of the sine wave, e.g. sin(10x), the linear regression results in a huge fitting error:
Similar problems occur with a phase shift like sin(2x+pi/2):
Automatically determining parameters like frequency and phase in this example exceeds the capabilities of linear regression, requiring advanced non-linear optimization or search methods that are beyond the scope of this article.
Confusing (but Common) Names for Special Cases of Linear Regression
In many books one can find the Term "Quadratic Regression" for cases where a quadratic function like f(x) = ax^2 + bx + c is fitted to the data using linear regression.
Similarly we can find the term "Cubic Regression" for cases where a cubic polynomial like f(x) = ax^3 + bx ^2 + cx +d is fitted to the data using linear regression.
And with "Polynomial Regression" many texts denote the general case that a polynomial function of any degree is used to model the relationship between x and y.
All these cases are still linear regression since f(x) is a linear combination (with coefficients a,b, ...) of the potentially no-linear features. The often shown linear regression using a polynomial of of degree 1 is just a special case, where linear regression is done using a linear model.
Summary
Linear regression computes an error-minimizing linear superposition of arbitrary (linear or non-linear!) functions ("features") to model the relation underlying the given data. The choice of functions determines the goodness of fit with the training data and the ability to predict values for new data. A good choice usually requires knowledge or assumptions of the phenomenon which generated the data.
Note: A good alternative for learning non-linear relations in unknown data are often neural networks who can model quite different relations in different parts of the data space.
Marketing Procurement @teamAMEX
1 年Not fallen for it since my econometrics courses in grad school ?? Very good article!