Published: Jan 2025
Image source: Pixabay
Linear regression is usually introduced through formulas, normal equations, and matrix algebra. But at its core, linear regression is a geometric problem. Once you see it geometrically, concepts like least squares, residuals, multicollinearity, and overfitting become almost intuitive.
This post explains linear regression not as a statistical recipe, but as a problem of distance, angles, and projections in high-dimensional space.
Suppose we have n observations and p predictors. Each predictor is a vector of length n. Together, these predictors form the design matrix:
$$ X \in \mathbb{R}^{n \times p} $$
Each column of X represents a direction in an n-dimensional space.
These directions span a flat geometric object known as the
column space of X.
This column space is not curved, spherical, or nonlinear. It is a flat subspace (a plane, or hyperplane) embedded inside \( \mathbb{R}^n \).
The response variable \( y \) is also a vector in \( \mathbb{R}^n \). But in general, y does not lie inside the column space of X.
This mismatch is the entire reason regression exists. If \( y \) already lay in the column space, we could explain it perfectly with a linear combination of predictors.
Instead, linear regression asks a very precise geometric question:
Among all points in the column space of X, which one is closest to y?
This equation does not "solve for coefficients" in an abstract sense.
It computes the coordinates of a specific point inside the
column space of X.
That point is:
$$ \hat{y} = X\hat{\beta} $$Geometrically, \( \hat{y} \) is the orthogonal projection of \( y \) onto the column space of \( X \).
The residual vector is:
$$ r = y - \hat{y} $$This residual is not arbitrary. It satisfies a strong geometric property:
$$ X^T r = 0 $$Which means the residual is perpendicular to every column of X.
So when we say “least squares,” what we really mean is:
This is why ordinary least squares is a projection problem, not an optimization trick.
Each regression coefficient answers a subtle geometric question:
How much do we move along this predictor's direction to reach the projection point?
The coefficients are not properties of \( y \) alone. They depend on:
This is why multicollinearity makes coefficients unstable: when predictor directions are nearly parallel, the geometry becomes ill-conditioned.
Even in 100-dimensional space, the story is unchanged.
No matter how many dimensions we add, linear regression never bends the space. It only projects.
If you understand this geometry, you are no longer memorizing regression. You are seeing it.