Linear Algebra: Least Squares Solutions

In engineering and data science, you will constantly face a fundamental problem: what do you do when a system of equations has no exact solution? Real-world data is messy, measurements have error, and physical models are often approximations. The least squares method provides the powerful, definitive answer. It finds the "best" approximate solution to an inconsistent or overdetermined system (where there are more equations than unknowns) by minimizing the sum of the squares of the residuals. This technique is the engine behind curve fitting, regression analysis, satellite positioning, and countless other optimization tasks.

The Core Problem and the Normal Equations

Consider the system $A x = b$ , where $A$ is an $m \times n$ matrix with $m > n$ . This system is overdetermined and typically has no solution because $b$ is not in the column space of $A$ , $Col (A)$ . The least squares approach redefines the goal. Instead of solving $A x = b$ , we seek a vector $\hat{x}$ that minimizes the squared error or residual vector $r = b - A x$ .

The quantity to minimize is the squared Euclidean norm: $x min ∥ b - A x ∥^{2}$ This is equivalent to minimizing the sum of the squares of the residual components: $r_{1}^{2} + r_{2}^{2} + ... + r_{m}^{2}$ .

The solution is found by projecting $b$ onto $Col (A)$ . The key result is that the error vector $b - A \hat{x}$ must be orthogonal to $Col (A)$ . This orthogonality condition leads directly to the normal equations: $A^{T} A \hat{x} = A^{T} b$ To derive this, we state that the residual $b - A \hat{x}$ is orthogonal to every vector in $Col (A)$ , meaning it is orthogonal to each column of $A$ . Writing this condition as $A^{T} (b - A \hat{x}) = 0$ immediately yields the normal equations. Provided the columns of $A$ are linearly independent, $A^{T} A$ is invertible, and the unique least squares solution is: $\hat{x} = (A^{T} A)^{- 1} A^{T} b$

Example: Fit a line $y = C + D t$ to the data points $(0, 6), (1, 0), (2, 0)$ . We have the overdetermined system: $C + 0 D C + 1 D C + 2 D = 6 = 0 = 0$ In matrix form, $A x = b$ with: $A = 111012, x = [C D], b = 600$ The normal equations are: $A^{T} A = [3335], A^{T} b = [60]$ Solving $(A^{T} A) \hat{x} = A^{T} b$ gives $\hat{x} = [5 - 3]$ . The best-fit line is $y = 5 - 3 t$ .

Geometric Interpretation and the QR Factorization Approach

Geometrically, the least squares solution $\hat{x}$ is the vector whose image $A \hat{x}$ is the orthogonal projection of $b$ onto $Col (A)$ . This projected vector is denoted $proj_{Col (A)} (b) = A \hat{x}$ . The residual $r = b - A \hat{x}$ is the component of $b$ orthogonal to the column space.

While the normal equations are conceptually clear, they can be numerically unstable if $A$ is ill-conditioned, as the condition number of $A^{T} A$ is roughly the square of the condition number of $A$ . A more robust computational method uses QR factorization.

Here, we factor $A = QR$ , where $Q$ is an $m \times n$ matrix with orthonormal columns (a basis for $Col (A)$ ) and $R$ is an $n \times n$ invertible upper triangular matrix. Substituting into the normal equations: $(QR)^{T} (QR) \hat{x} = (QR)^{T} b$ $R^{T} (Q^{T} Q) R \hat{x} = R^{T} Q^{T} b$ Since $Q^{T} Q = I$ , this simplifies to: $R^{T} R \hat{x} = R^{T} Q^{T} b$ And because $R^{T}$ is invertible, we arrive at the clean system: $R \hat{x} = Q^{T} b$ This system is easy to solve by back substitution. The geometric insight is powerful: $Q^{T} b$ computes the coordinates of the projection of $b$ onto the orthonormal basis given by $Q$ 's columns, and $R$ maps those coordinates back to the original $\hat{x}$ -coordinates.

Linear Regression and Weighted Least Squares

Linear regression is the most famous application of least squares. In simple linear regression, we fit a line $y = β_{0} + β_{1} x$ to data points $(x_{i}, y_{i})$ . The design matrix $A$ has a column of ones (for the intercept $β_{0}$ ) and a column of $x_{i}$ values. The normal equations solve for $\hat{β}_{0}$ and $\hat{β}_{1}$ . This extends directly to multiple linear regression with many predictor variables.

A crucial extension is weighted least squares (WLS). In standard least squares, we assume all measurements in $b$ are equally reliable. What if some data points are known to be more precise than others? WLS incorporates a weight for each observation. If we have a diagonal weight matrix $W$ with positive weights $w_{i}$ , the problem becomes: $x min (b - A x)^{T} W (b - A x)$ The solution modifies the normal equations: $A^{T} W A \hat{x} = A^{T} W b$ Heavier weights force the solution to fit certain data points more closely. This is essential in engineering when combining sensor data of varying quality or in econometrics to handle heteroscedasticity.

Applications in Data Fitting and Engineering Systems

The utility of least squares spans from simple curve fitting to complex system control. In data fitting, you can fit not just lines but polynomials, exponentials (by taking logs), or any model linear in its parameters to a dataset. The matrix $A$ is constructed from the basis functions evaluated at the data points.

For overdetermined systems in engineering, think of triangulating a device's location from multiple GPS satellites with slightly inconsistent time signals. Each satellite gives an equation relating distances, and the least squares solution provides the most probable location. In electrical engineering, solving for circuit parameters from noisy measurements is a least squares problem. In mechanical and aerospace engineering, calibrating sensors or estimating forces from strain gauge readings relies on these principles.

The method also underpines the Kalman filter, a recursive algorithm for state estimation that uses a form of sequential least squares to optimally combine predictions with new measurements, which is vital for navigation and control systems.

Common Pitfalls

Solving Normal Equations Directly for Ill-Conditioned Problems: As mentioned, forming $A^{T} A$ squares the condition number, which can lead to catastrophic loss of precision for nearly rank-deficient matrices. Correction: Use the QR factorization method or, for rank-deficient cases, a Singular Value Decomposition (SVD) approach, which is the most stable numerical algorithm for least squares.

Misinterpreting the Solution When $A^{T} A$ is Singular: If the columns of $A$ are linearly dependent, $A^{T} A$ is not invertible and the normal equations have infinitely many solutions. This means the least squares problem itself has infinitely many minimizers. Correction: Recognize that this indicates your model has redundant parameters. You must either reformulate the model to remove the dependency or use the SVD to select the minimum norm least squares solution, which is often the desired unique output.

Applying Least Squares to a Model that is Nonlinear in its Parameters: You cannot directly fit a model like $y = α e^{β x}$ using the linear least squares framework unless you transform it. Taking logs gives $ln y = ln α + β x$ , which is linear in parameters $ln α$ and $β$ . Correction: Always check if your model can be transformed into a linear one, or be prepared to use nonlinear least squares algorithms (e.g., Gauss-Newton), which are more complex and iterative.

Ignoring the Assumptions and Diagnostics: Least squares provides the "best" linear unbiased estimator under specific assumptions (errors are uncorrelated, have constant variance, and are normally distributed). Blindly applying it without checking residuals for patterns can lead to poor models. Correction: After fitting, always plot residuals versus fitted values and predictors to check for heteroscedasticity (non-constant variance) or nonlinearity, which may require weighted least squares or a different model form.

Summary

The least squares method finds the best approximate solution $\hat{x}$ to an inconsistent system $A x = b$ by minimizing the sum of squared residuals $∥ b - A x ∥^{2}$ .
The solution is defined by the normal equations $A^{T} A \hat{x} = A^{T} b$ and corresponds geometrically to the orthogonal projection of $b$ onto the column space of $A$ .
For numerical stability, the QR factorization $A = QR$ is preferred, reducing the problem to solving the triangular system $R \hat{x} = Q^{T} b$ .
Linear regression is a direct application, and weighted least squares generalizes the method to account for measurements of differing precision or reliability.
The technique is indispensable for solving overdetermined systems in engineering domains like signal processing, navigation, and data fitting, providing a principled way to extract optimal estimates from noisy, redundant data.

Linear Algebra: Least Squares Solutions

Linear Algebra: Least Squares Solutions

The Core Problem and the Normal Equations

Geometric Interpretation and the QR Factorization Approach

Linear Regression and Weighted Least Squares

Applications in Data Fitting and Engineering Systems

Common Pitfalls

Summary

Write better notes with AI