Linear Regression with Gradient Descent

Linear regression is the foundational algorithm for predicting continuous outcomes, but knowing how to compute the model parameters is as crucial as understanding the model itself. While the direct ordinary least squares (OLS) method provides an exact solution, gradient descent offers a flexible, iterative optimization framework that scales to massive datasets and paves the way for understanding complex models like neural networks. Mastering both approaches gives you a complete toolkit, from quick analytical solutions to powerful, scalable optimization techniques essential for modern data science.

From Ordinary Least Squares to Iterative Optimization

Linear regression models the relationship between a dependent variable $y$ and one or more independent features $X$ by fitting a linear equation: $\overset{y}{^} = θ_{0} + θ_{1} x_{1} + ... + θ_{n} x_{n}$ . Here, $θ_{0}$ is the intercept and $θ_{1} ... θ_{n}$ are the coefficients or weights. The goal is to find the values for these parameters that result in the "best fit" line through your data.

The classical method for finding these parameters is the Ordinary Least Squares (OLS) solution. This approach defines "best fit" as the line that minimizes the sum of the squared differences between the observed values ( $y$ ) and the predicted values ( $\overset{y}{^}$ ). These differences are called residuals. Mathematically, OLS finds the parameters that minimize the Mean Squared Error (MSE) cost function, which for $m$ data points is:

$J (θ) = \frac{1}{2 m} i = 1 \sum m (h_{θ} (x^{(i)}) - y^{(i)})^{2}$

where $h_{θ} (x^{(i)})$ is the model's prediction for the $i$ -th example. The OLS method solves this minimization problem analytically using a closed-form formula: $θ = (X^{T} X)^{- 1} X^{T} y$ . This is efficient and exact for smaller datasets. However, it has significant limitations: it becomes computationally expensive (with a time complexity roughly $O (n^{3})$ for matrix inversion) for datasets with a very large number of features ( $n$ ), and it cannot be used if the $X^{T} X$ matrix is non-invertible (e.g., with highly correlated features). This is where iterative, gradient-based methods shine.

The Gradient Descent Optimization Framework

Gradient descent is a first-order iterative optimization algorithm used to find a local minimum of a differentiable function. Instead of solving for the parameters directly, it starts with random initial values for $θ$ and iteratively adjusts them in the direction that steeply descends the cost function $J (θ)$ . Think of it as being placed on a foggy hillside (the cost landscape) and taking small steps downhill by feeling for the steepest slope around your feet.

The core of the algorithm lies in two components: the gradient and the learning rate. The gradient of the cost function, $\nabla J (θ)$ , is a vector of partial derivatives. For linear regression with MSE, the partial derivative with respect to a parameter $θ_{j}$ is:

$\frac{\partial J ( θ )}{\partial θ _{j}} = \frac{1}{m} i = 1 \sum m (h_{θ} (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$

This gradient points in the direction of steepest ascent. To minimize the cost, we move $θ$ in the opposite direction. The update rule for each parameter in batch gradient descent is:

$θ_{j} := θ_{j} - α \cdot \frac{1}{m} i = 1 \sum m (h_{θ} (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$

Here, $α$ is the learning rate, a hyperparameter that controls the size of each step. This process repeats until a convergence criterion is met, such as the change in the cost function $J (θ)$ between iterations falling below a very small threshold (e.g., $1 0^{- 6}$ ) or reaching a preset number of iterations.

Variants: Batch, Stochastic, and Mini-Batch

The classic form described above is Batch Gradient Descent, which uses the entire training dataset to compute the gradient for a single update. While this provides a stable path to the minimum, it can be prohibitively slow for datasets with millions of examples, as each step requires a full pass through the data.

Stochastic Gradient Descent (SGD) addresses this by using only one randomly selected training example to compute the gradient per iteration. Its update rule is: $θ_{j} := θ_{j} - α \cdot (h_{θ} (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$ . This makes each iteration incredibly fast. However, because the gradient is based on a single, potentially noisy example, the path to the minimum is much more erratic. This randomness can help escape shallow local minima in more complex cost functions but requires careful tuning of the learning rate, often using a schedule that decreases $α$ over time.

Mini-Batch Gradient Descent strikes a practical balance. Instead of the full batch or a single example, it uses a small, random subset (a mini-batch) of the data—typically between 32 and 512 examples—for each gradient update. This variant combines the computational efficiency of SGD (by leveraging vectorized operations on the mini-batch) with the more stable, less noisy convergence of batch gradient descent. It is the default choice for training most deep learning models.

Practical Implementation and Feature Scaling

Successful implementation of gradient descent requires careful attention to preprocessing and hyperparameter tuning. The most critical preprocessing step is feature scaling. Because gradient descent updates parameters in proportion to the gradient, features on different scales (e.g., age in years 0-100 and income in dollars 0-200,000) will cause the algorithm to take inefficient, oscillating steps toward the optimum. Standardization (subtracting the mean and dividing by the standard deviation for each feature) is commonly used to transform all features to a similar scale, typically with a mean of 0 and a standard deviation of 1. This allows the learning rate to work uniformly across all parameters and dramatically speeds up convergence.

Selecting the learning rate ( $α$ ) is also paramount. If $α$ is too small, the algorithm will converge very slowly. If it is too large, you may overshoot the minimum, causing the cost function to diverge or oscillate endlessly. A good practice is to try values on a logarithmic scale (e.g., 0.001, 0.01, 0.1, 1) and plot the cost function $J (θ)$ against the number of iterations. A well-chosen learning rate will show a steady, smooth decrease in cost.

Common Pitfalls

Using an Inappropriate Learning Rate: A common mistake is using a fixed, poorly tuned learning rate. A rate that is too high causes divergence (the cost increases), while one that is too low wastes computation time. Correction: Always plot the cost function over iterations during initial experiments. Implement a learning rate schedule or use adaptive optimizers that adjust $α$ dynamically.

Neglecting Feature Scaling: Running gradient descent on unscaled features is a primary reason for slow or failed convergence. The algorithm will spend excessive time zigzagging toward the optimum. Correction: Always standardize or normalize your features before applying gradient descent. This does not apply to the OLS analytical method.

Failing to Check for Convergence: Letting the algorithm run for an arbitrary number of iterations is inefficient. Correction: Implement a sensible convergence criterion, such as stopping when $∣ J (θ)^{(k)} - J (θ)^{(k - 1)} ∣ < ϵ$ , where $k$ is the iteration number and $ϵ$ is a small tolerance (e.g., $1 0^{- 6}$ ). Always include a maximum iteration limit as a safeguard.

Misunderstanding the Variants: Using batch gradient descent on a dataset with 10 million examples will be impractically slow, while using stochastic gradient descent without a proper learning rate schedule may never converge cleanly. Correction: Choose the variant suited to your data size and problem. Mini-batch gradient descent is the recommended starting point for most large-scale applications.

Summary

Linear regression can be solved analytically via the Ordinary Least Squares closed-form solution, which is efficient for small-to-medium datasets but becomes computationally prohibitive for very large feature sets.
Gradient descent is an iterative optimization algorithm that minimizes the Mean Squared Error (MSE) cost function by updating parameters in the direction of the negative gradient, scaled by a learning rate.
The three main variants—Batch, Stochastic, and Mini-Batch—offer trade-offs between computational cost and stability of convergence, with mini-batch being the most common choice in practice.
Proper implementation is non-negotiable: feature scaling (like standardization) is essential for performance, and the learning rate must be carefully tuned to ensure timely and stable convergence to the optimal model parameters.

Linear Regression with Gradient Descent

Linear Regression with Gradient Descent

From Ordinary Least Squares to Iterative Optimization

The Gradient Descent Optimization Framework

Variants: Batch, Stochastic, and Mini-Batch

Practical Implementation and Feature Scaling

Common Pitfalls

Summary

Write better notes with AI