📘 Overview of Least Squares Estimation Methods

1. Core Idea

All least squares methods are founded on a simple, powerful principle: find the model parameters that make the predicted values as close as possible to the observed values. They achieve this by minimizing the sum of squared residuals (the differences between observed and predicted values). Squaring the residuals ensures both positive and negative errors are penalized and emphasizes larger errors.

General Mathematical Form:

\[ \hat{\theta} = \arg \min_{\theta} \sum_{i=1}^n w_i \, (y_i - f(x_i, \theta))^2 \]

Breaking it down:

2. Main Variants

1. OLS – Ordinary Least Squares

Model: \( y = X \beta + \epsilon \), where \( \epsilon \) is the error term.

Estimator: \( \hat{\beta} = \arg \min_{\beta} \sum (y_i - x_i' \beta)^2 = (X'X)^{-1} X' y \) (closed-form solution).

Key Assumptions (The "Classical" Assumptions):

  1. Linearity: The relationship between \( X \) and \( y \) is linear.
  2. Exogeneity: The error term has a mean of zero conditional on the regressors (\( E[\epsilon | X] = 0 \)). This means \( X \) is not correlated with the error.
  3. Homoskedasticity: The error term has constant variance (\( \text{Var}(\epsilon | X) = \sigma^2 I \)).
  4. No Autocorrelation: Errors are uncorrelated with each other.

Properties: Under these assumptions, the Gauss-Markov Theorem holds: OLS is the Best Linear Unbiased Estimator (BLUE). It has the smallest variance among all unbiased linear estimators.

Use Case: The standard starting point for any linear regression analysis.

Example: Predicting house prices (\( y \)) based on square footage and number of bedrooms (\( X \)). We assume the variability in price is roughly the same for small and large houses (homoskedasticity).

2. WLS – Weighted Least Squares

Model: Same as OLS, but errors are heteroskedastic (non-constant variance).

Estimator: \( \hat{\beta} = \arg \min_{\beta} \sum w_i (y_i - x_i' \beta)^2 \). The weights are typically chosen as \( w_i = 1 / \sigma_i^2 \), where \( \sigma_i^2 \) is the variance of the error for the \( i \)-th observation.

Idea: "Down-weight" observations that are known to be noisier (high variance) and "up-weight" observations that are more precise (low variance). This restores efficiency.

Properties: More efficient than OLS when the weights are correctly specified. If weights are wrong, it can be worse than OLS.

Use Case: Data where the reliability of observations varies.

Example:

3. GLS – Generalized Least Squares

Model: \( y = X \beta + u \), where \( \text{Var}(u) = \sigma^2 \Omega \). \( \Omega \) is a known positive-definite covariance matrix that captures the structure of the heteroskedasticity and autocorrelation.

Estimator: \( \hat{\beta}_{GLS} = (X' \Omega^{-1} X)^{-1} X' \Omega^{-1} y \). This "transforms" the original model to one that satisfies OLS assumptions.

Idea: The most general case for handling any violation of the spherical errors assumption (homoskedasticity + no correlation). It simultaneously corrects for both heteroskedasticity and autocorrelation.

Special Cases & Intuition:

Use Case: Time-series regressions, spatial econometrics, panel data models.

Example: Modeling economic GDP growth over time. Error terms are likely autocorrelated (a shock this year affects next year). \( \Omega \) would have high values on its main diagonal and non-zero values on the off-diagonals to represent this correlation structure.

4. NLS – Nonlinear Least Squares

Model: \( y_i = f(x_i, \theta) + \epsilon_i \), where \( f(\cdot) \) is a nonlinear function of the parameters \( \theta \) (e.g., \( \theta_1 e^{\theta_2 x} \)).

Estimator: \( \hat{\theta} = \arg \min_{\theta} \sum (y_i - f(x_i, \theta))^2 \).

Key Difference: There is no closed-form solution like \( (X'X)^{-1} X' y \). Estimation requires iterative numerical optimization algorithms (e.g., Gradient Descent, Gauss-Newton).

Properties: Under standard regularity conditions, the estimator is consistent and asymptotically normal. However, it can be sensitive to starting values and may converge to a local (not global) minimum.

Use Case: Any context where the underlying data-generating process is known to be nonlinear.

Examples:

3. Relationships and a Practical Challenge

Theoretical Hierarchy: OLS ⊂ WLS ⊂ GLS. GLS is the most general form for linear models with generalized error structures. NLS is a separate branch for nonlinearity.

The Practical Problem: In practice, the true error covariance matrix \( \Omega \) for GLS is almost never known.

The Solution: Feasible GLS (FGLS):

  1. Run an OLS regression and use the residuals \( \hat{u} \).
  2. Estimate the structure of \( \Omega \) from these residuals (e.g., model the heteroskedasticity or the autocorrelation).
  3. Use this estimated matrix \( \hat{\Omega} \) in the GLS formula. This two-step estimator is called Feasible GLS and is the version used in applied work.

4. When to Use: A Decision Guide

Method Primary Use Case Key Assumption
OLS Baseline modeling. Standard linear relationships. Spherical errors (homoskedastic, uncorrelated).
WLS Heteroskedastic data with known or estimable variances. The chosen weights are (inversely) proportional to the error variance.
GLS/FGLS Correlated errors (time-series, panels) or complex heteroskedasticity. The structure of the error covariance (\( \Omega \)) can be correctly specified/estimated.
NLS Theoretical model is inherently nonlinear in its parameters. The functional form \( f(x_i, \theta) \) is correctly specified.

Practical Workflow:

  1. Start with OLS and test its assumptions (e.g., using Breusch-Pagan test for heteroskedasticity, Durbin-Watson test for autocorrelation).
  2. If assumptions are violated, use diagnostic tests to identify the nature of the problem:
    • If heteroskedasticity is found, use WLS or FGLS.
    • If autocorrelation is found (in time series), use FGLS or models like Cochrane-Orcutt.
  3. If theory or scatter plots suggest a nonlinear relationship, specify the correct functional form and use NLS.