📘 Likelihood-Based Estimation Methods: An Enhanced Guide
Core Idea: The Engine of Likelihood
At its heart, a likelihood method asks a simple question: "Given the data I observed, which parameter values for my model make this data most probable?"
The Likelihood Function (\( L(\theta | \text{data}) \)): This is the answer to that question. It's a function of the parameters (\( \theta \)), not the data. For independent data, it's the product of the probability (density) functions for each data point.
The Log-Likelihood (\( \ell(\theta) \)): We almost always work with the log-likelihood because it turns pesky products into manageable sums. Since the logarithm is a monotonic function, maximizing the log-likelihood gives the same answer as maximizing the likelihood.
\[ \ell(\theta) = \sum \log( f(y_i | x_i, \theta) ) \]
1. Full Likelihood (Parametric MLE) - "The Purist"
Philosophy: "I am willing to assume a specific, full probability distribution for my data (e.g., Normal, Binomial, Poisson). I will find the parameters that make this chosen distribution best fit the data."
Key Properties:
- Efficient: If your model is correct, MLE gives the most precise (lowest-variance) estimates possible (asymptotically).
- Asymptotically Normal: With large sample sizes, the estimates follow a normal distribution, making inference (confidence intervals, hypothesis tests) straightforward.
Methods:
- Maximum Likelihood Estimation (MLE): The standard approach. E.g., Estimating the mean (\( \mu \)) and variance (\( \sigma^2 \)) of a Normal distribution.
- Restricted / Constrained MLE: Maximizing the likelihood subject to constraints. E.g., Estimating a variance, which must be \( \sigma^2 > 0 \).
- EM Algorithm: A brilliant iterative technique for problems with missing data or latent variables. It alternates between an Expectation (E-step) and a Maximization (M-step).
- Example: In a Gaussian Mixture Model (a type of clustering), you don't know which cluster each point belongs to (missing data). The E-step calculates the probability of each point belonging to each cluster. The M-step then updates the cluster means and variances using those probabilities.
Example: You assume your data \( y \) is Normally distributed. You use MLE to find \( \mu \) and \( \sigma^2 \). Your log-likelihood function is:
\[ \ell(\mu, \sigma^2) = - \frac{n}{2} \log(2 \pi \sigma^2) - \frac{1}{2 \sigma^2} \sum (y_i - \mu)^2 \]
2. Quasi-Likelihood / Pseudo-Likelihood - "The Pragmatist"
Philosophy: "I don't want to assume the full distribution. I only want to correctly specify the relationship for the mean (and maybe the variance). I'll use a 'likelihood-like' function that gives me good, robust estimates anyway."
Key Properties:
- Robust: Provides consistent estimates even if the probability distribution is wrong, as long as the mean model is correct.
- Solves Dispersion: Excellent for handling overdispersion or underdispersion (e.g., when the variance of your count data is larger or smaller than the mean, violating the Poisson assumption).
Methods:
- Quasi-MLE (QMLE): The workhorse. You maximize what's called a "quasi-likelihood" function. The estimates are the same as MLE, but the calculated standard errors are adjusted to be correct for the actual variance in the data.
- GLM Quasi-Likelihoods: This is where it shines. In a Generalized Linear Model (GLM), you specify a link function (e.g.,
log
for Poisson) and a variance function (e.g., \( V(\mu) = \mu \) for Poisson). You don't specify the full distribution.
Example: You model count data with a Poisson regression (which assumes mean = variance
). Your data is counts of insects on leaves, but the counts are more variable than expected. Instead of a complex model, you use Quasi-Poisson regression. You still model \( \log(\text{mean}) = \beta_0 + \beta_1 x \), but the model estimates a dispersion parameter to inflate the standard errors, making your inference reliable.
3. Semiparametric Likelihood - "The Balanced Approach"
Philosophy: "I will carefully model the part of the system I care about (usually the effect of covariates), but I will leave the annoying nuisance parts (like the baseline hazard or error distribution) completely unspecified to avoid making bad assumptions."
Key Properties:
- Flexible and Robust: More robust than full parametric models but often more efficient (powerful) than fully nonparametric methods.
- Nuisance Parameters: Excels at dealing with infinite-dimensional nuisance parameters.
Methods:
- Partial Likelihood (Cox Proportional Hazards Model): The superstar example. In survival analysis, you want to know how covariates (e.g., age, treatment) affect the risk of an event (e.g., death). The Cox model allows you to estimate these hazard ratios without having to specify the underlying baseline hazard function at all. It's incredibly powerful and widely used in medicine.
- Profile Likelihood: A technique to handle nuisance parameters. You "profile out" the nuisance parameter by maximizing the likelihood over it for each fixed value of the parameter of interest. This reduces the problem to a function of just the parameter you care about.
- Example: You have data from two Normal distributions: \( X \sim N(\mu_1, \sigma^2) \) and \( Y \sim N(\mu_2, \sigma^2) \). You care about the difference \( \delta = \mu_1 - \mu_2 \), but the common variance \( \sigma^2 \) is a nuisance. For each possible value of \( \delta \), you find the value of \( \sigma^2 \) that maximizes the likelihood. This creates the profile likelihood for \( \delta \), which you then maximize.
Example: Studying the effect of a new drug on patient survival time. You use a Cox model. The model tells you that the drug reduces the hazard of death by 50% (a precise, interpretable effect size), without you ever having to model the complex pattern of survival times for all patients.
4. Likelihood-Based Inference - "Making Decisions"
Once you have estimates from maximizing a (log-)likelihood, you need to perform inference.
- Likelihood Ratio Test (LRT): The gold standard for comparing nested models. It compares the log-likelihood of the full model (\( \ell_{\text{full}} \)) to a reduced model (\( \ell_{\text{reduced}} \)). The test statistic is \( 2 (\ell_{\text{full}} - \ell_{\text{reduced}}) \) and follows a Chi-squared distribution.
- Use it for: "Does adding these three new variables to my regression model significantly improve the fit?"
- Wald Test: Uses the curvature of the log-likelihood (the observed information matrix) to form a test. The test statistic is \( (\text{estimate} / \text{SE}(\text{estimate}))^2 \), which is also Chi-squared.
- Use it for: "Is this single coefficient \( \beta \) significantly different from zero?" (This is what standard regression output shows you).
- Score Test (Lagrange Multiplier Test): Uses the slope of the log-likelihood at the null value. It's less common but useful in certain advanced settings.
- Model Selection: AIC & BIC
- Akaike Information Criterion (AIC): \( \text{AIC} = -2 \ell + 2 k \). Penalizes model complexity (\( k \) = number of parameters). Designed for prediction accuracy. Choose the model with the lowest AIC.
- Bayesian Information Criterion (BIC): \( \text{BIC} = -2 \ell + k \log(n) \). Penalizes complexity more severely than AIC, especially in large samples (\( n \) = sample size). Designed to find the true model. Choose the model with the lowest BIC.
📌 Hierarchy Mnemonic & Practical Guide
Method |
When to Use It |
Key Question to Ask |
Real-World Analogy |
Full MLE |
You are confident in the data's distribution. Efficiency is key. |
"Am I willing to bet that the errors are exactly Normal?" |
Using a precise recipe from a renowned chef. Best results if followed exactly. |
Quasi-MLE |
Your focus is on the mean trend. The full distribution is messy. |
"Is my data overdispersed? Or do I just care about getting the trend right?" |
Following the main steps of a recipe but tweaking spices to your taste. Still makes a great dish. |
Semiparametric |
You care about specific effects (e.g., treatment) but not underlying shapes. |
"Do I want to avoid assuming a shape for the baseline hazard or error distribution?" |
Buying a perfectly tailored jacket (the covariate effect) without worrying about how the base fabric was woven (the nuisance parameter). |