Regression Models

A comprehensive overview of statistical modeling approaches

What Are Regression Models?

Regression models are statistical tools that examine how one or more input variables relate to an outcome variable. In simple terms, a regression model tries to find a relationship between independent variables (the inputs or features) and a dependent variable (the output or target). The goal is often to predict the value of the dependent variable based on the inputs or to understand how changes in the inputs affect the output.

Essentially, the model looks for a trend in the data and fits a line or curve that best represents that trend. This "best-fit" line helps us see the pattern: when the independent variables change, how does the dependent variable change on average? Think of plotting data points on a graph (for example, each point could represent a student's study hours vs. their exam score). A regression model will try to draw a line through these points that best follows their overall direction, creating a simple model that we can use for understanding and prediction.

Why Are Regression Models Used?

Regression models are extremely popular because they serve two main purposes: insight and prediction. First, they provide insight into the relationship between variables. A regression analysis can tell us whether and how strongly variables are related – for instance, it might reveal that there is a positive relationship between hours studied and exam score, meaning more study tends to lead to higher scores.

Second, regression models are used for predicting future or unknown values, making them fundamental in many fields. Because they uncover patterns in historical data, we can apply those patterns to forecast what might happen next. Businesses use regression for tasks like sales forecasting, risk analysis, and finance. Scientists might use regression to predict outcomes like growth trends or climate measurements. In essence, a regression model lets us take information we already have and make an educated guess about something we don't know yet.

What Types of Problems Do They Solve?

Regression models are suited for supervised learning problems where we have historical data (with known outcomes) and want to predict or understand an outcome for new data. There are two main types of tasks they handle:

Predicting continuous numeric values: This is the classic use of regression. If the question asks "how much?" or "how many?", a regression model is likely the tool for the job. For example, predicting house prices given features like size, location, and age, or forecasting stock prices over time.

Predicting categories or yes/no outcomes (classification): A special type of regression model can also handle problems where the goal is to predict a categorical outcome. The most common example is logistic regression, which outputs a probability that an example belongs to a certain class. For instance, determining whether an email is "spam" or "not spam", or predicting whether a customer will make a purchase.

Regression Models Taxonomy

The following taxonomy illustrates the hierarchical structure of various regression techniques, from classical linear models to advanced approaches, helping researchers and analysts choose the most appropriate method for their specific data and research questions.

flowchart TD A["πŸ” Regression Models
Predicting an outcome based on variables"] %% Main branches A --> B["πŸ“ˆ Parametric Models"] A --> C["πŸ“Š Semiparametric Models"] A --> D["πŸ“Š Nonparametric Models"] B --> LM["πŸ“‹ Linear Models
(See the map below for details)"] B --> NLM["πŸ“‹ Nonlinear Models"] %% Semiparametric Models branch C --> COX["πŸ“‹ Cox Proportional Hazards Model"] C --> QRM["πŸ“‹ Quantitle Regression Models"] C --> GAM["πŸ“‹ Generalized Additive Models(GAMs)"] %% Nonparametric Models branch D --> DT["πŸ“‹ Decision Trees"] D --> KR["πŸ“‹ Kernel Regression"] D --> SPL["πŸ“‹ Splines"] GLMs["βš™οΈ Generalized Linear Models (GLMs)
Extend GLM to non-Normal data
via link functions & distributions"] %% Special case note GAM -. "Special case:
If each 𝑓𝑗(π‘₯𝑗) is just a straight line (𝛽𝑗π‘₯𝑗), then GAM reduces to an ordinary GLM." .-> GLMs click COX "/dedao/wuli/modeling/modeling_coxph" "Go to Cox Model Details Page" click GAM "/dedao/wuli/modeling/modeling_GAMs" "Go to GAMs Details Page" click D "/dedao/wuli/modeling/modeling_nonparametric" "Go to Nonparametric Models Details Page" click GLMs "/dedao/wuli/modeling/modeling_GLMs" "Go to GLMs Details Page" click QRM "/dedao/wuli/modeling/modeling_quantile" "Go to Quantile Regression Details Page"
flowchart TD A["πŸ” Parametric Models"] %% Main branches A --> B["πŸ“ˆ Linear Models"] A --> C["πŸ“Š Nonlinear Models"] %% General Linear Model branch B --> GLM["πŸ“‹ General Linear Model
(Matrix Form: Y = XΞ² + Ξ΅)"] GLM --> CLRM["βœ… Classical Linear Regression Model (CLRM)
Gauss–Markov assumptions:
β€’ Linearity in parameters
β€’ Random sampling from the population.
β€’ No perfect multicollinearity.
β€’ Zero conditional mean of errors (E(Ξ΅|X) = 0).
β€’ Homoscedasticity (constant variance of errors).
β€’ No autocorrelation (errors are uncorrelated with each other)."] CLRM --> SLR["1️⃣ Simple Linear Regression
(1 independent variable)"] CLRM --> MLR["πŸ”’ Multiple Linear Regression
(2+ independent variables)"] GLM --> ANOVA["πŸ“Š ANOVA
(Categorical predictors)"] GLM --> ANCOVA["πŸ”— ANCOVA
(Mixed continuous & categorical)"] %% Generalized Linear Models branch B --> GLMs["βš™οΈ Generalized Linear Models (GLMs)
Extend GLM to non-Normal data
via link functions & distributions"] GLMs --> Logistic["🎯 Logistic Regression
(Binary outcomes)"] GLMs --> MLogistic["πŸ“ˆ Multinomial / Ordered
Logistic Model
(Multi-Category outcomes)"] GLMs --> Beta["πŸ“ˆ Gamma/Beta/
Fractional Response Models
(Limited outcomes)"] GLMs --> PSM["πŸ”§ Parametric Survival Models
(Exponential, Weibull, etc.)"] B --> GLMEs["βš™οΈ Generalized Linear Model Extensions
Extend GLMs to non-Exponential family distributions"] GLMEs --> MProbit["πŸ“ˆ Multinomial / Ordered
Probit Model
(Multi-Category outcomes)"] GLMEs --> AFT["πŸ“ˆ Accelerated Failure Time Model
(Time-to-event outcomes)"] %% Special case note GLMs -. "Special case:
Normal distribution + Identity Link" .-> GLM %% Add internal application link to CLRM node click CLRM "/dedao/wuli/modeling/modeling_CLRM" "Go to CLRM Details Page" click Logistic "/dedao/wuli/modeling/modeling_logistic" "Go to Logistic Details Page" click GLMs "/dedao/wuli/modeling/modeling_GLMs" "Go to GLMs Details Page" click ANOVA "/dedao/wuli/modeling/modeling_anova" "Go to ANOVA Details Page" click MLogistic "/dedao/wuli/modeling/modeling_multicat" "Go to Multinomial / Ordered Logistic Model Details Page" click Beta "/dedao/wuli/modeling/modeling_beta" "Go to Gamma/Beta/Fractional Response Model Details Page"

🎯 Estimation Techniques for Regression

πŸ“‹ Category πŸ”§ Methods πŸ’‘ When to Use
πŸ“ Least Squares Methods β€’ OLS - Ordinary Least Squares
β€’ WLS - Weighted Least Squares
β€’ GLS - Generalized Least Squares
β€’ NLS - Nonlinear Least Squares
When assumptions hold, small/medium data, need interpretable results
πŸ“Š Likelihood-Based Methods β€’ MLE - Maximum Likelihood Estimation
β€’ REML - Restricted Maximum Likelihood
β€’ EM - Expectation-Maximization
When distribution is fully specified, need efficient estimates
🎯 Regularization Methods β€’ Ridge - L2 Regularization
β€’ Lasso - L1 Regularization
β€’ Elastic Net - L1 + L2 Combined
Many predictors, multicollinearity, prevent overfitting
πŸ”§ Moment-based Methods β€’ IV - Instrumental Variables
β€’ GMM - Generalized Method of Moments
β€’ GEE - Generalized Estimating Equations (GEE)
Endogeneity issues, causal inference, latent variables, correlated data (like longitudinal/repeated measures)
πŸ›‘οΈ Robust Methods β€’ Huber - M-estimators
β€’ Quantile - Beyond mean estimation
β€’ LAD - Least Absolute Deviation
Outliers present, non-normal errors, robust inference needed
🎲 Bayesian Methods β€’ MAP - Maximum A Posteriori
β€’ MCMC - Markov Chain Monte Carlo
β€’ VB - Variational Bayes
When prior information available, uncertainty quantification needed
πŸ€– Algorithm-Specific Optimization
(Mainly ML models)
β€’ Convex optimization - Support Vector Regression
β€’ Greedy splitting - Trees, Boosting
β€’ Gradient-based optimization - Neural Networks
Predictive accuracy priority, complex patterns, large datasets
πŸ“ˆ Assessment Methods β€’ Bootstrap - Empirical distribution
β€’ Cross-Validation - Model selection
β€’ Jackknife - Leave-one-out
Error estimation, model validation, stability assessment

Linear Models

Models that assume linearity in parameters. These form the foundation of regression analysis and include multiple regression, ANOVA, and ANCOVA.

Generalized Linear Models

Extensions of linear models that can handle non-normal distributions through link functions. Examples include logistic regression for binary outcomes and Poisson regression for count data.

Gauss-Markov Assumptions

The foundational assumptions for classical linear regression:
β€’ Linearity in parameters
β€’ Random sampling from the population.
β€’ No perfect multicollinearity.
β€’ Zero conditional mean of errors (E(Ξ΅|X) = 0).
β€’ Homoscedasticity (constant variance of errors).
β€’ No autocorrelation (errors are uncorrelated with each other).

Nonlinear Models

Models that capture non-linearity in parameters, including exponential, polynomial, and other complex functional forms that cannot be linearized.


Understanding the Hierarchy: General vs. Generalized Linear Models

A Common Question: "I think general linear regression model is a special case of the generalized linear regression model. Why not put 'General Linear Regression Model' under 'Generalized Linear Regression Model'?"

This is an absolutely fantastic question that gets to the heart of why statistical terminology can be confusing. Your intuition is mathematically sharp and completely correct: a General Linear Model IS indeed a special case of a Generalized Linear Model.


The Mathematical Truth

A Generalized Linear Model has three components:

  1. Random Component - the probability distribution of Y from the exponential family,
  2. Systematic Component - the linear predictor (Ξ· = Ξ²β‚€ + β₁X₁ + ...), and
  3. Link Function - connects the mean of Y to the linear predictor.

When we choose the Normal distribution and the Identity link function (g(E(Y)) = E(Y)), the GLM becomes: E(Y) = Ξ²β‚€ + β₁X₁ + ... which is exactly the General Linear Model.


Why the Taxonomy Is Structured This Way

The reason we don't put "General Linear Model" under "Generalized Linear Model" in this taxonomy is less about mathematical hierarchy and more about historical development and conceptual clarity:

Historical Context: General Linear Model came first, unifying classic techniques (linear regression, ANOVA, ANCOVA) for continuous, normally distributed data. Generalized Linear Model was developed later (1972 by Nelder and Wedderburn) to extend this framework to non-normal distributions.

Conceptual Utility: For learning and practical application, it's more helpful to first separate models by their primary use case and data type, then show extensions. This avoids the initial confusion of similar names and presents GLM as the foundational starting point with Generalized Linear Models as powerful extensions.


Alternative Mathematical Hierarchy

While our taxonomy prioritizes conceptual clarity, a mathematically pure hierarchy would indeed place General Linear Models as a special case under Generalized Linear Models. Both perspectives are valid - the choice depends on whether you're emphasizing mathematical relationships or practical learning progression.

Classical Linear Model vs General Linear Model: Subtle but Important Differences

While these terms are often used interchangeably, there are subtle differences in context and scope that are worth understanding. The mathematical core is identical, but they emphasize different aspects of linear modeling:

Feature Classical Linear Model (CLM) General Linear Model (GLM)
Primary Emphasis Often used in econometrics and theoretical contexts to emphasize the set of assumptions required for inference (e.g., Gauss-Markov theorem). Often used in experimental research (e.g., psychology, biology) and statistical software (like SPSS) as a unifying framework.
Typical Scope Tends to refer specifically to multiple linear regression with continuous predictors. Explicitly includes both continuous predictors and categorical predictors using dummy coding. This means techniques like ANOVA and ANCOVA are considered special cases.
Connotation "Classical" implies a focus on the traditional, foundationally important form of regression. "General" implies a broader framework that encompasses several specific techniques.

You can think of it this way:

  • The Classical Linear Model is the theoretical foundationβ€”the equation and its strict assumptions.
  • The General Linear Model is the application of that foundation to a wider array of statistical procedures (Regression, ANOVA, ANCOVA). It's a "general" framework for handling linear relationships.

In everyday conversation, especially when talking about multiple regression, the terms are used interchangeably. The distinction becomes more important when you see "General Linear Model" in software menus where it includes ANOVA procedures, reminding you that they are all part of the same linear family.


The Mathematical Relationship: CLRM as GLM + Assumptions

The Classical Linear Regression Model (CLRM) is the application of the General Linear Model (GLM) framework under strict classical assumptions:

CLRM = GLM + Gauss-Markov Assumptions + Normality Assumption

GLM (The Algebraic Structure)

This provides the basic mathematical form: Y = Ξ²β‚€ + β₁X₁ + ... + Ξ²β‚–Xβ‚– + Ξ΅
This is the "general" part. It doesn't impose strict rules on the error term yet.

+ Gauss-Markov Assumptions

This is where we define the ideal conditions for the error term (Ξ΅):

  • Linearity: The model is linear in parameters
  • Random Sampling: The data is a random sample from the population
  • No Perfect Collinearity: The independent variables are not perfectly correlated
  • Exogeneity: The conditional mean of the error term is zero: E(Ξ΅|X₁, Xβ‚‚, ...) = 0 (most important assumption)
  • Homoskedasticity: The error term has constant variance
  • No Autocorrelation: Errors are uncorrelated with each other

Meeting these assumptions guarantees that your OLS estimators are the Best Linear Unbiased Estimators (BLUE).

+ Normality Assumption

For the purpose of hypothesis testing and constructing confidence intervals, we often add:
Normality: The error term is normally distributed: Ξ΅ ~ N(0, σ²)
This assumption allows us to use t-tests and F-tests, which are derived from the normal distribution.

The CLRM is the specific, "ideal" or "classical" case of the more general framework, distinguished by its strict assumptions.