The Complete Guide to Hypothesis Testing in Financial & Econometric Models

1. The "Why": Purpose and Intuition

Hypothesis testing is the statistical machinery that separates signal from noise. In finance and econometrics, where data is often messy and correlations can be spurious, it's not enough to just see a pattern; we must ask, "Is this pattern real?"

Statistical Purpose: To determine if an observed effect (e.g., a coefficient, a difference in means) is statistically significant or likely due to random sampling variation.
Financial Purpose:
- Asset Pricing: Does a new factor (e.g., ESG score) genuinely explain cross-sectional returns beyond established models (Fama-French 5-factor)? Is a fund's reported alpha (\( \alpha \)) statistically different from zero, or just luck?
- Risk Management: In a Probability of Default (PD) model, does a variable like "Quick Ratio" actually improve our ability to predict corporate defaults? Regulators demand evidence that model variables are significant.
- Macro-Finance: Does a central bank's interest rate change Granger-cause changes in stock market volatility?
Practical Outcome: These tests directly inform investment decisions, model validation, regulatory compliance (Basel, IFRS 9), and strategic resource allocation.

2. Core Framework: The Logic of Testing

Every hypothesis test follows the same logical process:

Formulate Hypotheses:
- Null Hypothesis (\( H_0 \)): The default, "skeptical" hypothesis of no effect or no relationship. (e.g., \( \beta_j = 0 \)).
- Alternative Hypothesis (\( H_1 \) or \( H_a \)): The hypothesis you want to prove. (e.g., \( \beta_j \neq 0 \)).
Construct a Test Statistic: Calculate a number from your sample data that measures the discrepancy from the null hypothesis. (e.g., a t-statistic).
Determine the Reference Distribution: Under the assumption that \( H_0 \) is true, what is the theoretical distribution of your test statistic? (e.g., t-distribution, F-distribution, \( \chi^2 \)-distribution).
Make a Decision: Compare your test statistic to the critical values of the reference distribution.
- p-value: The probability of observing a test statistic at least as extreme as the one you computed, if the null hypothesis were true.
- Significance Level (\( \alpha \)): A pre-defined threshold (e.g., 0.05, 0.01) for rejecting \( H_0 \). If p-value < \( \alpha \), we reject \( H_0 \) in favor of \( H_1 \). If not, we fail to reject \( H_0 \).

The Decision Matrix:

	Reject \( H_0 \)	Fail to Reject \( H_0 \)
\( H_0 \) is True	Type I Error (False Positive)	Correct Decision
\( H_0 \) is False	Correct Decision (Power)	Type II Error (False Negative)

3. Standard Tests in Regression

3.1. Individual coefficient test (t-test / Wald test)

Setup:

Linear regression

\[ y_t = \beta_0 + \beta_1 x_{1t} + \dots + \beta_k x_{kt} + \varepsilon_t \]

We want to test

\[ H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0 \]

Test statistic (OLS):

\[ t = \frac{\hat{\beta}_j - 0}{\text{SE}(\hat{\beta}_j)} \]

where

\[ \text{SE}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 (X'X)^{-1}_{jj}} \]

and

\[ \hat{\sigma}^2 = \frac{\sum \hat{\varepsilon}_t^2}{n-k-1}. \]

Under \( H_0 \): \( t \sim t_{n-k-1} \).

Finance example:

In CAPM,

\[ R_{it} - R_{ft} = \alpha_i + \beta_i (R_{mt} - R_{ft}) + \varepsilon_t \]

Testing \( H_0: \alpha_i = 0 \). If rejected → evidence of abnormal return ("alpha").

3.2. Joint significance test (F-test / Wald test for multiple restrictions)

Suppose we test

\[ H_0: \beta_j = \beta_k = 0 \]

(two coefficients jointly 0).

F-test (nested OLS models):

\[ F = \frac{(RSS_r - RSS_u)/q}{RSS_u / (n-k-1)} \]

\( RSS_r \) = residual sum of squares in restricted model
\( RSS_u \) = residual sum of squares in unrestricted model
\( q \) = number of restrictions

Under \( H_0 \): \( F \sim F(q, n-k-1) \).

Finance example:

In a credit risk model: test if liquidity ratios (\( \beta_{\text{liq}} \)) add anything beyond leverage (\( \beta_{\text{lev}} \)).

4. Likelihood-Based Tests (Logit/Probit, Survival Models)

4.1. Likelihood Ratio Test (Logit/Probit)

We are testing whether a subset of coefficients is equal to zero (or satisfies some restriction).

Null hypothesis:

\[ H_0: \beta_j = 0 \quad \text{(or more generally } R\beta = r \text{)} \]

Alternative hypothesis:

\[ H_1: \beta_j \neq 0 \]

Statistic:

\[ LR = -2(\ell_r - \ell_u) \]

\( \ell_r \) = log-likelihood of the restricted model (with \( H_0 \) imposed).
\( \ell_u \) = log-likelihood of the unrestricted model.

Distribution:

\[ LR \sim \chi^2_q \quad \text{under } H_0 \]

where \( q \) = number of restrictions.

Finance example:

Default probability (logit model):

\[ \Pr(\text{default}) = \frac{\exp(\beta_0 + \beta_1 \text{Leverage} + \beta_2 \text{Liquidity})}{1+\exp(\beta_0 + \beta_1 \text{Leverage} + \beta_2 \text{Liquidity})} \]

Restricted model (\( H_0 \)): \( \beta_2 = 0 \) → Liquidity doesn't matter.
Unrestricted model: includes liquidity.
Compute \( \ell_r, \ell_u \); reject \( H_0 \) if \( LR \) is large enough.

4.2. Wald test (general linear restrictions)

For a restriction of the form

\[ H_0: R\beta = r \]

where \( R \) is a \( q \times k \) matrix.

Statistic:

\[ W = (R\hat{\beta} - r)' \big( R \, \widehat{\text{Var}}(\hat{\beta}) \, R' \big)^{-1} (R\hat{\beta} - r) \]

Under \( H_0 \): \( W \sim \chi^2_q \).

Finance example:

Suppose in a multifactor model you want to test if

\[ \beta_{SMB} + \beta_{HML} = 1 \]

(not just zero). Then \( R = [0, 1, 1, 0, \dots] \), \( r = 1 \).

4.3. Lagrange Multiplier (Score) Test

This test is useful when we don't want to estimate the full unrestricted model.

Null hypothesis:

\[ H_0: \beta_j = 0 \quad \text{(or more generally } R\beta = r \text{)} \]

Alternative hypothesis:

\[ H_1: \beta_j \neq 0 \]

Statistic:

\[ LM = s(\hat{\beta}_r)' \, I(\hat{\beta}_r)^{-1} \, s(\hat{\beta}_r) \]

\( s(\hat{\beta}_r) \) = score vector (first derivative of the log-likelihood) evaluated at the restricted estimates \( \hat{\beta}_r \).
\( I(\hat{\beta}_r) \) = Fisher information matrix at \( \hat{\beta}_r \).

Distribution:

\[ LM \sim \chi^2_q \quad \text{under } H_0 \]

Finance example:

Credit risk model with logit/probit:

Restricted model: leverage only.
\( H_0 \): liquidity adds no explanatory power (\( \beta_{\text{liq}} = 0 \)).
Instead of fitting the full model with liquidity, the LM test uses the gradient of the log-likelihood at the restricted solution.

5. The "Trinity" of Econometric Tests (LR, Wald, LM)

Test	Hypothesis (\( H_0 \) vs \( H_1 \))	Test Statistic	Distribution (under \( H_0 \))	Typical Finance Application
Likelihood Ratio (LR)	\( H_0: R\beta = r \) (e.g., \( \beta_j = 0 \)) vs \( H_1: R\beta \neq r \)	\( LR = -2(\ell_r - \ell_u) \) where \( \ell_r \) = log-likelihood of restricted model, \( \ell_u \) = unrestricted	\( \chi^2_q \) with \( q \) restrictions	Logistic regression of default: test if adding liquidity improves fit beyond leverage
Wald Test	\( H_0: R\beta = r \) (test on estimated coefficients) vs \( H_1: R\beta \neq r \)	\( W = (R\hat{\beta} - r)' \left( R \,\widehat{\text{Var}}(\hat{\beta})\, R' \right)^{-1} (R\hat{\beta} - r) \)	\( \chi^2_q \)	Fama-French 3-factor model: test if \( \beta_{SMB} + \beta_{HML} = 1 \)
Lagrange Multiplier (Score, LM)	\( H_0: R\beta = r \) (same form) vs \( H_1: R\beta \neq r \)	\( LM = s(\hat{\beta}_r)' I(\hat{\beta}_r)^{-1} s(\hat{\beta}_r) \) where \( s(\hat{\beta}_r) \) = score at restricted estimates, \( I(\hat{\beta}_r) \) = information matrix	\( \chi^2_q \)	Time-series credit risk: test if adding lagged CDS spread improves model fit without re-estimating unrestricted model

Key differences (intuitively)

LR test: Compare fit of restricted vs unrestricted models → needs both estimated.
Wald test: Uses unrestricted estimate \( \hat{\beta} \) → "Are the estimated coefficients far from the hypothesized values?"
LM test: Uses restricted estimate only → "Would the score want to move away from the restriction?"

They are asymptotically equivalent (large samples → same decision), but may differ in small samples.

6. Worked Numeric Example (Logit Default Model)

Let's test whether Liquidity adds explanatory power to a default prediction model.

Model:

\[ \Pr(\text{default}_i=1)=\text{logit}^{-1}\big(\beta_0+\beta_1\text{Leverage}_i+\beta_2\text{Liquidity}_i\big) \]

Null hypothesis (for all three tests): \( H_0:\beta_{\text{liq}}=\beta_2=0 \) (one restriction \( q=1 \)).

Data from model estimation:

Restricted model (no Liquidity): log-likelihood \( \ell_r=-123.10 \)
Unrestricted model (with Liquidity): log-likelihood \( \ell_u=-120.35 \)
Unrestricted coefficient and SE: \( \hat{\beta}_{\text{liq}}=0.45,\quad \text{SE}(\hat{\beta}_{\text{liq}})=0.21 \)
Restricted model's score/information: \( s(\hat{\beta}_r)=2.35 \), \( I(\hat{\beta}_r)^{-1}=0.90 \)

1) Likelihood Ratio (LR) test

Statistic

\[ LR=-2(\ell_r-\ell_u)=-2\big((-123.10)-(-120.35)\big)=5.50 \]

Reference distribution under \( H_0 \): \( \chi^2_1 \)

p-value \( \approx 0.019 \)

Decision (5% level): Reject \( H_0 \). Liquidity improves the model.

2) Wald test

Statistic

\[ z=\frac{\hat{\beta}_{\text{liq}}-0}{\text{SE}(\hat{\beta}_{\text{liq}})}=\frac{0.45}{0.21}\approx 2.143,\quad W=z^2\approx 4.59 \]

Reference distribution under \( H_0 \): \( \chi^2_1 \)

p-value \( \approx 0.032 \)

Decision (5% level): Reject \( H_0 \).

3) Lagrange Multiplier (Score, LM) test

Statistic

\[ LM=s(\hat{\beta}_r)^\top I(\hat{\beta}_r)^{-1}s(\hat{\beta}_r)= (2.35)^2 \times 0.90 \approx 4.96 \]

Reference distribution under \( H_0 \): \( \chi^2_1 \)

p-value \( \approx 0.026 \)

Decision (5% level): Reject \( H_0 \).

Takeaways for practice:

All three tests reject \( H_0 \) here and are numerically close
Small differences are normal in finite samples
Finance habit: also report economic significance (marginal effect/odds ratio, or change in PD for a realistic liquidity shift) and do out-of-sample checks.

7. Econometric Complications in Finance

Non-normal residuals: Heavy tails in returns → bootstrap or robust SE.
Autocorrelation: Time series → HAC/Newey-West corrections.
Multicollinearity: Macro factors correlated; inflates SE.
Multiple testing / data-snooping: Hundreds of regressions inflate false positives.

8. Credit-Risk Logit Model Diagnostics with Python & SAS Code

A) Specification & Link Tests (is the logit form right?)

1. Pregibon Link Test (misspecification check)

Fit your model → get \( \hat{p} \). Refit: default ~ \( \hat{p} \) and \( \hat{p}^2 \).

\( H_0 \): coefficient on \( \hat{p}^2 =0 \).
If \( \hat{p}^2 \) is significant → missing nonlinearity, interactions, or wrong link.

Python (statsmodels):

import statsmodels.api as sm
import pandas as pd

X = sm.add_constant(df[features])
m1 = sm.Logit(df['default'], X).fit()
phat = m1.predict(X)

# Create dataframe with phat and phat squared
link_test_data = pd.DataFrame({'phat': phat, 'phat2': phat**2})
Z = sm.add_constant(link_test_data)

# Fit link test model
link_test = sm.Logit(df['default'], Z).fit()
print(link_test.summary())  # Check if phat2 is significant

SAS:

proc logistic data=cred;
  model default(event='1') = x1 x2 x3 / link=logit;
  output out=pred p=phat;
run;

data pred;
  set pred;
  phat2 = phat*phat;
run;

proc logistic data=pred;
  model default(event='1') = phat phat2;
run;

2. RESET-style tests for logit

Augment with powers of the linear predictor \( \eta=\hat{\beta}^\top x \) (e.g., \( \eta^2,\eta^3 \)). Significance ⇒ missing nonlinearity.

Python:

# Get linear predictor
eta = m1.predict(X, linear=True)

# Create polynomial terms
reset_test_data = pd.DataFrame({
    'eta': eta,
    'eta2': eta**2,
    'eta3': eta**3
})
Z_reset = sm.add_constant(reset_test_data)

# Fit RESET test model
reset_test = sm.Logit(df['default'], Z_reset).fit()
print(reset_test.summary())  # Check significance of eta2, eta3

B) Calibration (are predicted PDs numerically right?)

B1. Global calibration

Calibration-in-the-large (CIL): fit default ~ 1 with logit offset = logit(\( \hat{p} \)); intercept should be 0.
Calibration slope (CS): fit default ~ logit(\hat{p}); slope should be 1.

Python:

import numpy as np

# Calculate logit of predicted probabilities
logit_phat = np.log(phat/(1-phat))

# CIL (offset fit)
cil_model = sm.Logit(df['default'], np.ones(len(df)))  # Only intercept
cil_res = cil_model.fit(offset=logit_phat, disp=False)
cil_intercept = cil_res.params[0]  # Should be close to 0

# CS (slope fit)
cs_model = sm.Logit(df['default'], sm.add_constant(logit_phat))
cs_res = cs_model.fit(disp=False)
cal_slope = cs_res.params[1]  # Should be close to 1

print(f"CIL Intercept: {cil_intercept:.4f}, Calibration Slope: {cal_slope:.4f}")

SAS:

data with_off;
  set pred;
  logitp = log(phat/(1-phat));
  const = 1;
run;

/* Calibration-in-the-large */
proc logistic data=with_off;
  model default(event='1') = / noint;
  offset logitp;
run;

/* Calibration slope */
proc logistic data=with_off;
  model default(event='1') = logitp;
run;

B2. Bin-level calibration (portfolio view)

Python (bins + plot):

import matplotlib.pyplot as plt

# Create deciles
K = 10
df['bin'] = pd.qcut(phat, K, labels=False, duplicates='drop')

# Calculate calibration statistics
cal = df.groupby('bin').agg(
    avg_pd=('phat', 'mean'),
    odr=('default', 'mean'),
    n=('default', 'size')
).reset_index()
cal['diff'] = cal['odr'] - cal['avg_pd']

# Plot calibration curve
plt.figure(figsize=(8, 6))
plt.plot(cal['avg_pd'], cal['odr'], 'o-', label='Model')
plt.plot([0, max(cal['avg_pd'])], [0, max(cal['avg_pd'])], 'k--', label='Perfect calibration')
plt.xlabel('Average Predicted Probability')
plt.ylabel('Observed Default Rate')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True)
plt.show()

# Hosmer-Lemeshow test (approximate)
hl_stat = np.sum((cal['odr'] - cal['avg_pd'])**2 * cal['n'] / (cal['avg_pd'] * (1 - cal['avg_pd'])))
from scipy.stats import chi2
hl_pvalue = 1 - chi2.cdf(hl_stat, K-2)
print(f"Hosmer-Lemeshow statistic: {hl_stat:.4f}, p-value: {hl_pvalue:.4f}")

B3. Proper scoring & belts

Python:

from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss

# Brier score
brier_score = brier_score_loss(df['default'], phat)
print(f"Brier Score: {brier_score:.4f}")

# Calibration curve with sklearn
prob_true, prob_pred = calibration_curve(df['default'], phat, n_bins=10, strategy='quantile')

plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, 's-', label='Model')
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
plt.xlabel('Predicted Probability')
plt.ylabel('Observed Frequency')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True)
plt.show()

C) Stability & Shift (will calibration hold in production?)

C1. Population & feature stability

Python (PSI function):

def calculate_psi(expected, actual, bins=10):
    """Calculate Population Stability Index"""
    # Create bins based on expected distribution
    breakpoints = np.percentile(expected, np.linspace(0, 100, bins + 1))
    breakpoints = np.unique(breakpoints)  # Remove duplicates
    
    # Bin both distributions
    expected_binned = pd.cut(expected, breakpoints, include_lowest=True)
    actual_binned = pd.cut(actual, breakpoints, include_lowest=True)
    
    # Calculate percentages
    expected_pct = expected_binned.value_counts(normalize=True, sort=False).sort_index()
    actual_pct = actual_binned.value_counts(normalize=True, sort=False).sort_index()
    
    # Calculate PSI
    psi = np.sum((actual_pct - expected_pct) * np.log((actual_pct + 1e-12) / (expected_pct + 1e-12)))
    return psi

# Example usage with development and validation data
psi_value = calculate_psi(dev_data['phat'], val_data['phat'])
print(f"PSI: {psi_value:.4f}")

# Interpret PSI
if psi_value < 0.1:
    print("No significant population shift")
elif psi_value < 0.25:
    print("Moderate population shift")
else:
    print("Significant population shift - investigation needed")

C3. PD backtesting (Basel style)

Python (binomial test):

from statsmodels.stats.proportion import proportions_ztest

# Portfolio-level binomial test
obs_defaults = df['default'].sum()
exp_defaults = df['phat'].sum()
n_obs = len(df)

# z-test for proportion
stat, pval = proportions_ztest(obs_defaults, n_obs, exp_defaults/n_obs)
print(f"Binomial test: z-statistic = {stat:.4f}, p-value = {pval:.4f}")

# Traffic light approach
if pval > 0.1:
    print("Green zone - model is well calibrated")
elif pval > 0.05:
    print("Yellow zone - monitor closely")
elif pval > 0.01:
    print("Amber zone - investigate calibration")
else:
    print("Red zone - significant miscalibration, recalibration needed")

9. Practical Workflow (Credit PD Model)

Hypothesis testing: t-tests/Wald for coefficients, LR for nested models.
Spec check: link test, RESET, residuals.
Discrimination: AUC, KS.
Calibration: CIL, CS, decile plots, recalibrate if needed.
Stability: PSI, binomial backtests, challenger-champion comparison.
Governance pack: summary charts + traffic-light signals.

This comprehensive guide covers the full spectrum of hypothesis testing in financial and econometric models, from fundamental concepts to advanced diagnostics with practical code implementation. The Python and SAS code snippets provide ready-to-use tools for model validation in credit risk and other financial applications.