📘 The Complete Guide to Hypothesis Testing in Financial & Econometric Models

1. The "Why": Purpose and Intuition

Hypothesis testing is the statistical machinery that separates signal from noise. In finance and econometrics, where data is often messy and correlations can be spurious, it's not enough to just see a pattern; we must ask, "Is this pattern real?"

2. Core Framework: The Logic of Testing

Every hypothesis test follows the same logical process:

  1. Formulate Hypotheses:
    • Null Hypothesis (\( H_0 \)): The default, "skeptical" hypothesis of no effect or no relationship. (e.g., \( \beta_j = 0 \)).
    • Alternative Hypothesis (\( H_1 \) or \( H_a \)): The hypothesis you want to prove. (e.g., \( \beta_j \neq 0 \)).
  2. Construct a Test Statistic: Calculate a number from your sample data that measures the discrepancy from the null hypothesis. (e.g., a t-statistic).
  3. Determine the Reference Distribution: Under the assumption that \( H_0 \) is true, what is the theoretical distribution of your test statistic? (e.g., t-distribution, F-distribution, \( \chi^2 \)-distribution).
  4. Make a Decision: Compare your test statistic to the critical values of the reference distribution.
    • p-value: The probability of observing a test statistic at least as extreme as the one you computed, if the null hypothesis were true.
    • Significance Level (\( \alpha \)): A pre-defined threshold (e.g., 0.05, 0.01) for rejecting \( H_0 \). If p-value < \( \alpha \), we reject \( H_0 \) in favor of \( H_1 \). If not, we fail to reject \( H_0 \).

The Decision Matrix:

Reject \( H_0 \) Fail to Reject \( H_0 \)
\( H_0 \) is True Type I Error (False Positive) Correct Decision
\( H_0 \) is False Correct Decision (Power) Type II Error (False Negative)

3. Standard Tests in Regression

3.1. Individual coefficient test (t-test / Wald test)

Setup:

Linear regression

\[ y_t = \beta_0 + \beta_1 x_{1t} + \dots + \beta_k x_{kt} + \varepsilon_t \]

We want to test

\[ H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0 \]

Test statistic (OLS):

\[ t = \frac{\hat{\beta}_j - 0}{\text{SE}(\hat{\beta}_j)} \]

where

\[ \text{SE}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 (X'X)^{-1}_{jj}} \]

and

\[ \hat{\sigma}^2 = \frac{\sum \hat{\varepsilon}_t^2}{n-k-1}. \]

Under \( H_0 \): \( t \sim t_{n-k-1} \).

Finance example:

In CAPM,

\[ R_{it} - R_{ft} = \alpha_i + \beta_i (R_{mt} - R_{ft}) + \varepsilon_t \]

Testing \( H_0: \alpha_i = 0 \). If rejected → evidence of abnormal return ("alpha").

3.2. Joint significance test (F-test / Wald test for multiple restrictions)

Suppose we test

\[ H_0: \beta_j = \beta_k = 0 \]

(two coefficients jointly 0).

F-test (nested OLS models):

\[ F = \frac{(RSS_r - RSS_u)/q}{RSS_u / (n-k-1)} \]

Under \( H_0 \): \( F \sim F(q, n-k-1) \).

Finance example:

In a credit risk model: test if liquidity ratios (\( \beta_{\text{liq}} \)) add anything beyond leverage (\( \beta_{\text{lev}} \)).

4. Likelihood-Based Tests (Logit/Probit, Survival Models)

4.1. Likelihood Ratio Test (Logit/Probit)

We are testing whether a subset of coefficients is equal to zero (or satisfies some restriction).

Statistic:

\[ LR = -2(\ell_r - \ell_u) \]

Distribution:

\[ LR \sim \chi^2_q \quad \text{under } H_0 \]

where \( q \) = number of restrictions.

Finance example:

Default probability (logit model):

\[ \Pr(\text{default}) = \frac{\exp(\beta_0 + \beta_1 \text{Leverage} + \beta_2 \text{Liquidity})}{1+\exp(\beta_0 + \beta_1 \text{Leverage} + \beta_2 \text{Liquidity})} \]

4.2. Wald test (general linear restrictions)

For a restriction of the form

\[ H_0: R\beta = r \]

where \( R \) is a \( q \times k \) matrix.

Statistic:

\[ W = (R\hat{\beta} - r)' \big( R \, \widehat{\text{Var}}(\hat{\beta}) \, R' \big)^{-1} (R\hat{\beta} - r) \]

Under \( H_0 \): \( W \sim \chi^2_q \).

Finance example:

Suppose in a multifactor model you want to test if

\[ \beta_{SMB} + \beta_{HML} = 1 \]

(not just zero). Then \( R = [0, 1, 1, 0, \dots] \), \( r = 1 \).

4.3. Lagrange Multiplier (Score) Test

This test is useful when we don't want to estimate the full unrestricted model.

Statistic:

\[ LM = s(\hat{\beta}_r)' \, I(\hat{\beta}_r)^{-1} \, s(\hat{\beta}_r) \]

Distribution:

\[ LM \sim \chi^2_q \quad \text{under } H_0 \]

Finance example:

Credit risk model with logit/probit:

5. The "Trinity" of Econometric Tests (LR, Wald, LM)

Test Hypothesis (\( H_0 \) vs \( H_1 \)) Test Statistic Distribution (under \( H_0 \)) Typical Finance Application
Likelihood Ratio (LR) \( H_0: R\beta = r \) (e.g., \( \beta_j = 0 \)) vs \( H_1: R\beta \neq r \) \( LR = -2(\ell_r - \ell_u) \)
where \( \ell_r \) = log-likelihood of restricted model, \( \ell_u \) = unrestricted
\( \chi^2_q \) with \( q \) restrictions Logistic regression of default: test if adding liquidity improves fit beyond leverage
Wald Test \( H_0: R\beta = r \) (test on estimated coefficients) vs \( H_1: R\beta \neq r \) \( W = (R\hat{\beta} - r)' \left( R \,\widehat{\text{Var}}(\hat{\beta})\, R' \right)^{-1} (R\hat{\beta} - r) \) \( \chi^2_q \) Fama-French 3-factor model: test if \( \beta_{SMB} + \beta_{HML} = 1 \)
Lagrange Multiplier (Score, LM) \( H_0: R\beta = r \) (same form) vs \( H_1: R\beta \neq r \) \( LM = s(\hat{\beta}_r)' I(\hat{\beta}_r)^{-1} s(\hat{\beta}_r) \)
where \( s(\hat{\beta}_r) \) = score at restricted estimates, \( I(\hat{\beta}_r) \) = information matrix
\( \chi^2_q \) Time-series credit risk: test if adding lagged CDS spread improves model fit without re-estimating unrestricted model

Key differences (intuitively)

They are asymptotically equivalent (large samples → same decision), but may differ in small samples.

6. Worked Numeric Example (Logit Default Model)

Let's test whether Liquidity adds explanatory power to a default prediction model.

Model:

\[ \Pr(\text{default}_i=1)=\text{logit}^{-1}\big(\beta_0+\beta_1\text{Leverage}_i+\beta_2\text{Liquidity}_i\big) \]

Null hypothesis (for all three tests): \( H_0:\beta_{\text{liq}}=\beta_2=0 \) (one restriction \( q=1 \)).

Data from model estimation:

1) Likelihood Ratio (LR) test

Statistic

\[ LR=-2(\ell_r-\ell_u)=-2\big((-123.10)-(-120.35)\big)=5.50 \]

Reference distribution under \( H_0 \): \( \chi^2_1 \)

p-value \( \approx 0.019 \)

Decision (5% level): Reject \( H_0 \). Liquidity improves the model.

2) Wald test

Statistic

\[ z=\frac{\hat{\beta}_{\text{liq}}-0}{\text{SE}(\hat{\beta}_{\text{liq}})}=\frac{0.45}{0.21}\approx 2.143,\quad W=z^2\approx 4.59 \]

Reference distribution under \( H_0 \): \( \chi^2_1 \)

p-value \( \approx 0.032 \)

Decision (5% level): Reject \( H_0 \).

3) Lagrange Multiplier (Score, LM) test

Statistic

\[ LM=s(\hat{\beta}_r)^\top I(\hat{\beta}_r)^{-1}s(\hat{\beta}_r)= (2.35)^2 \times 0.90 \approx 4.96 \]

Reference distribution under \( H_0 \): \( \chi^2_1 \)

p-value \( \approx 0.026 \)

Decision (5% level): Reject \( H_0 \).

Takeaways for practice:

7. Econometric Complications in Finance

8. Credit-Risk Logit Model Diagnostics with Python & SAS Code

A) Specification & Link Tests (is the logit form right?)

1. Pregibon Link Test (misspecification check)

Fit your model → get \( \hat{p} \). Refit: default ~ \( \hat{p} \) and \( \hat{p}^2 \).

Python (statsmodels):

import statsmodels.api as sm
import pandas as pd

X = sm.add_constant(df[features])
m1 = sm.Logit(df['default'], X).fit()
phat = m1.predict(X)

# Create dataframe with phat and phat squared
link_test_data = pd.DataFrame({'phat': phat, 'phat2': phat**2})
Z = sm.add_constant(link_test_data)

# Fit link test model
link_test = sm.Logit(df['default'], Z).fit()
print(link_test.summary())  # Check if phat2 is significant

SAS:

proc logistic data=cred;
  model default(event='1') = x1 x2 x3 / link=logit;
  output out=pred p=phat;
run;

data pred;
  set pred;
  phat2 = phat*phat;
run;

proc logistic data=pred;
  model default(event='1') = phat phat2;
run;

2. RESET-style tests for logit

Augment with powers of the linear predictor \( \eta=\hat{\beta}^\top x \) (e.g., \( \eta^2,\eta^3 \)). Significance ⇒ missing nonlinearity.

Python:

# Get linear predictor
eta = m1.predict(X, linear=True)

# Create polynomial terms
reset_test_data = pd.DataFrame({
    'eta': eta,
    'eta2': eta**2,
    'eta3': eta**3
})
Z_reset = sm.add_constant(reset_test_data)

# Fit RESET test model
reset_test = sm.Logit(df['default'], Z_reset).fit()
print(reset_test.summary())  # Check significance of eta2, eta3

B) Calibration (are predicted PDs numerically right?)

B1. Global calibration

Python:

import numpy as np

# Calculate logit of predicted probabilities
logit_phat = np.log(phat/(1-phat))

# CIL (offset fit)
cil_model = sm.Logit(df['default'], np.ones(len(df)))  # Only intercept
cil_res = cil_model.fit(offset=logit_phat, disp=False)
cil_intercept = cil_res.params[0]  # Should be close to 0

# CS (slope fit)
cs_model = sm.Logit(df['default'], sm.add_constant(logit_phat))
cs_res = cs_model.fit(disp=False)
cal_slope = cs_res.params[1]  # Should be close to 1

print(f"CIL Intercept: {cil_intercept:.4f}, Calibration Slope: {cal_slope:.4f}")

SAS:

data with_off;
  set pred;
  logitp = log(phat/(1-phat));
  const = 1;
run;

/* Calibration-in-the-large */
proc logistic data=with_off;
  model default(event='1') = / noint;
  offset logitp;
run;

/* Calibration slope */
proc logistic data=with_off;
  model default(event='1') = logitp;
run;

B2. Bin-level calibration (portfolio view)

Python (bins + plot):

import matplotlib.pyplot as plt

# Create deciles
K = 10
df['bin'] = pd.qcut(phat, K, labels=False, duplicates='drop')

# Calculate calibration statistics
cal = df.groupby('bin').agg(
    avg_pd=('phat', 'mean'),
    odr=('default', 'mean'),
    n=('default', 'size')
).reset_index()
cal['diff'] = cal['odr'] - cal['avg_pd']

# Plot calibration curve
plt.figure(figsize=(8, 6))
plt.plot(cal['avg_pd'], cal['odr'], 'o-', label='Model')
plt.plot([0, max(cal['avg_pd'])], [0, max(cal['avg_pd'])], 'k--', label='Perfect calibration')
plt.xlabel('Average Predicted Probability')
plt.ylabel('Observed Default Rate')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True)
plt.show()

# Hosmer-Lemeshow test (approximate)
hl_stat = np.sum((cal['odr'] - cal['avg_pd'])**2 * cal['n'] / (cal['avg_pd'] * (1 - cal['avg_pd'])))
from scipy.stats import chi2
hl_pvalue = 1 - chi2.cdf(hl_stat, K-2)
print(f"Hosmer-Lemeshow statistic: {hl_stat:.4f}, p-value: {hl_pvalue:.4f}")

B3. Proper scoring & belts

Python:

from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss

# Brier score
brier_score = brier_score_loss(df['default'], phat)
print(f"Brier Score: {brier_score:.4f}")

# Calibration curve with sklearn
prob_true, prob_pred = calibration_curve(df['default'], phat, n_bins=10, strategy='quantile')

plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, 's-', label='Model')
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
plt.xlabel('Predicted Probability')
plt.ylabel('Observed Frequency')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True)
plt.show()

C) Stability & Shift (will calibration hold in production?)

C1. Population & feature stability

Python (PSI function):

def calculate_psi(expected, actual, bins=10):
    """Calculate Population Stability Index"""
    # Create bins based on expected distribution
    breakpoints = np.percentile(expected, np.linspace(0, 100, bins + 1))
    breakpoints = np.unique(breakpoints)  # Remove duplicates
    
    # Bin both distributions
    expected_binned = pd.cut(expected, breakpoints, include_lowest=True)
    actual_binned = pd.cut(actual, breakpoints, include_lowest=True)
    
    # Calculate percentages
    expected_pct = expected_binned.value_counts(normalize=True, sort=False).sort_index()
    actual_pct = actual_binned.value_counts(normalize=True, sort=False).sort_index()
    
    # Calculate PSI
    psi = np.sum((actual_pct - expected_pct) * np.log((actual_pct + 1e-12) / (expected_pct + 1e-12)))
    return psi

# Example usage with development and validation data
psi_value = calculate_psi(dev_data['phat'], val_data['phat'])
print(f"PSI: {psi_value:.4f}")

# Interpret PSI
if psi_value < 0.1:
    print("No significant population shift")
elif psi_value < 0.25:
    print("Moderate population shift")
else:
    print("Significant population shift - investigation needed")

C3. PD backtesting (Basel style)

Python (binomial test):

from statsmodels.stats.proportion import proportions_ztest

# Portfolio-level binomial test
obs_defaults = df['default'].sum()
exp_defaults = df['phat'].sum()
n_obs = len(df)

# z-test for proportion
stat, pval = proportions_ztest(obs_defaults, n_obs, exp_defaults/n_obs)
print(f"Binomial test: z-statistic = {stat:.4f}, p-value = {pval:.4f}")

# Traffic light approach
if pval > 0.1:
    print("Green zone - model is well calibrated")
elif pval > 0.05:
    print("Yellow zone - monitor closely")
elif pval > 0.01:
    print("Amber zone - investigate calibration")
else:
    print("Red zone - significant miscalibration, recalibration needed")

9. Practical Workflow (Credit PD Model)

  1. Hypothesis testing: t-tests/Wald for coefficients, LR for nested models.
  2. Spec check: link test, RESET, residuals.
  3. Discrimination: AUC, KS.
  4. Calibration: CIL, CS, decile plots, recalibrate if needed.
  5. Stability: PSI, binomial backtests, challenger-champion comparison.
  6. Governance pack: summary charts + traffic-light signals.

This comprehensive guide covers the full spectrum of hypothesis testing in financial and econometric models, from fundamental concepts to advanced diagnostics with practical code implementation. The Python and SAS code snippets provide ready-to-use tools for model validation in credit risk and other financial applications.