Hypothesis testing is the statistical machinery that separates signal from noise. In finance and econometrics, where data is often messy and correlations can be spurious, it's not enough to just see a pattern; we must ask, "Is this pattern real?"
Every hypothesis test follows the same logical process:
p-value < \( \alpha \)
, we reject \( H_0 \) in favor of \( H_1 \). If not, we fail to reject \( H_0 \).The Decision Matrix:
Reject \( H_0 \) | Fail to Reject \( H_0 \) | |
---|---|---|
\( H_0 \) is True | Type I Error (False Positive) | Correct Decision |
\( H_0 \) is False | Correct Decision (Power) | Type II Error (False Negative) |
Setup:
Linear regression
We want to test
Test statistic (OLS):
where
and
Under \( H_0 \): \( t \sim t_{n-k-1} \).
Finance example:
In CAPM,
Testing \( H_0: \alpha_i = 0 \). If rejected → evidence of abnormal return ("alpha").
Suppose we test
(two coefficients jointly 0).
F-test (nested OLS models):
Under \( H_0 \): \( F \sim F(q, n-k-1) \).
Finance example:
In a credit risk model: test if liquidity ratios (\( \beta_{\text{liq}} \)) add anything beyond leverage (\( \beta_{\text{lev}} \)).
We are testing whether a subset of coefficients is equal to zero (or satisfies some restriction).
Statistic:
Distribution:
where \( q \) = number of restrictions.
Finance example:
Default probability (logit model):
For a restriction of the form
where \( R \) is a \( q \times k \) matrix.
Statistic:
Under \( H_0 \): \( W \sim \chi^2_q \).
Finance example:
Suppose in a multifactor model you want to test if
(not just zero). Then \( R = [0, 1, 1, 0, \dots] \), \( r = 1 \).
This test is useful when we don't want to estimate the full unrestricted model.
Statistic:
Distribution:
Finance example:
Credit risk model with logit/probit:
Test | Hypothesis (\( H_0 \) vs \( H_1 \)) | Test Statistic | Distribution (under \( H_0 \)) | Typical Finance Application |
---|---|---|---|---|
Likelihood Ratio (LR) | \( H_0: R\beta = r \) (e.g., \( \beta_j = 0 \)) vs \( H_1: R\beta \neq r \) | \( LR = -2(\ell_r - \ell_u) \) where \( \ell_r \) = log-likelihood of restricted model, \( \ell_u \) = unrestricted |
\( \chi^2_q \) with \( q \) restrictions | Logistic regression of default: test if adding liquidity improves fit beyond leverage |
Wald Test | \( H_0: R\beta = r \) (test on estimated coefficients) vs \( H_1: R\beta \neq r \) | \( W = (R\hat{\beta} - r)' \left( R \,\widehat{\text{Var}}(\hat{\beta})\, R' \right)^{-1} (R\hat{\beta} - r) \) | \( \chi^2_q \) | Fama-French 3-factor model: test if \( \beta_{SMB} + \beta_{HML} = 1 \) |
Lagrange Multiplier (Score, LM) | \( H_0: R\beta = r \) (same form) vs \( H_1: R\beta \neq r \) | \( LM = s(\hat{\beta}_r)' I(\hat{\beta}_r)^{-1} s(\hat{\beta}_r) \) where \( s(\hat{\beta}_r) \) = score at restricted estimates, \( I(\hat{\beta}_r) \) = information matrix |
\( \chi^2_q \) | Time-series credit risk: test if adding lagged CDS spread improves model fit without re-estimating unrestricted model |
They are asymptotically equivalent (large samples → same decision), but may differ in small samples.
Let's test whether Liquidity adds explanatory power to a default prediction model.
Model:
Null hypothesis (for all three tests): \( H_0:\beta_{\text{liq}}=\beta_2=0 \) (one restriction \( q=1 \)).
Data from model estimation:
Statistic
Reference distribution under \( H_0 \): \( \chi^2_1 \)
p-value \( \approx 0.019 \)
Decision (5% level): Reject \( H_0 \). Liquidity improves the model.
Statistic
Reference distribution under \( H_0 \): \( \chi^2_1 \)
p-value \( \approx 0.032 \)
Decision (5% level): Reject \( H_0 \).
Statistic
Reference distribution under \( H_0 \): \( \chi^2_1 \)
p-value \( \approx 0.026 \)
Decision (5% level): Reject \( H_0 \).
Takeaways for practice:
Fit your model → get \( \hat{p} \). Refit: default ~ \( \hat{p} \) and \( \hat{p}^2 \).
Python (statsmodels):
import statsmodels.api as sm
import pandas as pd
X = sm.add_constant(df[features])
m1 = sm.Logit(df['default'], X).fit()
phat = m1.predict(X)
# Create dataframe with phat and phat squared
link_test_data = pd.DataFrame({'phat': phat, 'phat2': phat**2})
Z = sm.add_constant(link_test_data)
# Fit link test model
link_test = sm.Logit(df['default'], Z).fit()
print(link_test.summary()) # Check if phat2 is significant
SAS:
proc logistic data=cred;
model default(event='1') = x1 x2 x3 / link=logit;
output out=pred p=phat;
run;
data pred;
set pred;
phat2 = phat*phat;
run;
proc logistic data=pred;
model default(event='1') = phat phat2;
run;
Augment with powers of the linear predictor \( \eta=\hat{\beta}^\top x \) (e.g., \( \eta^2,\eta^3 \)). Significance ⇒ missing nonlinearity.
Python:
# Get linear predictor
eta = m1.predict(X, linear=True)
# Create polynomial terms
reset_test_data = pd.DataFrame({
'eta': eta,
'eta2': eta**2,
'eta3': eta**3
})
Z_reset = sm.add_constant(reset_test_data)
# Fit RESET test model
reset_test = sm.Logit(df['default'], Z_reset).fit()
print(reset_test.summary()) # Check significance of eta2, eta3
default ~ 1
with logit offset = logit(\( \hat{p} \)); intercept should be 0.default ~ logit(\hat{p})
; slope should be 1.Python:
import numpy as np
# Calculate logit of predicted probabilities
logit_phat = np.log(phat/(1-phat))
# CIL (offset fit)
cil_model = sm.Logit(df['default'], np.ones(len(df))) # Only intercept
cil_res = cil_model.fit(offset=logit_phat, disp=False)
cil_intercept = cil_res.params[0] # Should be close to 0
# CS (slope fit)
cs_model = sm.Logit(df['default'], sm.add_constant(logit_phat))
cs_res = cs_model.fit(disp=False)
cal_slope = cs_res.params[1] # Should be close to 1
print(f"CIL Intercept: {cil_intercept:.4f}, Calibration Slope: {cal_slope:.4f}")
SAS:
data with_off;
set pred;
logitp = log(phat/(1-phat));
const = 1;
run;
/* Calibration-in-the-large */
proc logistic data=with_off;
model default(event='1') = / noint;
offset logitp;
run;
/* Calibration slope */
proc logistic data=with_off;
model default(event='1') = logitp;
run;
Python (bins + plot):
import matplotlib.pyplot as plt
# Create deciles
K = 10
df['bin'] = pd.qcut(phat, K, labels=False, duplicates='drop')
# Calculate calibration statistics
cal = df.groupby('bin').agg(
avg_pd=('phat', 'mean'),
odr=('default', 'mean'),
n=('default', 'size')
).reset_index()
cal['diff'] = cal['odr'] - cal['avg_pd']
# Plot calibration curve
plt.figure(figsize=(8, 6))
plt.plot(cal['avg_pd'], cal['odr'], 'o-', label='Model')
plt.plot([0, max(cal['avg_pd'])], [0, max(cal['avg_pd'])], 'k--', label='Perfect calibration')
plt.xlabel('Average Predicted Probability')
plt.ylabel('Observed Default Rate')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True)
plt.show()
# Hosmer-Lemeshow test (approximate)
hl_stat = np.sum((cal['odr'] - cal['avg_pd'])**2 * cal['n'] / (cal['avg_pd'] * (1 - cal['avg_pd'])))
from scipy.stats import chi2
hl_pvalue = 1 - chi2.cdf(hl_stat, K-2)
print(f"Hosmer-Lemeshow statistic: {hl_stat:.4f}, p-value: {hl_pvalue:.4f}")
Python:
from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss
# Brier score
brier_score = brier_score_loss(df['default'], phat)
print(f"Brier Score: {brier_score:.4f}")
# Calibration curve with sklearn
prob_true, prob_pred = calibration_curve(df['default'], phat, n_bins=10, strategy='quantile')
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, 's-', label='Model')
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
plt.xlabel('Predicted Probability')
plt.ylabel('Observed Frequency')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True)
plt.show()
Python (PSI function):
def calculate_psi(expected, actual, bins=10):
"""Calculate Population Stability Index"""
# Create bins based on expected distribution
breakpoints = np.percentile(expected, np.linspace(0, 100, bins + 1))
breakpoints = np.unique(breakpoints) # Remove duplicates
# Bin both distributions
expected_binned = pd.cut(expected, breakpoints, include_lowest=True)
actual_binned = pd.cut(actual, breakpoints, include_lowest=True)
# Calculate percentages
expected_pct = expected_binned.value_counts(normalize=True, sort=False).sort_index()
actual_pct = actual_binned.value_counts(normalize=True, sort=False).sort_index()
# Calculate PSI
psi = np.sum((actual_pct - expected_pct) * np.log((actual_pct + 1e-12) / (expected_pct + 1e-12)))
return psi
# Example usage with development and validation data
psi_value = calculate_psi(dev_data['phat'], val_data['phat'])
print(f"PSI: {psi_value:.4f}")
# Interpret PSI
if psi_value < 0.1:
print("No significant population shift")
elif psi_value < 0.25:
print("Moderate population shift")
else:
print("Significant population shift - investigation needed")
Python (binomial test):
from statsmodels.stats.proportion import proportions_ztest
# Portfolio-level binomial test
obs_defaults = df['default'].sum()
exp_defaults = df['phat'].sum()
n_obs = len(df)
# z-test for proportion
stat, pval = proportions_ztest(obs_defaults, n_obs, exp_defaults/n_obs)
print(f"Binomial test: z-statistic = {stat:.4f}, p-value = {pval:.4f}")
# Traffic light approach
if pval > 0.1:
print("Green zone - model is well calibrated")
elif pval > 0.05:
print("Yellow zone - monitor closely")
elif pval > 0.01:
print("Amber zone - investigate calibration")
else:
print("Red zone - significant miscalibration, recalibration needed")
This comprehensive guide covers the full spectrum of hypothesis testing in financial and econometric models, from fundamental concepts to advanced diagnostics with practical code implementation. The Python and SAS code snippets provide ready-to-use tools for model validation in credit risk and other financial applications.