Credit Risk Model Evaluation

A Comprehensive Guide to PD Model Performance Assessment

1. What is Model Performance Evaluation?

In credit risk, a PD model is not an academic exercise; it's a critical tool that directly impacts a bank's profitability, regulatory capital requirements, and strategic decisions. Therefore, evaluation is a rigorous process to answer these questions:

  • Statistical Soundness: Does the model have a solid mathematical foundation? Are its predictions reliable and not just lucky?
  • Predictive Power: Can it effectively rank borrowers from safest to riskiest? Does it produce accurate default probabilities?
  • Business Utility: Can we explain why it gives a certain score? Does it align with economic intuition? Is it practical to implement and monitor?
  • Regulatory Compliance: Does it meet the standards set by regulators (e.g., Basel Accords for IRB models)?

2. Dimensions of Model Evaluation

(a) Discriminatory Power (Who is Risky?)

This is about ranking, not necessarily the exact probability.

AUC-ROC Curve (Area Under the Receiver Operating Characteristic Curve)

What it is: The probability that the model will rank a randomly chosen defaulted loan (bad) higher than a randomly chosen non-defaulted loan (good). An AUC of 1.0 is perfect, 0.5 is no better than a coin flip.

Example: Model A has an AUC of 0.82. This means there's an 82% chance that a defaulted loan will have a higher (worse) PD score than a non-defaulted one. This is generally considered good.

Gini Coefficient / Accuracy Ratio

Gini = 2 × AUC - 1

An AUC of 0.82 equals a Gini of 0.64. Interpretation: 0% = no power, 100% = perfect power.

Kolmogorov-Smirnov (KS) Statistic

What it is: The maximum difference between the cumulative distribution of the "good" population and the "bad" population. A higher KS indicates better separation.

Example: The maximum vertical distance between the two cumulative distribution lines is 0.45 (or 45%). A KS above 40% is typically very strong, but it can be sensitive to sample size.

(b) Calibration (How Risky are They, Exactly?)

A model can have great ranking but terrible probabilities. Calibration is crucial for pricing loans and calculating expected loss (EL = PD × LGD × EAD).

Calibration Plot / Reliability Diagram

How it works:
  1. Split your test data into 10-20 bins based on their predicted PD (e.g., 0-2%, 2-4%, ..., 18-20%)
  2. For each bin, calculate the average predicted PD
  3. For each bin, calculate the actual observed default rate
  4. Plot the predicted PD (x-axis) vs. the actual default rate (y-axis)

Interpretation: Points falling on the 45-degree line indicate perfect calibration. A pattern above the line means the model is too optimistic (under-predicting risk); below the line means it's too pessimistic (over-predicting risk).

Hosmer-Lemeshow (HL) Test

What it is: A formal statistical test for calibration. A low p-value (e.g., <0.05) indicates a significant difference between predicted and actual defaults, meaning you reject the hypothesis that the model is well-calibrated. You want a high p-value here.

(c) Stability & Robustness (Does it Work Tomorrow?)

Models can "decay" as economic conditions and borrower profiles change.

Population Stability Index (PSI)

What it is: Measures how the distribution of model scores has shifted between a baseline dataset (e.g., training) and a current dataset (e.g., recent applicants).
Interpretation:
  • PSI < 0.1: No significant change. The population is stable.
  • PSI 0.1 - 0.25: Moderate change. Investigate the shift.
  • PSI > 0.25: Significant change. The model may need redevelopment or recalibration.

(d) Business Use & Interpretability (Can We Trust and Use It?)

Sign and Significance of Variables

Do the model coefficients make sense?

  • Good Sign: Debt-to-Income Ratio has a strong positive coefficient (higher debt → higher PD)
  • Bad Sign: Years in Business has a negative coefficient (older firms are riskier?)

3. Model Selection: A Practical Scenario

Task: Choose a PD model for a UK SME (Small-Medium Enterprise) portfolio.

Candidates: Logistic Regression (LR), Random Forest (RF), XGBoost (XGB)

Metric Logistic Regression Random Forest XGBoost Ideal
AUC 0.79 0.85 0.87 Higher is better
KS 38% 48% 52% Higher is better
Calibration Excellent Poor (too conservative) Fair (slightly optimistic) On the 45-degree line
PSI (OOT) 0.08 0.22 0.19 < 0.10
Interpretability High (clear reasons) Low ("black box") Medium (can use SHAP) High
Regulatory Fit Excellent Needs justification Needs justification Explainable & defensible

Analysis & Decision:

  1. XGBoost has the best raw predictive power (AUC, KS). However, its calibration is off, and it's less stable over time (higher PSI).
  2. Random Forest has similar issues but is even less calibrated and less stable.
  3. Logistic Regression has very good (but not best) predictive power. Its key advantages are perfect calibration, high stability, and complete transparency.
Verdict: For a regulated banking environment where explainability to regulators and customers is paramount, and where stability is prized, Logistic Regression is often the best choice.

4. Ensuring Ongoing Fitness-for-Purpose: The Validation Lifecycle

Model evaluation isn't a one-time event. It's a cycle.

  1. Development Validation: The initial deep-dive we described above.
  2. Ongoing Monitoring (Post-Implementation):
    • Backtesting: Every quarter, compare predicted vs. actual default rates
    • Benchmarking: Compare to simple alternatives
    • Threshold Monitoring: Track score distributions and decision volumes
  3. Periodic Revalidation: Typically annual, full re-run of validation process
  4. Triggers for Re-development: Significant AUC drop (>5%), consistently high PSI, or major regulatory/business changes

✅ Final Summary: The PD Model Evaluation Checklist

When your boss asks, "How do we know this model is good?", you can walk them through this:

Discrimination: Does it separate good and bad? (AUC > 0.75, ideally higher)
Calibration: Are the probabilities accurate? (Check the calibration plot)
Stability: Has the world changed since we built it? (PSI < 0.1)
Interpretability: Can we explain its decisions? (Signs make sense, reason codes available)
Business Impact: Does it improve our decisions vs. the old system? (Show a profit/loss simulation)
Monitoring Plan: How will we know if it breaks? (Define backtesting and benchmark metrics)

This comprehensive, multi-faceted approach ensures the model is not just a statistical toy but a reliable, trusted, and valuable business asset.