1. What is Model Performance Evaluation?
In credit risk, a PD model is not an academic exercise; it's a critical tool that directly impacts a bank's profitability, regulatory capital requirements, and strategic decisions. Therefore, evaluation is a rigorous process to answer these questions:
- Statistical Soundness: Does the model have a solid mathematical foundation? Are its predictions reliable and not just lucky?
- Predictive Power: Can it effectively rank borrowers from safest to riskiest? Does it produce accurate default probabilities?
- Business Utility: Can we explain why it gives a certain score? Does it align with economic intuition? Is it practical to implement and monitor?
- Regulatory Compliance: Does it meet the standards set by regulators (e.g., Basel Accords for IRB models)?
2. Dimensions of Model Evaluation
(a) Discriminatory Power (Who is Risky?)
This is about ranking, not necessarily the exact probability.
AUC-ROC Curve (Area Under the Receiver Operating Characteristic Curve)
Example: Model A has an AUC of 0.82. This means there's an 82% chance that a defaulted loan will have a higher (worse) PD score than a non-defaulted one. This is generally considered good.
Gini Coefficient / Accuracy Ratio
An AUC of 0.82 equals a Gini of 0.64. Interpretation: 0% = no power, 100% = perfect power.
Kolmogorov-Smirnov (KS) Statistic
Example: The maximum vertical distance between the two cumulative distribution lines is 0.45 (or 45%). A KS above 40% is typically very strong, but it can be sensitive to sample size.
(b) Calibration (How Risky are They, Exactly?)
A model can have great ranking but terrible probabilities. Calibration is crucial for pricing loans and calculating expected loss (EL = PD × LGD × EAD).
Calibration Plot / Reliability Diagram
- Split your test data into 10-20 bins based on their predicted PD (e.g., 0-2%, 2-4%, ..., 18-20%)
- For each bin, calculate the average predicted PD
- For each bin, calculate the actual observed default rate
- Plot the predicted PD (x-axis) vs. the actual default rate (y-axis)
Interpretation: Points falling on the 45-degree line indicate perfect calibration. A pattern above the line means the model is too optimistic (under-predicting risk); below the line means it's too pessimistic (over-predicting risk).
Hosmer-Lemeshow (HL) Test
(c) Stability & Robustness (Does it Work Tomorrow?)
Models can "decay" as economic conditions and borrower profiles change.
Population Stability Index (PSI)
- PSI < 0.1: No significant change. The population is stable.
- PSI 0.1 - 0.25: Moderate change. Investigate the shift.
- PSI > 0.25: Significant change. The model may need redevelopment or recalibration.
(d) Business Use & Interpretability (Can We Trust and Use It?)
Sign and Significance of Variables
Do the model coefficients make sense?
- Good Sign: Debt-to-Income Ratio has a strong positive coefficient (higher debt → higher PD)
- Bad Sign: Years in Business has a negative coefficient (older firms are riskier?)
3. Model Selection: A Practical Scenario
Task: Choose a PD model for a UK SME (Small-Medium Enterprise) portfolio.
Candidates: Logistic Regression (LR), Random Forest (RF), XGBoost (XGB)
Metric | Logistic Regression | Random Forest | XGBoost | Ideal |
---|---|---|---|---|
AUC | 0.79 | 0.85 | 0.87 | Higher is better |
KS | 38% | 48% | 52% | Higher is better |
Calibration | Excellent | Poor (too conservative) | Fair (slightly optimistic) | On the 45-degree line |
PSI (OOT) | 0.08 | 0.22 | 0.19 | < 0.10 |
Interpretability | High (clear reasons) | Low ("black box") | Medium (can use SHAP) | High |
Regulatory Fit | Excellent | Needs justification | Needs justification | Explainable & defensible |
Analysis & Decision:
- XGBoost has the best raw predictive power (AUC, KS). However, its calibration is off, and it's less stable over time (higher PSI).
- Random Forest has similar issues but is even less calibrated and less stable.
- Logistic Regression has very good (but not best) predictive power. Its key advantages are perfect calibration, high stability, and complete transparency.
4. Ensuring Ongoing Fitness-for-Purpose: The Validation Lifecycle
Model evaluation isn't a one-time event. It's a cycle.
- Development Validation: The initial deep-dive we described above.
- Ongoing Monitoring (Post-Implementation):
- Backtesting: Every quarter, compare predicted vs. actual default rates
- Benchmarking: Compare to simple alternatives
- Threshold Monitoring: Track score distributions and decision volumes
- Periodic Revalidation: Typically annual, full re-run of validation process
- Triggers for Re-development: Significant AUC drop (>5%), consistently high PSI, or major regulatory/business changes
✅ Final Summary: The PD Model Evaluation Checklist
When your boss asks, "How do we know this model is good?", you can walk them through this:
This comprehensive, multi-faceted approach ensures the model is not just a statistical toy but a reliable, trusted, and valuable business asset.