Feature engineering is the art and science of transforming raw data into powerful, predictive signals that a machine learning model can understand. It's not just a preprocessing step; it's where domain expertise is mathematically encoded.
👉 In credit risk, the predictive power of a Probability of Default (PD) model is often 80% determined by the quality of the features and 20% by the model algorithm itself. A complex model with poor features will fail, while a simple model with brilliantly engineered features will excel.
This is about crafting variables that encapsulate financial health, stability, and risk.
current_assets / current_liabilities
. (Standard)(current_assets - inventory) / current_liabilities
. A stricter measure, as inventory can be hard to liquidate quickly.total_liabilities / total_shareholder_equity
. Measures a company's financial leverage.total_debt / EBITDA
. A key covenant in loan agreements; indicates how many years of earnings are needed to pay off debt.EBIT / interest_expense
. Crucial for assessing a borrower's ability to service debt. A value below 1.5 is often a major red flag.net_income / total_assets
. Measures how efficiently assets are used to generate profit.sector_growth_rate - GDP_growth_rate
.np.log1p(x)
is used to handle zeros (log(1+x)
).age_of_firm
and PD
might not be linear. A young firm and a very old firm might be riskier than a middle-aged one.drop='first'
to remove one category.SIC_code
), one-hot encoding creates a sparse, high-dimensional dataset. This is inefficient and can lead to overfitting.Industry='Construction'
is replaced with the historical default rate for all construction firms in the training data.smoothed_encoding = (n * category_mean + α * global_mean) / (n + α)
industry
, it would be "how many construction firms are in the dataset?" This can be a useful signal without the leakage risk of target encoding.Missingness is often Not Missing At Random (NMAR) and is a feature in itself.
is_[feature]_missing
. This is often a powerful predictor. (e.g., a missing cash_flow_statement
is a negative signal).KNNImputer
or IterativeImputer
in sklearn), which estimates missing values based on other features.X_capped = np.clip(X, a_min=np.percentile(X, 1), a_max=np.percentile(X, 99))
is_outlier
based on statistical tests (e.g., Z-score > 3) or business rules (e.g., leverage_ratio > 10
). This allows the model to learn a specific coefficient for these rare but important events.feature_importances_
attribute.high_leverage * volatile_industry
low_liquidity * short_loan_tenor
high_DTI_ratio * low_collateral_coverage
PolynomialFeatures
can generate all polynomial and interaction features (e.g., a, b, a^2, b^2, a*b
), but this can lead to a combinatorial explosion of features. Use this with heavy feature selection afterward.This is critical for differentiating a temporary hiccup from a terminal decline.
revenue
, EBITDA
, cash_balance
.days_sales_outstanding (DSO)
trend. An increasing DSO suggests customers are taking longer to pay, indicating stress.industry
, create default_rate_in_industry_and_region
. This creates a stronger, more specific signal for the rare default event.class_weight='balanced'
in sklearn) allow you to assign a higher weight to the minority class (defaults) during training. This is often cleaner than manipulating the data itself.The best features are grounded in finance. A PD model is essentially a quantitative credit analyst.
Where A=Working Capital/Total Assets, B=Retained Earnings/Total Assets, C=EBIT/Total Assets, D=Market Value of Equity/Total Liabilities, E=Sales/Total Assets.
(Net Operating Income / Total Debt Service)
. A cornerstone of commercial lending. A DSCR < 1.0 means the company does not generate enough cash to cover its debts.Feature | Type | Transformation & Notes |
---|---|---|
log(total_assets) |
Created/Transformed | Log transform to reduce skew. |
debt_to_ebitda |
Created | Key leverage covenant. Cap at 20. |
interest_coverage_ratio |
Created | ICR < 1.5 is a red flag. |
current_ratio |
Created | Standard liquidity measure. |
cash_flow_volatility |
Temporal | Std. Dev. of last 12mo operating CF. |
revenue_trend_slope |
Temporal | Slope of revenue over last 8 quarters. |
industry_risk_score |
Domain/Encoded | Target-encoded (with smoothing) industry. |
is_financials_missing |
Missingness | Binary flag. |
has_high_leverage_outlier |
Outlier | Flag for Debt-to-Equity > 10. |
region_x_industry_interaction |
Interaction | One-hot encoded interaction. |
days_past_due_90_max |
Behavioral | Worst delinquency event in 24mo. |
borrower_concentration_ratio |
Domain | % of bank's exposure to this borrower. |