Explain the difference between a MERGE and a SET statement in a DATA step. Provide an example of when you would use each in the context of preparing data for a probability of default (PD) model, such as combining customer credit history with loan performance data.
Example:
/* SET: Append loan performance data from 2023 and 2024 */ data loan_performance; set loan_data_2023 loan_data_2024; run; /* MERGE: Combine credit history with default flags by customer ID */ data pd_data; merge credit_history (in=a) loan_defaults (in=b); by customer_id; if a and b; /* Keep only matched records */ run;
In PD modeling, SET is useful for creating a longitudinal dataset of loan performance, while MERGE is ideal for enriching borrower profiles with default outcomes.
Write a PROC SQL query to join two datasets: one containing borrower financial ratios and another with default flags. Include a condition to filter for commercial loans originated after 2020, and calculate the average debt-to-income ratio for defaulted vs. non-defaulted borrowers.
proc sql; create table summary as select d.default_flag, mean(f.debt_to_income) as avg_dti from financial_ratios f inner join loan_defaults d on f.customer_id = d.customer_id where f.loan_type = 'COMMERCIAL' and f.origination_date > '01JAN2020'd group by d.default_flag; quit;
This query joins the datasets on customer_id
, filters for commercial loans post-2020, and calculates the average debt-to-income ratio by default status. This is critical for understanding financial risk factors in PD models.
Describe how you would identify and impute missing values in a dataset of commercial loan attributes using SAS procedures. What considerations would you have for imputation methods in a PD modeling context to avoid biasing the model?
PROC MEANS
or PROC FREQ
to summarize missing values.PROC STDIZE
for mean/median imputation or PROC MI
for multiple imputation to preserve variability.Example:
/* Identify missing values */ proc means data=loan_attributes n nmiss; var credit_score debt_to_income loan_amount; run; /* Impute missing values with median */ proc stdize data=loan_attributes out=loan_attributes_imputed method=median; var credit_score debt_to_income loan_amount; run;
Choose imputation methods that align with the variable’s distribution and model assumptions to maintain PD model accuracy.
Write a simple SAS macro that takes a dataset name and a variable (e.g., loan amount) as parameters, and generates summary statistics (mean, median, min, max) for that variable, grouped by industry sector. Explain how this macro could be useful in exploratory data analysis for PD models.
%macro summary_stats(dataset, variable); proc means data=&dataset n mean median min max; class industry_sector; var &variable; output out=stats_&variable; run; %mend summary_stats; /* Call macro */ %summary_stats(loan_attributes, loan_amount);
This macro automates the generation of summary statistics for key variables across industry sectors, enabling quick identification of risk patterns (e.g., higher loan amounts in certain sectors may correlate with higher defaults). It streamlines exploratory data analysis for PD model development.
Outline the steps to build a basic logistic regression model using PROC LOGISTIC for predicting probability of default, including variable selection options like stepwise and how to interpret the output odds ratios in a banking risk context.
PROC LOGISTIC
with stepwise selection to identify significant predictors.Example:
proc logistic data=pd_data; class industry_sector (ref='RETAIL'); model default_flag (event='1') = credit_score debt_to_income loan_amount industry_sector / selection=stepwise slentry=0.05 slstay=0.05; output out=pd_predictions pred=prob_default; run;
An odds ratio of 1.5 for credit_score
means a one-unit increase in credit score increases the odds of default by 50%, assuming other variables are constant. This helps quantify risk factors for commercial loans.
Demonstrate how to use arrays in a DATA step to create multiple lagged variables (e.g., payment history over the last 6 months) from a time-series dataset of commercial account payments. Why might this be important for feature engineering in PD models?
data payment_lags; set payment_history; by customer_id; array payments{6} payment_1 - payment_6; retain payment_1 - payment_6; /* Shift payments to create lags */ do i = 6 to 2 by -1; payments{i} = payments{i-1}; end; payments{1} = payment_amount; if first.customer_id then do; do i = 1 to 6; payments{i} = .; end; end; run;
Lagged variables capture temporal patterns (e.g., missed payments over time), which are critical for predicting default risk in PD models, as recent payment behavior often indicates financial distress.
How would you optimize SAS code to handle a very large dataset (e.g., millions of commercial loan records) efficiently? Discuss techniques like indexing, using WHERE clauses, or BY-group processing, and provide an example related to filtering high-risk portfolios.
customer_id
) to speed up joins.Example:
/* Create index */ proc datasets library=work; modify loan_attributes; index create customer_id; run; /* Filter high-risk portfolios using WHERE */ data high_risk; set loan_attributes; where credit_score < 600 and loan_type = 'COMMERCIAL'; run;
These techniques reduce processing time for large datasets, enabling efficient analysis of high-risk commercial portfolios in PD modeling.
Explain how to use ODS to export the results of a PROC FREQ analysis (e.g., frequency of defaults by credit rating) into an Excel file with customized formatting. Include sample code and discuss its utility in reporting PD model insights to stakeholders.
ods excel file='default_by_rating.xlsx' options(sheet_name='Default_Analysis' embedded_titles='yes'); proc freq data=pd_data; tables credit_rating * default_flag / nocol norow nopercent; title 'Default Frequency by Credit Rating'; run; ods excel close;
ODS Excel allows formatted output for stakeholder reports, making it easy to share PD model insights (e.g., default rates by credit rating) in a professional, accessible format.
Create a custom format in SAS to categorize commercial borrowers into risk bands (e.g., low, medium, high) based on their credit scores. Show how to apply this format in a DATA step and explain its role in simplifying data visualization for PD model validation.
proc format; value riskband low - 600 = 'High Risk' 601 - 700 = 'Medium Risk' 701 - high = 'Low Risk'; run; data pd_data_formatted; set pd_data; format credit_score riskband.; run;
Custom formats simplify visualization (e.g., in PROC FREQ
tables or plots) by grouping continuous variables like credit scores into interpretable categories, aiding PD model validation and stakeholder communication.
Describe common SAS errors you might encounter when building a PD model (e.g., during data merging or model fitting) and how to debug them using options like MPRINT or SYMBOLGEN. Provide an example of troubleshooting a logistic model that fails due to multicollinearity.
PROC LOGISTIC
.MPRINT
to trace macro execution, SYMBOLGEN
to inspect macro variable resolution, or check log for warnings about collinearity.Example:
options mprint symbolgen; proc logistic data=pd_data; model default_flag (event='1') = credit_score debt_to_income loan_amount; run;
If the model fails due to multicollinearity (e.g., high correlation between debt_to_income
and loan_amount
), check the correlation matrix with PROC CORR
and remove or combine correlated variables to stabilize the PD model.