SAS Interview Guide

Probability of Default Modeling Analyst Position

Data Manipulation
Statistical Modeling
SAS Programming
Risk Management
Question 1
Data Manipulation and Preparation
You have a dataset with customer loan information spanning 5 years. Some customers have multiple loans. How would you create a dataset that shows each customer's total exposure, number of active loans, and earliest loan origination date? Walk me through the SAS code structure you'd use.
Answer
Method 1: Using PROC SQL
proc sql; create table customer_summary as select customer_id, sum(loan_amount) as total_exposure, count(*) as num_active_loans, min(origination_date) as earliest_loan_date from loan_data where loan_status = 'Active' group by customer_id; quit;
Method 2: Using DATA Step
proc sort data=loan_data; by customer_id; run; data customer_summary; set loan_data; by customer_id; retain total_exposure num_loans earliest_date; if first.customer_id then do; total_exposure = 0; num_loans = 0; earliest_date = .; end; if loan_status = 'Active' then do; total_exposure + loan_amount; num_loans + 1; if earliest_date = . or origination_date < earliest_date then earliest_date = origination_date; end; if last.customer_id then output; keep customer_id total_exposure num_loans earliest_date; run;
Duplicate Handling
proc sort data=loan_data nodupkey dupout=duplicates; by customer_id loan_id; run; proc freq data=loan_data; tables customer_id / noprint out=dup_check; run;
Question 2
Missing Data Handling
In our commercial portfolio, financial ratios often have missing values due to incomplete financial statements. Describe 3 different approaches you could use in SAS to handle missing values for a debt-to-equity ratio variable, and explain when you'd use each approach.
Answer
Three approaches:
  • Complete Case Analysis: Remove observations with missing values - use when data is MCAR and sample size is large
  • Mean/Median Imputation: Replace with central tendency - use for stable variables with few missing values
  • Multiple Imputation: Create multiple plausible values - use when data is MAR and preserving uncertainty is important
Method 1: Complete Case Analysis
data analysis_set; set commercial_data; if not missing(debt_to_equity_ratio); run;
Method 2: Mean Imputation
proc means data=commercial_data mean noprint; var debt_to_equity_ratio; output out=means mean=de_ratio_mean; run; data commercial_imputed; if _n_ = 1 then set means; set commercial_data; if missing(debt_to_equity_ratio) then debt_to_equity_ratio = de_ratio_mean; run;
Method 3: Multiple Imputation
proc mi data=commercial_data out=mi_data nimpute=5 seed=12345; var debt_to_equity_ratio revenue total_assets industry_code; class industry_code; fcs reg(debt_to_equity_ratio = revenue total_assets industry_code); run; /* Analysis with multiple imputation */ proc logistic data=mi_data; class industry_code; model default_flag = debt_to_equity_ratio revenue total_assets industry_code; by _imputation_; ods output ParameterEstimates=estimates; run; proc mianalyze parms=estimates; modeleffects debt_to_equity_ratio revenue total_assets industry_code; run;
Question 3
Variable Transformation and Binning
For PD modeling, we often need to transform continuous variables. How would you use SAS to create optimal binning for a continuous variable like "annual revenue" while ensuring monotonic relationship with default rates?
Answer
Method 1: Using PROC HPBIN for Optimal Binning
proc hpbin data=commercial_data numbin=10; input annual_revenue; target default_flag / level=nominal; output out=binned_data; run;
Method 2: Manual Binning with Monotonicity Check
proc rank data=commercial_data out=ranked groups=20; var annual_revenue; ranks revenue_decile; run; proc sql; create table bin_analysis as select revenue_decile, count(*) as n_obs, sum(default_flag) as n_defaults, mean(default_flag) as default_rate, min(annual_revenue) as min_revenue, max(annual_revenue) as max_revenue from ranked group by revenue_decile order by revenue_decile; quit; /* Check monotonicity */ data monotonic_check; set bin_analysis; lag_default_rate = lag(default_rate); if _n_ > 1 then do; if default_rate <= lag_default_rate then monotonic_flag = 1; else monotonic_flag = 0; end; run;
Statistical Significance Testing
/* Test statistical significance */ proc freq data=final_bins; tables revenue_bin*default_flag / chisq expected; run; /* Cochran-Armitage trend test */ proc freq data=final_bins; tables revenue_bin*default_flag / trend; run;
Question 5
Model Development - Logistic Regression
You're building a logistic regression model for probability of default. Walk me through how you'd use SAS to:
  • Perform variable selection
  • Check for multicollinearity
  • Validate model assumptions
  • Generate model performance statistics
Answer
Variable Selection
/* Method 1: Stepwise selection */ proc logistic data=modeling_data plots=all; class industry_code risk_rating; model default_flag = debt_ratio revenue_growth total_assets industry_code risk_rating age_years / selection=stepwise sle=0.05 sls=0.10; output out=pred_data predicted=pred_prob; run; /* Method 2: LASSO selection */ proc hpgenselect data=modeling_data; class industry_code risk_rating; model default_flag = debt_ratio revenue_growth total_assets industry_code risk_rating age_years / distribution=binomial link=logit; selection method=lasso; run;
Multicollinearity Check
proc reg data=modeling_data; model debt_ratio = revenue_growth total_assets age_years / vif collin; run; proc corr data=modeling_data; var debt_ratio revenue_growth total_assets age_years; run;
Final Model with Diagnostics
proc logistic data=modeling_data plots=(roc influence dfbetas); class industry_code; model default_flag = debt_ratio revenue_growth industry_code / lackfit rsq; output out=diagnostics predicted=pred_prob residual=residual leverage=leverage dfbetas=dfbetas stdres=stdres; /* ROC Curve and Performance */ roc pred=pred_prob; /* Hosmer-Lemeshow Test */ output out=hl_data predprobs=individual; run;
Question 7
Macro Programming
You need to run the same PD model on 10 different industry segments. How would you create a SAS macro to automate this process? What parameters would your macro accept, and what outputs would it generate?
Answer
Basic Industry Segment Macro
%macro industry_pd_model( input_data=, /* Input dataset name */ industry_var=, /* Industry grouping variable */ industry_value=, /* Specific industry to model */ target_var=default_flag, /* Target variable */ model_vars=, /* Space-separated list of model variables */ output_prefix=model_ /* Prefix for output datasets */ ); /* Filter data for specific industry */ data industry_data; set &input_data; where &industry_var = "&industry_value"; run; /* Check sample size */ proc sql noprint; select count(*) into :nobs from industry_data; quit; %if &nobs < 100 %then %do; %put WARNING: Sample size (&nobs) too small for industry &industry_value; %goto exit; %end; /* Run logistic regression */ proc logistic data=industry_data; model &target_var = &model_vars / outest=&output_prefix.params_&industry_value; output out=&output_prefix.scored_&industry_value predicted=pred_prob; store &output_prefix.model_&industry_value; run; %exit: %mend industry_pd_model;
Key Features:
  • Automated sample size checking with warning system
  • Standardized output naming for easy reference
  • Model storage capability for later deployment
  • Comprehensive error handling and logging
  • Flexible parameterization for different use cases

Evaluation Criteria

Strong Candidates Should Demonstrate:

  • Proficiency with PROC SQL, DATA steps, and key analytical procedures (PROC LOGISTIC, PROC HPBIN, etc.)
  • Deep understanding of statistical concepts relevant to credit modeling (WOE, IV, AUC, etc.)
  • Advanced SAS macro programming skills for automation and efficiency
  • Comprehensive awareness of data quality issues and model validation requirements
  • Experience with performance optimization techniques for large datasets
  • Thorough understanding of regulatory requirements in banking (Basel, IFRS 9)

Red Flags:

  • Cannot explain basic DATA step processing or BY-group processing
  • Unfamiliar with key modeling procedures like PROC LOGISTIC or validation techniques
  • No experience with SAS macros or automation approaches
  • Cannot discuss sophisticated missing data handling strategies beyond simple imputation
  • Lacks understanding of advanced model performance metrics (Gini, KS, PSI)
  • No knowledge of regulatory model requirements or validation standards