SAS Interview Guide - PD Modeling Analyst

Question 1

Data Manipulation and Preparation

You have a dataset with customer loan information spanning 5 years. Some customers have multiple loans. How would you create a dataset that shows each customer's total exposure, number of active loans, and earliest loan origination date? Walk me through the SAS code structure you'd use.

Follow-up: What SAS procedures would you use to identify and handle duplicate records in this scenario?

Answer

Method 1: Using PROC SQL

proc sql;
   create table customer_summary as
   select customer_id,
          sum(loan_amount) as total_exposure,
          count(*) as num_active_loans,
          min(origination_date) as earliest_loan_date
   from loan_data
   where loan_status = 'Active'
   group by customer_id;
quit;

Method 2: Using DATA Step

proc sort data=loan_data; by customer_id; run;
data customer_summary;
   set loan_data;
   by customer_id;
   retain total_exposure num_loans earliest_date;
   
   if first.customer_id then do;
      total_exposure = 0;
      num_loans = 0;
      earliest_date = .;
   end;
   
   if loan_status = 'Active' then do;
      total_exposure + loan_amount;
      num_loans + 1;
      if earliest_date = . or origination_date < earliest_date then
         earliest_date = origination_date;
   end;
   
   if last.customer_id then output;
   keep customer_id total_exposure num_loans earliest_date;
run;

Duplicate Handling

proc sort data=loan_data nodupkey dupout=duplicates;
   by customer_id loan_id; 
run;

proc freq data=loan_data;
   tables customer_id / noprint out=dup_check;
run;

Question 2

Missing Data Handling

In our commercial portfolio, financial ratios often have missing values due to incomplete financial statements. Describe 3 different approaches you could use in SAS to handle missing values for a debt-to-equity ratio variable, and explain when you'd use each approach.

Technical focus: Can you write the SAS code to implement multiple imputation for this variable?

Answer

                        Three approaches:
                        Complete Case Analysis: Remove observations with missing values - use when data is MCAR and sample size is large
Mean/Median Imputation: Replace with central tendency - use for stable variables with few missing values
Multiple Imputation: Create multiple plausible values - use when data is MAR and preserving uncertainty is important

                    

Method 1: Complete Case Analysis

data analysis_set;
   set commercial_data;
   if not missing(debt_to_equity_ratio);
run;

Method 2: Mean Imputation

proc means data=commercial_data mean noprint;
   var debt_to_equity_ratio;
   output out=means mean=de_ratio_mean;
run;

data commercial_imputed;
   if _n_ = 1 then set means;
   set commercial_data;
   if missing(debt_to_equity_ratio) then 
      debt_to_equity_ratio = de_ratio_mean;
run;

Method 3: Multiple Imputation

proc mi data=commercial_data out=mi_data nimpute=5 seed=12345;
   var debt_to_equity_ratio revenue total_assets industry_code;
   class industry_code;
   fcs reg(debt_to_equity_ratio = revenue total_assets industry_code);
run;

/* Analysis with multiple imputation */
proc logistic data=mi_data;
   class industry_code;
   model default_flag = debt_to_equity_ratio revenue total_assets industry_code;
   by _imputation_;
   ods output ParameterEstimates=estimates;
run;

proc mianalyze parms=estimates;
   modeleffects debt_to_equity_ratio revenue total_assets industry_code;
run;

Question 3

Variable Transformation and Binning

For PD modeling, we often need to transform continuous variables. How would you use SAS to create optimal binning for a continuous variable like "annual revenue" while ensuring monotonic relationship with default rates?

Follow-up: What SAS procedures would you use to test the statistical significance of your bins?

Answer

Method 1: Using PROC HPBIN for Optimal Binning

proc hpbin data=commercial_data numbin=10;
   input annual_revenue;
   target default_flag / level=nominal;
   output out=binned_data;
run;

Method 2: Manual Binning with Monotonicity Check

proc rank data=commercial_data out=ranked groups=20;
   var annual_revenue;
   ranks revenue_decile;
run;

proc sql;
   create table bin_analysis as
   select revenue_decile,
          count(*) as n_obs,
          sum(default_flag) as n_defaults,
          mean(default_flag) as default_rate,
          min(annual_revenue) as min_revenue,
          max(annual_revenue) as max_revenue
   from ranked
   group by revenue_decile
   order by revenue_decile;
quit;

/* Check monotonicity */
data monotonic_check;
   set bin_analysis;
   lag_default_rate = lag(default_rate);
   if _n_ > 1 then do;
      if default_rate <= lag_default_rate then monotonic_flag = 1;
      else monotonic_flag = 0;
   end;
run;

Statistical Significance Testing

/* Test statistical significance */
proc freq data=final_bins;
   tables revenue_bin*default_flag / chisq expected;
run;

/* Cochran-Armitage trend test */
proc freq data=final_bins;
   tables revenue_bin*default_flag / trend;
run;

Question 5

Model Development - Logistic Regression

You're building a logistic regression model for probability of default. Walk me through how you'd use SAS to:

Perform variable selection
Check for multicollinearity
Validate model assumptions
Generate model performance statistics

Specific ask: Which SAS procedures would you use for each step?

Answer

Variable Selection

/* Method 1: Stepwise selection */
proc logistic data=modeling_data plots=all;
   class industry_code risk_rating;
   model default_flag = debt_ratio revenue_growth total_assets 
                        industry_code risk_rating age_years / 
                        selection=stepwise sle=0.05 sls=0.10;
   output out=pred_data predicted=pred_prob;
run;

/* Method 2: LASSO selection */
proc hpgenselect data=modeling_data;
   class industry_code risk_rating;
   model default_flag = debt_ratio revenue_growth total_assets 
                        industry_code risk_rating age_years / 
                        distribution=binomial link=logit;
   selection method=lasso;
run;

Multicollinearity Check

proc reg data=modeling_data;
   model debt_ratio = revenue_growth total_assets age_years / vif collin;
run;

proc corr data=modeling_data;
   var debt_ratio revenue_growth total_assets age_years;
run;

Final Model with Diagnostics

proc logistic data=modeling_data plots=(roc influence dfbetas);
   class industry_code;
   model default_flag = debt_ratio revenue_growth industry_code / 
                        lackfit rsq;
   output out=diagnostics predicted=pred_prob 
          residual=residual leverage=leverage 
          dfbetas=dfbetas stdres=stdres;
   
   /* ROC Curve and Performance */
   roc pred=pred_prob;
   
   /* Hosmer-Lemeshow Test */
   output out=hl_data predprobs=individual;
run;

Question 7

Macro Programming

You need to run the same PD model on 10 different industry segments. How would you create a SAS macro to automate this process? What parameters would your macro accept, and what outputs would it generate?

Advanced: How would you modify your macro to handle different variable sets for different industries?

Answer

Basic Industry Segment Macro

%macro industry_pd_model(
   input_data=,           /* Input dataset name */
   industry_var=,         /* Industry grouping variable */
   industry_value=,       /* Specific industry to model */
   target_var=default_flag, /* Target variable */
   model_vars=,           /* Space-separated list of model variables */
   output_prefix=model_   /* Prefix for output datasets */
);

   /* Filter data for specific industry */
   data industry_data;
      set &input_data;
      where &industry_var = "&industry_value";
   run;
   
   /* Check sample size */
   proc sql noprint;
      select count(*) into :nobs
      from industry_data;
   quit;
   
   %if &nobs < 100 %then %do;
      %put WARNING: Sample size (&nobs) too small for industry &industry_value;
      %goto exit;
   %end;
   
   /* Run logistic regression */
   proc logistic data=industry_data;
      model &target_var = &model_vars / outest=&output_prefix.params_&industry_value;
      output out=&output_prefix.scored_&industry_value predicted=pred_prob;
      store &output_prefix.model_&industry_value;
   run;
   
   %exit:
%mend industry_pd_model;

                        Key Features:
                        Automated sample size checking with warning system
Standardized output naming for easy reference
Model storage capability for later deployment
Comprehensive error handling and logging
Flexible parameterization for different use cases

                    

Evaluation Criteria

Strong Candidates Should Demonstrate:

Proficiency with PROC SQL, DATA steps, and key analytical procedures (PROC LOGISTIC, PROC HPBIN, etc.)
Deep understanding of statistical concepts relevant to credit modeling (WOE, IV, AUC, etc.)
Advanced SAS macro programming skills for automation and efficiency
Comprehensive awareness of data quality issues and model validation requirements
Experience with performance optimization techniques for large datasets
Thorough understanding of regulatory requirements in banking (Basel, IFRS 9)

Red Flags:

Cannot explain basic DATA step processing or BY-group processing
Unfamiliar with key modeling procedures like PROC LOGISTIC or validation techniques
No experience with SAS macros or automation approaches
Cannot discuss sophisticated missing data handling strategies beyond simple imputation
Lacks understanding of advanced model performance metrics (Gini, KS, PSI)
No knowledge of regulatory model requirements or validation standards