SAS Interview Questions for PD Modeling Analyst Role

1. Data Step Basics

Question

Explain the difference between a MERGE and a SET statement in a DATA step. Provide an example of when you would use each in the context of preparing data for a probability of default (PD) model, such as combining customer credit history with loan performance data.

Answer

  • SET Statement: Concatenates datasets vertically, stacking observations from multiple datasets with similar structures. It is used when combining data with the same variables, such as appending historical loan performance data from different time periods.
  • MERGE Statement: Combines datasets horizontally, matching observations based on a common key (e.g., customer ID). It is used to join datasets with different variables, such as merging credit history with loan performance.

Example:

/* SET: Append loan performance data from 2023 and 2024 */
data loan_performance;
   set loan_data_2023 loan_data_2024;
run;

/* MERGE: Combine credit history with default flags by customer ID */
data pd_data;
   merge credit_history (in=a) loan_defaults (in=b);
   by customer_id;
   if a and b; /* Keep only matched records */
run;
                

In PD modeling, SET is useful for creating a longitudinal dataset of loan performance, while MERGE is ideal for enriching borrower profiles with default outcomes.

2. PROC SQL Usage

Question

Write a PROC SQL query to join two datasets: one containing borrower financial ratios and another with default flags. Include a condition to filter for commercial loans originated after 2020, and calculate the average debt-to-income ratio for defaulted vs. non-defaulted borrowers.

Answer

proc sql;
   create table summary as
   select d.default_flag,
          mean(f.debt_to_income) as avg_dti
   from financial_ratios f
   inner join loan_defaults d
   on f.customer_id = d.customer_id
   where f.loan_type = 'COMMERCIAL' and f.origination_date > '01JAN2020'd
   group by d.default_flag;
quit;
                

This query joins the datasets on customer_id, filters for commercial loans post-2020, and calculates the average debt-to-income ratio by default status. This is critical for understanding financial risk factors in PD models.

3. Handling Missing Values

Question

Describe how you would identify and impute missing values in a dataset of commercial loan attributes using SAS procedures. What considerations would you have for imputation methods in a PD modeling context to avoid biasing the model?

Answer

  • Identification: Use PROC MEANS or PROC FREQ to summarize missing values.
  • Imputation: Use PROC STDIZE for mean/median imputation or PROC MI for multiple imputation to preserve variability.
  • Considerations: In PD modeling, imputation should avoid introducing bias. For example, imputing missing credit scores with the mean might underestimate risk for high-risk borrowers. Multiple imputation or model-based methods are preferred to account for uncertainty.

Example:

/* Identify missing values */
proc means data=loan_attributes n nmiss;
   var credit_score debt_to_income loan_amount;
run;

/* Impute missing values with median */
proc stdize data=loan_attributes out=loan_attributes_imputed method=median;
   var credit_score debt_to_income loan_amount;
run;
                

Choose imputation methods that align with the variable’s distribution and model assumptions to maintain PD model accuracy.

4. Macros in SAS

Question

Write a simple SAS macro that takes a dataset name and a variable (e.g., loan amount) as parameters, and generates summary statistics (mean, median, min, max) for that variable, grouped by industry sector. Explain how this macro could be useful in exploratory data analysis for PD models.

Answer

%macro summary_stats(dataset, variable);
   proc means data=&dataset n mean median min max;
      class industry_sector;
      var &variable;
      output out=stats_&variable;
   run;
%mend summary_stats;

/* Call macro */
%summary_stats(loan_attributes, loan_amount);
                

This macro automates the generation of summary statistics for key variables across industry sectors, enabling quick identification of risk patterns (e.g., higher loan amounts in certain sectors may correlate with higher defaults). It streamlines exploratory data analysis for PD model development.

5. PROC LOGISTIC for Modeling

Question

Outline the steps to build a basic logistic regression model using PROC LOGISTIC for predicting probability of default, including variable selection options like stepwise and how to interpret the output odds ratios in a banking risk context.

Answer

  • Steps:
    1. Prepare data: Ensure variables are clean and relevant.
    2. Run PROC LOGISTIC with stepwise selection to identify significant predictors.
    3. Interpret output: Odds ratios indicate the change in odds of default for a unit increase in a predictor, holding others constant.

Example:

proc logistic data=pd_data;
   class industry_sector (ref='RETAIL');
   model default_flag (event='1') = credit_score debt_to_income loan_amount industry_sector
      / selection=stepwise slentry=0.05 slstay=0.05;
   output out=pd_predictions pred=prob_default;
run;
                

An odds ratio of 1.5 for credit_score means a one-unit increase in credit score increases the odds of default by 50%, assuming other variables are constant. This helps quantify risk factors for commercial loans.

6. Arrays and Looping

Question

Demonstrate how to use arrays in a DATA step to create multiple lagged variables (e.g., payment history over the last 6 months) from a time-series dataset of commercial account payments. Why might this be important for feature engineering in PD models?

Answer

data payment_lags;
   set payment_history;
   by customer_id;
   array payments{6} payment_1 - payment_6;
   retain payment_1 - payment_6;
   /* Shift payments to create lags */
   do i = 6 to 2 by -1;
      payments{i} = payments{i-1};
   end;
   payments{1} = payment_amount;
   if first.customer_id then do;
      do i = 1 to 6;
         payments{i} = .;
      end;
   end;
run;
                

Lagged variables capture temporal patterns (e.g., missed payments over time), which are critical for predicting default risk in PD models, as recent payment behavior often indicates financial distress.

7. Efficiency with Large Datasets

Question

How would you optimize SAS code to handle a very large dataset (e.g., millions of commercial loan records) efficiently? Discuss techniques like indexing, using WHERE clauses, or BY-group processing, and provide an example related to filtering high-risk portfolios.

Answer

  • Techniques:
    • Indexing: Create indexes on key variables (e.g., customer_id) to speed up joins.
    • WHERE Clauses: Filter data early to reduce processing.
    • BY-Group Processing: Use sorted datasets to process groups efficiently.

Example:

/* Create index */
proc datasets library=work;
   modify loan_attributes;
   index create customer_id;
run;

/* Filter high-risk portfolios using WHERE */
data high_risk;
   set loan_attributes;
   where credit_score < 600 and loan_type = 'COMMERCIAL';
run;
                

These techniques reduce processing time for large datasets, enabling efficient analysis of high-risk commercial portfolios in PD modeling.

8. Output Delivery System (ODS)

Question

Explain how to use ODS to export the results of a PROC FREQ analysis (e.g., frequency of defaults by credit rating) into an Excel file with customized formatting. Include sample code and discuss its utility in reporting PD model insights to stakeholders.

Answer

ods excel file='default_by_rating.xlsx' options(sheet_name='Default_Analysis' embedded_titles='yes');
proc freq data=pd_data;
   tables credit_rating * default_flag / nocol norow nopercent;
   title 'Default Frequency by Credit Rating';
run;
ods excel close;
                

ODS Excel allows formatted output for stakeholder reports, making it easy to share PD model insights (e.g., default rates by credit rating) in a professional, accessible format.

9. Formats and Informats

Question

Create a custom format in SAS to categorize commercial borrowers into risk bands (e.g., low, medium, high) based on their credit scores. Show how to apply this format in a DATA step and explain its role in simplifying data visualization for PD model validation.

Answer

proc format;
   value riskband
      low - 600 = 'High Risk'
      601 - 700 = 'Medium Risk'
      701 - high = 'Low Risk';
run;

data pd_data_formatted;
   set pd_data;
   format credit_score riskband.;
run;
                

Custom formats simplify visualization (e.g., in PROC FREQ tables or plots) by grouping continuous variables like credit scores into interpretable categories, aiding PD model validation and stakeholder communication.

10. Debugging and Error Handling

Question

Describe common SAS errors you might encounter when building a PD model (e.g., during data merging or model fitting) and how to debug them using options like MPRINT or SYMBOLGEN. Provide an example of troubleshooting a logistic model that fails due to multicollinearity.

Answer

  • Common Errors:
    • Merging: Mismatched keys causing missing data.
    • Model Fitting: Multicollinearity causing unstable estimates in PROC LOGISTIC.
  • Debugging: Use MPRINT to trace macro execution, SYMBOLGEN to inspect macro variable resolution, or check log for warnings about collinearity.

Example:

options mprint symbolgen;
proc logistic data=pd_data;
   model default_flag (event='1') = credit_score debt_to_income loan_amount;
run;
                

If the model fails due to multicollinearity (e.g., high correlation between debt_to_income and loan_amount), check the correlation matrix with PROC CORR and remove or combine correlated variables to stabilize the PD model.