SAS Interview Questions

Essential questions for a Commercial Bank Analyst role

1. Data Loading and Cleaning

Question: Describe the SAS code you'd use to read a large dataset from a comma-separated values (CSV) file. How would you handle missing values for key financial variables like revenue and debt-to-equity ratio? Please provide a code snippet for handling missing values.

Answer:

A candidate should mention using the PROC IMPORT or DATA step to read a CSV file. For a very large dataset, the DATA step is often preferred for its efficiency and control.

To handle missing values, a common approach is imputation. For financial variables, one might impute missing values with the mean, median, or a specific value (like 0 if the variable is positive). The code snippet below shows how to impute missing revenue values with the median.

/* Using a DATA step to read the CSV and impute missing values */
DATA new_data;
    INFILE 'path/to/your/data.csv' 
    DLM=',' 
    FIRSTOBS=2;
    INPUT customer_id $ revenue debt_to_equity;

    /* Impute missing revenue with the median */
    IF revenue = . THEN revenue = MEDIAN(revenue, _N_);
RUN;

2. Data Merging

Question: You have two datasets: one with customer demographic information (demographics) and another with their historical loan performance (loan_history). Both datasets share a common identifier, customer_id. Write the SAS code to merge these two datasets to create a single table. What is the difference between a LEFT JOIN and a RIGHT JOIN in SAS, and when would you use each?

Answer:

The PROC SQL or DATA step is used for merging datasets. PROC SQL is often more intuitive for those familiar with SQL.

LEFT JOIN returns all rows from the left table (demographics) and the matched rows from the right table (loan_history). If there's no match, the columns from the right table will have missing values. This is useful when you want to keep all customer records, even if they don't have a loan history.

RIGHT JOIN returns all rows from the right table and the matched rows from the left. This is useful if you want to analyze all loan history records, even those without a corresponding customer demographic profile.

/* Using PROC SQL to perform a LEFT JOIN */
PROC SQL;
    CREATE TABLE merged_data AS
    SELECT 
        A.customer_id,
        A.demographic_var1,
        B.loan_amount,
        B.loan_status
    FROM 
        demographics AS A
    LEFT JOIN 
        loan_history AS B 
    ON 
        A.customer_id = B.customer_id;
QUIT;

3. Data Aggregation

Question: You need to create a summary table that shows the total number of loans and the average loan amount for each loan product type. How would you accomplish this using SAS, and what specific procedure would you use? Provide a sample code.

Answer:

The candidate should mention using PROC SUMMARY or PROC MEANS for data aggregation. These procedures are highly efficient for summarizing large datasets.

/* Using PROC SUMMARY to aggregate data */
PROC SUMMARY DATA=loan_history;
    CLASS loan_product_type;
    VAR loan_amount;
    OUTPUT OUT=summary_table
        N=total_loans
        MEAN=avg_loan_amount;
RUN;

4. Logistic Regression

Question: When building a probability of default model, logistic regression is a common technique. How would you run a logistic regression in SAS using the PROC LOGISTIC procedure? What key outputs would you look at to assess the model's performance and the significance of the independent variables?

Answer:

The candidate should describe using PROC LOGISTIC. The MODEL statement defines the dependent and independent variables.

Key outputs to assess performance include:

_MODEL_FIT_STATISTICS_: Provides metrics like AIC (Akaike Information Criterion) and SC (Schwarz Criterion), which help compare the fit of different models. A lower value indicates a better fit.
Odds Ratio Estimates: Shows the multiplicative change in the odds of the event (default) for a one-unit increase in the independent variable.
P-values: Found in the _ANALYSIS_OF_MAXIMUM_LIKELIHOOD_ESTIMATES_ table. A low p-value (typically < 0.05) indicates the variable is statistically significant.
Concordance: The concordance statistic tells you the percentage of pairs of observations (one defaulted, one not) where the model correctly predicts the defaulted one has a higher probability of default. A value above 70% is generally good.

5. Model Validation and Performance Metrics

Question: After building your model, how would you validate its performance? Describe at least two key metrics you would use to evaluate a PD model's predictive power. Specifically, how would you interpret the Gini coefficient and the Kolmogorov-Smirnov (KS) statistic?

Answer:

Model validation is typically done on a hold-out sample that was not used to train the model.

Two key metrics are:

Gini Coefficient (or Area Under the ROC Curve - AUC): The Gini coefficient measures the model's discriminatory power. A value of 0 indicates no discriminatory power (random chance), while a value of 1.0 indicates perfect separation of good and bad customers. In credit risk, a Gini of 0.40 to 0.60 is often considered a good model.

Note: The relationship between the Gini coefficient and the Area Under the ROC Curve (AUC) can be a bit confusing because they measure the same thing but on different scales. Think of it this way: AUC is a measure of a model's ability to distinguish between two classes (like "default" and "non-default"). The scale for AUC is from 0.5 (random chance) to 1.0 (perfect prediction). An AUC of 0.5 means your model is no better than flipping a coin, while an AUC of 1.0 means it perfectly separates the two groups. The Gini coefficient is a scaled version of AUC. The scale for Gini is from 0 (random chance) to 1 (perfect prediction). This makes it a bit more intuitive to interpret for many people, as a value of 0 directly corresponds to a useless model. The key is that they are mathematically related by a simple formula: Gini = (2 * AUC) - 1 This means if you know one value, you can easily calculate the other. For example, a model with an AUC of 0.75 would have a Gini coefficient of (2∗0.75)−1=0.5 . In essence, while they are different metrics, they both tell you the same story about your model's predictive power. In credit risk, the Gini coefficient is often preferred because its scale is slightly more straightforward for business stakeholders to understand.

Kolmogorov-Smirnov (KS) Statistic: The KS statistic measures the maximum difference between the cumulative distributions of the good and bad customers. A higher KS value indicates a better separation between the two groups. It helps to identify the optimal cutoff point for classifying customers.

6. Out-of-Time Validation

Question: Explain the concept of out-of-time (OOT) validation. Why is OOT validation particularly important for credit risk models, and how would you implement it in SAS?

Answer:

Out-of-Time (OOT) validation involves testing the model's performance on data from a future time period that was not included in the model development. For example, if the model was trained on data from 2018-2022, OOT validation would use data from 2023.

OOT validation is crucial for credit risk models because it tests for model stability and predictive power over time. A model that performs well on historical data may fail to predict future defaults due to changes in economic conditions, regulations, or business practices. Implementing it in SAS simply involves partitioning your dataset into a training set and a separate, later-date validation set before running your model.

7. Macros

Question: You need to run the same analysis for multiple loan portfolios (e.g., small business, large corporate, etc.). How would you use a SAS macro to automate this process, making your code more efficient and reusable? Provide a simple example of a macro definition and how to call it.

Answer:

SAS macros are a powerful way to automate repetitive tasks and make code more dynamic. They are essentially a block of code stored under a specific name that can be executed repeatedly with different parameters.

/* Macro Definition */
%MACRO run_model(portfolio_type);

    /* Filter data for the specific portfolio */
    DATA specific_portfolio;
        SET all_data;
        WHERE loan_product_type = "&portfolio_type";
    RUN;

    /* Run logistic regression on the filtered data */
    PROC LOGISTIC DATA=specific_portfolio;
        MODEL default = variable1 variable2;
    RUN;

%MEND;

/* Macro Calls */
%run_model(small_business);
%run_model(large_corporate);

8. Arrays

Question: You have 100 variables, all named var1 to var100, and you need to calculate the average of each variable for every customer. How can you use a SAS array to do this efficiently without writing 100 separate lines of code?

Answer:

An array allows you to refer to a group of variables by a single name and an index. This is extremely useful for repetitive tasks, such as calculating the average of many variables without writing a separate line of code for each.

DATA want;
    SET have;
    
    /* Define an array for the 100 variables */
    ARRAY my_vars(*) var1-var100;
    
    /* Loop through the array and calculate the average */
    total_sum = 0;
    DO i = 1 TO DIM(my_vars);
        total_sum = total_sum + my_vars(i);
    END;
    
    avg_of_vars = total_sum / DIM(my_vars);
RUN;

9. Stored Processes and Enterprise Guide

Question: Your team needs to automate the monthly model run and generate a standard report. How would you use SAS Enterprise Guide or SAS Stored Processes to achieve this? What are the advantages of using these tools over running code manually?

Answer:

SAS Enterprise Guide (EG) is a point-and-click interface that generates SAS code behind the scenes. It's great for building and scheduling processes without extensive coding. A candidate would use it to create a project flow that automates the data pull, model scoring, and report generation.

A SAS Stored Process is a SAS program that is published and executed from a centralized server. It allows for the automation of a specific task (like running a model) with parameters.

The advantage of these tools is automation, standardization, and governance. They ensure the model is run consistently on schedule, using the same code every time, and the results can be easily shared across the organization. This reduces manual errors and ensures auditability.

10. Final Model Interpretation

Question: Let's say your final model has a variable called time_since_last_delinquency with an odds ratio of 1.15. How would you explain this result to a business stakeholder who is not a statistician? What does this value imply about the relationship between the variable and the probability of default?

Answer:

An odds ratio of 1.15 for time_since_last_delinquency means that for every one-unit increase in the variable, the odds of default increase by 15%.

To a business stakeholder, this implies that customers who have had a more recent delinquency are 15% more likely to default than those who had a delinquency a year earlier, holding all other factors constant. This is a crucial finding because it highlights the recency of past credit issues as a strong predictor of future risk, which can be used to inform loan approval decisions or pricing.