Goal: Analyze each variable in isolation to understand its distribution, central tendency, and spread.
For Numerical Variables (e.g., age, income, loan_amount):
Visualization: Histograms, Kernel Density Plots (KDE), and Boxplots.
# Create a 1x3 grid of plots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
df['income'].plot.hist(bins=30, ax=axes[0], title='Histogram of Income')
df['income'].plot.kde(ax=axes[1], title='KDE of Income')
df['income'].plot.box(ax=axes[2], title='Boxplot of Income')
plt.show()
What to Look For:
- Shape: Normal, skewed left/right, bimodal?
- Central Tendency: Mean vs. Median
- Spread: Standard deviation, IQR
- Example:
income
is right-skewed. Most cluster around $50k, but some have very high incomes.
For Categorical Variables (e.g., marital_status, loan_purpose):
Visualization: Bar charts (count of each category).
status_counts = df['loan_status'].value_counts()
plt.figure(figsize=(6,4))
sns.barplot(x=status_counts.index, y=status_counts.values)
plt.title('Count of Loan Status')
plt.show()
What to Look For:
- Cardinality: Number of unique categories
- Class Imbalance: 8,500 "Paid" vs 1,500 "Defaulted"
- Rare Categories: Categories with few samples
Check: โ
Distribution of each variable understood? โ
Skewness and outliers noted? โ
Target variable imbalance identified?