Bank Churn Prediction: Decision Tree Analysis

Bank Churn Prediction Analysis

Decision Tree & Random Forest Implementation for Customer Retention

Project Overview & Process Flow

1
Preprocessing
Data import, exploration, and quality assessment
Key Activities:
  • Loaded bank churn dataset with 10,000+ records
  • Identified class imbalance: 79.18% no churn vs 20.82% churn
  • Handled missing values and odd characters ('?' replaced with NaN)
  • Performed correlation analysis revealing Age (0.33) as strongest predictor
2
Data Transformations
Feature engineering and preprocessing pipeline
Transformation Pipeline:
  • Missing value imputation using median for numerical features
  • One-hot encoding for categorical variables (Location, NumOfProducts)
  • Standard scaling for numerical features to normalise distributions
  • Train-test split (80/20) with stratification to maintain class balance
3
Simple Decision Trees
Basic tree models with Gini and Entropy criteria
Model Configurations:
  • Gini criterion: 78.7% accuracy, 0.678 AUC
  • Entropy criterion: 79.4% accuracy, 0.705 AUC
  • Both models showed poor performance on minority class (49% F1-score)
  • Applied class_weight='balanced' to address imbalance
4
Hyperparameter Tuning
Optimising max_depth and min_samples_leaf parameters
Tuning Results:
  • Grid search across depth values 1-10 and leaf samples 1-10
  • Best configuration: max_depth=6, min_samples_leaf=4
  • Achieved 85.7% accuracy with improved generalisation
  • Reduced overfitting whilst maintaining predictive power
5
Random Forest
Ensemble method implementation and evaluation
Ensemble Configuration:
  • 100 decision trees with bootstrap sampling
  • Feature randomness to reduce overfitting
  • Achieved best overall performance: 86.1% accuracy
  • Highest precision (73.6%) and F1-score (60.9%)
6
Model Comparison
Comprehensive evaluation and selection
Final Assessment:
  • Evaluated models across 5 key metrics
  • Random Forest emerged as best all-round performer
  • Best Tuned Tree showed highest recall (69.5%)
  • Business context determines optimal model choice

Key Performance Metrics

Dataset Size
10K+
Customer records analysed
Class Imbalance
79:21
No churn vs Churn ratio
Best Accuracy
86.1%
Random Forest model
Top AUC Score
0.760
Best Tuned Tree

Model Comparison Results

Model Accuracy ROC AUC F1 Score Precision Recall
Random Forest 0.861 0.740 0.609 0.736 0.530
Best Tuned Tree 0.846 0.760 0.589 0.694 0.695
Simple Tree (Gini) 0.787 0.678 0.490 0.490 0.490

Key Findings & Insights

!

Class Imbalance Impact

The 79:21 ratio significantly affected model performance, particularly for identifying churners. Balanced class weighting improved minority class detection.

Feature Importance

NumOfProducts_2.0 emerged as the most critical predictor (35% importance), followed by Age (25%) and Balance (10%).

📊

Model Performance Trade-offs

Random Forest excelled in precision and overall accuracy, whilst Best Tuned Tree showed superior recall for identifying actual churners.

⚖️

Hyperparameter Impact

Optimal depth of 6 and minimum samples of 4 per leaf balanced model complexity with generalisation capability.

Age Correlation: Older customers show higher churn probability (0.33 correlation), indicating potential dissatisfaction with digital services or different banking needs.
Product Portfolio Effect: Customers with 2 products show distinct behaviour patterns, suggesting this as an optimal engagement level for retention.
Activity Status: Active members demonstrate 19% lower churn rates, highlighting the importance of customer engagement initiatives.
Model Ensemble Benefits: Random Forest's ensemble approach provided better generalisation and reduced overfitting compared to single decision trees.

Business Implications & Strategic Recommendations

Strategic Implementation Framework

1

Model Deployment Strategy

Recommendation: Deploy Random Forest model for general churn prediction due to its superior precision (73.6%) and balanced performance across metrics.

Alternative: Use Best Tuned Tree for high-risk customer identification where maximising recall (69.5%) is critical to capture more potential churners.

2

Customer Segmentation Focus

Age-Based Targeting: Develop specialised retention programmes for customers aged 45+ who show highest churn propensity.

Product Portfolio Optimisation: Encourage customers towards the 2-product sweet spot through targeted cross-selling campaigns.

3

Operational Improvements

Data Quality Enhancement: Implement robust data collection processes to reduce missing values and improve model accuracy.

Real-Time Monitoring: Deploy models in production with continuous monitoring and monthly retraining schedules.

4

Advanced Analytics Roadmap

Ensemble Enhancement: Explore gradient boosting methods (XGBoost, LightGBM) for potentially improved performance on imbalanced datasets.

Feature Engineering: Develop interaction features between Age and Product holdings to capture more nuanced customer behaviour patterns.

5

Business Impact Measurement

ROI Calculation: Estimate that reducing false negatives by 20% could retain an additional 100+ customers annually, worth approximately £2.5M in lifetime value.

Cost-Benefit Analysis: Balance precision vs recall based on cost of retention campaigns (£50-200 per customer) versus lifetime customer value (£25,000 average).

Implementation Priority Matrix

High Priority (0-3 months): Deploy Random Forest model, implement age-based customer segmentation

Medium Priority (3-6 months): Enhance data collection processes, develop product portfolio strategies

Long-term (6-12 months): Advanced ensemble methods, comprehensive feature engineering, automated retraining pipeline

Technical Summary

Data Characteristics

  • 10,000+ customer records
  • 10 predictive features
  • Binary classification target
  • Significant class imbalance addressed

Model Architecture

  • Decision Trees with Gini/Entropy
  • Random Forest (100 estimators)
  • Hyperparameter grid search
  • Cross-validation approach

Performance Achievements

  • 86.1% peak accuracy
  • 0.760 maximum AUC score
  • Balanced precision-recall trade-off
  • Robust model generalisation