Bank Churn Prediction: Decision Tree Analysis

Bank Churn Prediction Analysis

Decision Tree & Random Forest Implementation for Customer Retention

Project Overview & Process Flow

Preprocessing

Data import, exploration, and quality assessment

Key Activities:

Loaded bank churn dataset with 10,000+ records
Identified class imbalance: 79.18% no churn vs 20.82% churn
Handled missing values and odd characters ('?' replaced with NaN)
Performed correlation analysis revealing Age (0.33) as strongest predictor

Data Transformations

Feature engineering and preprocessing pipeline

Transformation Pipeline:

Missing value imputation using median for numerical features
One-hot encoding for categorical variables (Location, NumOfProducts)
Standard scaling for numerical features to normalise distributions
Train-test split (80/20) with stratification to maintain class balance

Simple Decision Trees

Basic tree models with Gini and Entropy criteria

Model Configurations:

Gini criterion: 78.7% accuracy, 0.678 AUC
Entropy criterion: 79.4% accuracy, 0.705 AUC
Both models showed poor performance on minority class (49% F1-score)
Applied class_weight='balanced' to address imbalance

Hyperparameter Tuning

Optimising max_depth and min_samples_leaf parameters

Tuning Results:

Grid search across depth values 1-10 and leaf samples 1-10
Best configuration: max_depth=6, min_samples_leaf=4
Achieved 85.7% accuracy with improved generalisation
Reduced overfitting whilst maintaining predictive power

Random Forest

Ensemble method implementation and evaluation

Ensemble Configuration:

100 decision trees with bootstrap sampling
Feature randomness to reduce overfitting
Achieved best overall performance: 86.1% accuracy
Highest precision (73.6%) and F1-score (60.9%)

Model Comparison

Comprehensive evaluation and selection

Final Assessment:

Evaluated models across 5 key metrics
Random Forest emerged as best all-round performer
Best Tuned Tree showed highest recall (69.5%)
Business context determines optimal model choice

Key Performance Metrics

Dataset Size

10K+

Customer records analysed

Class Imbalance

79:21

No churn vs Churn ratio

Best Accuracy

86.1%

Random Forest model

Top AUC Score

0.760

Best Tuned Tree

Model Comparison Results

Model	Accuracy	ROC AUC	F1 Score	Precision	Recall
Random Forest	0.861	0.740	0.609	0.736	0.530
Best Tuned Tree	0.846	0.760	0.589	0.694	0.695
Simple Tree (Gini)	0.787	0.678	0.490	0.490	0.490

Key Findings & Insights

Class Imbalance Impact

The 79:21 ratio significantly affected model performance, particularly for identifying churners. Balanced class weighting improved minority class detection.

★

Feature Importance

NumOfProducts_2.0 emerged as the most critical predictor (35% importance), followed by Age (25%) and Balance (10%).

📊

Model Performance Trade-offs

Random Forest excelled in precision and overall accuracy, whilst Best Tuned Tree showed superior recall for identifying actual churners.

⚖️

Hyperparameter Impact

Optimal depth of 6 and minimum samples of 4 per leaf balanced model complexity with generalisation capability.

Age Correlation: Older customers show higher churn probability (0.33 correlation), indicating potential dissatisfaction with digital services or different banking needs.

Product Portfolio Effect: Customers with 2 products show distinct behaviour patterns, suggesting this as an optimal engagement level for retention.

Activity Status: Active members demonstrate 19% lower churn rates, highlighting the importance of customer engagement initiatives.

Model Ensemble Benefits: Random Forest's ensemble approach provided better generalisation and reduced overfitting compared to single decision trees.

Business Implications & Strategic Recommendations

Strategic Implementation Framework

Model Deployment Strategy

Recommendation: Deploy Random Forest model for general churn prediction due to its superior precision (73.6%) and balanced performance across metrics.

Alternative: Use Best Tuned Tree for high-risk customer identification where maximising recall (69.5%) is critical to capture more potential churners.

Customer Segmentation Focus

Age-Based Targeting: Develop specialised retention programmes for customers aged 45+ who show highest churn propensity.

Product Portfolio Optimisation: Encourage customers towards the 2-product sweet spot through targeted cross-selling campaigns.

Operational Improvements

Data Quality Enhancement: Implement robust data collection processes to reduce missing values and improve model accuracy.

Real-Time Monitoring: Deploy models in production with continuous monitoring and monthly retraining schedules.

Advanced Analytics Roadmap

Ensemble Enhancement: Explore gradient boosting methods (XGBoost, LightGBM) for potentially improved performance on imbalanced datasets.

Feature Engineering: Develop interaction features between Age and Product holdings to capture more nuanced customer behaviour patterns.

Business Impact Measurement

ROI Calculation: Estimate that reducing false negatives by 20% could retain an additional 100+ customers annually, worth approximately £2.5M in lifetime value.

Cost-Benefit Analysis: Balance precision vs recall based on cost of retention campaigns (£50-200 per customer) versus lifetime customer value (£25,000 average).

Implementation Priority Matrix

High Priority (0-3 months): Deploy Random Forest model, implement age-based customer segmentation

Medium Priority (3-6 months): Enhance data collection processes, develop product portfolio strategies

Long-term (6-12 months): Advanced ensemble methods, comprehensive feature engineering, automated retraining pipeline

Technical Summary

Data Characteristics

10,000+ customer records
10 predictive features
Binary classification target
Significant class imbalance addressed

Model Architecture

Decision Trees with Gini/Entropy
Random Forest (100 estimators)
Hyperparameter grid search
Cross-validation approach

Performance Achievements

86.1% peak accuracy
0.760 maximum AUC score
Balanced precision-recall trade-off
Robust model generalisation