Decision Tree ML Project - Visual Guide

Decision Tree Machine Learning Project

Exploring Tree-Based Models with SHAP Interpretation on Bank Marketing Data

Project Process Overview

1

Data Exploration

Load UCI Bank Marketing dataset, analyse structure, visualise target distribution, and perform correlation analysis

2

Data Transformations

Encode categorical variables using label encoding. No standardisation needed for tree-based models

3

Model Comparison

Train and evaluate Decision Tree, Random Forest, AdaBoost, Gradient Boosting, and XGBoost models

4

Hyperparameter Tuning

Optimise Decision Tree using GridSearch with pre-pruning and post-pruning techniques

5

SHAP Interpretation

Use SHAP values to interpret model predictions and understand feature importance

Key Findings & Model Performance

Dataset Characteristics

Class Distribution: 88.6% : 11.4%
Imbalance Ratio: 0.13
Total Features: Multiple Numerical & Categorical

🏆 Best Performing Model

Model: Gradient Boosting
ROC-AUC: 0.9485
F1-Score (Class 1): 0.60

Hyperparameter Tuning Impact

Baseline DT ROC-AUC: 0.738
Tuned DT ROC-AUC: 0.921
Improvement: +24.8%

ROC-AUC Model Comparison

Gradient Boosting
0.9485
XGBoost
0.9200
Random Forest
0.9100
Tuned Decision Tree
0.9210
AdaBoost
0.8800
Baseline DT
0.7380

SHAP Analysis: Feature Importance & Impact

Understanding which features drive model predictions and their directional impact

📞 Duration (Most Important)
Call duration shows strongest positive correlation with subscription success. Longer conversations significantly increase conversion likelihood.
📊 Employment Variation Rate
Economic indicator with strong predictive power. Higher rates typically correlate with lower subscription rates, indicating economic sensitivity.
👥 Number Employed
Labour market indicator affecting customer behaviour. Higher employment numbers generally reduce subscription likelihood.
📧 Campaign Contacts
Number of contacts during campaign shows moderate influence. Too many contacts can decrease conversion rates.
👤 Customer Age
Age demographics influence subscription patterns, with certain age groups showing higher conversion rates.
📅 Previous Contact Days
Time since last contact affects receptiveness. Optimal timing windows exist for follow-up communications.

Business Implications & Strategic Recommendations

🎯 Call Quality Over Quantity
Focus on extending meaningful conversations rather than increasing call volume. Train agents to engage customers in longer, value-driven discussions.
📈 Economic Timing Strategy
Monitor employment indicators to time campaigns optimally. Launch intensive campaigns during favourable economic conditions.
👥 Demographic Targeting
Develop age-specific marketing strategies based on conversion patterns identified through model analysis.
📞 Contact Optimisation
Implement contact frequency caps and develop optimal timing algorithms to avoid customer fatigue whilst maximising engagement.
🤖 Model Implementation
Deploy Gradient Boosting model for real-time customer scoring to prioritise high-potential leads and personalise approaches.
📊 Continuous Monitoring
Establish model performance monitoring with regular retraining cycles to maintain prediction accuracy as market conditions evolve.

Project Conclusions

Key Success Metrics: Achieved 94.85% ROC-AUC with Gradient Boosting, representing excellent discriminative performance for imbalanced marketing data. Hyperparameter tuning improved baseline Decision Tree by 24.8%.

Technical Achievements: Successfully implemented comprehensive machine learning pipeline with proper handling of class imbalance, feature encoding, and model interpretation. SHAP analysis revealed actionable insights about customer behaviour patterns.

Business Value: The model provides clear direction for marketing strategy optimisation, with call duration and economic indicators as primary conversion drivers. Implementation of these insights could significantly improve campaign effectiveness and ROI.

Next Steps: Deploy the Gradient Boosting model in production environment, implement A/B testing framework for validation, and establish monitoring systems for model drift detection and performance tracking.