Customer Segmentation with Clustering

Customer Segmentation with Clustering

Overview

Problem Statement: Understanding and serving customers are key retail marketing goals. This project applies clustering to segment customers from an e-commerce dataset (951,668 orders, 2012-2016) across five continents, using demographics, behavior, and purchase data to enhance marketing efficiency, customer retention, and resource allocation.

Approach: The process involves data exploration, preprocessing, feature engineering (e.g., Frequency, Recency, CLV), EDA with visualizations, clustering via K-means and hierarchical methods, and dimensionality reduction with PCA and t-SNE. A report summarizes insights for stakeholders.

Dataset Size
951,668
Orders
Time Period
4 Years
2012-2016
Geographic Reach
5
Continents

Project Stages

1

Data Exploration

Imported data, identified missing values (e.g., City, Postal Code), removed duplicates, and handled outliers (5% via Isolation Forest). Aggregated to one row per customer.

2

Feature Engineering

Created Frequency (order count), Recency (days since last order), CLV (revenue - cost), Average Unit Cost, and Customer Age from birthdates. Scaled features for analysis.

3

Clustering & Reduction

Performed EDA, K-means (k=2, 5), hierarchical clustering, and reduced dimensions with PCA/t-SNE to visualize customer segments effectively.

Feature Engineering Process

📅
Recency

Days since last purchase

🔄
Frequency

Number of orders

💰
CLV

Revenue - Cost

💲
Unit Cost

Average cost per item

👤
Age

Customer age from birthdate

Workflow

Start: Customer Segmentation
Part I: Initial Data Exploration
Import Libraries and Data
Preprocess Data: Check Missing, Duplicates, Outliers, Aggregate
Feature Engineering: Frequency, Recency, CLV, Unit Cost, Age
Part II: Clustering with ML Models
EDA and Visualisations
Column Transformer for Efficiency
Determine k: Elbow and Silhouette Methods
K-Means Clustering with Optimal k
Hierarchical Clustering and Dendrogram
Part III: Customer Segments
Assign Cluster to Customer ID
Boxplots: Frequency, Recency, CLV, Unit Cost, Age
Part IV: Dimensionality Reduction
PCA and t-SNE to 2D
2D Visualisation of Clusters
Document Approach and Insights
End: Generate Report & Submit

Clustering Methods

📊

Elbow & Silhouette

Elbow method suggested k=5 (gradual SSE drop); Silhouette scores peaked at k=2 (0.42) and k=5 (0.35), indicating well-separated clusters.

🌳

Hierarchical Clustering

Applied on 10,000 samples with dendrograms; k=5 showed distinct splits (e.g., high CLV vs. low recency), confirming segmentation viability.

🎯

K-Means Clustering

Tested k=2 (high vs. low CLV) and k=5 (varied CLV, age, recency); k=5 provided nuanced segments for targeted marketing.

Silhouette Score Comparison

k=2
0.42
k=3
0.30
k=4
0.32
k=5
0.35
k=6
0.29

Customer Segments (k=5)

0

Mature Budget

Older, low CLV, variable unit cost

Age
CLV
Frequency
Recency
1

Recent Explorers

Most recent purchases, low CLV

Age
CLV
Frequency
Recency
2

Mid-Value Regulars

Moderate CLV, mixed age demographics

Age
CLV
Frequency
Recency
3

Young VIPs

Young, high CLV, frequent purchases

Age
CLV
Frequency
Recency
4

Dormant Midrange

Least recent, moderate unit cost

Age
CLV
Frequency
Recency
Frequency Distribution by Cluster
Cluster 0
Cluster 1
Cluster 2
Cluster 3
Cluster 4
CLV Distribution by Cluster
Cluster 0
Cluster 1
Cluster 2
Cluster 3
Cluster 4

Dimensionality Reduction

PCA Visualization

Explained 60.99% variance; PC1 driven by Frequency/CLV, PC2 by Unit Cost/Age. Clusters separated but overlapped (e.g., 0 vs. 3).

Principal Component 1 (37.84%)
Principal Component 2 (23.15%)
t-SNE Visualization

Revealed non-linear clusters; clearer separation (e.g., Cluster 3 high CLV, young) vs. PCA, enhancing segment visualization.

t-SNE Component 1
t-SNE Component 2
Cluster 0 & 3
Cluster 1 & 4
Cluster 2

Conclusion

t-SNE outperformed PCA for visualizing clusters (k=5), offering detailed segmentation by CLV, age, and recency. Cluster 3 (young, high CLV, frequent) is key for retention; Cluster 0 (older, low CLV) needs re-engagement strategies. Future work should refine outliers and explore seasonal trends.

Marketing Recommendations

Cluster 0: Mature Budget

Offer loyalty incentives and personalized product recommendations to rebuild engagement and increase CLV.

Cluster 1: Recent Explorers

Capitalize on recency with follow-up communications and cross-sell opportunities to increase order values.

Cluster 3: Young VIPs

Implement premium service offerings and exclusive benefits to maintain high engagement and maximize lifetime value.

Effectiveness of Methods

K-means with k=5 balanced granularity and actionability; hierarchical clustering validated splits. t-SNE's non-linear insight was superior for marketing applications, revealing patterns that PCA's linear approach couldn't detect.