Back to Projects
Predictive Analytics / Education

Student Retention Prediction System

A supervised learning system that identifies students at risk of dropping out by combining academic performance data with demographic and engagement indicators, enabling institutions to intervene before it is too late.

Type
Supervised Learning
Domain
Higher Education
Methods
XGBoost, Neural Network
Status
Completed
STUDENT RETENTION PREDICTION
XGBOOST + SHAP
RETAINED
AT RISK

The Challenge

Universities lose talented students every year because at-risk cases are not identified early enough for intervention. By the time a student formally withdraws, the decision has usually been building for months, signalled by patterns in attendance, grades, engagement, and personal circumstances.

The data to predict these outcomes often exists across fragmented systems but is rarely synthesised into a unified early-warning signal that student support teams can act on proactively.

Approach

01
Data Integration and EDA
Conducted phased data exploration across academic records, demographic data, and engagement metrics. Identified key features correlated with dropout behaviour through statistical analysis.
02
Feature Engineering
Built predictive features from raw data including academic trajectory indicators, engagement decay rates, and socioeconomic risk factors.
03
Model Development
Built and compared XGBoost (for interpretable, gradient-boosted predictions) and a neural network (for capturing non-linear feature interactions) on the same feature set.
04
Evaluation and Interpretability
Compared models on accuracy, precision, recall (prioritising recall to minimise missed at-risk students), and used SHAP values to explain which factors drive risk predictions.

Results

80.89%
Recall on the dropout class - correctly identifying 4 in 5 at-risk students
0.21 F1
F1 score improved from 0.12 baseline through iterative feature engineering and threshold tuning
SHAP
Explainable predictions for student support teams to target interventions effectively

The core challenge in this dataset was severe class imbalance - dropout cases formed a small minority of total records, making them inherently difficult to detect. Initial models achieved recall of just 0.44 on the dropout class, meaning more than half of at-risk students were being missed entirely.

Through iterative feature engineering, resampling strategies, and threshold optimisation, XGBoost improved dropout recall from 0.44 to 0.81 (80.89%), correctly flagging four in five students who would ultimately withdraw. The F1 score rose from 0.12 to 0.21, reflecting the difficulty of simultaneously improving precision on a heavily imbalanced target.

SHAP values provided clear, per-student explanations of which factors were driving each risk prediction, giving support teams actionable insight rather than opaque scores. The neural network captured additional non-linear patterns but at the cost of reduced interpretability, making XGBoost the recommended production model for institutional deployment.

Technology Stack

Python XGBoost TensorFlow SHAP Pandas Scikit-learn Matplotlib
Interested in this work or something similar?