A supervised learning system that identifies students at risk of dropping out by combining academic performance data with demographic and engagement indicators, enabling institutions to intervene before it is too late.
Type
Supervised Learning
Domain
Higher Education
Methods
XGBoost, Neural Network
Status
Completed
The Challenge
Universities lose talented students every year because at-risk cases are not identified early enough for intervention. By the time a student formally withdraws, the decision has usually been building for months, signalled by patterns in attendance, grades, engagement, and personal circumstances.
The data to predict these outcomes often exists across fragmented systems but is rarely synthesised into a unified early-warning signal that student support teams can act on proactively.
Approach
01
Data Integration and EDA
Conducted phased data exploration across academic records, demographic data, and engagement metrics. Identified key features correlated with dropout behaviour through statistical analysis.
02
Feature Engineering
Built predictive features from raw data including academic trajectory indicators, engagement decay rates, and socioeconomic risk factors.
03
Model Development
Built and compared XGBoost (for interpretable, gradient-boosted predictions) and a neural network (for capturing non-linear feature interactions) on the same feature set.
04
Evaluation and Interpretability
Compared models on accuracy, precision, recall (prioritising recall to minimise missed at-risk students), and used SHAP values to explain which factors drive risk predictions.
STUDENT RETENTION PREDICTION
XGBOOST + SHAP
RETAINED
AT RISK
Results
High Recall
Minimised missed at-risk students
XGBoost
Best overall performance with interpretability
SHAP
Explainable predictions for support teams
XGBoost delivered the best balance of accuracy and interpretability, with SHAP values providing clear explanations for each prediction that student support teams could understand and act on. The neural network captured additional non-linear patterns but at the cost of reduced interpretability, making XGBoost the recommended production model.