Predictive Analytics / Education

Student Retention Prediction System

A supervised learning system that identifies students at risk of dropping out by combining academic performance data with demographic and engagement indicators, enabling institutions to intervene before it is too late.

Type

Supervised Learning

Domain

Higher Education

Methods

XGBoost, Neural Network

Status

Completed

The Challenge

Universities lose talented students every year because at-risk cases are not identified early enough for intervention. By the time a student formally withdraws, the decision has usually been building for months, signalled by patterns in attendance, grades, engagement, and personal circumstances.

The data to predict these outcomes often exists across fragmented systems but is rarely synthesised into a unified early-warning signal that student support teams can act on proactively.

Approach

Data Integration and EDA

Conducted phased data exploration across academic records, demographic data, and engagement metrics. Identified key features correlated with dropout behaviour through statistical analysis.

Feature Engineering

Built predictive features from raw data including academic trajectory indicators, engagement decay rates, and socioeconomic risk factors.

Model Development

Built and compared XGBoost (for interpretable, gradient-boosted predictions) and a neural network (for capturing non-linear feature interactions) on the same feature set.

Evaluation and Interpretability

Compared models on accuracy, precision, recall (prioritising recall to minimise missed at-risk students), and used SHAP values to explain which factors drive risk predictions.

STUDENT RETENTION PREDICTION

XGBOOST + SHAP

RETAINED

AT RISK

Results

High Recall

Minimised missed at-risk students

XGBoost

Best overall performance with interpretability

SHAP

Explainable predictions for support teams

XGBoost delivered the best balance of accuracy and interpretability, with SHAP values providing clear explanations for each prediction that student support teams could understand and act on. The neural network captured additional non-linear patterns but at the cost of reduced interpretability, making XGBoost the recommended production model.

Technology Stack

Python XGBoost TensorFlow SHAP Pandas Scikit-learn Matplotlib

Interested in this work or something similar?

Get in Touch View All Projects