Problem Statement: Student retention is vital for educational stability and success. This project develops a predictive model for student dropout using machine learning across three stages: application (Stage 1), engagement (Stage 2), and academic performance (Stage 3). Aims include predicting dropout risk with high accuracy using XGBoost and Neural Networks, evaluating model effectiveness across stages, addressing class imbalance, and comparing performance to inform timely interventions for Study Group.
Approach: Supervised learning with XGBoost and Neural Networks was applied to three datasets. Preprocessing involved cleaning, encoding, and feature engineering (e.g., age from date of birth). Models were trained on an 80-20 split with stratification, evaluated via accuracy, precision, recall, F1, AUC, and confusion matrices, and tuned using RandomizedSearchCV.
Loaded applicant data, removed high-cardinality (>200 unique) and >50% missing columns, engineered 'Age', encoded features, and trained models.
Added absence counts, repeated preprocessing. Tuned NN with top 10 features (e.g., UnauthorisedAbsenceCount) excelled, outperforming Stage 1.
Included module performance (Passed/FailedModules), preprocessed similarly. Models struggled due to severe imbalance and data complexity.
Tuned XGBoost: F1: 0.18, Recall: 0.66; Tuned NN: F1: 0.13, Recall: 0.46. Nationality (e.g., Indian: 0.07) and gender key. Moderate imbalance (0.18) managed with stratification, showing basic risk identification.
Tuned NN (Top 10 features) hit F1: 0.24, Recall: 0.95, leveraging absence counts. Improved recall over Stage 1 highlights engagement data's value for mid-course interventions.
Both models dropped (F1 ~0.02, Recall <0.2, AUC ~0.004) despite PassedModules (0.32). Severe imbalance (0.07) and academic data noise suggest need for refined strategies.
Stage | Model | F1 Score | Recall | Key Features | Imbalance |
---|---|---|---|---|---|
Stage 1 | XGBoost | 0.18 | 0.66 | Nationality, Gender | 0.18 |
Neural Network | 0.13 | 0.46 | |||
Stage 2 | Tuned NN (Top 10) | 0.24 | 0.95 | UnauthorisedAbsenceCount | 0.17 |
Stage 3 | Both Models | ~0.02 | <0.2 | PassedModules (0.32) | 0.07 |
Bar chart shows Stage 2 imbalance: 85% completed vs. 15% dropped, impacting model performance.
Horizontal bars highlight UnauthorisedAbsenceCount as a top predictor in Stage 2.
Confusion matrix (Tuned NN: TN 685, FP 37) reflects high recall with few false negatives.
Stage 2's Tuned Neural Network (Top 10 features) excelled with F1: 0.2425, Recall: 0.9488, leveraging engagement data for mid-course dropout prediction. Stage 1 showed moderate success, but Stage 3's academic complexity led to poor performance (F1 ~0.02), highlighting class imbalance and data integration challenges. Future work should refine feature selection and imbalance handling for robust late-stage predictions.
Consistent across Stage 1, but faltered in Stage 3 due to severe imbalance. Feature importance insights valuable for identifying key predictors.
Performed best in Stage 2 after tuning with top 10 features. Engagement metrics significantly improved prediction capability for timely interventions.
Implement alerts based on Stage 2 model (engagement metrics) for highest impact on student retention. Focus on unauthorized absences as key trigger.
Address severe class imbalance in Stage 3 through advanced sampling techniques and consider specialized models for each academic program.