Student Dropout Prediction Analysis

Student Dropout Prediction Analysis

Overview

Problem Statement: Student retention is vital for educational stability and success. This project develops a predictive model for student dropout using machine learning across three stages: application (Stage 1), engagement (Stage 2), and academic performance (Stage 3). Aims include predicting dropout risk with high accuracy using XGBoost and Neural Networks, evaluating model effectiveness across stages, addressing class imbalance, and comparing performance to inform timely interventions for Study Group.

Approach: Supervised learning with XGBoost and Neural Networks was applied to three datasets. Preprocessing involved cleaning, encoding, and feature engineering (e.g., age from date of birth). Models were trained on an 80-20 split with stratification, evaluated via accuracy, precision, recall, F1, AUC, and confusion matrices, and tuned using RandomizedSearchCV.

Project Overview by Stage

Stage 1
Application
Demographics & Course Info
Stage 2
Engagement
Attendance & Participation
Stage 3
Performance
Academic Results & Modules

Project Stages

1

Applicant & Course Insights

Loaded applicant data, removed high-cardinality (>200 unique) and >50% missing columns, engineered 'Age', encoded features, and trained models.

XGBoost F1
0.18
XGBoost Recall
0.66
NN F1
0.13
NN Recall
0.46
2

Engagement Dynamics

Added absence counts, repeated preprocessing. Tuned NN with top 10 features (e.g., UnauthorisedAbsenceCount) excelled, outperforming Stage 1.

Tuned NN F1
0.24
Tuned NN Recall
0.95
Top Feature
Absences
Imbalance
0.17
3

Academic Performance

Included module performance (Passed/FailedModules), preprocessed similarly. Models struggled due to severe imbalance and data complexity.

Best F1
~0.02
Best Recall
< 0.2
Top Feature
PassedModules
Imbalance
0.07

Workflow

1
Start Mini-Project: Predict Student Dropout
2
Load Datasets: Stage 1, 2, 3
3
Stage Loop: 1, 2, 3
4
Preprocess Data
5
Explore Data
6
Split Data: 80-20
7
Scale Features
8
Build and Train Models
9
Tune Hyperparameters
10
Evaluate Models
11
Eval Top 10 Features
12
Last Stage?
13
Compare Across Stages
14
Generate Report
15
Submit Deliverables
16
End Mini-Project

Key Findings

Stage 1

Initial Capability

Tuned XGBoost: F1: 0.18, Recall: 0.66; Tuned NN: F1: 0.13, Recall: 0.46. Nationality (e.g., Indian: 0.07) and gender key. Moderate imbalance (0.18) managed with stratification, showing basic risk identification.

Stage 2

Engagement Peak

Tuned NN (Top 10 features) hit F1: 0.24, Recall: 0.95, leveraging absence counts. Improved recall over Stage 1 highlights engagement data's value for mid-course interventions.

Stage 3

Complexity Challenge

Both models dropped (F1 ~0.02, Recall <0.2, AUC ~0.004) despite PassedModules (0.32). Severe imbalance (0.07) and academic data noise suggest need for refined strategies.

Stage Model F1 Score Recall Key Features Imbalance
Stage 1 XGBoost 0.18 0.66 Nationality, Gender 0.18
Neural Network 0.13 0.46
Stage 2 Tuned NN (Top 10) 0.24 0.95 UnauthorisedAbsenceCount 0.17
Stage 3 Both Models ~0.02 <0.2 PassedModules (0.32) 0.07

Visual Insights (Stage 2)

Class Distribution
85%
Completed
15%
Dropped

Bar chart shows Stage 2 imbalance: 85% completed vs. 15% dropped, impacting model performance.

Feature Importance
UnauthorisedAbsenceCount
Nationality
Gender
Age
CourseType

Horizontal bars highlight UnauthorisedAbsenceCount as a top predictor in Stage 2.

Confusion Matrix
TN: 685
FP: 37
FN: 2
TP: 37

Confusion matrix (Tuned NN: TN 685, FP 37) reflects high recall with few false negatives.

Conclusion

Stage 2's Tuned Neural Network (Top 10 features) excelled with F1: 0.2425, Recall: 0.9488, leveraging engagement data for mid-course dropout prediction. Stage 1 showed moderate success, but Stage 3's academic complexity led to poor performance (F1 ~0.02), highlighting class imbalance and data integration challenges. Future work should refine feature selection and imbalance handling for robust late-stage predictions.

Model Effectiveness

XGBoost Performance

Consistent across Stage 1, but faltered in Stage 3 due to severe imbalance. Feature importance insights valuable for identifying key predictors.

Neural Network Adaptability

Performed best in Stage 2 after tuning with top 10 features. Engagement metrics significantly improved prediction capability for timely interventions.

Recommendations for Implementation

Early Intervention Strategy

Implement alerts based on Stage 2 model (engagement metrics) for highest impact on student retention. Focus on unauthorized absences as key trigger.

Future Model Improvements

Address severe class imbalance in Stage 3 through advanced sampling techniques and consider specialized models for each academic program.