Predicting_Student_Dropout_with_Supervised

Overview

Problem Statement: Student retention is vital for educational stability and success. This project develops a predictive model for student dropout using machine learning across three stages: application (Stage 1), engagement (Stage 2), and academic performance (Stage 3). Aims include predicting dropout risk with high accuracy using XGBoost and Neural Networks, evaluating model effectiveness across stages, addressing class imbalance, and comparing performance to inform timely interventions for Study Group.

Approach: Supervised learning with XGBoost and Neural Networks was applied to three datasets. Preprocessing involved cleaning, encoding, and feature engineering (e.g., age from date of birth). Models were trained on an 80-20 split with stratification, evaluated via accuracy, precision, recall, F1, AUC, and confusion matrices, and tuned using RandomizedSearchCV.

Project Overview by Stage

Stage 1

Application

Demographics & Course Info

Stage 2

Engagement

Attendance & Participation

Stage 3

Performance

Academic Results & Modules

Project Stages

1

Applicant & Course Insights

Loaded applicant data, removed high-cardinality (>200 unique) and >50% missing columns, engineered 'Age', encoded features, and trained models.

XGBoost F1

0.18

XGBoost Recall

0.66

NN F1

0.13

NN Recall

0.46

2

Engagement Dynamics

Added absence counts, repeated preprocessing. Tuned NN with top 10 features (e.g., UnauthorisedAbsenceCount) excelled, outperforming Stage 1.

Tuned NN F1

0.24

Tuned NN Recall

0.95

Top Feature

Absences

Imbalance

0.17

3

Academic Performance

Included module performance (Passed/FailedModules), preprocessed similarly. Models struggled due to severe imbalance and data complexity.

Best F1

~0.02

Best Recall

< 0.2

Top Feature

PassedModules

Imbalance

0.07

Workflow

1

Start Mini-Project: Predict Student Dropout

2

Load Datasets: Stage 1, 2, 3

3

Stage Loop: 1, 2, 3

4

Preprocess Data

5

Explore Data

6

Split Data: 80-20

7

Scale Features

8

Build and Train Models

9

Tune Hyperparameters

10

Evaluate Models

11

Eval Top 10 Features

12

Last Stage?

13

Compare Across Stages

14

Generate Report

15

Submit Deliverables

16

End Mini-Project

Key Findings

Stage 1

Initial Capability

Tuned XGBoost: F1: 0.18, Recall: 0.66; Tuned NN: F1: 0.13, Recall: 0.46. Nationality (e.g., Indian: 0.07) and gender key. Moderate imbalance (0.18) managed with stratification, showing basic risk identification.

Stage 2

Engagement Peak

Tuned NN (Top 10 features) hit F1: 0.24, Recall: 0.95, leveraging absence counts. Improved recall over Stage 1 highlights engagement data's value for mid-course interventions.

Stage 3

Complexity Challenge

Both models dropped (F1 ~0.02, Recall <0.2, AUC ~0.004) despite PassedModules (0.32). Severe imbalance (0.07) and academic data noise suggest need for refined strategies.

Stage	Model	F1 Score	Recall	Key Features	Imbalance
Stage 1	XGBoost	0.18	0.66	Nationality, Gender	0.18
Stage 1	Neural Network	0.13	0.46	Nationality, Gender	0.18
Stage 2	Tuned NN (Top 10)	0.24	0.95	UnauthorisedAbsenceCount	0.17
Stage 3	Both Models	~0.02	<0.2	PassedModules (0.32)	0.07

Visual Insights (Stage 2)

Class Distribution

85%

Completed

15%

Dropped

Bar chart shows Stage 2 imbalance: 85% completed vs. 15% dropped, impacting model performance.

Feature Importance

UnauthorisedAbsenceCount

Nationality

Gender

Age

CourseType

Horizontal bars highlight UnauthorisedAbsenceCount as a top predictor in Stage 2.

Confusion Matrix

TN: 685

FP: 37

FN: 2

TP: 37

Confusion matrix (Tuned NN: TN 685, FP 37) reflects high recall with few false negatives.

Conclusion

Stage 2's Tuned Neural Network (Top 10 features) excelled with F1: 0.2425, Recall: 0.9488, leveraging engagement data for mid-course dropout prediction. Stage 1 showed moderate success, but Stage 3's academic complexity led to poor performance (F1 ~0.02), highlighting class imbalance and data integration challenges. Future work should refine feature selection and imbalance handling for robust late-stage predictions.

Model Effectiveness

XGBoost Performance

Consistent across Stage 1, but faltered in Stage 3 due to severe imbalance. Feature importance insights valuable for identifying key predictors.

Neural Network Adaptability

Performed best in Stage 2 after tuning with top 10 features. Engagement metrics significantly improved prediction capability for timely interventions.

Recommendations for Implementation

Early Intervention Strategy

Implement alerts based on Stage 2 model (engagement metrics) for highest impact on student retention. Focus on unauthorised absences as key trigger.

Future Model Improvements

Address severe class imbalance in Stage 3 through advanced sampling techniques and consider specialised models for each academic programme.

Student Dropout Prediction Analysis