evaluation-metrics

Overview

Problem Statement: A company seeks to refine a spam detection neural network using the Spambase dataset (4,601 emails, 57 features) with comprehensive evaluation metrics.

Approach: I selected the best-tuned model, computed predictions, evaluated accuracy, F1, precision, recall, and analysed performance using a confusion matrix.

Spam Detection Metrics Overview

4,601

Email Samples

57

Features

92.5%

Accuracy

90.9%

F1 Score

Data Preparation

1

Loading & Splitting

Loaded 4,601 emails with 57 features and binary spam/non-spam labels. Split the dataset into 80% training (with 10% used for validation) and 20% test sets to ensure proper model evaluation.

X: 57 numerical features (word frequencies, character frequencies)
y: Binary spam label (1 = spam, 0 = non-spam)
Train: 3,680 samples
Test: 921 samples

2

Standardisation

Applied StandardScaler to normalise features to mean=0 and standard deviation=1. This ensures consistent scaling across features for optimal neural network training.

The standardisation was applied only to the training data, and then the same transformation was applied to test data to prevent data leakage.

Why Standardise?

Ensures gradient descent converges more quickly
Prevents features with larger scales from dominating the model
Improves numerical stability during training

Workflow

1

Start: Metrics Exploration

Begin the project to explore comprehensive evaluation metrics for spam detection.

2

Import Libraries & Data

Load TensorFlow, Keras, scikit-learn, numpy, pandas, and the Spambase dataset.

3

Prepare Data

Split into training and test sets, apply standardisation to normalise features.

4

Select Best Model

Choose the highest-performing model architecture from previous grid search experiments.

5

Train & Save Model

Train the selected model architecture with early stopping and save to disk for reuse.

6

Compute Predictions & Metrics

Generate predictions on test data and calculate comprehensive evaluation metrics.

7

Analyse Performance

Interpret metrics and confusion matrix to understand model strengths and weaknesses.

8

End Activity

Conclude with recommendations based on performance analysis.

Model Selection

Best Model Architecture

Selected the highest-performing model from previous grid search experiments with the following characteristics:

Architecture: 64-32-16x4-1 neurons
Training: 14 epochs
Batch Size: 16
Optimiser: Adam
Cross-validation Accuracy: 0.945

Input

1

...

57

Hidden 1

1

...

64

Hidden 2

1

...

32

Hidden 3-6

1

...

16

Output

1

Training & Saving

Implemented early stopping to prevent overfitting and saved the best model to enable reuse and deployment.

Early Stopping: Monitors validation loss with patience=5
Callbacks: ModelCheckpoint to save best weights during training
Output Format: HDF5 file format (.h5)
Storage Location: Google Drive for persistent access

Training Process

1. Initialise model with 64-32-16x4-1 architecture

2. Compile with binary cross-entropy loss and accuracy metric

3. Configure early stopping and model checkpoint callbacks

4. Train with batch_size=16 for up to 100 epochs (early stopping typically activates around epoch 14)

5. Save best model as 'best_model.h5'

Metrics Computation

Prediction Generation

Applied the trained model to the test set to generate probability predictions, which were then converted to binary classifications using a threshold of 0.5.

Prediction Process:

Load best trained model from saved file
Generate probability predictions on standardised test data
Apply threshold (0.5) to convert probabilities to binary predictions
Compare predictions with actual test labels

Evaluation Metrics

Computed comprehensive metrics over 10 runs to ensure statistical robustness of the evaluation results.

Accuracy

0.9254

Precision

0.9416

Recall

0.8787

F1 Score

0.9090

Metric Definitions

Accuracy

Proportion of correctly classified emails (both spam and non-spam)

Formula: (TP + TN) / (TP + TN + FP + FN)

Precision

Of all emails classified as spam, what proportion were actually spam

Formula: TP / (TP + FP)

Recall

Of all actual spam emails, what proportion were correctly identified

Formula: TP / (TP + FN)

F1 Score

Harmonic mean of precision and recall, balancing both concerns

Formula: 2 * (Precision * Recall) / (Precision + Recall)

Performance Insights

Predicted

True Negative

511

55.5%

False Positive

19

2.1%

False Negative

43

4.7%

True Positive

347

37.7%

Actual

Confusion Matrix Analysis

The confusion matrix reveals the distribution of predictions across classes:

True Negatives (511, 55.5%): Non-spam emails correctly classified as non-spam
False Positives (19, 2.1%): Non-spam emails incorrectly classified as spam
False Negatives (43, 4.7%): Spam emails incorrectly classified as non-spam
True Positives (347, 37.7%): Spam emails correctly classified as spam

The low false positive rate indicates the model is conservative when flagging emails as spam, which is important to avoid filtering out legitimate emails.

Metric Interpretation

High Precision (0.9416): When the model identifies an email as spam, it's right about 94% of the time. This minimises the risk of important emails being incorrectly filtered.

Moderate Recall (0.8787): The model correctly identifies about 88% of all actual spam emails. Some spam may still reach the inbox.

Strong F1 Score (0.9090): Indicates a good balance between precision and recall, though slightly favouring precision over recall.

Solid Accuracy (0.9254): Overall, the model correctly classifies 92.5% of all emails.

Conclusion

The selected model (64-32-16x4-1 architecture, trained for 14 epochs with batch size 16) performs well on spam detection, achieving an accuracy of 0.9254 and F1 score of 0.9090. The high precision (0.9416) ensures minimal loss of legitimate emails, although the moderate recall (0.8787) indicates some spam may still reach the inbox.

Business Implications

Strengths

High Precision: Very low false positive rate (2.1%) means business-critical emails are unlikely to be lost
Good Overall Performance: F1 score above 0.90 indicates a balanced and effective spam filter
Reliable Architecture: Model design provides consistent results across multiple runs

Areas for Improvement

Recall Enhancement: Could explore techniques to improve recall without sacrificing precision
False Negative Reduction: 4.7% of emails are spam that reaches inboxes
Confidence Thresholds: Consider adjusting the classification threshold to balance precision/recall based on business priorities

Final Assessment

The model excels at conservative spam detection, prioritising the preservation of legitimate emails over catching every spam message. This approach aligns well with business needs where missing important communications would be more costly than receiving occasional spam. Further tuning could enhance recall for more comprehensive filtering whilst maintaining the model's strong precision.

Exploring Evaluation Metrics

Overview

Spam Detection Metrics Overview

Data Preparation

Loading & Splitting

Standardisation

Workflow

Start: Metrics Exploration

Import Libraries & Data

Prepare Data

Select Best Model

Train & Save Model

Compute Predictions & Metrics

Analyse Performance

End Activity

Model Selection

Metrics Computation

Prediction Generation

Evaluation Metrics

Metric Definitions

Performance Insights

Confusion Matrix Analysis

Metric Interpretation

Conclusion

Business Implications

Strengths

Areas for Improvement

Final Assessment