Exploring Evaluation Metrics

Exploring Evaluation Metrics

Overview

Problem Statement: A company seeks to refine a spam detection neural network using the Spambase dataset (4,601 emails, 57 features) with comprehensive evaluation metrics.

Approach: I selected the best-tuned model, computed predictions, evaluated accuracy, F1, precision, recall, and analysed performance using a confusion matrix.

Spam Detection Metrics Overview

4,601
Email Samples
57
Features
92.5%
Accuracy
90.9%
F1 Score

Data Preparation

1

Loading & Splitting

Loaded 4,601 emails with 57 features and binary spam/non-spam labels. Split the dataset into 80% training (with 10% used for validation) and 20% test sets to ensure proper model evaluation.

  • X: 57 numerical features (word frequencies, character frequencies)
  • y: Binary spam label (1 = spam, 0 = non-spam)
  • Train: 3,680 samples
  • Test: 921 samples
2

Standardisation

Applied StandardScaler to normalise features to mean=0 and standard deviation=1. This ensures consistent scaling across features for optimal neural network training.

The standardisation was applied only to the training data, and then the same transformation was applied to test data to prevent data leakage.

Why Standardise?
  • Ensures gradient descent converges more quickly
  • Prevents features with larger scales from dominating the model
  • Improves numerical stability during training

Workflow

1

Start: Metrics Exploration

Begin the project to explore comprehensive evaluation metrics for spam detection.

2

Import Libraries & Data

Load TensorFlow, Keras, scikit-learn, numpy, pandas, and the Spambase dataset.

3

Prepare Data

Split into training and test sets, apply standardisation to normalise features.

4

Select Best Model

Choose the highest-performing model architecture from previous grid search experiments.

5

Train & Save Model

Train the selected model architecture with early stopping and save to disk for reuse.

6

Compute Predictions & Metrics

Generate predictions on test data and calculate comprehensive evaluation metrics.

7

Analyse Performance

Interpret metrics and confusion matrix to understand model strengths and weaknesses.

8

End Activity

Conclude with recommendations based on performance analysis.

Model Selection

Best Model Architecture

Selected the highest-performing model from previous grid search experiments with the following characteristics:

  • Architecture: 64-32-16x4-1 neurons
  • Training: 14 epochs
  • Batch Size: 16
  • Optimiser: Adam
  • Cross-validation Accuracy: 0.945
Input
1
...
57
Hidden 1
1
...
64
Hidden 2
1
...
32
Hidden 3-6
1
...
16
Output
1
Training & Saving

Implemented early stopping to prevent overfitting and saved the best model to enable reuse and deployment.

  • Early Stopping: Monitors validation loss with patience=5
  • Callbacks: ModelCheckpoint to save best weights during training
  • Output Format: HDF5 file format (.h5)
  • Storage Location: Google Drive for persistent access
Training Process

1. Initialise model with 64-32-16x4-1 architecture

2. Compile with binary cross-entropy loss and accuracy metric

3. Configure early stopping and model checkpoint callbacks

4. Train with batch_size=16 for up to 100 epochs (early stopping typically activates around epoch 14)

5. Save best model as 'best_model.h5'

Metrics Computation

Prediction Generation

Applied the trained model to the test set to generate probability predictions, which were then converted to binary classifications using a threshold of 0.5.

Prediction Process:
  1. Load best trained model from saved file
  2. Generate probability predictions on standardised test data
  3. Apply threshold (0.5) to convert probabilities to binary predictions
  4. Compare predictions with actual test labels

Evaluation Metrics

Computed comprehensive metrics over 10 runs to ensure statistical robustness of the evaluation results.

Accuracy
0.9254
Precision
0.9416
Recall
0.8787
F1 Score
0.9090

Metric Definitions

Accuracy

Proportion of correctly classified emails (both spam and non-spam)

Formula: (TP + TN) / (TP + TN + FP + FN)
Precision

Of all emails classified as spam, what proportion were actually spam

Formula: TP / (TP + FP)
Recall

Of all actual spam emails, what proportion were correctly identified

Formula: TP / (TP + FN)
F1 Score

Harmonic mean of precision and recall, balancing both concerns

Formula: 2 * (Precision * Recall) / (Precision + Recall)

Performance Insights

Predicted
True Negative
511
55.5%
False Positive
19
2.1%
False Negative
43
4.7%
True Positive
347
37.7%
Actual

Confusion Matrix Analysis

The confusion matrix reveals the distribution of predictions across classes:

  • True Negatives (511, 55.5%): Non-spam emails correctly classified as non-spam
  • False Positives (19, 2.1%): Non-spam emails incorrectly classified as spam
  • False Negatives (43, 4.7%): Spam emails incorrectly classified as non-spam
  • True Positives (347, 37.7%): Spam emails correctly classified as spam

The low false positive rate indicates the model is conservative when flagging emails as spam, which is important to avoid filtering out legitimate emails.

Metric Interpretation

High Precision (0.9416): When the model identifies an email as spam, it's right about 94% of the time. This minimises the risk of important emails being incorrectly filtered.

Moderate Recall (0.8787): The model correctly identifies about 88% of all actual spam emails. Some spam may still reach the inbox.

Strong F1 Score (0.9090): Indicates a good balance between precision and recall, though slightly favouring precision over recall.

Solid Accuracy (0.9254): Overall, the model correctly classifies 92.5% of all emails.

Conclusion

The selected model (64-32-16x4-1 architecture, trained for 14 epochs with batch size 16) performs well on spam detection, achieving an accuracy of 0.9254 and F1 score of 0.9090. The high precision (0.9416) ensures minimal loss of legitimate emails, although the moderate recall (0.8787) indicates some spam may still reach the inbox.

Business Implications

Strengths

  • High Precision: Very low false positive rate (2.1%) means business-critical emails are unlikely to be lost
  • Good Overall Performance: F1 score above 0.90 indicates a balanced and effective spam filter
  • Reliable Architecture: Model design provides consistent results across multiple runs

Areas for Improvement

  • Recall Enhancement: Could explore techniques to improve recall without sacrificing precision
  • False Negative Reduction: 4.7% of emails are spam that reaches inboxes
  • Confidence Thresholds: Consider adjusting the classification threshold to balance precision/recall based on business priorities

Final Assessment

The model excels at conservative spam detection, prioritising the preservation of legitimate emails over catching every spam message. This approach aligns well with business needs where missing important communications would be more costly than receiving occasional spam. Further tuning could enhance recall for more comprehensive filtering whilst maintaining the model's strong precision.