Movie_Sentiment

Sentiment Analysis of Movie Reviews

Overview

Problem Statement: A film studio planning a new science-fiction film needs to research customer feedback on related films to investigate potential relationships between positive sentiment and production budgets.

Approach: I constructed a natural language processing (NLP) pipeline for sentiment classification of movie reviews, including preprocessing, vectorisation, model training, and performance evaluation.

Project Goals

🎬

Review Classification

Positive vs. Negative

📈

High Accuracy

Reliable sentiment prediction

⚙️

NLP Pipeline

Effective text preprocessing

📝

Actionable Insights

Business-focused findings

Complete Process Flowchart

Data Acquisition & Exploration

Dataset Acquisition

Loaded the SST2 (Stanford Sentiment Treebank) dataset from Hugging Face, containing movie review sentences with sentiment labels.

Dataset Source: Hugging Face's datasets library
Training Set: 67,349 labelled movie review sentences
Validation Set: 872 labelled review sentences
Labels: Binary (0 = negative, 1 = positive)

Text Analysis & Similarity

Explored the dataset using cosine similarity to understand relationships between different movie reviews in the corpus.

Methodology: Used spaCy to calculate semantic similarity between review pairs
Comparison Pairs: 5th-100th, 5th-15,000th, and 5th-50,000th sentences
Similarity Range: 0 (completely different) to 1 (identical)
Insights: Revealed semantic patterns across the dataset

Step-by-Step Process

Install Necessary Packages

datasets, nltk, spacy, beautifulsoup4, en_core_web_lg

↓

Load SST2 Dataset from Hugging Face

Stanford Sentiment Treebank with binary labels

↓

Create Train and Validation Dataframes

Train (67,349 examples) and Validation (872 examples)

↓

4-6

Cosine Similarity Calculations

Step 4: 5th vs 100th sentence

0.69

Step 5: 5th vs 15,000th sentence

0.11

Step 6: 5th vs 50,000th sentence

0.17

↓

Comment on Cosine Similarity Scores

Higher similarity between nearby sentences (0.69)

Lower similarity between distant sentences (0.11-0.17)

↓

Create Preprocessing Function

Text Preprocessing Pipeline

1. Remove HTML tags & punctuation

3. Remove stop words

2. Tokenise text into tokens

4. Apply lemmatisation or stemming

↓

Obtain Bag-of-Words & TF-IDF Representations

Bag-of-Words

CountVectorizer(max_features=3000)

TF-IDF

TfidfVectorizer(max_features=3000)

↓

Train Logistic Regression Models

Model Performance

BoW: 78% accuracy

TF-IDF: 78% accuracy

↓

Analyse Impact of Not Removing Stop Words

Including stop words improved accuracy to 79%

Stop words may contain important sentiment signals

↓

Compare Lemmatisation vs Stemming

Both techniques show similar performance

Choice between them has minimal impact on sentiment classification

Text Preprocessing Details

🔍

Raw Text

Original movie reviews

→

🧹

Cleaning

Remove HTML & punctuation

→

✂️

Tokenisation

Split into words

→

🚫

Stop Words

Remove common words

→

🔄

Lemmatisation

Convert to base forms

Preprocessing Function

Created a flexible text preprocessing pipeline with multiple configuration options to experiment with different approaches.

Text Preprocessing Flow Diagram

Preprocessing Examples

Applying different preprocessing techniques to a sample movie review shows how each step transforms the text.

Sample Transformations:

Preprocessing	Result
Original	"The film's pacing is terrible, but the acting is superb!"
Clean + Tokenise	"the film pacing is terrible but the acting is superb"
Remove Stopwords	"film pacing terrible acting superb"
Lemmatisation	"film pace terrible act superb"
Stemming	"film pac terribl act superb"

Similarity Analysis

Cosine Similarity Analysis

Sentence #5:
"The film makes a strong case for Maori traditions and argues that the drive for modernity threatens indigenous cultures worldwide."

0.69

Similarity Score

Sentence #100:
"A thoughtful, provocative, insistently humanising film."

Sentence #5:
"The film makes a strong case for Maori traditions and argues that the drive for modernity threatens indigenous cultures worldwide."

0.11

Similarity Score

Sentence #15000:
"The action sequences are poorly shot and edited, making it difficult to tell what's happening."

Sentence #5:
"The film makes a strong case for Maori traditions and argues that the drive for modernity threatens indigenous cultures worldwide."

0.17

Similarity Score

Sentence #50000:
"The special effects are amazing but the plot is completely nonsensical and the dialogue is laughable."

Analysis: The similarity scores demonstrate decreasing semantic relationships as distance between reviews increases in the dataset. Reviews close to each other (5th and 100th) show higher thematic and semantic similarity, whilst distant reviews show significant content divergence, indicating the dataset contains diverse review content and opinions spanning different films and genres.

Key Findings & Conclusions

Similarity Analysis

Position in dataset

correlates with similarity

Model Performance

TF-IDF slightly outperforms

Bag-of-Words (79% vs 78%)

Preprocessing Impact

Stop words improve results;

lemmatisation ≈ stemming

Business Implications & Recommendations

Practical Insights

Sentiment Trends: The 79% accuracy model can reliably categorise thousands of reviews to identify overall sentiment patterns
Production Budget Analysis: Enables comparison of positive sentiment percentages across films with different budget levels
Genre-Specific Patterns: Can identify which elements of science fiction films correlate with positive audience reactions

Next Steps

Model Refinement: Further tuning could improve accuracy beyond 79%, particularly with domain adaptation
Include Stop Words: Retain common words in preprocessing to preserve sentiment signals
Enhanced Features: Incorporate word embeddings (Word2Vec/GloVe) for better semantic understanding

The sentiment analysis pipeline provides a robust foundation for the film studio's market research. By analysing patterns in customer feedback across different science fiction films, the studio can identify key elements that resonate positively with audiences and make data-driven decisions about their new film's direction, marketing approach, and budget allocation.

The finding that retaining stop words improves model performance challenges common NLP practices and highlights the importance of testing different preprocessing approaches for specific domains like movie reviews, where sentiment is often expressed through function words and modifiers.