Sentiment Analysis of Movie Reviews

Sentiment Analysis of Movie Reviews

Overview

Problem Statement: A film studio planning a new science-fiction film needs to research customer feedback on related films to investigate potential relationships between positive sentiment and production budgets.

Approach: I constructed a natural language processing (NLP) pipeline for sentiment classification of movie reviews, including preprocessing, vectorisation, model training, and performance evaluation.

Project Goals

๐ŸŽฌ
Review Classification
Positive vs. Negative
๐Ÿ“ˆ
High Accuracy
Reliable sentiment prediction
โš™๏ธ
NLP Pipeline
Effective text preprocessing
๐Ÿ“
Actionable Insights
Business-focused findings
Complete Process Flowchart

Data Acquisition & Exploration

1

Dataset Acquisition

Loaded the SST2 (Stanford Sentiment Treebank) dataset from Hugging Face, containing movie review sentences with sentiment labels.

  • Dataset Source: Hugging Face's datasets library
  • Training Set: 67,349 labelled movie review sentences
  • Validation Set: 872 labelled review sentences
  • Labels: Binary (0 = negative, 1 = positive)
2

Text Analysis & Similarity

Explored the dataset using cosine similarity to understand relationships between different movie reviews in the corpus.

  • Methodology: Used spaCy to calculate semantic similarity between review pairs
  • Comparison Pairs: 5th-100th, 5th-15,000th, and 5th-50,000th sentences
  • Similarity Range: 0 (completely different) to 1 (identical)
  • Insights: Revealed semantic patterns across the dataset

Step-by-Step Process

1
Install Necessary Packages
datasets, nltk, spacy, beautifulsoup4, en_core_web_lg
โ†“
2
Load SST2 Dataset from Hugging Face
Stanford Sentiment Treebank with binary labels
โ†“
3
Create Train and Validation Dataframes
Train (67,349 examples) and Validation (872 examples)
โ†“
4-6
Cosine Similarity Calculations
Step 4: 5th vs 100th sentence
0.69
Step 5: 5th vs 15,000th sentence
0.11
Step 6: 5th vs 50,000th sentence
0.17
โ†“
7
Comment on Cosine Similarity Scores
Higher similarity between nearby sentences (0.69)
Lower similarity between distant sentences (0.11-0.17)
โ†“
8
Create Preprocessing Function
Text Preprocessing Pipeline

1. Remove HTML tags & punctuation
3. Remove stop words
2. Tokenise text into tokens
4. Apply lemmatisation or stemming
โ†“
9
Obtain Bag-of-Words & TF-IDF Representations
Bag-of-Words
CountVectorizer(max_features=3000)
TF-IDF
TfidfVectorizer(max_features=3000)
โ†“
10
Train Logistic Regression Models
Model Performance
BoW: 78% accuracy
TF-IDF: 78% accuracy
โ†“
11
Analyse Impact of Not Removing Stop Words
Including stop words improved accuracy to 79%
Stop words may contain important sentiment signals
โ†“
12
Compare Lemmatisation vs Stemming
Both techniques show similar performance
Choice between them has minimal impact on sentiment classification

Text Preprocessing Details

๐Ÿ”
Raw Text
Original movie reviews
โ†’
๐Ÿงน
Cleaning
Remove HTML & punctuation
โ†’
โœ‚๏ธ
Tokenisation
Split into words
โ†’
๐Ÿšซ
Stop Words
Remove common words
โ†’
๐Ÿ”„
Lemmatisation
Convert to base forms
Preprocessing Function

Created a flexible text preprocessing pipeline with multiple configuration options to experiment with different approaches.

Text Preprocessing Flow Diagram

Text Preprocessing Pipeline Input Text Remove HTML Tags BeautifulSoup Remove Punctuation string.punctuation Tokenise Text word_tokenize() Remove Stopwords if remove_stopwords=True ? Apply Lemmatisation if use_lemmatization=True ? Apply Stemming if use_stemming=True ? Input Example: "<p>Hello, world!</p>" After HTML Removal: "Hello, world!" After Punctuation Removal: "Hello world" After Tokenisation: ["hello", "world"] After Stopword Removal: ["hello", "world"] Final Output: "hello world" Legend Required Steps Optional Steps Conditional Flags
Preprocessing Examples

Applying different preprocessing techniques to a sample movie review shows how each step transforms the text.

Sample Transformations:
Preprocessing Result
Original "The film's pacing is terrible, but the acting is superb!"
Clean + Tokenise "the film pacing is terrible but the acting is superb"
Remove Stopwords "film pacing terrible acting superb"
Lemmatisation "film pace terrible act superb"
Stemming "film pac terribl act superb"

Similarity Analysis

Cosine Similarity Analysis
Sentence #5:
"The film makes a strong case for Maori traditions and argues that the drive for modernity threatens indigenous cultures worldwide."
0.69
Similarity Score
Sentence #100:
"A thoughtful, provocative, insistently humanising film."
Sentence #5:
"The film makes a strong case for Maori traditions and argues that the drive for modernity threatens indigenous cultures worldwide."
0.11
Similarity Score
Sentence #15000:
"The action sequences are poorly shot and edited, making it difficult to tell what's happening."
Sentence #5:
"The film makes a strong case for Maori traditions and argues that the drive for modernity threatens indigenous cultures worldwide."
0.17
Similarity Score
Sentence #50000:
"The special effects are amazing but the plot is completely nonsensical and the dialogue is laughable."
Analysis: The similarity scores demonstrate decreasing semantic relationships as distance between reviews increases in the dataset. Reviews close to each other (5th and 100th) show higher thematic and semantic similarity, whilst distant reviews show significant content divergence, indicating the dataset contains diverse review content and opinions spanning different films and genres.
Key Findings & Conclusions
Similarity Analysis
Position in dataset
correlates with similarity
Model Performance
TF-IDF slightly outperforms
Bag-of-Words (79% vs 78%)
Preprocessing Impact
Stop words improve results;
lemmatisation โ‰ˆ stemming

Business Implications & Recommendations

Practical Insights

  • Sentiment Trends: The 79% accuracy model can reliably categorise thousands of reviews to identify overall sentiment patterns
  • Production Budget Analysis: Enables comparison of positive sentiment percentages across films with different budget levels
  • Genre-Specific Patterns: Can identify which elements of science fiction films correlate with positive audience reactions

Next Steps

  • Model Refinement: Further tuning could improve accuracy beyond 79%, particularly with domain adaptation
  • Include Stop Words: Retain common words in preprocessing to preserve sentiment signals
  • Enhanced Features: Incorporate word embeddings (Word2Vec/GloVe) for better semantic understanding

The sentiment analysis pipeline provides a robust foundation for the film studio's market research. By analysing patterns in customer feedback across different science fiction films, the studio can identify key elements that resonate positively with audiences and make data-driven decisions about their new film's direction, marketing approach, and budget allocation.

The finding that retaining stop words improves model performance challenges common NLP practices and highlights the importance of testing different preprocessing approaches for specific domains like movie reviews, where sentiment is often expressed through function words and modifiers.