Problem Statement: A film studio planning a new science-fiction film needs to research customer feedback on related films to investigate potential relationships between positive sentiment and production budgets.
Approach: I constructed a natural language processing (NLP) pipeline for sentiment classification of movie reviews, including preprocessing, vectorisation, model training, and performance evaluation.
Loaded the SST2 (Stanford Sentiment Treebank) dataset from Hugging Face, containing movie review sentences with sentiment labels.
Explored the dataset using cosine similarity to understand relationships between different movie reviews in the corpus.
Created a flexible text preprocessing pipeline with multiple configuration options to experiment with different approaches.
Applying different preprocessing techniques to a sample movie review shows how each step transforms the text.
Preprocessing | Result |
---|---|
Original | "The film's pacing is terrible, but the acting is superb!" |
Clean + Tokenise | "the film pacing is terrible but the acting is superb" |
Remove Stopwords | "film pacing terrible acting superb" |
Lemmatisation | "film pace terrible act superb" |
Stemming | "film pac terribl act superb" |
The sentiment analysis pipeline provides a robust foundation for the film studio's market research. By analysing patterns in customer feedback across different science fiction films, the studio can identify key elements that resonate positively with audiences and make data-driven decisions about their new film's direction, marketing approach, and budget allocation.
The finding that retaining stop words improves model performance challenges common NLP practices and highlights the importance of testing different preprocessing approaches for specific domains like movie reviews, where sentiment is often expressed through function words and modifiers.