Nielsen BookScan Time Series Analysis Guide

Nielsen BookScan Sales Forecasting Project

Using Time Series Analysis for Sales and Demand Forecasting in Book Retail

Business Context

Nielsen BookScan tracks book sales data across global markets, covering approximately 90% of all retail print book purchases in the UK.

Objective

To identify sales patterns with seasonal trends to inform reordering, restocking, and reprinting decisions for various book titles.

Key Focus

Predicting sales for two titles: "The Alchemist" and "The Very Hungry Caterpillar" using multiple forecasting approaches.

Project Process Flowchart

Comprehensive steps undertaken to analyse sales data and build forecasting models

Data Collection

Import and merge data from Nielsen BookScan's four main categories

Data Cleaning & Preparation

Handle missing values, convert formats, and resample data to consistent weekly intervals

Initial Data Investigation

Explore general patterns, compare historical periods, and identify potential seasonality

Classical Techniques

Apply decomposition, stationarity tests, ARIMA modelling, and 32-week forecasting

ML & DL Techniques

Implement XGBoost and LSTM models with hyperparameter tuning for 32-week forecasting

Hybrid Models

Combine SARIMA and LSTM in sequential and parallel architectures with weight optimisation

Monthly Prediction

Aggregate to monthly data and forecast 8 months with XGBoost and SARIMA models

Model Evaluation & Comparison

Calculate MAE, MAPE, RMSE metrics and compare performance across all approaches

Comprehensive Visual Guide

Detailed explanation of each phase with visual representations and key findings

Data Collection & Preparation

File 1: BookScan Data
4 sheets, 13 columns
227,224 rows after preprocessing
File 2: ISBN Data
4 sheets, 10+ columns
Sales history since 2001
Key Processing Steps:
  • ✓ Convert ISBNs to string format
  • ✓ Convert dates to datetime objects
  • ✓ Resample to weekly frequency
  • ✓ Fill missing weeks with zero sales
  • ✓ Handle non-ASCII characters
Final Dataset: 227,224 rows with 14 columns including ISBN, Title, Author, Date, Volume (sales), Value, etc.

Data Cleaning Process

The project began with importing and merging data from four major book categories: Adult Fiction, Adult Non-Fiction Specialist, Adult Non-Fiction Trade, and Children's, YA & Educational from Nielsen BookScan's database.

Key preprocessing steps included:

  • Standardisation: Converting ISBNs to string format and dates to datetime objects
  • Handling missing data: Resampling the weekly data and filling missing weeks with zeros to create consistent time intervals
  • Data cleaning: Addressing non-ASCII characters in titles and authors, and ensuring mathematical consistency between sales volume and value fields
  • Filtering: Identifying books with sales data beyond 2024-07-01 for in-depth analysis

The final dataset contained complete weekly sales histories for all titles, focusing specifically on "The Alchemist" (655 weekly records) and "The Very Hungry Caterpillar" (886 weekly records) from January 2012 to July 2024.

Initial Data Investigation

2000-2012 Period
High Sales Volume
Mean: 161,205.7 units
Max: 2,058,701 units
2012-2024 Period
Declining Print Sales
Mean: 12,735 units
Max: 876,479 units

Observations from Initial Sales Patterns

The initial data investigation revealed stark differences between the two time periods:

  • 2000-2012 Period: High sales volumes were observed, with a mean of 161,205.7 units (median 100,558, maximum 2,058,701) across 500 ISBNs, reflecting a strong pre-digital book market.
  • 2012-2024 Period: A dramatic decline in print sales is evident, with a mean of just 12,735 units (median 24.5, maximum 876,479) across 336 ISBNs, indicating a polarised market post-2012.

For the two focus books post-2012:

  • The Alchemist: Shows steadier, more predictable sales with gradual decline, suggesting sustained interest but competition from digital formats.
  • The Very Hungry Caterpillar: Exhibits more volatile sales with higher peaks (up to ~3,500 units), potentially due to educational use, seasonal purchasing patterns, and being less affected by e-book competition.

The sharp decline in overall sales volumes post-2012 likely reflects the growing impact of e-books and digital reading platforms on the print book market.

Classical Statistical Techniques

Time Series Decomposition

The Alchemist
Trend Slope: 8.27 units/year
Seasonal Amplitude: 813.95 units
Residual Std: 136.89
The Very Hungry Caterpillar
Trend Slope: 128.03 units/year
Seasonal Amplitude: 775.22 units
Residual Std: 526.82

SARIMA Models & Performance

Book ARIMA Model MAE RMSE MAPE (%)
The Alchemist (1,1,2)(1,0,0)[52] 169.33 252.54 31.89
The Very Hungry Caterpillar (2,1,2)(1,0,0)[52] 382.28 467.09 19.60

Classical Forecasting Approach

Classical time series analysis techniques were applied to understand the underlying patterns and forecast sales for the final 32 weeks:

  1. Time Series Decomposition: Both books showed clear seasonal patterns, with "The Very Hungry Caterpillar" exhibiting a stronger upward trend (128.03 units/year vs. 8.27 units/year for "The Alchemist").
  2. Stationarity Testing: Both series required first-order differencing to achieve stationarity, indicating the presence of trends that needed to be addressed in modelling.
  3. SARIMA Modelling: Auto ARIMA identified optimal models for each book:
    • The Alchemist: ARIMA(1,1,2)(1,0,0)[52] - capturing yearly seasonality with moderate trend components
    • The Very Hungry Caterpillar: ARIMA(2,1,2)(1,0,0)[52] - requiring more complex autoregressive and moving average terms
  4. Forecast Evaluation: The models performed impressively, with:
    • The Alchemist: MAE of 169.33, RMSE of 252.54, MAPE of 31.89%
    • The Very Hungry Caterpillar: MAE of 382.28, RMSE of 467.09, MAPE of 19.60%

The classical SARIMA approach proved highly effective, particularly for "The Very Hungry Caterpillar" with its lower percentage error (19.60% MAPE), despite higher absolute errors due to its larger sales volume. This establishes SARIMA as a strong baseline for comparison with more complex models.

Machine Learning & Deep Learning Models

Model Architecture Comparison

XGBoost Architecture
  • • Features: week_of_year, month, lag_1, lag_2, lag_3, roll_mean_4, roll_mean_8
  • • Parameters: n_estimators=100, max_depth=5, learning_rate=0.1
  • • Windows: [4, 12] for rolling features
LSTM Architecture
  • • Lookback Period: 10 weeks for The Alchemist, 20 weeks for The Very Hungry Caterpillar
  • • Features: Scaled sales, week_of_year, month
  • • Hidden Units: 64 with dropout rate 0.3

ML/DL Performance Metrics

Book Model MAE MAPE (%) RMSE
The Alchemist LSTM 166.96 24.99 284.68
The Alchemist XGBoost 607.91 96.81 696.07
The Very Hungry Caterpillar LSTM 706.73 32.52 807.26
The Very Hungry Caterpillar XGBoost 2167.09 99.02 2244.64

Machine Learning & Deep Learning Approach

More advanced machine learning and deep learning techniques were implemented to capture complex patterns in the sales data:

  1. Feature Engineering: Time-based features (week, month), lagged values, and rolling statistics were created to provide context for the models.
  2. XGBoost Implementation: Gradient boosting models were trained with cross-validation and hyperparameter tuning, including adjustments to tree depth, learning rate, and window length.
  3. LSTM Neural Networks: Sequence-based models were developed with:
    • Carefully tuned lookback periods (10 for The Alchemist, 20 for The Very Hungry Caterpillar)
    • Input features including scaled sales, week of year, and month
    • Hyperparameter optimisation via KerasTuner
  4. Comparative Performance:
    • LSTM significantly outperformed XGBoost, achieving MAE values close to classical SARIMA models
    • XGBoost struggled with the time series data, producing high error rates (MAPE near 100%)
    • Confidence interval coverage was poor (46.88% for LSTM, 0% for XGBoost), indicating uncertainty estimation issues

The superior performance of LSTM models compared to XGBoost highlights the importance of sequence modeling for this time series data. While LSTMs approached the accuracy of SARIMA models, they required more complex implementation and tuning, and their confidence intervals were less reliable.

Hybrid Forecasting Models

Hybrid Model Architectures

Sequential Hybrid
SARIMA Model
Residuals
LSTM Model

LSTM trained on SARIMA residuals, with final prediction as sum of both models

Parallel Hybrid
SARIMA Model
LSTM Model
Weighted Average

SARIMA and LSTM trained independently, with optimised weighting for final prediction

Hybrid Model Performance

Book Model MAE MAPE (%) Weight
The Alchemist Sequential Hybrid 144.99 30.86 N/A
The Alchemist Parallel Hybrid 245.70 44.52 0.5
The Very Hungry Caterpillar Sequential Hybrid 1243.30 66.41 N/A
The Very Hungry Caterpillar Parallel Hybrid 800.53 38.01 0.5

Hybrid Modelling Approach

To leverage the strengths of both classical and deep learning approaches, hybrid models were implemented combining SARIMA and LSTM in two architectures:

  1. Sequential Hybrid: In this approach:
    • SARIMA was first fitted to the original time series to model trend and seasonality
    • LSTM was then trained on the SARIMA residuals to capture complex patterns missed by SARIMA
    • Final predictions combined SARIMA forecasts with LSTM residual predictions
  2. Parallel Hybrid: This model used:
    • Independent SARIMA and LSTM models trained on the original time series
    • Weighted averaging to combine predictions (initially with w=0.5)
    • Weight optimisation to minimise error metrics
  3. Performance Results:
    • For "The Alchemist", the Sequential Hybrid achieved the best overall performance with an MAE of 144.99, outperforming both standalone models
    • For "The Very Hungry Caterpillar", the Parallel Hybrid performed better (MAE: 800.53) than the Sequential approach, but neither improved on the classical SARIMA model
    • Weight optimisation in the Parallel Hybrid found that LSTM-only (weight=0) was optimal for most scenarios

The Sequential Hybrid demonstrated particular promise for "The Alchemist", achieving the lowest MAE across all models (144.99), suggesting that combining statistical modelling with deep learning can capture complementary aspects of the time series for more stable data patterns.

Monthly Aggregation & Forecasting

Monthly Data Transformation

Weekly to Monthly Aggregation Process
Week 1
Week 2
Week 3
Week 4
Monthly Sum
Forecast Horizon Adjustment
Weekly: 32 weeks
Monthly: 8 months

Monthly Model Performance

Book Model MAE MAPE (%) RMSE
The Alchemist Monthly XGBoost 1272.90 48.32 1607.55
The Alchemist Monthly SARIMA NaN NaN NaN
The Very Hungry Caterpillar Monthly XGBoost 2555.47 27.36 2920.21
The Very Hungry Caterpillar Monthly SARIMA NaN NaN NaN

Monthly Forecasting Approach

To evaluate whether monthly aggregation could improve forecasting performance, the weekly data was resampled to monthly frequency and new models were trained:

  1. Data Transformation:
    • Weekly sales data was aggregated by summing values by month
    • Forecast horizon was adjusted from 32 weeks to 8 months for equivalent timeline
    • Features were recalculated at the monthly level, including lag and seasonality indicators
  2. XGBoost Implementation:
    • Monthly XGBoost models were trained with cross-validation and hyperparameter tuning
    • Feature engineering included month indicators, yearly patterns, and rolling statistics
    • Performance was significantly worse than weekly models, with high MAE values
  3. SARIMA Implementation:
    • Monthly SARIMA models encountered NaN errors due to data quality issues
    • Models were specified as ARIMA(2,1,0)(1,0,1)[12] for The Alchemist and ARIMA(0,1,1)(1,0,1)[12] for The Very Hungry Caterpillar
    • Despite good training fits, forecasting failed due to data preprocessing challenges
  4. Weekly vs. Monthly Comparison:
    • Weekly models significantly outperformed monthly aggregations across all metrics
    • Monthly XGBoost showed very high errors (MAE: 1272.90-2555.47) compared to weekly SARIMA (MAE: 169.33-382.28)
    • Monthly aggregation smoothed important weekly patterns, losing valuable signal for prediction

The poor performance of monthly models demonstrates that the weekly sales patterns contain critical information for accurate forecasting. The data loss from aggregation was not offset by any reduction in noise, making weekly modelling clearly superior for these book sales forecasts.

Comprehensive Model Comparison

Performance Metrics Comparison

Model The Alchemist MAE The Alchemist MAPE The Very Hungry Caterpillar MAE The Very Hungry Caterpillar MAPE
SARIMA 169.33 31.89% 415.45 19.60%
LSTM 166.96 24.99% 706.73 32.52%
XGBoost 607.91 96.81% 2167.09 99.02%
Sequential Hybrid 144.99 30.86% 1243.30 66.41%
Parallel Hybrid 245.70 44.52% 800.53 38.01%
Monthly XGBoost 1272.90 48.32% 2555.47 27.36%

Recommended Models by Book

The Alchemist

Sequential Hybrid
  • ✓ Lowest MAE (144.99)
  • ✓ Good MAPE (30.86%)
  • ✓ Captures trend and seasonal patterns
  • ✓ Outperforms standalone models

The Very Hungry Caterpillar

SARIMA
  • ✓ Lowest MAE (415.45)
  • ✓ Best MAPE (19.60%)
  • ✓ Handles volatility effectively
  • ✓ Simpler implementation than alternatives

Model Evaluation & Comparison

After implementing multiple forecasting approaches, a comprehensive comparison revealed clear patterns in model performance:

  1. Overall Performance by Book:
    • For "The Alchemist", the Sequential Hybrid achieved the best results (MAE: 144.99), followed closely by LSTM (MAE: 166.96) and SARIMA (MAE: 169.33)
    • For "The Very Hungry Caterpillar", SARIMA was the clear winner (MAE: 415.45), significantly outperforming all other approaches
  2. Model-Specific Insights:
    • Classical SARIMA models performed consistently well, particularly for volatile data
    • LSTM models showed promise but required complex implementation and tuning
    • XGBoost consistently underperformed for time series forecasting despite parameter tuning
    • Hybrid models showed mixed results, with Sequential Hybrid excelling for stable patterns
    • Monthly models underperformed weekly models across the board, highlighting the importance of granular data
  3. Error Metrics & Interpretation:
    • MAE values reflect absolute prediction errors, with "The Very Hungry Caterpillar" showing higher values due to its larger sales volume
    • MAPE provides percentage errors, revealing that "The Very Hungry Caterpillar" had proportionally smaller errors (19.60% vs. 30.86% for best models)
    • Confidence interval coverage was generally poor across models, indicating challenges in uncertainty estimation

The comparative analysis demonstrates that different books benefit from different modelling approaches based on their unique sales patterns. The stable patterns of "The Alchemist" benefitted from the combined strengths of the Sequential Hybrid, while the more volatile patterns of "The Very Hungry Caterpillar" were best captured by the robust SARIMA approach.

Key Findings & Conclusions

Critical insights drawn from the time series analysis and forecasting process

Book-Specific Optimal Forecasting Approaches

Different books require different forecasting approaches based on their sales patterns. "The Alchemist" benefitted most from a Sequential Hybrid model (MAE: 144.99), combining SARIMA's trend and seasonality modelling with LSTM's ability to capture residual patterns. In contrast, "The Very Hungry Caterpillar" was best forecast using classical SARIMA methods (MAE: 415.45, MAPE: 19.60%), which robustly handled its higher volatility.

Temporal Granularity Impacts Forecast Accuracy

Weekly data significantly outperformed monthly aggregation for forecasting accuracy. Monthly models lost critical information about weekly sales patterns, resulting in much higher errors (Monthly XGBoost MAE: 1272.90-2555.47 vs. Weekly SARIMA MAE: 169.33-382.28). This demonstrates that Nielsen should maintain weekly sales tracking for optimal forecasting results.

Classical Models Remain Competitive

Despite advances in machine learning and deep learning, classical statistical methods like SARIMA remain highly competitive for time series forecasting. SARIMA models delivered consistent, reliable results with simpler implementation and less tuning than complex alternatives. For "The Very Hungry Caterpillar", SARIMA achieved the best performance of all models, highlighting its robustness for volatile sales patterns.

Volatility Challenges All Models

Higher volatility in "The Very Hungry Caterpillar" sales resulted in larger absolute errors across all models compared to "The Alchemist". This suggests that more volatile titles require more sophisticated forecasting approaches and potentially the incorporation of external variables (e.g., school terms, holidays) to improve prediction accuracy.

Hybrid Models Show Promise for Stable Patterns

The Sequential Hybrid architecture, which uses LSTM to model SARIMA residuals, achieved the best overall results for "The Alchemist", demonstrating that combining statistical and machine learning approaches can capture complementary aspects of time series for titles with more stable sales patterns.

Business Implications & Recommendations

Strategic insights and actionable recommendations for Nielsen BookScan and publishers

1

Implement Tailored Forecasting Systems

For Nielsen's new service aimed at small to medium-sized independent publishers, implement a dual forecasting approach based on book characteristics:

  • For stable titles with clear seasonality (like "The Alchemist"): Deploy Sequential Hybrid models combining SARIMA and LSTM to maximise forecast accuracy.
  • For volatile titles with higher sales variance (like "The Very Hungry Caterpillar"): Utilise robust SARIMA models with weekly data to balance accuracy and implementation complexity.

This tailored approach would optimise the balance between forecasting accuracy and system complexity, delivering the best value for independent publishers.

2

Develop Book Category-Specific Models

Different book categories exhibit distinctive sales patterns that impact forecasting accuracy. Nielsen should:

  • Create category-specific forecasting models for fiction, non-fiction, children's books, and educational titles
  • Incorporate relevant external variables (e.g., school terms for educational books, cultural events for fiction)
  • Provide category-specific benchmarks to help publishers contextualise their forecasts

This approach would improve forecast accuracy and allow publishers to make more informed decisions about inventory management and reprinting schedules based on how similar titles have performed historically.

3

Maintain Weekly Data Granularity

The analysis clearly demonstrated that weekly data significantly outperforms monthly aggregation for sales forecasting. Nielsen should:

  • Continue collecting and providing weekly sales data for all forecasting applications
  • Ensure consistent weekly data capture to avoid gaps that compromise forecast accuracy
  • Implement automated quality checks to identify and address data anomalies

Maintaining this granularity is essential for capturing short-term sales fluctuations that drive accurate forecasting, especially for titles with seasonal patterns tied to specific weeks of the year.

4

Develop an Economic Lifespan Prediction Tool

To address Nielsen's objective of helping publishers understand the "useful economic life span" of titles, create a specialised prediction tool that:

  • Analyses historical sales trajectories to identify patterns of sustained demand
  • Forecasts multiple years ahead to identify potential "longevity candidates"
  • Calculates economic viability thresholds based on production costs and expected returns

This tool would help small and medium publishers make more strategic decisions about initial print runs and reprint scheduling, reducing both overstock risk and missed sales opportunities.

5

Enhance Services with Uncertainty Estimates

The analysis revealed challenges with confidence interval estimation across models. Nielsen should:

  • Improve uncertainty quantification in forecasts to help publishers manage risk
  • Develop best-case, worst-case, and most-likely scenarios for sales projections
  • Provide visualisations that clearly communicate forecast uncertainty to non-technical users

Better uncertainty estimates would allow publishers to make more robust inventory decisions, planning for contingencies while optimising for expected outcomes.

Nielsen BookScan Time Series Analysis Project

Using time series analysis for sales and demand forecasting in book retail