Automobile Price Prediction Analysis - Interactive Guide

Automobile Price Prediction Analysis

Comprehensive Feature Engineering & Machine Learning Approach

9-Step Analysis Process

1
Library Imports
Import essential libraries for data analysis, visualisation, and machine learning including pandas, numpy, sklearn, and seaborn
2
Data Exploration
Load automobile dataset (205 rows, 26 features), examine data types, identify unique values, and correct brand name typos
3
Data Preprocessing
Handle missing values, apply label encoding, binary encoding, and one-hot encoding for categorical variables
4
Feature Scaling
Apply MinMax and Standard scaling to normalise features and remove outliers using IQR method
5
Multicollinearity Check
Use correlation matrix and eigenvalue analysis to identify highly correlated features (threshold > 0.95)
6
Feature Selection
Apply Mutual Information Regression and Recursive Feature Elimination (RFE) to identify top 15 features
7
Correlation Heatmap
Create heatmaps to visualise correlations with price (|correlation| > 0.5) for feature validation
8
Feature DataFrames
Create target (price) and input feature DataFrames based on feature importance rankings
9
Model Evaluation
Train Linear Regression model, calculate performance metrics (MSE, R²), and visualise results

Comprehensive Visual Guide

Data Preprocessing Techniques

  • Brand name typo correction (maxda → mazda, toyouta → toyota)
  • Created Brand_Modified feature grouping low-frequency brands
  • Label encoding for cylindernumber (four→4, six→6, etc.)
  • Binary encoding for fueltype, aspiration, enginelocation
  • One-hot encoding for remaining categorical variables
  • TF-IDF vectorization for CarName text analysis

Feature Engineering Methods

  • Outlier removal using IQR method (Q1-1.5*IQR, Q3+1.5*IQR)
  • Standard scaling for normalisation
  • Float64 to integer conversion for consistency
  • Missing value imputation (mean for numerical, 'Unknown' for categorical)
  • Feature creation from existing data (Brand extraction)
  • Multicollinearity detection (correlation > 0.95)

Feature Selection Approaches

  • Mutual Information Regression for feature importance
  • Recursive Feature Elimination (RFE) with DecisionTreeRegressor
  • Correlation analysis with price target
  • Eigenvalue analysis for multicollinearity
  • Cross-validation of feature rankings
  • Selection of top 15 features for final model

Model Validation Techniques

  • Train-test split with random_state=10
  • Linear Regression model fitting
  • Mean Squared Error (MSE) calculation
  • R-squared (R²) performance metric
  • Actual vs Predicted price visualisation
  • Residual plot analysis for model diagnostics
📊 Click to view Technical Implementation Details

Key Technical Decisions:

Regression vs Classification: Correctly used mutual_info_regression and DecisionTreeRegressor since price is continuous.

Scaling Choice: Standard scaling chosen over MinMax to handle outliers better and maintain data distribution.

Encoding Strategy: Combined approach using label, binary, and one-hot encoding based on feature characteristics.

Feature Count: Selected 15 features to balance model complexity and performance.

Key Findings & Conclusions

0.73
R² Score
73% of price variance explained by the model - indicating strong predictive performance
0.277
MSE
Low Mean Squared Error suggests predictions are close to actual values
15
Optimal Features
Selected features provide maximum predictive power whilst avoiding overfitting
205
Data Points
Clean dataset after outlier removal and preprocessing

Most Important Features (RFE Rankings)

symboling
wheelbase
carlength
carwidth
carheight
curbweight
enginesize
horsepower
peakrpm
citympg
highwaympg
compressionratio
carbody_sedan
drivewheel_fwd
enginetype_ohc

Business Implications & Recommendations

💰
Pricing Strategy
Key Insight: Engine size, horsepower, and curbweight are strongest price predictors.
Recommendation: Focus marketing on these specifications for premium vehicles. Optimise engine performance features to justify higher pricing tiers.
🎯
Product Development
Key Insight: Car dimensions (length, width, height) significantly impact price perception.
Recommendation: Design vehicles with optimal proportions. Consider segment positioning based on dimensional characteristics.
📊
Market Positioning
Key Insight: Fuel efficiency (citympg, highwaympg) negatively correlates with price.
Recommendation: Balance performance and efficiency. Position eco-friendly models in different market segments from performance vehicles.
🔧
Manufacturing Focus
Key Insight: Engine type (OHC) and drive system (FWD) influence pricing significantly.
Recommendation: Invest in advanced engine technologies. Consider drivetrain options as key differentiators in product lineup.
📈
Sales & Marketing
Key Insight: Model can predict prices with 73% accuracy using key specifications.
Recommendation: Use predictive model for competitive pricing analysis and market positioning. Train sales teams on value-driving features.
⚖️
Risk Management
Key Insight: Strong correlation between physical and performance attributes reduces pricing uncertainty.
Recommendation: Use feature importance rankings to assess portfolio risk and identify potential pricing gaps in current product range.
🚀 Click to view Implementation Roadmap

Recommended Implementation Steps:

  1. Short-term (1-3 months): Integrate pricing model into sales tools and competitive analysis processes
  2. Medium-term (3-6 months): Refine product specifications based on feature importance findings
  3. Long-term (6-12 months): Develop new vehicle concepts optimising high-impact features identified in analysis
  4. Ongoing: Continuously update model with new market data and emerging feature trends