Implementing PCA and t-SNE for Dimensionality Reduction in Real-World Automotive Data
Load automobiles.csv dataset and perform initial data exploration
Handle missing values and prepare data for encoding
Transform categorical variables into numerical representations
Remove outliers and ensure data quality
Apply StandardScaler for feature normalisation
Apply PCA for dimensionality reduction
Calculate mutual information scores for feature ranking
Apply t-SNE for visualisation and build predictive models
Dimensionality reduction technique that transforms correlated features into uncorrelated principal components while preserving maximum variance.
Key Components:
Non-linear dimensionality reduction technique for visualising high-dimensional data in 2D space whilst preserving local structure.
Visualisation Results:
Statistical measure to quantify the dependency between features and target variable, identifying the most informative components.
Top Features by MI Score:
Comparison of Random Forest Regression and Linear Regression models for automobile price prediction using dimensionality-reduced features.
Model Performance:
PCA successfully reduced feature space from 60 to 28 dimensions whilst retaining 95% of variance. This significant reduction improves computational efficiency without sacrificing predictive power.
The top three most informative features for price prediction are: symboling (19.70%), wheelbase (10.98%), and carwidth (7.13%) of total variance.
Random Forest significantly outperformed Linear Regression with R² = 0.9993 vs 0.8544, indicating strong non-linear relationships in automotive pricing data.
t-SNE visualisation revealed distinct clustering of automobiles into price categories, with clear separation between lower, medium, and high-cost vehicles.
Categorical encoding expanded the dataset from 26 to 60 features, providing richer representation whilst outlier removal improved model stability.
The final Random Forest model achieved MSE = 0.0005 and R² = 0.9993, demonstrating exceptional predictive accuracy for real-world application.
Focus pricing decisions on the three most influential factors: symboling (insurance risk rating), wheelbase (vehicle size), and carwidth (interior space). These features explain over 37% of price variance and should guide premium positioning strategies.
The clear clustering of vehicles by price categories suggests distinct market segments. Manufacturers can leverage these insights to position new models within optimal price brackets and identify potential market gaps.
Deploy the Random Forest model with 99.93% accuracy for dynamic pricing strategies, inventory valuation, and competitive analysis. The model's exceptional performance enables confident real-time pricing decisions.
R&D investments should prioritise features with high mutual information scores. Symboling improvements (safety ratings) and optimal wheelbase design provide maximum impact on market value and customer appeal.
Implement dimensionality reduction techniques across other automotive datasets to identify key value drivers efficiently. PCA's 53% feature reduction demonstrates significant computational savings without accuracy loss.
Use t-SNE visualisations to map competitor vehicles within price-feature space, identifying positioning opportunities and understanding market dynamics through cluster analysis of automotive specifications.