Home
My Data Voyage | Data Science Portfolio

My Data Voyage

QA Lead Transitioning to Data Science | Cambridge Level 7 Certificate 2025 | Python + ML/NLP Portfolio

Project Portfolio

Customer Segmentation

🎯 Challenge: Businesses lack data-driven methods to identify distinct customer clusters, resulting in inefficient marketing and suboptimal engagement.

Analysed a dataset through exploration and preprocessing, conducted feature engineering, determined the optimal number of clusters (k), and applied machine learning models to segment customers effectively.

Technologies: Python, Scikit-learn, Pandas, Clustering Algorithms

Student Dropout Prediction

🎯 Challenge: Educational institutions struggle to predict at-risk students early due to fragmented academic and personal data.

Conducted phased data exploration, preprocessing, and feature engineering. Built and compared predictive models using XGBoost and a neural network to forecast student dropout rates with high accuracy.

Technologies: Python, XGBoost, TensorFlow, Pandas

Statistical Hypothesis Testing

🎯 Challenge: Organisations misinterpret data by conflating correlation with causation, leading to flawed decision-making without rigorous validation.

Applied statistical hypothesis testing to evaluate organisational data scenarios. Explored the differences between correlation and causation in data analysis.

Technologies: Python, Statistical Methods

Anomaly Detection

🎯 Challenge: Conventional monitoring misses subtle anomalies in operational systems, exposing organisations to financial losses without automated detection.

Explored a dataset to identify patterns, preprocessed data, and performed feature engineering. Applied statistical techniques and machine learning algorithms to detect anomalies, followed by a detailed report summarising findings and recommendations.

Technologies: Python, Pandas, Scikit-learn, Statistical Methods

Time Series Forecasting

🎯 Challenge: Retailers face volatile demand fluctuations as baseline forecasting methods fail to capture temporal patterns accurately.

Analysed historical sales data using time series decomposition, feature engineering, and ARIMA modeling to forecast future demand. Achieved 15% improvement in forecast accuracy over baseline methods.

Technologies: Python, Statsmodels, Prophet, Pandas

Neural Network Project

🎯 Challenge: Simplistic models fail to handle high-dimensional data, requiring advanced architectures for intricate feature learning.

Designed and implemented a deep neural network architecture from scratch. Applied forward and backward propagation algorithms, optimised hyperparameters, and achieved state-of-the-art performance on classification tasks.

Technologies: Python, TensorFlow, Keras, NumPy, Matplotlib

Explore More

Movie sentiment analysis visualisation Movie Review Sentiment Classification System

🎯 Challenge: Streaming platforms struggle to gauge audience reactions from vast review volumes without automated sentiment analysis.

Advanced neural network architecture Advanced Neural Network Architecture Visualisation

🎯 Challenge: Practitioners struggle to understand complex architectures without interactive tools showing layer interactions and data flows.

Neural network visualisation Interactive Neural Network Learning Demonstrator

🎯 Challenge: Novice learners find neural network training opaque without real-time demonstrations of weight updates and convergence.

Hyperparameter optimisation dashboard Hyperparameter Optimisation Visual Analytics

🎯 Challenge: Tuning hyperparameters remains time-consuming without visual dashboards to track optimisation trajectories across parameter spaces.

Foundation neural network model Foundation Neural Network Model Explorer

🎯 Challenge: Beginners lack accessible tools to experiment with foundational neural network concepts and activation functions.

Neural network implementation Deep Learning Network Implementation Framework

🎯 Challenge: Developing bespoke networks is hindered by fragmented libraries and steep learning curves for low-level implementations.

Machine learning optimisation Automated Hyperparameter Tuning Pipeline

Challenge: Model tuning is slow and expensive guesswork without real-time visualisation of the search space.

Model evaluation metrics dashboard Comprehensive Model Evaluation Metrics Suite

🎯 Challenge: Developers overlook performance aspects beyond accuracy without integrated evaluation suites for precision and recall.

Optimisation algorithms comparison Gradient Descent Optimiser Comparative Analysis

🎯 Challenge: Selecting optimal optimisers is challenging amid varying convergence speeds without comparative analyses across datasets.

Titanic survival prediction model Titanic Survival Prediction Neural Network

🎯 Challenge: Historical datasets with imbalanced features and noise impede development of robust classifiers for risk assessment.

Supervised learning algorithms Supervised Learning Algorithm Implementation

🎯 Challenge: Practitioners waste time using the wrong algorithm type because task boundaries between regression and classification are unclear.

Time series analysis Advanced Time Series Forecasting Models

🎯 Challenge: Retailers face volatile demand as baseline forecasts fail to capture seasonality and shocks.

Customer segmentation analysis Customer Behavioural Segmentation Analytics

🎯 Challenge: Businesses waste marketing budget because customer behaviour clusters remain hidden.

Maritime anomaly detection Maritime Engine Anomaly Detection System

🎯 Challenge: Subtle engine failures slip past rule-based monitoring, risking safety and huge repair costs.

Statistical analysis visualisation Statistical Hypothesis Testing Framework

🎯 Challenge: Teams misinterpret data by confusing correlation with causation without rigorous testing.

Educational data analytics Student Retention Predictive Analytics

🎯 Challenge: Universities lose talented students because at-risk cases cannot be spotted early.

Natural language processing Advanced NLP Sentiment Classification Engine

🎯 Challenge: Companies drown in unstructured text and miss critical customer sentiment signals.

Deep learning architecture Custom Deep Learning Architecture Design

🎯 Challenge: Off-the-shelf models fail on specialised high-dimensional problems requiring bespoke architectures.

ARIMA crime forecasting analysis Baltimore Police ARIMA Crime Forecasting System

🎯 Challenge: Police cannot predict crime hotspots accurately, leading to inefficient patrols and public safety gaps.

Baltimore crime time series analysis Baltimore Crime Patterns Time Series Analysis

🎯 Challenge: Law enforcement struggles with unpredictable crime patterns without reliable predictive models to anticipate hotspots and trends.

RNN sentiment analysis comparison RNN Model Comparison for Text Classification

🎯 Challenge: Text classification suffers from inconsistent performance across recurrent architectures without systematic LSTM, GRU, and RNN comparisons.

Decision tree machine learning Decision Tree Analysis with SHAP Interpretation

🎯 Challenge: Interpretable models are underutilised in high-stakes decisions without explainability tools to demystify feature importance.

Bank churn prediction analysis Bank Customer Churn Prediction System

🎯 Challenge: Banks lose revenue from customer churn as siloed data hinders early identification of at-risk clients for retention efforts.

Neural network optimiser comparison Neural Network Optimiser Performance Analysis

🎯 Challenge: Selecting optimal optimisers is challenging amid varying convergence speeds without comparative analyses.

Titanic ML optimisation study Titanic Survivor Prediction Optimisation Study

🎯 Challenge: Even classic datasets hide subtle interactions that only optimised models can uncover.

Manual neural network propagation Neural Network Manual Propagation Framework

🎯 Challenge: Understanding backpropagation deeply requires implementing it from scratch — most never do.

Regression vs classification guide Regression vs Classification Decision Framework

🎯 Challenge: Practitioners waste time using the wrong algorithm type because task boundaries are unclear.

Automobile PCA dimensionality reduction Automobile Price Prediction with PCA Analysis

🎯 Challenge: Automotive marketplaces face opaque pricing influenced by interdependent features without dimensionality reduction techniques.

Automobile t-SNE visualisation Advanced Dimensionality Reduction Visualisation

🎯 Challenge: High-dimensional data is impossible to interpret without powerful reduction and visualisation.

Automobile price analysis Comprehensive Automobile Price Analysis Guide

🎯 Challenge: Car pricing appears random when dozens of correlated features hide the real drivers.

Customer loyalty analysis Customer Loyalty Predictive Analytics System

🎯 Challenge: Retailers face declining loyalty and escalating costs as traditional metrics fail to predict long-term engagement without integrated analytics.

Medical insurance correlation analysis Medical Insurance Cost Correlation Analysis

🎯 Challenge: Insurers cannot price policies fairly without understanding hidden correlations between lifestyle and cost.

Statistical hypothesis testing Statistical Hypothesis Testing Analysis Dashboard

🎯 Challenge: Non-technical stakeholders cannot trust statistical claims without interactive p-value and power analysis tools.

Technical Skills

Python & Data Science Stack

Pandas • NumPy • Scikit-learn • TensorFlow • PyTorch • Jupyter • Git • Production-ready ML pipelines • Automated/scalable workflows

Machine Learning & Deep Learning

Supervised & unsupervised learning • XGBoost • Random Forests • SVM • Neural networks (custom architectures, forward/backward propagation, gradient descent) • Ensemble methods • Clustering (K-means, DBSCAN, HDBSCAN, hierarchical)

Natural Language Processing & Generative AI

Hugging Face Transformers • FinBERT • FinLLaMA • BERT • BERTopic • VADER • GPT-2 • BART • Sentence Transformers • spaCy • NLTK • Text classification & sentiment analysis (92 % accuracy on customer-review dataset)

Time-Series Analysis & Forecasting

ARIMA/SARIMA • Prophet • LSTM • Statsmodels • Decomposition techniques • Demand & financial forecasting (15 % accuracy improvement vs baseline on book-sales project)

Anomaly Detection

Isolation Forests • Autoencoders • Statistical methods • Real-time maritime/engine anomaly detection project

Model Evaluation & Optimisation

Hyperparameter tuning (Grid, Random, Bayesian) • ROC-AUC • Precision-Recall • Custom business metrics • SHAP interpretability • A/B testing

Feature Engineering & Dimensionality Reduction

Feature creation/selection • PCA • t-SNE • UMAP • Autoencoders • High-dimensional data processing

Data Visualisation & BI

Matplotlib • Seaborn • Plotly • Power BI • Interactive dashboards • Business intelligence reporting

Statistical Analysis & Hypothesis Testing

Parametric & non-parametric tests • Correlation & causal inference • Model validation

MLOps & Deployment Fundamentals

Experiment tracking • Model versioning • Drift detection concepts • Automated retraining basics (academic & portfolio exposure)

MLOps & Model Deployment

Model versioning • Experiment tracking • Deployment pipelines • Drift detection • Automated retraining

Customer & Business Analytics

RFM analysis • Cohort analysis • Behavioural segmentation • Retention optimisation • Targeted marketing insights

Visualisation Gallery

A selection of my data visualisation techniques

Contact Me

Interested in working together? Fill out the form below, and I'll get back to you promptly.

Form was sent successfully!

Location

Based in London, UK

New Bank of englanf html code below
Bank_of_England_Financial_Analysis_System
G-SIB Financial Analysis - Interactive Visual Guide

🏦 Capstone Project (University of Cambridge 2025) - Prototype G-SIB Risk Assessment System

Built an end-to-end prototype analysing 81 public quarterly financial reports & earnings-call transcripts of three Global Systemically Important Banks (2023–2025). Combined advanced NLP (FinBERT, FinLLaMA, BERTopic, VADER) with structured financial metrics extraction and ARIMA forecasting. Produced an interactive dashboard presenting regulatory-style risk insights

3 G-SIB Banks Analysed
81 Financial Documents
9 Quarters Analysed
7 Analysis Methods

🎯 The Challenge

In the wake of escalating geopolitical tensions and economic volatility, regulators face mounting pressure to enhance real-time monitoring of systemic risks posed by Global Systemically Important Banks (G-SIBs), yet manual analysis of vast financial reports and transcripts remains inefficient and prone to oversight, hindering proactive risk mitigation.

✅ The Solution & Impact

Built an end-to-end prototype analysing 81 public quarterly financial reports and earnings-call transcripts using advanced NLP (FinBERT, FinLLaMA, BERTopic, VADER) combined with ARIMA forecasting. Produced an interactive dashboard presenting regulatory-style risk insights, demonstrating feasibility of automated systemic risk monitoring for regulatory applications.

🎯 Problem Statement & Strategic Context

Business Challenge

Global Systemically Important Banks (G-SIBs) produce quarterly financial results accompanied by analyst Q&A transcripts and webcasts. The Bank of England's Prudential Regulation Authority (PRA) supervises these institutions to uphold monetary and economic stability.

Core Problem: While quantitative metrics are readily incorporated by existing risk-assessment frameworks, qualitative insights embedded in earnings-call discussions remain under-utilised.

Research Objective

By analysing multiple Global Systemically Important Banks and their quarterly earnings results over the period 2023-2025, identify key insights using advanced analytical techniques that may be missed by traditional quantitative analysis methods.

Strategic Rationale & Methodology Justification

🔍
Beyond Traditional Metrics
Traditional financial analysis focuses primarily on numerical data from financial statements. Our approach incorporates textual analysis of qualitative reports and transcripts, extracting insights from narrative context surrounding the numbers that conventional methods often overlook.
🎯
Multi-Faceted Analysis Framework
By combining sentiment analysis, risk assessment, topic modelling, and financial metrics extraction, we provide a holistic view of bank performance that increases the likelihood of identifying unique and valuable regulatory insights.
🦙
Domain-Specific AI Models
FinLLaMA's financial domain expertise surpasses general-purpose models in understanding complex financial statements, regulatory language, and subtle sentiment shifts critical for G-SIB supervision.
🎤
Transcript Intelligence Mining
Earnings call transcripts contain forward-looking statements and management commentary not found in static reports, revealing subtle strategic shifts and concerns essential for proactive regulatory oversight.
📊
Advanced Topic Discovery
BERTopic's state-of-the-art capabilities discover nuanced, coherent themes within complex financial documents, identifying emerging trends and regulatory concerns not explicitly stated in quantitative data.
Real-Time Risk Intelligence
Multi-model sentiment analysis provides early warning capabilities for identifying operational vulnerabilities and reputational risks before they manifest in traditional financial metrics.

🏛️ Regulatory Innovation Impact

Enhanced Supervision
Qualitative risk indicators complement quantitative frameworks
Proactive Monitoring
Early identification of emerging systemic risks
Data-Driven Insights
Evidence-based regulatory decision making

📋 Complete Analysis Process Flow

1
Document Processing & Extraction
Complete
Processing 81 financial documents across Bank A, Bank B, and Bank C from Q1 2023 to Q1 2025, including quarterly earnings, Q&A transcripts, and presentations.
Key Components:
Multi-format processing (PDF, DOCX, TXT)
Quality scoring system for document assessment
Comprehensive metadata extraction
Automated text cleaning and normalisation
2
Financial Metrics Extraction
Complete
Extracting structured financial data from unstructured text using regex patterns with confidence scoring for ROE, NIM, capital ratios, and other key metrics.
Extracted Metrics:
Return on Equity (ROE) analysis
Net Interest Margin (NIM) tracking
Capital ratio calculations
Confidence scoring for each metric
3
Risk Assessment Analysis
Complete
Comprehensive risk identification across operational and financial dimensions, highlighting potential vulnerabilities that could impact the financial system.
Risk Categories:
Operational risk assessment
Credit risk analysis
Market risk evaluation
Liquidity risk monitoring
4
G-SIB Analysis
Complete
Assessment based on Basel III framework covering cross-jurisdictional activity, size, interconnectedness, substitutability, and complexity factors.
G-SIB Categories:
Cross-jurisdictional activity assessment
Size indicator analysis
Interconnectedness evaluation
Substitutability factor scoring
Complexity indicator measurement
5
Transcript Analysis
Complete
Analysis of earnings call transcripts providing insights from forward-looking statements, management commentary, and Q&A sessions not found in static reports.
Analysis Components:
Speaker sentiment analysis
Topic identification in Q&A sessions
Regulatory mention tracking
Management commentary evaluation
6
Advanced Sentiment Analysis
Complete
Multi-model sentiment analysis using FinBERT and VADER with intelligent text chunking for nuanced financial text analysis across the document corpus.
Sentiment Models:
FinBERT financial domain-specific analysis
VADER sentiment intensity scoring
Text chunking with context preservation
Quarterly sentiment trend tracking
7
BERTopic Modelling
Complete
State-of-the-art topic discovery using BERTopic to identify recurring themes and emerging trends within the financial documents corpus.
Topic Analysis:
Hierarchically structured topic discovery
Coherent theme identification
Sentiment-topic correlation analysis
Granular regulatory theme detection
8
FinLLaMA Summarisation
Complete
Domain-specific financial summarisation using FinLLaMA (LLaMA 3.1 fine-tuned) providing superior financial understanding compared to general-purpose models.
LLM Capabilities:
Financial domain expertise
Regulatory language interpretation
Contextual financial insight generation
Sentiment shift identification
9
Comprehensive Reporting
Complete
Generation of multi-format outputs including CSV datasets, interactive HTML dashboards, and executive summary reports for stakeholder review.
Output Deliverables:
Interactive dashboard creation
Structured CSV datasets
Executive summary generation
Regulatory compliance reports

📊 Comprehensive Bank Analysis

🏦 Bank A
18.2%
Negative Sentiment
High
Risk Profile
Highest Risk Institution

Persistent elevated negative sentiment following major acquisition integration. Requires enhanced supervisory oversight and weekly sentiment monitoring.

🏛️ Bank B
8.4%
Negative Sentiment
Low
Risk Profile
Most Stable Institution

Consistently lowest negative sentiment with 50% net income growth. Positioned as stabilising G-SIB force with effective risk management.

🏢 Bank C
12.7%
Negative Sentiment
Medium
Risk Profile
Volatile Q2 Pattern

Notable Q2 volatility spikes in both 2023 and 2024. Sharp Q1 2025 improvement requires investigation of underlying factors.

🎯 Stress Testing Results

13%+
Post-Stress Capital Ratios
3.55%
Max Impact (Bank B)
Adequate
Systemic Resilience

🔍 Key Findings & Conclusions

📈
Risk Differentiation Identified
Bank A demonstrates highest risk profile with 18.2% negative sentiment, significantly above the 13.1% average. Persistent pattern following major acquisition integration indicates ongoing operational vulnerabilities requiring enhanced supervision.
🏆
Bank B Excellence
Most stable institution with 8.4% negative sentiment and 50% net income growth. Positioned as stabilising G-SIB force with consistently effective risk management and sentiment control.
⚠️
Bank C Q2 Volatility
Notable Q2 sentiment spikes in both 2023 and 2024, followed by sharp Q1 2025 improvement. Pattern requires investigation to understand underlying seasonal or operational factors.
🛡️
Capital Adequacy Maintained
All banks maintain post-stress capital ratios exceeding 13%, well above regulatory minimums. Quantitative resilience validated despite qualitative sentiment variations.
🎯
Topic Analysis Insights
Key themes focus on financial stability, risk exposure, and reporting transparency. Discussions centre on post-pandemic normalisation, integration challenges, and global economic impacts.
🤖
Advanced Analytics Value
FinLLaMA outperformed general LLMs in financial domain understanding. Sentiment analysis revealed risk differentials that traditional quantitative metrics miss, providing enhanced supervisory intelligence.

🎯 Model Validation Events

Major Acquisition Event (Q1 2023)
Bank A sentiment spike correlated with acquisition announcement
Market Volatility (Q2 2024)
Bank C sentiment deterioration aligned with trading challenges
Interest Rate Environment
Bank B resilience demonstrated across rate cycles

💼 Business Implications & Strategic Recommendations

🚨 Immediate Actions Required
Bank A Enhanced Supervision: Implement weekly sentiment monitoring and enhanced oversight protocols. Address operational vulnerabilities identified through persistent negative sentiment patterns.

Bank C Q2 Investigation: Conduct targeted analysis of Q2 volatility patterns to identify underlying operational or seasonal factors requiring mitigation.
📊 Real-Time Monitoring Framework
Automated Alert Systems: Deploy sentiment-based early warning systems with threshold breach notifications for all G-SIBs.

Quarterly Intelligence: Integrate advanced analytics into regular supervisory review processes for enhanced risk identification.
🏆 Best Practice Analysis
Bank B Model: Study and disseminate effective risk management and communication strategies demonstrated by Bank B's consistent performance.

Industry Benchmarking: Establish sentiment-based performance benchmarks across G-SIB institutions.
🤖 Advanced Analytics Integration
FinLLaMA Development: Further fine-tune models on Basel reports and BoE/PRA filings for enhanced regulatory intelligence.

Methodology Expansion: Scale advanced sentiment and topic analysis across additional financial institutions and document types.
📈 Strategic Implementation
Phased Rollout: Extend analysis framework to additional G-SIBs and domestic banks with proven methodology.

Integration Planning: Incorporate advanced analytics into existing supervisory technology stack with staff training programmes.
🔮 Long-Term Development
Predictive Capabilities: Develop forward-looking risk indicators based on sentiment trends and topic evolution.

Regulatory Innovation: Position BoE as leader in advanced analytics application for financial supervision and systemic risk management.