Home

My Data Voyage | Data Science Portfolio

My Data Voyage

QA Lead Transitioning to Data Science | Cambridge Level 7 Certificate 2025 | Python + ML/NLP Portfolio

Featured Projects

Capstone Project (University of Cambridge 2025) - Prototype G-SIB Risk Assessment System

🎯 Challenge: Regulators struggle to monitor systemic bank risks as manual analysis of financial reports remains inefficient and prone to oversight.

Built an end-to-end prototype analysing 81 public quarterly financial reports & earnings-call transcripts of three Global Systemically Important Banks (2023–2025). Combined advanced NLP (FinBERT, FinLLaMA, BERTopic, VADER) with structured financial metrics extraction and ARIMA forecasting. Produced an interactive dashboard presenting regulatory-style risk insights. Technologies: Python, FinBERT, VADER, BERTopic, GPT-2, ARIMA modelling, Sentence Transformers, BART, HDBSCAN clustering. Prototype built using only publicly available data – no affiliation with or delivery to the Bank of England.

NLP Customer Review Sentiment Analysis for Wellness Centre

🎯 Challenge: Wellness centres struggle to extract actionable insights from unstructured customer feedback, missing service improvement opportunities.

Developed an NLP solution to analyse customer feedback sentiment. Implemented text preprocessing techniques and trained a BERT-based model to classify sentiment with 92% accuracy. Technologies: Python, NLTK, Transformers, PyTorch.

Technologies: Python, NLTK, Transformers, PyTorch.

Project Portfolio

Customer Segmentation

🎯 Challenge: Businesses lack data-driven methods to identify distinct customer clusters, resulting in inefficient marketing and suboptimal engagement.

Analysed a dataset through exploration and preprocessing, conducted feature engineering, determined the optimal number of clusters (k), and applied machine learning models to segment customers effectively.

Technologies: Python, Scikit-learn, Pandas, Clustering Algorithms

Student Dropout Prediction

🎯 Challenge: Educational institutions struggle to predict at-risk students early due to fragmented academic and personal data.

Conducted phased data exploration, preprocessing, and feature engineering. Built and compared predictive models using XGBoost and a neural network to forecast student dropout rates with high accuracy.

Technologies: Python, XGBoost, TensorFlow, Pandas

Statistical Hypothesis Testing

🎯 Challenge: Organisations misinterpret data by conflating correlation with causation, leading to flawed decision-making without rigorous validation.

Applied statistical hypothesis testing to evaluate organisational data scenarios. Explored the differences between correlation and causation in data analysis.

Technologies: Python, Statistical Methods

Anomaly Detection

🎯 Challenge: Conventional monitoring misses subtle anomalies in operational systems, exposing organisations to financial losses without automated detection.

Explored a dataset to identify patterns, preprocessed data, and performed feature engineering. Applied statistical techniques and machine learning algorithms to detect anomalies, followed by a detailed report summarising findings and recommendations.

Technologies: Python, Pandas, Scikit-learn, Statistical Methods

Time Series Forecasting

🎯 Challenge: Retailers face volatile demand fluctuations as baseline forecasting methods fail to capture temporal patterns accurately.

Analysed historical sales data using time series decomposition, feature engineering, and ARIMA modeling to forecast future demand. Achieved 15% improvement in forecast accuracy over baseline methods.

Technologies: Python, Statsmodels, Prophet, Pandas

Neural Network Project

🎯 Challenge: Simplistic models fail to handle high-dimensional data, requiring advanced architectures for intricate feature learning.

Designed and implemented a deep neural network architecture from scratch. Applied forward and backward propagation algorithms, optimised hyperparameters, and achieved state-of-the-art performance on classification tasks.

Technologies: Python, TensorFlow, Keras, NumPy, Matplotlib

Explore More

Movie Review Sentiment Classification System

🎯 Challenge: Streaming platforms struggle to gauge audience reactions from vast review volumes without automated sentiment analysis.

Advanced Neural Network Architecture Visualisation

🎯 Challenge: Practitioners struggle to understand complex architectures without interactive tools showing layer interactions and data flows.

Interactive Neural Network Learning Demonstrator

🎯 Challenge: Novice learners find neural network training opaque without real-time demonstrations of weight updates and convergence.

Hyperparameter Optimisation Visual Analytics

🎯 Challenge: Tuning hyperparameters remains time-consuming without visual dashboards to track optimisation trajectories across parameter spaces.

Foundation Neural Network Model Explorer

🎯 Challenge: Beginners lack accessible tools to experiment with foundational neural network concepts and activation functions.

Deep Learning Network Implementation Framework

🎯 Challenge: Developing bespoke networks is hindered by fragmented libraries and steep learning curves for low-level implementations.

Automated Hyperparameter Tuning Pipeline

Challenge: Model tuning is slow and expensive guesswork without real-time visualisation of the search space.

Comprehensive Model Evaluation Metrics Suite

🎯 Challenge: Developers overlook performance aspects beyond accuracy without integrated evaluation suites for precision and recall.

Gradient Descent Optimiser Comparative Analysis

🎯 Challenge: Selecting optimal optimisers is challenging amid varying convergence speeds without comparative analyses across datasets.

Titanic Survival Prediction Neural Network

🎯 Challenge: Historical datasets with imbalanced features and noise impede development of robust classifiers for risk assessment.

Supervised Learning Algorithm Implementation

🎯 Challenge: Practitioners waste time using the wrong algorithm type because task boundaries between regression and classification are unclear.

Advanced Time Series Forecasting Models

🎯 Challenge: Retailers face volatile demand as baseline forecasts fail to capture seasonality and shocks.

Customer Behavioural Segmentation Analytics

🎯 Challenge: Businesses waste marketing budget because customer behaviour clusters remain hidden.

Maritime Engine Anomaly Detection System

🎯 Challenge: Subtle engine failures slip past rule-based monitoring, risking safety and huge repair costs.

Statistical Hypothesis Testing Framework

🎯 Challenge: Teams misinterpret data by confusing correlation with causation without rigorous testing.

Student Retention Predictive Analytics

🎯 Challenge: Universities lose talented students because at-risk cases cannot be spotted early.

Advanced NLP Sentiment Classification Engine

🎯 Challenge: Companies drown in unstructured text and miss critical customer sentiment signals.

Custom Deep Learning Architecture Design

🎯 Challenge: Off-the-shelf models fail on specialised high-dimensional problems requiring bespoke architectures.

Baltimore Police ARIMA Crime Forecasting System

🎯 Challenge: Police cannot predict crime hotspots accurately, leading to inefficient patrols and public safety gaps.

Baltimore Crime Patterns Time Series Analysis

🎯 Challenge: Law enforcement struggles with unpredictable crime patterns without reliable predictive models to anticipate hotspots and trends.

RNN Model Comparison for Text Classification

🎯 Challenge: Text classification suffers from inconsistent performance across recurrent architectures without systematic LSTM, GRU, and RNN comparisons.

Decision Tree Analysis with SHAP Interpretation

🎯 Challenge: Interpretable models are underutilised in high-stakes decisions without explainability tools to demystify feature importance.

Bank Customer Churn Prediction System

🎯 Challenge: Banks lose revenue from customer churn as siloed data hinders early identification of at-risk clients for retention efforts.

Neural Network Optimiser Performance Analysis

🎯 Challenge: Selecting optimal optimisers is challenging amid varying convergence speeds without comparative analyses.

Titanic Survivor Prediction Optimisation Study

🎯 Challenge: Even classic datasets hide subtle interactions that only optimised models can uncover.

Neural Network Manual Propagation Framework

🎯 Challenge: Understanding backpropagation deeply requires implementing it from scratch — most never do.

Regression vs Classification Decision Framework

🎯 Challenge: Practitioners waste time using the wrong algorithm type because task boundaries are unclear.

Automobile Price Prediction with PCA Analysis

🎯 Challenge: Automotive marketplaces face opaque pricing influenced by interdependent features without dimensionality reduction techniques.

Advanced Dimensionality Reduction Visualisation

🎯 Challenge: High-dimensional data is impossible to interpret without powerful reduction and visualisation.

Comprehensive Automobile Price Analysis Guide

🎯 Challenge: Car pricing appears random when dozens of correlated features hide the real drivers.

Customer Loyalty Predictive Analytics System

🎯 Challenge: Retailers face declining loyalty and escalating costs as traditional metrics fail to predict long-term engagement without integrated analytics.

Medical Insurance Cost Correlation Analysis

🎯 Challenge: Insurers cannot price policies fairly without understanding hidden correlations between lifestyle and cost.

Statistical Hypothesis Testing Analysis Dashboard

🎯 Challenge: Non-technical stakeholders cannot trust statistical claims without interactive p-value and power analysis tools.

Technical Skills

Python & Data Science Stack

Pandas • NumPy • Scikit-learn • TensorFlow • PyTorch • Jupyter • Git • Production-ready ML pipelines • Automated/scalable workflows

Machine Learning & Deep Learning

Supervised & unsupervised learning • XGBoost • Random Forests • SVM • Neural networks (custom architectures, forward/backward propagation, gradient descent) • Ensemble methods • Clustering (K-means, DBSCAN, HDBSCAN, hierarchical)

Natural Language Processing & Generative AI

Hugging Face Transformers • FinBERT • FinLLaMA • BERT • BERTopic • VADER • GPT-2 • BART • Sentence Transformers • spaCy • NLTK • Text classification & sentiment analysis (92 % accuracy on customer-review dataset)

Time-Series Analysis & Forecasting

ARIMA/SARIMA • Prophet • LSTM • Statsmodels • Decomposition techniques • Demand & financial forecasting (15 % accuracy improvement vs baseline on book-sales project)

Anomaly Detection

Isolation Forests • Autoencoders • Statistical methods • Real-time maritime/engine anomaly detection project

Model Evaluation & Optimisation

Hyperparameter tuning (Grid, Random, Bayesian) • ROC-AUC • Precision-Recall • Custom business metrics • SHAP interpretability • A/B testing

Feature Engineering & Dimensionality Reduction

Feature creation/selection • PCA • t-SNE • UMAP • Autoencoders • High-dimensional data processing

Data Visualisation & BI

Matplotlib • Seaborn • Plotly • Power BI • Interactive dashboards • Business intelligence reporting

Statistical Analysis & Hypothesis Testing

Parametric & non-parametric tests • Correlation & causal inference • Model validation

MLOps & Deployment Fundamentals

Experiment tracking • Model versioning • Drift detection concepts • Automated retraining basics (academic & portfolio exposure)

MLOps & Model Deployment

Model versioning • Experiment tracking • Deployment pipelines • Drift detection • Automated retraining

Customer & Business Analytics

RFM analysis • Cohort analysis • Behavioural segmentation • Retention optimisation • Targeted marketing insights

Visualisation Gallery

A selection of my data visualisation techniques

Optimal cluster count determination using the elbow method for anomaly detection

Principal Component Analysis for anomaly identification

Multi-feature boxplot visualisation for examining distribution patterns across customer metrics

Plot showing k-means clustering

Violin plot illustrating SHAP values

Visual representation of hierarchical clustering showing cluster distances

Contact Me

Interested in working together? Fill out the form below, and I'll get back to you promptly.

Form was sent successfully!

Location

Based in London, UK

New Bank of englanf html code below

Bank_of_England_Financial_Analysis_System

G-SIB Financial Analysis - Interactive Visual Guide

🏦 Capstone Project (University of Cambridge 2025) - Prototype G-SIB Risk Assessment System

3 G-SIB Banks Analysed

81 Financial Documents

9 Quarters Analysed

7 Analysis Methods

🎯 The Challenge

In the wake of escalating geopolitical tensions and economic volatility, regulators face mounting pressure to enhance real-time monitoring of systemic risks posed by Global Systemically Important Banks (G-SIBs), yet manual analysis of vast financial reports and transcripts remains inefficient and prone to oversight, hindering proactive risk mitigation.

✅ The Solution & Impact

Built an end-to-end prototype analysing 81 public quarterly financial reports and earnings-call transcripts using advanced NLP (FinBERT, FinLLaMA, BERTopic, VADER) combined with ARIMA forecasting. Produced an interactive dashboard presenting regulatory-style risk insights, demonstrating feasibility of automated systemic risk monitoring for regulatory applications.

🎯 Problem Statement & Strategic Context

Business Challenge

Global Systemically Important Banks (G-SIBs) produce quarterly financial results accompanied by analyst Q&A transcripts and webcasts. The Bank of England's Prudential Regulation Authority (PRA) supervises these institutions to uphold monetary and economic stability.

Core Problem: While quantitative metrics are readily incorporated by existing risk-assessment frameworks, qualitative insights embedded in earnings-call discussions remain under-utilised.

Research Objective

By analysing multiple Global Systemically Important Banks and their quarterly earnings results over the period 2023-2025, identify key insights using advanced analytical techniques that may be missed by traditional quantitative analysis methods.

Strategic Rationale & Methodology Justification

🔍

Beyond Traditional Metrics

Traditional financial analysis focuses primarily on numerical data from financial statements. Our approach incorporates textual analysis of qualitative reports and transcripts, extracting insights from narrative context surrounding the numbers that conventional methods often overlook.

🎯

Multi-Faceted Analysis Framework

By combining sentiment analysis, risk assessment, topic modelling, and financial metrics extraction, we provide a holistic view of bank performance that increases the likelihood of identifying unique and valuable regulatory insights.

🦙

Domain-Specific AI Models

FinLLaMA's financial domain expertise surpasses general-purpose models in understanding complex financial statements, regulatory language, and subtle sentiment shifts critical for G-SIB supervision.

🎤

Transcript Intelligence Mining

Earnings call transcripts contain forward-looking statements and management commentary not found in static reports, revealing subtle strategic shifts and concerns essential for proactive regulatory oversight.

📊

Advanced Topic Discovery

BERTopic's state-of-the-art capabilities discover nuanced, coherent themes within complex financial documents, identifying emerging trends and regulatory concerns not explicitly stated in quantitative data.

⚡

Real-Time Risk Intelligence

Multi-model sentiment analysis provides early warning capabilities for identifying operational vulnerabilities and reputational risks before they manifest in traditional financial metrics.

🏛️ Regulatory Innovation Impact

Enhanced Supervision
Qualitative risk indicators complement quantitative frameworks

Proactive Monitoring
Early identification of emerging systemic risks

Data-Driven Insights
Evidence-based regulatory decision making

📋 Complete Analysis Process Flow

Document Processing & Extraction

Complete

Processing 81 financial documents across Bank A, Bank B, and Bank C from Q1 2023 to Q1 2025, including quarterly earnings, Q&A transcripts, and presentations.

Key Components:

Multi-format processing (PDF, DOCX, TXT)

Quality scoring system for document assessment

Comprehensive metadata extraction

Automated text cleaning and normalisation

Financial Metrics Extraction

Complete

Extracting structured financial data from unstructured text using regex patterns with confidence scoring for ROE, NIM, capital ratios, and other key metrics.

Extracted Metrics:

Return on Equity (ROE) analysis

Net Interest Margin (NIM) tracking

Capital ratio calculations

Confidence scoring for each metric

Risk Assessment Analysis

Complete

Comprehensive risk identification across operational and financial dimensions, highlighting potential vulnerabilities that could impact the financial system.

Risk Categories:

Operational risk assessment

Credit risk analysis

Market risk evaluation

Liquidity risk monitoring

G-SIB Analysis

Complete

Assessment based on Basel III framework covering cross-jurisdictional activity, size, interconnectedness, substitutability, and complexity factors.

G-SIB Categories:

Cross-jurisdictional activity assessment

Size indicator analysis

Interconnectedness evaluation

Substitutability factor scoring

Complexity indicator measurement

Transcript Analysis

Complete

Analysis of earnings call transcripts providing insights from forward-looking statements, management commentary, and Q&A sessions not found in static reports.

Analysis Components:

Speaker sentiment analysis

Topic identification in Q&A sessions

Regulatory mention tracking

Management commentary evaluation

Advanced Sentiment Analysis

Complete

Multi-model sentiment analysis using FinBERT and VADER with intelligent text chunking for nuanced financial text analysis across the document corpus.

Sentiment Models:

FinBERT financial domain-specific analysis

VADER sentiment intensity scoring

Text chunking with context preservation

Quarterly sentiment trend tracking

BERTopic Modelling

Complete

State-of-the-art topic discovery using BERTopic to identify recurring themes and emerging trends within the financial documents corpus.

Topic Analysis:

Hierarchically structured topic discovery

Coherent theme identification

Sentiment-topic correlation analysis

Granular regulatory theme detection

FinLLaMA Summarisation

Complete

Domain-specific financial summarisation using FinLLaMA (LLaMA 3.1 fine-tuned) providing superior financial understanding compared to general-purpose models.

LLM Capabilities:

Financial domain expertise

Regulatory language interpretation

Contextual financial insight generation

Sentiment shift identification

Comprehensive Reporting

Complete

Generation of multi-format outputs including CSV datasets, interactive HTML dashboards, and executive summary reports for stakeholder review.

Output Deliverables:

Interactive dashboard creation

Structured CSV datasets

Executive summary generation

Regulatory compliance reports

📊 Comprehensive Bank Analysis

🏦 Bank A

18.2%

Negative Sentiment

High

Risk Profile

Highest Risk Institution

Persistent elevated negative sentiment following major acquisition integration. Requires enhanced supervisory oversight and weekly sentiment monitoring.

🏛️ Bank B

8.4%

Negative Sentiment

Low

Risk Profile

Most Stable Institution

Consistently lowest negative sentiment with 50% net income growth. Positioned as stabilising G-SIB force with effective risk management.

🏢 Bank C

12.7%

Negative Sentiment

Medium

Risk Profile

Volatile Q2 Pattern

Notable Q2 volatility spikes in both 2023 and 2024. Sharp Q1 2025 improvement requires investigation of underlying factors.

🎯 Stress Testing Results

13%+

Post-Stress Capital Ratios

3.55%

Max Impact (Bank B)

Adequate

Systemic Resilience

🔍 Key Findings & Conclusions

📈

Risk Differentiation Identified

Bank A demonstrates highest risk profile with 18.2% negative sentiment, significantly above the 13.1% average. Persistent pattern following major acquisition integration indicates ongoing operational vulnerabilities requiring enhanced supervision.

🏆

Bank B Excellence

Most stable institution with 8.4% negative sentiment and 50% net income growth. Positioned as stabilising G-SIB force with consistently effective risk management and sentiment control.

⚠️

Bank C Q2 Volatility

Notable Q2 sentiment spikes in both 2023 and 2024, followed by sharp Q1 2025 improvement. Pattern requires investigation to understand underlying seasonal or operational factors.

🛡️

Capital Adequacy Maintained

All banks maintain post-stress capital ratios exceeding 13%, well above regulatory minimums. Quantitative resilience validated despite qualitative sentiment variations.

🎯

Topic Analysis Insights

Key themes focus on financial stability, risk exposure, and reporting transparency. Discussions centre on post-pandemic normalisation, integration challenges, and global economic impacts.

🤖

Advanced Analytics Value

FinLLaMA outperformed general LLMs in financial domain understanding. Sentiment analysis revealed risk differentials that traditional quantitative metrics miss, providing enhanced supervisory intelligence.

🎯 Model Validation Events

Major Acquisition Event (Q1 2023)
Bank A sentiment spike correlated with acquisition announcement

Market Volatility (Q2 2024)
Bank C sentiment deterioration aligned with trading challenges

Interest Rate Environment
Bank B resilience demonstrated across rate cycles

💼 Business Implications & Strategic Recommendations

🚨 Immediate Actions Required

Bank A Enhanced Supervision: Implement weekly sentiment monitoring and enhanced oversight protocols. Address operational vulnerabilities identified through persistent negative sentiment patterns.

Bank C Q2 Investigation: Conduct targeted analysis of Q2 volatility patterns to identify underlying operational or seasonal factors requiring mitigation.

📊 Real-Time Monitoring Framework

Automated Alert Systems: Deploy sentiment-based early warning systems with threshold breach notifications for all G-SIBs.

Quarterly Intelligence: Integrate advanced analytics into regular supervisory review processes for enhanced risk identification.

🏆 Best Practice Analysis

Bank B Model: Study and disseminate effective risk management and communication strategies demonstrated by Bank B's consistent performance.

Industry Benchmarking: Establish sentiment-based performance benchmarks across G-SIB institutions.

🤖 Advanced Analytics Integration

FinLLaMA Development: Further fine-tune models on Basel reports and BoE/PRA filings for enhanced regulatory intelligence.

Methodology Expansion: Scale advanced sentiment and topic analysis across additional financial institutions and document types.

📈 Strategic Implementation

Phased Rollout: Extend analysis framework to additional G-SIBs and domestic banks with proven methodology.

Integration Planning: Incorporate advanced analytics into existing supervisory technology stack with staff training programmes.

🔮 Long-Term Development

Predictive Capabilities: Develop forward-looking risk indicators based on sentiment trends and topic evolution.

Regulatory Innovation: Position BoE as leader in advanced analytics application for financial supervision and systemic risk management.