QA Lead Transitioning to Data Science | Cambridge Level 7 Certificate 2025 | Python + ML/NLP Portfolio
🎯 Challenge: Regulators struggle to monitor systemic bank risks as manual analysis of financial reports remains inefficient and prone to oversight.
Built an end-to-end prototype analysing 81 public quarterly financial reports & earnings-call transcripts of three Global Systemically Important Banks (2023–2025). Combined advanced NLP (FinBERT, FinLLaMA, BERTopic, VADER) with structured financial metrics extraction and ARIMA forecasting. Produced an interactive dashboard presenting regulatory-style risk insights. Technologies: Python, FinBERT, VADER, BERTopic, GPT-2, ARIMA modelling, Sentence Transformers, BART, HDBSCAN clustering. Prototype built using only publicly available data – no affiliation with or delivery to the Bank of England.
🎯 Challenge: Wellness centres struggle to extract actionable insights from unstructured customer feedback, missing service improvement opportunities.
Developed an NLP solution to analyse customer feedback sentiment. Implemented text preprocessing techniques and trained a BERT-based model to classify sentiment with 92% accuracy. Technologies: Python, NLTK, Transformers, PyTorch.
Technologies: Python, NLTK, Transformers, PyTorch.
🎯 Challenge: Businesses lack data-driven methods to identify distinct customer clusters, resulting in inefficient marketing and suboptimal engagement.
Analysed a dataset through exploration and preprocessing, conducted feature engineering, determined the optimal number of clusters (k), and applied machine learning models to segment customers effectively.
Technologies: Python, Scikit-learn, Pandas, Clustering Algorithms
🎯 Challenge: Educational institutions struggle to predict at-risk students early due to fragmented academic and personal data.
Conducted phased data exploration, preprocessing, and feature engineering. Built and compared predictive models using XGBoost and a neural network to forecast student dropout rates with high accuracy.
Technologies: Python, XGBoost, TensorFlow, Pandas
🎯 Challenge: Organisations misinterpret data by conflating correlation with causation, leading to flawed decision-making without rigorous validation.
Applied statistical hypothesis testing to evaluate organisational data scenarios. Explored the differences between correlation and causation in data analysis.
Technologies: Python, Statistical Methods
🎯 Challenge: Conventional monitoring misses subtle anomalies in operational systems, exposing organisations to financial losses without automated detection.
Explored a dataset to identify patterns, preprocessed data, and performed feature engineering. Applied statistical techniques and machine learning algorithms to detect anomalies, followed by a detailed report summarising findings and recommendations.
Technologies: Python, Pandas, Scikit-learn, Statistical Methods
🎯 Challenge: Retailers face volatile demand fluctuations as baseline forecasting methods fail to capture temporal patterns accurately.
Analysed historical sales data using time series decomposition, feature engineering, and ARIMA modeling to forecast future demand. Achieved 15% improvement in forecast accuracy over baseline methods.
Technologies: Python, Statsmodels, Prophet, Pandas
🎯 Challenge: Simplistic models fail to handle high-dimensional data, requiring advanced architectures for intricate feature learning.
Designed and implemented a deep neural network architecture from scratch. Applied forward and backward propagation algorithms, optimised hyperparameters, and achieved state-of-the-art performance on classification tasks.
Technologies: Python, TensorFlow, Keras, NumPy, Matplotlib
🎯 Challenge: Streaming platforms struggle to gauge audience reactions from vast review volumes without automated sentiment analysis.
🎯 Challenge: Practitioners struggle to understand complex architectures without interactive tools showing layer interactions and data flows.
🎯 Challenge: Novice learners find neural network training opaque without real-time demonstrations of weight updates and convergence.
🎯 Challenge: Tuning hyperparameters remains time-consuming without visual dashboards to track optimisation trajectories across parameter spaces.
🎯 Challenge: Beginners lack accessible tools to experiment with foundational neural network concepts and activation functions.
🎯 Challenge: Developing bespoke networks is hindered by fragmented libraries and steep learning curves for low-level implementations.
Challenge: Model tuning is slow and expensive guesswork without real-time visualisation of the search space.
🎯 Challenge: Developers overlook performance aspects beyond accuracy without integrated evaluation suites for precision and recall.
🎯 Challenge: Selecting optimal optimisers is challenging amid varying convergence speeds without comparative analyses across datasets.
🎯 Challenge: Historical datasets with imbalanced features and noise impede development of robust classifiers for risk assessment.
🎯 Challenge: Practitioners waste time using the wrong algorithm type because task boundaries between regression and classification are unclear.
🎯 Challenge: Retailers face volatile demand as baseline forecasts fail to capture seasonality and shocks.
🎯 Challenge: Businesses waste marketing budget because customer behaviour clusters remain hidden.
🎯 Challenge: Subtle engine failures slip past rule-based monitoring, risking safety and huge repair costs.
🎯 Challenge: Teams misinterpret data by confusing correlation with causation without rigorous testing.
🎯 Challenge: Universities lose talented students because at-risk cases cannot be spotted early.
🎯 Challenge: Companies drown in unstructured text and miss critical customer sentiment signals.
🎯 Challenge: Off-the-shelf models fail on specialised high-dimensional problems requiring bespoke architectures.
🎯 Challenge: Police cannot predict crime hotspots accurately, leading to inefficient patrols and public safety gaps.
🎯 Challenge: Law enforcement struggles with unpredictable crime patterns without reliable predictive models to anticipate hotspots and trends.
🎯 Challenge: Text classification suffers from inconsistent performance across recurrent architectures without systematic LSTM, GRU, and RNN comparisons.
🎯 Challenge: Interpretable models are underutilised in high-stakes decisions without explainability tools to demystify feature importance.
🎯 Challenge: Banks lose revenue from customer churn as siloed data hinders early identification of at-risk clients for retention efforts.
🎯 Challenge: Selecting optimal optimisers is challenging amid varying convergence speeds without comparative analyses.
🎯 Challenge: Even classic datasets hide subtle interactions that only optimised models can uncover.
🎯 Challenge: Understanding backpropagation deeply requires implementing it from scratch — most never do.
🎯 Challenge: Practitioners waste time using the wrong algorithm type because task boundaries are unclear.
🎯 Challenge: Automotive marketplaces face opaque pricing influenced by interdependent features without dimensionality reduction techniques.
🎯 Challenge: High-dimensional data is impossible to interpret without powerful reduction and visualisation.
🎯 Challenge: Car pricing appears random when dozens of correlated features hide the real drivers.
🎯 Challenge: Retailers face declining loyalty and escalating costs as traditional metrics fail to predict long-term engagement without integrated analytics.
🎯 Challenge: Insurers cannot price policies fairly without understanding hidden correlations between lifestyle and cost.
🎯 Challenge: Non-technical stakeholders cannot trust statistical claims without interactive p-value and power analysis tools.
Pandas • NumPy • Scikit-learn • TensorFlow • PyTorch • Jupyter • Git • Production-ready ML pipelines • Automated/scalable workflows
Supervised & unsupervised learning • XGBoost • Random Forests • SVM • Neural networks (custom architectures, forward/backward propagation, gradient descent) • Ensemble methods • Clustering (K-means, DBSCAN, HDBSCAN, hierarchical)
Hugging Face Transformers • FinBERT • FinLLaMA • BERT • BERTopic • VADER • GPT-2 • BART • Sentence Transformers • spaCy • NLTK • Text classification & sentiment analysis (92 % accuracy on customer-review dataset)
ARIMA/SARIMA • Prophet • LSTM • Statsmodels • Decomposition techniques • Demand & financial forecasting (15 % accuracy improvement vs baseline on book-sales project)
Isolation Forests • Autoencoders • Statistical methods • Real-time maritime/engine anomaly detection project
Hyperparameter tuning (Grid, Random, Bayesian) • ROC-AUC • Precision-Recall • Custom business metrics • SHAP interpretability • A/B testing
Feature creation/selection • PCA • t-SNE • UMAP • Autoencoders • High-dimensional data processing
Matplotlib • Seaborn • Plotly • Power BI • Interactive dashboards • Business intelligence reporting
Parametric & non-parametric tests • Correlation & causal inference • Model validation
Experiment tracking • Model versioning • Drift detection concepts • Automated retraining basics (academic & portfolio exposure)
Model versioning • Experiment tracking • Deployment pipelines • Drift detection • Automated retraining
RFM analysis • Cohort analysis • Behavioural segmentation • Retention optimisation • Targeted marketing insights
A selection of my data visualisation techniques
Interested in working together? Fill out the form below, and I'll get back to you promptly.
Based in London, UK
Built an end-to-end prototype analysing 81 public quarterly financial reports & earnings-call transcripts of three Global Systemically Important Banks (2023–2025). Combined advanced NLP (FinBERT, FinLLaMA, BERTopic, VADER) with structured financial metrics extraction and ARIMA forecasting. Produced an interactive dashboard presenting regulatory-style risk insights
In the wake of escalating geopolitical tensions and economic volatility, regulators face mounting pressure to enhance real-time monitoring of systemic risks posed by Global Systemically Important Banks (G-SIBs), yet manual analysis of vast financial reports and transcripts remains inefficient and prone to oversight, hindering proactive risk mitigation.
Built an end-to-end prototype analysing 81 public quarterly financial reports and earnings-call transcripts using advanced NLP (FinBERT, FinLLaMA, BERTopic, VADER) combined with ARIMA forecasting. Produced an interactive dashboard presenting regulatory-style risk insights, demonstrating feasibility of automated systemic risk monitoring for regulatory applications.
Global Systemically Important Banks (G-SIBs) produce quarterly financial results accompanied by analyst Q&A transcripts and webcasts. The Bank of England's Prudential Regulation Authority (PRA) supervises these institutions to uphold monetary and economic stability.
Core Problem: While quantitative metrics are readily incorporated by existing risk-assessment frameworks, qualitative insights embedded in earnings-call discussions remain under-utilised.
By analysing multiple Global Systemically Important Banks and their quarterly earnings results over the period 2023-2025, identify key insights using advanced analytical techniques that may be missed by traditional quantitative analysis methods.
Persistent elevated negative sentiment following major acquisition integration. Requires enhanced supervisory oversight and weekly sentiment monitoring.
Consistently lowest negative sentiment with 50% net income growth. Positioned as stabilising G-SIB force with effective risk management.
Notable Q2 volatility spikes in both 2023 and 2024. Sharp Q1 2025 improvement requires investigation of underlying factors.