An end-to-end NLP pipeline that analyses quarterly financial reports and earnings call transcripts from Global Systemically Important Banks, extracting regulatory-style risk insights that would take human analysts weeks to compile manually.
Type
Cambridge Capstone (2025)
Domain
Financial Risk & Regulation
Data
81 Public Reports (2023-2025)
Status
Completed
The Challenge
Financial regulators face a growing structural problem: the volume of unstructured data published by systemically important banks far exceeds what human analysts can process consistently. Quarterly earnings reports, investor call transcripts, risk disclosures, and supplementary filings contain critical signals about capital adequacy, liquidity risk, and emerging vulnerabilities.
Manual review of these documents is slow, inconsistent, and prone to oversight. Key risk indicators are often buried in dense financial language, and cross-bank comparisons require analysts to synthesise information across hundreds of pages per quarter.
Approach
01
Data Collection and Preprocessing
Collected 81 publicly available quarterly financial reports and earnings call transcripts from three G-SIBs spanning 2023 to 2025. Cleaned and structured the raw text data, handling financial formatting, tables, and multi-section document structures.
02
Multi-Model NLP Pipeline
Built a layered NLP architecture combining FinBERT for domain-specific sentiment scoring, VADER for complementary lexicon-based analysis, BERTopic for dynamic topic modelling to surface emerging risk themes, and GPT-2 for contextual summarisation.
03
Financial Metrics Extraction
Implemented structured extraction of quantitative financial metrics from unstructured text, enabling direct comparison of capital ratios, loss provisions, and liquidity indicators across banks and quarters.
04
ARIMA Forecasting
Applied ARIMA time-series modelling to the extracted metrics, generating forward-looking projections of key risk indicators to add a predictive dimension beyond retrospective analysis.
05
Interactive Dashboard
Produced a comprehensive interactive dashboard presenting regulatory-style risk insights, enabling exploration of sentiment trends, topic evolution, metric comparisons, and forecast trajectories.
G-SIB RISK ASSESSMENT
81
Results
81
Quarterly reports analysed across 3 G-SIBs
5
NLP models integrated into a single pipeline
2 yrs
Of financial data spanning 2023-2025
The system demonstrated that automated NLP analysis can surface risk signals from public filings at a speed and consistency that manual review cannot match. Topic modelling revealed emerging themes across banks that would be difficult to identify through siloed reading.
Important note: This is a prototype built using only publicly available data. It has no affiliation with, and was not delivered to, the Bank of England or any regulatory body.