Back to Projects
Financial Regulation / NLP

G-SIB Risk Assessment System

An end-to-end NLP pipeline that analyses quarterly financial reports and earnings call transcripts from Global Systemically Important Banks, extracting regulatory-style risk insights that would take human analysts weeks to compile manually.

Type
Cambridge Capstone (2025)
Domain
Financial Risk & Regulation
Data
81 Public Reports (2023-2025)
Status
Completed

The Challenge

Financial regulators face a growing structural problem: the volume of unstructured data published by systemically important banks far exceeds what human analysts can process consistently. Quarterly earnings reports, investor call transcripts, risk disclosures, and supplementary filings contain critical signals about capital adequacy, liquidity risk, and emerging vulnerabilities.

Manual review of these documents is slow, inconsistent, and prone to oversight. Key risk indicators are often buried in dense financial language, and cross-bank comparisons require analysts to synthesise information across hundreds of pages per quarter.

Approach

01
Data Collection and Preprocessing
Collected 81 publicly available quarterly financial reports and earnings call transcripts from three G-SIBs spanning 2023 to 2025. Cleaned and structured the raw text data, handling financial formatting, tables, and multi-section document structures.
02
Multi-Model NLP Pipeline
Built a layered NLP architecture combining FinBERT for domain-specific sentiment scoring, VADER for complementary lexicon-based analysis, BERTopic for dynamic topic modelling to surface emerging risk themes, and GPT-2 for contextual summarisation.
03
Financial Metrics Extraction
Implemented structured extraction of quantitative financial metrics from unstructured text, enabling direct comparison of capital ratios, loss provisions, and liquidity indicators across banks and quarters.
04
ARIMA Forecasting
Applied ARIMA time-series modelling to the extracted metrics, generating forward-looking projections of key risk indicators to add a predictive dimension beyond retrospective analysis.
05
Interactive Dashboard
Produced a comprehensive interactive dashboard presenting regulatory-style risk insights, enabling exploration of sentiment trends, topic evolution, metric comparisons, and forecast trajectories.
G-SIB RISK ASSESSMENT
81

Results

81
Quarterly reports analysed across 3 G-SIBs
5
NLP models integrated into a single pipeline
2 yrs
Of financial data spanning 2023-2025

Processed 81 quarterly reports across three global systemically important banks (UBS, Morgan Stanley, Barclays) spanning 2023-2025, extracting financial metrics, sentiment trajectories, and emerging risk themes at a speed and consistency that manual analyst review cannot replicate.

A five-model NLP pipeline (FinBERT, VADER, BERTopic, GPT-2, ARIMA) delivered structured intelligence from unstructured filings - turning earnings transcripts and regulatory disclosures into comparable, queryable data across banks and quarters. Topic modelling surfaced cross-bank themes that siloed reading would miss.

For regulatory teams managing growing volumes of public disclosures, this approach replaces weeks of manual extraction per reporting cycle with automated, auditable analysis that scales without proportional headcount increases.

Important note: This is a prototype built using only publicly available data. It demonstrates the methodology and capability, not a production deployment.

Technology Stack

Python FinBERT VADER BERTopic GPT-2 ARIMA Sentence Transformers BART HDBSCAN Pandas Plotly
Interested in this work or something similar?