Financial Regulation / NLP

G-SIB Risk Assessment System

An end-to-end NLP pipeline that analyses quarterly financial reports and earnings call transcripts from Global Systemically Important Banks, extracting regulatory-style risk insights that would take human analysts weeks to compile manually.

Type

Cambridge Capstone (2025)

Domain

Financial Risk & Regulation

Data

81 Public Reports (2023-2025)

Status

Completed

The Challenge

Financial regulators face a growing structural problem: the volume of unstructured data published by systemically important banks far exceeds what human analysts can process consistently. Quarterly earnings reports, investor call transcripts, risk disclosures, and supplementary filings contain critical signals about capital adequacy, liquidity risk, and emerging vulnerabilities.

Manual review of these documents is slow, inconsistent, and prone to oversight. Key risk indicators are often buried in dense financial language, and cross-bank comparisons require analysts to synthesise information across hundreds of pages per quarter.

Approach

Data Collection and Preprocessing

Collected 81 publicly available quarterly financial reports and earnings call transcripts from three G-SIBs spanning 2023 to 2025. Cleaned and structured the raw text data, handling financial formatting, tables, and multi-section document structures.

Multi-Model NLP Pipeline

Built a layered NLP architecture combining FinBERT for domain-specific sentiment scoring, VADER for complementary lexicon-based analysis, BERTopic for dynamic topic modelling to surface emerging risk themes, and GPT-2 for contextual summarisation.

Financial Metrics Extraction

Implemented structured extraction of quantitative financial metrics from unstructured text, enabling direct comparison of capital ratios, loss provisions, and liquidity indicators across banks and quarters.

ARIMA Forecasting

Applied ARIMA time-series modelling to the extracted metrics, generating forward-looking projections of key risk indicators to add a predictive dimension beyond retrospective analysis.

Interactive Dashboard

Produced a comprehensive interactive dashboard presenting regulatory-style risk insights, enabling exploration of sentiment trends, topic evolution, metric comparisons, and forecast trajectories.

G-SIB RISK ASSESSMENT

Results

Quarterly reports analysed across 3 G-SIBs

NLP models integrated into a single pipeline

2 yrs

Of financial data spanning 2023-2025

Processed 81 quarterly reports across three global systemically important banks (UBS, Morgan Stanley, Barclays) spanning 2023-2025, extracting financial metrics, sentiment trajectories, and emerging risk themes at a speed and consistency that manual analyst review cannot replicate.

A five-model NLP pipeline (FinBERT, VADER, BERTopic, GPT-2, ARIMA) delivered structured intelligence from unstructured filings - turning earnings transcripts and regulatory disclosures into comparable, queryable data across banks and quarters. Topic modelling surfaced cross-bank themes that siloed reading would miss.

For regulatory teams managing growing volumes of public disclosures, this approach replaces weeks of manual extraction per reporting cycle with automated, auditable analysis that scales without proportional headcount increases.

Important note: This is a prototype built using only publicly available data. It demonstrates the methodology and capability, not a production deployment.

Technology Stack

Python FinBERT VADER BERTopic GPT-2 ARIMA Sentence Transformers BART HDBSCAN Pandas Plotly

Interested in this work or something similar?

Get in Touch View All Projects