Private Document Analysis AI
A privacy-first document intelligence system that answers questions about uploaded documents and spreadsheets using a three-tier query architecture. Deterministic logic handles what it can, a schema engine handles what it should, and a local LLM with RAG handles the rest. No data ever leaves the server.
The Challenge
Organisations handling sensitive documents - legal transcripts, membership records, internal reports - need to query and analyse that data without sending it to third-party AI services. Off-the-shelf tools like ChatGPT require uploading content to external servers, which is unacceptable when confidentiality is non-negotiable.
The system needed to run entirely on private infrastructure, accept a wide range of file formats (spreadsheets, PDFs, Word documents, images), answer questions with verifiable accuracy where possible, and only invoke AI when simpler methods genuinely could not answer the question.
Approach
Results
The three-tier architecture means the majority of data questions are answered instantly with guaranteed accuracy, reserving AI processing for questions that genuinely require reasoning or synthesis. Tiers 1 and 2 were validated against 96 test queries with zero hallucination by design - they use deterministic logic, not probabilistic generation.
The split report architecture solves a problem that plagues most AI document tools: when numbers and narrative are generated by the same model, the AI can invent statistics. Here, quantitative outputs (charts, counts, tables) are produced by Python code with guaranteed accuracy, while qualitative analysis (themes, quotes, insights) is generated by the AI from document passages. The two streams are kept strictly separate and stitched together only at presentation.
For the client, this delivered something commercially unavailable: the analytical capability of a modern AI assistant with the data sovereignty of a fully private, on-premises system. Sensitive documents could be queried, cross-referenced, and analysed without any content leaving their infrastructure.