NLP in Accounting Document Processing | OCR, Entity Recognition & Accuracy

Accountants have always been document processors. The core job involves reading financial documents, extracting the relevant numbers, categorising them correctly, and producing accurate reports. Natural Language Processing is automating each of these steps, but the gap between reading text and understanding its financial meaning is where the real transformation is happening.

From OCR to NLP: Reading vs Understanding

Optical Character Recognition (OCR) has existed for decades. It converts images of text into machine-readable characters. Modern OCR engines from Google (Vision AI), Amazon (Textract), and Microsoft (Azure AI Document Intelligence) achieve character-level accuracy rates above 99% on clean, printed documents.

But OCR only reads. It does not understand. An OCR engine scanning an invoice will output a stream of text: "Supplier: Mediterranean Office Supplies Ltd. Invoice No: INV-8847. Date: 15/03/2026. Item: A4 Paper 5 Reams. Qty: 10. Unit Price: EUR 4.50. VAT 18%: EUR 8.10. Total: EUR 53.10."

That text is accurate but unstructured. The OCR engine does not know that "EUR 53.10" is the total amount, that "18%" is the VAT rate, or that "Mediterranean Office Supplies Ltd" is the supplier name. It just sees characters on a page.

NLP bridges this gap. It takes the OCR output and applies linguistic and contextual understanding to extract meaning. It identifies that the number following "Total:" is the invoice total. It recognises that the percentage near "VAT" is the tax rate. It understands that the text at the top of the document near "Supplier:" is the vendor name.

Named Entity Recognition for Financial Documents

Named Entity Recognition (NER) is the NLP technique most directly relevant to accounting document processing. Standard NER models identify entities like people, organisations, locations, and dates. Financial NER extends this to accounting-specific entities:

Monetary amounts — distinguishing the subtotal, tax amount, total, and individual line item prices
Tax identifiers — VAT registration numbers, tax identification numbers, and company registration numbers
Dates — invoice date, due date, delivery date, and payment terms dates
Supplier and buyer details — company names, addresses, and contact information
Document references — invoice numbers, purchase order numbers, and credit note references

Training financial NER models requires annotated datasets where humans have marked each entity in thousands of documents. The Stanford NER system and spaCy's entity recognition pipeline provide foundations, but financial document processing demands custom models trained on domain-specific data.

Research published at EMNLP and ACL conferences has shown that transformer-based NER models, particularly those that incorporate layout information alongside text, achieve F1 scores above 95% on structured financial documents. Microsoft's LayoutLM family of models, now in its third iteration, was specifically designed for document understanding tasks that require both textual and spatial features.

The Format Chaos

If every invoice followed the same template, document processing would be a solved problem. They do not. A single accounting firm might receive documents in dozens of formats:

PDF invoices — from formal corporate suppliers with consistent layouts
Email invoices — sometimes as attachments, sometimes as inline text in the email body
Scanned paper documents — varying quality, sometimes handwritten, occasionally coffee-stained
Phone photographs — taken at angles, with shadows, under varying lighting conditions
WhatsApp messages — increasingly common for informal business transactions, particularly among sole traders and small businesses
Web-based invoices — HTML emails from services like Stripe, PayPal, and subscription platforms

Each format presents different challenges. PDF invoices are relatively clean but may use embedded fonts that OCR engines struggle with. Phone photographs introduce perspective distortion, uneven lighting, and motion blur. WhatsApp messages mix conversation with financial data in an unstructured stream.

The NLP system must handle all of these. Pre-processing pipelines correct for image quality issues before OCR. Document classification models identify what type of document has been received. Specialised extraction models then apply the appropriate processing strategy for each document type.

Multilingual Financial Documents

A Maltese business operates in an inherently multilingual environment. English and Maltese are both official languages. Italian is widely understood. EU regulations and cross-border transactions introduce French, German, and other European languages.

Multilingual NLP for financial documents faces specific challenges beyond general translation:

Financial terminology varies across languages and jurisdictions. The Maltese "taxxa fuq il-valur mizjud" and the English "value added tax" refer to the same concept but appear differently in documents.
Number formatting differs: 1,234.56 in English becomes 1.234,56 in most European languages. A model must understand these conventions to extract amounts correctly.
Date formats vary: DD/MM/YYYY in Europe, MM/DD/YYYY in the US, and various textual representations across languages.
Address formats differ by country, affecting supplier identification and jurisdiction determination.

Multilingual transformer models like XLM-RoBERTa and mBERT provide cross-lingual understanding, but fine-tuning on financial documents in each target language significantly improves performance. The practical challenge is assembling sufficient annotated training data in languages with smaller digital footprints.

The Accuracy Challenge

No AI system processes financial documents with 100% accuracy. The critical question is where errors occur and how they are handled.

Common extraction errors include:

Character confusion — the digit "1" misread as lowercase "l" or the letter "O" confused with the number "0." These errors are particularly common on thermal paper receipts where print quality degrades over time.
Amount ambiguity — when a document contains multiple monetary amounts (subtotal, tax, total, discounts), the model must correctly assign each amount to its role. Errors here can cascade through the accounting process.
Date parsing — "05/06/2026" could be May 6th or June 5th depending on convention. Without additional context, the model must infer the correct interpretation from the document's language, jurisdiction, and other dates present.
Supplier identification — matching an abbreviated or misspelled supplier name to the correct entity in the accounting system requires fuzzy matching and contextual understanding.

Confidence Scores and Human Review

Well-designed systems address extraction uncertainty through confidence scoring. Each extracted field is assigned a confidence score reflecting the model's certainty. A clearly printed "Total: EUR 1,500.00" on a high-quality PDF might receive a 99% confidence score. A partially obscured amount on a crumpled receipt might receive 72%.

The system then applies a threshold. Extractions above the threshold (typically 90-95%) are accepted automatically. Those below are flagged for human review. This creates a tiered workflow:

High confidence (95%+): Automatically processed, reducing human workload
Medium confidence (80-95%): Presented to a human reviewer with the AI's best guess pre-populated, requiring only confirmation or correction
Low confidence (below 80%): Flagged for manual processing with the AI's output available as a starting point

This approach optimises the allocation of human attention. Rather than reviewing every document, humans focus on the cases where the AI is genuinely uncertain. The corrections humans make feed back into the training data, progressively improving the model's accuracy on similar documents.

Real-World Accuracy Rates

Published benchmarks and industry reports provide a realistic picture of current capabilities:

Structured invoices (standard PDF formats from large suppliers): 97-99% field-level accuracy
Semi-structured documents (varied layouts but typed text): 93-97% accuracy
Unstructured documents (handwritten notes, informal receipts): 85-93% accuracy
Poor quality images (faded thermal receipts, blurry photos): 75-90% accuracy

These figures represent field-level accuracy, meaning the percentage of individual fields (date, amount, supplier name) correctly extracted. Document-level accuracy, where every field must be correct for the document to count as successfully processed, is naturally lower.

The trajectory is consistently improving. Google's Document AI accuracy improved by 15-20% between 2021 and 2024 on benchmark datasets. Amazon Textract's invoice processing feature, launched in 2022, reduced error rates by approximately 30% compared to its general-purpose document analysis. Each major model version brings measurable improvements, particularly on the difficult edge cases that drive accuracy from 95% toward 99%.

For the accounting profession, these accuracy rates are already sufficient to transform the workflow. The AI handles the bulk processing. Humans handle the exceptions. The result is faster, cheaper, and often more accurate than fully manual processing, where human data entry error rates of 1-4% are well documented.

Michael Cutajar, CPA — Founder of Accora.