There is a common misconception that you can point ChatGPT at a pile of invoices and get accurate bookkeeping. You cannot. The gap between a general-purpose large language model and a system that reliably processes financial documents is enormous, and understanding why requires looking at how these models are actually trained.
General LLMs vs Domain-Specific Financial Models
Large language models like GPT-4, Claude, and Gemini are trained on broad internet text: Wikipedia, books, code repositories, news articles, and web pages. They develop impressive general reasoning, but their knowledge of accounting is shallow. They know what a VAT return is in the same way a well-read generalist might. They do not know the specific rules that govern how reverse charge VAT applies to a cross-border service supplied by a Maltese self-employed professional to a German company.
Domain-specific financial models, by contrast, are trained or fine-tuned on financial documents. Companies like Intuit, Xero, and a wave of fintech startups have spent years building proprietary datasets of labelled invoices, bank transactions, and tax filings. The difference in output quality is not incremental. It is categorical.
What Training Data Looks Like
Training a financial AI system requires structured and labelled examples. The raw materials include:
- Invoices and receipts — millions of them, across different formats, languages, and layouts. Each document must be annotated with the correct supplier name, date, line items, amounts, currency, and tax rate.
- Bank statements — transaction descriptions paired with their correct accounting categories. A payment to "AMZN*2847XQ" needs to be labelled as an office supplies purchase, not an Amazon Prime subscription.
- Tax forms — completed returns mapped to the underlying transactions that generated them. This teaches the model the relationship between source documents and regulatory outputs.
- Chart of accounts mappings — how different businesses categorise the same type of expense differently depending on their industry and accounting framework.
The scale matters. Bloomberg's BloombergGPT, announced in 2023, was trained on 363 billion tokens of financial data alongside 345 billion tokens of general text. JPMorgan's DocLLM, published in early 2024, was specifically designed to understand document layouts alongside text, a critical requirement for processing invoices where spatial positioning determines meaning.
Supervised vs Unsupervised Learning
Most financial AI systems rely on supervised learning for core tasks. A human labels ten thousand invoices with the correct extracted fields, and the model learns the patterns. This works well for structured documents like standard invoices from large suppliers.
Unsupervised learning plays a different role. It excels at clustering similar transactions, detecting anomalies, and identifying patterns that humans did not explicitly define. When a system notices that transactions from a particular merchant always appear on Fridays and suddenly appear on a Tuesday, that insight emerges from unsupervised pattern recognition.
Semi-supervised approaches are increasingly common. A small set of expertly labelled data trains an initial model, which then generates predictions on unlabelled data. Humans review the uncertain cases, and the corrected predictions feed back into training. This active learning loop dramatically reduces the volume of manual labelling required.
Fine-Tuning for Accounting Tasks
Fine-tuning takes a pre-trained model and adapts it to a specific domain. Rather than training from scratch on financial data (which would require enormous compute resources), you start with a model that already understands language and teach it the specifics of accounting.
Google's research on PaLM demonstrated that fine-tuning on as few as a thousand domain-specific examples can significantly improve performance on specialised tasks. For accounting, this means taking a base model and training it on examples like:
- Given this invoice image, extract the supplier name, date, and total amount
- Given this bank transaction description, classify it into the correct expense category
- Given this set of transactions, identify which ones are likely input VAT claimable
The results are dramatic. A fine-tuned model might achieve 97% accuracy on invoice field extraction where a general LLM achieves 80%. That difference sounds small in percentage terms but translates to one error in every five invoices versus one in thirty-three. For a business processing hundreds of documents monthly, this is the difference between a usable system and an unreliable one.
The Multilingual Challenge
Financial documents are inherently multilingual. A Maltese business might receive invoices in English, Italian, and occasionally German or French. A single invoice might contain English field labels with Italian addresses and amounts formatted with European comma decimals.
Training multilingual financial models requires parallel corpora: the same types of documents in multiple languages, all correctly labelled. This is expensive to produce and difficult to source. Some approaches use translation as a data augmentation technique, but financial terminology does not always translate cleanly. The Italian "fattura" and the English "invoice" map directly, but concepts like "imposta sul valore aggiunto" and "value added tax" carry jurisdiction-specific implications that a simple translation misses.
Recent work by Meta's NLLB (No Language Left Behind) project and Google's multilingual models has improved cross-lingual transfer learning, where a model trained primarily on English financial documents can transfer some of that knowledge to other languages. But accuracy still drops measurably on languages with less training data.
Why Generic GPT Is Not Enough
Here is the fundamental problem. If you ask GPT-4 to classify a transaction, it will give you a plausible answer. Plausible is not the same as correct. Tax law is not a matter of opinion or probability. The VAT rate on a specific good or service in a specific jurisdiction is a fact, and it either matches the legislation or it does not.
Large language models hallucinate. They generate confident, articulate, and completely wrong answers. OpenAI's own research shows hallucination rates that, while improving, remain far too high for financial applications where errors have legal and monetary consequences.
This is not a temporary limitation that will be solved by the next model version. It is an architectural characteristic of how these models work. They predict the most likely next token based on patterns in training data. They do not reason from first principles about tax statute.
The Hybrid Approach
The systems that actually work in production use a hybrid architecture. Probabilistic AI handles what it is good at: reading messy documents, extracting data from varied formats, classifying transactions based on patterns, and flagging anomalies. Deterministic rules engines handle what they are good at: applying tax rates, calculating thresholds, enforcing compliance rules, and generating returns that match legislative requirements exactly.
The AI component says "I am 94% confident this receipt is for office supplies totalling EUR 47.50 including 18% VAT." The rules engine says "office supplies in Malta attract 18% VAT, the input VAT claimable is EUR 7.24, and this must be reported in Box 4 of the VAT return."
Neither system works well alone. The AI without rules will occasionally apply the wrong VAT rate and generate an incorrect return. The rules engine without AI cannot read a crumpled receipt photographed at an angle under fluorescent lighting. Together, they achieve something neither can alone: reliable, automated financial document processing.
This is not a theoretical architecture. It is how every serious financial AI system in production works today, from Intuit's QuickBooks to newer entrants in the space. The details of implementation differ, but the principle is universal: use AI where ambiguity is inherent, and use rules where correctness is non-negotiable.
Michael Cutajar, CPA — Founder of Accora.