All posts

AI & Technology

The Limitations of Large Language Models in Accounting

By Michael Cutajar10 min read

The current wave of enthusiasm around large language models (LLMs) like GPT-4, Claude, Gemini, and their open-source counterparts has led to bold claims about AI transforming every industry. Accounting is no exception. But if you work in finance and have actually tested these models on real accounting tasks, you will have noticed something: they are remarkably good at some things and dangerously unreliable at others.

Understanding this distinction is not academic. It is the difference between building useful tools and building ticking time bombs.

Where LLMs Excel

LLMs are genuinely impressive in several accounting-adjacent tasks:

Text Extraction and Structuring

Give an LLM a scanned invoice or a bank statement PDF and ask it to extract the vendor name, amount, date, VAT number, and line items. It will perform well, often better than traditional OCR plus rule-based extraction. LLMs handle variation in invoice formats, languages, and layouts with a flexibility that rigid template-based systems cannot match.

Classification and Categorisation

"Is this transaction a business expense or a personal expense?" "Which accounting category does this fall into?" LLMs can perform these classifications with reasonable accuracy, especially when given context about the business and examples of prior categorisations.

Summarisation

Summarising financial reports, extracting key points from lengthy contracts, or distilling a complex tax ruling into plain language. This is core LLM territory, and they do it well.

Natural Language Queries

"How much did I spend on travel in Q3?" When connected to structured financial data, an LLM can translate natural language questions into database queries and present the results in a readable format. This is a meaningful improvement over navigating accounting software through menus and report builders.

Where LLMs Fail

Arithmetic

This surprises people, but LLMs are unreliable at arithmetic. They do not calculate; they predict the next token. When you ask an LLM to add up a column of numbers, it is not performing addition. It is generating text that looks like the result of addition.

For simple calculations, the prediction often matches the correct answer, because the model has seen millions of examples of simple arithmetic and has learned the patterns. But for complex calculations involving many steps, unusual numbers, or precision requirements, errors creep in.

In accounting, arithmetic errors are not tolerable. Your tax liability needs to be correct to the cent, not approximately right. A VAT return that is off by even a small amount triggers queries and potential penalties.

Applying Specific Tax Rules

Tax law is precise, conditional, and jurisdiction-specific. A question like "Am I entitled to the MicroInvest tax credit?" requires checking multiple conditions: the type of business, the type of expenditure, the amount, whether the maximum has been reached, whether the business meets the size criteria, and whether the expenditure was incurred in the qualifying period.

LLMs have a general awareness of tax concepts from their training data, but they do not reliably apply specific rules from specific jurisdictions. They may conflate rules from different countries, apply outdated rates, or miss conditions. And critically, they will do this with the same confident tone they use when they are correct.

Consistency Across Runs

Ask an LLM the same tax question twice, and you may get two different answers. This is by design: LLMs use sampling to generate responses, and the "temperature" parameter controls how much randomness is in the output.

Even at low temperature settings, LLMs are not deterministic in the way that a tax calculation engine is. The same inputs will not always produce the same output. In accounting, where reproducibility is a fundamental requirement (the same transactions should always produce the same financial statements), this is a serious limitation.

The Hallucination Problem

LLMs sometimes generate plausible-sounding but factually incorrect information. In a general conversation, a hallucinated fact might be harmless. In a financial context, it can be costly.

An LLM might confidently state that a particular expense is deductible when it is not. It might cite a tax rate that does not exist. It might invent a filing deadline. These hallucinations look exactly like correct responses, and without independent verification, there is no way for a non-expert to tell the difference.

Research from Microsoft, Google, and academic institutions has documented hallucination rates across various domains. In specialised fields like tax and accounting, where the training data is less abundant and the rules are more nuanced, hallucination rates tend to be higher.

The Right Architecture: LLMs as a Layer, Not a Decision Engine

The failures above do not mean LLMs are useless in accounting. They mean LLMs should not be the final decision-maker for financial calculations or compliance determinations.

The correct architecture treats the LLM as a preprocessing and interface layer:

Extraction layer. The LLM reads documents, extracts structured data, and classifies transactions.

Calculation layer. A deterministic rules engine (traditional software, not an LLM) performs all arithmetic and applies specific tax rules. This engine produces the same output every time for the same input.

Validation layer. Automated checks verify that outputs are within expected ranges, that calculations balance, and that results are consistent with prior periods.

Presentation layer. The LLM translates the calculated results into natural language explanations, generates summaries, and answers user questions about the data.

In this architecture, the LLM never calculates your tax liability. It helps you understand your tax liability after a reliable engine has calculated it.

Guardrails and Validation

Any system that uses LLMs in a financial context needs guardrails:

Output validation. Every LLM-generated classification, extraction, or recommendation should be checked against rules and constraints. If the LLM classifies a 50,000-euro transaction as "office supplies," a simple reasonableness check should flag it for review.

Confidence scoring. The system should track how confident the LLM is in its outputs and route low-confidence items to human review. Many LLM APIs provide log probabilities that can be used for this purpose.

Audit trail. Every LLM interaction should be logged: the input, the output, and any downstream actions taken. This is essential for regulatory compliance and for debugging when things go wrong.

Human-in-the-loop. For any decision that has financial or legal consequences, a qualified human should review and approve the output. The LLM assists; it does not decide.

Why "AI-Powered" Does Not Mean "AI-Decided"

The marketing landscape is full of claims about AI-powered accounting, AI-powered tax filing, and AI-powered compliance. These claims range from genuine to misleading.

A system where AI extracts data from invoices, categorises transactions, and presents a dashboard for human review is legitimately AI-powered. The AI is doing real work that saves real time.

A system where AI calculates your tax liability, files your return, and makes compliance decisions without meaningful human oversight is a liability waiting to happen.

The distinction matters. When evaluating any AI-enabled accounting tool, the question is not "Does it use AI?" The question is "Where exactly does AI make decisions, and where do humans make decisions?" If the answer to the first part includes anything involving compliance obligations or financial calculations, proceed with extreme caution.

The Path Forward

LLMs will continue to improve. Arithmetic capabilities are being augmented with tool use (the model calls a calculator rather than predicting the answer). Hallucination rates are decreasing with better training techniques and retrieval-augmented generation. Consistency is improving with better prompting strategies and deterministic decoding.

But the fundamental principle will remain: LLMs are probabilistic systems, and accounting is a domain that demands deterministic accuracy. The firms that understand this boundary will build reliable, trustworthy systems. The firms that ignore it will build systems that work brilliantly until they do not.


Michael Cutajar, CPA — Founder of Accora.