How AI Transaction Categorisation Works in Modern Bookkeeping

Every self-employed professional generates hundreds, sometimes thousands, of bank transactions per year. Each one needs a category: is that payment to Shell for vehicle fuel, office supplies, or a client dinner at a restaurant that happens to be called Shell? This is the categorisation problem, and it sits at the heart of automated bookkeeping.

The Rule-Based Era

The earliest attempts at automated categorisation used rule-based systems. The logic was simple: if the merchant name contains "Shell," assign it to "Motor Expenses." If it contains "Vodafone," assign it to "Telephone." These keyword-matching systems were easy to build and easy to understand.

They also broke constantly.

A payment to "Shell Cafe" would land in motor expenses. A transfer to "J. Shell Consulting" would do the same. "Amazon" could be office supplies, software subscriptions, personal purchases, or client gifts. The more transactions you processed, the more edge cases you discovered, and each edge case required another rule. Maintaining these rule sets became a full-time job in itself.

Enter Machine Learning Classification

Modern transaction categorisation uses supervised machine learning, specifically multi-class classification models. Instead of hand-coding rules, you train a model on millions of previously categorised transactions and let it learn the patterns.

The shift from rules to ML is significant. A rule-based system knows that "Shell" means fuel because someone told it so. An ML model learns that transactions at Shell stations tend to be between 20 and 80 euros, occur during commuting hours, appear roughly weekly, and follow geographic patterns consistent with the user's known location. When it sees "Shell Cafe" with a 12-euro charge at lunchtime, the model has enough context to classify it differently.

Feature Engineering: What the Model Actually Sees

Raw transaction data from a bank feed is messy. The description field might say "POS 2847 SHELL MSIDA MT" or "DD REF 9928471 VODAFONE." Turning this into useful input for a model requires feature engineering.

Typical features include:

Merchant-level features: Cleaned merchant name, merchant category code (MCC) when available, merchant location, and merchant type from external databases.

Transaction-level features: Amount, currency, date, day of week, time of day (where available), and transaction type (card payment, direct debit, bank transfer, cash withdrawal).

Behavioural features: Transaction frequency with this merchant, average amount at this merchant, time since last transaction with this merchant, and spending patterns by category over time.

Contextual features: User's industry (a real estate agent's "marketing" spend looks different from a developer's), VAT registration status, and historical category assignments.

The feature engineering is often where the real competitive advantage lies. Two teams using the same gradient-boosted tree model will get very different results depending on how well they prepare the input data.

Accuracy Rates and What They Mean

Published benchmarks from companies like Plaid, Yodlee, and various open banking providers report categorisation accuracy between 85% and 95%, depending on the granularity of categories and the quality of the training data.

But "accuracy" is a slippery metric in this context. If 60% of your transactions are straightforward card payments at well-known merchants, a model that only gets those right already hits 60% accuracy. The hard cases are bank transfers with cryptic references, international payments, and transactions at small local businesses that don't appear in any merchant database.

What matters more than headline accuracy is precision and recall per category. You want high precision on tax-sensitive categories (you don't want non-deductible personal expenses accidentally classified as business costs) and high recall on categories where missing a deduction costs the client money.

The Cold Start Problem

ML models learn from historical data. When a user signs up for a new accounting service, there is no history. When a new vendor appears in the market, there is no training data for that vendor. This is the cold start problem.

Solutions typically involve a combination of approaches. Global models trained on aggregate data from all users provide a reasonable baseline. Transfer learning from similar users in the same industry helps narrow things down. And fallback to merchant database lookups (using MCC codes or external enrichment services like Plaid's merchant identification) fills gaps for known vendors that are new to a particular user.

Over time, as the system processes more transactions and receives corrections from the user or their accountant, the model adapts. This is where online learning or periodic retraining comes in, and it is one of the reasons that accounting platforms improve with scale.

Learning from Corrections

When a user or accountant reclassifies a transaction, that correction is gold. It tells the model exactly where it went wrong and provides a new labelled example for future training.

But corrections need to be handled carefully. A single user reclassifying "Uber" from "Travel" to "Client Entertainment" reflects their specific use case, not a universal truth. The model needs to learn that this particular user uses Uber for client meetings, without concluding that Uber is always client entertainment for everyone.

This is where user-level personalisation layers sit on top of global models. The global model provides a strong prior (Uber is probably travel), and the personalisation layer adjusts based on individual behaviour.

Why 100% Automation Is the Wrong Goal

There is a temptation in the fintech world to chase full automation. But in accounting, a 100% automation target is actually dangerous.

Consider the implications. If a model categorises a personal expense as a business deduction, the client claims a tax deduction they are not entitled to. If it miscategorises a VAT-exempt transaction as standard-rated, the client either overpays VAT or, worse, under-reports it.

The more pragmatic approach is what practitioners call the 95/5 split. Automate the 95% of transactions where the model has high confidence and the stakes of an error are low. Route the remaining 5%, the ambiguous, high-value, or unusual transactions, to a human reviewer.

This is not a failure of AI. It is intelligent system design. The human reviewer is not checking everything; they are checking only the transactions that genuinely need expert judgement. This is where a qualified accountant adds real value: not in categorising the 200th Vodafone direct debit, but in deciding whether that 3,000-euro payment to your cousin's company is a legitimate business expense or a related-party transaction that needs disclosure.

The Path Forward

Transaction categorisation is one of the most mature applications of ML in accounting. The models are good and getting better. But the real innovation is not in the model architecture; it is in understanding where automation should stop and human judgement should begin.

The firms that get this balance right, automating the routine while flagging the exceptions, will deliver both efficiency and accuracy. The firms that chase 100% automation will eventually face a compliance problem they could have avoided.

Michael Cutajar, CPA — Founder of Accora.