All posts

AI & Technology

Embedding Models and Financial Search: Finding the Needle in the Haystack

By Michael Cutajar9 min read

Try searching your accounting records for "entertainment expenses." A keyword search will find transactions explicitly tagged or described with that phrase. But it will miss the client dinner at a restaurant, the team lunch for a project milestone, the tickets to an industry conference networking event, and the gift basket sent to a referral partner.

These are all entertainment expenses. But none of them contain the word "entertainment."

This is the fundamental limitation of keyword-based search in financial data. And it is a problem that embedding models solve elegantly.

What Are Embeddings?

An embedding is a way of representing a piece of text (a word, a sentence, an entire document) as a list of numbers, typically hundreds or thousands of them. This list of numbers is called a vector, and it lives in a high-dimensional space where similar meanings end up close together.

The key insight is that embeddings capture meaning, not just words. The embedding for "client dinner" will be close to the embedding for "business entertainment" in vector space, because these phrases refer to similar concepts, even though they share no words.

Modern embedding models like OpenAI's text-embedding-3, Cohere's Embed, and open-source alternatives like Sentence-BERT are trained on massive text corpora and develop a nuanced understanding of semantic relationships. They know that "invoice" and "bill" are related. They know that "VAT refund" and "tax credit" overlap conceptually. They know that "amortisation" and "depreciation" are related but distinct.

Why This Matters for Accounting

Financial data is full of semantic relationships that keyword search misses.

Transaction descriptions are inconsistent. One merchant codes a payment as "CONSULT FEE," another as "PROFESSIONAL SERVICES," and a third as "ADVISORY RETAINER." A keyword search for "consulting" finds only the first. A semantic search finds all three.

Invoices use varied language. Your supplier invoices describe the same service differently each month. The underlying meaning is the same, but the words change.

Queries are natural language. When a business owner asks "how much did I spend on marketing last quarter?" they expect the answer to include Google Ads, Facebook campaigns, business card printing, trade show fees, and sponsored content. Keyword matching on "marketing" misses most of these.

Vector Databases for Accounting

To make semantic search work at scale, you need a vector database: a system optimised for storing and querying high-dimensional vectors. Products like Pinecone, Weaviate, Qdrant, and the open-source FAISS library provide this capability.

Here is how it works in an accounting context:

  1. Every transaction description, invoice, receipt, and contract gets converted into an embedding vector.
  2. These vectors are stored in a vector database alongside the original records.
  3. When a user searches, their query is also converted into an embedding.
  4. The database returns the records whose embeddings are closest to the query embedding.

The search is based on meaning, not exact text matching. And it happens in milliseconds, even across millions of records.

Practical Use Cases

Finding Related Transactions

A user wants to review all transactions related to a specific project. Some transactions mention the project name. Others mention the client name. Others mention the deliverable. A semantic search with the project name as the query retrieves all of these, because their descriptions are semantically related.

Invoice-to-Purchase-Order Matching

Matching incoming invoices to outstanding purchase orders is a common accounting task. The invoice might describe the goods as "premium office chairs, ergonomic, black" while the purchase order says "ergonomic seating, 10 units." Keyword matching fails. Embedding similarity succeeds.

Research published by teams at various fintech companies has shown that embedding-based matching significantly outperforms rule-based approaches for three-way matching (purchase order, goods receipt, invoice), particularly when descriptions are inconsistent.

Searching Across Years of History

When preparing for a tax audit or responding to a query from the tax authority, you might need to find all transactions of a certain type across multiple years. Embedding search makes this feasible even when categorisation was inconsistent, naming conventions changed, or different accounting systems were used in different years.

Document Retrieval

Beyond transactions, embedding search works across all financial documents: contracts, engagement letters, regulatory correspondence, and board minutes. Searching for "indemnity clause" retrieves documents containing limitation of liability provisions, hold harmless agreements, and risk allocation terms, even if none of them use the word "indemnity."

The Privacy Advantage

One of the most important properties of embeddings is that they can be computed locally.

When you send a document to a cloud-based LLM for processing, the raw text of that document travels to a remote server. For financial data containing client names, transaction amounts, and sensitive business information, this raises legitimate privacy and confidentiality concerns.

Embedding models can run on local hardware. Open-source models like those in the Sentence-Transformers library can compute embeddings on a laptop. The raw financial data never leaves the organisation's infrastructure.

Once computed, embeddings are also not straightforwardly reversible. You cannot reconstruct the original text from its embedding vector (at least not with current techniques). This means that even if embedding vectors were intercepted, the underlying financial data would not be directly exposed.

This makes embeddings an attractive approach for organisations that need the benefits of semantic search without the privacy trade-offs of sending raw data to external AI services.

Limitations and Considerations

Embedding models are not perfect for financial search. There are important caveats:

Numerical precision. Embeddings capture semantic meaning, not numerical values. A search for "transactions over 10,000 euros" requires traditional filtering, not embedding search. The best systems combine semantic search with structured queries.

Domain specificity. General-purpose embedding models may not fully understand accounting terminology. A model that has not seen enough financial text might not know that "T&E" means travel and entertainment, or that "COGS" means cost of goods sold. Domain-adapted or fine-tuned models perform better.

Retrieval is not understanding. Finding relevant documents is the first step. Making sense of them, applying judgement, drawing conclusions, still requires human expertise (or additional AI layers with their own limitations).

The Direction of Travel

Semantic search is becoming a standard capability in modern software. Major database vendors (PostgreSQL with pgvector, MongoDB with Atlas Vector Search) are adding vector search natively. This means embedding-based search does not require a separate specialist system; it can be built into the existing data architecture.

For accounting and financial services, this capability transforms how professionals interact with financial data. Instead of navigating rigid chart-of-accounts hierarchies and remembering exact account codes, users can search by meaning.

The professionals who leverage this will spend less time looking for information and more time acting on it.


Michael Cutajar, CPA — Founder of Accora.