Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
Date: 2026-02-24 Status: Draft
Automatic matching of financial documents: Invoice ↔ Purchase Order ↔ Contract ↔ Goods Received Note. Documents arrive in arbitrary order; matching runs asynchronously on each arrival.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Ingestion │ │ SQS Queue │ │ Matching │
│ Pipeline │────▶│ document-ready │────▶│ Worker │
│ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ PostgreSQL │
│ - document │
│ - document_match│
└─────────────────┘
Event flow:
valid_structured_data = truedocument-ready event to SQSmatch-document!:
Future: Replace SQS with SNS fan-out for multiple subscribers with different delivery guarantees.
-- Cluster membership on document (denormalized for O(1) queries)
ALTER TABLE document ADD COLUMN cluster_id uuid;
ALTER TABLE document ADD COLUMN searchable_text text;
ALTER TABLE document ADD COLUMN embedding vector(768);
CREATE INDEX idx_document_cluster_id ON document(cluster_id);
CREATE INDEX idx_document_embedding ON document USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_document_searchable_text ON document USING gin (to_tsvector('simple', searchable_text));
-- Pairwise match edges (source of truth)
CREATE TABLE document_match (
document_a_id uuid REFERENCES document(id) ON DELETE CASCADE,
document_b_id uuid REFERENCES document(id) ON DELETE CASCADE,
confidence decimal(5,4),
match_method text NOT NULL, -- 'rule-based', 'llm'
evidence jsonb, -- [{"signal": "po-number-exact", "value": "PO-123", "weight": 60}, ...]
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (document_a_id, document_b_id),
CHECK (document_a_id < document_b_id) -- canonical ordering, symmetric
);
CREATE INDEX idx_document_match_b ON document_match(document_b_id);
CREATE INDEX idx_document_match_created ON document_match(created_at);
Create match (three cases):
cluster_id → generate new UUID, assign to bothcluster_id → assign same to the othercluster_id → merge: update all docs with cluster_b to cluster_aRemove match (may disconnect cluster):
document_matchcluster_id = NULLIntegrity: Periodic job rebuilds connected components from edges, verifies consistency.
-- How much of this PO has been invoiced?
SELECT COALESCE(SUM((d.structured_data->>'total')::decimal), 0)
FROM document_match dm
JOIN document d ON d.id = dm.document_a_id OR d.id = dm.document_b_id
WHERE (dm.document_a_id = :po_id OR dm.document_b_id = :po_id)
AND d.id != :po_id
AND d.type = 'invoice';
No cached column — computed on demand.
(def matchable-pairs
#{#{:invoice :purchase-order}
#{:invoice :contract}
#{:purchase-order :contract}
#{:goods-received-note :purchase-order}})
Invoice ↔ GRN and GRN ↔ Contract are indirect (via PO).
Single broad search, type-specific scoring:
(defn find-candidates
[doc]
(let [matchable-types (get-matchable-types (:type doc))
query-text (build-searchable-text doc)
query-embedding (embed query-text)]
(-> (hybrid-search {:legal-entity-id (:legal-entity-id doc)
:types matchable-types
:query-text query-text
:query-embedding query-embedding
:limit 50})
(exclude-self doc))))
Searchable text (same for indexing and querying):
(defn build-searchable-text
[{:keys [type structured-data]}]
(case type
:invoice
(str/join " | "
[(get-in structured-data [:issuer :name])
(get-in structured-data [:issuer :vat-id])
(:invoice-number structured-data)
(some-> structured-data :total str)
(:currency structured-data)
(:po-reference structured-data)])
:purchase-order
(str/join " | "
[(get-in structured-data [:supplier :name])
(get-in structured-data [:supplier :vat-id])
(:po-number structured-data)
(some-> structured-data :total str)
(:currency structured-data)
(:contract-reference structured-data)])
:contract
(str/join " | "
[(get-in structured-data [:party-b :name])
(get-in structured-data [:party-b :vat-id])
(:contract-number structured-data)])
:goods-received-note
(str/join " | "
[(get-in structured-data [:supplier :name])
(get-in structured-data [:supplier :vat-id])
(:grn-number structured-data)
(:po-reference structured-data)])))
;; Evidence signal weights
(def evidence-signals
{;; Exact reference matches (highest value)
:po-number-exact 60
:contract-ref-exact 55
:po-ref-exact 55
;; Identity matches
:vat-id-match 30
:iban-match 25
;; Fuzzy/range matches
:amount-within-2pct 20 ; compared against remaining PO amount for Invoice-PO
:amount-within-5pct 10
:supplier-name-fuzzy 15 ; >0.8 similarity
:date-in-validity 10
;; Negative signals
:vat-id-mismatch -40})
(def match-thresholds
{:high 0.70
:medium 0.50
:low 0.30})
Scoring normalizes sum of weights to 0.0–1.0 range.
Per matchable type, not global:
(defn match-document!
[doc]
;; 1. Clear existing matches (may split clusters)
(remove-matches-for-document! (:id doc))
;; 2. Find and score candidates
(let [all-candidates (->> (find-candidates doc)
(map #(assoc % :score (compute-score doc %)))
(filter #(>= (:score %) (:low match-thresholds))))
by-type (group-by #(get-in % [:doc :type]) all-candidates)]
;; 3. Decide per document type
(doseq [[doc-type candidates] by-type]
(let [sorted (sort-by :score > candidates)
high-candidates (filter #(>= (:score %) (:high match-thresholds)) sorted)]
(cond
;; Single high-confidence candidate → match directly
(= 1 (count high-candidates))
(create-match! doc (first high-candidates) {:match-method "rule-based"})
;; Any other case with candidates → LLM decides
(seq candidates)
(let [llm-result (llm-match-decision doc candidates)]
(doseq [match (:matches llm-result)]
(create-match! doc (:candidate match) {:match-method "llm"}))))))))
| Scenario | Action |
|---|---|
| 1 candidate ≥ high (per type) | Match directly, no LLM |
| Multiple candidates ≥ high | LLM decides |
| Candidates between low and high | LLM decides |
| No candidates ≥ low | No match, document orphaned |
When invoked:
Prompt structure:
(defn build-match-prompt
[{:keys [source-doc candidates]}]
{:system "You are a document matching assistant for financial documents.
Determine which candidate(s) match the source document.
Consider: supplier identity, amounts, dates, reference numbers.
Be conservative - only confirm matches you're confident about."
:user (str "## Source Document\n"
(format-document-summary source-doc)
"\n\n## Candidates\n"
(format-candidates candidates)
"\n\n## Task\n"
"Which candidate(s) match? Return JSON:\n"
"{\"matches\": [{\"candidate\": 1, \"confidence\": \"high|medium|low\", \"reasoning\": \"...\"}]}")})
Response handling:
high or medium confidence → create matchlow confidence → no matchEndpoint:
GET /api/documents/:id/matches
Returns:
{
"matches": [
{
"document_id": "uuid",
"document_type": "purchase-order",
"document_number": "PO-2024-001",
"confidence": 0.95,
"match_method": "rule-based",
"matched_at": "2026-02-24T10:00:00Z"
}
],
"cluster_id": "uuid"
}
UI: Simple list on document detail page showing matched documents with confidence.