Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Four-Way Document Matching Design

Date: 2026-02-24 Status: Draft

Overview

Automatic matching of financial documents: Invoice ↔ Purchase Order ↔ Contract ↔ Goods Received Note. Documents arrive in arbitrary order; matching runs asynchronously on each arrival.

Goals

Non-Goals (v1)


Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Ingestion     │     │   SQS Queue     │     │   Matching      │
│   Pipeline      │────▶│ document-ready  │────▶│   Worker        │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                        │
                                                        ▼
                                               ┌─────────────────┐
                                               │  PostgreSQL     │
                                               │  - document     │
                                               │  - document_match│
                                               └─────────────────┘

Event flow:

  1. Document completes ingestion with valid_structured_data = true
  2. Ingestion publishes document-ready event to SQS
  3. Matching worker picks up event
  4. Worker runs match-document!:
  5. Matching is idempotent — re-processing produces same results

Future: Replace SQS with SNS fan-out for multiple subscribers with different delivery guarantees.


Data Model

Schema Changes

-- Cluster membership on document (denormalized for O(1) queries)
ALTER TABLE document ADD COLUMN cluster_id uuid;
ALTER TABLE document ADD COLUMN searchable_text text;
ALTER TABLE document ADD COLUMN embedding vector(768);

CREATE INDEX idx_document_cluster_id ON document(cluster_id);
CREATE INDEX idx_document_embedding ON document USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_document_searchable_text ON document USING gin (to_tsvector('simple', searchable_text));

-- Pairwise match edges (source of truth)
CREATE TABLE document_match (
  document_a_id  uuid REFERENCES document(id) ON DELETE CASCADE,
  document_b_id  uuid REFERENCES document(id) ON DELETE CASCADE,
  confidence     decimal(5,4),
  match_method   text NOT NULL,  -- 'rule-based', 'llm'
  evidence       jsonb,          -- [{"signal": "po-number-exact", "value": "PO-123", "weight": 60}, ...]
  created_at     timestamptz NOT NULL DEFAULT now(),
  updated_at     timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (document_a_id, document_b_id),
  CHECK (document_a_id < document_b_id)  -- canonical ordering, symmetric
);

CREATE INDEX idx_document_match_b ON document_match(document_b_id);
CREATE INDEX idx_document_match_created ON document_match(created_at);

Cluster Operations

Create match (three cases):

  1. Neither document has cluster_id → generate new UUID, assign to both
  2. One has cluster_id → assign same to the other
  3. Both have different cluster_id → merge: update all docs with cluster_b to cluster_a

Remove match (may disconnect cluster):

  1. Delete edge from document_match
  2. Run connected components on affected cluster
  3. If multiple components: first keeps original cluster_id, others get new UUIDs
  4. Single-node components with no edges get cluster_id = NULL

Integrity: Periodic job rebuilds connected components from edges, verifies consistency.

Consumed Amount (Dynamic)

-- How much of this PO has been invoiced?
SELECT COALESCE(SUM((d.structured_data->>'total')::decimal), 0)
FROM document_match dm
JOIN document d ON d.id = dm.document_a_id OR d.id = dm.document_b_id
WHERE (dm.document_a_id = :po_id OR dm.document_b_id = :po_id)
  AND d.id != :po_id
  AND d.type = 'invoice';

No cached column — computed on demand.


Matchable Pairs

(def matchable-pairs
  #{#{:invoice :purchase-order}
    #{:invoice :contract}
    #{:purchase-order :contract}
    #{:goods-received-note :purchase-order}})

Invoice ↔ GRN and GRN ↔ Contract are indirect (via PO).


Candidate Retrieval

Single broad search, type-specific scoring:

(defn find-candidates
  [doc]
  (let [matchable-types (get-matchable-types (:type doc))
        query-text (build-searchable-text doc)
        query-embedding (embed query-text)]
    (-> (hybrid-search {:legal-entity-id (:legal-entity-id doc)
                        :types matchable-types
                        :query-text query-text
                        :query-embedding query-embedding
                        :limit 50})
        (exclude-self doc))))

Searchable text (same for indexing and querying):

(defn build-searchable-text
  [{:keys [type structured-data]}]
  (case type
    :invoice
    (str/join " | "
      [(get-in structured-data [:issuer :name])
       (get-in structured-data [:issuer :vat-id])
       (:invoice-number structured-data)
       (some-> structured-data :total str)
       (:currency structured-data)
       (:po-reference structured-data)])

    :purchase-order
    (str/join " | "
      [(get-in structured-data [:supplier :name])
       (get-in structured-data [:supplier :vat-id])
       (:po-number structured-data)
       (some-> structured-data :total str)
       (:currency structured-data)
       (:contract-reference structured-data)])

    :contract
    (str/join " | "
      [(get-in structured-data [:party-b :name])
       (get-in structured-data [:party-b :vat-id])
       (:contract-number structured-data)])

    :goods-received-note
    (str/join " | "
      [(get-in structured-data [:supplier :name])
       (get-in structured-data [:supplier :vat-id])
       (:grn-number structured-data)
       (:po-reference structured-data)])))

Evidence Signals & Scoring

;; Evidence signal weights
(def evidence-signals
  {;; Exact reference matches (highest value)
   :po-number-exact      60
   :contract-ref-exact   55
   :po-ref-exact         55

   ;; Identity matches
   :vat-id-match         30
   :iban-match           25

   ;; Fuzzy/range matches
   :amount-within-2pct   20  ; compared against remaining PO amount for Invoice-PO
   :amount-within-5pct   10
   :supplier-name-fuzzy  15  ; >0.8 similarity
   :date-in-validity     10

   ;; Negative signals
   :vat-id-mismatch     -40})

(def match-thresholds
  {:high   0.70
   :medium 0.50
   :low    0.30})

Scoring normalizes sum of weights to 0.0–1.0 range.


Decision Logic

Per matchable type, not global:

(defn match-document!
  [doc]
  ;; 1. Clear existing matches (may split clusters)
  (remove-matches-for-document! (:id doc))

  ;; 2. Find and score candidates
  (let [all-candidates (->> (find-candidates doc)
                            (map #(assoc % :score (compute-score doc %)))
                            (filter #(>= (:score %) (:low match-thresholds))))
        by-type (group-by #(get-in % [:doc :type]) all-candidates)]

    ;; 3. Decide per document type
    (doseq [[doc-type candidates] by-type]
      (let [sorted (sort-by :score > candidates)
            high-candidates (filter #(>= (:score %) (:high match-thresholds)) sorted)]

        (cond
          ;; Single high-confidence candidate → match directly
          (= 1 (count high-candidates))
          (create-match! doc (first high-candidates) {:match-method "rule-based"})

          ;; Any other case with candidates → LLM decides
          (seq candidates)
          (let [llm-result (llm-match-decision doc candidates)]
            (doseq [match (:matches llm-result)]
              (create-match! doc (:candidate match) {:match-method "llm"}))))))))
Scenario Action
1 candidate ≥ high (per type) Match directly, no LLM
Multiple candidates ≥ high LLM decides
Candidates between low and high LLM decides
No candidates ≥ low No match, document orphaned

LLM Integration

When invoked:

Prompt structure:

(defn build-match-prompt
  [{:keys [source-doc candidates]}]
  {:system "You are a document matching assistant for financial documents.
            Determine which candidate(s) match the source document.
            Consider: supplier identity, amounts, dates, reference numbers.
            Be conservative - only confirm matches you're confident about."

   :user (str "## Source Document\n"
              (format-document-summary source-doc)
              "\n\n## Candidates\n"
              (format-candidates candidates)
              "\n\n## Task\n"
              "Which candidate(s) match? Return JSON:\n"
              "{\"matches\": [{\"candidate\": 1, \"confidence\": \"high|medium|low\", \"reasoning\": \"...\"}]}")})

Response handling:


API & UI

Endpoint:

GET /api/documents/:id/matches

Returns:

{
  "matches": [
    {
      "document_id": "uuid",
      "document_type": "purchase-order",
      "document_number": "PO-2024-001",
      "confidence": 0.95,
      "match_method": "rule-based",
      "matched_at": "2026-02-24T10:00:00Z"
    }
  ],
  "cluster_id": "uuid"
}

UI: Simple list on document detail page showing matched documents with confidence.


Future Work