Case Study · Major Energy Operator

Invoice Governance Across Source-to-Pay at a Major Energy Operator

Industry: Oil & Gas — Source-to-Pay
Stack before: pgvector + PostgreSQL + custom ETL + homegrown audit logging
Stack after: Infino

The Problem

A major energy operator runs procurement across hundreds of suppliers, thousands of contracts, and versioned pricebooks that change on a rolling effective-date schedule. Invoices arrive as PDFs. Operational logs live in a separate field system. Contract scope is buried in legal prose. The same piece of equipment gets called "MWD Service," "measurement while drilling - daily," or "CIBP" depending on which supplier's AP clerk filled out the line item.

The operator needed to detect five classes of billing discrepancy — non-catalog overcharges, third-party markup violations, duplicate invoices, out-of-scope billing, and performance-based payment failures — reliably, repeatably, and with enough explainability to send to a supplier or hold up in an audit. At the volume of invoice lines running through the Source-to-Pay system, spot-checking is not a control.

Where pgvector Hits a Wall

The first prototype used the obvious stack: a document extraction service for PDFs, pgvector for embedding-based description matching, PostgreSQL for structured joins.

It worked in the demo. Here's where it broke down in practice.

Description matching requires vector search scoped by structured predicates. Finding invoice descriptions that semantically match pricebook items sounds like a vector search problem. It is — but only when scoped to the pricebook version effective on the invoice date, and to the same product/service category. In pgvector, filtered vector search works post-hoc: retrieve the top-K nearest neighbors globally, then discard the ones that don't match the date range and category. At high filter selectivity — which is the normal case when you have 30 contracts, each with a different effective pricebook — you're traversing the majority of the graph and throwing away most of the result set. Precision degrades.

Joining vector results with structured data is application code. A single non-catalog detection step requires: call an embedding API for the invoice line description, pass the vector to pgvector, receive candidate IDs, join those against the pricebook effective-date table in PostgreSQL, compare prices with UoM normalization, and aggregate by supplier. Across a real Source-to-Pay backlog, that's a loop the application owns — with its own failure modes and retry logic at each hop.

Governance lives in three places and they don't compose. Column privileges in PostgreSQL for direct SQL access. Middleware in the application for API filtering. Separate instrumentation for audit logging. Change the role definition and you have to update all three. A new query path — say, an agent running a different code path — bypasses whichever layers don't cover it. Auditors asking "what did the system access last Tuesday" get a query to write, not an answer.

The Architecture That Worked

PDF contracts / invoices / pricebooks
  → Document extraction
  → Infino ingest

Standard datasets (structured joins)
  invoices · pricebooks · work-orders · wellview-daily-reports

Vector-indexed datasets (auto-embedding on ingest)
  invoice-line-items     ← free-text descriptions
  pricebook-items        ← catalog item descriptions
  contracts              ← clause text
  subcontracts           ← scope of work / explicit exclusions
  service-confirmations  ← field tickets

Governance layer (enforced uniformly across all query paths)
  RBAC: procurement_manager / ap_reviewer / auditor
  Field-level security: markup rates, clause text
  Audit log: every request, every field accessed, queryable

Infino stores vectors, structured records, and full-text in the same engine and queries them in a single request. Embedding is automatic on ingest — designate a dataset as vector-indexed and Infino handles embedding generation and HNSW graph maintenance transparently. The application never calls an embedding API or manages those credentials.

The agent's first pass for any single invoice line is Query DSL against pricebook_items alone — a filtered hybrid search that combines structured predicates (PSL category, contract scope) with semantic ranking on the catalog description, all evaluated during graph traversal:

{
  "index": "pricebook_items",
  "query": {
    "bool": {
      "filter": [
        { "term":  { "psl": "1.4.2" } },
        { "term":  { "is_flex_price": false } },
        { "terms": { "pricebook_id": ["pb_a7f1", "pb_a7f2"] } }
      ],
      "must": [
        {
          "semantic": {
            "field": "description",
            "query": "high-pressure mud pump fluid end module",
            "k": 10
          }
        }
      ]
    }
  },
  "fields": ["item_id", "description", "unit_price", "uom"]
}

That's enough to ground a single line. The full overcharge sweep wants the same comparison run across every non-catalog line on the invoice, joined against invoice_line_items and the time-bounded pricebooks contract scope — that's a multi-dataset join, which is what SQL is for:

SELECT
    il.line_id,
    il.description                      AS invoiced_description,
    pi.description                      AS matched_catalog_item,
    il.unit_price                       AS invoiced_price,
    pi.unit_price                       AS catalog_price,
    il.unit_price - pi.unit_price       AS price_delta,
    il.uom, pi.uom, il.supplier_id
FROM invoice_line_items il
JOIN pricebook_items pi
    ON pi.pricebook_id IN (
        SELECT pricebook_id FROM pricebooks
        WHERE contract_id = il.contract_id
          AND effective_date <= il.invoice_date
          AND (end_date IS NULL OR end_date >= il.invoice_date)
    )
   AND pi.psl = il.psl
   AND pi.item_id IN (SEMANTIC_SEARCH(il.description, 'description', 10))
WHERE il.is_catalog = false
  AND pi.is_flex_price = false
  AND il.unit_price > pi.unit_price
ORDER BY price_delta DESC;

No embedding call. No result merging. The vector traversal and the date range join and the price comparison are one plan.

Filtered vector search evaluates predicates during graph traversal, not after. When the query includes a category or date filter, the search doesn't retrieve global candidates and discard them — it walks only the portion of the graph that satisfies the predicate. Precision stays stable as filter selectivity increases. This is the difference between a system that works in a filtered production setting versus one tuned for demo recall numbers.

A Team of Agents, Not One

Invoice governance is not one agent running one query. It's a pipeline of specialized agents, each with a narrow job, coordinating through shared state:

Extraction agent — ingests incoming contract, invoice, and pricebook PDFs; normalizes them into typed records; writes to the shared line-item and contract datasets.
Clause agent — pulls the applicable contract, sub-contract, and work-order clauses for each invoice line based on effective date and hierarchy; stores the resolved clause context alongside the line.
Matching agent — runs the semantic + structured filter join against pricebook items scoped to the effective date and product category; writes candidate matches with similarity scores.
Variance agent — computes price, markup, and UoM variances against the matched clause and catalog entry; classifies each line as clean, over-billed, out-of-scope, or ineligible markup.
Duplicate agent — runs exact, near-duplicate, and split-invoice detection across the invoice history.
NPT agent — joins performance clauses to operational logs and computes payable vs non-payable days.
Exception reviewer agent — assembles flags from the specialist agents, deduplicates, assigns confidence, and routes to the AP reviewer, procurement manager, or audit queue.
Supplier-response agent — drafts the supplier communication for each flagged line, grounded in the specific clause and line record.

These are not independent scripts wired together. They are agents sharing one backend — one identity model, one retrieval layer, one audit log. When the extraction agent writes a new line, the matching agent sees it. When the variance agent flags an issue, the exception reviewer sees the full provenance. There are no CSV handoffs, no vector-store-to-warehouse sync jobs, no separate auth contexts to reconcile.

Governance Declared Once, Enforced Everywhere

The governance rules are the same rules whether the query is written by a human, generated by an agent, or scheduled as a recurring job. They are declared once on the backend and enforced inside the engine on every query that runs:

sdk.create_role("ap_reviewer", permissions={
    "datasets": ["invoice_line_items", "invoices"],
    "field_deny": {
        "contracts":       ["third_party_markup_pct", "markup_clause_text"],
        "pricebook_items": ["unit_price"]
    },
    "row_filter": {
        "invoice_line_items": "cost_center IN ($user.cost_centers)"
    }
})

Three layers of access, all server-side:

RBAC — the AP reviewer agent can run detections and assemble exceptions, but cannot access pricing tables or modify contract records.
Field-level security — contracted markup percentages and clause text are masked from any query the AP reviewer agent issues, regardless of whether the query is SQL, semantic search, or a natural language question routed through its console.
Row-level security — the AP reviewer agent only sees line items for cost centers its user is assigned to. The row filter is applied to the row source before the plan executes, not by application code.

The alternative is the usual pattern: column privileges in the warehouse, middleware filtering in the app, separate auth in the vector store, a side-channel for the scheduler, and a best-effort audit pipeline somewhere. Every new agent is a new place where a scope can leak. The backend-enforced model removes that entire class of work. A new specialist agent is declared with a role, and the governance applies automatically to every query it ever issues.

The audit log captures every request with requesting agent, requesting user, datasets accessed, fields returned, rows filtered, and timestamp. The log is itself a queryable dataset. "What did the matching agent access on Tuesday?" is a SQL query, not a grep through log files.

Five Use Cases, One Substrate

Each detection use case is handled by the team of specialist agents against the same datasets. Each detection is a stored query — visible, reviewable, and schedulable as a recurring job — that returns a structured exception report with similarity scores, confidence ratings, monetary impact, and per-line explanations.

Use Case	What It Detects
Non-catalog overcharges	Higher invoiced price than catalog, UoM mismatch, abbreviation/synonym match, repetitive non-catalog billing
Markup violations	Over-markup, ineligible markup on excluded categories, inconsistent markup across suppliers, hidden markup in lump sums
Duplicate invoices	Exact match, near-duplicate descriptions, split invoices, consecutive-day billing without operational record
Out-of-scope billing	No scope match, explicit exclusion, wrong contract context, ambiguous free text
NPT performance	NPT above threshold, billing during downtime, inconsistent supplier logs, missing proration

The generated queries reference specific source records — invoice IDs, contract clauses, pricebook entries, operational log lines. Every exception in the output traces back to data, not to a model confidence score in isolation.

What Changed

Before: a separate vector store, a separate embedding pipeline with its own credentials and retry logic, application-layer auth checks that didn't compose across query paths, and custom instrumentation to produce any audit record at all. Every new agent added to the workflow was another integration point to secure, another scope to get wrong, another audit gap to close by hand.

After: one backend. The extraction, matching, variance, NPT, and reviewer agents all work on top of the same datasets, the same identity model, and the same audit log. Governance is declared once and applied uniformly — the AP reviewer agent cannot see markup rates no matter how the query is phrased, the matching agent only sees line items for its scoped cost centers, and every action any agent takes is recorded.

Exception reports that used to take the governance team a full sprint to produce, and couldn't be reproduced a month later, now run on a recurring schedule. Every flag traces back to a specific line item, a specific clause, a specific query, and the specific agent that produced it.

Questions? Get in touch.