Skip to content

Eval from search snapshots

You can’t tune what you don’t measure. This guide builds a small product corpus from real search snapshots, runs the examples/fashion-search eval harness offline (zero dependencies), then against the live hybrid engine, and reads the scorecard honestly.

raw snapshots → subset corpus → offline eval (keyword) → live ingest → live eval (hybrid)

Each snapshot is { query, results[] } — a search query and the products returned for it upstream. Take a few results from each file → a small corpus.json of Product + EvalQuery records, where each query’s relevant set is the products taken from its snapshot.

Terminal window
cd examples/fashion-search
LK_SNAPSHOTS_DIR=/path/to/search-snapshots bun build-lk-subset.ts --per 3
# → 30 products / 10 queries; q7 ("…under 5000") becomes a price-constrained query

The harness data model (validated by eval.ts):

type Product = { id, title, brand, category, colors[], material, price, available };
type EvalQuery = { name, q, filters?, constraints?, relevant: string[] };

With no live server configured, eval.ts runs an in-process keyword search — your baseline.

Terminal window
FASHION_DATASET_DIR=datasets/lk-snapshot-subset bun eval.ts

It scores relevance@3 (does each query surface its own products), constraint compliance (do hits respect price/availability), zero-result and relaxation rates, and latency, writing .samesake/fashion-eval.{json,md}.

Push the same products through the real pipeline — enrich → embed → index — into a dedicated project, then run the same queries through matcher.search (NLQ + FTS + cosine ANN).

On a deliberately tiny 30-product corpus, this is a real result (not spin):

metricoffline keywordlive hybrid
relevance@3 (mean)0.700.63
constraint compliance1.001.00
relaxation rate0.000.10
  • Where hybrid wins: the price query “modest dress for work under 5000” went 0.00 → 0.33 — NLQ parsed the budget into a price ≤ 5000 filter and relaxed soft filters to avoid a dead-end.
  • Where it regresses: broad/use-case queries. With only 30 docs, vector recall pulls cross-category neighbours that score as misses against a narrow relevant set.

The takeaway: corpus size matters. Embedding recall needs a catalog big enough to disambiguate; at 30 docs you mostly measure keyword overlap plus constraint handling. Scale up (--per 20, or the full set) — the harness, metrics, and commands are identical.

Against a ~5.5k-product catalog with 50 golden queries (label-free: objective + behavioural metrics):

  • zero-result rate: 0 across all 50 queries
  • budget enforcement: 4/5 explicit “under N” queries returned 0% over-budget results in the top 5 — hard filters held. The one miss was an implicit “cheap” query (a percentile heuristic that didn’t match the stated cap) — an actionable gap, surfaced honestly.
  • latency: ~2s/query, dominated by the per-query model calls (NLQ rewrite + query embed); the Postgres hybrid SQL itself is sub-millisecond at this scale.