omnigraph

apunkt/omnigraph

Fork 0

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-30 02:49:39 +02:00

Commit graph

Author	SHA1	Message	Date
Claude	a1e9f32ee1	pq-l2: bench quality fixes — pre-alloc output, warmup, black_box Three related fixes from the code-review pass that make the per-query timing measure kernel work and only kernel work: 1. distance_table API now takes `&mut [f32]` output buffer - Old: `fn distance_table(&self, query: &[f32]) -> Vec<f32>` — every call allocated a fresh Vec inside the timed region. An agent that reduced allocator pressure (e.g., via interior-mutability hacks with RefCell + thread-local scratch) would have shown up as a "kernel win" when it was actually just dodging the allocator. - New: `fn distance_table(&self, query: &[f32], out: &mut [f32])`. run_experiment pre-allocates one buffer per workload and reuses it across queries. Same for the criterion bench (one scratch buffer per bench_function closure). Timing now reflects only the kernel work. 2. Warmup query per workload - The first query of each (shape × distribution) combo paid cold-cache cost on the codes array (1.9 MB for the (768,96,256) shape, exceeds L2 on many laptops) and on the codebook (786 KB at that shape). With SPEED_NUM_QUERIES=32 that's a ~3% first-query bias on the geomean. - run_experiment now does one untimed distance_table + probe_top_k call per workload before the timing loop. Black-boxed so it can't be DCE'd. 3. std::hint::black_box on probe_top_k result in the trial loop - The criterion bench already did this; the trial harness (which is the load-bearing measurement) did not. Under LTO + opt-level=3, since the binary was the only consumer of `_hits`, the optimizer could in principle DCE the heap maintenance work. black_box makes the result observably live. Doc updates: - crates/pq-l2/program.md: API contract reflects the new signature; the obsolete "avoid the Vec alloc in distance_table" prior is replaced with a note about reducing probe_top_k's Vec<(u32, f32)> allocation (single small alloc per query, real concern once the kernel SIMDs). - docs/targets/pq-l2.md: API description updated. Verified: - cargo build / clippy / test: clean - baseline trial: correctness pass, exit 0, ~40s wall-clock - baseline numbers are now slower than before (geomean 1.35M vs prior 880k; (768,96,256) 5.2M vs prior 4.3M) because the prior numbers were artificially low — allocator pressure improvements masqueraded as kernel improvements, and LTO could in principle DCE heap maintenance. The new numbers measure actual kernel work. https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5	2026-05-15 01:24:54 +00:00
Claude	7b1b0b5b75	research: fix lance-autoresearch correctness bugs surfaced by code review A code review pass found a cluster of real bugs in metrics and contract; fixing them before any agent loop runs against this harness. Critical metric bug: - harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space high-water-mark, includes mmap'd files / guard pages / untouched allocations) instead of VmHWM (resident pages high-water-mark). The function name and HARNESS.md contract both promised RSS. Every peak_mem_mb row logged under the old code was virtual peak, not RSS. Correctness contract bug: - reference::topk_consistent's tie-tolerance had a flawed neighbor-scan check: when the K-th distance fell in a multi-way tie, agent and reference could legally return different K-sized subsets of the tied band (heap eviction order vs. sort stability), and the neighbor scan required both endpoints to be present, false-negativing legitimate cases. Simplified to a positional distance-tolerance check; ids at the same rank may differ silently because the distance match within tol constrains the swap to a 2*tol band. Diagnostic comment explains the rationale. API hygiene: - Removed dead PqKernel::shape() and ScalarReference::shape() — declared in the public API contract (program.md, kernels.rs comment), required to be stable, never called by the bench / benches / inputs / reference. Now the contract reflects what the bench actually uses. - Removed dead `anyhow` workspace dependency. Determinism: - PRNG seed mixing now uses the SplitMix64 finalizer per part instead of raw XOR. Raw XOR is commutative and small-constant collisions are reachable; mix_seeds iterates the finalizer once per ingredient so distinct (seed, shape, kind) tuples produce distinct streams with vanishingly small collision probability. License headers: - kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match the crate's Cargo.toml license field (the rest of the crate is dual- licensed). Added matching SPDX headers to reference.rs and inputs.rs. Doc cleanups: - design.md: replaced the broken relative link `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside lance-autoresearch where the note doesn't live) with a path-explained reference noting the note lives in the parent OmniGraph repo and won't ship on extraction. - README.md: clarified that the target table mixes a single landed target with a candidate roadmap — they have no code yet. - HARNESS.md: added exit code 1 (internal error) to the exit-code summary; was documented in run_experiment.rs but not in the loop contract. - adding-a-target.md: dropped the misleading "cp -r plus surgical edits" framing — the workflow rewrites 7 files; what's inherited is Cargo manifest, license headers, workspace registration, and shared utilities. Verified end-to-end: cargo build / clippy / test all green. Baseline trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads RSS — same workload reports 91 MB, plausibly correct given the temporary fixture-construction buffers). https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5	2026-05-15 00:55:57 +00:00
Claude	0d72cc69fb	research: restructure lance-autoresearch as multi-target workspace The original lance-autoresearch was one Cargo crate optimizing one Lance kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research note, a single-crate shape doesn't scale: per-target deps will collide, the agent's edits to one target's kernels.rs would conflict with another's lib path, and build/test isolation is lost. Restructure into a Cargo workspace. Layout: research/lance-autoresearch/ ├── Cargo.toml (workspace root) ├── README.md (target table, contract overview, repo layout) ├── HARNESS.md (universal loop contract every target inherits) ├── crates/ │ ├── harness-common/ (shared: SplitMix64, geomean, peak RSS, │ │ MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS) │ └── pq-l2/ (the landed target; was the previous single crate) └── docs/ ├── design.md (rationale for workspace shape, no Target trait) ├── adding-a-target.md (step-by-step workflow for new targets) └── targets/pq-l2.md (per-target capsule) Decisions documented in docs/design.md: - Workspace, not single crate: per-target Cargo.toml so deps don't collide; per-target src tree so agent edits don't conflict; per-target build/test isolation for faster agent iteration. - harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance constants, time budget). Intentionally NO Target trait - decode kernel signatures and distance kernel signatures differ enough that a unifying trait would either bloat or require erased boxing. Each target is its own natural shape. - Per-target program.md + shared HARNESS.md: the loop contract is universal, the priors and API spec are per-target. Two files instead of one because copy-pasting the universal loop into every program.md would drift. pq-l2 refactor: - src/* moved into crates/pq-l2/src/* via git mv (preserves history) - crate renamed lance-autoresearch -> pq-l2 - SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of duplication that would have been copy-pasted into every new target) - program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the PQ-L2-specific API contract and SIMD priors remain - Cargo.toml depends on harness-common via path; workspace.dependencies pins criterion uniformly across targets The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2 IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode, A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect, A10 top-K merge) are listed in README.md's target table as "candidate"; each gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow. Verified end-to-end: - cargo build --release: clean, both crates compile - cargo clippy --release --workspace --all-targets -- -D warnings: clean - cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2) - cargo run --release --bin run_experiment -p pq-l2: correctness pass, geomean ~880k ns, exit 0, ~30s wall-clock - omnigraph parent workspace unchanged (research/ excluded as before) https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5	2026-05-15 00:15:02 +00:00

Author

SHA1

Message

Date

Claude

a1e9f32ee1

pq-l2: bench quality fixes — pre-alloc output, warmup, black_box

Three related fixes from the code-review pass that make the per-query
timing measure kernel work and only kernel work:

1. distance_table API now takes `&mut [f32]` output buffer
   - Old: `fn distance_table(&self, query: &[f32]) -> Vec<f32>` — every
     call allocated a fresh Vec inside the timed region. An agent that
     reduced allocator pressure (e.g., via interior-mutability hacks with
     RefCell + thread-local scratch) would have shown up as a "kernel win"
     when it was actually just dodging the allocator.
   - New: `fn distance_table(&self, query: &[f32], out: &mut [f32])`.
     run_experiment pre-allocates one buffer per workload and reuses it
     across queries. Same for the criterion bench (one scratch buffer per
     bench_function closure). Timing now reflects only the kernel work.

2. Warmup query per workload
   - The first query of each (shape × distribution) combo paid cold-cache
     cost on the codes array (1.9 MB for the (768,96,256) shape, exceeds
     L2 on many laptops) and on the codebook (786 KB at that shape). With
     SPEED_NUM_QUERIES=32 that's a ~3% first-query bias on the geomean.
   - run_experiment now does one untimed distance_table + probe_top_k call
     per workload before the timing loop. Black-boxed so it can't be DCE'd.

3. std::hint::black_box on probe_top_k result in the trial loop
   - The criterion bench already did this; the trial harness (which is the
     load-bearing measurement) did not. Under LTO + opt-level=3, since the
     binary was the only consumer of `_hits`, the optimizer could in
     principle DCE the heap maintenance work. black_box makes the result
     observably live.

Doc updates:
- crates/pq-l2/program.md: API contract reflects the new signature; the
  obsolete "avoid the Vec alloc in distance_table" prior is replaced with
  a note about reducing probe_top_k's Vec<(u32, f32)> allocation
  (single small alloc per query, real concern once the kernel SIMDs).
- docs/targets/pq-l2.md: API description updated.

Verified:
- cargo build / clippy / test: clean
- baseline trial: correctness pass, exit 0, ~40s wall-clock
- baseline numbers are now slower than before (geomean 1.35M vs prior
  880k; (768,96,256) 5.2M vs prior 4.3M) because the prior numbers were
  artificially low — allocator pressure improvements masqueraded as
  kernel improvements, and LTO could in principle DCE heap maintenance.
  The new numbers measure actual kernel work.

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5

2026-05-15 01:24:54 +00:00

Claude

7b1b0b5b75

research: fix lance-autoresearch correctness bugs surfaced by code review

A code review pass found a cluster of real bugs in metrics and contract;
fixing them before any agent loop runs against this harness.

Critical metric bug:
- harness-common::sysinfo::peak_rss_mb read VmPeak (virtual address space
  high-water-mark, includes mmap'd files / guard pages / untouched
  allocations) instead of VmHWM (resident pages high-water-mark). The
  function name and HARNESS.md contract both promised RSS. Every
  peak_mem_mb row logged under the old code was virtual peak, not RSS.

Correctness contract bug:
- reference::topk_consistent's tie-tolerance had a flawed neighbor-scan
  check: when the K-th distance fell in a multi-way tie, agent and
  reference could legally return different K-sized subsets of the tied
  band (heap eviction order vs. sort stability), and the neighbor scan
  required both endpoints to be present, false-negativing legitimate
  cases. Simplified to a positional distance-tolerance check; ids at the
  same rank may differ silently because the distance match within tol
  constrains the swap to a 2*tol band. Diagnostic comment explains the
  rationale.

API hygiene:
- Removed dead PqKernel::shape() and ScalarReference::shape() — declared
  in the public API contract (program.md, kernels.rs comment), required
  to be stable, never called by the bench / benches / inputs / reference.
  Now the contract reflects what the bench actually uses.
- Removed dead `anyhow` workspace dependency.

Determinism:
- PRNG seed mixing now uses the SplitMix64 finalizer per part instead of
  raw XOR. Raw XOR is commutative and small-constant collisions are
  reachable; mix_seeds iterates the finalizer once per ingredient so
  distinct (seed, shape, kind) tuples produce distinct streams with
  vanishingly small collision probability.

License headers:
- kernels.rs SPDX changed from Apache-2.0 to MIT OR Apache-2.0 to match
  the crate's Cargo.toml license field (the rest of the crate is dual-
  licensed). Added matching SPDX headers to reference.rs and inputs.rs.

Doc cleanups:
- design.md: replaced the broken relative link
  `../../docs/research/llm-evolutionary-sampling.md` (which resolved inside
  lance-autoresearch where the note doesn't live) with a path-explained
  reference noting the note lives in the parent OmniGraph repo and won't
  ship on extraction.
- README.md: clarified that the target table mixes a single landed target
  with a candidate roadmap — they have no code yet.
- HARNESS.md: added exit code 1 (internal error) to the exit-code summary;
  was documented in run_experiment.rs but not in the loop contract.
- adding-a-target.md: dropped the misleading "cp -r plus surgical edits"
  framing — the workflow rewrites 7 files; what's inherited is Cargo
  manifest, license headers, workspace registration, and shared utilities.

Verified end-to-end: cargo build / clippy / test all green. Baseline
trial runs `correctness: pass` exit 0 in ~34s (peak_mem_mb now reads
RSS — same workload reports 91 MB, plausibly correct given the temporary
fixture-construction buffers).

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5

2026-05-15 00:55:57 +00:00

Claude

0d72cc69fb

research: restructure lance-autoresearch as multi-target workspace

The original lance-autoresearch was one Cargo crate optimizing one Lance
kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research
note, a single-crate shape doesn't scale: per-target deps will collide, the
agent's edits to one target's kernels.rs would conflict with another's lib
path, and build/test isolation is lost. Restructure into a Cargo workspace.

Layout:

  research/lance-autoresearch/
  ├── Cargo.toml          (workspace root)
  ├── README.md           (target table, contract overview, repo layout)
  ├── HARNESS.md          (universal loop contract every target inherits)
  ├── crates/
  │   ├── harness-common/ (shared: SplitMix64, geomean, peak RSS,
  │   │                    MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS)
  │   └── pq-l2/          (the landed target; was the previous single crate)
  └── docs/
      ├── design.md           (rationale for workspace shape, no Target trait)
      ├── adding-a-target.md  (step-by-step workflow for new targets)
      └── targets/pq-l2.md    (per-target capsule)

Decisions documented in docs/design.md:

- Workspace, not single crate: per-target Cargo.toml so deps don't collide;
  per-target src tree so agent edits don't conflict; per-target build/test
  isolation for faster agent iteration.
- harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance
  constants, time budget). Intentionally NO Target trait - decode kernel
  signatures and distance kernel signatures differ enough that a unifying
  trait would either bloat or require erased boxing. Each target is its own
  natural shape.
- Per-target program.md + shared HARNESS.md: the loop contract is universal,
  the priors and API spec are per-target. Two files instead of one because
  copy-pasting the universal loop into every program.md would drift.

pq-l2 refactor:
- src/* moved into crates/pq-l2/src/* via git mv (preserves history)
- crate renamed lance-autoresearch -> pq-l2
- SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL,
  TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of
  duplication that would have been copy-pasted into every new target)
- program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the
  PQ-L2-specific API contract and SIMD priors remain
- Cargo.toml depends on harness-common via path; workspace.dependencies
  pins criterion uniformly across targets

The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2
IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode,
A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect,
A10 top-K merge) are listed in README.md's target table as "candidate"; each
gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md
documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow.

Verified end-to-end:
- cargo build --release: clean, both crates compile
- cargo clippy --release --workspace --all-targets -- -D warnings: clean
- cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2)
- cargo run --release --bin run_experiment -p pq-l2: correctness pass,
  geomean ~880k ns, exit 0, ~30s wall-clock
- omnigraph parent workspace unchanged (research/ excluded as before)

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5

2026-05-15 00:15:02 +00:00

3 commits