diff --git a/research/lance-autoresearch/Cargo.toml b/research/lance-autoresearch/Cargo.toml index 96d72d4..a56f436 100644 --- a/research/lance-autoresearch/Cargo.toml +++ b/research/lance-autoresearch/Cargo.toml @@ -1,32 +1,14 @@ -# Empty `[workspace]` section so cargo treats this directory as its own -# workspace root and does NOT walk up to the parent omnigraph workspace. -# Without this, cargo from inside `research/lance-autoresearch/` will try to -# resolve omnigraph's dependencies even though we're excluded as a member. [workspace] +resolver = "2" +members = [ + "crates/harness-common", + "crates/pq-l2", +] -[package] -name = "lance-autoresearch" -version = "0.1.0" -edition = "2024" -license = "MIT OR Apache-2.0" -description = "Autoresearch-style harness for evolving Lance PQ L2 distance kernels via LLM agents." -publish = false - -[lib] -path = "src/lib.rs" - -[[bin]] -name = "run_experiment" -path = "src/bin/run_experiment.rs" - -[[bench]] -name = "pq_l2" -harness = false - -[dependencies] +# Each per-target crate sets its own deps. Shared deps below pin versions +# uniformly across targets so the workspace lockfile stays clean. +[workspace.dependencies] anyhow = "1" - -[dev-dependencies] criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] } [profile.release] diff --git a/research/lance-autoresearch/HARNESS.md b/research/lance-autoresearch/HARNESS.md new file mode 100644 index 0000000..25defc7 --- /dev/null +++ b/research/lance-autoresearch/HARNESS.md @@ -0,0 +1,137 @@ +# HARNESS — shared loop contract for every lance-autoresearch target + +This document is the universal part of every target's agent instructions. Each +target's `program.md` is a thin layer of *target-specific priors and API spec* +on top of the conventions below. The agent reads `HARNESS.md` and the target's +`program.md` at the start of every session. + +## What this harness is + +A single agent (you) edits one file in one target crate to optimize a Lance +kernel. Per trial, you build, run a binary that exercises the kernel against +diverse inputs, parse a fixed-format output block, and decide keep-or-revert. + +This is a Karpathy-style autoresearch loop. It assumes: + +- Per-trial eval is **seconds-scale**. Long enough to measure, short enough to + iterate hundreds of times in a session. +- The kernel has a **deterministic correctness oracle** — a scalar reference + that produces the same answer to compare against. +- The optimization target is **dataset-independent**: the harness generates + diverse inputs each trial, so wins generalize across distributions and + shapes by construction. + +Targets that don't fit these constraints (index-build parameter tuning, +plan-patching, anything where eval is minutes-to-hours) belong in the +BauplanLabs tournament-loop shape, not this harness. See `docs/design.md` for +the boundary. + +## What's editable, per target + +| Path | Mutability | Why | +|---|---|---| +| `crates//src/kernels.rs` | **mutable** | Your playground. The whole point. | +| `crates//src/reference.rs` | immutable | The oracle. Touching it makes wins meaningless. | +| `crates//src/inputs.rs` | immutable | The fixture generator. Touching it makes timings incomparable across trials. | +| `crates//src/lib.rs` | immutable | Shared types pinned by the bench (`PqShape` etc.). | +| `crates//src/bin/run_experiment.rs` | immutable | The trial harness. | +| `crates//benches/*.rs` | immutable | Criterion bench, optional read-only reference. | +| `crates//Cargo.toml` | immutable | Adding deps changes the optimization target. | +| `crates//program.md` | human-iterated between runs | Not edited by you in-loop; the human refines it. | +| `crates//results.tsv` | append-only | Your audit log. Gitignored. | +| `crates/harness-common/**` | immutable | Workspace-shared infrastructure. | +| `HARNESS.md` (this file) | immutable | Workspace-shared loop contract. | + +You may add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file +property checks. You may NOT add new crate dependencies. You may NOT use +unsafe-only-on-broken-assumptions tricks (e.g., assuming a fixture invariant +that holds today but isn't documented). + +## The metric + +Every target's `run_experiment` binary prints a fixed-format output block ending +with these universal fields: + +- `correctness:` — `pass` or `fail`. Set by comparing your kernel against the + scalar reference on every input the bench generates. +- `geomean_ns_per_*:` — geometric mean of per-operation wall-clock across all + timed operations. +- `worst_ns_per_*:` — slowest combo's geomean. +- `peak_mem_mb:` — process RSS high-water-mark. +- `total_seconds:` — trial wall-clock. + +A kernel is **kept** iff: + +1. `correctness: pass` (any failure → `std::process::exit(2)`). +2. `geomean_ns_per_*` strictly better than the previous best-kept kernel + (allow ~1% noise band). +3. `worst_ns_per_*` ≤ 1.05 × the previous best-kept kernel's worst. +4. `total_seconds` ≤ 600 (the per-trial cap; exceed it → `std::process::exit(3)`). +5. Build clean: `cargo build --release` and + `cargo clippy --release --all-targets -- -D warnings` both succeed. + +Ties break toward simpler code: same speed within ~3% noise → fewer lines / +less `unsafe` wins. + +## The loop + +After reading `HARNESS.md` and the target's `program.md`: + +1. **Setup (once per session).** Confirm `results.tsv` exists; if not, create + it with a per-target header (the target's `program.md` defines the columns). + Run the baseline trial: + ``` + cargo run --release --bin run_experiment -p > run.log 2>&1 + ``` + Append a row tagged `keep=baseline` and commit it. + +2. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas + have been tried, what won, what regressed. Form one hypothesis with one + sentence stating the change and the predicted effect on speed and + correctness. + +3. **Edit `kernels.rs`.** Keep the diff focused on the one hypothesis. + +4. **Build and lint.** + ``` + cargo build --release + cargo clippy --release --all-targets -- -D warnings + ``` + If either fails, fix and retry. Do not commit broken state. + +5. **Run the trial.** + ``` + cargo run --release --bin run_experiment -p > run.log 2>&1 + ``` + +6. **Parse and decide.** Extract the universal fields plus any per-target + fields. Compute deltas vs. the last-kept row. Apply the keep criteria above. + +7. **Log.** Append one row to `results.tsv` matching the target's header. + +8. **Commit.** One-line message describing the change and the headline number, + e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`. + +9. **Hygiene.** + - Always commit `kernels.rs` changes; never commit `results.tsv` or + `run.log` (gitignored). + - If a change fails to build, do not commit. Iterate or revert cleanly. + - If two consecutive ideas regress, take a beat: re-read the last ~10 rows + and update your mental model before proposing the next. + - Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min, + kill it and mark the trial as `timeout`. + +## Never stop + +Keep going until interrupted. Each loop iteration is one hypothesis, one edit, +one measurement, one commit. No multi-step plans across iterations. + +## Working across multiple targets + +If a session spans multiple targets, work on **one target per session**. Don't +edit `kernels.rs` in two crates between commits — the agent's mental model is +shared but the keep-decision is per-target. Pick a target, do a session there, +commit, switch. + +The human is responsible for selecting which target to work on next. Don't +proactively switch targets unless the user asks. diff --git a/research/lance-autoresearch/README.md b/research/lance-autoresearch/README.md index c2573fd..7f52b77 100644 --- a/research/lance-autoresearch/README.md +++ b/research/lance-autoresearch/README.md @@ -1,112 +1,143 @@ # lance-autoresearch -An autoresearch-style harness for evolving [Lance](https://github.com/lance-format/lance) -PQ L2 distance kernels via LLM coding agents (Claude Code, Codex, Cursor). - -Modeled on Andrej Karpathy's +A multi-target workspace for evolving [Lance](https://github.com/lance-format/lance) +hot-path kernels via LLM coding agents (Claude Code, Codex, Cursor), +in the style of Andrej Karpathy's [`nanochat-research`](https://x.com/karpathy/status/1855651423497650238) -three-file contract: +single-agent autoresearch loop. -- **Immutable bench** — `src/bin/run_experiment.rs` + `src/inputs.rs` + - `src/reference.rs`. The agent cannot touch these. -- **Mutable kernel** — `src/kernels.rs`. The agent's playground. Starts as a - scalar baseline matching Lance's PQ L2 algorithm shape; the agent's job is to - beat it. -- **Human-iterated program** — `program.md`. The "skill" the agent reads at - the start of every session. The human refines it between runs. +Each target is an independent Rust crate under `crates/`: + +| Target | Status | Lance source area | What's optimized | +|---|---|---|---| +| [`crates/pq-l2`](crates/pq-l2) | landed | `lance-linalg::distance::l2`, PQ probe | PQ L2 distance: build LUT, probe codes, top-K | +| `crates/pq-cosine` | candidate (A1) | `lance-linalg::distance::cosine` | PQ cosine distance | +| `crates/pq-dot` | candidate (A1) | `lance-linalg::distance::dot` | PQ dot-product distance | +| `crates/ivf-partition` | candidate (A2) | `lance-index::vector::ivf` partition select | IVF partition selection (centroid scan) | +| `crates/fts-bm25` | candidate (A3) | `lance-index::scalar::inverted` BM25 | FTS BM25 scoring inner loop | +| `crates/bitpack` | candidate (A4) | `lance-encoding::encodings::bitpack` | Bitpack integer decode | +| `crates/dictionary` | candidate (A5) | `lance-encoding::encodings::dictionary` | Dictionary decode | +| `crates/fsst` | candidate (A6) | `lance-encoding::encodings::fsst` | FSST string decode | +| `crates/take` | candidate (A7) | `lance-core::utils::take` | Take / gather kernel | +| `crates/predicate` | candidate (A8) | `lance-datafusion` filter eval | Predicate evaluation kernels | +| `crates/posting-intersect` | candidate (A9) | `lance-index::scalar::inverted` | Posting list intersection (FTS AND) | +| `crates/topk-merge` | candidate (A10) | scan-merge | Top-K k-way merge | + +The candidate targets are documented in [`docs/targets/`](docs/targets/) and can +be added by following [`docs/adding-a-target.md`](docs/adding-a-target.md). The +single landed target (`pq-l2`) proves the harness shape; the candidates wait +for an agent to spin them up. + +## The contract every target follows + +Karpathy's three-file shape, applied per target: + +| File (per target crate) | Mutability | Edited by | +|---|---|---| +| `src/kernels.rs` | **mutable** | the agent | +| `src/reference.rs`, `src/inputs.rs`, `src/lib.rs`, `src/bin/run_experiment.rs`, `benches/*.rs` | immutable | — | +| `program.md` | human-iterated | the human, between runs | +| `results.tsv` | append-only | the agent, per trial (gitignored) | + +The shared utilities — deterministic PRNG, geomean, peak-RSS readback, +tolerance constants, time-budget — live in [`crates/harness-common`](crates/harness-common/src/lib.rs) +and are consumed by every target. There is intentionally **no `Target` trait**: +decode-kernel signatures and distance-kernel signatures are different enough +that a unifying trait would either bloat or require erased boxing. Each target +is its own natural shape; the shared crate is plumbing only. + +The shared loop conventions every target's `program.md` inherits live in +[`HARNESS.md`](HARNESS.md). Per-target priors and API specifics live in each +target's own `program.md`. ## Dataset-independent by design Every other ANN benchmark you've seen is "compete on this fixed dataset" -(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness* -(the math) and *kernel speed under one specific data distribution*. An LLM -agent given recall@K as the oracle has incentive to overfit to the dataset's -quirks. +(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness* (the +math) and *kernel speed under one specific data distribution*. An LLM agent +given recall@K as the oracle has incentive to overfit to the dataset's quirks. -We split them: +We split them, every target: -- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4`) match to a scalar - reference kernel, on diverse generated inputs (Gaussian, uniform, sparse, - large-dynamic-range, mostly-zero) × multiple PQ shapes. This is mathematical - equivalence; there's no dataset to overfit. Lossy techniques fail this gate. -- **Speed** = geomean ns/query across multiple PQ shapes × - multiple data distributions. A kernel that wins on one distribution and - regresses on another fails the worst-case guard. +- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4` for floats; bitwise for + integer/byte kernels) match to a scalar reference, on diverse generated + inputs. Mathematical equivalence; no dataset to overfit. Lossy techniques fail + this gate. +- **Speed** = geomean ns/operation across multiple shape × distribution + combinations, with worst-case guard. A kernel that wins on one distribution + and regresses on another fails to keep. By construction, an "improvement" generalizes across distributions and shapes. -There is no `wget sift.tar.gz` step; the harness is fully self-contained. +There is no `wget sift.tar.gz` step; every target is fully self-contained. -## Why a separate repo +## Why a separate repo (and a workspace, not a single crate) OmniGraph (the graph engine that motivated this) pins Lance at a released -version and consumes its kernels via the public crate API. Improvements live one -layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps the -optimization target pure (only the kernel changes), keeps the license clean for -upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and -keeps the agent's working set tiny. +version and consumes its kernels via the public crate API. Improvements live +one layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps +the optimization target pure (only the kernel changes), keeps the license clean +for upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and +keeps each agent's working set tiny. + +**Workspace not single-crate** because per-target deps differ — FSST decode +will want a different dependency set than PQ kernels — and the agent's edits +to one target's `kernels.rs` must not collide with another's lib path. Each +target is buildable, testable, and runnable in isolation: `cd crates/ +&& cargo run --release --bin run_experiment`. ## Quick start ```bash -cargo run --release --bin run_experiment +# Run the landed PQ L2 target's baseline. +cargo run --release --bin run_experiment -p pq-l2 -# Or run with Claude Code / Codex: -# Open the repo in your agent of choice and prompt: -# Hi, have a look at program.md and let's kick off a new experiment. +# Or with Claude Code / Codex, working on one target: +cd crates/pq-l2 +# Open in your agent of choice and prompt: +# Hi, have a look at program.md and let's kick off a new experiment. + +# Add a new target (see docs/adding-a-target.md): +cp -r crates/pq-l2 crates/pq-cosine +# ... edit Cargo.toml name, kernels.rs / reference.rs / inputs.rs / program.md ``` -## File ownership - -| File | Mutability | Edited by | -|---|---|---| -| `src/kernels.rs` | **mutable** | the agent | -| `src/bin/run_experiment.rs` | immutable | — | -| `src/reference.rs` | immutable | — | -| `src/inputs.rs` | immutable | — | -| `src/lib.rs` | immutable (shared types) | — | -| `benches/pq_l2.rs` | immutable | — | -| `program.md` | human-iterated | the human, between runs | -| `results.tsv` | append-only | the agent, per trial (gitignored) | - -## The metric - -`run_experiment` runs two phases per trial: a correctness check and a -multi-shape × multi-distribution speed measurement. Output looks like: +## Repo layout ``` -correctness: pass ---- -correctness: pass -shapes_tested: (128,16,256) (256,16,256) (768,96,256) -distributions_tested: clustered uniform sparse -geomean_ns_per_query: 18234 -worst_ns_per_query: 24515 ((768,96,256), sparse) -best_ns_per_query: 12876 ((128,16,256), clustered) -per_combo_geomean_ns: - (128,16,256) clustered -> 12876 ns - (128,16,256) uniform -> 13441 ns - ... -peak_mem_mb: 28.4 -total_seconds: 12.3 +lance-autoresearch/ +├── Cargo.toml # workspace root +├── README.md # you are here +├── HARNESS.md # shared loop contract every target inherits +├── LICENSE-MIT, LICENSE-APACHE # dual-licensed (Apache compat for Lance PRs) +├── crates/ +│ ├── harness-common/ # shared: SplitMix64, geomean, peak RSS, tolerance, time budget +│ │ └── src/{lib,prng,stats,sysinfo,tolerance}.rs +│ └── pq-l2/ # landed target +│ ├── Cargo.toml +│ ├── program.md # this target's agent skill +│ ├── src/ +│ │ ├── lib.rs # PqShape + module wiring (immutable) +│ │ ├── kernels.rs # MUTABLE — agent's playground +│ │ ├── reference.rs # IMMUTABLE — scalar reference, oracle helpers +│ │ ├── inputs.rs # IMMUTABLE — diverse test-data generators +│ │ └── bin/run_experiment.rs # IMMUTABLE — per-trial entry point +│ └── benches/pq_l2.rs # criterion benchmark (immutable) +└── docs/ + ├── design.md # rationale for the workspace shape + ├── adding-a-target.md # workflow for spinning up a new target + └── targets/ + └── pq-l2.md # capsule: upstream Lance pointers, oracle, status ``` -A kernel is "kept" iff: - -- Correctness phase passes (mathematical equivalence to scalar reference) -- `geomean_ns_per_query` strictly better than the previous best-kept kernel -- `worst_ns_per_query` ≤ 1.05 × the previous best-kept kernel's worst -- `total_seconds` ≤ 600 - -See `program.md` for the full loop spec. - ## Upstream contribution path -When a commit clears the keep bar by a meaningful margin (≥10% geomean -speedup with worst-case guard intact), the human reviews the diff, ports the -technique against [`lance-format/lance`](https://github.com/lance-format/lance) -HEAD, runs Lance's own test suite, and opens a PR. Because `src/kernels.rs` is -dual MIT/Apache-2.0 licensed and algorithmically modeled on Lance's existing -path, the upstream PR inherits Apache-2.0 cleanly. +When a commit on any target clears the keep bar by a meaningful margin +(≥10% geomean speedup with worst-case guard intact), the human reviews the +diff, ports the technique against +[`lance-format/lance`](https://github.com/lance-format/lance) HEAD, runs +Lance's own test suite, and opens a PR. Because the workspace is dual +MIT/Apache-2.0 licensed and each target's kernel is algorithmically modeled on +Lance's existing path, the upstream PR inherits Apache-2.0 cleanly. ## License diff --git a/research/lance-autoresearch/crates/harness-common/Cargo.toml b/research/lance-autoresearch/crates/harness-common/Cargo.toml new file mode 100644 index 0000000..fc530bc --- /dev/null +++ b/research/lance-autoresearch/crates/harness-common/Cargo.toml @@ -0,0 +1,10 @@ +[package] +name = "harness-common" +version = "0.1.0" +edition = "2024" +license = "MIT OR Apache-2.0" +description = "Shared utilities for lance-autoresearch per-target harnesses (PRNG, geomean, peak RSS, tolerance constants, time budget)." +publish = false + +[lib] +path = "src/lib.rs" diff --git a/research/lance-autoresearch/crates/harness-common/src/lib.rs b/research/lance-autoresearch/crates/harness-common/src/lib.rs new file mode 100644 index 0000000..3671f71 --- /dev/null +++ b/research/lance-autoresearch/crates/harness-common/src/lib.rs @@ -0,0 +1,36 @@ +//! Shared utilities for lance-autoresearch per-target harnesses. +//! +//! Each target crate (`pq-l2`, future `pq-cosine`, `bitpack-decode`, etc.) +//! defines its own `kernels.rs` (mutable, the agent's playground), `reference.rs` +//! (immutable scalar reference), `inputs.rs` (immutable test-data generators), +//! and `bin/run_experiment.rs` (immutable per-trial entry point). They all need +//! the same handful of building blocks: a deterministic PRNG, a geomean +//! aggregator, peak-RSS readback, tolerance constants for the bit-exact oracle, +//! and a single shared time-budget constant. That's everything in this crate. +//! +//! What is **not** here, and intentionally not abstracted: +//! +//! - A `Target` trait. Decode kernels (`bitpack`, `dictionary`, `FSST`) have +//! very different signatures than distance kernels (`PqKernel::probe_top_k`), +//! and forcing them into one trait shape would either bloat the trait or +//! require erased boxing. Keep each target's API natural to its kernel. +//! +//! - Output-format orchestration. Each target's `run_experiment.rs` prints its +//! own fixed-format result block — different targets report different +//! per-combo dimensions (PQ shapes vs bit widths vs distribution kinds vs ...). +//! Sharing the format would make the per-target binaries less readable and +//! gain very little — `println!` is cheap. + +pub mod prng; +pub mod stats; +pub mod sysinfo; +pub mod tolerance; + +pub use prng::SplitMix64; +pub use stats::geomean; +pub use sysinfo::peak_rss_mb; +pub use tolerance::{MAX_ABS_ERR, TOPK_DIST_TOL}; + +/// Per-trial wall-clock cap. Targets should `std::process::exit(3)` if exceeded +/// so the agent's loop logs the trial as a timeout instead of a measurement. +pub const TIME_BUDGET_SECS: u64 = 600; diff --git a/research/lance-autoresearch/crates/harness-common/src/prng.rs b/research/lance-autoresearch/crates/harness-common/src/prng.rs new file mode 100644 index 0000000..ef33519 --- /dev/null +++ b/research/lance-autoresearch/crates/harness-common/src/prng.rs @@ -0,0 +1,52 @@ +//! Deterministic SplitMix64 PRNG. Same seed produces the same sequence on +//! every machine; no platform-specific RNG / no `rand` crate. Reproducibility +//! across trials is the whole point. + +pub struct SplitMix64 { + state: u64, +} + +impl SplitMix64 { + pub fn new(seed: u64) -> Self { + Self { state: seed } + } + + pub fn next_u64(&mut self) -> u64 { + self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15); + let mut z = self.state; + z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9); + z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB); + z ^ (z >> 31) + } + + /// Uniform in `[0, 1)` with 24 bits of mantissa precision. + pub fn next_f32(&mut self) -> f32 { + let bits = (self.next_u64() >> 40) as u32; + bits as f32 / ((1u32 << 24) as f32) + } + + /// Standard normal via Box–Muller. Cheap and sufficient for fixture + /// generation; not cryptographically anything. + pub fn next_normal(&mut self) -> f32 { + let mut u1 = self.next_f32(); + if u1 < 1e-7 { + u1 = 1e-7; + } + let u2 = self.next_f32(); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn deterministic_across_calls() { + let mut a = SplitMix64::new(0x1234_5678); + let mut b = SplitMix64::new(0x1234_5678); + for _ in 0..1000 { + assert_eq!(a.next_u64(), b.next_u64()); + } + } +} diff --git a/research/lance-autoresearch/crates/harness-common/src/stats.rs b/research/lance-autoresearch/crates/harness-common/src/stats.rs new file mode 100644 index 0000000..5dc3772 --- /dev/null +++ b/research/lance-autoresearch/crates/harness-common/src/stats.rs @@ -0,0 +1,36 @@ +//! Geometric mean of u64 timings. Robust to outliers; the right aggregator for +//! latency distributions because halving one query and doubling another cancels. + +pub fn geomean(xs: &[u64]) -> u64 { + if xs.is_empty() { + return 0; + } + let mut sum_ln = 0.0f64; + for &x in xs { + sum_ln += (x.max(1) as f64).ln(); + } + (sum_ln / xs.len() as f64).exp() as u64 +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn empty_yields_zero() { + assert_eq!(geomean(&[]), 0); + } + + #[test] + fn single_value_round_trips() { + assert_eq!(geomean(&[100]), 100); + } + + #[test] + fn geomean_is_below_arithmetic_mean() { + let xs = [1, 10, 100, 1000]; + let g = geomean(&xs); + let am: u64 = xs.iter().sum::() / xs.len() as u64; + assert!(g < am); + } +} diff --git a/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs b/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs new file mode 100644 index 0000000..d389ff4 --- /dev/null +++ b/research/lance-autoresearch/crates/harness-common/src/sysinfo.rs @@ -0,0 +1,24 @@ +//! Peak resident-set-size readback (Linux only; non-Linux returns 0). + +#[cfg(target_os = "linux")] +pub fn peak_rss_mb() -> f64 { + let Ok(s) = std::fs::read_to_string("/proc/self/status") else { + return 0.0; + }; + for line in s.lines() { + if let Some(rest) = line.strip_prefix("VmPeak:") { + let kb: f64 = rest + .split_whitespace() + .next() + .and_then(|t| t.parse().ok()) + .unwrap_or(0.0); + return kb / 1024.0; + } + } + 0.0 +} + +#[cfg(not(target_os = "linux"))] +pub fn peak_rss_mb() -> f64 { + 0.0 +} diff --git a/research/lance-autoresearch/crates/harness-common/src/tolerance.rs b/research/lance-autoresearch/crates/harness-common/src/tolerance.rs new file mode 100644 index 0000000..ee19887 --- /dev/null +++ b/research/lance-autoresearch/crates/harness-common/src/tolerance.rs @@ -0,0 +1,15 @@ +//! Default tolerance constants for bit-exact correctness oracles. +//! +//! These suit float-arithmetic kernels (PQ distance, BM25 scoring, vector +//! normalization) where SIMD-accumulator reordering is legal but real bugs +//! shift values by orders of magnitude. Targets that operate on integer or +//! byte-exact data (bitpack decode, dictionary decode, FSST decode) should +//! assert strict bitwise equality and not use these constants. + +/// Maximum permitted absolute element error between agent kernel output and +/// scalar reference output, for float kernels. +pub const MAX_ABS_ERR: f32 = 1e-4; + +/// Maximum permitted distance error when comparing top-K results between +/// agent kernel and scalar reference, for float kernels. +pub const TOPK_DIST_TOL: f32 = 1e-4; diff --git a/research/lance-autoresearch/crates/pq-l2/Cargo.toml b/research/lance-autoresearch/crates/pq-l2/Cargo.toml new file mode 100644 index 0000000..39ddf9a --- /dev/null +++ b/research/lance-autoresearch/crates/pq-l2/Cargo.toml @@ -0,0 +1,24 @@ +[package] +name = "pq-l2" +version = "0.1.0" +edition = "2024" +license = "MIT OR Apache-2.0" +description = "Autoresearch target: Lance PQ L2 distance kernel optimization." +publish = false + +[lib] +path = "src/lib.rs" + +[[bin]] +name = "run_experiment" +path = "src/bin/run_experiment.rs" + +[[bench]] +name = "pq_l2" +harness = false + +[dependencies] +harness-common = { path = "../harness-common" } + +[dev-dependencies] +criterion = { workspace = true } diff --git a/research/lance-autoresearch/benches/pq_l2.rs b/research/lance-autoresearch/crates/pq-l2/benches/pq_l2.rs similarity index 93% rename from research/lance-autoresearch/benches/pq_l2.rs rename to research/lance-autoresearch/crates/pq-l2/benches/pq_l2.rs index 716b133..022cacc 100644 --- a/research/lance-autoresearch/benches/pq_l2.rs +++ b/research/lance-autoresearch/crates/pq-l2/benches/pq_l2.rs @@ -7,8 +7,8 @@ use std::hint::black_box; use criterion::{Criterion, criterion_group, criterion_main}; -use lance_autoresearch::inputs::{SHAPES, SPEED_TOP_K, speed_workloads}; -use lance_autoresearch::kernels::PqKernel; +use pq_l2::inputs::{SHAPES, SPEED_TOP_K, speed_workloads}; +use pq_l2::kernels::PqKernel; fn bench_pq_l2(c: &mut Criterion) { let workloads = speed_workloads(0xBE3C_C0DE_F1AC_BABE); diff --git a/research/lance-autoresearch/crates/pq-l2/program.md b/research/lance-autoresearch/crates/pq-l2/program.md new file mode 100644 index 0000000..7e778fb --- /dev/null +++ b/research/lance-autoresearch/crates/pq-l2/program.md @@ -0,0 +1,98 @@ +# Target: PQ L2 — agent instructions + +This is the per-target overlay on top of [`../../HARNESS.md`](../../HARNESS.md). +Read **HARNESS.md first** for the universal loop contract (what's editable, +the metric, the loop, hygiene, never stop). This file adds the PQ-L2-specific +API spec and priors. + +## Setup (once per session) + +1. Read in this order: + - `../../HARNESS.md` + - `../../README.md` + - `program.md` (this file) + - `src/lib.rs` + - `src/kernels.rs` *(the only file you may edit)* + - `src/reference.rs` + - `src/inputs.rs` + - `src/bin/run_experiment.rs` +2. Ensure `results.tsv` exists. If not, create it with this header: + ``` + commit timestamp correctness geomean_ns worst_ns worst_combo best_ns best_combo peak_mem_mb total_seconds keep description + ``` +3. Baseline trial: + ``` + cargo run --release --bin run_experiment > run.log 2>&1 + ``` + Append a row tagged `keep=baseline`, commit it. + +## Public API contract (must remain stable) + +The bench imports these from `crate::kernels`. You may NOT change their +signatures. You MAY add private helpers, internal data layouts, `unsafe` +blocks, `std::arch` intrinsics under `#[cfg(target_arch = ...)]` gates, +pre-computed state inside `PqKernel`, etc. + +```rust +pub struct PqKernel { /* agent's private fields */ } + +impl PqKernel { + pub fn new(shape: PqShape, codebook: &[f32]) -> Self; + pub fn shape(&self) -> &PqShape; + pub fn distance_table(&self, query: &[f32]) -> Vec; + pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>; +} +``` + +Pre-processing in `new` is free — the bench measures `distance_table + +probe_top_k` per query, not per (build + query). Codebook transposes, +cached `c·c`, packed LUTs, etc., should live in `new`. + +## What you can / cannot do + +(See HARNESS.md for the universal table; this is the PQ-L2 specific +addition.) + +- **Cannot** change `PqShape` or the constants in `lib.rs`. They define + the optimization target. +- **Cannot** introduce lossy techniques (LUT u8/u16 quantization, asymmetric + approximation, anything that drops bits relative to the scalar reference). + The correctness phase asserts `max_abs_err ≤ 1e-4` against the scalar + reference; lossy techniques fail this gate. If you want to explore a lossy + track, propose it to the human as a separate kernel surface. +- **Can** mark hot functions `#[inline]`, split them, add private helpers. +- **Can** add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file + property checks against the scalar path. + +## Lance-PQ-specific priors + +These are the directions that pay off on this kernel shape without +compromising arithmetic accuracy. Pick one hypothesis per trial; don't try +to combine multiple ideas at once. + +- **Codebook layout transpose.** The reference layout is `[m][k][d]`. + Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)` lanes + across `d` and broadcast over `k`. Do the transpose in `PqKernel::new` once. +- **Cache `c·c` per centroid.** The diff–square–sum is + `(q - c)·(q - c) = q·q - 2qc + c·c`. Hoist `q·q` per sub-vector, + precompute `c·c` once at `new()` time, store next to the codebook. Inner + loop becomes one FMA. Watch sign / accumulator ordering so rounding stays + within `MAX_ABS_ERR`. +- **Probe-side code transpose.** Probe is dominated by + `acc += table[m][codes[off+m]]` × `num_sub_vectors`. Transposing codes to + `[m][i]` (one row per sub-quantizer, contiguous over base index) lets you + process 32+ vectors per inner iteration with `vpgatherdq`-style loads. +- **Top-K block-then-merge.** `push()` does a branch + heap sift on every + code. At 20k probes per query × 9 (shape × dist) combos that's the + second-biggest cost after the gather. Block the probe (e.g., 512 codes at + a time), find the local top-K with a branchless pass, then merge into the + global heap. +- **Prefetch.** `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)` + ahead of the gather is usually pure win at 20k+ scale. +- **FMA chains for table build.** The diff–square–sum maps cleanly to FMA + on AVX2/NEON. Even without intrinsics, structuring the inner loop so + `rustc` emits FMA helps. +- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates + a fresh `Vec` per call. The public API is fixed (returns `Vec`), + but you can reuse a thread-local scratch buffer internally and copy to a + `Vec` at the boundary if it speeds the build. diff --git a/research/lance-autoresearch/src/bin/run_experiment.rs b/research/lance-autoresearch/crates/pq-l2/src/bin/run_experiment.rs similarity index 87% rename from research/lance-autoresearch/src/bin/run_experiment.rs rename to research/lance-autoresearch/crates/pq-l2/src/bin/run_experiment.rs index 0f0de21..86ed8b0 100644 --- a/research/lance-autoresearch/src/bin/run_experiment.rs +++ b/research/lance-autoresearch/crates/pq-l2/src/bin/run_experiment.rs @@ -35,18 +35,18 @@ use std::time::Instant; -use lance_autoresearch::inputs::{ +use harness_common::{MAX_ABS_ERR, TIME_BUDGET_SECS, TOPK_DIST_TOL, geomean, peak_rss_mb}; +use pq_l2::inputs::{ DISTRIBUTIONS, DataDistribution, SHAPES, SpeedWorkload, correctness_battery, speed_workloads, }; -use lance_autoresearch::kernels::PqKernel; -use lance_autoresearch::reference::{ScalarReference, max_abs_err, topk_consistent}; -use lance_autoresearch::{MAX_ABS_ERR, PqShape, TOPK_DIST_TOL}; +use pq_l2::kernels::PqKernel; +use pq_l2::reference::{ScalarReference, max_abs_err, topk_consistent}; +use pq_l2::PqShape; // Any constants; the only requirement is that they're pinned across trials so // the inputs and the timings are reproducible. const CORRECTNESS_SEED: u64 = 0xC0FF_EEC0_DEBE_EFFE; const SPEED_SEED: u64 = 0x5EED_F1AC_BABE_FACE; -const TIME_BUDGET_SECS: u64 = 600; fn main() { let start = Instant::now(); @@ -210,17 +210,6 @@ fn run_speed(workloads: &[SpeedWorkload]) -> SpeedReport { } } -fn geomean(xs: &[u64]) -> u64 { - if xs.is_empty() { - return 0; - } - let mut sum_ln = 0.0f64; - for &x in xs { - sum_ln += (x.max(1) as f64).ln(); - } - (sum_ln / xs.len() as f64).exp() as u64 -} - fn format_shape(s: &PqShape) -> String { format!("({},{},{})", s.dim, s.num_sub_vectors, s.num_centroids) } @@ -233,26 +222,3 @@ fn format_dist(d: &DataDistribution) -> String { } .to_string() } - -#[cfg(target_os = "linux")] -fn peak_rss_mb() -> f64 { - let Ok(s) = std::fs::read_to_string("/proc/self/status") else { - return 0.0; - }; - for line in s.lines() { - if let Some(rest) = line.strip_prefix("VmPeak:") { - let kb: f64 = rest - .split_whitespace() - .next() - .and_then(|t| t.parse().ok()) - .unwrap_or(0.0); - return kb / 1024.0; - } - } - 0.0 -} - -#[cfg(not(target_os = "linux"))] -fn peak_rss_mb() -> f64 { - 0.0 -} diff --git a/research/lance-autoresearch/src/inputs.rs b/research/lance-autoresearch/crates/pq-l2/src/inputs.rs similarity index 93% rename from research/lance-autoresearch/src/inputs.rs rename to research/lance-autoresearch/crates/pq-l2/src/inputs.rs index f153028..78917e5 100644 --- a/research/lance-autoresearch/src/inputs.rs +++ b/research/lance-autoresearch/crates/pq-l2/src/inputs.rs @@ -16,6 +16,7 @@ //! the codebook is shape-appropriate, not random. use crate::PqShape; +use harness_common::SplitMix64; /// PQ shapes the bench evaluates. The agent's kernel must produce correct /// output and competitive speed on every one. @@ -295,36 +296,6 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec { out } -/// SplitMix64 — small, deterministic; bit-for-bit reproducible across machines. -struct SplitMix64 { - state: u64, -} - -impl SplitMix64 { - fn new(seed: u64) -> Self { - Self { state: seed } - } - fn next_u64(&mut self) -> u64 { - self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15); - let mut z = self.state; - z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9); - z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB); - z ^ (z >> 31) - } - fn next_f32(&mut self) -> f32 { - let bits = (self.next_u64() >> 40) as u32; - bits as f32 / ((1u32 << 24) as f32) - } - fn next_normal(&mut self) -> f32 { - let mut u1 = self.next_f32(); - if u1 < 1e-7 { - u1 = 1e-7; - } - let u2 = self.next_f32(); - (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() - } -} - fn shape_hash(s: PqShape) -> u64 { (s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15) ^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9) diff --git a/research/lance-autoresearch/src/kernels.rs b/research/lance-autoresearch/crates/pq-l2/src/kernels.rs similarity index 100% rename from research/lance-autoresearch/src/kernels.rs rename to research/lance-autoresearch/crates/pq-l2/src/kernels.rs diff --git a/research/lance-autoresearch/src/lib.rs b/research/lance-autoresearch/crates/pq-l2/src/lib.rs similarity index 62% rename from research/lance-autoresearch/src/lib.rs rename to research/lance-autoresearch/crates/pq-l2/src/lib.rs index b050c03..7a0e9bc 100644 --- a/research/lance-autoresearch/src/lib.rs +++ b/research/lance-autoresearch/crates/pq-l2/src/lib.rs @@ -1,17 +1,20 @@ -//! Lance autoresearch harness — public API for the bench binary, benchmarks, and tests. +//! Autoresearch target: Lance PQ L2 distance kernel optimization. //! -//! Contract (Karpathy-style three files): +//! Karpathy-style three-file contract: //! //! - `kernels` — the AGENT'S PLAYGROUND. Modify freely. //! - `reference` — IMMUTABLE. Scalar reference kernel. Defines the math. //! - `inputs` — IMMUTABLE. Diverse test-data + workload generators, //! deterministic per fixed seed, varied across the input battery. //! -//! The optimization target is dataset-independent: the agent's kernel must match -//! the scalar reference within `MAX_ABS_ERR` on every input the bench generates, -//! and minimize geomean ns/query across multiple PQ shapes and data -//! distributions. There is no fixed dataset; an "improvement" by construction -//! generalizes across distributions and shapes. +//! The optimization target is dataset-independent: the agent's kernel must +//! match the scalar reference within `harness_common::MAX_ABS_ERR` on every +//! input the bench generates, and minimize geomean ns/query across multiple +//! PQ shapes and data distributions. There is no fixed dataset. +//! +//! Shared utilities (deterministic PRNG, geomean, peak RSS, tolerance +//! constants, time budget) come from the `harness-common` workspace crate. +//! See `../HARNESS.md` for the harness conventions every target follows. pub mod inputs; pub mod kernels; @@ -45,12 +48,3 @@ impl PqShape { self.num_sub_vectors * self.num_centroids * self.sub_vector_dim() } } - -/// Tolerance for the agent kernel's distance values vs. the scalar reference. -/// Loose enough to permit legal SIMD-accumulator reordering; tight enough to -/// catch real arithmetic bugs. -pub const MAX_ABS_ERR: f32 = 1e-4; - -/// Tolerance for top-K *distances* (id sets are compared with tie-tolerance — -/// see `reference::topk_consistent`). -pub const TOPK_DIST_TOL: f32 = 1e-4; diff --git a/research/lance-autoresearch/src/reference.rs b/research/lance-autoresearch/crates/pq-l2/src/reference.rs similarity index 100% rename from research/lance-autoresearch/src/reference.rs rename to research/lance-autoresearch/crates/pq-l2/src/reference.rs diff --git a/research/lance-autoresearch/docs/adding-a-target.md b/research/lance-autoresearch/docs/adding-a-target.md new file mode 100644 index 0000000..4e3cecc --- /dev/null +++ b/research/lance-autoresearch/docs/adding-a-target.md @@ -0,0 +1,192 @@ +# Adding a new target + +Walk through this when spinning up a new optimization target (A1 cosine, A4 +bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural +decisions to make per target if the kernel fits the autoresearch shape. + +If your target's per-trial eval is more than ~30 seconds, or the correctness +oracle can't be a deterministic comparison against a scalar reference, this +harness is the wrong fit — see [`design.md`](design.md) "When to revisit" +for the boundary. + +## Steps + +### 1. Pick a template target + +The closest existing target. For now there's just `pq-l2`, but as more land: +- Distance / scoring kernels that take a query and return per-row scores → + template off `pq-l2`. +- Decode kernels that take encoded bytes and return an Arrow array → + template off `bitpack` once it lands. +- Scan / merge kernels → template off `topk-merge` once it lands. + +```bash +cp -r crates/pq-l2 crates/ +``` + +### 2. Rewrite `Cargo.toml` + +```toml +[package] +name = "" +# version, edition, license, publish stay the same +``` + +Add the target to the workspace `members` in the root `Cargo.toml`: + +```toml +[workspace] +members = [ + "crates/harness-common", + "crates/pq-l2", + "crates/", # add this +] +``` + +### 3. Rewrite `src/lib.rs` + +Define the target's `Shape` type (analogue of `PqShape`) and any other types +shared between `kernels.rs` and `reference.rs` and `inputs.rs`. Document +which fields are pinned by the harness vs. agent-tunable. + +This file is **immutable** to the agent. The shape parameters define the +optimization target — changing them changes what's being optimized. + +### 4. Rewrite `src/reference.rs` + +Implement the scalar reference kernel — the math, in plain Rust, no SIMD, +no cleverness. This is what the agent's kernel is compared against. Mirror +the public API of `kernels.rs` exactly. + +For float kernels, also export `max_abs_err(a, b)` and `topk_consistent(...)` +(or analogues) — the comparison helpers the bench uses to assert +near-bit-exact equivalence with `harness_common::MAX_ABS_ERR` / +`TOPK_DIST_TOL`. + +For integer / byte kernels, the comparison is simpler — `assert_eq!` on the +returned Arrow array. No tolerance constants needed. + +### 5. Rewrite `src/inputs.rs` + +Two surfaces: + +- `correctness_battery(seed) -> Vec` — diverse shape × + distribution combinations, sized small enough that the correctness phase + finishes in seconds. The point is breadth, not realism. +- `speed_workloads(seed) -> Vec` — larger shape × distribution + combinations sized for stable timings. Aim for total trial wall-clock + ≤ 60s; the agent's iteration latency dominates correctness elsewhere. + +Use `harness_common::SplitMix64` for determinism. Same seed → same battery +across trials. + +### 6. Rewrite `src/kernels.rs` (the agent's playground) + +Implement a clean scalar baseline matching the algorithm shape of the Lance +upstream code. The header comment must: + +- Cite the upstream Lance source (`lance-format/lance` rev / file path) the + algorithm is modeled on. +- Document the public API the bench calls — these are the surfaces the agent + may NOT change. +- List "what you can do" / "what you cannot do" rules specific to this + target. + +The starting kernel must be correct (passes the correctness phase against +`reference.rs`) and lint-clean. The agent's job is to make it faster. + +### 7. Rewrite `src/bin/run_experiment.rs` + +Two phases: + +- **Correctness phase:** for each `CorrectnessCase`, run agent kernel + + reference, compare. Any mismatch → print `correctness: fail`, diagnostic + line, exit 2. +- **Speed phase:** for each `SpeedWorkload`, run agent kernel and time per + query / per row / per byte. Aggregate geomean / worst / best across all + combos. Print fixed-format result block. + +Universal output fields (every target) are listed in `HARNESS.md` "The +metric." Add per-target fields above them as needed (e.g., `bit_widths_tested` +for bitpack). + +Use: +- `harness_common::geomean` for the aggregator +- `harness_common::peak_rss_mb` for memory readback +- `harness_common::TIME_BUDGET_SECS` for the time-budget check + +### 8. (Optional) Rewrite `benches/.rs` + +Criterion benchmark with the same kernel calls as `run_experiment` but +under criterion's statistical-sampling harness. Optional — the per-trial +binary is the agent's primary measurement; criterion is for the human's +deeper investigation. + +### 9. Write `program.md` + +Per-target agent skill, layered on top of `HARNESS.md`. Sections: + +- **Setup** — which files to read at session start (always include + `../../HARNESS.md`). +- **Public API contract** — the exact functions / structs the agent must + keep stable. +- **Target-specific priors** — known SIMD techniques for this kernel shape, + algorithmic transformations worth trying, common pitfalls. This is the + highest-leverage content; spend time on it. +- **`results.tsv` header** — the per-target column set. + +### 10. Write the per-target capsule in `docs/targets/.md` + +A short doc covering: + +- What's optimized (one sentence) +- Upstream Lance source pointers (rev, file paths, function names) +- Oracle definition (bit-exact / `max_abs_err`) +- Speed workload shape (what shapes × distributions span) +- Status (candidate / landed / has-results) + +### 11. Verify end-to-end + +```bash +cargo build --release -p +cargo clippy --release -p --all-targets -- -D warnings +cargo run --release --bin run_experiment -p +``` + +The baseline trial must: +- Print `correctness: pass` +- Exit 0 +- Finish within ~60s +- Reference a sensible `geomean_ns_per_*` baseline number + +Smoke-test the gate: deliberately break `kernels.rs` (e.g., return constant +zero), confirm the trial exits 2 with `correctness: fail`. Restore. + +### 12. Add the target row to the top-level `README.md` + +In the targets table at the top of the README, change the new target's row +from `candidate` to `landed`. + +### 13. Commit + +One commit for the target's scaffolding. Don't bundle multiple targets in +one commit — each target's history should be independently revertible. + +## Common gotchas + +- **Forgetting the empty `[workspace]` block** at the root means cargo walks + up to the omnigraph parent workspace. Already handled; just don't remove it. +- **Per-target `Cargo.toml` referencing the wrong `harness-common` path.** + Use `harness-common = { path = "../harness-common" }`. +- **Picking a `SHAPES` set that's too small.** Three shapes is the floor; + with one shape an agent could specialize and pass, with two there's not + enough variety. Ensure the shapes span at least one "outlier" (e.g., for + PQ, one shape with `sub_vector_dim != 8`). +- **Correctness battery too narrow.** Five distributions is the floor: at + minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or + the integer analogue: uniform / clustered / skewed / few-distinct / + monotonic). +- **Trial time too long.** If the speed phase exceeds ~60s, agent iteration + rate drops below useful. Reduce workload sizes; the speed metric is + per-operation, not per-workload, so absolute size doesn't change the + comparison. diff --git a/research/lance-autoresearch/docs/design.md b/research/lance-autoresearch/docs/design.md new file mode 100644 index 0000000..8dc2087 --- /dev/null +++ b/research/lance-autoresearch/docs/design.md @@ -0,0 +1,152 @@ +# Design — why the workspace is shaped this way + +This document records the rationale for the multi-target workspace shape so +future contributors don't relitigate the early decisions. + +## The thing we're building + +A multi-target harness for LLM-driven optimization of Lance hot-path kernels. +"Multi-target" because Lance has many such kernels — distance kernels in +`lance-linalg`, decoders in `lance-encoding`, scan/merge kernels — and the +right harness shape is identical across them: bit-exact correctness oracle, +geomean-across-distributions speed metric, single-agent autoresearch loop. + +The original [research note](../../docs/research/llm-evolutionary-sampling.md) +enumerates ten such candidates (A1–A10) clustered by Lance crate. The first +landed (`pq-l2`) proves the harness shape; the rest follow the same template. + +## Decision: workspace, not single crate + +A single crate exposing multiple binaries (`run_experiment_pq_l2`, +`run_experiment_bitpack`, ...) was the obvious-looking alternative. Rejected +for three reasons: + +1. **Per-target deps differ.** FSST decode wants different deps than PQ + kernels (a string-compression library vs. just `f32` math). A single + `Cargo.toml` would either bundle every target's deps into every build or + require fine-grained features. Workspaces give per-target `Cargo.toml` + for free. + +2. **Edit isolation.** The agent edits one target's `kernels.rs` at a time. + In a single crate, `kernels.rs` files would collide on path or have to live + in target-specific submodules with target-specific naming. Per-target + crates put `src/kernels.rs` at the natural location every time and let the + agent navigate one tree per session. + +3. **Build / test isolation.** `cargo build -p pq-l2` builds only what's + needed for the PQ L2 target; `cargo test -p pq-l2` runs only its tests. + The agent's iteration loop is faster because it doesn't pay for unrelated + targets' compile time. + +The downside — workspace boilerplate, per-target `Cargo.toml`, the empty +`[workspace]` block at the workspace root that prevents cargo from walking up +to the parent omnigraph workspace — is a one-time cost. Per-target overhead +of adding a new target is one `cp -r` plus path edits. + +## Decision: shared `harness-common` crate, no `Target` trait + +A `Target` trait was the obvious-looking other alternative — express the +common loop generically, plug in target-specific types. Rejected because: + +1. **Kernel signatures vary too much for a single trait shape.** PQ + `probe_top_k` returns `Vec<(u32, f32)>`. Bitpack decode returns an + `IntArray`. FSST decode returns `Vec`. Predicate evaluation returns a + `BooleanArray`. A unifying trait would need erased boxing or a wide + associated-type surface, both of which obscure the actual hot path the + agent is editing. + +2. **The orchestration that *is* shared is small.** A deterministic PRNG + (~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four + tolerance constants. Total ~70 lines of shared code. Building a trait + abstraction over 70 lines costs more than it saves. + +3. **The output format isn't worth sharing.** Each target's + `run_experiment.rs` prints a fixed-format result block; the *fields* + differ per target (PQ shapes vs bit widths vs distribution kinds). A + shared formatter would be either trivial wrapping of `println!` (no + value) or a complicated builder API (negative value). + +`harness-common` therefore exposes plumbing only: `SplitMix64`, `geomean`, +`peak_rss_mb`, `MAX_ABS_ERR`, `TOPK_DIST_TOL`, `TIME_BUDGET_SECS`. Each +target consumes what it needs. The shared loop contract is documented in +`HARNESS.md`, not encoded in code. + +## Decision: per-target `program.md` + shared `HARNESS.md` + +The agent reads two files at session start: + +- `HARNESS.md` (workspace-level) — universal: the loop, the metric, the + edit-permission table, hygiene rules. +- `crates//program.md` (per-target) — specific: the kernel API the + agent must keep stable, target-specific priors (which SIMD intrinsics tend + to win on this kernel shape), the `results.tsv` column header. + +The shape mirrors how Karpathy's `nanochat-research` `program.md` works, +factored across the dimension that varies (per target) vs. doesn't (the loop +itself). Two files instead of one because copy-pasting the universal loop +into every `program.md` makes them drift. + +## Decision: dataset-independent oracle every target + +The first iteration of the harness used recall@K vs. SIFT1M as the +correctness oracle. We replaced it with bit-exact (or near-bit-exact for +floats) match against a scalar reference because: + +1. The agent had incentive to overfit lossy approximations to the dataset's + cluster structure, even though we didn't ask for that. +2. SIFT1M is 250 MB and a hassle to download; the harness benefited from + being self-contained. +3. Mathematical equivalence is a strictly stronger contract than recall + preservation: if the kernel is bit-equivalent to the scalar reference, + recall is automatically identical because the distance values are the + same. There's nothing recall@K catches that bit-exactness doesn't. + +This decision generalizes to every target. Decode kernels get strict bitwise +equality (no float arithmetic involved). Distance and BM25 kernels get +`max_abs_err ≤ 1e-4` (loose enough for SIMD-accumulator reordering, tight +enough for real bugs). Targets that genuinely require lossy techniques to +get headroom — there might be some; LUT u8 quantization in PQ is one — go +in a separate "lossy track" with a recall-based oracle on diverse datasets, +not the bit-exact track. + +## Decision: per-target speed measurement spans multiple shapes × distributions + +A single dataset would let an agent overfit to that dataset's distribution. +Each target's `inputs.rs` therefore generates speed workloads across: + +- Multiple **shapes** of the kernel's domain (PQ: `(dim, num_sub_vectors, + num_centroids)`; bitpack: bit width; etc.). Captures how the kernel + performs at different sizes Lance users actually encounter. +- Multiple **data distributions** (Gaussian / uniform / sparse for floats; + uniform / skewed / clustered for integers; etc.). Captures whether the + kernel's win is data-distribution-conditional. + +The keep gate uses geomean across all (shape × distribution) combos with a +worst-case guard: a kernel that wins on one combo and regresses ≥5% on +another fails to keep, even if the geomean improves. This forces wins to +generalize. + +## What's deliberately not abstracted + +- **Output format.** Each target prints its own field block. See above. +- **`TopKHeap` and other small data structures.** When two targets need a + `TopKHeap`, the second one copies the first's. Three copies of a 30-line + struct is cheaper than one trait-erased indirection. +- **Test data shapes.** Each target's `inputs.rs` knows its own kernel's + fixture shape. Sharing would require a generic `Fixture` trait, + which would either be too narrow (forces every kernel into a `query + + workload` shape) or too wide (gives up the type safety that makes the + bench's correctness check obvious). + +## When to revisit + +If the workspace grows past ~6 active targets and we notice we're +copy-pasting more than ~50 lines of `run_experiment.rs` boilerplate per new +target, consider extracting a shared `RunExperiment` helper that takes +closures for the correctness and speed phases. Don't pre-extract — wait +until the duplication is real and visible. + +If we add a target that genuinely doesn't fit the autoresearch loop (eval +crosses ~30s; tournament sampling becomes the right control loop), it +belongs in a separate workspace, not this one. The boundary line is the +loop shape, not the target type. diff --git a/research/lance-autoresearch/docs/targets/pq-l2.md b/research/lance-autoresearch/docs/targets/pq-l2.md new file mode 100644 index 0000000..7ac7daf --- /dev/null +++ b/research/lance-autoresearch/docs/targets/pq-l2.md @@ -0,0 +1,98 @@ +# Target: `pq-l2` + +PQ L2 distance kernel for f32 dense vectors — the asymmetric-distance compute +that runs on every `IvfPq` / `IvfHnswPq` ANN query in Lance. + +## Status + +**Landed.** Baseline scalar kernel committed; the agent's job is to find +generalizable speedups against it. + +## What's optimized + +Two functions in `crates/pq-l2/src/kernels.rs`: + +- `PqKernel::distance_table(query)` — builds the asymmetric distance table + (`[num_sub_vectors][num_centroids]`) for one query against the codebook. + Cost: `num_sub_vectors × num_centroids × sub_vector_dim` MAC ops per query. +- `PqKernel::probe_top_k(table, codes, num_vectors, k)` — probes + `num_vectors` PQ-encoded vectors, accumulates per-vector distance via + `num_sub_vectors` table lookups, returns top-K. Cost: + `num_vectors × num_sub_vectors` lookups + heap maintenance per query. + This is the dominant cost at typical scales. + +`PqKernel::new(shape, codebook)` is also editable — the agent may pre-process +the codebook (transpose layout, cache `c·c` for the FMA trick, pack the LUT) +and amortize over queries; build cost is excluded from per-query timing. + +## Upstream Lance source + +Algorithmically modeled on `lance-linalg::distance::l2` plus the PQ +asymmetric-distance compute in `lance::index::vector::pq`. Specifically the +f32 dense path; the byte / fixed-point variants are out of scope for this +target. + +When porting a winning kernel upstream: +- File: `lance-linalg/src/distance/l2.rs` and the L2-specific path in + `lance/src/index/vector/pq.rs`. +- License: Apache-2.0 (matches our dual MIT/Apache-2.0 → upstream takes + the Apache half). + +## Oracle + +**Float-accumulator-tolerance match against scalar reference.** Per +`harness_common::MAX_ABS_ERR = 1e-4`: + +- Distance table values must match the scalar reference within `1e-4` per + element. Loose enough for legal SIMD-accumulator reordering, tight enough + to catch real arithmetic bugs. +- Top-K results compared with `harness_common::TOPK_DIST_TOL = 1e-4` plus + tie-tolerant id substitution (any permutation within a tied-distance band + is accepted). + +The correctness phase asserts both on every input combination — five input +distributions × three PQ shapes = 15 cases per trial. + +## Speed workload + +Three shapes: +- `(128, 16, 256)` — SIFT-like; sub_vector_dim = 8 +- `(256, 16, 256)` — sub_vector_dim = 16 +- `(768, 96, 256)` — BERT-base-like; large codebook + +Three data distributions: +- `Clustered` — 32 cluster centers, low intra-cluster noise +- `Uniform` — uniform on [-1, 1] +- `Sparse` — 90% zeros + 10% Gaussian + +Per (shape × distribution): 20,000 base vectors PQ-encoded, 32 queries +timed. Total trial wall-clock: ~30–60s on a developer laptop. + +## Output fields + +``` +correctness: pass | fail +shapes_tested: (128,16,256) (256,16,256) (768,96,256) +distributions_tested: clustered uniform sparse +geomean_ns_per_query: +worst_ns_per_query: (, ) +best_ns_per_query: (, ) +per_combo_geomean_ns: + (...) +peak_mem_mb: +total_seconds: +``` + +## Known headroom (priors for the agent) + +See `crates/pq-l2/program.md` "Lance-PQ-specific priors" for the canonical +list. Highlights: + +- Codebook layout transpose (`[m][k][d]` → `[m][d][k]`) for SIMD-broadcast + table build. +- Cache `c·c` per centroid in `new()` so the inner loop is `q·q − 2qc + c·c` + (one FMA chain). +- Probe-side code transpose so the inner loop processes 32+ vectors per + iteration via gather. +- Top-K block-then-merge instead of per-vector heap insert. +- Prefetch on `codes[i+64]` ahead of gather. diff --git a/research/lance-autoresearch/program.md b/research/lance-autoresearch/program.md deleted file mode 100644 index 73f73e3..0000000 --- a/research/lance-autoresearch/program.md +++ /dev/null @@ -1,172 +0,0 @@ -# Lance PQ L2 kernel research — agent instructions - -You are an autonomous research assistant. Your job is to improve `src/kernels.rs` -so that `cargo run --release --bin run_experiment` reports a **lower -`geomean_ns_per_query`** while: - -1. The **correctness phase passes** — your kernel's distance values must match the - scalar reference within `MAX_ABS_ERR = 1e-4`, and the top-K must be - tie-tolerant equivalent on every input the bench generates. -2. The `worst_ns_per_query` does **not regress more than 5%** against the - last-kept kernel — if you win on one (shape × distribution) and lose - significantly on another, the change isn't a generalizable improvement. - -This bench is intentionally **dataset-independent**: there is no fixed dataset. -The correctness oracle is mathematical equivalence to the scalar reference, -checked across multiple PQ shapes and synthetic input distributions -(Gaussian / uniform / sparse / large-dynamic-range / mostly-zero). The speed -oracle is the geomean across multiple shapes × distributions, with worst-case -guarded. A win that depends on a specific data distribution or PQ shape will -fail to clear the bar by construction. - -Read this file end-to-end before doing anything else. Then run setup, then the loop. - -## Setup (do once at the start of every session) - -1. Read these files, in this order: - - `README.md` - - `program.md` (this file) - - `src/lib.rs` - - `src/kernels.rs` *(the only file you may edit)* - - `src/reference.rs` - - `src/inputs.rs` - - `src/bin/run_experiment.rs` -2. Ensure `results.tsv` exists. If not, create it with this header line: - ``` - commit timestamp correctness geomean_ns worst_ns worst_combo best_ns best_combo peak_mem_mb total_seconds keep description - ``` -3. Run the baseline trial: `cargo run --release --bin run_experiment > run.log 2>&1`. - Confirm `correctness: pass`. Parse `run.log` and append a row to `results.tsv` - with `keep=baseline` and `description="seeded scalar PQ-L2 baseline"`. This - is your reference number. -4. Commit the baseline row with a one-line message like `baseline: `. - -## What you CAN do - -- Modify **`src/kernels.rs`** freely. You may: - - Pre-process the codebook in `PqKernel::new` (transpose layouts, cache - `c·c` for the FMA trick, pack the codebook for register-resident lookup, - etc.). This cost is paid once per dataset and amortized across queries — - the bench measures per-query, not per-(build + query). - - Reorder loops, switch internal data layouts, drop down to `std::arch` - intrinsics under `#[cfg(target_arch = ...)]` gates. **Always keep a - portable scalar fallback** so the kernel compiles everywhere. - - Use `unsafe` if needed; document the invariants you're relying on. - - Mark hot functions `#[inline]`; add private helpers freely. - - Add `#[cfg(test)] mod tests { ... }` inside `src/kernels.rs` if you want - in-file property checks. - -## What you CANNOT do - -- Do **not** modify `src/lib.rs` (`PqShape` and the tolerance constants are - shared with the immutable scaffolding). -- Do **not** modify `src/bin/run_experiment.rs`, `src/reference.rs`, - `src/inputs.rs`, `benches/pq_l2.rs`, or `Cargo.toml`. -- Do **not** add new crate dependencies. -- Do **not** alter the public API of `kernels::PqKernel`: - - `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self` - - `PqKernel::shape(&self) -> &PqShape` - - `PqKernel::distance_table(&self, query: &[f32]) -> Vec` - - `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>` -- Do **not** introduce lossy techniques (LUT u8/u16 quantization, asymmetric- - distance approximation, etc.) — the correctness phase asserts exact-up-to-ε - match against the scalar reference. If you want to explore a lossy track, - surface that in a separate kernel and propose a track extension. - -## The metric - -Minimize `geomean_ns_per_query` (geometric mean of per-query wall-clock across -all timed queries, all shapes, all distributions) subject to: - -1. Correctness phase: **pass** (exit-2 otherwise). -2. `worst_ns_per_query` ≤ 1.05 × the last-kept kernel's worst. -3. `total_seconds` ≤ 600. -4. Build is clean: `cargo build --release` succeeds, `cargo clippy --release - --all-targets -- -D warnings` reports zero issues. - -Ties break toward simpler code. If two kernels report the same speed within -~3% noise, prefer fewer lines / less `unsafe`. - -## Lance-PQ-specific priors (lossless directions) - -These directions are known to pay off without compromising arithmetic accuracy. -Pick one hypothesis at a time; implement; measure; decide. - -- **Codebook layout.** The reference layout is `[m][k][d]`. For a fixed query, - iterating over centroids stays in cache, but the inner loop over `d` is - short. Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)` - lanes across `d` and broadcast over `k`. Do the transpose in `PqKernel::new` - once. -- **Cache `c·c`.** The diff–square–sum is `(q - c)·(q - c) = q·q - 2qc + c·c`. - Hoist `q·q` per sub-vector, precompute `c·c` once at codebook-load time. - Inner loop becomes one FMA (`-2qc + cc`). Watch the sign / accumulator - ordering so the rounding stays within tolerance. -- **Probe layout.** The probe is dominated by `acc += table[m][codes[off+m]]` - × `num_sub_vectors`. Transposing codes to `[m][i]` (one row per sub-quantizer, - contiguous over base index) lets you process up to 32+ vectors per inner - iteration with `vpgatherdq`-style loads. -- **Top-K integration.** `push()` does a branch + heap sift on every code. - At 50k probes per query × 9 (shape × dist) combos that's the second-biggest - cost after the gather. Block the probe (e.g., 512 codes at a time), find the - local top-K with a branchless pass, then merge into the global heap. -- **Prefetch.** A `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)` - ahead of the gather is usually pure win at 50k+ scale where codes don't all - fit in L2. -- **FMA chains for table build.** The diff–square–sum maps cleanly to FMA on - AVX2/NEON. Even without intrinsics, structuring the inner loop so `rustc` - emits FMA helps. -- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates a - fresh `Vec` per call. Returning a fixed-capacity buffer is a public-API - change you can't make — but you can reuse a thread-local scratch buffer - internally if it speeds the build. - -## The loop - -Once setup is done, repeat indefinitely: - -1. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas - have been tried, what won, what regressed. Form a hypothesis with one - sentence stating the change and the predicted effect on speed and - correctness. -2. **Edit `src/kernels.rs`.** Keep the diff focused on the one hypothesis. -3. **Build and lint.** - ``` - cargo build --release - cargo clippy --release --all-targets -- -D warnings - ``` - If either fails, fix and try again — do not commit broken state. -4. **Run the trial.** - ``` - cargo run --release --bin run_experiment > run.log 2>&1 - ``` -5. **Parse the result.** Extract `correctness`, `geomean_ns_per_query`, - `worst_ns_per_query` (with combo), `peak_mem_mb`, `total_seconds`. Compute - deltas vs. baseline. -6. **Decide keep or revert.** - - **Keep** iff: `correctness: pass`, geomean strictly better than the - last-kept row (allow ~1% noise band), and `worst_ns_per_query` ≤ 1.05 × - last-kept's worst. - - **Revert** otherwise: `git restore src/kernels.rs` (or commit and - `git revert` if you want the revert in history). Note what failed. -7. **Log.** Append one row to `results.tsv`: - ``` - - ``` -8. **Commit.** One-line message describing the change and the headline number, - e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`. - -## Hygiene - -- Always commit `src/kernels.rs` changes; never commit `results.tsv` or - `run.log` (they're gitignored). -- If a change fails to build, do not commit. Iterate until it builds, or - revert cleanly. -- If two consecutive ideas regress, take a beat: re-read the last ~10 rows of - `results.tsv` and update your mental model before proposing the next. -- Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min, kill it - and mark the trial as `timeout`. - -## Never stop - -Keep going until interrupted. Each loop iteration is one hypothesis, one edit, -one measurement, one commit. No multi-step plans across iterations.