mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
research: restructure lance-autoresearch as multi-target workspace
The original lance-autoresearch was one Cargo crate optimizing one Lance
kernel (PQ L2 distance). With 9+ candidate targets enumerated in the research
note, a single-crate shape doesn't scale: per-target deps will collide, the
agent's edits to one target's kernels.rs would conflict with another's lib
path, and build/test isolation is lost. Restructure into a Cargo workspace.
Layout:
research/lance-autoresearch/
├── Cargo.toml (workspace root)
├── README.md (target table, contract overview, repo layout)
├── HARNESS.md (universal loop contract every target inherits)
├── crates/
│ ├── harness-common/ (shared: SplitMix64, geomean, peak RSS,
│ │ MAX_ABS_ERR, TOPK_DIST_TOL, TIME_BUDGET_SECS)
│ └── pq-l2/ (the landed target; was the previous single crate)
└── docs/
├── design.md (rationale for workspace shape, no Target trait)
├── adding-a-target.md (step-by-step workflow for new targets)
└── targets/pq-l2.md (per-target capsule)
Decisions documented in docs/design.md:
- Workspace, not single crate: per-target Cargo.toml so deps don't collide;
per-target src tree so agent edits don't conflict; per-target build/test
isolation for faster agent iteration.
- harness-common as a plumbing-only crate (PRNG, geomean, peak RSS, tolerance
constants, time budget). Intentionally NO Target trait - decode kernel
signatures and distance kernel signatures differ enough that a unifying
trait would either bloat or require erased boxing. Each target is its own
natural shape.
- Per-target program.md + shared HARNESS.md: the loop contract is universal,
the priors and API spec are per-target. Two files instead of one because
copy-pasting the universal loop into every program.md would drift.
pq-l2 refactor:
- src/* moved into crates/pq-l2/src/* via git mv (preserves history)
- crate renamed lance-autoresearch -> pq-l2
- SplitMix64, geomean, peak_rss_mb, MAX_ABS_ERR, TOPK_DIST_TOL,
TIME_BUDGET_SECS now imported from harness-common (drops ~70 lines of
duplication that would have been copy-pasted into every new target)
- program.md trimmed: setup/loop/hygiene moved to HARNESS.md; only the
PQ-L2-specific API contract and SIMD priors remain
- Cargo.toml depends on harness-common via path; workspace.dependencies
pins criterion uniformly across targets
The 9 candidate targets from the research note (A1 cosine/dot/hamming, A2
IVF partition select, A3 FTS BM25, A4 bitpack decode, A5 dictionary decode,
A6 FSST decode, A7 take/gather, A8 predicate eval, A9 posting list intersect,
A10 top-K merge) are listed in README.md's target table as "candidate"; each
gets a docs/targets/<name>.md capsule when it's spun up. docs/adding-a-target.md
documents the cp -r + edit-Cargo.toml + rewrite-three-files workflow.
Verified end-to-end:
- cargo build --release: clean, both crates compile
- cargo clippy --release --workspace --all-targets -- -D warnings: clean
- cargo test --release --workspace: 6/6 pass (4 harness-common + 2 pq-l2)
- cargo run --release --bin run_experiment -p pq-l2: correctness pass,
geomean ~880k ns, exit 0, ~30s wall-clock
- omnigraph parent workspace unchanged (research/ excluded as before)
https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
This commit is contained in:
parent
92ce8f1e7f
commit
0d72cc69fb
21 changed files with 1012 additions and 366 deletions
|
|
@ -1,32 +1,14 @@
|
|||
# Empty `[workspace]` section so cargo treats this directory as its own
|
||||
# workspace root and does NOT walk up to the parent omnigraph workspace.
|
||||
# Without this, cargo from inside `research/lance-autoresearch/` will try to
|
||||
# resolve omnigraph's dependencies even though we're excluded as a member.
|
||||
[workspace]
|
||||
resolver = "2"
|
||||
members = [
|
||||
"crates/harness-common",
|
||||
"crates/pq-l2",
|
||||
]
|
||||
|
||||
[package]
|
||||
name = "lance-autoresearch"
|
||||
version = "0.1.0"
|
||||
edition = "2024"
|
||||
license = "MIT OR Apache-2.0"
|
||||
description = "Autoresearch-style harness for evolving Lance PQ L2 distance kernels via LLM agents."
|
||||
publish = false
|
||||
|
||||
[lib]
|
||||
path = "src/lib.rs"
|
||||
|
||||
[[bin]]
|
||||
name = "run_experiment"
|
||||
path = "src/bin/run_experiment.rs"
|
||||
|
||||
[[bench]]
|
||||
name = "pq_l2"
|
||||
harness = false
|
||||
|
||||
[dependencies]
|
||||
# Each per-target crate sets its own deps. Shared deps below pin versions
|
||||
# uniformly across targets so the workspace lockfile stays clean.
|
||||
[workspace.dependencies]
|
||||
anyhow = "1"
|
||||
|
||||
[dev-dependencies]
|
||||
criterion = { version = "0.5", default-features = false, features = ["plotters", "cargo_bench_support"] }
|
||||
|
||||
[profile.release]
|
||||
|
|
|
|||
137
research/lance-autoresearch/HARNESS.md
Normal file
137
research/lance-autoresearch/HARNESS.md
Normal file
|
|
@ -0,0 +1,137 @@
|
|||
# HARNESS — shared loop contract for every lance-autoresearch target
|
||||
|
||||
This document is the universal part of every target's agent instructions. Each
|
||||
target's `program.md` is a thin layer of *target-specific priors and API spec*
|
||||
on top of the conventions below. The agent reads `HARNESS.md` and the target's
|
||||
`program.md` at the start of every session.
|
||||
|
||||
## What this harness is
|
||||
|
||||
A single agent (you) edits one file in one target crate to optimize a Lance
|
||||
kernel. Per trial, you build, run a binary that exercises the kernel against
|
||||
diverse inputs, parse a fixed-format output block, and decide keep-or-revert.
|
||||
|
||||
This is a Karpathy-style autoresearch loop. It assumes:
|
||||
|
||||
- Per-trial eval is **seconds-scale**. Long enough to measure, short enough to
|
||||
iterate hundreds of times in a session.
|
||||
- The kernel has a **deterministic correctness oracle** — a scalar reference
|
||||
that produces the same answer to compare against.
|
||||
- The optimization target is **dataset-independent**: the harness generates
|
||||
diverse inputs each trial, so wins generalize across distributions and
|
||||
shapes by construction.
|
||||
|
||||
Targets that don't fit these constraints (index-build parameter tuning,
|
||||
plan-patching, anything where eval is minutes-to-hours) belong in the
|
||||
BauplanLabs tournament-loop shape, not this harness. See `docs/design.md` for
|
||||
the boundary.
|
||||
|
||||
## What's editable, per target
|
||||
|
||||
| Path | Mutability | Why |
|
||||
|---|---|---|
|
||||
| `crates/<target>/src/kernels.rs` | **mutable** | Your playground. The whole point. |
|
||||
| `crates/<target>/src/reference.rs` | immutable | The oracle. Touching it makes wins meaningless. |
|
||||
| `crates/<target>/src/inputs.rs` | immutable | The fixture generator. Touching it makes timings incomparable across trials. |
|
||||
| `crates/<target>/src/lib.rs` | immutable | Shared types pinned by the bench (`PqShape` etc.). |
|
||||
| `crates/<target>/src/bin/run_experiment.rs` | immutable | The trial harness. |
|
||||
| `crates/<target>/benches/*.rs` | immutable | Criterion bench, optional read-only reference. |
|
||||
| `crates/<target>/Cargo.toml` | immutable | Adding deps changes the optimization target. |
|
||||
| `crates/<target>/program.md` | human-iterated between runs | Not edited by you in-loop; the human refines it. |
|
||||
| `crates/<target>/results.tsv` | append-only | Your audit log. Gitignored. |
|
||||
| `crates/harness-common/**` | immutable | Workspace-shared infrastructure. |
|
||||
| `HARNESS.md` (this file) | immutable | Workspace-shared loop contract. |
|
||||
|
||||
You may add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
|
||||
property checks. You may NOT add new crate dependencies. You may NOT use
|
||||
unsafe-only-on-broken-assumptions tricks (e.g., assuming a fixture invariant
|
||||
that holds today but isn't documented).
|
||||
|
||||
## The metric
|
||||
|
||||
Every target's `run_experiment` binary prints a fixed-format output block ending
|
||||
with these universal fields:
|
||||
|
||||
- `correctness:` — `pass` or `fail`. Set by comparing your kernel against the
|
||||
scalar reference on every input the bench generates.
|
||||
- `geomean_ns_per_*:` — geometric mean of per-operation wall-clock across all
|
||||
timed operations.
|
||||
- `worst_ns_per_*:` — slowest combo's geomean.
|
||||
- `peak_mem_mb:` — process RSS high-water-mark.
|
||||
- `total_seconds:` — trial wall-clock.
|
||||
|
||||
A kernel is **kept** iff:
|
||||
|
||||
1. `correctness: pass` (any failure → `std::process::exit(2)`).
|
||||
2. `geomean_ns_per_*` strictly better than the previous best-kept kernel
|
||||
(allow ~1% noise band).
|
||||
3. `worst_ns_per_*` ≤ 1.05 × the previous best-kept kernel's worst.
|
||||
4. `total_seconds` ≤ 600 (the per-trial cap; exceed it → `std::process::exit(3)`).
|
||||
5. Build clean: `cargo build --release` and
|
||||
`cargo clippy --release --all-targets -- -D warnings` both succeed.
|
||||
|
||||
Ties break toward simpler code: same speed within ~3% noise → fewer lines /
|
||||
less `unsafe` wins.
|
||||
|
||||
## The loop
|
||||
|
||||
After reading `HARNESS.md` and the target's `program.md`:
|
||||
|
||||
1. **Setup (once per session).** Confirm `results.tsv` exists; if not, create
|
||||
it with a per-target header (the target's `program.md` defines the columns).
|
||||
Run the baseline trial:
|
||||
```
|
||||
cargo run --release --bin run_experiment -p <target> > run.log 2>&1
|
||||
```
|
||||
Append a row tagged `keep=baseline` and commit it.
|
||||
|
||||
2. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
|
||||
have been tried, what won, what regressed. Form one hypothesis with one
|
||||
sentence stating the change and the predicted effect on speed and
|
||||
correctness.
|
||||
|
||||
3. **Edit `kernels.rs`.** Keep the diff focused on the one hypothesis.
|
||||
|
||||
4. **Build and lint.**
|
||||
```
|
||||
cargo build --release
|
||||
cargo clippy --release --all-targets -- -D warnings
|
||||
```
|
||||
If either fails, fix and retry. Do not commit broken state.
|
||||
|
||||
5. **Run the trial.**
|
||||
```
|
||||
cargo run --release --bin run_experiment -p <target> > run.log 2>&1
|
||||
```
|
||||
|
||||
6. **Parse and decide.** Extract the universal fields plus any per-target
|
||||
fields. Compute deltas vs. the last-kept row. Apply the keep criteria above.
|
||||
|
||||
7. **Log.** Append one row to `results.tsv` matching the target's header.
|
||||
|
||||
8. **Commit.** One-line message describing the change and the headline number,
|
||||
e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
|
||||
|
||||
9. **Hygiene.**
|
||||
- Always commit `kernels.rs` changes; never commit `results.tsv` or
|
||||
`run.log` (gitignored).
|
||||
- If a change fails to build, do not commit. Iterate or revert cleanly.
|
||||
- If two consecutive ideas regress, take a beat: re-read the last ~10 rows
|
||||
and update your mental model before proposing the next.
|
||||
- Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min,
|
||||
kill it and mark the trial as `timeout`.
|
||||
|
||||
## Never stop
|
||||
|
||||
Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
|
||||
one measurement, one commit. No multi-step plans across iterations.
|
||||
|
||||
## Working across multiple targets
|
||||
|
||||
If a session spans multiple targets, work on **one target per session**. Don't
|
||||
edit `kernels.rs` in two crates between commits — the agent's mental model is
|
||||
shared but the keep-decision is per-target. Pick a target, do a session there,
|
||||
commit, switch.
|
||||
|
||||
The human is responsible for selecting which target to work on next. Don't
|
||||
proactively switch targets unless the user asks.
|
||||
|
|
@ -1,112 +1,143 @@
|
|||
# lance-autoresearch
|
||||
|
||||
An autoresearch-style harness for evolving [Lance](https://github.com/lance-format/lance)
|
||||
PQ L2 distance kernels via LLM coding agents (Claude Code, Codex, Cursor).
|
||||
|
||||
Modeled on Andrej Karpathy's
|
||||
A multi-target workspace for evolving [Lance](https://github.com/lance-format/lance)
|
||||
hot-path kernels via LLM coding agents (Claude Code, Codex, Cursor),
|
||||
in the style of Andrej Karpathy's
|
||||
[`nanochat-research`](https://x.com/karpathy/status/1855651423497650238)
|
||||
three-file contract:
|
||||
single-agent autoresearch loop.
|
||||
|
||||
- **Immutable bench** — `src/bin/run_experiment.rs` + `src/inputs.rs` +
|
||||
`src/reference.rs`. The agent cannot touch these.
|
||||
- **Mutable kernel** — `src/kernels.rs`. The agent's playground. Starts as a
|
||||
scalar baseline matching Lance's PQ L2 algorithm shape; the agent's job is to
|
||||
beat it.
|
||||
- **Human-iterated program** — `program.md`. The "skill" the agent reads at
|
||||
the start of every session. The human refines it between runs.
|
||||
Each target is an independent Rust crate under `crates/`:
|
||||
|
||||
| Target | Status | Lance source area | What's optimized |
|
||||
|---|---|---|---|
|
||||
| [`crates/pq-l2`](crates/pq-l2) | landed | `lance-linalg::distance::l2`, PQ probe | PQ L2 distance: build LUT, probe codes, top-K |
|
||||
| `crates/pq-cosine` | candidate (A1) | `lance-linalg::distance::cosine` | PQ cosine distance |
|
||||
| `crates/pq-dot` | candidate (A1) | `lance-linalg::distance::dot` | PQ dot-product distance |
|
||||
| `crates/ivf-partition` | candidate (A2) | `lance-index::vector::ivf` partition select | IVF partition selection (centroid scan) |
|
||||
| `crates/fts-bm25` | candidate (A3) | `lance-index::scalar::inverted` BM25 | FTS BM25 scoring inner loop |
|
||||
| `crates/bitpack` | candidate (A4) | `lance-encoding::encodings::bitpack` | Bitpack integer decode |
|
||||
| `crates/dictionary` | candidate (A5) | `lance-encoding::encodings::dictionary` | Dictionary decode |
|
||||
| `crates/fsst` | candidate (A6) | `lance-encoding::encodings::fsst` | FSST string decode |
|
||||
| `crates/take` | candidate (A7) | `lance-core::utils::take` | Take / gather kernel |
|
||||
| `crates/predicate` | candidate (A8) | `lance-datafusion` filter eval | Predicate evaluation kernels |
|
||||
| `crates/posting-intersect` | candidate (A9) | `lance-index::scalar::inverted` | Posting list intersection (FTS AND) |
|
||||
| `crates/topk-merge` | candidate (A10) | scan-merge | Top-K k-way merge |
|
||||
|
||||
The candidate targets are documented in [`docs/targets/`](docs/targets/) and can
|
||||
be added by following [`docs/adding-a-target.md`](docs/adding-a-target.md). The
|
||||
single landed target (`pq-l2`) proves the harness shape; the candidates wait
|
||||
for an agent to spin them up.
|
||||
|
||||
## The contract every target follows
|
||||
|
||||
Karpathy's three-file shape, applied per target:
|
||||
|
||||
| File (per target crate) | Mutability | Edited by |
|
||||
|---|---|---|
|
||||
| `src/kernels.rs` | **mutable** | the agent |
|
||||
| `src/reference.rs`, `src/inputs.rs`, `src/lib.rs`, `src/bin/run_experiment.rs`, `benches/*.rs` | immutable | — |
|
||||
| `program.md` | human-iterated | the human, between runs |
|
||||
| `results.tsv` | append-only | the agent, per trial (gitignored) |
|
||||
|
||||
The shared utilities — deterministic PRNG, geomean, peak-RSS readback,
|
||||
tolerance constants, time-budget — live in [`crates/harness-common`](crates/harness-common/src/lib.rs)
|
||||
and are consumed by every target. There is intentionally **no `Target` trait**:
|
||||
decode-kernel signatures and distance-kernel signatures are different enough
|
||||
that a unifying trait would either bloat or require erased boxing. Each target
|
||||
is its own natural shape; the shared crate is plumbing only.
|
||||
|
||||
The shared loop conventions every target's `program.md` inherits live in
|
||||
[`HARNESS.md`](HARNESS.md). Per-target priors and API specifics live in each
|
||||
target's own `program.md`.
|
||||
|
||||
## Dataset-independent by design
|
||||
|
||||
Every other ANN benchmark you've seen is "compete on this fixed dataset"
|
||||
(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness*
|
||||
(the math) and *kernel speed under one specific data distribution*. An LLM
|
||||
agent given recall@K as the oracle has incentive to overfit to the dataset's
|
||||
quirks.
|
||||
(SIFT1M, GIST1M, DEEP1B). That conflates two things: *kernel correctness* (the
|
||||
math) and *kernel speed under one specific data distribution*. An LLM agent
|
||||
given recall@K as the oracle has incentive to overfit to the dataset's quirks.
|
||||
|
||||
We split them:
|
||||
We split them, every target:
|
||||
|
||||
- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4`) match to a scalar
|
||||
reference kernel, on diverse generated inputs (Gaussian, uniform, sparse,
|
||||
large-dynamic-range, mostly-zero) × multiple PQ shapes. This is mathematical
|
||||
equivalence; there's no dataset to overfit. Lossy techniques fail this gate.
|
||||
- **Speed** = geomean ns/query across multiple PQ shapes ×
|
||||
multiple data distributions. A kernel that wins on one distribution and
|
||||
regresses on another fails the worst-case guard.
|
||||
- **Correctness** = bit-equivalent (`max_abs_err ≤ 1e-4` for floats; bitwise for
|
||||
integer/byte kernels) match to a scalar reference, on diverse generated
|
||||
inputs. Mathematical equivalence; no dataset to overfit. Lossy techniques fail
|
||||
this gate.
|
||||
- **Speed** = geomean ns/operation across multiple shape × distribution
|
||||
combinations, with worst-case guard. A kernel that wins on one distribution
|
||||
and regresses on another fails to keep.
|
||||
|
||||
By construction, an "improvement" generalizes across distributions and shapes.
|
||||
There is no `wget sift.tar.gz` step; the harness is fully self-contained.
|
||||
There is no `wget sift.tar.gz` step; every target is fully self-contained.
|
||||
|
||||
## Why a separate repo
|
||||
## Why a separate repo (and a workspace, not a single crate)
|
||||
|
||||
OmniGraph (the graph engine that motivated this) pins Lance at a released
|
||||
version and consumes its kernels via the public crate API. Improvements live one
|
||||
layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps the
|
||||
optimization target pure (only the kernel changes), keeps the license clean for
|
||||
upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and
|
||||
keeps the agent's working set tiny.
|
||||
version and consumes its kernels via the public crate API. Improvements live
|
||||
one layer below: in Lance itself. A standalone repo with no OmniGraph dep keeps
|
||||
the optimization target pure (only the kernel changes), keeps the license clean
|
||||
for upstream contribution (dual MIT/Apache-2.0 → Apache-2.0 PRs to Lance), and
|
||||
keeps each agent's working set tiny.
|
||||
|
||||
**Workspace not single-crate** because per-target deps differ — FSST decode
|
||||
will want a different dependency set than PQ kernels — and the agent's edits
|
||||
to one target's `kernels.rs` must not collide with another's lib path. Each
|
||||
target is buildable, testable, and runnable in isolation: `cd crates/<target>
|
||||
&& cargo run --release --bin run_experiment`.
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
cargo run --release --bin run_experiment
|
||||
# Run the landed PQ L2 target's baseline.
|
||||
cargo run --release --bin run_experiment -p pq-l2
|
||||
|
||||
# Or run with Claude Code / Codex:
|
||||
# Open the repo in your agent of choice and prompt:
|
||||
# Hi, have a look at program.md and let's kick off a new experiment.
|
||||
# Or with Claude Code / Codex, working on one target:
|
||||
cd crates/pq-l2
|
||||
# Open in your agent of choice and prompt:
|
||||
# Hi, have a look at program.md and let's kick off a new experiment.
|
||||
|
||||
# Add a new target (see docs/adding-a-target.md):
|
||||
cp -r crates/pq-l2 crates/pq-cosine
|
||||
# ... edit Cargo.toml name, kernels.rs / reference.rs / inputs.rs / program.md
|
||||
```
|
||||
|
||||
## File ownership
|
||||
|
||||
| File | Mutability | Edited by |
|
||||
|---|---|---|
|
||||
| `src/kernels.rs` | **mutable** | the agent |
|
||||
| `src/bin/run_experiment.rs` | immutable | — |
|
||||
| `src/reference.rs` | immutable | — |
|
||||
| `src/inputs.rs` | immutable | — |
|
||||
| `src/lib.rs` | immutable (shared types) | — |
|
||||
| `benches/pq_l2.rs` | immutable | — |
|
||||
| `program.md` | human-iterated | the human, between runs |
|
||||
| `results.tsv` | append-only | the agent, per trial (gitignored) |
|
||||
|
||||
## The metric
|
||||
|
||||
`run_experiment` runs two phases per trial: a correctness check and a
|
||||
multi-shape × multi-distribution speed measurement. Output looks like:
|
||||
## Repo layout
|
||||
|
||||
```
|
||||
correctness: pass
|
||||
---
|
||||
correctness: pass
|
||||
shapes_tested: (128,16,256) (256,16,256) (768,96,256)
|
||||
distributions_tested: clustered uniform sparse
|
||||
geomean_ns_per_query: 18234
|
||||
worst_ns_per_query: 24515 ((768,96,256), sparse)
|
||||
best_ns_per_query: 12876 ((128,16,256), clustered)
|
||||
per_combo_geomean_ns:
|
||||
(128,16,256) clustered -> 12876 ns
|
||||
(128,16,256) uniform -> 13441 ns
|
||||
...
|
||||
peak_mem_mb: 28.4
|
||||
total_seconds: 12.3
|
||||
lance-autoresearch/
|
||||
├── Cargo.toml # workspace root
|
||||
├── README.md # you are here
|
||||
├── HARNESS.md # shared loop contract every target inherits
|
||||
├── LICENSE-MIT, LICENSE-APACHE # dual-licensed (Apache compat for Lance PRs)
|
||||
├── crates/
|
||||
│ ├── harness-common/ # shared: SplitMix64, geomean, peak RSS, tolerance, time budget
|
||||
│ │ └── src/{lib,prng,stats,sysinfo,tolerance}.rs
|
||||
│ └── pq-l2/ # landed target
|
||||
│ ├── Cargo.toml
|
||||
│ ├── program.md # this target's agent skill
|
||||
│ ├── src/
|
||||
│ │ ├── lib.rs # PqShape + module wiring (immutable)
|
||||
│ │ ├── kernels.rs # MUTABLE — agent's playground
|
||||
│ │ ├── reference.rs # IMMUTABLE — scalar reference, oracle helpers
|
||||
│ │ ├── inputs.rs # IMMUTABLE — diverse test-data generators
|
||||
│ │ └── bin/run_experiment.rs # IMMUTABLE — per-trial entry point
|
||||
│ └── benches/pq_l2.rs # criterion benchmark (immutable)
|
||||
└── docs/
|
||||
├── design.md # rationale for the workspace shape
|
||||
├── adding-a-target.md # workflow for spinning up a new target
|
||||
└── targets/
|
||||
└── pq-l2.md # capsule: upstream Lance pointers, oracle, status
|
||||
```
|
||||
|
||||
A kernel is "kept" iff:
|
||||
|
||||
- Correctness phase passes (mathematical equivalence to scalar reference)
|
||||
- `geomean_ns_per_query` strictly better than the previous best-kept kernel
|
||||
- `worst_ns_per_query` ≤ 1.05 × the previous best-kept kernel's worst
|
||||
- `total_seconds` ≤ 600
|
||||
|
||||
See `program.md` for the full loop spec.
|
||||
|
||||
## Upstream contribution path
|
||||
|
||||
When a commit clears the keep bar by a meaningful margin (≥10% geomean
|
||||
speedup with worst-case guard intact), the human reviews the diff, ports the
|
||||
technique against [`lance-format/lance`](https://github.com/lance-format/lance)
|
||||
HEAD, runs Lance's own test suite, and opens a PR. Because `src/kernels.rs` is
|
||||
dual MIT/Apache-2.0 licensed and algorithmically modeled on Lance's existing
|
||||
path, the upstream PR inherits Apache-2.0 cleanly.
|
||||
When a commit on any target clears the keep bar by a meaningful margin
|
||||
(≥10% geomean speedup with worst-case guard intact), the human reviews the
|
||||
diff, ports the technique against
|
||||
[`lance-format/lance`](https://github.com/lance-format/lance) HEAD, runs
|
||||
Lance's own test suite, and opens a PR. Because the workspace is dual
|
||||
MIT/Apache-2.0 licensed and each target's kernel is algorithmically modeled on
|
||||
Lance's existing path, the upstream PR inherits Apache-2.0 cleanly.
|
||||
|
||||
## License
|
||||
|
||||
|
|
|
|||
10
research/lance-autoresearch/crates/harness-common/Cargo.toml
Normal file
10
research/lance-autoresearch/crates/harness-common/Cargo.toml
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
[package]
|
||||
name = "harness-common"
|
||||
version = "0.1.0"
|
||||
edition = "2024"
|
||||
license = "MIT OR Apache-2.0"
|
||||
description = "Shared utilities for lance-autoresearch per-target harnesses (PRNG, geomean, peak RSS, tolerance constants, time budget)."
|
||||
publish = false
|
||||
|
||||
[lib]
|
||||
path = "src/lib.rs"
|
||||
36
research/lance-autoresearch/crates/harness-common/src/lib.rs
Normal file
36
research/lance-autoresearch/crates/harness-common/src/lib.rs
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
//! Shared utilities for lance-autoresearch per-target harnesses.
|
||||
//!
|
||||
//! Each target crate (`pq-l2`, future `pq-cosine`, `bitpack-decode`, etc.)
|
||||
//! defines its own `kernels.rs` (mutable, the agent's playground), `reference.rs`
|
||||
//! (immutable scalar reference), `inputs.rs` (immutable test-data generators),
|
||||
//! and `bin/run_experiment.rs` (immutable per-trial entry point). They all need
|
||||
//! the same handful of building blocks: a deterministic PRNG, a geomean
|
||||
//! aggregator, peak-RSS readback, tolerance constants for the bit-exact oracle,
|
||||
//! and a single shared time-budget constant. That's everything in this crate.
|
||||
//!
|
||||
//! What is **not** here, and intentionally not abstracted:
|
||||
//!
|
||||
//! - A `Target` trait. Decode kernels (`bitpack`, `dictionary`, `FSST`) have
|
||||
//! very different signatures than distance kernels (`PqKernel::probe_top_k`),
|
||||
//! and forcing them into one trait shape would either bloat the trait or
|
||||
//! require erased boxing. Keep each target's API natural to its kernel.
|
||||
//!
|
||||
//! - Output-format orchestration. Each target's `run_experiment.rs` prints its
|
||||
//! own fixed-format result block — different targets report different
|
||||
//! per-combo dimensions (PQ shapes vs bit widths vs distribution kinds vs ...).
|
||||
//! Sharing the format would make the per-target binaries less readable and
|
||||
//! gain very little — `println!` is cheap.
|
||||
|
||||
pub mod prng;
|
||||
pub mod stats;
|
||||
pub mod sysinfo;
|
||||
pub mod tolerance;
|
||||
|
||||
pub use prng::SplitMix64;
|
||||
pub use stats::geomean;
|
||||
pub use sysinfo::peak_rss_mb;
|
||||
pub use tolerance::{MAX_ABS_ERR, TOPK_DIST_TOL};
|
||||
|
||||
/// Per-trial wall-clock cap. Targets should `std::process::exit(3)` if exceeded
|
||||
/// so the agent's loop logs the trial as a timeout instead of a measurement.
|
||||
pub const TIME_BUDGET_SECS: u64 = 600;
|
||||
|
|
@ -0,0 +1,52 @@
|
|||
//! Deterministic SplitMix64 PRNG. Same seed produces the same sequence on
|
||||
//! every machine; no platform-specific RNG / no `rand` crate. Reproducibility
|
||||
//! across trials is the whole point.
|
||||
|
||||
pub struct SplitMix64 {
|
||||
state: u64,
|
||||
}
|
||||
|
||||
impl SplitMix64 {
|
||||
pub fn new(seed: u64) -> Self {
|
||||
Self { state: seed }
|
||||
}
|
||||
|
||||
pub fn next_u64(&mut self) -> u64 {
|
||||
self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
|
||||
let mut z = self.state;
|
||||
z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
|
||||
z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
|
||||
z ^ (z >> 31)
|
||||
}
|
||||
|
||||
/// Uniform in `[0, 1)` with 24 bits of mantissa precision.
|
||||
pub fn next_f32(&mut self) -> f32 {
|
||||
let bits = (self.next_u64() >> 40) as u32;
|
||||
bits as f32 / ((1u32 << 24) as f32)
|
||||
}
|
||||
|
||||
/// Standard normal via Box–Muller. Cheap and sufficient for fixture
|
||||
/// generation; not cryptographically anything.
|
||||
pub fn next_normal(&mut self) -> f32 {
|
||||
let mut u1 = self.next_f32();
|
||||
if u1 < 1e-7 {
|
||||
u1 = 1e-7;
|
||||
}
|
||||
let u2 = self.next_f32();
|
||||
(-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn deterministic_across_calls() {
|
||||
let mut a = SplitMix64::new(0x1234_5678);
|
||||
let mut b = SplitMix64::new(0x1234_5678);
|
||||
for _ in 0..1000 {
|
||||
assert_eq!(a.next_u64(), b.next_u64());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
//! Geometric mean of u64 timings. Robust to outliers; the right aggregator for
|
||||
//! latency distributions because halving one query and doubling another cancels.
|
||||
|
||||
pub fn geomean(xs: &[u64]) -> u64 {
|
||||
if xs.is_empty() {
|
||||
return 0;
|
||||
}
|
||||
let mut sum_ln = 0.0f64;
|
||||
for &x in xs {
|
||||
sum_ln += (x.max(1) as f64).ln();
|
||||
}
|
||||
(sum_ln / xs.len() as f64).exp() as u64
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn empty_yields_zero() {
|
||||
assert_eq!(geomean(&[]), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn single_value_round_trips() {
|
||||
assert_eq!(geomean(&[100]), 100);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn geomean_is_below_arithmetic_mean() {
|
||||
let xs = [1, 10, 100, 1000];
|
||||
let g = geomean(&xs);
|
||||
let am: u64 = xs.iter().sum::<u64>() / xs.len() as u64;
|
||||
assert!(g < am);
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,24 @@
|
|||
//! Peak resident-set-size readback (Linux only; non-Linux returns 0).
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
pub fn peak_rss_mb() -> f64 {
|
||||
let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
|
||||
return 0.0;
|
||||
};
|
||||
for line in s.lines() {
|
||||
if let Some(rest) = line.strip_prefix("VmPeak:") {
|
||||
let kb: f64 = rest
|
||||
.split_whitespace()
|
||||
.next()
|
||||
.and_then(|t| t.parse().ok())
|
||||
.unwrap_or(0.0);
|
||||
return kb / 1024.0;
|
||||
}
|
||||
}
|
||||
0.0
|
||||
}
|
||||
|
||||
#[cfg(not(target_os = "linux"))]
|
||||
pub fn peak_rss_mb() -> f64 {
|
||||
0.0
|
||||
}
|
||||
|
|
@ -0,0 +1,15 @@
|
|||
//! Default tolerance constants for bit-exact correctness oracles.
|
||||
//!
|
||||
//! These suit float-arithmetic kernels (PQ distance, BM25 scoring, vector
|
||||
//! normalization) where SIMD-accumulator reordering is legal but real bugs
|
||||
//! shift values by orders of magnitude. Targets that operate on integer or
|
||||
//! byte-exact data (bitpack decode, dictionary decode, FSST decode) should
|
||||
//! assert strict bitwise equality and not use these constants.
|
||||
|
||||
/// Maximum permitted absolute element error between agent kernel output and
|
||||
/// scalar reference output, for float kernels.
|
||||
pub const MAX_ABS_ERR: f32 = 1e-4;
|
||||
|
||||
/// Maximum permitted distance error when comparing top-K results between
|
||||
/// agent kernel and scalar reference, for float kernels.
|
||||
pub const TOPK_DIST_TOL: f32 = 1e-4;
|
||||
24
research/lance-autoresearch/crates/pq-l2/Cargo.toml
Normal file
24
research/lance-autoresearch/crates/pq-l2/Cargo.toml
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
[package]
|
||||
name = "pq-l2"
|
||||
version = "0.1.0"
|
||||
edition = "2024"
|
||||
license = "MIT OR Apache-2.0"
|
||||
description = "Autoresearch target: Lance PQ L2 distance kernel optimization."
|
||||
publish = false
|
||||
|
||||
[lib]
|
||||
path = "src/lib.rs"
|
||||
|
||||
[[bin]]
|
||||
name = "run_experiment"
|
||||
path = "src/bin/run_experiment.rs"
|
||||
|
||||
[[bench]]
|
||||
name = "pq_l2"
|
||||
harness = false
|
||||
|
||||
[dependencies]
|
||||
harness-common = { path = "../harness-common" }
|
||||
|
||||
[dev-dependencies]
|
||||
criterion = { workspace = true }
|
||||
|
|
@ -7,8 +7,8 @@ use std::hint::black_box;
|
|||
|
||||
use criterion::{Criterion, criterion_group, criterion_main};
|
||||
|
||||
use lance_autoresearch::inputs::{SHAPES, SPEED_TOP_K, speed_workloads};
|
||||
use lance_autoresearch::kernels::PqKernel;
|
||||
use pq_l2::inputs::{SHAPES, SPEED_TOP_K, speed_workloads};
|
||||
use pq_l2::kernels::PqKernel;
|
||||
|
||||
fn bench_pq_l2(c: &mut Criterion) {
|
||||
let workloads = speed_workloads(0xBE3C_C0DE_F1AC_BABE);
|
||||
98
research/lance-autoresearch/crates/pq-l2/program.md
Normal file
98
research/lance-autoresearch/crates/pq-l2/program.md
Normal file
|
|
@ -0,0 +1,98 @@
|
|||
# Target: PQ L2 — agent instructions
|
||||
|
||||
This is the per-target overlay on top of [`../../HARNESS.md`](../../HARNESS.md).
|
||||
Read **HARNESS.md first** for the universal loop contract (what's editable,
|
||||
the metric, the loop, hygiene, never stop). This file adds the PQ-L2-specific
|
||||
API spec and priors.
|
||||
|
||||
## Setup (once per session)
|
||||
|
||||
1. Read in this order:
|
||||
- `../../HARNESS.md`
|
||||
- `../../README.md`
|
||||
- `program.md` (this file)
|
||||
- `src/lib.rs`
|
||||
- `src/kernels.rs` *(the only file you may edit)*
|
||||
- `src/reference.rs`
|
||||
- `src/inputs.rs`
|
||||
- `src/bin/run_experiment.rs`
|
||||
2. Ensure `results.tsv` exists. If not, create it with this header:
|
||||
```
|
||||
commit timestamp correctness geomean_ns worst_ns worst_combo best_ns best_combo peak_mem_mb total_seconds keep description
|
||||
```
|
||||
3. Baseline trial:
|
||||
```
|
||||
cargo run --release --bin run_experiment > run.log 2>&1
|
||||
```
|
||||
Append a row tagged `keep=baseline`, commit it.
|
||||
|
||||
## Public API contract (must remain stable)
|
||||
|
||||
The bench imports these from `crate::kernels`. You may NOT change their
|
||||
signatures. You MAY add private helpers, internal data layouts, `unsafe`
|
||||
blocks, `std::arch` intrinsics under `#[cfg(target_arch = ...)]` gates,
|
||||
pre-computed state inside `PqKernel`, etc.
|
||||
|
||||
```rust
|
||||
pub struct PqKernel { /* agent's private fields */ }
|
||||
|
||||
impl PqKernel {
|
||||
pub fn new(shape: PqShape, codebook: &[f32]) -> Self;
|
||||
pub fn shape(&self) -> &PqShape;
|
||||
pub fn distance_table(&self, query: &[f32]) -> Vec<f32>;
|
||||
pub fn probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>;
|
||||
}
|
||||
```
|
||||
|
||||
Pre-processing in `new` is free — the bench measures `distance_table +
|
||||
probe_top_k` per query, not per (build + query). Codebook transposes,
|
||||
cached `c·c`, packed LUTs, etc., should live in `new`.
|
||||
|
||||
## What you can / cannot do
|
||||
|
||||
(See HARNESS.md for the universal table; this is the PQ-L2 specific
|
||||
addition.)
|
||||
|
||||
- **Cannot** change `PqShape` or the constants in `lib.rs`. They define
|
||||
the optimization target.
|
||||
- **Cannot** introduce lossy techniques (LUT u8/u16 quantization, asymmetric
|
||||
approximation, anything that drops bits relative to the scalar reference).
|
||||
The correctness phase asserts `max_abs_err ≤ 1e-4` against the scalar
|
||||
reference; lossy techniques fail this gate. If you want to explore a lossy
|
||||
track, propose it to the human as a separate kernel surface.
|
||||
- **Can** mark hot functions `#[inline]`, split them, add private helpers.
|
||||
- **Can** add `#[cfg(test)] mod tests { ... }` inside `kernels.rs` for in-file
|
||||
property checks against the scalar path.
|
||||
|
||||
## Lance-PQ-specific priors
|
||||
|
||||
These are the directions that pay off on this kernel shape without
|
||||
compromising arithmetic accuracy. Pick one hypothesis per trial; don't try
|
||||
to combine multiple ideas at once.
|
||||
|
||||
- **Codebook layout transpose.** The reference layout is `[m][k][d]`.
|
||||
Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)` lanes
|
||||
across `d` and broadcast over `k`. Do the transpose in `PqKernel::new` once.
|
||||
- **Cache `c·c` per centroid.** The diff–square–sum is
|
||||
`(q - c)·(q - c) = q·q - 2qc + c·c`. Hoist `q·q` per sub-vector,
|
||||
precompute `c·c` once at `new()` time, store next to the codebook. Inner
|
||||
loop becomes one FMA. Watch sign / accumulator ordering so rounding stays
|
||||
within `MAX_ABS_ERR`.
|
||||
- **Probe-side code transpose.** Probe is dominated by
|
||||
`acc += table[m][codes[off+m]]` × `num_sub_vectors`. Transposing codes to
|
||||
`[m][i]` (one row per sub-quantizer, contiguous over base index) lets you
|
||||
process 32+ vectors per inner iteration with `vpgatherdq`-style loads.
|
||||
- **Top-K block-then-merge.** `push()` does a branch + heap sift on every
|
||||
code. At 20k probes per query × 9 (shape × dist) combos that's the
|
||||
second-biggest cost after the gather. Block the probe (e.g., 512 codes at
|
||||
a time), find the local top-K with a branchless pass, then merge into the
|
||||
global heap.
|
||||
- **Prefetch.** `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
|
||||
ahead of the gather is usually pure win at 20k+ scale.
|
||||
- **FMA chains for table build.** The diff–square–sum maps cleanly to FMA
|
||||
on AVX2/NEON. Even without intrinsics, structuring the inner loop so
|
||||
`rustc` emits FMA helps.
|
||||
- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates
|
||||
a fresh `Vec<f32>` per call. The public API is fixed (returns `Vec<f32>`),
|
||||
but you can reuse a thread-local scratch buffer internally and copy to a
|
||||
`Vec` at the boundary if it speeds the build.
|
||||
|
|
@ -35,18 +35,18 @@
|
|||
|
||||
use std::time::Instant;
|
||||
|
||||
use lance_autoresearch::inputs::{
|
||||
use harness_common::{MAX_ABS_ERR, TIME_BUDGET_SECS, TOPK_DIST_TOL, geomean, peak_rss_mb};
|
||||
use pq_l2::inputs::{
|
||||
DISTRIBUTIONS, DataDistribution, SHAPES, SpeedWorkload, correctness_battery, speed_workloads,
|
||||
};
|
||||
use lance_autoresearch::kernels::PqKernel;
|
||||
use lance_autoresearch::reference::{ScalarReference, max_abs_err, topk_consistent};
|
||||
use lance_autoresearch::{MAX_ABS_ERR, PqShape, TOPK_DIST_TOL};
|
||||
use pq_l2::kernels::PqKernel;
|
||||
use pq_l2::reference::{ScalarReference, max_abs_err, topk_consistent};
|
||||
use pq_l2::PqShape;
|
||||
|
||||
// Any constants; the only requirement is that they're pinned across trials so
|
||||
// the inputs and the timings are reproducible.
|
||||
const CORRECTNESS_SEED: u64 = 0xC0FF_EEC0_DEBE_EFFE;
|
||||
const SPEED_SEED: u64 = 0x5EED_F1AC_BABE_FACE;
|
||||
const TIME_BUDGET_SECS: u64 = 600;
|
||||
|
||||
fn main() {
|
||||
let start = Instant::now();
|
||||
|
|
@ -210,17 +210,6 @@ fn run_speed(workloads: &[SpeedWorkload]) -> SpeedReport {
|
|||
}
|
||||
}
|
||||
|
||||
fn geomean(xs: &[u64]) -> u64 {
|
||||
if xs.is_empty() {
|
||||
return 0;
|
||||
}
|
||||
let mut sum_ln = 0.0f64;
|
||||
for &x in xs {
|
||||
sum_ln += (x.max(1) as f64).ln();
|
||||
}
|
||||
(sum_ln / xs.len() as f64).exp() as u64
|
||||
}
|
||||
|
||||
fn format_shape(s: &PqShape) -> String {
|
||||
format!("({},{},{})", s.dim, s.num_sub_vectors, s.num_centroids)
|
||||
}
|
||||
|
|
@ -233,26 +222,3 @@ fn format_dist(d: &DataDistribution) -> String {
|
|||
}
|
||||
.to_string()
|
||||
}
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
fn peak_rss_mb() -> f64 {
|
||||
let Ok(s) = std::fs::read_to_string("/proc/self/status") else {
|
||||
return 0.0;
|
||||
};
|
||||
for line in s.lines() {
|
||||
if let Some(rest) = line.strip_prefix("VmPeak:") {
|
||||
let kb: f64 = rest
|
||||
.split_whitespace()
|
||||
.next()
|
||||
.and_then(|t| t.parse().ok())
|
||||
.unwrap_or(0.0);
|
||||
return kb / 1024.0;
|
||||
}
|
||||
}
|
||||
0.0
|
||||
}
|
||||
|
||||
#[cfg(not(target_os = "linux"))]
|
||||
fn peak_rss_mb() -> f64 {
|
||||
0.0
|
||||
}
|
||||
|
|
@ -16,6 +16,7 @@
|
|||
//! the codebook is shape-appropriate, not random.
|
||||
|
||||
use crate::PqShape;
|
||||
use harness_common::SplitMix64;
|
||||
|
||||
/// PQ shapes the bench evaluates. The agent's kernel must produce correct
|
||||
/// output and competitive speed on every one.
|
||||
|
|
@ -295,36 +296,6 @@ fn encode(shape: PqShape, n: usize, base: &[f32], codebook: &[f32]) -> Vec<u8> {
|
|||
out
|
||||
}
|
||||
|
||||
/// SplitMix64 — small, deterministic; bit-for-bit reproducible across machines.
|
||||
struct SplitMix64 {
|
||||
state: u64,
|
||||
}
|
||||
|
||||
impl SplitMix64 {
|
||||
fn new(seed: u64) -> Self {
|
||||
Self { state: seed }
|
||||
}
|
||||
fn next_u64(&mut self) -> u64 {
|
||||
self.state = self.state.wrapping_add(0x9E37_79B9_7F4A_7C15);
|
||||
let mut z = self.state;
|
||||
z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9);
|
||||
z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB);
|
||||
z ^ (z >> 31)
|
||||
}
|
||||
fn next_f32(&mut self) -> f32 {
|
||||
let bits = (self.next_u64() >> 40) as u32;
|
||||
bits as f32 / ((1u32 << 24) as f32)
|
||||
}
|
||||
fn next_normal(&mut self) -> f32 {
|
||||
let mut u1 = self.next_f32();
|
||||
if u1 < 1e-7 {
|
||||
u1 = 1e-7;
|
||||
}
|
||||
let u2 = self.next_f32();
|
||||
(-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
|
||||
}
|
||||
}
|
||||
|
||||
fn shape_hash(s: PqShape) -> u64 {
|
||||
(s.dim as u64).wrapping_mul(0x9E37_79B9_7F4A_7C15)
|
||||
^ (s.num_sub_vectors as u64).wrapping_mul(0xBF58_476D_1CE4_E5B9)
|
||||
|
|
@ -1,17 +1,20 @@
|
|||
//! Lance autoresearch harness — public API for the bench binary, benchmarks, and tests.
|
||||
//! Autoresearch target: Lance PQ L2 distance kernel optimization.
|
||||
//!
|
||||
//! Contract (Karpathy-style three files):
|
||||
//! Karpathy-style three-file contract:
|
||||
//!
|
||||
//! - `kernels` — the AGENT'S PLAYGROUND. Modify freely.
|
||||
//! - `reference` — IMMUTABLE. Scalar reference kernel. Defines the math.
|
||||
//! - `inputs` — IMMUTABLE. Diverse test-data + workload generators,
|
||||
//! deterministic per fixed seed, varied across the input battery.
|
||||
//!
|
||||
//! The optimization target is dataset-independent: the agent's kernel must match
|
||||
//! the scalar reference within `MAX_ABS_ERR` on every input the bench generates,
|
||||
//! and minimize geomean ns/query across multiple PQ shapes and data
|
||||
//! distributions. There is no fixed dataset; an "improvement" by construction
|
||||
//! generalizes across distributions and shapes.
|
||||
//! The optimization target is dataset-independent: the agent's kernel must
|
||||
//! match the scalar reference within `harness_common::MAX_ABS_ERR` on every
|
||||
//! input the bench generates, and minimize geomean ns/query across multiple
|
||||
//! PQ shapes and data distributions. There is no fixed dataset.
|
||||
//!
|
||||
//! Shared utilities (deterministic PRNG, geomean, peak RSS, tolerance
|
||||
//! constants, time budget) come from the `harness-common` workspace crate.
|
||||
//! See `../HARNESS.md` for the harness conventions every target follows.
|
||||
|
||||
pub mod inputs;
|
||||
pub mod kernels;
|
||||
|
|
@ -45,12 +48,3 @@ impl PqShape {
|
|||
self.num_sub_vectors * self.num_centroids * self.sub_vector_dim()
|
||||
}
|
||||
}
|
||||
|
||||
/// Tolerance for the agent kernel's distance values vs. the scalar reference.
|
||||
/// Loose enough to permit legal SIMD-accumulator reordering; tight enough to
|
||||
/// catch real arithmetic bugs.
|
||||
pub const MAX_ABS_ERR: f32 = 1e-4;
|
||||
|
||||
/// Tolerance for top-K *distances* (id sets are compared with tie-tolerance —
|
||||
/// see `reference::topk_consistent`).
|
||||
pub const TOPK_DIST_TOL: f32 = 1e-4;
|
||||
192
research/lance-autoresearch/docs/adding-a-target.md
Normal file
192
research/lance-autoresearch/docs/adding-a-target.md
Normal file
|
|
@ -0,0 +1,192 @@
|
|||
# Adding a new target
|
||||
|
||||
Walk through this when spinning up a new optimization target (A1 cosine, A4
|
||||
bitpack, etc.). It's a `cp -r` plus surgical edits — no architectural
|
||||
decisions to make per target if the kernel fits the autoresearch shape.
|
||||
|
||||
If your target's per-trial eval is more than ~30 seconds, or the correctness
|
||||
oracle can't be a deterministic comparison against a scalar reference, this
|
||||
harness is the wrong fit — see [`design.md`](design.md) "When to revisit"
|
||||
for the boundary.
|
||||
|
||||
## Steps
|
||||
|
||||
### 1. Pick a template target
|
||||
|
||||
The closest existing target. For now there's just `pq-l2`, but as more land:
|
||||
- Distance / scoring kernels that take a query and return per-row scores →
|
||||
template off `pq-l2`.
|
||||
- Decode kernels that take encoded bytes and return an Arrow array →
|
||||
template off `bitpack` once it lands.
|
||||
- Scan / merge kernels → template off `topk-merge` once it lands.
|
||||
|
||||
```bash
|
||||
cp -r crates/pq-l2 crates/<my-target>
|
||||
```
|
||||
|
||||
### 2. Rewrite `Cargo.toml`
|
||||
|
||||
```toml
|
||||
[package]
|
||||
name = "<my-target>"
|
||||
# version, edition, license, publish stay the same
|
||||
```
|
||||
|
||||
Add the target to the workspace `members` in the root `Cargo.toml`:
|
||||
|
||||
```toml
|
||||
[workspace]
|
||||
members = [
|
||||
"crates/harness-common",
|
||||
"crates/pq-l2",
|
||||
"crates/<my-target>", # add this
|
||||
]
|
||||
```
|
||||
|
||||
### 3. Rewrite `src/lib.rs`
|
||||
|
||||
Define the target's `Shape` type (analogue of `PqShape`) and any other types
|
||||
shared between `kernels.rs` and `reference.rs` and `inputs.rs`. Document
|
||||
which fields are pinned by the harness vs. agent-tunable.
|
||||
|
||||
This file is **immutable** to the agent. The shape parameters define the
|
||||
optimization target — changing them changes what's being optimized.
|
||||
|
||||
### 4. Rewrite `src/reference.rs`
|
||||
|
||||
Implement the scalar reference kernel — the math, in plain Rust, no SIMD,
|
||||
no cleverness. This is what the agent's kernel is compared against. Mirror
|
||||
the public API of `kernels.rs` exactly.
|
||||
|
||||
For float kernels, also export `max_abs_err(a, b)` and `topk_consistent(...)`
|
||||
(or analogues) — the comparison helpers the bench uses to assert
|
||||
near-bit-exact equivalence with `harness_common::MAX_ABS_ERR` /
|
||||
`TOPK_DIST_TOL`.
|
||||
|
||||
For integer / byte kernels, the comparison is simpler — `assert_eq!` on the
|
||||
returned Arrow array. No tolerance constants needed.
|
||||
|
||||
### 5. Rewrite `src/inputs.rs`
|
||||
|
||||
Two surfaces:
|
||||
|
||||
- `correctness_battery(seed) -> Vec<CorrectnessCase>` — diverse shape ×
|
||||
distribution combinations, sized small enough that the correctness phase
|
||||
finishes in seconds. The point is breadth, not realism.
|
||||
- `speed_workloads(seed) -> Vec<SpeedWorkload>` — larger shape × distribution
|
||||
combinations sized for stable timings. Aim for total trial wall-clock
|
||||
≤ 60s; the agent's iteration latency dominates correctness elsewhere.
|
||||
|
||||
Use `harness_common::SplitMix64` for determinism. Same seed → same battery
|
||||
across trials.
|
||||
|
||||
### 6. Rewrite `src/kernels.rs` (the agent's playground)
|
||||
|
||||
Implement a clean scalar baseline matching the algorithm shape of the Lance
|
||||
upstream code. The header comment must:
|
||||
|
||||
- Cite the upstream Lance source (`lance-format/lance` rev / file path) the
|
||||
algorithm is modeled on.
|
||||
- Document the public API the bench calls — these are the surfaces the agent
|
||||
may NOT change.
|
||||
- List "what you can do" / "what you cannot do" rules specific to this
|
||||
target.
|
||||
|
||||
The starting kernel must be correct (passes the correctness phase against
|
||||
`reference.rs`) and lint-clean. The agent's job is to make it faster.
|
||||
|
||||
### 7. Rewrite `src/bin/run_experiment.rs`
|
||||
|
||||
Two phases:
|
||||
|
||||
- **Correctness phase:** for each `CorrectnessCase`, run agent kernel +
|
||||
reference, compare. Any mismatch → print `correctness: fail`, diagnostic
|
||||
line, exit 2.
|
||||
- **Speed phase:** for each `SpeedWorkload`, run agent kernel and time per
|
||||
query / per row / per byte. Aggregate geomean / worst / best across all
|
||||
combos. Print fixed-format result block.
|
||||
|
||||
Universal output fields (every target) are listed in `HARNESS.md` "The
|
||||
metric." Add per-target fields above them as needed (e.g., `bit_widths_tested`
|
||||
for bitpack).
|
||||
|
||||
Use:
|
||||
- `harness_common::geomean` for the aggregator
|
||||
- `harness_common::peak_rss_mb` for memory readback
|
||||
- `harness_common::TIME_BUDGET_SECS` for the time-budget check
|
||||
|
||||
### 8. (Optional) Rewrite `benches/<my-target>.rs`
|
||||
|
||||
Criterion benchmark with the same kernel calls as `run_experiment` but
|
||||
under criterion's statistical-sampling harness. Optional — the per-trial
|
||||
binary is the agent's primary measurement; criterion is for the human's
|
||||
deeper investigation.
|
||||
|
||||
### 9. Write `program.md`
|
||||
|
||||
Per-target agent skill, layered on top of `HARNESS.md`. Sections:
|
||||
|
||||
- **Setup** — which files to read at session start (always include
|
||||
`../../HARNESS.md`).
|
||||
- **Public API contract** — the exact functions / structs the agent must
|
||||
keep stable.
|
||||
- **Target-specific priors** — known SIMD techniques for this kernel shape,
|
||||
algorithmic transformations worth trying, common pitfalls. This is the
|
||||
highest-leverage content; spend time on it.
|
||||
- **`results.tsv` header** — the per-target column set.
|
||||
|
||||
### 10. Write the per-target capsule in `docs/targets/<my-target>.md`
|
||||
|
||||
A short doc covering:
|
||||
|
||||
- What's optimized (one sentence)
|
||||
- Upstream Lance source pointers (rev, file paths, function names)
|
||||
- Oracle definition (bit-exact / `max_abs_err`)
|
||||
- Speed workload shape (what shapes × distributions span)
|
||||
- Status (candidate / landed / has-results)
|
||||
|
||||
### 11. Verify end-to-end
|
||||
|
||||
```bash
|
||||
cargo build --release -p <my-target>
|
||||
cargo clippy --release -p <my-target> --all-targets -- -D warnings
|
||||
cargo run --release --bin run_experiment -p <my-target>
|
||||
```
|
||||
|
||||
The baseline trial must:
|
||||
- Print `correctness: pass`
|
||||
- Exit 0
|
||||
- Finish within ~60s
|
||||
- Reference a sensible `geomean_ns_per_*` baseline number
|
||||
|
||||
Smoke-test the gate: deliberately break `kernels.rs` (e.g., return constant
|
||||
zero), confirm the trial exits 2 with `correctness: fail`. Restore.
|
||||
|
||||
### 12. Add the target row to the top-level `README.md`
|
||||
|
||||
In the targets table at the top of the README, change the new target's row
|
||||
from `candidate` to `landed`.
|
||||
|
||||
### 13. Commit
|
||||
|
||||
One commit for the target's scaffolding. Don't bundle multiple targets in
|
||||
one commit — each target's history should be independently revertible.
|
||||
|
||||
## Common gotchas
|
||||
|
||||
- **Forgetting the empty `[workspace]` block** at the root means cargo walks
|
||||
up to the omnigraph parent workspace. Already handled; just don't remove it.
|
||||
- **Per-target `Cargo.toml` referencing the wrong `harness-common` path.**
|
||||
Use `harness-common = { path = "../harness-common" }`.
|
||||
- **Picking a `SHAPES` set that's too small.** Three shapes is the floor;
|
||||
with one shape an agent could specialize and pass, with two there's not
|
||||
enough variety. Ensure the shapes span at least one "outlier" (e.g., for
|
||||
PQ, one shape with `sub_vector_dim != 8`).
|
||||
- **Correctness battery too narrow.** Five distributions is the floor: at
|
||||
minimum Gaussian / uniform / sparse / large-dynamic-range / mostly-zero (or
|
||||
the integer analogue: uniform / clustered / skewed / few-distinct /
|
||||
monotonic).
|
||||
- **Trial time too long.** If the speed phase exceeds ~60s, agent iteration
|
||||
rate drops below useful. Reduce workload sizes; the speed metric is
|
||||
per-operation, not per-workload, so absolute size doesn't change the
|
||||
comparison.
|
||||
152
research/lance-autoresearch/docs/design.md
Normal file
152
research/lance-autoresearch/docs/design.md
Normal file
|
|
@ -0,0 +1,152 @@
|
|||
# Design — why the workspace is shaped this way
|
||||
|
||||
This document records the rationale for the multi-target workspace shape so
|
||||
future contributors don't relitigate the early decisions.
|
||||
|
||||
## The thing we're building
|
||||
|
||||
A multi-target harness for LLM-driven optimization of Lance hot-path kernels.
|
||||
"Multi-target" because Lance has many such kernels — distance kernels in
|
||||
`lance-linalg`, decoders in `lance-encoding`, scan/merge kernels — and the
|
||||
right harness shape is identical across them: bit-exact correctness oracle,
|
||||
geomean-across-distributions speed metric, single-agent autoresearch loop.
|
||||
|
||||
The original [research note](../../docs/research/llm-evolutionary-sampling.md)
|
||||
enumerates ten such candidates (A1–A10) clustered by Lance crate. The first
|
||||
landed (`pq-l2`) proves the harness shape; the rest follow the same template.
|
||||
|
||||
## Decision: workspace, not single crate
|
||||
|
||||
A single crate exposing multiple binaries (`run_experiment_pq_l2`,
|
||||
`run_experiment_bitpack`, ...) was the obvious-looking alternative. Rejected
|
||||
for three reasons:
|
||||
|
||||
1. **Per-target deps differ.** FSST decode wants different deps than PQ
|
||||
kernels (a string-compression library vs. just `f32` math). A single
|
||||
`Cargo.toml` would either bundle every target's deps into every build or
|
||||
require fine-grained features. Workspaces give per-target `Cargo.toml`
|
||||
for free.
|
||||
|
||||
2. **Edit isolation.** The agent edits one target's `kernels.rs` at a time.
|
||||
In a single crate, `kernels.rs` files would collide on path or have to live
|
||||
in target-specific submodules with target-specific naming. Per-target
|
||||
crates put `src/kernels.rs` at the natural location every time and let the
|
||||
agent navigate one tree per session.
|
||||
|
||||
3. **Build / test isolation.** `cargo build -p pq-l2` builds only what's
|
||||
needed for the PQ L2 target; `cargo test -p pq-l2` runs only its tests.
|
||||
The agent's iteration loop is faster because it doesn't pay for unrelated
|
||||
targets' compile time.
|
||||
|
||||
The downside — workspace boilerplate, per-target `Cargo.toml`, the empty
|
||||
`[workspace]` block at the workspace root that prevents cargo from walking up
|
||||
to the parent omnigraph workspace — is a one-time cost. Per-target overhead
|
||||
of adding a new target is one `cp -r` plus path edits.
|
||||
|
||||
## Decision: shared `harness-common` crate, no `Target` trait
|
||||
|
||||
A `Target` trait was the obvious-looking other alternative — express the
|
||||
common loop generically, plug in target-specific types. Rejected because:
|
||||
|
||||
1. **Kernel signatures vary too much for a single trait shape.** PQ
|
||||
`probe_top_k` returns `Vec<(u32, f32)>`. Bitpack decode returns an
|
||||
`IntArray`. FSST decode returns `Vec<u8>`. Predicate evaluation returns a
|
||||
`BooleanArray`. A unifying trait would need erased boxing or a wide
|
||||
associated-type surface, both of which obscure the actual hot path the
|
||||
agent is editing.
|
||||
|
||||
2. **The orchestration that *is* shared is small.** A deterministic PRNG
|
||||
(~30 lines), a geomean (~10 lines), peak RSS readback (~20 lines), four
|
||||
tolerance constants. Total ~70 lines of shared code. Building a trait
|
||||
abstraction over 70 lines costs more than it saves.
|
||||
|
||||
3. **The output format isn't worth sharing.** Each target's
|
||||
`run_experiment.rs` prints a fixed-format result block; the *fields*
|
||||
differ per target (PQ shapes vs bit widths vs distribution kinds). A
|
||||
shared formatter would be either trivial wrapping of `println!` (no
|
||||
value) or a complicated builder API (negative value).
|
||||
|
||||
`harness-common` therefore exposes plumbing only: `SplitMix64`, `geomean`,
|
||||
`peak_rss_mb`, `MAX_ABS_ERR`, `TOPK_DIST_TOL`, `TIME_BUDGET_SECS`. Each
|
||||
target consumes what it needs. The shared loop contract is documented in
|
||||
`HARNESS.md`, not encoded in code.
|
||||
|
||||
## Decision: per-target `program.md` + shared `HARNESS.md`
|
||||
|
||||
The agent reads two files at session start:
|
||||
|
||||
- `HARNESS.md` (workspace-level) — universal: the loop, the metric, the
|
||||
edit-permission table, hygiene rules.
|
||||
- `crates/<target>/program.md` (per-target) — specific: the kernel API the
|
||||
agent must keep stable, target-specific priors (which SIMD intrinsics tend
|
||||
to win on this kernel shape), the `results.tsv` column header.
|
||||
|
||||
The shape mirrors how Karpathy's `nanochat-research` `program.md` works,
|
||||
factored across the dimension that varies (per target) vs. doesn't (the loop
|
||||
itself). Two files instead of one because copy-pasting the universal loop
|
||||
into every `program.md` makes them drift.
|
||||
|
||||
## Decision: dataset-independent oracle every target
|
||||
|
||||
The first iteration of the harness used recall@K vs. SIFT1M as the
|
||||
correctness oracle. We replaced it with bit-exact (or near-bit-exact for
|
||||
floats) match against a scalar reference because:
|
||||
|
||||
1. The agent had incentive to overfit lossy approximations to the dataset's
|
||||
cluster structure, even though we didn't ask for that.
|
||||
2. SIFT1M is 250 MB and a hassle to download; the harness benefited from
|
||||
being self-contained.
|
||||
3. Mathematical equivalence is a strictly stronger contract than recall
|
||||
preservation: if the kernel is bit-equivalent to the scalar reference,
|
||||
recall is automatically identical because the distance values are the
|
||||
same. There's nothing recall@K catches that bit-exactness doesn't.
|
||||
|
||||
This decision generalizes to every target. Decode kernels get strict bitwise
|
||||
equality (no float arithmetic involved). Distance and BM25 kernels get
|
||||
`max_abs_err ≤ 1e-4` (loose enough for SIMD-accumulator reordering, tight
|
||||
enough for real bugs). Targets that genuinely require lossy techniques to
|
||||
get headroom — there might be some; LUT u8 quantization in PQ is one — go
|
||||
in a separate "lossy track" with a recall-based oracle on diverse datasets,
|
||||
not the bit-exact track.
|
||||
|
||||
## Decision: per-target speed measurement spans multiple shapes × distributions
|
||||
|
||||
A single dataset would let an agent overfit to that dataset's distribution.
|
||||
Each target's `inputs.rs` therefore generates speed workloads across:
|
||||
|
||||
- Multiple **shapes** of the kernel's domain (PQ: `(dim, num_sub_vectors,
|
||||
num_centroids)`; bitpack: bit width; etc.). Captures how the kernel
|
||||
performs at different sizes Lance users actually encounter.
|
||||
- Multiple **data distributions** (Gaussian / uniform / sparse for floats;
|
||||
uniform / skewed / clustered for integers; etc.). Captures whether the
|
||||
kernel's win is data-distribution-conditional.
|
||||
|
||||
The keep gate uses geomean across all (shape × distribution) combos with a
|
||||
worst-case guard: a kernel that wins on one combo and regresses ≥5% on
|
||||
another fails to keep, even if the geomean improves. This forces wins to
|
||||
generalize.
|
||||
|
||||
## What's deliberately not abstracted
|
||||
|
||||
- **Output format.** Each target prints its own field block. See above.
|
||||
- **`TopKHeap` and other small data structures.** When two targets need a
|
||||
`TopKHeap`, the second one copies the first's. Three copies of a 30-line
|
||||
struct is cheaper than one trait-erased indirection.
|
||||
- **Test data shapes.** Each target's `inputs.rs` knows its own kernel's
|
||||
fixture shape. Sharing would require a generic `Fixture<Kernel>` trait,
|
||||
which would either be too narrow (forces every kernel into a `query +
|
||||
workload` shape) or too wide (gives up the type safety that makes the
|
||||
bench's correctness check obvious).
|
||||
|
||||
## When to revisit
|
||||
|
||||
If the workspace grows past ~6 active targets and we notice we're
|
||||
copy-pasting more than ~50 lines of `run_experiment.rs` boilerplate per new
|
||||
target, consider extracting a shared `RunExperiment` helper that takes
|
||||
closures for the correctness and speed phases. Don't pre-extract — wait
|
||||
until the duplication is real and visible.
|
||||
|
||||
If we add a target that genuinely doesn't fit the autoresearch loop (eval
|
||||
crosses ~30s; tournament sampling becomes the right control loop), it
|
||||
belongs in a separate workspace, not this one. The boundary line is the
|
||||
loop shape, not the target type.
|
||||
98
research/lance-autoresearch/docs/targets/pq-l2.md
Normal file
98
research/lance-autoresearch/docs/targets/pq-l2.md
Normal file
|
|
@ -0,0 +1,98 @@
|
|||
# Target: `pq-l2`
|
||||
|
||||
PQ L2 distance kernel for f32 dense vectors — the asymmetric-distance compute
|
||||
that runs on every `IvfPq` / `IvfHnswPq` ANN query in Lance.
|
||||
|
||||
## Status
|
||||
|
||||
**Landed.** Baseline scalar kernel committed; the agent's job is to find
|
||||
generalizable speedups against it.
|
||||
|
||||
## What's optimized
|
||||
|
||||
Two functions in `crates/pq-l2/src/kernels.rs`:
|
||||
|
||||
- `PqKernel::distance_table(query)` — builds the asymmetric distance table
|
||||
(`[num_sub_vectors][num_centroids]`) for one query against the codebook.
|
||||
Cost: `num_sub_vectors × num_centroids × sub_vector_dim` MAC ops per query.
|
||||
- `PqKernel::probe_top_k(table, codes, num_vectors, k)` — probes
|
||||
`num_vectors` PQ-encoded vectors, accumulates per-vector distance via
|
||||
`num_sub_vectors` table lookups, returns top-K. Cost:
|
||||
`num_vectors × num_sub_vectors` lookups + heap maintenance per query.
|
||||
This is the dominant cost at typical scales.
|
||||
|
||||
`PqKernel::new(shape, codebook)` is also editable — the agent may pre-process
|
||||
the codebook (transpose layout, cache `c·c` for the FMA trick, pack the LUT)
|
||||
and amortize over queries; build cost is excluded from per-query timing.
|
||||
|
||||
## Upstream Lance source
|
||||
|
||||
Algorithmically modeled on `lance-linalg::distance::l2` plus the PQ
|
||||
asymmetric-distance compute in `lance::index::vector::pq`. Specifically the
|
||||
f32 dense path; the byte / fixed-point variants are out of scope for this
|
||||
target.
|
||||
|
||||
When porting a winning kernel upstream:
|
||||
- File: `lance-linalg/src/distance/l2.rs` and the L2-specific path in
|
||||
`lance/src/index/vector/pq.rs`.
|
||||
- License: Apache-2.0 (matches our dual MIT/Apache-2.0 → upstream takes
|
||||
the Apache half).
|
||||
|
||||
## Oracle
|
||||
|
||||
**Float-accumulator-tolerance match against scalar reference.** Per
|
||||
`harness_common::MAX_ABS_ERR = 1e-4`:
|
||||
|
||||
- Distance table values must match the scalar reference within `1e-4` per
|
||||
element. Loose enough for legal SIMD-accumulator reordering, tight enough
|
||||
to catch real arithmetic bugs.
|
||||
- Top-K results compared with `harness_common::TOPK_DIST_TOL = 1e-4` plus
|
||||
tie-tolerant id substitution (any permutation within a tied-distance band
|
||||
is accepted).
|
||||
|
||||
The correctness phase asserts both on every input combination — five input
|
||||
distributions × three PQ shapes = 15 cases per trial.
|
||||
|
||||
## Speed workload
|
||||
|
||||
Three shapes:
|
||||
- `(128, 16, 256)` — SIFT-like; sub_vector_dim = 8
|
||||
- `(256, 16, 256)` — sub_vector_dim = 16
|
||||
- `(768, 96, 256)` — BERT-base-like; large codebook
|
||||
|
||||
Three data distributions:
|
||||
- `Clustered` — 32 cluster centers, low intra-cluster noise
|
||||
- `Uniform` — uniform on [-1, 1]
|
||||
- `Sparse` — 90% zeros + 10% Gaussian
|
||||
|
||||
Per (shape × distribution): 20,000 base vectors PQ-encoded, 32 queries
|
||||
timed. Total trial wall-clock: ~30–60s on a developer laptop.
|
||||
|
||||
## Output fields
|
||||
|
||||
```
|
||||
correctness: pass | fail
|
||||
shapes_tested: (128,16,256) (256,16,256) (768,96,256)
|
||||
distributions_tested: clustered uniform sparse
|
||||
geomean_ns_per_query: <u64>
|
||||
worst_ns_per_query: <u64> (<shape>, <dist>)
|
||||
best_ns_per_query: <u64> (<shape>, <dist>)
|
||||
per_combo_geomean_ns:
|
||||
(...)
|
||||
peak_mem_mb: <f64>
|
||||
total_seconds: <f64>
|
||||
```
|
||||
|
||||
## Known headroom (priors for the agent)
|
||||
|
||||
See `crates/pq-l2/program.md` "Lance-PQ-specific priors" for the canonical
|
||||
list. Highlights:
|
||||
|
||||
- Codebook layout transpose (`[m][k][d]` → `[m][d][k]`) for SIMD-broadcast
|
||||
table build.
|
||||
- Cache `c·c` per centroid in `new()` so the inner loop is `q·q − 2qc + c·c`
|
||||
(one FMA chain).
|
||||
- Probe-side code transpose so the inner loop processes 32+ vectors per
|
||||
iteration via gather.
|
||||
- Top-K block-then-merge instead of per-vector heap insert.
|
||||
- Prefetch on `codes[i+64]` ahead of gather.
|
||||
|
|
@ -1,172 +0,0 @@
|
|||
# Lance PQ L2 kernel research — agent instructions
|
||||
|
||||
You are an autonomous research assistant. Your job is to improve `src/kernels.rs`
|
||||
so that `cargo run --release --bin run_experiment` reports a **lower
|
||||
`geomean_ns_per_query`** while:
|
||||
|
||||
1. The **correctness phase passes** — your kernel's distance values must match the
|
||||
scalar reference within `MAX_ABS_ERR = 1e-4`, and the top-K must be
|
||||
tie-tolerant equivalent on every input the bench generates.
|
||||
2. The `worst_ns_per_query` does **not regress more than 5%** against the
|
||||
last-kept kernel — if you win on one (shape × distribution) and lose
|
||||
significantly on another, the change isn't a generalizable improvement.
|
||||
|
||||
This bench is intentionally **dataset-independent**: there is no fixed dataset.
|
||||
The correctness oracle is mathematical equivalence to the scalar reference,
|
||||
checked across multiple PQ shapes and synthetic input distributions
|
||||
(Gaussian / uniform / sparse / large-dynamic-range / mostly-zero). The speed
|
||||
oracle is the geomean across multiple shapes × distributions, with worst-case
|
||||
guarded. A win that depends on a specific data distribution or PQ shape will
|
||||
fail to clear the bar by construction.
|
||||
|
||||
Read this file end-to-end before doing anything else. Then run setup, then the loop.
|
||||
|
||||
## Setup (do once at the start of every session)
|
||||
|
||||
1. Read these files, in this order:
|
||||
- `README.md`
|
||||
- `program.md` (this file)
|
||||
- `src/lib.rs`
|
||||
- `src/kernels.rs` *(the only file you may edit)*
|
||||
- `src/reference.rs`
|
||||
- `src/inputs.rs`
|
||||
- `src/bin/run_experiment.rs`
|
||||
2. Ensure `results.tsv` exists. If not, create it with this header line:
|
||||
```
|
||||
commit timestamp correctness geomean_ns worst_ns worst_combo best_ns best_combo peak_mem_mb total_seconds keep description
|
||||
```
|
||||
3. Run the baseline trial: `cargo run --release --bin run_experiment > run.log 2>&1`.
|
||||
Confirm `correctness: pass`. Parse `run.log` and append a row to `results.tsv`
|
||||
with `keep=baseline` and `description="seeded scalar PQ-L2 baseline"`. This
|
||||
is your reference number.
|
||||
4. Commit the baseline row with a one-line message like `baseline: <numbers>`.
|
||||
|
||||
## What you CAN do
|
||||
|
||||
- Modify **`src/kernels.rs`** freely. You may:
|
||||
- Pre-process the codebook in `PqKernel::new` (transpose layouts, cache
|
||||
`c·c` for the FMA trick, pack the codebook for register-resident lookup,
|
||||
etc.). This cost is paid once per dataset and amortized across queries —
|
||||
the bench measures per-query, not per-(build + query).
|
||||
- Reorder loops, switch internal data layouts, drop down to `std::arch`
|
||||
intrinsics under `#[cfg(target_arch = ...)]` gates. **Always keep a
|
||||
portable scalar fallback** so the kernel compiles everywhere.
|
||||
- Use `unsafe` if needed; document the invariants you're relying on.
|
||||
- Mark hot functions `#[inline]`; add private helpers freely.
|
||||
- Add `#[cfg(test)] mod tests { ... }` inside `src/kernels.rs` if you want
|
||||
in-file property checks.
|
||||
|
||||
## What you CANNOT do
|
||||
|
||||
- Do **not** modify `src/lib.rs` (`PqShape` and the tolerance constants are
|
||||
shared with the immutable scaffolding).
|
||||
- Do **not** modify `src/bin/run_experiment.rs`, `src/reference.rs`,
|
||||
`src/inputs.rs`, `benches/pq_l2.rs`, or `Cargo.toml`.
|
||||
- Do **not** add new crate dependencies.
|
||||
- Do **not** alter the public API of `kernels::PqKernel`:
|
||||
- `PqKernel::new(shape: PqShape, codebook: &[f32]) -> Self`
|
||||
- `PqKernel::shape(&self) -> &PqShape`
|
||||
- `PqKernel::distance_table(&self, query: &[f32]) -> Vec<f32>`
|
||||
- `PqKernel::probe_top_k(&self, table: &[f32], codes: &[u8], num_vectors: usize, k: usize) -> Vec<(u32, f32)>`
|
||||
- Do **not** introduce lossy techniques (LUT u8/u16 quantization, asymmetric-
|
||||
distance approximation, etc.) — the correctness phase asserts exact-up-to-ε
|
||||
match against the scalar reference. If you want to explore a lossy track,
|
||||
surface that in a separate kernel and propose a track extension.
|
||||
|
||||
## The metric
|
||||
|
||||
Minimize `geomean_ns_per_query` (geometric mean of per-query wall-clock across
|
||||
all timed queries, all shapes, all distributions) subject to:
|
||||
|
||||
1. Correctness phase: **pass** (exit-2 otherwise).
|
||||
2. `worst_ns_per_query` ≤ 1.05 × the last-kept kernel's worst.
|
||||
3. `total_seconds` ≤ 600.
|
||||
4. Build is clean: `cargo build --release` succeeds, `cargo clippy --release
|
||||
--all-targets -- -D warnings` reports zero issues.
|
||||
|
||||
Ties break toward simpler code. If two kernels report the same speed within
|
||||
~3% noise, prefer fewer lines / less `unsafe`.
|
||||
|
||||
## Lance-PQ-specific priors (lossless directions)
|
||||
|
||||
These directions are known to pay off without compromising arithmetic accuracy.
|
||||
Pick one hypothesis at a time; implement; measure; decide.
|
||||
|
||||
- **Codebook layout.** The reference layout is `[m][k][d]`. For a fixed query,
|
||||
iterating over centroids stays in cache, but the inner loop over `d` is
|
||||
short. Transposing to `[m][d][k]` lets you SIMD-load 8 `(query - centroid)`
|
||||
lanes across `d` and broadcast over `k`. Do the transpose in `PqKernel::new`
|
||||
once.
|
||||
- **Cache `c·c`.** The diff–square–sum is `(q - c)·(q - c) = q·q - 2qc + c·c`.
|
||||
Hoist `q·q` per sub-vector, precompute `c·c` once at codebook-load time.
|
||||
Inner loop becomes one FMA (`-2qc + cc`). Watch the sign / accumulator
|
||||
ordering so the rounding stays within tolerance.
|
||||
- **Probe layout.** The probe is dominated by `acc += table[m][codes[off+m]]`
|
||||
× `num_sub_vectors`. Transposing codes to `[m][i]` (one row per sub-quantizer,
|
||||
contiguous over base index) lets you process up to 32+ vectors per inner
|
||||
iteration with `vpgatherdq`-style loads.
|
||||
- **Top-K integration.** `push()` does a branch + heap sift on every code.
|
||||
At 50k probes per query × 9 (shape × dist) combos that's the second-biggest
|
||||
cost after the gather. Block the probe (e.g., 512 codes at a time), find the
|
||||
local top-K with a branchless pass, then merge into the global heap.
|
||||
- **Prefetch.** A `_mm_prefetch(codes.as_ptr().add(off + 64), _MM_HINT_T0)`
|
||||
ahead of the gather is usually pure win at 50k+ scale where codes don't all
|
||||
fit in L2.
|
||||
- **FMA chains for table build.** The diff–square–sum maps cleanly to FMA on
|
||||
AVX2/NEON. Even without intrinsics, structuring the inner loop so `rustc`
|
||||
emits FMA helps.
|
||||
- **Avoid the `Vec` allocation in the hot path.** `distance_table` allocates a
|
||||
fresh `Vec<f32>` per call. Returning a fixed-capacity buffer is a public-API
|
||||
change you can't make — but you can reuse a thread-local scratch buffer
|
||||
internally if it speeds the build.
|
||||
|
||||
## The loop
|
||||
|
||||
Once setup is done, repeat indefinitely:
|
||||
|
||||
1. **Observe state.** Read the last ~5 rows of `results.tsv`. Note which ideas
|
||||
have been tried, what won, what regressed. Form a hypothesis with one
|
||||
sentence stating the change and the predicted effect on speed and
|
||||
correctness.
|
||||
2. **Edit `src/kernels.rs`.** Keep the diff focused on the one hypothesis.
|
||||
3. **Build and lint.**
|
||||
```
|
||||
cargo build --release
|
||||
cargo clippy --release --all-targets -- -D warnings
|
||||
```
|
||||
If either fails, fix and try again — do not commit broken state.
|
||||
4. **Run the trial.**
|
||||
```
|
||||
cargo run --release --bin run_experiment > run.log 2>&1
|
||||
```
|
||||
5. **Parse the result.** Extract `correctness`, `geomean_ns_per_query`,
|
||||
`worst_ns_per_query` (with combo), `peak_mem_mb`, `total_seconds`. Compute
|
||||
deltas vs. baseline.
|
||||
6. **Decide keep or revert.**
|
||||
- **Keep** iff: `correctness: pass`, geomean strictly better than the
|
||||
last-kept row (allow ~1% noise band), and `worst_ns_per_query` ≤ 1.05 ×
|
||||
last-kept's worst.
|
||||
- **Revert** otherwise: `git restore src/kernels.rs` (or commit and
|
||||
`git revert` if you want the revert in history). Note what failed.
|
||||
7. **Log.** Append one row to `results.tsv`:
|
||||
```
|
||||
<short_sha> <iso8601> <correctness> <geomean_ns> <worst_ns> <worst_combo> <best_ns> <best_combo> <peak_mem> <elapsed> <keep|revert> <one-line description>
|
||||
```
|
||||
8. **Commit.** One-line message describing the change and the headline number,
|
||||
e.g. `transpose codebook in new(); 18.2k → 14.1k geomean ns (worst -8%)`.
|
||||
|
||||
## Hygiene
|
||||
|
||||
- Always commit `src/kernels.rs` changes; never commit `results.tsv` or
|
||||
`run.log` (they're gitignored).
|
||||
- If a change fails to build, do not commit. Iterate until it builds, or
|
||||
revert cleanly.
|
||||
- If two consecutive ideas regress, take a beat: re-read the last ~10 rows of
|
||||
`results.tsv` and update your mental model before proposing the next.
|
||||
- Per-trial cap: 10 minutes. If `cargo run` is still going after 10 min, kill it
|
||||
and mark the trial as `timeout`.
|
||||
|
||||
## Never stop
|
||||
|
||||
Keep going until interrupted. Each loop iteration is one hypothesis, one edit,
|
||||
one measurement, one commit. No multi-step plans across iterations.
|
||||
Loading…
Add table
Add a link
Reference in a new issue