omnigraph/research/lance-autoresearch/scripts/prepare_fixtures.sh
Claude ed376af7d8
research: lance-autoresearch — PQ L2 kernel autoresearch harness
Stand up a standalone Rust project under research/lance-autoresearch/ for
LLM-driven optimization of Lance's PQ L2 distance kernels, following Karpathy's
three-file autoresearch contract:

  - src/kernels.rs (mutable, the agent's playground): scalar baseline PQ L2
    distance + top-K matching Lance 4.x's algorithm shape (16 sub-vectors,
    256 centroids, 8-bit codes, 128-d f32).
  - src/{fixture,reference,bin/run_experiment}.rs (immutable): SIFT1M loader
    (fvecs/ivecs + frozen codebook) with deterministic synthetic fallback,
    brute-force ground truth, fixed-format result block with recall@10 floor
    + time-budget exits.
  - program.md (human-iterated): the skill the agent reads each session —
    setup, what it can / cannot edit, the metric, Lance-PQ-specific priors,
    the keep/revert loop.

Smoke tests pass: baseline build clean, recall@10 = 0.66 on synthetic above
the 0.50 floor (exit 0), broken kernel triggers floor failure (exit 2),
clippy -D warnings clean. Excludes research/ from omnigraph workspace so
the nested project doesn't enter omnigraph's cargo build graph.

Licensed dual MIT / Apache-2.0 to keep the upstream-PR path to lance-format/lance
clean.

https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
2026-05-14 22:38:39 +00:00

46 lines
1.8 KiB
Bash
Executable file
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env bash
# IMMUTABLE. One-time SIFT1M fixture preparation.
#
# Downloads SIFT1M from the Texmex corpus (Inria), extracts the f32 vector
# files + ground-truth, then runs the in-tree fixture builder to train a
# product-quantization codebook and encode the base set. All artifacts are
# written under ~/.cache/lance-autoresearch/ so they survive between trials
# but stay out of git.
#
# Total time: ~510 min on a fresh laptop. ~250 MB download.
set -euo pipefail
CACHE_DIR="${HOME}/.cache/lance-autoresearch"
SIFT_URL="ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz"
SIFT_URL_MIRROR="https://huggingface.co/datasets/qbo-odp/sift1m/resolve/main/sift.tar.gz"
mkdir -p "${CACHE_DIR}"
cd "${CACHE_DIR}"
if [[ ! -f sift_base.fvecs || ! -f sift_query.fvecs || ! -f sift_groundtruth.ivecs ]]; then
echo "[prepare_fixtures] downloading SIFT1M..."
if [[ ! -f sift.tar.gz ]]; then
curl --fail -L -o sift.tar.gz "${SIFT_URL}" || \
curl --fail -L -o sift.tar.gz "${SIFT_URL_MIRROR}"
fi
echo "[prepare_fixtures] extracting..."
tar xzf sift.tar.gz
mv sift/sift_base.fvecs ./sift_base.fvecs
mv sift/sift_query.fvecs ./sift_query.fvecs
mv sift/sift_groundtruth.ivecs ./sift_groundtruth.ivecs
rm -rf sift sift.tar.gz
fi
if [[ ! -f pq_codebook.bin || ! -f pq_codes.bin ]]; then
echo "[prepare_fixtures] training PQ codebook + encoding base..."
# The fixture builder is run as a `cargo test` with a marker env var so we
# don't have to add a second binary just for one-time setup. The test reads
# SIFT1M, calls the in-tree `train_codebook` + `encode`, and writes the
# frozen artifacts next to the dataset.
cd "$(dirname "$0")/.."
LANCE_AUTORESEARCH_BUILD_FIXTURES=1 cargo test --release --lib build_fixtures -- --ignored --nocapture
fi
echo "[prepare_fixtures] done — fixtures in ${CACHE_DIR}"
ls -la "${CACHE_DIR}"