mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-07-03 02:51:04 +02:00
Stand up a standalone Rust project under research/lance-autoresearch/ for
LLM-driven optimization of Lance's PQ L2 distance kernels, following Karpathy's
three-file autoresearch contract:
- src/kernels.rs (mutable, the agent's playground): scalar baseline PQ L2
distance + top-K matching Lance 4.x's algorithm shape (16 sub-vectors,
256 centroids, 8-bit codes, 128-d f32).
- src/{fixture,reference,bin/run_experiment}.rs (immutable): SIFT1M loader
(fvecs/ivecs + frozen codebook) with deterministic synthetic fallback,
brute-force ground truth, fixed-format result block with recall@10 floor
+ time-budget exits.
- program.md (human-iterated): the skill the agent reads each session —
setup, what it can / cannot edit, the metric, Lance-PQ-specific priors,
the keep/revert loop.
Smoke tests pass: baseline build clean, recall@10 = 0.66 on synthetic above
the 0.50 floor (exit 0), broken kernel triggers floor failure (exit 2),
clippy -D warnings clean. Excludes research/ from omnigraph workspace so
the nested project doesn't enter omnigraph's cargo build graph.
Licensed dual MIT / Apache-2.0 to keep the upstream-PR path to lance-format/lance
clean.
https://claude.ai/code/session_01Aq8kBUcjmEPobcEufnWbW5
46 lines
1.8 KiB
Bash
Executable file
46 lines
1.8 KiB
Bash
Executable file
#!/usr/bin/env bash
|
||
# IMMUTABLE. One-time SIFT1M fixture preparation.
|
||
#
|
||
# Downloads SIFT1M from the Texmex corpus (Inria), extracts the f32 vector
|
||
# files + ground-truth, then runs the in-tree fixture builder to train a
|
||
# product-quantization codebook and encode the base set. All artifacts are
|
||
# written under ~/.cache/lance-autoresearch/ so they survive between trials
|
||
# but stay out of git.
|
||
#
|
||
# Total time: ~5–10 min on a fresh laptop. ~250 MB download.
|
||
|
||
set -euo pipefail
|
||
|
||
CACHE_DIR="${HOME}/.cache/lance-autoresearch"
|
||
SIFT_URL="ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz"
|
||
SIFT_URL_MIRROR="https://huggingface.co/datasets/qbo-odp/sift1m/resolve/main/sift.tar.gz"
|
||
|
||
mkdir -p "${CACHE_DIR}"
|
||
cd "${CACHE_DIR}"
|
||
|
||
if [[ ! -f sift_base.fvecs || ! -f sift_query.fvecs || ! -f sift_groundtruth.ivecs ]]; then
|
||
echo "[prepare_fixtures] downloading SIFT1M..."
|
||
if [[ ! -f sift.tar.gz ]]; then
|
||
curl --fail -L -o sift.tar.gz "${SIFT_URL}" || \
|
||
curl --fail -L -o sift.tar.gz "${SIFT_URL_MIRROR}"
|
||
fi
|
||
echo "[prepare_fixtures] extracting..."
|
||
tar xzf sift.tar.gz
|
||
mv sift/sift_base.fvecs ./sift_base.fvecs
|
||
mv sift/sift_query.fvecs ./sift_query.fvecs
|
||
mv sift/sift_groundtruth.ivecs ./sift_groundtruth.ivecs
|
||
rm -rf sift sift.tar.gz
|
||
fi
|
||
|
||
if [[ ! -f pq_codebook.bin || ! -f pq_codes.bin ]]; then
|
||
echo "[prepare_fixtures] training PQ codebook + encoding base..."
|
||
# The fixture builder is run as a `cargo test` with a marker env var so we
|
||
# don't have to add a second binary just for one-time setup. The test reads
|
||
# SIFT1M, calls the in-tree `train_codebook` + `encode`, and writes the
|
||
# frozen artifacts next to the dataset.
|
||
cd "$(dirname "$0")/.."
|
||
LANCE_AUTORESEARCH_BUILD_FIXTURES=1 cargo test --release --lib build_fixtures -- --ignored --nocapture
|
||
fi
|
||
|
||
echo "[prepare_fixtures] done — fixtures in ${CACHE_DIR}"
|
||
ls -la "${CACHE_DIR}"
|