From b6440d6b17d677cbd41e084e37db4c12e1a3704e Mon Sep 17 00:00:00 2001 From: Ragnor Comerford Date: Tue, 28 Apr 2026 23:48:28 +0200 Subject: [PATCH] =?UTF-8?q?Add=20docs/lance.md=20=E2=80=94=20task-organize?= =?UTF-8?q?d=20index=20of=20Lance=20upstream=20docs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Curates the Lance documentation site (lance.org) into a problem-domain index so agents fetch the right page when working on Lance-touching code instead of guessing or grepping our codebase. Organized by topic: storage format & file layout, branching/tags/time travel, indexes (scalar + system + vector), reads/writes, schema evolution, object store, data types, performance, compaction, DataFusion integration, SDK reference, plus quick-starts and the upstream AGENTS.md. Skips ~200 irrelevant URLs from the upstream sitemap (Namespace REST API model surface, Spark/Trino/Databricks/etc. integrations, Python/Ray/HuggingFace docs, community pages) since omnigraph is Rust-only and doesn't run a Lance Namespace catalog. AGENTS.md surfaces it in the topic index and adds a directive: "when you hit a Lance-shaped problem, consult docs/lance.md and fetch the upstream URL before guessing." Co-Authored-By: Claude Opus 4.7 --- AGENTS.md | 3 + docs/lance.md | 157 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 160 insertions(+) create mode 100644 docs/lance.md diff --git a/AGENTS.md b/AGENTS.md index c5e4766..31dbee9 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -5,6 +5,8 @@ This file is the always-on map for AI coding agents (Claude Code, Codex, Cursor, > **Required reading every session: [docs/invariants.md](docs/invariants.md).** Load this in full before proposing, reviewing, or implementing any change — the §IX deny-list and §X review checklist apply to every PR, not only architecture work. Tools that support `@`-imports (Claude Code) auto-include it via the import below; other agents must open it explicitly at the start of each session. > > @docs/invariants.md +> +> **When you hit a Lance-shaped problem** (file format, fragments, indexes, transactions, branches/tags, compaction, schema evolution, vector / FTS internals): consult [docs/lance.md](docs/lance.md) and fetch the listed upstream URL before grepping our code or guessing. Lance is the substrate; behavior is documented there, not here. `CLAUDE.md` is a symlink to this file — there is exactly one source of truth. Edit `AGENTS.md`. @@ -60,6 +62,7 @@ Full diagram and concurrency model: [docs/architecture.md](docs/architecture.md) | Area | Read | |---|---| | **Architectural invariants & deny-list (read before any non-trivial proposal or review)** | **[docs/invariants.md](docs/invariants.md)** | +| **Lance docs index — fetch upstream Lance docs by problem domain** | **[docs/lance.md](docs/lance.md)** | | Architecture, L1/L2 framing, concurrency model | [docs/architecture.md](docs/architecture.md) | | Storage layout, `__manifest` schema, URI schemes, S3 env vars | [docs/storage.md](docs/storage.md) | | `.pg` schema language, types, constraints, annotations, migration planning | [docs/schema-language.md](docs/schema-language.md) | diff --git a/docs/lance.md b/docs/lance.md new file mode 100644 index 0000000..700e105 --- /dev/null +++ b/docs/lance.md @@ -0,0 +1,157 @@ +# Lance Docs Index (for OmniGraph agents) + +OmniGraph sits on top of Lance. Many problems — index lifecycle, branching, transactions, fragments, compaction, vector/FTS internals — are answered upstream in Lance's docs, not in this repo. + +This file is the curated entry point. **When you hit a Lance-shaped problem, find the matching topic below and fetch the listed URL(s) before guessing.** Don't grep our codebase for behavior that is documented authoritatively in Lance. + +Base URL: `https://lance.org`. Use `WebFetch` (or your tool's equivalent) on the full URLs. Keep this index curated to relevant material — the upstream sitemap has hundreds of URLs (notably the Namespace REST API model surface, Spark/Trino/Databricks integrations) that we don't use. + +> **Substrate boundary check.** Before fetching, recall [docs/invariants.md §I](invariants.md): if Lance already does the thing, we don't reimplement it. The most common reason to read these docs is to confirm a substrate behavior, not to learn what to clone. + +## Quick-start (read these once per project) + +| Read when | URL | +|---|---| +| Onboarding to Lance — concepts in 10 min | https://lance.org/quickstart/ | +| Onboarding to vector search | https://lance.org/quickstart/vector-search/ | +| Onboarding to full-text search | https://lance.org/quickstart/full-text-search/ | +| Onboarding to versioning / time travel | https://lance.org/quickstart/versioning/ | +| Lance's own AGENTS.md (its agent guide) | https://lance.org/format/AGENTS/ | + +## By problem domain + +### Storage format & file layout + +Touching `db/manifest`, fragment lifecycle, dataset reconstruction, or anything that reads/writes raw Lance state. + +| Topic | URL | +|---|---| +| Lance file format overview | https://lance.org/format/ | +| File-level format spec | https://lance.org/format/file/ | +| File encoding | https://lance.org/format/file/encoding/ | +| File-level versioning | https://lance.org/format/file/versioning/ | +| Table layout (fragments, manifest) | https://lance.org/format/table/layout/ | +| Table schema metadata | https://lance.org/format/table/schema/ | +| Table-level versioning | https://lance.org/format/table/versioning/ | +| Transactions (commit semantics, conflict types) | https://lance.org/format/table/transaction/ | +| MemWAL (durability story) | https://lance.org/format/table/mem_wal/ | +| Row-ID lineage (stable row IDs) | https://lance.org/format/table/row_id_lineage/ | +| Branches & tags (Lance native) | https://lance.org/format/table/branch_tag/ | + +### Branching / tags / time travel + +Touching graph-level branches, snapshots, run isolation, the commit graph. + +| Topic | URL | +|---|---| +| Branch & tag format | https://lance.org/format/table/branch_tag/ | +| Tags & branches operational guide | https://lance.org/guide/tags_and_branches/ | +| Versioning quick-start | https://lance.org/quickstart/versioning/ | +| Table-level versioning spec | https://lance.org/format/table/versioning/ | + +### Indexes + +Adding/changing index types, fixing coverage, debugging FTS or vector recall, designing the reconciler. + +| Topic | URL | +|---|---| +| Index spec overview | https://lance.org/format/table/index/ | +| BTREE scalar index | https://lance.org/format/table/index/scalar/btree/ | +| Bitmap scalar index | https://lance.org/format/table/index/scalar/bitmap/ | +| Bloom-filter scalar index | https://lance.org/format/table/index/scalar/bloom_filter/ | +| Label-list scalar index | https://lance.org/format/table/index/scalar/label_list/ | +| Zone-map scalar index | https://lance.org/format/table/index/scalar/zonemap/ | +| R-Tree scalar index (spatial) | https://lance.org/format/table/index/scalar/rtree/ | +| Full-text search (FTS) index | https://lance.org/format/table/index/scalar/fts/ | +| N-gram scalar index | https://lance.org/format/table/index/scalar/ngram/ | +| Vector index | https://lance.org/format/table/index/vector/ | +| Fragment-reuse system index | https://lance.org/format/table/index/system/frag_reuse/ | +| MemWAL system index | https://lance.org/format/table/index/system/mem_wal/ | +| HNSW Rust example | https://lance.org/examples/rust/hnsw/ | +| Distributed indexing | https://lance.org/guide/distributed_indexing/ | +| Tokenizer (FTS, n-gram) | https://lance.org/guide/tokenizer/ | + +### Reads & writes + +Touching the bulk loader, mutation execution, `merge_insert`, `WriteMode` selection. + +| Topic | URL | +|---|---| +| Read-and-write guide | https://lance.org/guide/read_and_write/ | +| Distributed write | https://lance.org/guide/distributed_write/ | +| Rust example: write & read a dataset | https://lance.org/examples/rust/write_read_dataset/ | + +### Schema evolution + +Touching `apply_schema`, the migration planner, additive evolution. + +| Topic | URL | +|---|---| +| Data-evolution guide | https://lance.org/guide/data_evolution/ | +| Migration guide | https://lance.org/guide/migration/ | + +### Object store / S3 + +Touching `storage.rs`, S3-compatible backends (RustFS, MinIO), env vars. + +| Topic | URL | +|---|---| +| Object-store guide | https://lance.org/guide/object_store/ | + +### Data types + +Touching schema-language scalar mappings, blob columns, JSON, list columns. + +| Topic | URL | +|---|---| +| Data types overview | https://lance.org/guide/data_types/ | +| Arrays / list types | https://lance.org/guide/arrays/ | +| Blobs (LargeBinary) | https://lance.org/guide/blob/ | +| JSON | https://lance.org/guide/json/ | + +### Performance & tuning + +Optimizing scans, fragment counts, cache behavior, memory pool sizing. + +| Topic | URL | +|---|---| +| Performance guide | https://lance.org/guide/performance/ | + +### Compaction & cleanup + +Touching `omnigraph optimize` / `cleanup`, the underlying `compact_files` / `cleanup_old_versions`. + +| Topic | URL | +|---|---| +| Read-and-write guide (covers `compact_files`, `cleanup_old_versions`) | https://lance.org/guide/read_and_write/ | +| Performance (compaction tradeoffs) | https://lance.org/guide/performance/ | +| Fragment-reuse index | https://lance.org/format/table/index/system/frag_reuse/ | + +### DataFusion integration + +The runtime substrate that may carry our query execution. See [docs/invariants.md §I.4](invariants.md): we don't rebuild relational machinery. + +| Topic | URL | +|---|---| +| DataFusion integration | https://lance.org/integrations/datafusion/ | + +### SDK reference + +Looking up a specific Rust API (signature, return type, error variant). + +| Topic | URL | +|---|---| +| SDK docs landing | https://lance.org/sdk_docs/ | + +## What's not in this index (and why) + +- **Namespace REST API model surface** (`/format/namespace/client/operations/models/...`) — hundreds of REST schema docs for the Lance Namespace catalog API. Omnigraph does not run a Lance Namespace server, so these are not reachable from our problem space. +- **Spark / Trino / Databricks / Dataproc / Hive / Glue / Polaris / Iceberg / Unity / OneLake / Gravitino integrations** — not part of OmniGraph's deployment surface. +- **Python / TF / PyTorch / Hugging Face / Ray integrations** — OmniGraph is Rust-only; Python notebooks aren't relevant. +- **Community / governance / release / voting / PMC pages** — meta, not technical. + +If a future need pulls one of these into scope, add a row to the matching domain section above and link it from `AGENTS.md`'s topic index. + +## Maintenance + +When Lance ships a major release that changes any of the above (file format bump, new index type, transaction semantics change, new branching primitive), refresh this index in the same change as the omnigraph upgrade. Stale Lance pointers are worse than no pointers.