diff --git a/docs/plans/0002-phase-2-postgres-backend.md b/docs/plans/0002-phase-2-postgres-backend.md index 3fe28f2..ed2a186 100644 --- a/docs/plans/0002-phase-2-postgres-backend.md +++ b/docs/plans/0002-phase-2-postgres-backend.md @@ -1,5 +1,13 @@ # Phase 2 Plan: PostgreSQL Backend +> **Supersession Notice (2026-05-26):** This master plan is now archival. Execution is governed by: +> - **ADR**: [`docs/adr/0002-phase-2-execution.md`](../adr/0002-phase-2-execution.md) -- binding decisions +> - **Sub-plans** (executable briefs): +> - Phase 1 amendment: [0001a-trait-rewrite.md](0001a-trait-rewrite.md), [0001b-sqlite-split.md](0001b-sqlite-split.md), [0001c-async-trait-sunset.md](0001c-async-trait-sunset.md) +> - Phase 2: 0002a..0002i (skeleton, pool+config, migrations, store impl, hybrid search, migrate CLI, reembed, tests+benches, runbook) +> +> **Deltas vs body**: trait uses `trait_variant`, error type is `MemoryStoreError`/`MemoryStoreResult`, `connect` is `(url, max_connections)` only, the core table is `knowledge_nodes` (not `memories`) and gains `owner_user_id` + `visibility` + `shared_with_groups` + `codebase`, plus `users`/`groups`/`group_memberships` tables. See ADR 0002 D1-D8. + **Status**: Draft **Depends on**: Phase 1 (MemoryStore + Embedder traits, embedding_model registry, domain columns) **Related**: docs/adr/0001-pluggable-storage-and-network-access.md (Phase 2), docs/prd/001-getting-centralized-vestige.md, docs/plans/local-dev-postgres-setup.md (local cluster provisioning) diff --git a/docs/plans/0002a-skeleton-and-feature-gate.md b/docs/plans/0002a-skeleton-and-feature-gate.md new file mode 100644 index 0000000..74032dc --- /dev/null +++ b/docs/plans/0002a-skeleton-and-feature-gate.md @@ -0,0 +1,554 @@ +# Phase 2 Sub-Plan 0002a -- Skeleton and Feature Gate + +**Status**: Ready +**Depends on**: Phase 1 amendment (sub-plans `0001a-trait-rewrite.md` and +`0001b-sqlite-split.md`) merged. Specifically: +- `MemoryStore` trait declared with `#[trait_variant::make(MemoryStore: Send)]`, + generating a non-Send `LocalMemoryStore` companion trait. The + `pub use MemoryStore as LocalMemoryStore` alias from Phase 1 is gone. +- `crates/vestige-core/src/storage/sqlite.rs` has been split into + `crates/vestige-core/src/storage/sqlite/` with the same public surface. + +This sub-plan covers Phase 2 master-plan deliverables D1 and D2 only: +the `postgres-backend` Cargo feature gate and a compilable `PgMemoryStore` +skeleton whose trait method bodies are `todo!()`. No real Postgres code, no +migrations, no SQL. Later sub-plans (`0002b-pool-and-config.md`, +`0002c-migrations.md`, `0002d-store-impl-bodies.md`, ...) fill the bodies in. + +The success criterion is a clean build under both feature-flag configurations, +nothing more. + +--- + +## Context + +ADR 0002 D4 commits Phase 2 to a `crates/vestige-core/src/storage/postgres/` +directory from day one. The seven other files in that directory +(`pool.rs`, `migrations.rs`, `registry.rs`, `search.rs`, `migrate_cli.rs`, +`reembed.rs`) belong to subsequent sub-plans. This sub-plan creates only +`crates/vestige-core/src/storage/postgres/mod.rs` so the rest can be added +incrementally without breaking the build. + +Per ADR 0002 D2, `PgMemoryStore::connect` mirrors `SqliteMemoryStore::new`: +no `Embedder` argument. The pgvector typmod DDL +(`ALTER TABLE memories ALTER COLUMN embedding TYPE vector($N)`) lives inside +the trait method `register_model`, invoked by the caller after construction. +In this sub-plan `register_model` is a `todo!()` body; `0002c-migrations.md` +and `0002d-store-impl-bodies.md` provide the real implementation. + +The trait surface in `crates/vestige-core/src/storage/memory_store.rs` is the +source of truth for method signatures. Do NOT copy signatures from the master +plan -- they are stale in places (for example, master plan 0002 D2 lists +`remove_edge` as three-arg `(source, target, edge_type)`; the live trait has +two args `(source, target)`). + +--- + +## Cargo manifest changes + +Two optional crates and one new feature flag. Use `cargo add` per the global +CLAUDE.md preference; do not hand-edit `Cargo.toml`. + +```bash +cd crates/vestige-core + +cargo add sqlx@0.8 --optional --no-default-features \ + --features runtime-tokio,tls-rustls,postgres,uuid,chrono,json,migrate,macros + +cargo add pgvector@0.4 --optional --features sqlx +``` + +After both commands, open `crates/vestige-core/Cargo.toml` and add the +`postgres-backend` feature line in the `[features]` block. Place it after +the `metal` feature, before `[dependencies]`: + +```toml +# Postgres backend (mutually compilable with the SQLite backend; default OFF). +# Compile with: --features postgres-backend +postgres-backend = ["dep:sqlx", "dep:pgvector"] +``` + +Do NOT add `tokio-stream`, `futures`, `indicatif`, or `toml` in this sub-plan. +The master plan D1 lists them in the `postgres-backend` feature for +convenience, but their consumers (streaming migrate, progress bar, config +parsing) land in later sub-plans. Adding them here pulls dead weight into the +feature gate. + +Do NOT add the `vestige-mcp` pass-through feature in this sub-plan either. +The MCP crate gets its `postgres-backend` feature in `0002b-pool-and-config.md` +when `MemoryStoreConfig` lands and the binary needs a knob to pick a backend. + +Verify the diff to `crates/vestige-core/Cargo.toml` looks like this and only +this: + +```toml +[features] +# ...existing features unchanged... +postgres-backend = ["dep:sqlx", "dep:pgvector"] + +[dependencies] +# ...existing deps unchanged... +sqlx = { version = "0.8", default-features = false, features = [ + "runtime-tokio", "tls-rustls", "postgres", "uuid", "chrono", + "json", "migrate", "macros", +], optional = true } +pgvector = { version = "0.4", features = ["sqlx"], optional = true } +``` + +The exact ordering of the two new lines inside `[dependencies]` is not +significant; `cargo add` places them at the end. Leave the placement that +`cargo add` produces. + +--- + +## Storage module export + +Edit `crates/vestige-core/src/storage/mod.rs` to expose the new module behind +the feature flag. Two lines change. + +Add to the module-declaration block (after `mod sqlite;`): + +```rust +#[cfg(feature = "postgres-backend")] +mod postgres; +``` + +Add to the re-export block (after the `pub use sqlite::{ ... }` block): + +```rust +#[cfg(feature = "postgres-backend")] +pub use postgres::PgMemoryStore; +``` + +Nothing else in `storage/mod.rs` changes. The `Storage` alias still points at +`SqliteMemoryStore`; the SQLite re-export block is untouched. + +--- + +## Postgres module skeleton + +Create `crates/vestige-core/src/storage/postgres/mod.rs` with the full content +below. This is the only new file in this sub-plan. + +```rust +#![cfg(feature = "postgres-backend")] +//! Postgres-backed implementation of `MemoryStore`. +//! +//! Skeleton only. Every trait method is `todo!()`. Real bodies land in +//! subsequent Phase 2 sub-plans: +//! - `0002b-pool-and-config.md`: pool construction and config wiring +//! - `0002c-migrations.md`: sqlx migration files and `init`/`register_model` +//! - `0002d-store-impl-bodies.md`: CRUD, scheduling, edges, domains +//! - `0002e-hybrid-search.md`: RRF query and search bodies +//! +//! The directory grows companion files (`pool.rs`, `migrations.rs`, +//! `registry.rs`, `search.rs`, `migrate_cli.rs`, `reembed.rs`) in those +//! sub-plans; this skeleton only creates `mod.rs`. + +use chrono::{DateTime, Utc}; +use sqlx::PgPool; +use uuid::Uuid; + +use crate::storage::memory_store::{ + Domain, HealthStatus, LocalMemoryStore, MemoryEdge, MemoryRecord, MemoryStoreResult, + ModelSignature, SchedulingState, SearchQuery, SearchResult, StoreStats, +}; + +/// Postgres-backed implementation of `MemoryStore`. +/// +/// Cheaply cloneable. Methods take `&self`; interior state lives inside the +/// `PgPool` (which already provides `Sync` via `Arc` internally). +#[derive(Clone)] +pub struct PgMemoryStore { + pool: PgPool, + /// Embedding vector dimension. Set to 0 in the skeleton; populated by + /// `register_model` in `0002d-store-impl-bodies.md` once the pgvector + /// `ALTER COLUMN TYPE vector(N)` DDL lands. + embedding_dim: i32, +} + +impl PgMemoryStore { + /// Construct a new store from a connection URL. + /// + /// Mirrors `SqliteMemoryStore::new`: no `Embedder` argument. The pgvector + /// `ALTER TABLE memories ALTER COLUMN embedding TYPE vector($N)` DDL lives + /// inside `register_model`, not here. The caller (cognitive engine + /// bootstrap, migrate CLI, tests) invokes `register_model` after this + /// returns, identically to the SQLite path. + /// + /// Real body lands in `0002b-pool-and-config.md` (pool construction) and + /// `0002c-migrations.md` (initial migration run). + pub async fn connect(_url: &str, _max_connections: u32) -> MemoryStoreResult { + todo!("PgMemoryStore::connect lands in 0002b-pool-and-config.md") + } + + /// Low-level constructor for tests: supply an existing pool, skip migrate. + /// + /// Real body lands in `0002b-pool-and-config.md`. + pub async fn from_pool(_pool: PgPool) -> MemoryStoreResult { + todo!("PgMemoryStore::from_pool lands in 0002b-pool-and-config.md") + } + + /// Accessor used by migrate/reembed CLI. + pub fn pool(&self) -> &PgPool { + &self.pool + } + + /// Currently-registered vector dimension. Returns 0 until `register_model` + /// has been called (real body: `0002d-store-impl-bodies.md`). + pub fn embedding_dim(&self) -> i32 { + self.embedding_dim + } +} + +// trait_variant::make on the trait declaration generates two traits: +// - `MemoryStore` (Send-bound) +// - `LocalMemoryStore` (non-Send companion) +// The implementer writes `impl LocalMemoryStore for ...` plainly; the Send +// variant is generated automatically from the non-Send impl. +impl LocalMemoryStore for PgMemoryStore { + // --- Lifecycle --- + async fn init(&self) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::init lands in 0002c-migrations.md") + } + + async fn health_check(&self) -> MemoryStoreResult { + todo!("PgMemoryStore::health_check lands in 0002d-store-impl-bodies.md") + } + + // --- Embedding model registry --- + async fn registered_model(&self) -> MemoryStoreResult> { + todo!("PgMemoryStore::registered_model lands in 0002d-store-impl-bodies.md") + } + + async fn register_model(&self, _sig: &ModelSignature) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::register_model lands in 0002d-store-impl-bodies.md") + } + + // --- CRUD --- + async fn insert(&self, _record: &MemoryRecord) -> MemoryStoreResult { + todo!("PgMemoryStore::insert lands in 0002d-store-impl-bodies.md") + } + + async fn get(&self, _id: Uuid) -> MemoryStoreResult> { + todo!("PgMemoryStore::get lands in 0002d-store-impl-bodies.md") + } + + async fn update(&self, _record: &MemoryRecord) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::update lands in 0002d-store-impl-bodies.md") + } + + async fn delete(&self, _id: Uuid) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::delete lands in 0002d-store-impl-bodies.md") + } + + // --- Search --- + async fn search(&self, _query: &SearchQuery) -> MemoryStoreResult> { + todo!("PgMemoryStore::search lands in 0002e-hybrid-search.md") + } + + async fn fts_search( + &self, + _text: &str, + _limit: usize, + ) -> MemoryStoreResult> { + todo!("PgMemoryStore::fts_search lands in 0002e-hybrid-search.md") + } + + async fn vector_search( + &self, + _embedding: &[f32], + _limit: usize, + ) -> MemoryStoreResult> { + todo!("PgMemoryStore::vector_search lands in 0002e-hybrid-search.md") + } + + // --- FSRS Scheduling --- + async fn get_scheduling( + &self, + _memory_id: Uuid, + ) -> MemoryStoreResult> { + todo!("PgMemoryStore::get_scheduling lands in 0002d-store-impl-bodies.md") + } + + async fn update_scheduling(&self, _state: &SchedulingState) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::update_scheduling lands in 0002d-store-impl-bodies.md") + } + + async fn get_due_memories( + &self, + _before: DateTime, + _limit: usize, + ) -> MemoryStoreResult> { + todo!("PgMemoryStore::get_due_memories lands in 0002d-store-impl-bodies.md") + } + + // --- Graph (spreading activation) --- + async fn add_edge(&self, _edge: &MemoryEdge) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::add_edge lands in 0002d-store-impl-bodies.md") + } + + async fn get_edges( + &self, + _node_id: Uuid, + _edge_type: Option<&str>, + ) -> MemoryStoreResult> { + todo!("PgMemoryStore::get_edges lands in 0002d-store-impl-bodies.md") + } + + async fn remove_edge(&self, _source: Uuid, _target: Uuid) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::remove_edge lands in 0002d-store-impl-bodies.md") + } + + async fn get_neighbors( + &self, + _node_id: Uuid, + _depth: usize, + ) -> MemoryStoreResult> { + todo!("PgMemoryStore::get_neighbors lands in 0002d-store-impl-bodies.md") + } + + // --- Domains (Phase 1: stubs return empty; full impl in Phase 4) --- + async fn list_domains(&self) -> MemoryStoreResult> { + todo!("PgMemoryStore::list_domains lands in 0002d-store-impl-bodies.md") + } + + async fn get_domain(&self, _id: &str) -> MemoryStoreResult> { + todo!("PgMemoryStore::get_domain lands in 0002d-store-impl-bodies.md") + } + + async fn upsert_domain(&self, _domain: &Domain) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::upsert_domain lands in 0002d-store-impl-bodies.md") + } + + async fn delete_domain(&self, _id: &str) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::delete_domain lands in 0002d-store-impl-bodies.md") + } + + async fn classify(&self, _embedding: &[f32]) -> MemoryStoreResult> { + todo!("PgMemoryStore::classify lands in 0002d-store-impl-bodies.md") + } + + // --- Bulk / Maintenance --- + async fn count(&self) -> MemoryStoreResult { + todo!("PgMemoryStore::count lands in 0002d-store-impl-bodies.md") + } + + async fn get_stats(&self) -> MemoryStoreResult { + todo!("PgMemoryStore::get_stats lands in 0002d-store-impl-bodies.md") + } + + async fn vacuum(&self) -> MemoryStoreResult<()> { + todo!("PgMemoryStore::vacuum lands in 0002d-store-impl-bodies.md") + } +} +``` + +Notes on the skeleton: + +- The file-level `#![cfg(feature = "postgres-backend")]` means the whole file + vanishes when the feature is off. The `mod postgres;` line in + `storage/mod.rs` is itself feature-gated, so this is belt-and-braces; both + gates are needed because the file-level attribute is what allows the file to + use `sqlx::PgPool` unconditionally inside it. +- `EmbeddingModelDescriptor` (a separate Postgres-internal type that the + master plan sketched on the struct) is dropped. The trait surface already + carries `ModelSignature` for the registry round-trip; the registry storage + layout is a private concern of `registry.rs`, which is added later. Keep + `PgMemoryStore` minimal until a real consumer needs the extra type. +- The struct only carries `pool` and `embedding_dim`. The model descriptor + field from the master plan sketch goes away with `EmbeddingModelDescriptor`. + If `register_model` later needs to cache the descriptor on the struct, it + can be added then; the skeleton does not speculate. +- The two trivial accessors (`pool`, `embedding_dim`) get real bodies. Every + other method is `todo!()` so it returns `!` and trivially coerces to the + declared return type at the type checker; this is what lets the build pass + with no error variants and no SQL. + +--- + +## Connect signature + +Per ADR 0002 D2: + +```rust +pub async fn connect(url: &str, max_connections: u32) -> MemoryStoreResult; +pub async fn from_pool(pool: PgPool) -> MemoryStoreResult; +``` + +No `&dyn Embedder` argument. This deliberately differs from master plan 0002, +which predates the Phase 1 freeze. The pgvector-specific DDL +(`ALTER TABLE memories ALTER COLUMN embedding TYPE vector($N)`) does not run +inside `connect`; it runs inside `register_model(&ModelSignature)`, which the +caller invokes after `connect` returns. + +In this sub-plan `register_model` is `todo!()`. The real body lands in +`0002d-store-impl-bodies.md` after `0002c-migrations.md` ships the +`0001_init.up.sql` migration that creates the `memories` table with a +placeholder `embedding vector` column (no typmod), against which +`register_model` later runs the typmod stamp. + +--- + +## Error variant additions: deferred + +`MemoryStoreError` does NOT gain `Postgres(sqlx::Error)` or +`Migrate(sqlx::migrate::MigrateError)` in this sub-plan. + +The reason is mechanical: `todo!()` evaluates to the never type `!`, which +coerces to any `MemoryStoreResult` regardless of the error variants +available. With every method body a `todo!()`, the skeleton has no expression +that needs to convert a `sqlx::Error` or `sqlx::migrate::MigrateError` into +`MemoryStoreError`. Adding the variants here would mean adding the +`#[cfg(feature = "postgres-backend")]` and `#[from]` plumbing to +`memory_store.rs` with no consumer yet -- dead code at every level except the +enum definition itself. + +`0002d-store-impl-bodies.md` introduces both variants in the same commit that +turns the first `todo!()` into a real `sqlx::query!` call. That keeps the +diff to `memory_store.rs` next to the first usage site, which is easier to +review than adding variants ahead of need. + +For reference, the variants that will be added in `0002d-store-impl-bodies.md` +look like this: + +```rust +#[cfg(feature = "postgres-backend")] +#[error("postgres error: {0}")] +Postgres(#[from] sqlx::Error), + +#[cfg(feature = "postgres-backend")] +#[error("postgres migration error: {0}")] +Migrate(#[from] sqlx::migrate::MigrateError), +``` + +Do not pre-add them here. + +--- + +## Verification + +Run these commands from the workspace root. All four must produce a clean +build, zero warnings on the diff-affected files, no test changes. + +```bash +# 1. Default features (SQLite backend, postgres-backend OFF). Must build. +cargo build --workspace --all-targets + +# 2. Workspace clippy with default features. Must be clean. +cargo clippy --workspace --all-targets -- -D warnings + +# 3. Postgres feature enabled. Must build. +cargo build -p vestige-core --features postgres-backend + +# 4. Clippy with postgres feature enabled. Must be clean. +cargo clippy -p vestige-core --features postgres-backend --all-targets -- -D warnings +``` + +Expected outcomes: + +- `cargo build --workspace --all-targets` finishes with no compilation of + `sqlx` or `pgvector` (both are optional, no consumer with default features). + The `postgres` module is excluded entirely via `#[cfg]`. +- `cargo build -p vestige-core --features postgres-backend` compiles `sqlx`, + `pgvector`, and `storage/postgres/mod.rs`. The build succeeds because every + trait method is `todo!()`; nothing actually runs SQL. +- Both `clippy` invocations pass with `-D warnings`. The `todo!()` macro does + not emit a `dead_code` lint by itself, and the trivial accessors are used by + later sub-plans (clippy on the postgres feature alone may flag them as + unused if you run with `--lib` only; the `--all-targets` form keeps tests + and benches in scope so this does not fire). +- If clippy flags `unused_variables` on the underscore-prefixed parameters in + the `todo!()` bodies, the underscore prefix is already the standard + suppression; if a future clippy version disagrees, add + `#[allow(unused_variables)]` to the impl block, not to each method. + +Tests are not modified in this sub-plan. The unit tests in +`memory_store.rs` (`memory_store_error_from_storage_error`, +`model_signature_serde_round_trip`, `memory_record_serde_round_trip`) keep +passing because no type they touch changes. + +Do NOT run `cargo test` against the postgres feature -- there is no Postgres +running and no test exercises `PgMemoryStore` yet. The build check is the +contract. + +--- + +## Acceptance criteria + +1. `crates/vestige-core/Cargo.toml` declares `sqlx = "0.8"` and + `pgvector = "0.4"` as optional dependencies with the exact feature sets + specified above. +2. `crates/vestige-core/Cargo.toml` declares `postgres-backend = ["dep:sqlx", + "dep:pgvector"]` and nothing else inside that feature. +3. `crates/vestige-mcp/Cargo.toml` is unchanged. +4. `crates/vestige-core/src/storage/mod.rs` adds exactly two + feature-gated lines: `mod postgres;` and `pub use postgres::PgMemoryStore;`. + No other change. +5. `crates/vestige-core/src/storage/postgres/mod.rs` exists and contains the + `PgMemoryStore` struct, `impl PgMemoryStore` block with real `pool` and + `embedding_dim` accessors and `todo!()` bodies for `connect` and + `from_pool`, and the full `impl LocalMemoryStore for PgMemoryStore` block + with `todo!()` for every trait method. +6. The trait impl method signatures match `memory_store.rs` byte-for-byte + (including `remove_edge(&self, source: Uuid, target: Uuid)` two-arg form, + not the three-arg form from the master plan). +7. `MemoryStoreError` is unchanged. +8. No other files in the crate are touched. No new files in + `storage/postgres/` besides `mod.rs`. +9. The four verification commands above all succeed. + +--- + +## Commit sequence + +One commit is recommended. The two changes (Cargo manifest + module skeleton) +do not compile in isolation: the manifest change without the skeleton produces +unused-optional-dep warnings, and the skeleton without the manifest change +fails to find `sqlx`. Splitting them adds no review value, since the second +commit is the one that has to compile cleanly. + +```bash +git add crates/vestige-core/Cargo.toml \ + crates/vestige-core/Cargo.lock \ + crates/vestige-core/src/storage/mod.rs \ + crates/vestige-core/src/storage/postgres/mod.rs + +git commit -m "feat(storage): scaffold postgres-backend feature and PgMemoryStore skeleton + +Adds the postgres-backend Cargo feature gating sqlx 0.8 and pgvector 0.4. +Introduces crates/vestige-core/src/storage/postgres/mod.rs with the +PgMemoryStore struct, connect/from_pool/pool/embedding_dim, and a trait impl +whose method bodies are todo!() pending later Phase 2 sub-plans. + +Builds clean with default features (SQLite only) and with --features +postgres-backend. No runtime behaviour change. + +Refs ADR 0002 D1, D2, D4." +``` + +If for any reason the manifest change must be reviewed separately (for +example, a security review of the sqlx version pin), split as: + +1. `cargo add` for sqlx and pgvector + manual feature line in Cargo.toml. + Build with default features will pass but `--features postgres-backend` + will fail (no module to satisfy the feature). This is acceptable for a + short-lived intermediate commit. +2. `storage/mod.rs` edits + `storage/postgres/mod.rs` creation. Both builds + pass. + +Default to the single-commit form unless asked to split. + +--- + +## Followups + +- `0002b-pool-and-config.md` adds `pool.rs`, `PostgresConfig`, and the + `vestige-mcp` `postgres-backend` pass-through feature. +- `0002c-migrations.md` adds `crates/vestige-core/migrations/postgres/` with + `0001_init.{up,down}.sql` and `0002_hnsw.{up,down}.sql`, plus + `postgres/migrations.rs` invoking `sqlx::migrate!`. `init()` body lands here. +- `0002d-store-impl-bodies.md` introduces the two `MemoryStoreError` variants + and replaces every `todo!()` in CRUD / scheduling / edges / domains / + registry with real `sqlx::query!` bodies. +- `0002e-hybrid-search.md` fills the three search bodies via the RRF query. diff --git a/docs/plans/0002b-pool-and-config.md b/docs/plans/0002b-pool-and-config.md new file mode 100644 index 0000000..9941413 --- /dev/null +++ b/docs/plans/0002b-pool-and-config.md @@ -0,0 +1,886 @@ +# Sub-plan 0002b -- Pool construction and VestigeConfig + +**Status**: Draft +**Master plan**: [0002-phase-2-postgres-backend.md](0002-phase-2-postgres-backend.md) +**ADR**: [0002-phase-2-execution.md](../adr/0002-phase-2-execution.md) +**Predecessor**: [0002a-skeleton-and-feature-gate.md](0002a-skeleton-and-feature-gate.md) + +--- + +## Context + +This sub-plan delivers two of the master plan's deliverables now that the +`0002a` skeleton has landed: + +- **D3** -- pool construction in + `crates/vestige-core/src/storage/postgres/pool.rs`. Replaces the `todo!()` + body of `PgMemoryStore::connect` with a real `PgPool` builder that reads a + `PostgresConfig`. Registry/migration calls remain `todo!()`; those are + filled in by sub-plans `0002c` (migrations) and `0002d` (store bodies + + registry). +- **D7** -- new module `crates/vestige-core/src/config.rs` containing + `VestigeConfig`, `StorageConfig`, `SqliteConfig`, `PostgresConfig`, + `EmbeddingsConfig`, plus a `ConfigError` enum and a loader that reads + `vestige.toml`. The loader is wired into `vestige-mcp` so the running + server picks SQLite or Postgres at startup based on the config file. + +After this sub-plan: + +- `cargo build` (default features, no `postgres-backend`) compiles and the + MCP server still defaults to SQLite when no `vestige.toml` is present. +- `cargo build --features postgres-backend` compiles, with + `PgMemoryStore::connect` now wiring through `pool.rs` (registry/migration + still `todo!()` until `0002c` and `0002d`). +- A `vestige.toml` example can be round-tripped through + `VestigeConfig::load` in a unit test. + +This sub-plan deliberately does NOT: + +- Add migrations (`0002c`). +- Fill in real CRUD/search bodies on `PgMemoryStore` (`0002d`, `0002e`). +- Add env-var override support (Phase 3 concern, called out in master plan + D7 behaviour notes). + +--- + +## Dependencies + +- `0002a-skeleton-and-feature-gate.md` must be merged. That sub-plan creates + `crates/vestige-core/src/storage/postgres/mod.rs` with: + - `PgMemoryStore` struct holding `pool: PgPool`. + - `PgMemoryStore::connect(url: &str, max_connections: u32) -> MemoryStoreResult` + body = `todo!()`. + - `PgMemoryStore::from_pool(pool: PgPool) -> MemoryStoreResult` + body = `todo!()`. + - The trait impl block with all methods routed to `todo!()`. + - The `postgres-backend` feature gate on the module declaration in + `storage/mod.rs`. + +This sub-plan extends those bodies and adds two siblings: `pool.rs` and +`registry.rs` (the latter is a stub here, real body in `0002d`). + +--- + +## Audit step (do this first) + +Before adding `config.rs`, confirm there is no existing top-level config +loader. Run from the repo root: + +```bash +rg -nF 'VestigeConfig' crates/ +rg -nF 'toml::from_str' crates/ +rg -n '#\[derive.*Deserialize.*\]' crates/vestige-core/src/ +``` + +If a `VestigeConfig` struct already exists from Phase 1, treat the "Config +module" section below as additive: extend the existing struct rather than +creating a new file. The cross-cut additions in that case are: + +1. Add the `StorageConfig` enum (gated and ungated branches). +2. Add `SqliteConfig`, `PostgresConfig`. +3. Add the `default_path()` helper if missing. +4. Add `ConfigError` if a different error enum is used today (rename/extend + instead of duplicating). + +As of the audit at the time of this writing, no `VestigeConfig` exists in +`vestige-core`. `directories::ProjectDirs` is already used in +`vestige-core/src/embeddings/local.rs` and in +`vestige-mcp/src/protocol/auth.rs`, so the `directories` crate is already a +workspace dependency -- no new dep there. + +--- + +## Cargo manifest additions + +Add `toml` to `vestige-core`. `serde` and `thiserror` are already present +from Phase 1; `directories` is already a transitive dep but we add it +explicitly so `default_path()` is supported. + +```bash +cd crates/vestige-core +cargo add toml@0.8 +cargo add directories@5 +``` + +No new deps on `vestige-mcp`; it already depends on `vestige-core`. + +`sqlx` is already added by `0002a` behind the `postgres-backend` feature +with `runtime-tokio`, `tls-rustls`, `postgres`, `uuid`, `chrono`, +`json`, `macros`, `migrate` features. The pool module only uses what is +already pulled in. + +--- + +## Config module + +**File**: `crates/vestige-core/src/config.rs` (new). +**Re-exported** from `crates/vestige-core/src/lib.rs` as `pub mod config;` plus +`pub use config::{VestigeConfig, StorageConfig, SqliteConfig, EmbeddingsConfig, ConfigError};` +and `#[cfg(feature = "postgres-backend")] pub use config::PostgresConfig;`. + +Full content: + +```rust +//! Vestige top-level configuration. +//! +//! Loaded from `~/.vestige/vestige.toml` by default; the path is overridable +//! via `VestigeConfig::load(Some(&path))`. Parsing uses serde + toml; the +//! `[storage]` section is internally-tagged on a `backend` field so a single +//! enum dispatch picks SQLite or Postgres. + +use std::path::{Path, PathBuf}; + +use serde::Deserialize; + +/// Top-level configuration as parsed from `vestige.toml`. +#[derive(Debug, Clone, Deserialize, Default)] +#[serde(default, deny_unknown_fields)] +pub struct VestigeConfig { + pub embeddings: EmbeddingsConfig, + pub storage: StorageConfig, + /// Reserved for Phase 3. Empty in Phase 2. + pub server: ServerConfig, + /// Reserved for Phase 3. Empty in Phase 2. + pub auth: AuthConfig, +} + +/// Embedding provider selection. +#[derive(Debug, Clone, Deserialize)] +#[serde(deny_unknown_fields)] +pub struct EmbeddingsConfig { + /// Provider key. Phase 2 ships `"fastembed"` only. + pub provider: String, + /// Model name. For fastembed this is e.g. `"nomic-ai/nomic-embed-text-v1.5"`. + pub model: String, +} + +impl Default for EmbeddingsConfig { + fn default() -> Self { + Self { + provider: "fastembed".to_string(), + model: crate::DEFAULT_EMBEDDING_MODEL.to_string(), + } + } +} + +/// Storage backend selection. Internally tagged on the `backend` field: +/// +/// ```toml +/// [storage] +/// backend = "sqlite" +/// +/// [storage.sqlite] +/// path = "/home/user/.vestige/vestige.db" +/// ``` +/// +/// or, when compiled with `--features postgres-backend`: +/// +/// ```toml +/// [storage] +/// backend = "postgres" +/// +/// [storage.postgres] +/// url = "postgres://vestige:secret@localhost:5432/vestige" +/// max_connections = 10 +/// acquire_timeout_secs = 30 +/// ``` +#[derive(Debug, Clone, Deserialize)] +#[serde(tag = "backend", rename_all = "lowercase", deny_unknown_fields)] +pub enum StorageConfig { + Sqlite(SqliteConfig), + #[cfg(feature = "postgres-backend")] + Postgres(PostgresConfig), +} + +impl Default for StorageConfig { + fn default() -> Self { + StorageConfig::Sqlite(SqliteConfig::default()) + } +} + +/// SQLite backend configuration. +#[derive(Debug, Clone, Deserialize)] +#[serde(deny_unknown_fields)] +pub struct SqliteConfig { + /// Path to the `vestige.db` file. If unset, the SqliteMemoryStore + /// constructor picks its platform default location. + #[serde(default)] + pub path: Option, +} + +impl Default for SqliteConfig { + fn default() -> Self { + Self { path: None } + } +} + +/// Postgres backend configuration. Only present when the `postgres-backend` +/// Cargo feature is enabled. +#[cfg(feature = "postgres-backend")] +#[derive(Debug, Clone, Deserialize)] +#[serde(deny_unknown_fields)] +pub struct PostgresConfig { + /// `postgres://user:pass@host:port/db` -- forwarded to + /// `PgConnectOptions::from_str`. + pub url: String, + /// Pool size. Default `10`. + #[serde(default)] + pub max_connections: Option, + /// Acquire timeout in seconds. Default `30`. Set above 30 so + /// testcontainer-based test fixtures do not race. + #[serde(default)] + pub acquire_timeout_secs: Option, +} + +/// Reserved for Phase 3 (bind address, ports, TLS). +#[derive(Debug, Clone, Default, Deserialize)] +#[serde(default, deny_unknown_fields)] +pub struct ServerConfig {} + +/// Reserved for Phase 3 (API keys, claims). +#[derive(Debug, Clone, Default, Deserialize)] +#[serde(default, deny_unknown_fields)] +pub struct AuthConfig {} + +/// Errors raised while locating, reading, or parsing `vestige.toml`. +#[derive(Debug, thiserror::Error)] +pub enum ConfigError { + #[error("config io: {0}")] + Io(#[from] std::io::Error), + #[error("config toml: {0}")] + Toml(#[from] toml::de::Error), + #[error("config dir: could not locate user home")] + NoHome, + #[error("invalid config: {0}")] + Invalid(String), +} + +impl VestigeConfig { + /// Load config from `path` or from `default_path()` when `None`. + /// + /// Returns `VestigeConfig::default()` (SQLite + fastembed defaults) when + /// the file does not exist. Any other I/O or parse failure is surfaced + /// as a `ConfigError`. + pub fn load(path: Option<&Path>) -> Result { + let resolved: PathBuf = match path { + Some(p) => p.to_path_buf(), + None => Self::default_path()?, + }; + + match std::fs::read_to_string(&resolved) { + Ok(text) => { + let cfg: VestigeConfig = toml::from_str(&text)?; + cfg.validate()?; + Ok(cfg) + } + Err(e) if e.kind() == std::io::ErrorKind::NotFound => { + Ok(Self::default()) + } + Err(e) => Err(ConfigError::Io(e)), + } + } + + /// `~/.vestige/vestige.toml`. The directory is NOT created here; loading + /// a missing file falls back to defaults. + pub fn default_path() -> Result { + let dirs = directories::ProjectDirs::from("", "vestige", "vestige") + .ok_or(ConfigError::NoHome)?; + // ProjectDirs::config_dir() varies per OS. Vestige convention is + // ~/.vestige/vestige.toml on Linux/macOS regardless of XDG, so we + // build the path off the home dir explicitly. + let home = directories::UserDirs::new() + .ok_or(ConfigError::NoHome)? + .home_dir() + .to_path_buf(); + let _ = dirs; // keep the dep wired; future Phase 3 may use it + Ok(home.join(".vestige").join("vestige.toml")) + } + + /// Light cross-field validation. Heavy validation (URL parsing, + /// directory existence) is left to the backend constructors. + fn validate(&self) -> Result<(), ConfigError> { + if self.embeddings.provider.is_empty() { + return Err(ConfigError::Invalid( + "embeddings.provider must not be empty".into(), + )); + } + if self.embeddings.model.is_empty() { + return Err(ConfigError::Invalid( + "embeddings.model must not be empty".into(), + )); + } + match &self.storage { + StorageConfig::Sqlite(_) => {} + #[cfg(feature = "postgres-backend")] + StorageConfig::Postgres(cfg) => { + if cfg.url.is_empty() { + return Err(ConfigError::Invalid( + "storage.postgres.url must not be empty".into(), + )); + } + } + } + Ok(()) + } +} +``` + +### Serde behaviour with `postgres-backend` off + +`StorageConfig` is generated by serde only for the variants that are +compiled in. When `postgres-backend` is off and the user writes: + +```toml +[storage] +backend = "postgres" + +[storage.postgres] +url = "..." +``` + +serde returns a `toml::de::Error` of the form +`unknown variant `postgres`, expected `sqlite``. That error path goes +through `From for ConfigError`, surfacing as +`ConfigError::Toml(..)`. The MCP server prints this once at startup and +exits with a non-zero code; there is no panic. + +To make the error friendlier we wrap that specific case in a clearer +message via a thin post-parse check. Add this small helper after parsing +in `load()`: + +```rust +// (Inside the Ok(text) arm in load(), wrapping the parse step.) +let cfg: VestigeConfig = match toml::from_str(&text) { + Ok(c) => c, + Err(e) => { + let msg = e.to_string(); + if msg.contains("unknown variant `postgres`") { + return Err(ConfigError::Invalid( + "storage.backend = \"postgres\" requires building with --features postgres-backend".into(), + )); + } + return Err(ConfigError::Toml(e)); + } +}; +``` + +This keeps the strict default deny_unknown_fields behaviour while giving the +user a one-line action item. + +--- + +## Pool module + +**File**: `crates/vestige-core/src/storage/postgres/pool.rs` (new). + +```rust +#![cfg(feature = "postgres-backend")] + +//! `PgPool` construction for the Postgres backend. +//! +//! Pool defaults follow ADR 0002 D2 + master plan D3: +//! - max_connections = 10 +//! - acquire_timeout = 30s (must exceed testcontainer warmup) +//! - idle_timeout = 600s +//! - max_lifetime = 1800s +//! - test_before_acquire = false (cheap queries; saves a roundtrip) + +use std::str::FromStr; +use std::time::Duration; + +use sqlx::postgres::{PgConnectOptions, PgPoolOptions}; +use sqlx::{ConnectOptions, PgPool}; + +use crate::config::PostgresConfig; +use crate::storage::memory_store::{MemoryStoreError, MemoryStoreResult}; + +const DEFAULT_MAX_CONNECTIONS: u32 = 10; +const DEFAULT_ACQUIRE_TIMEOUT_SECS: u64 = 30; +const IDLE_TIMEOUT_SECS: u64 = 600; +const MAX_LIFETIME_SECS: u64 = 1800; +const STATEMENT_CACHE_CAPACITY: usize = 256; + +/// Build a Postgres connection pool from a `PostgresConfig`. Does NOT run +/// migrations or stamp the embedding registry; those are the caller's job +/// (`PgMemoryStore::connect`). +pub async fn build_pool(cfg: &PostgresConfig) -> MemoryStoreResult { + let opts = PgConnectOptions::from_str(&cfg.url) + .map_err(MemoryStoreError::from)? + .application_name("vestige") + .statement_cache_capacity(STATEMENT_CACHE_CAPACITY) + .log_statements(tracing::log::LevelFilter::Debug); + + let max_conn = cfg.max_connections.unwrap_or(DEFAULT_MAX_CONNECTIONS); + let acquire = cfg + .acquire_timeout_secs + .unwrap_or(DEFAULT_ACQUIRE_TIMEOUT_SECS); + + let pool = PgPoolOptions::new() + .max_connections(max_conn) + .min_connections(0) + .acquire_timeout(Duration::from_secs(acquire)) + .idle_timeout(Some(Duration::from_secs(IDLE_TIMEOUT_SECS))) + .max_lifetime(Some(Duration::from_secs(MAX_LIFETIME_SECS))) + .test_before_acquire(false) + .connect_with(opts) + .await + .map_err(MemoryStoreError::from)?; + + Ok(pool) +} +``` + +### Wiring into `PgMemoryStore::connect` + +In `crates/vestige-core/src/storage/postgres/mod.rs`, replace the +`todo!()` body left by `0002a` for `connect` and `from_pool` with: + +```rust +// In crates/vestige-core/src/storage/postgres/mod.rs + +use sqlx::PgPool; + +use crate::config::PostgresConfig; +use crate::storage::memory_store::{MemoryStoreError, MemoryStoreResult}; + +mod pool; +mod registry; // see "Registry stub" section below + +pub struct PgMemoryStore { + pool: PgPool, +} + +impl PgMemoryStore { + /// Convenience constructor matching `SqliteMemoryStore::new` shape. + /// Takes a URL + pool size for the common case. + pub async fn connect(url: &str, max_connections: u32) -> MemoryStoreResult { + let cfg = PostgresConfig { + url: url.to_string(), + max_connections: Some(max_connections), + acquire_timeout_secs: None, + }; + Self::connect_with(&cfg).await + } + + /// Full-config constructor. + pub async fn connect_with(cfg: &PostgresConfig) -> MemoryStoreResult { + let pool = pool::build_pool(cfg).await?; + Self::from_pool(pool).await + } + + /// Construct from an already-built pool (used by tests and the migrate + /// CLI to share a pool across operations). + pub async fn from_pool(pool: PgPool) -> MemoryStoreResult { + // Migrations are added by 0002c. + // todo!("run sqlx::migrate! once 0002c lands") + registry::ensure_registry_stub(&pool).await?; + Ok(Self { pool }) + } +} +``` + +`connect_with` is the long-lived API; `connect` becomes a thin shim that +stays compatible with the master-plan-mandated signature. + +### Registry stub + +**File**: `crates/vestige-core/src/storage/postgres/registry.rs` (new, stub). + +```rust +#![cfg(feature = "postgres-backend")] + +//! Embedding registry. Real body lands in sub-plan 0002d. + +use sqlx::PgPool; + +use crate::storage::memory_store::MemoryStoreResult; + +/// Placeholder. Real implementation in 0002d reads/writes `embedding_model` +/// and stamps `ALTER TABLE memories ALTER COLUMN embedding TYPE vector($N)`. +pub(crate) async fn ensure_registry_stub(_pool: &PgPool) -> MemoryStoreResult<()> { + // Intentionally a no-op until 0002c lands the table + 0002d lands the + // real body. Leaving this as todo!() would crash the MCP server at + // startup the moment a user switches `backend = "postgres"`, which is + // not what we want for the build verification step in this sub-plan. + Ok(()) +} +``` + +The no-op keeps `cargo build --features postgres-backend` not just +compiling but also allowing the MCP server to *boot* against a Postgres +URL pointing at an already-migrated database (the local-dev-postgres-setup +docs cover bringing up such a DB by hand). Real init lands in `0002d`. + +--- + +## Error variants + +**File**: `crates/vestige-core/src/storage/memory_store.rs` (edit). + +The Phase 1 enum `MemoryStoreError` gains two feature-gated variants. These +were deferred in `0002a` and become required as soon as `pool.rs` calls +`.map_err(MemoryStoreError::from)` on `sqlx::Error`. + +```rust +// Within enum MemoryStoreError { ... } in memory_store.rs + +#[cfg(feature = "postgres-backend")] +#[error("postgres error: {0}")] +Postgres(#[from] sqlx::Error), + +#[cfg(feature = "postgres-backend")] +#[error("postgres migration error: {0}")] +Migrate(#[from] sqlx::migrate::MigrateError), +``` + +Both use thiserror's `#[from]` attribute so the `?` operator works in +`pool.rs`, the migrate module (`0002c`), and registry code (`0002d`). +Default-features build (no `postgres-backend`) sees neither variant; the +enum stays exhaustive on stable. + +If clippy fires on `non_exhaustive` due to the gated variants, add +`#[non_exhaustive]` on the enum. That has no caller-side effect since the +enum is constructed only inside the crate. + +--- + +## vestige-mcp wiring + +### Cargo feature passthrough + +**File**: `crates/vestige-mcp/Cargo.toml` (edit). + +Add a feature that forwards through to `vestige-core`. Default features in +`vestige-mcp` stay unchanged. + +```toml +[features] +default = ["embeddings", "vector-search"] +embeddings = ["vestige-core/embeddings"] +vector-search = ["vestige-core/vector-search"] +postgres-backend = ["vestige-core/postgres-backend"] +``` + +Verify with: + +```bash +cargo build -p vestige-mcp --features postgres-backend +``` + +### Backend dispatch at startup + +**File**: `crates/vestige-mcp/src/main.rs` (edit around the existing +`Storage::new(storage_path)` call -- see audit note above; in the current +worktree this is around line 285). + +The current code is roughly: + +```rust +let storage_path = match prepare_storage_path(config.data_dir) { ... }; +let storage = match Storage::new(storage_path) { ... }; +``` + +Replace that with a dispatch driven by `VestigeConfig`: + +```rust +use std::sync::Arc; + +use vestige_core::config::{StorageConfig, VestigeConfig}; +use vestige_core::storage::SqliteMemoryStore; +#[cfg(feature = "postgres-backend")] +use vestige_core::storage::postgres::PgMemoryStore; +use vestige_core::storage::MemoryStore; + +// Earlier: still call prepare_storage_path to honour --data-dir override. +let storage_path = match prepare_storage_path(config.data_dir.clone()) { ... }; + +// New: load vestige.toml (or fall back to defaults). +let vestige_cfg = match VestigeConfig::load(config.config_path.as_deref()) { + Ok(c) => c, + Err(e) => { + eprintln!("vestige: failed to load config: {e}"); + std::process::exit(2); + } +}; + +let storage: Arc = match &vestige_cfg.storage { + StorageConfig::Sqlite(sqlite_cfg) => { + // CLI flag --data-dir wins over the config file path. + let path = storage_path.clone().or_else(|| sqlite_cfg.path.clone()); + let s = SqliteMemoryStore::new(path).unwrap_or_else(|e| { + eprintln!("vestige: sqlite init failed: {e}"); + std::process::exit(3); + }); + Arc::new(s) + } + #[cfg(feature = "postgres-backend")] + StorageConfig::Postgres(pg_cfg) => { + let s = PgMemoryStore::connect_with(pg_cfg).await.unwrap_or_else(|e| { + eprintln!("vestige: postgres init failed: {e}"); + std::process::exit(3); + }); + Arc::new(s) + } +}; +``` + +The `config_path: Option` field on the local `Config` (or +clap-derived `Args`) struct must be added if not present; it accepts +`--config `. Default behaviour (no flag) goes through +`VestigeConfig::default_path()`. + +If the existing main wires `Storage` through a concrete type rather than +`Arc`, the dispatch above lives behind a helper: + +```rust +async fn build_store(cfg: &VestigeConfig, cli_path: Option) + -> Result, anyhow::Error> +{ ... } +``` + +and the caller chains `.into()` as needed. Phase 1 already moved +cognitive modules to `Arc` so this should be a pure +substitution; if a concrete-type holdout is found, fix it locally in this +sub-plan (separate commit) rather than punting. + +--- + +## vestige.toml example + +The canonical example to ship in `docs/` (Phase 2 docs land in `0002i`, +runbook), shown here for reference and used verbatim by the unit test +below. + +```toml +# vestige.toml -- top-level configuration +# +# Default location: ~/.vestige/vestige.toml +# Override: vestige-mcp --config /path/to/vestige.toml + +[embeddings] +provider = "fastembed" +model = "nomic-ai/nomic-embed-text-v1.5" + +# --- SQLite backend (default) --- +[storage] +backend = "sqlite" + +[storage.sqlite] +path = "/home/user/.vestige/vestige.db" + +# --- Postgres backend (requires --features postgres-backend) --- +# [storage] +# backend = "postgres" +# +# [storage.postgres] +# url = "postgres://vestige:secret@localhost:5432/vestige" +# max_connections = 10 +# acquire_timeout_secs = 30 + +[server] +# Reserved for Phase 3 (bind address, ports, TLS). + +[auth] +# Reserved for Phase 3 (API keys, claims). +``` + +--- + +## Verification + +Run all of these from the repo root. The first three are the gates that +must pass before this sub-plan is considered done. + +### 1. Default build (no Postgres) + +```bash +cargo build -p vestige-core +cargo build -p vestige-mcp +cargo test -p vestige-core --lib +``` + +Expected: clean build. `VestigeConfig::default()` selects SQLite; the MCP +server boots the same way it did pre-sub-plan. + +### 2. Postgres-feature build + +```bash +cargo build -p vestige-core --features postgres-backend +cargo build -p vestige-mcp --features postgres-backend +``` + +Expected: clean build. `PgMemoryStore::connect_with` resolves to +`pool::build_pool` + `registry::ensure_registry_stub`; no `todo!()` is +reachable on the build path. `connect` and `from_pool` are exported. + +### 3. Clippy across both feature sets + +```bash +cargo clippy -p vestige-core -- -D warnings +cargo clippy -p vestige-core --features postgres-backend -- -D warnings +cargo clippy -p vestige-mcp --features postgres-backend -- -D warnings +``` + +### 4. Unit test: round-trip the example + +Add this test to `crates/vestige-core/src/config.rs`: + +```rust +#[cfg(test)] +mod tests { + use super::*; + + const EXAMPLE_SQLITE: &str = r#" +[embeddings] +provider = "fastembed" +model = "nomic-ai/nomic-embed-text-v1.5" + +[storage] +backend = "sqlite" + +[storage.sqlite] +path = "/home/user/.vestige/vestige.db" +"#; + + #[cfg(feature = "postgres-backend")] + const EXAMPLE_POSTGRES: &str = r#" +[embeddings] +provider = "fastembed" +model = "nomic-ai/nomic-embed-text-v1.5" + +[storage] +backend = "postgres" + +[storage.postgres] +url = "postgres://vestige:secret@localhost:5432/vestige" +max_connections = 10 +acquire_timeout_secs = 30 +"#; + + #[test] + fn parses_sqlite_example() { + let cfg: VestigeConfig = toml::from_str(EXAMPLE_SQLITE).expect("parse"); + match cfg.storage { + StorageConfig::Sqlite(s) => assert!(s.path.is_some()), + #[cfg(feature = "postgres-backend")] + StorageConfig::Postgres(_) => panic!("wrong variant"), + } + assert_eq!(cfg.embeddings.provider, "fastembed"); + } + + #[cfg(feature = "postgres-backend")] + #[test] + fn parses_postgres_example() { + let cfg: VestigeConfig = toml::from_str(EXAMPLE_POSTGRES).expect("parse"); + match cfg.storage { + StorageConfig::Postgres(p) => { + assert_eq!(p.url, "postgres://vestige:secret@localhost:5432/vestige"); + assert_eq!(p.max_connections, Some(10)); + assert_eq!(p.acquire_timeout_secs, Some(30)); + } + StorageConfig::Sqlite(_) => panic!("wrong variant"), + } + } + + #[cfg(not(feature = "postgres-backend"))] + #[test] + fn rejects_postgres_when_feature_off() { + let toml_text = r#" +[storage] +backend = "postgres" + +[storage.postgres] +url = "postgres://x/y" +"#; + let res: Result = toml::from_str(toml_text); + assert!(res.is_err(), "must fail without postgres-backend feature"); + } + + #[test] + fn defaults_pick_sqlite() { + let cfg = VestigeConfig::default(); + assert!(matches!(cfg.storage, StorageConfig::Sqlite(_))); + } + + #[test] + fn load_missing_file_returns_default() { + let tmp = std::env::temp_dir().join("vestige-no-such-file.toml"); + let _ = std::fs::remove_file(&tmp); + let cfg = VestigeConfig::load(Some(&tmp)).expect("missing file is OK"); + assert!(matches!(cfg.storage, StorageConfig::Sqlite(_))); + } + + #[test] + fn load_roundtrip_from_disk() { + let tmp = std::env::temp_dir().join("vestige-roundtrip.toml"); + std::fs::write(&tmp, EXAMPLE_SQLITE).unwrap(); + let cfg = VestigeConfig::load(Some(&tmp)).expect("load"); + assert!(matches!(cfg.storage, StorageConfig::Sqlite(_))); + let _ = std::fs::remove_file(&tmp); + } +} +``` + +Run: + +```bash +cargo test -p vestige-core --lib config:: +cargo test -p vestige-core --lib config:: --features postgres-backend +``` + +### 5. Smoke: server boots with default config + +```bash +# default build, no vestige.toml on disk +cargo run -p vestige-mcp -- --help +# should print help, no panic +``` + +--- + +## Acceptance criteria + +- [ ] `cargo build -p vestige-core` (default features) succeeds. +- [ ] `cargo build -p vestige-core --features postgres-backend` succeeds. +- [ ] `cargo build -p vestige-mcp` (default features) succeeds. +- [ ] `cargo build -p vestige-mcp --features postgres-backend` succeeds. +- [ ] `cargo clippy` with and without `postgres-backend` is clean on both + crates. +- [ ] `crates/vestige-core/src/config.rs` exists, exposes + `VestigeConfig`, `StorageConfig`, `SqliteConfig`, `EmbeddingsConfig`, + `ConfigError`, plus `PostgresConfig` when the feature is on. +- [ ] `VestigeConfig::load(None)` returns `Ok(default)` when + `~/.vestige/vestige.toml` is missing. +- [ ] `VestigeConfig::load(Some(&path))` round-trips both the SQLite and + Postgres example blocks above. +- [ ] With `postgres-backend` off, parsing `backend = "postgres"` returns + a clear `ConfigError::Invalid` mentioning the feature flag, NOT a + panic. +- [ ] `crates/vestige-core/src/storage/postgres/pool.rs` exists, + implementing `build_pool(&PostgresConfig) -> MemoryStoreResult` + with the documented defaults. +- [ ] `PgMemoryStore::connect`, `connect_with`, and `from_pool` all wire + through `pool::build_pool`. None of them is `todo!()`. The registry + step is a no-op stub documented as filled in by `0002d`. +- [ ] `MemoryStoreError::Postgres(sqlx::Error)` and + `MemoryStoreError::Migrate(sqlx::migrate::MigrateError)` exist + behind `#[cfg(feature = "postgres-backend")]` with `#[from]`. +- [ ] `vestige-mcp` has a `postgres-backend` feature that forwards to + `vestige-core/postgres-backend`. +- [ ] `vestige-mcp/src/main.rs` selects SQLite vs Postgres at startup + based on `VestigeConfig`. SQLite is the default when no config file + is present. +- [ ] Unit tests in the "Verification" section pass on both feature sets. + +--- + +## Out of scope (handled by other sub-plans) + +- Migrations (`crates/vestige-core/migrations/postgres/*.sql`) -- `0002c`. +- Real `PgMemoryStore` CRUD/search/scheduling/edges bodies -- `0002d`, + `0002e`. +- `ensure_registry` real body with `ALTER COLUMN TYPE vector(N)` -- `0002d`. +- `vestige migrate --from sqlite --to postgres` CLI -- `0002f`. +- Re-embed flow -- `0002g`. +- Env-var override (`VESTIGE_POSTGRES_URL`, etc.) -- Phase 3. +- RLS, multi-tenant column population -- Phase 3. diff --git a/docs/plans/0002c-migrations.md b/docs/plans/0002c-migrations.md new file mode 100644 index 0000000..ef8e35c --- /dev/null +++ b/docs/plans/0002c-migrations.md @@ -0,0 +1,1119 @@ +# Phase 2 Sub-plan 0002c: sqlx Migrations + +**Status**: Draft +**Depends on**: `0002a-skeleton-and-feature-gate.md` (PgMemoryStore skeleton, error variants), `0002b-pool-and-config.md` (PgPool builder, PostgresConfig) +**Related**: docs/adr/0002-phase-2-execution.md (D7 multi-tenancy reservation, D8 codebase column), docs/plans/0002-phase-2-postgres-backend.md (D4 master SQL), docs/plans/local-dev-postgres-setup.md (local cluster + role + DB) + +--- + +## Context + +This sub-plan covers Phase 2 deliverable D4 (sqlx migration files under +`crates/vestige-core/migrations/postgres/`) PLUS the schema additions decided +in ADR 0002: + +- D7 -- multi-tenancy reservation: `users`, `groups`, `group_memberships` + tables, plus `owner_user_id`, `visibility`, `shared_with_groups` columns on + `knowledge_nodes`. Phase 3 fills these in; Phase 2 just reserves them so the auth + filter is later additive instead of an online migration over a populated, + HNSW-indexed table. +- D8 -- `codebase` promoted to a first-class indexed column on `knowledge_nodes`. + +This sub-plan also adds the parity SQLite migration (V15) that mirrors D7 + +D8 on the SQLite side, so a single-user SQLite deployment sees the same +columns (with stand-in defaults). + +After this sub-plan lands: + +- A fresh Postgres database, with the `vestige` role from the local-dev + setup, can be initialized by running `sqlx::migrate!` against + `crates/vestige-core/migrations/postgres/`, plus one programmatic + `register_model` call before the HNSW migration. +- A fresh SQLite database initialized by `apply_migrations` lands at + schema_version = 15 with the new tables and columns present. +- `PgMemoryStore::connect` wires the migrator into the connect path + (pool build -> migrator up-to v1 -> register_model -> migrator up-to v2). +- The SQLite test suite continues to pass. +- No `sqlx::query!` calls are introduced yet; the offline `.sqlx/` cache is + filled out in `0002d-store-impl-bodies.md`. + +The deliverable is purely schema. No query bodies, no row-mapping, no search. + +--- + +## Postgres migration files + +Layout, relative to repo root: + +``` +crates/vestige-core/migrations/postgres/ + 0001_init.up.sql + 0001_init.down.sql + 0002_hnsw.up.sql + 0002_hnsw.down.sql +``` + +The `migrations/postgres/` directory is sibling-of-`src/`, not under `src/`, +because `sqlx::migrate!` and `sqlx-cli` both look for a path relative to +`CARGO_MANIFEST_DIR`. The directory is committed. + +### 0001_init.up.sql + +Creates extensions, the multi-tenancy tables (D7), the embedding registry, +the domains catalogue, the `knowledge_nodes` table (with D7 + D8 columns merged in), +the FSRS scheduling and edges tables, the review-events log, all non-vector +indexes, the updated_at trigger, and the bootstrap `local` user row. + +The HNSW vector index is deliberately NOT here -- it requires a typmod on +`knowledge_nodes.embedding`, which is stamped by `register_model` at runtime. See +the "HNSW typmod ordering" section below. + +```sql +-- crates/vestige-core/migrations/postgres/0001_init.up.sql +-- +-- Phase 2 initial schema for the Postgres backend. +-- Includes D7 multi-tenancy reservation (users/groups/group_memberships, +-- owner_user_id/visibility/shared_with_groups on knowledge_nodes) and D8 +-- (codebase first-class column on knowledge_nodes). +-- +-- The HNSW index on knowledge_nodes.embedding lives in 0002_hnsw.up.sql; it +-- requires the column typmod to be stamped first by register_model(). + +-- Extensions ---------------------------------------------------------------- + +CREATE EXTENSION IF NOT EXISTS pgcrypto; +CREATE EXTENSION IF NOT EXISTS vector; + +-- Embedding model registry -------------------------------------------------- +-- Mirrors the SQLite table created in Phase 1 V14. +-- One logical row enforced by CHECK (id = 1). + +CREATE TABLE embedding_model ( + id SMALLINT PRIMARY KEY DEFAULT 1 CHECK (id = 1), + name TEXT NOT NULL, + dimension INTEGER NOT NULL CHECK (dimension > 0), + hash TEXT NOT NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT now() +); + +-- Domains catalogue --------------------------------------------------------- +-- Populated by the Phase 4 DomainClassifier. Phase 2 creates the empty +-- table so list/get/upsert/delete work uniformly against both backends. + +CREATE TABLE domains ( + id TEXT PRIMARY KEY, + label TEXT NOT NULL, + centroid vector, + top_terms TEXT[] NOT NULL DEFAULT '{}', + memory_count INTEGER NOT NULL DEFAULT 0, + created_at TIMESTAMPTZ NOT NULL DEFAULT now(), + metadata JSONB NOT NULL DEFAULT '{}'::jsonb +); + +-- Multi-tenancy (D7) -------------------------------------------------------- +-- Reserved in Phase 2; populated in Phase 3. +-- Single bootstrap user inserted at the bottom of this file. + +CREATE TABLE users ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + handle TEXT NOT NULL UNIQUE, + display_name TEXT, + created_at TIMESTAMPTZ NOT NULL DEFAULT now(), + metadata JSONB NOT NULL DEFAULT '{}'::jsonb +); + +CREATE TABLE groups ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + handle TEXT NOT NULL UNIQUE, + display_name TEXT, + created_at TIMESTAMPTZ NOT NULL DEFAULT now(), + metadata JSONB NOT NULL DEFAULT '{}'::jsonb +); + +CREATE TABLE group_memberships ( + user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, + group_id UUID NOT NULL REFERENCES groups(id) ON DELETE CASCADE, + role TEXT NOT NULL DEFAULT 'member', + joined_at TIMESTAMPTZ NOT NULL DEFAULT now(), + PRIMARY KEY (user_id, group_id), + CHECK (role IN ('member', 'admin')) +); + +-- Core knowledge_nodes table ------------------------------------------------- +-- Original Phase 2 columns merged with D7 (owner_user_id, visibility, +-- shared_with_groups) and D8 (codebase). + +CREATE TABLE knowledge_nodes ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + + -- Content + content TEXT NOT NULL, + node_type TEXT NOT NULL DEFAULT 'general', + tags TEXT[] NOT NULL DEFAULT '{}', + metadata JSONB NOT NULL DEFAULT '{}'::jsonb, + + -- Phase 4 emergent domains (Phase 2 leaves empty) + domains TEXT[] NOT NULL DEFAULT '{}', + domain_scores JSONB NOT NULL DEFAULT '{}'::jsonb, + + -- Embedding (typmod stamped by register_model before 0002_hnsw runs) + embedding vector, + + -- D8: first-class codebase column for high-frequency scoped queries + codebase TEXT, + + -- D7: multi-tenancy reservation. Defaults make Phase 2 single-user + -- behaviour identical to Phase 1. + owner_user_id UUID NOT NULL DEFAULT '00000000-0000-0000-0000-000000000001' + REFERENCES users(id), + visibility TEXT NOT NULL DEFAULT 'private', + shared_with_groups UUID[] NOT NULL DEFAULT '{}', + + -- Timestamps + created_at TIMESTAMPTZ NOT NULL DEFAULT now(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + + -- Generated full-text search vector. Phase 2 uses websearch_to_tsquery + -- against this column at query time (see 0002e-hybrid-search.md). + search_vec TSVECTOR GENERATED ALWAYS AS ( + setweight(to_tsvector('english', coalesce(content, '')), 'A') || + setweight(to_tsvector('english', coalesce(node_type, '')), 'B') || + setweight(to_tsvector('english', coalesce(array_to_string(tags, ' '), '')), 'C') + ) STORED, + + -- Visibility tri-state CHECK constraint. See "Visibility CHECK + -- constraint" section below for the cardinality variant we + -- intentionally do NOT add yet. + CHECK (visibility IN ('private', 'group', 'public')) +); + +-- FSRS scheduling state (1:1 with knowledge_nodes) --------------------------- +-- +-- Note: the FK column is named `memory_id` (not `node_id`) to match the +-- Phase 1 SQLite trait surface: `SchedulingState { memory_id: Uuid, ... }` +-- and `get_scheduling(memory_id: Uuid)` / `update_scheduling(&state)`. The +-- table is `knowledge_nodes` but the Rust identifier remained `memory_id` +-- across Phase 1 and is preserved here so both backends speak the same +-- language at the trait boundary. + +CREATE TABLE scheduling ( + memory_id UUID PRIMARY KEY REFERENCES knowledge_nodes(id) ON DELETE CASCADE, + stability DOUBLE PRECISION NOT NULL DEFAULT 0.0, + difficulty DOUBLE PRECISION NOT NULL DEFAULT 0.0, + retrievability DOUBLE PRECISION NOT NULL DEFAULT 1.0, + last_review TIMESTAMPTZ, + next_review TIMESTAMPTZ, + reps INTEGER NOT NULL DEFAULT 0, + lapses INTEGER NOT NULL DEFAULT 0 +); + +-- Spreading activation graph edges ------------------------------------------ + +CREATE TABLE edges ( + source_id UUID NOT NULL REFERENCES knowledge_nodes(id) ON DELETE CASCADE, + target_id UUID NOT NULL REFERENCES knowledge_nodes(id) ON DELETE CASCADE, + edge_type TEXT NOT NULL DEFAULT 'related', + weight DOUBLE PRECISION NOT NULL DEFAULT 1.0, + created_at TIMESTAMPTZ NOT NULL DEFAULT now(), + PRIMARY KEY (source_id, target_id, edge_type) +); + +-- FSRS review event log (append-only; Phase 5 federation reads) ------------- + +CREATE TABLE review_events ( + id BIGSERIAL PRIMARY KEY, + memory_id UUID NOT NULL REFERENCES knowledge_nodes(id) ON DELETE CASCADE, + timestamp TIMESTAMPTZ NOT NULL DEFAULT now(), + rating SMALLINT NOT NULL, + prior_state JSONB NOT NULL, + new_state JSONB NOT NULL +); + +-- Indexes ------------------------------------------------------------------- + +-- knowledge_nodes: full-text, arrays, hot scalar columns, D7+D8 access patterns +CREATE INDEX idx_knowledge_nodes_fts ON knowledge_nodes USING GIN (search_vec); +CREATE INDEX idx_knowledge_nodes_domains ON knowledge_nodes USING GIN (domains); +CREATE INDEX idx_knowledge_nodes_tags ON knowledge_nodes USING GIN (tags); +CREATE INDEX idx_knowledge_nodes_node_type ON knowledge_nodes (node_type); +CREATE INDEX idx_knowledge_nodes_created ON knowledge_nodes (created_at); +CREATE INDEX idx_knowledge_nodes_updated ON knowledge_nodes (updated_at); + +-- D7 visibility filter (Phase 3 query: WHERE owner_user_id = $me ...) +CREATE INDEX idx_knowledge_nodes_owner ON knowledge_nodes (owner_user_id); +CREATE INDEX idx_knowledge_nodes_shared_groups ON knowledge_nodes USING GIN (shared_with_groups); + +-- D8 codebase scoping (Phase 4 HDBSCAN per-repo, sharing rules in Phase 4). +-- Partial index keeps the index small in single-user mode where most rows +-- never set a codebase. +CREATE INDEX idx_knowledge_nodes_codebase + ON knowledge_nodes (codebase) + WHERE codebase IS NOT NULL; + +-- scheduling: hot lookup paths for FSRS pickers +CREATE INDEX idx_scheduling_next_review ON scheduling (next_review); +CREATE INDEX idx_scheduling_last_review ON scheduling (last_review); + +-- edges: bidirectional + edge type +CREATE INDEX idx_edges_target ON edges (target_id); +CREATE INDEX idx_edges_source ON edges (source_id); +CREATE INDEX idx_edges_type ON edges (edge_type); + +-- review_events: per-memory and chronological +CREATE INDEX idx_review_events_memory ON review_events (memory_id); +CREATE INDEX idx_review_events_ts ON review_events (timestamp); + +-- users / groups: unique handle indexes are implicit; add nothing extra. +-- group_memberships: primary key (user_id, group_id) is the access path. + +-- updated_at trigger on knowledge_nodes ---------------------------------------- + +CREATE OR REPLACE FUNCTION knowledge_nodes_set_updated_at() RETURNS TRIGGER AS $$ +BEGIN + NEW.updated_at := now(); + RETURN NEW; +END; +$$ LANGUAGE plpgsql; + +CREATE TRIGGER trg_knowledge_nodes_updated_at +BEFORE UPDATE ON knowledge_nodes +FOR EACH ROW EXECUTE FUNCTION knowledge_nodes_set_updated_at(); + +-- Bootstrap rows ------------------------------------------------------------ +-- Single 'local' user matches the default on knowledge_nodes.owner_user_id so +-- single-user Phase 2 inserts never violate the FK. + +INSERT INTO users (id, handle, display_name) + VALUES ('00000000-0000-0000-0000-000000000001', 'local', 'Local User'); +``` + +### 0001_init.down.sql + +Reverse-dependency drop order. Trigger and function first, then indexes, +then tables, then extensions are left alone (extensions are global; we do +not drop them in a `down`). + +```sql +-- crates/vestige-core/migrations/postgres/0001_init.down.sql + +DROP TRIGGER IF EXISTS trg_knowledge_nodes_updated_at ON knowledge_nodes; +DROP FUNCTION IF EXISTS knowledge_nodes_set_updated_at(); + +-- knowledge_nodes indexes +DROP INDEX IF EXISTS idx_knowledge_nodes_codebase; +DROP INDEX IF EXISTS idx_knowledge_nodes_shared_groups; +DROP INDEX IF EXISTS idx_knowledge_nodes_owner; +DROP INDEX IF EXISTS idx_knowledge_nodes_updated; +DROP INDEX IF EXISTS idx_knowledge_nodes_created; +DROP INDEX IF EXISTS idx_knowledge_nodes_node_type; +DROP INDEX IF EXISTS idx_knowledge_nodes_tags; +DROP INDEX IF EXISTS idx_knowledge_nodes_domains; +DROP INDEX IF EXISTS idx_knowledge_nodes_fts; + +-- scheduling indexes +DROP INDEX IF EXISTS idx_scheduling_last_review; +DROP INDEX IF EXISTS idx_scheduling_next_review; + +-- edges indexes +DROP INDEX IF EXISTS idx_edges_type; +DROP INDEX IF EXISTS idx_edges_source; +DROP INDEX IF EXISTS idx_edges_target; + +-- review_events indexes +DROP INDEX IF EXISTS idx_review_events_ts; +DROP INDEX IF EXISTS idx_review_events_memory; + +-- Tables, reverse dependency order +DROP TABLE IF EXISTS review_events; +DROP TABLE IF EXISTS edges; +DROP TABLE IF EXISTS scheduling; +DROP TABLE IF EXISTS knowledge_nodes; +DROP TABLE IF EXISTS group_memberships; +DROP TABLE IF EXISTS groups; +DROP TABLE IF EXISTS users; +DROP TABLE IF EXISTS domains; +DROP TABLE IF EXISTS embedding_model; + +-- Extensions are intentionally NOT dropped. They may be in use by other +-- databases on the cluster; dropping them is an admin choice. +``` + +### 0002_hnsw.up.sql + +Single statement; separated from 0001 so reembed (sub-plan 0002g) can +DROP/CREATE this index in isolation without touching anything else. + +```sql +-- crates/vestige-core/migrations/postgres/0002_hnsw.up.sql +-- +-- HNSW index on knowledge_nodes.embedding. This migration runs AFTER +-- register_model() has stamped the typmod via: +-- +-- ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector($N) +-- +-- where $N is the embedder's dimension(). Without the typmod, pgvector +-- rejects HNSW creation with: +-- +-- ERROR: column does not have dimensions +-- +-- See "HNSW typmod ordering" in 0002c-migrations.md and the connect() +-- sequence in 0002a-skeleton-and-feature-gate.md / 0002d-store-impl-bodies.md. +-- +-- Operator class: vector_cosine_ops -> distance operator `<=>`. +-- Build parameters: m = 16, ef_construction = 64 (pgvector defaults; see +-- the master plan 0002 D5 RRF discussion for the rationale). + +CREATE INDEX idx_knowledge_nodes_embedding_hnsw + ON knowledge_nodes USING hnsw (embedding vector_cosine_ops) + WITH (m = 16, ef_construction = 64); +``` + +### 0002_hnsw.down.sql + +```sql +-- crates/vestige-core/migrations/postgres/0002_hnsw.down.sql + +DROP INDEX IF EXISTS idx_knowledge_nodes_embedding_hnsw; +``` + +--- + +## HNSW typmod ordering + +pgvector's HNSW index requires the indexed column to have a typmod (fixed +dimension). `vector` (unconstrained) is rejected; `vector(768)` is accepted. +We cannot bake the dimension into 0001 because the dimension is an +embedder-determined runtime value -- different builds may use different +embedders. + +This forces an ordering: + +1. Apply migration 0001 (creates `knowledge_nodes.embedding vector`, no typmod). +2. Connect, decide which embedder is in use, run + `ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector($N)` + inside `register_model`. +3. Apply migration 0002 (creates HNSW; succeeds because the column now has + a typmod). + +`sqlx::migrate!("...")` runs ALL pending migrations in a single call. It is +not designed to pause between two specific migrations so application code +can interleave a runtime DDL step. So we have two options: + +**Option A: Migration 0002 lives outside the sqlx migrations directory.** +Keep `0001_init.{up,down}.sql` only in `migrations/postgres/`; promote +`0002_hnsw.up.sql` to a Rust `include_str!` constant or a separate +`migrations/postgres-hnsw/` directory, run it manually by `PgMemoryStore` +after `register_model`. + +Pros: simple control flow, one `sqlx::migrate!()` call. +Cons: `sqlx_migrations` table does not record 0002, so `sqlx-cli migrate +info` lies. The HNSW index becomes "shadow" schema state from sqlx's POV. +Reembed (sub-plan 0002g) has to also know about this file outside the +normal migrations directory. + +**Option B (chosen): Both migrations live in the directory; the runner +splits them programmatically.** Use `sqlx::migrate::Migrator::new` to load +the directory and call its `run_to(...)` method with a specific version. + +```rust +// crates/vestige-core/src/storage/postgres/migrations.rs +use sqlx::migrate::Migrator; +use sqlx::PgPool; + +use crate::storage::error::MemoryStoreResult; + +/// Embedded migrator. Loaded at compile time from the migrations directory +/// alongside the crate. Path is relative to CARGO_MANIFEST_DIR. +static MIGRATOR: Migrator = sqlx::migrate!("./migrations/postgres"); + +/// Run migrations up to (and including) version 1. +/// +/// This must be called BEFORE register_model so the schema (knowledge_nodes table, +/// embedding_model registry, etc.) exists for register_model to write into +/// and to ALTER. +pub(crate) async fn run_pre_register(pool: &PgPool) -> MemoryStoreResult<()> { + MIGRATOR.run_to(pool, 1).await?; + Ok(()) +} + +/// Run any remaining migrations (currently: HNSW = version 2). +/// +/// Called AFTER register_model has stamped the embedding column's typmod. +pub(crate) async fn run_post_register(pool: &PgPool) -> MemoryStoreResult<()> { + MIGRATOR.run(pool).await?; + Ok(()) +} +``` + +Pros: sqlx is the only source of truth for migration version state; +`sqlx-cli migrate info` is accurate; reembed re-applies 0002 by name; future +migrations slot in normally. +Cons: relies on `Migrator::run_to`, which exists in sqlx 0.7+ and is the +documented API for staged migration. If that API ever disappears we fall +back to Option A. + +Decision: Option B. `Migrator::run_to(target_version)` is stable in sqlx +0.8. Sub-plan 0002a's `MemoryStoreError` already gains +`#[from] sqlx::migrate::MigrateError` to absorb whichever error variant +this surfaces. + +The `connect()` sequence in sub-plan 0002d will therefore look like: + +```rust +// Sketch only; full body lives in 0002d-store-impl-bodies.md. +pub async fn connect(url: &str, max_connections: u32) -> MemoryStoreResult { + let pool = crate::storage::postgres::pool::build(url, max_connections).await?; + crate::storage::postgres::migrations::run_pre_register(&pool).await?; + let store = Self { pool }; + // register_model is called by the cognitive engine bootstrap, NOT here. + // After it runs, the engine calls store.finalize_schema() which calls + // run_post_register. Same shape as SqliteMemoryStore. + Ok(store) +} + +pub async fn finalize_schema(&self) -> MemoryStoreResult<()> { + crate::storage::postgres::migrations::run_post_register(&self.pool).await +} +``` + +`finalize_schema` lands in 0002d; this sub-plan only ships `run_pre_register` +and `run_post_register` plus their wiring into `connect`. + +--- + +## SQLite V15 migration + +The Phase 1 SQLite schema lives in `crates/vestige-core/src/storage/migrations.rs` +as a `MIGRATIONS` slice. V14 is the latest entry. V15 is appended to mirror +D7 (multi-tenancy) and D8 (codebase) on the SQLite side, so a single-user +SQLite deployment sees the same surface area. + +Constraints versus the Postgres migration: + +- No `UUID[]` -- `shared_with_groups` is a TEXT JSON-encoded `'[]'`. +- No `gen_random_uuid()` -- the bootstrap user UUID is a literal. +- No partial indexes for our chosen pattern (SQLite *does* support partial + indexes since 3.8; we use one for `codebase` to match Postgres). +- No `ADD COLUMN IF NOT EXISTS` -- the V15 column additions are split into a + `MIGRATION_V15_ALTER_COLUMNS` slice exactly like V14 did, so the migration + is idempotent on replay. + +### Insertion point in migrations.rs + +Add to the `MIGRATIONS` slice immediately after V14: + +```rust +// In MIGRATIONS slice, after the V14 entry: +Migration { + version: 15, + description: "ADR 0002 D7+D8: multi-tenancy reservation + codebase column", + up: MIGRATION_V15_UP, +}, +``` + +### V15 SQL + +```rust +/// V15: ADR 0002 D7 + D8. +/// +/// D7 reserves users / groups / group_memberships and owner_user_id / +/// visibility / shared_with_groups columns on knowledge_nodes. Single-user +/// SQLite mode never reads these (the trait surface ignores visibility +/// because there is exactly one user) but they exist so Phase 3 does not +/// have to ALTER a populated table. +/// +/// D8 adds a first-class `codebase` column. +/// +/// Like V14, the ALTER TABLE statements are split into +/// MIGRATION_V15_ALTER_COLUMNS because SQLite has no ADD COLUMN IF NOT EXISTS. +const MIGRATION_V15_UP: &str = r#" +-- Migration V15: multi-tenancy reservation + codebase column. + +-- 1. Users / groups / group_memberships ----------------------------------- +-- Mirrors the Postgres D7 tables. Single bootstrap user inserted below. + +CREATE TABLE IF NOT EXISTS users ( + id TEXT PRIMARY KEY, + handle TEXT NOT NULL UNIQUE, + display_name TEXT, + created_at TEXT NOT NULL, + metadata TEXT NOT NULL DEFAULT '{}' +); + +CREATE TABLE IF NOT EXISTS groups ( + id TEXT PRIMARY KEY, + handle TEXT NOT NULL UNIQUE, + display_name TEXT, + created_at TEXT NOT NULL, + metadata TEXT NOT NULL DEFAULT '{}' +); + +CREATE TABLE IF NOT EXISTS group_memberships ( + user_id TEXT NOT NULL REFERENCES users(id) ON DELETE CASCADE, + group_id TEXT NOT NULL REFERENCES groups(id) ON DELETE CASCADE, + role TEXT NOT NULL DEFAULT 'member' CHECK (role IN ('member', 'admin')), + joined_at TEXT NOT NULL, + PRIMARY KEY (user_id, group_id) +); + +-- 2. Bootstrap 'local' user. Same UUID as the Postgres default so a future +-- portable export from SQLite -> import to Postgres preserves owner_user_id. + +INSERT OR IGNORE INTO users (id, handle, display_name, created_at) + VALUES ('00000000-0000-0000-0000-000000000001', 'local', 'Local User', + datetime('now')); + +-- 3. Per-memory column additions are applied separately by the migration +-- runner (see MIGRATION_V15_ALTER_COLUMNS). + +-- 4. Indexes that do not depend on the new columns. Index creation on the +-- new knowledge_nodes columns is done after MIGRATION_V15_ALTER_COLUMNS +-- runs (see runner glue below). + +UPDATE schema_version SET version = 15, applied_at = datetime('now'); +"#; + +/// V15 column additions. SQLite has no ADD COLUMN IF NOT EXISTS, so the +/// runner skips "duplicate column" errors per statement (same shape as V14). +pub const MIGRATION_V15_ALTER_COLUMNS: &[&str] = &[ + // D7 columns. Defaults match the Postgres side. shared_with_groups is + // a JSON-encoded array. + "ALTER TABLE knowledge_nodes ADD COLUMN owner_user_id TEXT NOT NULL DEFAULT '00000000-0000-0000-0000-000000000001'", + "ALTER TABLE knowledge_nodes ADD COLUMN visibility TEXT NOT NULL DEFAULT 'private'", + "ALTER TABLE knowledge_nodes ADD COLUMN shared_with_groups TEXT NOT NULL DEFAULT '[]'", + // D8 column. + "ALTER TABLE knowledge_nodes ADD COLUMN codebase TEXT", +]; + +/// V15 index creation. Runs AFTER the ALTER COLUMN statements succeed. +/// Kept as a separate batch so a partial replay (columns already there, +/// indexes not yet) still creates the indexes. +const MIGRATION_V15_INDEXES: &str = r#" +CREATE INDEX IF NOT EXISTS idx_nodes_owner_user_id ON knowledge_nodes(owner_user_id); +CREATE INDEX IF NOT EXISTS idx_nodes_codebase ON knowledge_nodes(codebase) WHERE codebase IS NOT NULL; +-- shared_with_groups is TEXT JSON in SQLite; we do not add a GIN-equivalent +-- index. Phase 3 lookups on the SQLite side will scan; SQLite never serves +-- the multi-user query path in Phase 2-4 anyway. +"#; +``` + +### Runner glue + +Extend `apply_migrations` in `migrations.rs` to recognise V15 the same way +it recognises V14: + +```rust +// Existing pattern for V14 lives in apply_migrations; extend it: +if migration.version == 15 { + for stmt in MIGRATION_V15_ALTER_COLUMNS { + if let Err(e) = conn.execute_batch(stmt) { + let msg = e.to_string(); + if msg.contains("duplicate column name") { + tracing::debug!( + "V15 ALTER TABLE skipped (column already exists): {}", + msg + ); + } else { + return Err(e); + } + } + } + // Indexes run *after* the columns exist. + conn.execute_batch(MIGRATION_V15_INDEXES)?; +} + +// Then the normal: +conn.execute_batch(migration.up)?; +``` + +Order of operations on a fresh in-memory DB: + +1. V1 - V14 run as before. +2. V15: column ALTERs run first (so MIGRATION_V15_INDEXES sees them). +3. V15 main body creates users/groups/group_memberships and the bootstrap row. +4. V15 indexes batch runs. +5. schema_version advances to 15. + +This intentionally mirrors how V14 handles its ALTER + index pair. + +### Existing-data backfill + +Existing SQLite databases (every Phase 1 deployment) have populated +`knowledge_nodes` rows. The V15 ALTER COLUMN ADD COLUMN statements assign +the default values to every existing row: + +- `owner_user_id` -> `'00000000-0000-0000-0000-000000000001'` +- `visibility` -> `'private'` +- `shared_with_groups` -> `'[]'` +- `codebase` -> NULL + +Phase 2 leaves these defaults in place. Phase 3 owns the migration story +for populating real owner UUIDs and visibility values. + +--- + +## Rust wrapper + +Single file: + +```rust +// crates/vestige-core/src/storage/postgres/migrations.rs +// +// sqlx::migrate! wrapper for the Postgres backend. +// +// We split the migration apply into two halves around register_model: +// - run_pre_register: applies everything up to and including version 1 +// (schema, indexes, bootstrap row). Safe to call on a +// fresh DB. +// - run_post_register: applies the remainder (currently: 0002_hnsw, which +// needs the embedding column typmod stamped first). +// +// See docs/plans/0002c-migrations.md "HNSW typmod ordering" for why this +// split exists. + +#![cfg(feature = "postgres-backend")] + +use sqlx::PgPool; +use sqlx::migrate::Migrator; + +use crate::storage::error::MemoryStoreResult; + +/// Embedded migrator. Path is relative to CARGO_MANIFEST_DIR +/// (`crates/vestige-core/`). +static MIGRATOR: Migrator = sqlx::migrate!("./migrations/postgres"); + +/// Apply migrations through version 1 (the schema-only migration). +/// +/// Idempotent: sqlx::migrate consults the `_sqlx_migrations` table and is +/// a no-op on a database already at version 1 or higher. +pub(crate) async fn run_pre_register(pool: &PgPool) -> MemoryStoreResult<()> { + MIGRATOR.run_to(pool, 1).await?; + Ok(()) +} + +/// Apply any remaining migrations. Called after `register_model` has +/// stamped the typmod on `knowledge_nodes.embedding`. +pub(crate) async fn run_post_register(pool: &PgPool) -> MemoryStoreResult<()> { + MIGRATOR.run(pool).await?; + Ok(()) +} +``` + +Wiring into `PgMemoryStore::connect`. The skeleton from 0002a uses +`todo!()` for everything past pool construction. This sub-plan replaces +that with `run_pre_register` only; `run_post_register` is invoked by +`finalize_schema`, which lands in 0002d. Sketch: + +```rust +// In crates/vestige-core/src/storage/postgres/mod.rs (sub-plan 0002a wires +// pool construction; this sub-plan adds the run_pre_register call): + +impl PgMemoryStore { + pub async fn connect(url: &str, max_connections: u32) -> MemoryStoreResult { + let pool = super::pool::build(url, max_connections).await?; + super::migrations::run_pre_register(&pool).await?; + Ok(Self { pool }) + } +} +``` + +Module wire-up in `crates/vestige-core/src/storage/postgres/mod.rs`: + +```rust +mod migrations; // pub(crate) functions; not re-exported. +``` + +### Error variant + +Sub-plan 0002a already added (under feature gate) to `MemoryStoreError`: + +```rust +#[cfg(feature = "postgres-backend")] +#[error("postgres migration error: {0}")] +Migrate(#[from] sqlx::migrate::MigrateError), +``` + +`run_pre_register` / `run_post_register` use the `?` operator and the +`#[from]` conversion handles it; no extra error handling code is needed. + +--- + +## Visibility CHECK constraint + +ADR 0002 D7 specifies the tri-state enum: + +``` +visibility IN ('private', 'group', 'public') +``` + +This sub-plan includes that CHECK on the `knowledge_nodes` table (see 0001_init.up.sql +above) on both sides: + +- Postgres: `CHECK (visibility IN ('private', 'group', 'public'))` inline on + the table. +- SQLite: same CHECK constraint can be added to V15 if desired. (It is not + in the V15 body above because adding a CHECK via ALTER TABLE on SQLite + requires a table rebuild; we trust the application layer for SQLite, since + SQLite never serves the multi-user query path in Phase 2.) + +The stronger consistency rule from the ADR 0002 follow-ups section, + +``` +CHECK ( + visibility = 'private' + OR cardinality(shared_with_groups) > 0 + OR visibility = 'public' +) +``` + +is intentionally NOT added in this sub-plan. Rationale: + +- The rule is a "no orphan group rows" sanity check, not a correctness + requirement for Phase 2 (single-user mode never touches the column). +- Phase 3 is the first phase that writes `visibility = 'group'`. The check + belongs in the Phase 3 migration that lights up auth, alongside the + application code that ensures `shared_with_groups` is populated before + the visibility flips. +- Adding it now and discovering Phase 3 wants a different shape forces an + online CHECK constraint replacement. + +Recommendation: include only the IN check in Phase 2; revisit the +cardinality check in Phase 3. + +--- + +## Offline sqlx cache + +`crates/vestige-core/.sqlx/` is the on-disk cache of compile-time-checked +queries that `sqlx::query!` / `sqlx::query_as!` emit at build time when +`SQLX_OFFLINE=true`. It is committed to the repo so builds without +`DATABASE_URL` (CI, downstream consumers, contributors without Postgres) +succeed. + +This sub-plan does NOT yet generate or commit `.sqlx/` content. Reasons: + +- `sqlx::query!` calls are introduced in `0002d-store-impl-bodies.md` (real + CRUD bodies) and `0002e-hybrid-search.md` (RRF). This sub-plan ships only + the migrations directory and a wrapper that uses `sqlx::migrate!` -- which + is a compile-time macro that reads files, not a query macro that needs a + DB connection. +- Generating an empty `.sqlx/` directory now is noise that gets immediately + overwritten in the next sub-plan. + +Sub-plan 0002d will land the procedure: + +```sh +# Local dev box with vestige DB initialised per local-dev-postgres-setup.md. +export DATABASE_URL="postgresql://vestige:$(cat ~/.vestige_pg_pw)@127.0.0.1:5432/vestige" + +# Apply migrations against the dev DB. +cargo sqlx migrate run \ + --source crates/vestige-core/migrations/postgres \ + --database-url "$DATABASE_URL" + +# Generate the offline cache. +cargo sqlx prepare --workspace -- --features postgres-backend + +# Verify cache compiles offline. +SQLX_OFFLINE=true cargo check --workspace --features postgres-backend +``` + +The `.sqlx/` directory commit policy is: committed, reviewed in PRs that +add or change `query!` calls, regenerated locally and pushed. + +What this sub-plan DOES need from sqlx-cli, for verification only (see next +section): `cargo sqlx migrate run --source crates/vestige-core/migrations/postgres`. + +--- + +## Verification + +Two halves: Postgres migrations run cleanly on a fresh DB; SQLite V15 does +not break the Phase 1 store. + +### Postgres + +Prerequisites: Postgres 18 with pgvector, a role with CREATEDB and EXTENSION +rights, per `docs/plans/local-dev-postgres-setup.md`. Alternatively, a +container: + +```sh +podman run --rm -d --name vestige-pg \ + -e POSTGRES_PASSWORD=devpw \ + -e POSTGRES_USER=vestige \ + -e POSTGRES_DB=vestige \ + -p 5432:5432 \ + docker.io/pgvector/pgvector:pg16 + +export DATABASE_URL="postgresql://vestige:devpw@127.0.0.1:5432/vestige" +``` + +Steps: + +1. Apply migrations. From the repo root: + + ```sh + cargo install sqlx-cli --no-default-features --features postgres + cargo sqlx migrate run \ + --source crates/vestige-core/migrations/postgres \ + --database-url "$DATABASE_URL" + ``` + + Expected output: `Applied 1/migrate init` (`0002` is gated on typmod; + sqlx-cli will run it and pgvector will reject the HNSW creation with + "column does not have dimensions". This is the expected behaviour when + running migrations without going through the Rust connect path. To run + 0002 manually for verification, first stamp the typmod: + + ```sh + psql "$DATABASE_URL" -c "ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector(768);" + cargo sqlx migrate run \ + --source crates/vestige-core/migrations/postgres \ + --database-url "$DATABASE_URL" + ``` + + Now 0002 should apply.) + +2. Verify tables exist: + + ```sh + psql "$DATABASE_URL" -c "\dt" + ``` + + Expected (alphabetical): + ``` + domains + edges + embedding_model + group_memberships + groups + knowledge_nodes + review_events + scheduling + users + ``` + +3. Verify the bootstrap user row: + + ```sh + psql "$DATABASE_URL" -c "SELECT id, handle, display_name FROM users;" + ``` + + Expected: + ``` + id | handle | display_name + --------------------------------------+--------+-------------- + 00000000-0000-0000-0000-000000000001 | local | Local User + ``` + +4. Verify HNSW index (only after the typmod stamp + migrate 0002): + + ```sh + psql "$DATABASE_URL" -c "\d knowledge_nodes" + ``` + + The trailing `Indexes:` block should include `idx_knowledge_nodes_embedding_hnsw`. + +5. Verify the D7+D8 columns are present: + + ```sh + psql "$DATABASE_URL" -c " + SELECT column_name, data_type, column_default + FROM information_schema.columns + WHERE table_name = 'knowledge_nodes' + AND column_name IN ('owner_user_id', 'visibility', + 'shared_with_groups', 'codebase') + ORDER BY column_name; + " + ``` + + Expected: four rows, with `owner_user_id` defaulting to the bootstrap + UUID, `visibility` to `'private'::text`, `shared_with_groups` to + `'{}'::uuid[]`, `codebase` NULL-default. + +6. Verify CHECK constraint: + + ```sh + psql "$DATABASE_URL" -c " + INSERT INTO knowledge_nodes (content, visibility) VALUES ('test', 'bogus'); + " + # Expected: ERROR: new row for relation \"knowledge_nodes\" violates check constraint + ``` + +7. Roll back to verify down migrations work: + + ```sh + cargo sqlx migrate revert \ + --source crates/vestige-core/migrations/postgres \ + --database-url "$DATABASE_URL" + cargo sqlx migrate revert \ + --source crates/vestige-core/migrations/postgres \ + --database-url "$DATABASE_URL" + ``` + + `\dt` should then list only the sqlx-managed `_sqlx_migrations` table. + +8. Rust-side smoke test (no `sqlx::query!` calls yet, so cannot live in + a `#[sqlx::test]`-decorated function until 0002d). Manual: + + ```sh + cargo build -p vestige-core --features postgres-backend + ``` + + Should compile. The `sqlx::migrate!("./migrations/postgres")` macro + reads the directory at compile time; a missing file or syntax error + surfaces as a compile error. + +### SQLite + +1. Run the existing test suite: + + ```sh + cargo test -p vestige-core + ``` + + Expected: 352 (or current count + new V15 tests) tests pass, zero + warnings. + +2. New test in `migrations.rs#tests`: + + ```rust + #[test] + fn test_v15_advances_to_15_and_adds_d7_d8_columns() { + let conn = rusqlite::Connection::open_in_memory().expect("open in-memory"); + apply_migrations(&conn).expect("apply_migrations succeeds"); + + let version = get_current_version(&conn).expect("read schema_version"); + assert_eq!(version, 15, "schema_version should advance to 15"); + + // Tables exist + for tbl in ["users", "groups", "group_memberships"] { + let n: i32 = conn.query_row( + "SELECT COUNT(*) FROM sqlite_master WHERE type='table' AND name=?1", + [tbl], + |r| r.get(0), + ).expect("query sqlite_master"); + assert_eq!(n, 1, "table {tbl} should exist after V15"); + } + + // Bootstrap user row exists + let n: i32 = conn.query_row( + "SELECT COUNT(*) FROM users WHERE id = '00000000-0000-0000-0000-000000000001'", + [], + |r| r.get(0), + ).expect("query users"); + assert_eq!(n, 1, "bootstrap local user row should exist"); + + // D7+D8 columns on knowledge_nodes + let cols: Vec = conn + .prepare("PRAGMA table_info(knowledge_nodes)") + .unwrap() + .query_map([], |r| r.get::<_, String>(1)) + .unwrap() + .collect::>() + .unwrap(); + for c in ["owner_user_id", "visibility", "shared_with_groups", "codebase"] { + assert!(cols.iter().any(|x| x == c), + "knowledge_nodes should have column {c}"); + } + } + ``` + +3. Idempotency: re-applying V15 on an already-V15 DB must not error. + `apply_migrations` already skips when `current_version >= migration.version`; + no extra test needed beyond ensuring the V14 + V15 ALTER pattern works. + +4. Existing-data backfill smoke: insert a row before applying V15, then + verify the defaults populate: + + ```rust + #[test] + fn test_v15_backfills_existing_rows_with_defaults() { + let conn = rusqlite::Connection::open_in_memory().expect("open"); + + // Apply migrations through V14 only. + // (We rely on the fact that re-running apply_migrations is a no-op, + // so we apply all, then probe the columns. The V15 ALTER on a + // populated table is what we are testing implicitly.) + apply_migrations(&conn).expect("V1-V15"); + + // Insert a row using only Phase 1 columns; V15 defaults must + // populate owner_user_id / visibility / shared_with_groups / codebase. + conn.execute( + "INSERT INTO knowledge_nodes (id, content, node_type, created_at, updated_at, last_accessed) + VALUES ('test', 'hello', 'fact', datetime('now'), datetime('now'), datetime('now'))", + [], + ).expect("insert"); + + let (owner, vis, shared, codebase): (String, String, String, Option) = + conn.query_row( + "SELECT owner_user_id, visibility, shared_with_groups, codebase + FROM knowledge_nodes WHERE id = 'test'", + [], + |r| Ok((r.get(0)?, r.get(1)?, r.get(2)?, r.get(3)?)), + ).expect("query"); + + assert_eq!(owner, "00000000-0000-0000-0000-000000000001"); + assert_eq!(vis, "private"); + assert_eq!(shared, "[]"); + assert_eq!(codebase, None); + } + ``` + +5. Live deployment: apply V15 to a copy of `~/.vestige/vestige.db` and + verify the existing 150 memories all carry the four new columns with + default values: + + ```sh + cp ~/.vestige/vestige.db /tmp/v15-test.db + sqlite3 /tmp/v15-test.db <<'SQL' + .schema knowledge_nodes + SELECT COUNT(*) FROM knowledge_nodes; + SELECT DISTINCT owner_user_id, visibility, shared_with_groups + FROM knowledge_nodes LIMIT 5; + SQL + # (Migration applies on first read by the vestige binary running V15.) + ``` + + Capture pre- and post-counts. Expected: no row count change, all new + columns populated by defaults. + +--- + +## Acceptance criteria + +- [ ] `crates/vestige-core/migrations/postgres/` directory contains exactly + four files: `0001_init.up.sql`, `0001_init.down.sql`, + `0002_hnsw.up.sql`, `0002_hnsw.down.sql`. Content matches this + sub-plan. +- [ ] `crates/vestige-core/src/storage/postgres/migrations.rs` exports + `run_pre_register` and `run_post_register` as `pub(crate)` async + functions returning `MemoryStoreResult<()>`. Compiles with + `--features postgres-backend`. +- [ ] `PgMemoryStore::connect` (sub-plan 0002a skeleton) is updated to call + `run_pre_register` immediately after pool construction. `connect` + still returns before `register_model` runs; `run_post_register` + lands in 0002d via `finalize_schema`. +- [ ] `crates/vestige-core/src/storage/migrations.rs` has a new V15 entry + in `MIGRATIONS`, with `MIGRATION_V15_UP`, `MIGRATION_V15_ALTER_COLUMNS`, + and `MIGRATION_V15_INDEXES` constants. `apply_migrations` handles + V15 the same shape as V14. +- [ ] `cargo test -p vestige-core` passes. New tests cover V15 advance, + D7+D8 column existence, bootstrap user row, and existing-row backfill. +- [ ] `cargo build -p vestige-core --features postgres-backend` compiles + (the `sqlx::migrate!` macro will fail at compile time if any of the + four SQL files is missing or malformed). +- [ ] `cargo sqlx migrate run --source crates/vestige-core/migrations/postgres` + against a fresh container applies 0001 cleanly; `\dt` lists the nine + Phase 2 tables; `users` contains the bootstrap row. +- [ ] After the manual typmod stamp documented above, `cargo sqlx migrate + run` applies 0002 and `\d knowledge_nodes` shows `idx_knowledge_nodes_embedding_hnsw`. +- [ ] `cargo sqlx migrate revert` twice cleans the DB back to only the + `_sqlx_migrations` table. +- [ ] Inserting a row with `visibility = 'bogus'` is rejected by the CHECK + constraint. +- [ ] No `sqlx::query!` / `sqlx::query_as!` calls are added in this + sub-plan; the `.sqlx/` offline cache is not yet generated. +- [ ] The existing live SQLite DB on the development machine migrates from + V14 to V15 without row count change, and the 150 existing rows all + receive the four V15 default values. diff --git a/docs/plans/0002d-store-impl-bodies.md b/docs/plans/0002d-store-impl-bodies.md new file mode 100644 index 0000000..ad1d9b7 --- /dev/null +++ b/docs/plans/0002d-store-impl-bodies.md @@ -0,0 +1,1771 @@ +# Phase 2 Sub-Plan 0002d -- Store Implementation Bodies + +**Status**: Ready +**Depends on**: +- `0002a-skeleton-and-feature-gate.md` -- `PgMemoryStore` struct + trait impl block exist with `todo!()` bodies. +- `0002b-pool-and-config.md` -- `PgPool` is constructable, `MemoryStoreError::Postgres` and `MemoryStoreError::Migrate` variants exist behind the `postgres-backend` feature. +- `0002c-migrations.md` -- the two sqlx migrations (`0001_init`, `0002_hnsw`) exist, the schema is applied on `connect`, the `knowledge_nodes` / `scheduling` / `edges` / `domains` / `embedding_model` / `users` / `groups` / `group_memberships` / `review_events` tables exist with the D7+D8 columns. + +This sub-plan replaces every `todo!()` in +`crates/vestige-core/src/storage/postgres/mod.rs` with a real sqlx-backed +body, and adds `crates/vestige-core/src/storage/postgres/registry.rs` with +the `ensure_registry` / `register_model` typmod-stamping logic. + +The hybrid `search()` method is the meatiest single body in the backend +(RRF in one SQL statement) and lives in its own sub-plan +(`0002e-hybrid-search.md`). The bodies for the trivial single-branch +variants `fts_search` and `vector_search` are still inside this sub-plan +because they share row-mapping infrastructure with the CRUD bodies. + +Out of scope for this sub-plan: +- The full hybrid `search()` -- see `0002e-hybrid-search.md`. +- SQLite -> Postgres migrate CLI -- see `0002f-migrate-cli.md`. +- Re-embed flow -- see `0002g-reembed.md`. +- Phase 3 visibility filter -- explicitly NOT wired in Phase 2; see the + "Visibility filter posture" section below. + +--- + +## Context + +The Phase 1 `MemoryStore` trait surface is defined in +`crates/vestige-core/src/storage/memory_store.rs` and is the source of +truth for method signatures. ADR 0002 D7 added owner / visibility / +shared_with_groups columns to the `knowledge_nodes` table; ADR 0002 D8 promoted +`codebase` to a first-class column. The sqlx bodies in this sub-plan must +write to and read from those columns, but per ADR 0002 D7 they must NOT +filter on them in Phase 2 -- the visibility filter is a Phase 3 +deliverable that takes an `AuthContext` parameter. + +The semantics of every body must match the SQLite backend's current +behaviour. Where Postgres has native types (`UUID`, `JSONB`, `vector`, +`TEXT[]`, `TIMESTAMPTZ`) we use them directly; the SQLite backend's +RFC3339-string-and-JSON-blob encoding is an artefact of SQLite typing, +not the trait contract. + +Compile-time SQL validation uses sqlx's `query!` / `query_as!` macros. +The first time these macros run against a real database in CI they +populate `.sqlx/` query metadata; the metadata file is committed so +offline builds (CI without a live Postgres) succeed. + +--- + +## MemoryRecord type changes + +ADR 0002 D7 and D8 added four new columns to the `knowledge_nodes` table. +The `MemoryRecord` struct in +`crates/vestige-core/src/storage/memory_store.rs` must grow matching +fields so the trait surface can carry the data through both backends. +This is an additive change to the public type. + +Add to `MemoryRecord` (after the existing `metadata` field): + +```rust +/// Owner of this memory. Defaults to the local bootstrap user +/// (`00000000-0000-0000-0000-000000000001`) in single-user mode. +pub owner_user_id: Uuid, + +/// Tri-state visibility. ADR 0002 D7. +pub visibility: Visibility, + +/// Group IDs this memory is shared with when `visibility == Group`. +/// Empty for `Private` and `Public`. +pub shared_with_groups: Vec, + +/// First-class codebase tag. ADR 0002 D8. None if the ingest pipeline +/// could not infer one. +pub codebase: Option, +``` + +Add a new enum next to `MemoryRecord`: + +```rust +#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] +#[serde(rename_all = "lowercase")] +pub enum Visibility { + Private, + Group, + Public, +} + +impl Visibility { + pub fn as_str(&self) -> &'static str { + match self { + Self::Private => "private", + Self::Group => "group", + Self::Public => "public", + } + } + + pub fn from_str(s: &str) -> MemoryStoreResult { + match s { + "private" => Ok(Self::Private), + "group" => Ok(Self::Group), + "public" => Ok(Self::Public), + other => Err(MemoryStoreError::Backend( + format!("unknown visibility value: {other}"), + )), + } + } +} + +impl Default for Visibility { + fn default() -> Self { Self::Private } +} +``` + +`MemoryRecord` already derives `Serialize` and `Deserialize`; the new +fields ride along automatically. Two callers must change as part of this +sub-plan: + +1. **SQLite backend (V15 migration ships in `0001b-sqlite-split.md` or + the same Phase 1 amendment branch)**: the SQLite backend reads the + four new columns out of `knowledge_nodes` (V15 added them) and + populates the new fields in `Self::node_to_record`. Bootstrap user + ID is the same constant on both backends. Existing call sites that + construct `MemoryRecord` literally (in tests, in cognitive modules) + may default-init the four new fields: + + ```rust + MemoryRecord { + // ... existing fields ... + owner_user_id: LOCAL_USER_ID, + visibility: Visibility::default(), + shared_with_groups: Vec::new(), + codebase: None, + metadata: serde_json::json!({}), + } + ``` + + A single `pub const LOCAL_USER_ID: Uuid = uuid::uuid!("00000000-0000-0000-0000-000000000001");` + in `storage::memory_store` provides the bootstrap constant. + +2. **Cognitive modules that build `MemoryRecord` from the ingest + pipeline**: the ingest path already captures `codebase` in metadata + (see ADR 0002 D8). Lift it from `metadata.codebase` to the new + `codebase` field at the boundary where `MemoryRecord` is built. The + `metadata.codebase` JSON key is removed in the same commit; the + column is now the only source of truth. + +The change is purely additive to the trait surface -- no method +signatures change. Backwards compatibility for stored data (in the +SQLite case) comes from V15 defaulting the new columns to `'private'` +and the bootstrap user. The Postgres schema applies the same defaults +in `0001_init.up.sql`. + +--- + +## Registry module + +New file: `crates/vestige-core/src/storage/postgres/registry.rs`. + +```rust +#![cfg(feature = "postgres-backend")] + +//! Embedding-model registry for the Postgres backend. +//! +//! The `embedding_model` table stores exactly one row (id = 1) describing +//! the model whose vectors live in `knowledge_nodes.embedding`. Phase 2 enforces +//! that the active embedder matches the registered model on every write; +//! re-embed (`0002g-reembed.md`) is the only flow allowed to change the +//! row. +//! +//! The pgvector column `knowledge_nodes.embedding` is created in +//! `0001_init.up.sql` with a placeholder type (`vector`) -- no typmod. +//! On first connect we stamp the real dimension via +//! `ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector($N)` so the +//! HNSW index (created in `0002_hnsw.up.sql`) sees a sized type. + +use sqlx::PgPool; + +use crate::storage::memory_store::{ + MemoryStoreError, MemoryStoreResult, ModelSignature, +}; + +/// Look up the registered signature, if any. Returns `Ok(None)` on a +/// fresh database. +pub(crate) async fn fetch_registry( + pool: &PgPool, +) -> MemoryStoreResult> { + let row = sqlx::query!( + r#" + SELECT name, dimension, hash + FROM embedding_model + WHERE id = 1 + "# + ) + .fetch_optional(pool) + .await?; + + Ok(row.map(|r| ModelSignature { + name: r.name, + dimension: r.dimension as usize, + hash: r.hash, + })) +} + +/// First-ever call inserts the row and stamps the typmod on +/// `knowledge_nodes.embedding`. Subsequent calls compare against the stored +/// row and return `ModelMismatch` if any field differs. +pub(crate) async fn ensure_registry( + pool: &PgPool, + sig: &ModelSignature, +) -> MemoryStoreResult<()> { + let existing = fetch_registry(pool).await?; + + match existing { + None => { + sqlx::query!( + r#" + INSERT INTO embedding_model (id, name, dimension, hash) + VALUES (1, $1, $2, $3) + "#, + sig.name, + sig.dimension as i32, + sig.hash, + ) + .execute(pool) + .await?; + + stamp_vector_typmod(pool, sig.dimension).await?; + Ok(()) + } + Some(reg) if reg == *sig => Ok(()), + Some(reg) => Err(MemoryStoreError::ModelMismatch { + registered_name: reg.name, + registered_dim: reg.dimension, + registered_hash: reg.hash, + actual_name: sig.name.clone(), + actual_dim: sig.dimension, + actual_hash: sig.hash.clone(), + }), + } +} + +/// Called only by the re-embed flow (`0002g-reembed.md`) after a full +/// re-encode has rewritten every row. Updates the registry row and +/// re-stamps the typmod for the new dimension. +pub(crate) async fn update_registry_for_reembed( + pool: &PgPool, + sig: &ModelSignature, +) -> MemoryStoreResult<()> { + sqlx::query!( + r#" + UPDATE embedding_model + SET name = $1, dimension = $2, hash = $3, created_at = now() + WHERE id = 1 + "#, + sig.name, + sig.dimension as i32, + sig.hash, + ) + .execute(pool) + .await?; + + stamp_vector_typmod(pool, sig.dimension).await?; + Ok(()) +} + +async fn stamp_vector_typmod(pool: &PgPool, dim: usize) -> MemoryStoreResult<()> { + // pgvector's typmod is part of the column type, not a bound parameter. + // `format!` is safe here because `dim` is a `usize` cast to a decimal + // literal; there is no path for user-controlled SQL to reach this + // string. + let ddl = format!( + "ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector({dim})" + ); + sqlx::query(&ddl).execute(pool).await?; + Ok(()) +} +``` + +Wire the new module into `crates/vestige-core/src/storage/postgres/mod.rs`: + +```rust +pub(crate) mod registry; +``` + +The `fetch_registry` / `ensure_registry` functions are reached from the +trait methods `registered_model` and `register_model` (see method bodies +below). `update_registry_for_reembed` is reached only from +`postgres::reembed`, which is filled in by `0002g-reembed.md`. + +--- + +## Method-by-method bodies + +Every body below replaces a `todo!()` in +`crates/vestige-core/src/storage/postgres/mod.rs`. Method order matches +the trait declaration in `memory_store.rs`. + +Common imports at the top of `mod.rs`: + +```rust +use chrono::{DateTime, Utc}; +use pgvector::Vector; +use uuid::Uuid; + +use crate::storage::memory_store::{ + Domain, HealthStatus, LocalMemoryStore, MemoryEdge, MemoryRecord, + MemoryStoreError, MemoryStoreResult, ModelSignature, SchedulingState, + SearchQuery, SearchResult, StoreStats, Visibility, +}; +``` + +Recurring row-to-record helper (private to `mod.rs`): + +```rust +fn row_to_record( + id: Uuid, + content: String, + node_type: String, + tags: Vec, + domains: Vec, + domain_scores: serde_json::Value, + codebase: Option, + owner_user_id: Uuid, + visibility: String, + shared_with_groups: Vec, + embedding: Option, + metadata: serde_json::Value, + created_at: DateTime, + updated_at: DateTime, +) -> MemoryStoreResult { + let domain_scores: std::collections::HashMap = + serde_json::from_value(domain_scores).unwrap_or_default(); + let embedding = embedding.map(|v| v.to_vec()); + Ok(MemoryRecord { + id, + domains, + domain_scores, + content, + node_type, + tags, + embedding, + created_at, + updated_at, + metadata, + owner_user_id, + visibility: Visibility::from_str(&visibility)?, + shared_with_groups, + codebase, + }) +} +``` + +### Lifecycle + +#### `init` + +```rust +async fn init(&self) -> MemoryStoreResult<()> +``` + +Migrations already ran in `connect`; this is a no-op identical to +SQLite's behaviour. + +```rust +async fn init(&self) -> MemoryStoreResult<()> { + Ok(()) +} +``` + +#### `health_check` + +```rust +async fn health_check(&self) -> MemoryStoreResult +``` + +Issue a trivial `SELECT 1`. Pool acquisition errors degrade to +`HealthStatus::Degraded`; any other error path returns `Unavailable`. + +```rust +async fn health_check(&self) -> MemoryStoreResult { + match sqlx::query_scalar!("SELECT 1::int") + .fetch_one(&self.pool) + .await + { + Ok(_) => Ok(HealthStatus::Healthy), + Err(sqlx::Error::PoolTimedOut) => Ok(HealthStatus::Degraded { + reason: "pool exhausted".to_string(), + }), + Err(e) => Ok(HealthStatus::Unavailable { + reason: e.to_string(), + }), + } +} +``` + +### Embedding-model registry + +#### `registered_model` + +```rust +async fn registered_model(&self) -> MemoryStoreResult> +``` + +Thin pass-through to `registry::fetch_registry`. The Postgres backend +does not cache the row in-memory the way the SQLite backend does -- +sqlx's prepared-statement cache already keeps the SELECT cheap, and +`registered_model` is not on the hot path. + +```rust +async fn registered_model(&self) -> MemoryStoreResult> { + crate::storage::postgres::registry::fetch_registry(&self.pool).await +} +``` + +#### `register_model` + +```rust +async fn register_model(&self, sig: &ModelSignature) -> MemoryStoreResult<()> +``` + +Delegate to `registry::ensure_registry`, which handles the +"insert + stamp typmod" first-run path and the "compare" subsequent path. + +```rust +async fn register_model(&self, sig: &ModelSignature) -> MemoryStoreResult<()> { + crate::storage::postgres::registry::ensure_registry(&self.pool, sig).await +} +``` + +### CRUD + +#### `insert` + +```rust +async fn insert(&self, record: &MemoryRecord) -> MemoryStoreResult +``` + +Single `INSERT` into `knowledge_nodes` with all D7+D8 columns. Bind embedding +as `Option` -- pgvector's sqlx integration handles the +typmod check at execution time, so a length mismatch surfaces as +`MemoryStoreError::Postgres`. The caller-supplied UUID is preserved +(same contract as SQLite). Initial scheduling state is inserted in the +same transaction so a memory is never queryable without a scheduling +row. + +```rust +async fn insert(&self, record: &MemoryRecord) -> MemoryStoreResult { + let embedding: Option = record + .embedding + .as_ref() + .map(|v| Vector::from(v.clone())); + let domain_scores = serde_json::to_value(&record.domain_scores) + .unwrap_or_else(|_| serde_json::json!({})); + + let mut tx = self.pool.begin().await?; + + sqlx::query!( + r#" + INSERT INTO knowledge_nodes ( + id, + owner_user_id, + visibility, + shared_with_groups, + codebase, + content, + node_type, + tags, + domains, + domain_scores, + embedding, + metadata, + created_at, + updated_at + ) + VALUES ( + $1, $2, $3, $4, $5, $6, $7, $8, $9, $10::jsonb, + $11, $12::jsonb, $13, $14 + ) + "#, + record.id, + record.owner_user_id, + record.visibility.as_str(), + &record.shared_with_groups as &[Uuid], + record.codebase.as_deref(), + record.content, + record.node_type, + &record.tags as &[String], + &record.domains as &[String], + domain_scores, + embedding as Option, + record.metadata, + record.created_at, + record.updated_at, + ) + .execute(&mut *tx) + .await?; + + // Seed scheduling state. Mirrors SQLite defaults from `knowledge_nodes` + // (stability=1.0, difficulty=0.3, retrievability=1.0, reps=0, lapses=0, + // next_review = created_at + 1 day). + sqlx::query!( + r#" + INSERT INTO scheduling ( + memory_id, stability, difficulty, retrievability, + last_review, next_review, reps, lapses + ) + VALUES ($1, 1.0, 0.3, 1.0, NULL, $2, 0, 0) + "#, + record.id, + record.created_at + chrono::Duration::days(1), + ) + .execute(&mut *tx) + .await?; + + tx.commit().await?; + Ok(record.id) +} +``` + +Tricky bits: +- `&record.tags as &[String]` -- sqlx requires an explicit slice cast + to bind a `Vec` as `text[]`. +- `&record.shared_with_groups as &[Uuid]` -- same pattern for `uuid[]`. +- `embedding as Option` -- type annotation is mandatory in the + macro because the inference path bottoms out at a generic; pgvector's + `Encode` impl resolves only with a known concrete type. +- The `$10::jsonb` and `$12::jsonb` casts force sqlx to encode through + the JSONB path even if the parameter type-resolves to `JSON`. This + matters because the migrations created the columns as `JSONB`, and + sqlx 0.8 does not always pick JSONB without the cast. + +#### `get` + +```rust +async fn get(&self, id: Uuid) -> MemoryStoreResult> +``` + +`SELECT *` filtered by primary key. Row mapping goes through +`row_to_record`. + +```rust +async fn get(&self, id: Uuid) -> MemoryStoreResult> { + let row = sqlx::query!( + r#" + SELECT + id AS "id!: Uuid", + owner_user_id AS "owner_user_id!: Uuid", + visibility, + shared_with_groups AS "shared_with_groups!: Vec", + codebase, + content, + node_type, + tags AS "tags!: Vec", + domains AS "domains!: Vec", + domain_scores AS "domain_scores!: serde_json::Value", + embedding AS "embedding: Vector", + metadata AS "metadata!: serde_json::Value", + created_at AS "created_at!: DateTime", + updated_at AS "updated_at!: DateTime" + FROM knowledge_nodes + WHERE id = $1 + "#, + id, + ) + .fetch_optional(&self.pool) + .await?; + + let Some(r) = row else { return Ok(None) }; + + Ok(Some(row_to_record( + r.id, r.content, r.node_type, r.tags, r.domains, + r.domain_scores, r.codebase, r.owner_user_id, r.visibility, + r.shared_with_groups, r.embedding, r.metadata, + r.created_at, r.updated_at, + )?)) +} +``` + +The `AS "name!: Type"` annotations tell sqlx the exact Rust type for +each column, which is required for `Vec` (from `uuid[]`) and +`Vector` (from `vector`). The `!` means "trust me, this column is NOT +NULL"; sqlx skips its `Option` wrapping for those columns. The +`embedding` column is nullable, so it gets `Option` (no `!`). + +#### `update` + +```rust +async fn update(&self, record: &MemoryRecord) -> MemoryStoreResult<()> +``` + +Update everything the caller might have changed. `updated_at` is set +server-side via `now()` so clock drift between hosts does not leak into +the timeline. (If the caller wants to forge `updated_at` -- e.g. the +migrate CLI replaying SQLite timestamps -- it goes through `insert`, not +`update`.) The schema's `BEFORE UPDATE` trigger could replace this; we +write `updated_at = now()` explicitly to be backend-agnostic. + +```rust +async fn update(&self, record: &MemoryRecord) -> MemoryStoreResult<()> { + let embedding: Option = record + .embedding + .as_ref() + .map(|v| Vector::from(v.clone())); + let domain_scores = serde_json::to_value(&record.domain_scores) + .unwrap_or_else(|_| serde_json::json!({})); + + let rows = sqlx::query!( + r#" + UPDATE knowledge_nodes SET + owner_user_id = $2, + visibility = $3, + shared_with_groups = $4, + codebase = $5, + content = $6, + node_type = $7, + tags = $8, + domains = $9, + domain_scores = $10::jsonb, + embedding = $11, + metadata = $12::jsonb, + updated_at = now() + WHERE id = $1 + "#, + record.id, + record.owner_user_id, + record.visibility.as_str(), + &record.shared_with_groups as &[Uuid], + record.codebase.as_deref(), + record.content, + record.node_type, + &record.tags as &[String], + &record.domains as &[String], + domain_scores, + embedding as Option, + record.metadata, + ) + .execute(&self.pool) + .await? + .rows_affected(); + + if rows == 0 { + return Err(MemoryStoreError::NotFound(record.id.to_string())); + } + Ok(()) +} +``` + +#### `delete` + +```rust +async fn delete(&self, id: Uuid) -> MemoryStoreResult<()> +``` + +Single `DELETE` by id. `scheduling`, `edges`, and `review_events` all +have `ON DELETE CASCADE` on their `memory_id` foreign key, so this one +statement clears every dependent row. + +```rust +async fn delete(&self, id: Uuid) -> MemoryStoreResult<()> { + let rows = sqlx::query!( + "DELETE FROM knowledge_nodes WHERE id = $1", + id, + ) + .execute(&self.pool) + .await? + .rows_affected(); + + if rows == 0 { + return Err(MemoryStoreError::NotFound(id.to_string())); + } + Ok(()) +} +``` + +### Search (single-branch variants) + +The full hybrid `search` is implemented in `0002e-hybrid-search.md`. +The two single-branch variants below ship in this sub-plan. + +#### `fts_search` + +```rust +async fn fts_search(&self, text: &str, limit: usize) -> MemoryStoreResult> +``` + +PostgreSQL full-text search using the precomputed `search_vec` tsvector +column and `websearch_to_tsquery` (handles bare words, phrases, and +boolean operators). Ranking with `ts_rank_cd` (cover-density) so longer +matches outrank shorter ones; the SQLite backend uses BM25 from FTS5 but +the trait contract only requires "higher is better". + +```rust +async fn fts_search( + &self, + text: &str, + limit: usize, +) -> MemoryStoreResult> { + let limit = limit.min(1000) as i64; + let rows = sqlx::query!( + r#" + SELECT + m.id AS "id!: Uuid", + m.owner_user_id AS "owner_user_id!: Uuid", + m.visibility, + m.shared_with_groups AS "shared_with_groups!: Vec", + m.codebase, + m.content, + m.node_type, + m.tags AS "tags!: Vec", + m.domains AS "domains!: Vec", + m.domain_scores AS "domain_scores!: serde_json::Value", + m.embedding AS "embedding: Vector", + m.metadata AS "metadata!: serde_json::Value", + m.created_at AS "created_at!: DateTime", + m.updated_at AS "updated_at!: DateTime", + ts_rank_cd(m.search_vec, websearch_to_tsquery('english', $1)) + AS "score!: f64" + FROM knowledge_nodes m + WHERE m.search_vec @@ websearch_to_tsquery('english', $1) + ORDER BY score DESC + LIMIT $2 + "#, + text, + limit, + ) + .fetch_all(&self.pool) + .await?; + + let mut out = Vec::with_capacity(rows.len()); + for r in rows { + let rec = row_to_record( + r.id, r.content, r.node_type, r.tags, r.domains, + r.domain_scores, r.codebase, r.owner_user_id, r.visibility, + r.shared_with_groups, r.embedding, r.metadata, + r.created_at, r.updated_at, + )?; + out.push(SearchResult { + record: rec, + score: r.score, + fts_score: Some(r.score), + vector_score: None, + }); + } + Ok(out) +} +``` + +The `'english'` text-search configuration matches the GIN index built in +`0001_init.up.sql`. If a future migration parameterises the config, both +the index and this query change together. + +#### `vector_search` + +```rust +async fn vector_search(&self, embedding: &[f32], limit: usize) -> MemoryStoreResult> +``` + +pgvector cosine distance. The HNSW index on `embedding` (built in +`0002_hnsw.up.sql` with `vector_cosine_ops`) makes the `<=>` operator +index-accelerated. We convert the returned distance (0 = identical, 2 = +opposite for cosine on normalized vectors) to a similarity in `[0, 1]` +via `1 - distance`; this matches the SQLite backend's convention. + +```rust +async fn vector_search( + &self, + embedding: &[f32], + limit: usize, +) -> MemoryStoreResult> { + let query_vec = Vector::from(embedding.to_vec()); + let limit = limit.min(1000) as i64; + + let rows = sqlx::query!( + r#" + SELECT + m.id AS "id!: Uuid", + m.owner_user_id AS "owner_user_id!: Uuid", + m.visibility, + m.shared_with_groups AS "shared_with_groups!: Vec", + m.codebase, + m.content, + m.node_type, + m.tags AS "tags!: Vec", + m.domains AS "domains!: Vec", + m.domain_scores AS "domain_scores!: serde_json::Value", + m.embedding AS "embedding: Vector", + m.metadata AS "metadata!: serde_json::Value", + m.created_at AS "created_at!: DateTime", + m.updated_at AS "updated_at!: DateTime", + (1.0 - (m.embedding <=> $1)) AS "score!: f64" + FROM knowledge_nodes m + WHERE m.embedding IS NOT NULL + ORDER BY m.embedding <=> $1 + LIMIT $2 + "#, + query_vec as Vector, + limit, + ) + .fetch_all(&self.pool) + .await?; + + let mut out = Vec::with_capacity(rows.len()); + for r in rows { + let rec = row_to_record( + r.id, r.content, r.node_type, r.tags, r.domains, + r.domain_scores, r.codebase, r.owner_user_id, r.visibility, + r.shared_with_groups, r.embedding, r.metadata, + r.created_at, r.updated_at, + )?; + out.push(SearchResult { + record: rec, + score: r.score, + fts_score: None, + vector_score: Some(r.score), + }); + } + Ok(out) +} +``` + +The `query_vec as Vector` cast is the same type-annotation trick as +`insert` -- sqlx needs the concrete pgvector type to wire up encoding. +The `ORDER BY m.embedding <=> $1` (no `score`) is intentional: it lets +the HNSW index serve the query directly. Sorting by the computed +`score` column would force a sequential scan because the index orders +by distance, not similarity. + +### Scheduling + +The Postgres `scheduling` table is a separate row keyed on `memory_id`, +not embedded in `knowledge_nodes` (unlike SQLite where FSRS columns live on +`knowledge_nodes`). The bodies abstract that difference at the SQL +boundary; callers see the same `SchedulingState` value. + +#### `get_scheduling` + +```rust +async fn get_scheduling(&self, memory_id: Uuid) -> MemoryStoreResult> +``` + +```rust +async fn get_scheduling( + &self, + memory_id: Uuid, +) -> MemoryStoreResult> { + let row = sqlx::query!( + r#" + SELECT + memory_id AS "memory_id!: Uuid", + stability AS "stability!: f64", + difficulty AS "difficulty!: f64", + retrievability AS "retrievability!: f64", + last_review AS "last_review: DateTime", + next_review AS "next_review: DateTime", + reps AS "reps!: i32", + lapses AS "lapses!: i32" + FROM scheduling + WHERE memory_id = $1 + "#, + memory_id, + ) + .fetch_optional(&self.pool) + .await?; + + Ok(row.map(|r| SchedulingState { + memory_id: r.memory_id, + stability: r.stability, + difficulty: r.difficulty, + retrievability: r.retrievability, + last_review: r.last_review, + next_review: r.next_review, + reps: r.reps as u32, + lapses: r.lapses as u32, + })) +} +``` + +#### `update_scheduling` + +```rust +async fn update_scheduling(&self, state: &SchedulingState) -> MemoryStoreResult<()> +``` + +Upsert -- the `INSERT ... ON CONFLICT DO UPDATE` form -- so cognitive +modules that update scheduling for a freshly-inserted memory don't have +to race with the seed row from `insert`. + +```rust +async fn update_scheduling( + &self, + state: &SchedulingState, +) -> MemoryStoreResult<()> { + sqlx::query!( + r#" + INSERT INTO scheduling ( + memory_id, stability, difficulty, retrievability, + last_review, next_review, reps, lapses + ) + VALUES ($1, $2, $3, $4, $5, $6, $7, $8) + ON CONFLICT (memory_id) DO UPDATE SET + stability = EXCLUDED.stability, + difficulty = EXCLUDED.difficulty, + retrievability = EXCLUDED.retrievability, + last_review = EXCLUDED.last_review, + next_review = EXCLUDED.next_review, + reps = EXCLUDED.reps, + lapses = EXCLUDED.lapses + "#, + state.memory_id, + state.stability, + state.difficulty, + state.retrievability, + state.last_review, + state.next_review, + state.reps as i32, + state.lapses as i32, + ) + .execute(&self.pool) + .await?; + Ok(()) +} +``` + +#### `get_due_memories` + +```rust +async fn get_due_memories( + &self, + before: DateTime, + limit: usize, +) -> MemoryStoreResult> +``` + +Join `knowledge_nodes` and `scheduling`, filter on `next_review <= before`. +Single query returns both halves of the tuple. + +```rust +async fn get_due_memories( + &self, + before: DateTime, + limit: usize, +) -> MemoryStoreResult> { + let limit = limit.min(10_000) as i64; + let rows = sqlx::query!( + r#" + SELECT + m.id AS "id!: Uuid", + m.owner_user_id AS "owner_user_id!: Uuid", + m.visibility, + m.shared_with_groups AS "shared_with_groups!: Vec", + m.codebase, + m.content, + m.node_type, + m.tags AS "tags!: Vec", + m.domains AS "domains!: Vec", + m.domain_scores AS "domain_scores!: serde_json::Value", + m.embedding AS "embedding: Vector", + m.metadata AS "metadata!: serde_json::Value", + m.created_at AS "created_at!: DateTime", + m.updated_at AS "updated_at!: DateTime", + s.stability AS "stability!: f64", + s.difficulty AS "difficulty!: f64", + s.retrievability AS "retrievability!: f64", + s.last_review AS "last_review: DateTime", + s.next_review AS "next_review: DateTime", + s.reps AS "reps!: i32", + s.lapses AS "lapses!: i32" + FROM knowledge_nodes m + JOIN scheduling s ON s.memory_id = m.id + WHERE s.next_review IS NOT NULL AND s.next_review <= $1 + ORDER BY s.next_review ASC + LIMIT $2 + "#, + before, + limit, + ) + .fetch_all(&self.pool) + .await?; + + let mut out = Vec::with_capacity(rows.len()); + for r in rows { + let rec = row_to_record( + r.id, r.content, r.node_type, r.tags, r.domains, + r.domain_scores, r.codebase, r.owner_user_id, r.visibility, + r.shared_with_groups, r.embedding, r.metadata, + r.created_at, r.updated_at, + )?; + let state = SchedulingState { + memory_id: rec.id, + stability: r.stability, + difficulty: r.difficulty, + retrievability: r.retrievability, + last_review: r.last_review, + next_review: r.next_review, + reps: r.reps as u32, + lapses: r.lapses as u32, + }; + out.push((rec, state)); + } + Ok(out) +} +``` + +### Graph (edges) + +#### `add_edge` + +```rust +async fn add_edge(&self, edge: &MemoryEdge) -> MemoryStoreResult<()> +``` + +`INSERT ... ON CONFLICT` -- updating the weight if an edge already +exists (matches SQLite's `save_connection` semantics). + +```rust +async fn add_edge(&self, edge: &MemoryEdge) -> MemoryStoreResult<()> { + sqlx::query!( + r#" + INSERT INTO edges ( + source_id, target_id, edge_type, weight, created_at + ) + VALUES ($1, $2, $3, $4, $5) + ON CONFLICT (source_id, target_id, edge_type) DO UPDATE SET + weight = EXCLUDED.weight + "#, + edge.source_id, + edge.target_id, + edge.edge_type, + edge.weight, + edge.created_at, + ) + .execute(&self.pool) + .await?; + Ok(()) +} +``` + +#### `get_edges` + +```rust +async fn get_edges( + &self, + node_id: Uuid, + edge_type: Option<&str>, +) -> MemoryStoreResult> +``` + +Return every edge incident to `node_id` in either direction, optionally +filtered by `edge_type`. The optional filter binds as nullable; `$2 IS +NULL OR edge_type = $2` keeps the prepared statement reusable. + +```rust +async fn get_edges( + &self, + node_id: Uuid, + edge_type: Option<&str>, +) -> MemoryStoreResult> { + let rows = sqlx::query!( + r#" + SELECT + source_id AS "source_id!: Uuid", + target_id AS "target_id!: Uuid", + edge_type, + weight AS "weight!: f64", + created_at AS "created_at!: DateTime" + FROM edges + WHERE (source_id = $1 OR target_id = $1) + AND ($2::text IS NULL OR edge_type = $2) + "#, + node_id, + edge_type, + ) + .fetch_all(&self.pool) + .await?; + + Ok(rows + .into_iter() + .map(|r| MemoryEdge { + source_id: r.source_id, + target_id: r.target_id, + edge_type: r.edge_type, + weight: r.weight, + created_at: r.created_at, + }) + .collect()) +} +``` + +#### `remove_edge` + +```rust +async fn remove_edge(&self, source: Uuid, target: Uuid) -> MemoryStoreResult<()> +``` + +Note: the live trait signature is two args (`source`, `target`). The +master plan's stale three-arg signature including `edge_type` is not +implemented -- the live trait surface wins. Deletes every edge between +the pair regardless of `edge_type`. + +```rust +async fn remove_edge( + &self, + source: Uuid, + target: Uuid, +) -> MemoryStoreResult<()> { + sqlx::query!( + "DELETE FROM edges WHERE source_id = $1 AND target_id = $2", + source, + target, + ) + .execute(&self.pool) + .await?; + Ok(()) +} +``` + +#### `get_neighbors` + +```rust +async fn get_neighbors( + &self, + node_id: Uuid, + depth: usize, +) -> MemoryStoreResult> +``` + +Recursive CTE walks the edge graph outward from `node_id` for up to +`depth` hops. Weights compound multiplicatively along the path (same as +SQLite BFS). Cap the visited set at 256 rows to match SQLite. Direction +is treated as undirected by unioning both halves of each edge inside +the CTE. + +```rust +async fn get_neighbors( + &self, + node_id: Uuid, + depth: usize, +) -> MemoryStoreResult> { + if depth == 0 { + let Some(rec) = self.get(node_id).await? else { + return Err(MemoryStoreError::NotFound(node_id.to_string())); + }; + return Ok(vec![(rec, 1.0)]); + } + + let depth_i = depth.min(16) as i32; + let rows = sqlx::query!( + r#" + WITH RECURSIVE walk(node_id, weight, hops) AS ( + SELECT $1::uuid, 1.0::float8, 0 + UNION ALL + SELECT + CASE WHEN e.source_id = w.node_id THEN e.target_id + ELSE e.source_id END, + w.weight * e.weight, + w.hops + 1 + FROM walk w + JOIN edges e + ON e.source_id = w.node_id OR e.target_id = w.node_id + WHERE w.hops < $2 + ), + best AS ( + SELECT node_id, MAX(weight) AS weight + FROM walk + GROUP BY node_id + LIMIT 256 + ) + SELECT + m.id AS "id!: Uuid", + m.owner_user_id AS "owner_user_id!: Uuid", + m.visibility, + m.shared_with_groups AS "shared_with_groups!: Vec", + m.codebase, + m.content, + m.node_type, + m.tags AS "tags!: Vec", + m.domains AS "domains!: Vec", + m.domain_scores AS "domain_scores!: serde_json::Value", + m.embedding AS "embedding: Vector", + m.metadata AS "metadata!: serde_json::Value", + m.created_at AS "created_at!: DateTime", + m.updated_at AS "updated_at!: DateTime", + b.weight AS "weight!: f64" + FROM best b + JOIN knowledge_nodes m ON m.id = b.node_id + "#, + node_id, + depth_i, + ) + .fetch_all(&self.pool) + .await?; + + let mut out = Vec::with_capacity(rows.len()); + for r in rows { + let rec = row_to_record( + r.id, r.content, r.node_type, r.tags, r.domains, + r.domain_scores, r.codebase, r.owner_user_id, r.visibility, + r.shared_with_groups, r.embedding, r.metadata, + r.created_at, r.updated_at, + )?; + out.push((rec, r.weight)); + } + Ok(out) +} +``` + +The CTE can visit a node multiple times via different paths; the `best` +sub-CTE picks the highest weight per node. The `LIMIT 256` matches the +SQLite BFS cap. Postgres' recursive CTE is breadth-first by hop count +because of the `WHERE w.hops < $2` predicate. + +### Domains (Phase 4 populates; Phase 2 ships CRUD) + +The `domains` table is empty in Phase 2; these methods exist so the +trait surface is complete but they do not get exercised until Phase 4 +HDBSCAN clustering runs. + +#### `list_domains` + +```rust +async fn list_domains(&self) -> MemoryStoreResult> +``` + +```rust +async fn list_domains(&self) -> MemoryStoreResult> { + let rows = sqlx::query!( + r#" + SELECT + id, + label, + centroid AS "centroid: Vector", + top_terms AS "top_terms!: Vec", + memory_count AS "memory_count!: i64", + created_at AS "created_at!: DateTime" + FROM domains + ORDER BY created_at ASC + "# + ) + .fetch_all(&self.pool) + .await?; + + Ok(rows + .into_iter() + .map(|r| Domain { + id: r.id, + label: r.label, + centroid: r.centroid.map(|v| v.to_vec()).unwrap_or_default(), + top_terms: r.top_terms, + memory_count: r.memory_count as usize, + created_at: r.created_at, + }) + .collect()) +} +``` + +#### `get_domain` + +```rust +async fn get_domain(&self, id: &str) -> MemoryStoreResult> +``` + +```rust +async fn get_domain(&self, id: &str) -> MemoryStoreResult> { + let row = sqlx::query!( + r#" + SELECT + id, + label, + centroid AS "centroid: Vector", + top_terms AS "top_terms!: Vec", + memory_count AS "memory_count!: i64", + created_at AS "created_at!: DateTime" + FROM domains + WHERE id = $1 + "#, + id, + ) + .fetch_optional(&self.pool) + .await?; + + Ok(row.map(|r| Domain { + id: r.id, + label: r.label, + centroid: r.centroid.map(|v| v.to_vec()).unwrap_or_default(), + top_terms: r.top_terms, + memory_count: r.memory_count as usize, + created_at: r.created_at, + })) +} +``` + +#### `upsert_domain` + +```rust +async fn upsert_domain(&self, domain: &Domain) -> MemoryStoreResult<()> +``` + +```rust +async fn upsert_domain(&self, domain: &Domain) -> MemoryStoreResult<()> { + let centroid = if domain.centroid.is_empty() { + None + } else { + Some(Vector::from(domain.centroid.clone())) + }; + + sqlx::query!( + r#" + INSERT INTO domains ( + id, label, centroid, top_terms, memory_count, created_at + ) + VALUES ($1, $2, $3, $4, $5, $6) + ON CONFLICT (id) DO UPDATE SET + label = EXCLUDED.label, + centroid = EXCLUDED.centroid, + top_terms = EXCLUDED.top_terms, + memory_count = EXCLUDED.memory_count + "#, + domain.id, + domain.label, + centroid as Option, + &domain.top_terms as &[String], + domain.memory_count as i64, + domain.created_at, + ) + .execute(&self.pool) + .await?; + Ok(()) +} +``` + +#### `delete_domain` + +```rust +async fn delete_domain(&self, id: &str) -> MemoryStoreResult<()> +``` + +```rust +async fn delete_domain(&self, id: &str) -> MemoryStoreResult<()> { + sqlx::query!( + "DELETE FROM domains WHERE id = $1", + id, + ) + .execute(&self.pool) + .await?; + Ok(()) +} +``` + +#### `classify` + +```rust +async fn classify(&self, embedding: &[f32]) -> MemoryStoreResult> +``` + +The Postgres backend can ship this as a single SQL query against the +empty `domains` table -- it correctly returns an empty vector in Phase +2 and starts returning real scores in Phase 4 without any code change. + +```rust +async fn classify( + &self, + embedding: &[f32], +) -> MemoryStoreResult> { + let query_vec = Vector::from(embedding.to_vec()); + let rows = sqlx::query!( + r#" + SELECT + id, + (1.0 - (centroid <=> $1)) AS "score!: f64" + FROM domains + WHERE centroid IS NOT NULL + ORDER BY score DESC + "#, + query_vec as Vector, + ) + .fetch_all(&self.pool) + .await?; + + Ok(rows.into_iter().map(|r| (r.id, r.score)).collect()) +} +``` + +### Bulk / maintenance + +#### `count` + +```rust +async fn count(&self) -> MemoryStoreResult +``` + +```rust +async fn count(&self) -> MemoryStoreResult { + let n: i64 = sqlx::query_scalar!("SELECT COUNT(*) FROM knowledge_nodes") + .fetch_one(&self.pool) + .await? + .unwrap_or(0); + Ok(n as usize) +} +``` + +#### `get_stats` + +```rust +async fn get_stats(&self) -> MemoryStoreResult +``` + +Aggregate counts across `knowledge_nodes`, `edges`, `domains`. Read the +registry inline. + +```rust +async fn get_stats(&self) -> MemoryStoreResult { + let row = sqlx::query!( + r#" + SELECT + (SELECT COUNT(*) FROM knowledge_nodes) + AS "total_memories!: i64", + (SELECT COUNT(*) FROM knowledge_nodes WHERE embedding IS NOT NULL) + AS "memories_with_embeddings!: i64", + (SELECT COUNT(*) FROM edges) + AS "total_edges!: i64", + (SELECT COUNT(*) FROM domains) + AS "total_domains!: i64", + (SELECT name FROM embedding_model WHERE id = 1) + AS "registered_model_name: String", + (SELECT dimension FROM embedding_model WHERE id = 1) + AS "registered_model_dim: i32" + "# + ) + .fetch_one(&self.pool) + .await?; + + Ok(StoreStats { + total_memories: row.total_memories as usize, + memories_with_embeddings: row.memories_with_embeddings as usize, + total_edges: row.total_edges as usize, + total_domains: row.total_domains as usize, + registered_model_name: row.registered_model_name, + registered_model_dim: row.registered_model_dim.map(|d| d as usize), + }) +} +``` + +#### `vacuum` + +```rust +async fn vacuum(&self) -> MemoryStoreResult<()> +``` + +`VACUUM` cannot run inside a transaction. sqlx wraps each `query!` +invocation in an implicit transaction when it grabs a pooled +connection, but it does not -- the pool hands out a raw connection +that runs statements in autocommit mode by default. The safe path is +to acquire a connection explicitly and `execute` each statement +separately so neither participates in a transaction. + +```rust +async fn vacuum(&self) -> MemoryStoreResult<()> { + let mut conn = self.pool.acquire().await?; + sqlx::query("VACUUM ANALYZE knowledge_nodes") + .execute(conn.as_mut()) + .await?; + sqlx::query("VACUUM ANALYZE scheduling") + .execute(conn.as_mut()) + .await?; + sqlx::query("VACUUM ANALYZE edges") + .execute(conn.as_mut()) + .await?; + Ok(()) +} +``` + +`conn.as_mut()` yields a `&mut PgConnection`, which sqlx accepts as an +executor. Using `&self.pool` here would let sqlx pick a fresh +connection per statement (still fine, but two extra acquisitions). Note +we do NOT vacuum `domains`, `edges`-related lookup tables (`users` / +`groups` etc.) -- they are either empty in Phase 2 or low-churn and the +nightly autovacuum suffices. + +--- + +## Visibility filter posture + +ADR 0002 D7 declares the future Phase 3 visibility filter (reproduced +here for clarity): + +```sql +WHERE + (visibility = 'private' AND owner_user_id = $me) + OR (visibility = 'group' + AND (owner_user_id = $me OR shared_with_groups && $my_group_ids)) + OR visibility = 'public' +``` + +**Phase 2 does NOT apply this filter.** Every body above reads and +writes the rows it touches regardless of `owner_user_id` or +`visibility` because there is exactly one user in Phase 2 mode (the +bootstrap user from `0001_init.up.sql`). The reviewer should NOT expect +`WHERE owner_user_id = $...` clauses in Phase 2 method bodies. + +Phase 3 introduces an `AuthContext` parameter on the trait methods and +threads it into each WHERE clause. That migration is purely additive +(adds a parameter, adds a clause); it does not need a schema migration +because the columns and indexes are already in place. + +The four new `MemoryRecord` fields ARE populated correctly in Phase 2 +(insert writes them, get/search read them) so that exported archives +and replicated rows round-trip the visibility intent the moment Phase +3 enables filtering. + +--- + +## Offline sqlx cache + +`sqlx::query!` and `sqlx::query_as!` validate every SQL string at +compile time by contacting a live database. To keep CI builds from +needing a Postgres on the build host, sqlx supports an offline cache +in `/.sqlx/` containing one JSON file per validated query. + +This sub-plan is where `.sqlx/` is first populated and committed. + +Workflow: + +1. Ensure a local Postgres is running with the same schema CI will see: + + ```bash + cd crates/vestige-core + export DATABASE_URL="postgres://vestige:vestige@127.0.0.1:5432/vestige_dev" + sqlx database create + sqlx migrate run --source migrations/postgres + ``` + +2. Generate the offline cache: + + ```bash + cargo sqlx prepare --workspace -- --features postgres-backend + ``` + + This walks every `sqlx::query!` invocation under the active feature + flags and writes `crates/vestige-core/.sqlx/query-.json`. The + `--workspace` flag is needed because `vestige-mcp` enables the + `postgres-backend` feature transitively in `0002b-pool-and-config.md`. + +3. Stage and commit the cache directory: + + ```bash + git add crates/vestige-core/.sqlx/ + git commit -m "store: populate sqlx offline cache for postgres backend" + ``` + +4. Add to repo `.gitignore` adjustments (only if entries already deny + `target/` or similar globs): leave `.sqlx/` excluded from any + ignore globs. Specifically the workspace root `.gitignore` does NOT + contain a `.sqlx` line; if a future PR adds one, this sub-plan's + commit reverts it. + +5. CI runs `SQLX_OFFLINE=true cargo check --features postgres-backend`. + sqlx falls back to the JSON cache when `SQLX_OFFLINE=true` is set, + so CI does not need network access to a Postgres. + +Every time a `query!` invocation changes -- add a column, change a +WHERE clause, rename a binding -- re-run `cargo sqlx prepare` and +commit the updated `.sqlx/` files. The agent implementing this sub-plan +runs `cargo sqlx prepare` as the last step before opening the PR. + +--- + +## Verification + +Three layers of verification before merging this sub-plan. + +### 1. Compile and lint + +```bash +cargo check --workspace --features postgres-backend +cargo build --workspace --features postgres-backend +cargo clippy --workspace --features postgres-backend -- -D warnings + +# SQLite-only build still works (mutual compilability per CLAUDE.md): +cargo check --workspace --no-default-features --features embeddings,vector-search +``` + +### 2. Offline cache builds + +```bash +SQLX_OFFLINE=true cargo check --workspace --features postgres-backend +``` + +This is what CI will run. If it fails, `cargo sqlx prepare` was not +re-run after the last query change. + +### 3. Integration round-trip test (testcontainers) + +New test file: +`crates/vestige-core/tests/postgres_round_trip.rs`. Skipped unless the +`postgres-backend` feature is active and Docker / Podman is available. + +```rust +#![cfg(feature = "postgres-backend")] + +use chrono::Utc; +use testcontainers::{clients, GenericImage}; +use uuid::Uuid; +use vestige_core::storage::memory_store::{ + LocalMemoryStore, MemoryEdge, MemoryRecord, SchedulingState, Visibility, + LOCAL_USER_ID, +}; +use vestige_core::storage::postgres::PgMemoryStore; + +#[tokio::test] +async fn round_trip_crud_search_scheduling_edges() { + let docker = clients::Cli::default(); + let image = GenericImage::new("pgvector/pgvector", "pg16") + .with_env_var("POSTGRES_PASSWORD", "test") + .with_env_var("POSTGRES_DB", "vestige_test") + .with_exposed_port(5432); + let container = docker.run(image); + let port = container.get_host_port_ipv4(5432); + let url = format!("postgres://postgres:test@127.0.0.1:{port}/vestige_test"); + + let store = PgMemoryStore::connect(&url, 5).await.expect("connect"); + + // Register the model (typmod stamp). + store.register_model(&fixture_signature(384)).await.expect("register"); + + // insert -> get -> update -> delete. + let mut rec = fixture_record(); + let id = store.insert(&rec).await.expect("insert"); + let fetched = store.get(id).await.expect("get").expect("present"); + assert_eq!(fetched.content, rec.content); + assert_eq!(fetched.owner_user_id, LOCAL_USER_ID); + assert_eq!(fetched.visibility, Visibility::Private); + + rec.content = "edited".to_string(); + store.update(&rec).await.expect("update"); + assert_eq!(store.get(id).await.unwrap().unwrap().content, "edited"); + + // fts_search. + let hits = store.fts_search("edited", 10).await.expect("fts"); + assert!(hits.iter().any(|h| h.record.id == id)); + + // vector_search. + let emb = rec.embedding.clone().unwrap(); + let vhits = store.vector_search(&emb, 10).await.expect("vector"); + assert!(vhits.iter().any(|h| h.record.id == id)); + + // scheduling round-trip. + let sched = store.get_scheduling(id).await.unwrap().expect("seeded"); + let new_state = SchedulingState { + memory_id: id, + stability: 5.5, + difficulty: 0.2, + retrievability: 0.95, + last_review: Some(Utc::now()), + next_review: Some(Utc::now() + chrono::Duration::days(3)), + reps: sched.reps + 1, + lapses: sched.lapses, + }; + store.update_scheduling(&new_state).await.expect("update sched"); + let after = store.get_scheduling(id).await.unwrap().unwrap(); + assert_eq!(after.reps, new_state.reps); + + // edges. + let other = fixture_record(); + let other_id = store.insert(&other).await.unwrap(); + let edge = MemoryEdge { + source_id: id, + target_id: other_id, + edge_type: "semantic".to_string(), + weight: 0.8, + created_at: Utc::now(), + }; + store.add_edge(&edge).await.expect("add_edge"); + let edges = store.get_edges(id, None).await.unwrap(); + assert_eq!(edges.len(), 1); + let neighbors = store.get_neighbors(id, 1).await.unwrap(); + assert!(neighbors.iter().any(|(r, _)| r.id == other_id)); + store.remove_edge(id, other_id).await.expect("remove_edge"); + assert!(store.get_edges(id, None).await.unwrap().is_empty()); + + // delete -> cascade. + store.delete(id).await.expect("delete"); + assert!(store.get(id).await.unwrap().is_none()); + assert!(store.get_scheduling(id).await.unwrap().is_none()); +} + +fn fixture_record() -> MemoryRecord { + MemoryRecord { + id: Uuid::new_v4(), + domains: vec![], + domain_scores: Default::default(), + content: "the quick brown fox jumps over the lazy dog".into(), + node_type: "fact".into(), + tags: vec!["test".into()], + embedding: Some(vec![0.1_f32; 384]), + created_at: Utc::now(), + updated_at: Utc::now(), + metadata: serde_json::json!({}), + owner_user_id: LOCAL_USER_ID, + visibility: Visibility::Private, + shared_with_groups: vec![], + codebase: Some("vestige".to_string()), + } +} + +fn fixture_signature(dim: usize) -> vestige_core::storage::memory_store::ModelSignature { + vestige_core::storage::memory_store::ModelSignature { + name: "test/model".to_string(), + dimension: dim, + hash: "0".repeat(64), + } +} +``` + +Add `testcontainers = "0.20"` to `[dev-dependencies]` under +`#[cfg(feature = "postgres-backend")]` gating. The test is the slowest +in the suite (spawns a Docker container, ~5 s startup); annotate with +`#[ignore]` if CI runtime budget requires opt-in execution. + +### 4. Manual smoke (optional but recommended) + +```bash +# Tear down any prior database. +make postgres-down ; make postgres-up +DATABASE_URL=$(make postgres-url) cargo test \ + -p vestige-core --features postgres-backend -- --include-ignored +``` + +The `postgres-up` / `postgres-down` / `postgres-url` make targets are +defined in `docs/plans/local-dev-postgres-setup.md`. + +--- + +## Acceptance criteria + +This sub-plan is complete when ALL of the following hold: + +1. `cargo build --workspace --features postgres-backend` succeeds with + zero warnings. +2. `cargo clippy --workspace --features postgres-backend -- -D warnings` + succeeds. +3. `cargo build --workspace --no-default-features --features embeddings,vector-search` + still succeeds (the SQLite-only build is not regressed). +4. `SQLX_OFFLINE=true cargo check --workspace --features postgres-backend` + succeeds. `crates/vestige-core/.sqlx/` exists and contains one JSON + file per `sqlx::query!` / `sqlx::query_as!` invocation in the + Postgres backend. +5. Zero `todo!()` macros remain in + `crates/vestige-core/src/storage/postgres/mod.rs`. The only + exception is the body of the trait method `search` -- that method + stays `todo!()` until `0002e-hybrid-search.md` lands. +6. `crates/vestige-core/src/storage/postgres/registry.rs` exists with + the three functions described above + (`fetch_registry`, `ensure_registry`, `update_registry_for_reembed`). +7. `MemoryRecord` carries the four new fields + (`owner_user_id`, `visibility`, `shared_with_groups`, `codebase`) + and the `Visibility` enum is exported alongside it. The SQLite + backend reads and writes the same four fields. +8. The `tests/postgres_round_trip.rs` integration test passes against + a `pgvector/pgvector:pg16` container (insert / get / update / delete + / fts_search / vector_search / get_scheduling / update_scheduling + / add_edge / get_edges / remove_edge / get_neighbors / cascade + delete). +9. No visibility filter clause is present in any Phase 2 method body. + `WHERE owner_user_id = ...`, `WHERE visibility = ...`, and + `shared_with_groups && ...` do not appear anywhere in + `crates/vestige-core/src/storage/postgres/`. +10. `cargo sqlx prepare` was the last command run before commit; the + diff includes `.sqlx/` changes if any query changed. diff --git a/docs/plans/0002e-hybrid-search.md b/docs/plans/0002e-hybrid-search.md new file mode 100644 index 0000000..1f45174 --- /dev/null +++ b/docs/plans/0002e-hybrid-search.md @@ -0,0 +1,825 @@ +# Phase 2 Sub-Plan 0002e -- Hybrid RRF Search + +**Status**: Ready +**Depends on**: +- `0002a-skeleton-and-feature-gate.md` (the `postgres-backend` feature flag + exists and `PgMemoryStore` compiles with `todo!()` bodies). +- `0002b-pool-and-config.md` (a working `PgPool` reaches the backend). +- `0002c-migrations.md` (migration `0001_init` has created the `knowledge_nodes` + table with the D7 columns -- `owner_user_id`, `visibility`, + `shared_with_groups` -- and the D8 column `codebase`; migration `0002_hnsw` + has built the HNSW index). +- `0002d-store-impl-bodies.md` (real CRUD bodies exist so the integration + tests below can seed data through the trait surface rather than raw SQL). + +This sub-plan covers master plan 0002 deliverable D5: the hybrid RRF search +query implementation in `crates/vestige-core/src/storage/postgres/search.rs`, +plus the `search`, `fts_search`, and `vector_search` method bodies in +`crates/vestige-core/src/storage/postgres/mod.rs` that delegate into it. + +--- + +## Context + +This is one of the more performance-sensitive sub-plans in Phase 2. Every +search call from the cognitive engine -- the 7-stage retrieval pipeline, +`session_context`, `predict`, `deep_reference`, the dashboard -- bottoms out +in `MemoryStore::search`. The Postgres backend has to keep up with the +existing SQLite hybrid path, which combines BM25 over FTS5 with USearch HNSW +in two separate round trips and fuses the rankings in Rust. + +The shape of the win on Postgres is that both branches and the fusion run +inside one statement. The planner sees both CTEs together, the round trip is +single, and the rerank stage runs over a cleanly overfetched candidate set. + +Latency targets live in `0002h-testing-and-benches.md`. This sub-plan is +responsible for producing a correct, schema-stable query that the benches +can drive against. Do not optimise here; correctness and structure first. + +Master plan 0002 D5 (around lines 522-628 of +`docs/plans/0002-phase-2-postgres-backend.md`) sketches the SQL. That +sketch is the starting point, not the finished product. The schema after +the D7 and D8 amendments has more columns than the sketch enumerates, and +the SQLite `search` method (around line 6503 of +`crates/vestige-core/src/storage/sqlite.rs` in the Phase 1 worktree) +documents the semantics this implementation must stay compatible with: + +- Empty `query.limit` defaults to 10. +- `query.text == Some("")` is treated as no text query (degrade to vector). +- `query.embedding == None` is treated as no vector query (degrade to FTS). +- Both empty returns `Ok(vec![])`; not an error. +- The `MemoryRecord` in each `SearchResult` must be populated with all + fields the trait promises, including `domains` and `domain_scores` (Phase + 4 will fill these in; Phase 2 returns the stored values, which may be + empty arrays / empty objects). + +--- + +## Constants + +```rust +/// Reciprocal Rank Fusion smoothing constant from Cormack, Clarke and Buettcher +/// 2009 ("Reciprocal Rank Fusion outperforms Condorcet and individual rank +/// learning methods"). 60 is the canonical default and is robust across most +/// fusion regimes. Do not tune this without a paper-citation-grade reason. +const RRF_K: i32 = 60; + +/// Each branch (FTS, vector) is allowed to return OVERFETCH_MULT x final_limit +/// rows before fusion. Three matches the Phase 1 SQLite overfetch and gives +/// the fusion enough candidates to recover from any single branch's bad +/// recall on a given query. +const OVERFETCH_MULT: i64 = 3; +``` + +These live at module scope in +`crates/vestige-core/src/storage/postgres/search.rs`. They are `pub(crate)` +only if `0002h-testing-and-benches.md` needs to reference them from the +integration tests; otherwise private. + +--- + +## Public API + +```rust +#![cfg(feature = "postgres-backend")] + +use pgvector::Vector; +use sqlx::PgPool; + +use crate::storage::memory_store::{ + MemoryStoreResult, SearchQuery, SearchResult, +}; + +/// Hybrid RRF search over Postgres FTS and pgvector cosine distance. +/// +/// Branch behavior: +/// - empty text + null embedding -> Ok(vec![]) +/// - empty text + Some(embedding) -> pure vector search (FTS CTE returns +/// zero rows; fusion equals the vector +/// branch) +/// - Some(text) + null embedding -> pure FTS search +/// - Some(text) + Some(embedding) -> full RRF fusion +/// +/// `query.limit == 0` is treated as 10 (matches the SQLite default). +pub(crate) async fn rrf_search( + pool: &PgPool, + query: &SearchQuery, +) -> MemoryStoreResult>; + +/// FTS-only convenience search. Equivalent to calling `rrf_search` with +/// `query.embedding = None`, but uses a dedicated single-branch query that +/// avoids the FULL OUTER JOIN and the params CTE; faster by one planner pass +/// per call. +pub(crate) async fn fts_only( + pool: &PgPool, + text: &str, + limit: usize, +) -> MemoryStoreResult>; + +/// Vector-only convenience search. Dedicated single-branch query for the same +/// latency reason as `fts_only`. +pub(crate) async fn vector_only( + pool: &PgPool, + embedding: &[f32], + limit: usize, +) -> MemoryStoreResult>; +``` + +### Parameter handling + +In `rrf_search`: + +```rust +let final_limit: i32 = if query.limit == 0 { 10 } else { query.limit as i32 }; +let overfetch: i32 = (final_limit as i64 * OVERFETCH_MULT) + .min(i32::MAX as i64) as i32; + +let q_text: &str = query.text.as_deref().unwrap_or(""); +let q_vec: Option = query.embedding.as_ref() + .map(|v| Vector::from(v.clone())); + +let dom_filter: Option<&[String]> = query.domains.as_deref(); +let nt_filter: Option<&[String]> = query.node_types.as_deref(); +let tag_filter: Option<&[String]> = query.tags.as_deref(); + +let min_retr: Option = query.min_retrievability; +``` + +Both branches empty -- `q_text` is empty and `q_vec` is `None` -- returns +`Ok(vec![])` without hitting the database. The SQLite backend has the same +behavior and tests rely on it. + +```rust +if q_text.is_empty() && q_vec.is_none() { + return Ok(Vec::new()); +} +``` + +### `search` method body in `postgres/mod.rs` + +```rust +#[async_trait::async_trait] // or trait_variant after the Phase 1 amendment +impl MemoryStore for PgMemoryStore { + async fn search(&self, query: &SearchQuery) + -> MemoryStoreResult> + { + crate::storage::postgres::search::rrf_search(&self.pool, query).await + } + + async fn fts_search(&self, text: &str, limit: usize) + -> MemoryStoreResult> + { + crate::storage::postgres::search::fts_only(&self.pool, text, limit) + .await + } + + async fn vector_search(&self, embedding: &[f32], limit: usize) + -> MemoryStoreResult> + { + crate::storage::postgres::search::vector_only(&self.pool, embedding, limit) + .await + } +} +``` + +Everything below specifies the inside of those three free functions. + +--- + +## SQL: the hybrid RRF query + +The query is built as one `&'static str` (or `OnceCell`; see "Use +of sqlx::query!" below) and reused. Bound parameters are kept to seven +through a `params` CTE that the rest of the query references by name -- +this keeps the SQL readable and stops the bound-parameter count growing +with each filter clause. + +Bound parameters: + +- `$1`: text query (TEXT, may be empty) +- `$2`: embedding (pgvector::Vector, may be NULL) +- `$3`: overfetch limit per branch (INT) +- `$4`: final limit (INT) +- `$5`: domain filter (TEXT[] or NULL) +- `$6`: node_type filter (TEXT[] or NULL) +- `$7`: tag filter (TEXT[] or NULL) + +If `min_retrievability.is_some()` the outer SELECT adds a JOIN on +`scheduling` and a WHERE clause; that path uses a different prepared +statement (see "min_retrievability filter" below) so the simple-path query +stays free of the join. + +```sql +WITH params AS ( + SELECT + $1::text AS q_text, + $2::vector AS q_vec, + $3::int AS overfetch, + $4::int AS final_limit, + $5::text[] AS dom_filter, + $6::text[] AS nt_filter, + $7::text[] AS tag_filter +), +fts AS ( + SELECT + m.id, + ts_rank_cd( + m.search_vec, + websearch_to_tsquery('english', p.q_text) + ) AS score, + ROW_NUMBER() OVER ( + ORDER BY ts_rank_cd( + m.search_vec, + websearch_to_tsquery('english', p.q_text) + ) DESC + ) AS rank + FROM knowledge_nodes m + CROSS JOIN params p + WHERE p.q_text <> '' + AND m.search_vec @@ websearch_to_tsquery('english', p.q_text) + AND (p.dom_filter IS NULL OR m.domains && p.dom_filter) + AND (p.nt_filter IS NULL OR m.node_type = ANY(p.nt_filter)) + AND (p.tag_filter IS NULL OR m.tags && p.tag_filter) + ORDER BY score DESC + LIMIT (SELECT overfetch FROM params) +), +vec AS ( + SELECT + m.id, + 1.0 - (m.embedding <=> p.q_vec) AS score, + ROW_NUMBER() OVER ( + ORDER BY m.embedding <=> p.q_vec + ) AS rank + FROM knowledge_nodes m + CROSS JOIN params p + WHERE m.embedding IS NOT NULL + AND p.q_vec IS NOT NULL + AND (p.dom_filter IS NULL OR m.domains && p.dom_filter) + AND (p.nt_filter IS NULL OR m.node_type = ANY(p.nt_filter)) + AND (p.tag_filter IS NULL OR m.tags && p.tag_filter) + ORDER BY m.embedding <=> p.q_vec + LIMIT (SELECT overfetch FROM params) +), +fused AS ( + SELECT + COALESCE(f.id, v.id) AS id, + COALESCE(1.0 / (60 + f.rank), 0.0) + + COALESCE(1.0 / (60 + v.rank), 0.0) AS rrf_score, + f.score AS fts_score, + v.score AS vector_score + FROM fts f + FULL OUTER JOIN vec v ON f.id = v.id +) +SELECT + m.id AS "id!: uuid::Uuid", + m.owner_user_id AS "owner_user_id!: uuid::Uuid", + m.visibility AS "visibility!: String", + m.shared_with_groups AS "shared_with_groups!: Vec", + m.codebase AS "codebase: String", + m.domains AS "domains!: Vec", + m.domain_scores AS "domain_scores!: serde_json::Value", + m.content AS "content!: String", + m.node_type AS "node_type!: String", + m.tags AS "tags!: Vec", + m.embedding AS "embedding: pgvector::Vector", + m.metadata AS "metadata!: serde_json::Value", + m.created_at AS "created_at!: chrono::DateTime", + m.updated_at AS "updated_at!: chrono::DateTime", + fused.rrf_score AS "rrf_score!: f64", + fused.fts_score AS "fts_score: f64", + fused.vector_score AS "vector_score: f64" +FROM fused +JOIN knowledge_nodes m ON m.id = fused.id +ORDER BY fused.rrf_score DESC +LIMIT (SELECT final_limit FROM params); +``` + +Notes on the SELECT column list. The D7 columns (`owner_user_id`, +`visibility`, `shared_with_groups`) and the D8 column (`codebase`) are +selected even though Phase 2 does not filter on them yet, so: + +1. The `MemoryRecord` returned to the trait can be populated with the + stored values from day one. Phase 3 will start writing real + `owner_user_id` / `visibility` values; Phase 2 always writes the + single-user defaults (`'00000000-...-0001'`, `'private'`, `'{}'`). The + `MemoryRecord` returned in Phase 2 simply carries those defaults. +2. The schema-drift integration tests (see "Verification") catch the case + where someone adds a NOT NULL column to `knowledge_nodes` without updating + this query. + +Notes on the body: + +- `CROSS JOIN params p` is used instead of the master-plan sketch's + `FROM knowledge_nodes m, params p`. Same plan, clearer intent. +- The `ORDER BY ... LIMIT` inside each branch CTE is there so the planner + can stop early once it has `overfetch` rows; without it the LIMIT is + applied after a full sort over all matches. +- `1.0 - (m.embedding <=> p.q_vec)` converts pgvector's cosine *distance* + to cosine *similarity* in [0, 1] for the `vector_score` output. RRF + itself does not need the similarity -- it uses ranks -- but the trait + surface exposes `vector_score: Option` for caller introspection. +- `RRF_K = 60` is inlined as `60` in the SQL string. A `const` formatter + feels tidier but `60` is a literature constant; spell it out and leave a + comment in the Rust source: `// 60 == RRF_K (Cormack 2009)`. +- `FULL OUTER JOIN` is required: a row that the FTS branch finds and the + vector branch does not must still appear in `fused`, and vice versa. +- `COALESCE(..., 0.0)` on each `1.0 / (60 + rank)` term handles the + no-match-from-this-branch case. The fusion score for a row that only the + FTS branch ranks is `1/(60 + f.rank)` exactly. +- `m.search_vec` is the generated `tsvector` column created in migration + `0001_init` (see D4 of the master plan). + +--- + +## Result row mapping + +`sqlx::query_as::<_, SearchRow>` reads each row into a private struct that +owns the column types exactly as they come back from Postgres. The struct +is converted into a `SearchResult` after fetch. + +```rust +#[derive(sqlx::FromRow)] +struct SearchRow { + id: uuid::Uuid, + owner_user_id: uuid::Uuid, + visibility: String, + shared_with_groups: Vec, + codebase: Option, + domains: Vec, + domain_scores: serde_json::Value, + content: String, + node_type: String, + tags: Vec, + embedding: Option, + metadata: serde_json::Value, + created_at: chrono::DateTime, + updated_at: chrono::DateTime, + rrf_score: f64, + fts_score: Option, + vector_score: Option, +} + +impl SearchRow { + fn into_result(self) -> SearchResult { + use crate::storage::memory_store::MemoryRecord; + use std::collections::HashMap; + + // domain_scores is JSONB; the column always exists, but may be the + // empty object {} when Phase 4 has not classified this memory yet. + let domain_scores: HashMap = + serde_json::from_value(self.domain_scores).unwrap_or_default(); + + let record = MemoryRecord { + id: self.id, + domains: self.domains, + domain_scores, + content: self.content, + node_type: self.node_type, + tags: self.tags, + // pgvector::Vector -> Vec + embedding: self.embedding.map(|v| v.to_vec()), + created_at: self.created_at, + updated_at: self.updated_at, + metadata: self.metadata, + // owner_user_id / visibility / shared_with_groups / codebase + // do not appear on MemoryRecord yet. Phase 3 will decide whether + // to extend MemoryRecord or surface them via a side channel. + // For Phase 2 they are read but discarded here. + }; + + SearchResult { + record, + score: self.rrf_score, + fts_score: self.fts_score, + vector_score: self.vector_score, + } + } +} +``` + +Type mapping summary: + +| SQL type | Rust type | Notes | +|-------------------|--------------------------------------|------------------------------------------------| +| UUID | `uuid::Uuid` | requires sqlx `uuid` feature | +| TEXT | `String` | | +| TEXT NULL | `Option` | used for `codebase` | +| TEXT[] | `Vec` | for `tags`, `domains` | +| UUID[] | `Vec` | for `shared_with_groups` | +| JSONB | `serde_json::Value` | for `metadata`, `domain_scores` | +| TIMESTAMPTZ | `chrono::DateTime` | requires sqlx `chrono` feature | +| VECTOR(N) NULL | `Option` | `.map(|v| v.to_vec())` to `Option>` | +| FLOAT8 | `f64` | | +| FLOAT8 NULL | `Option` | for `fts_score`, `vector_score` | + +If `MemoryRecord` is extended in Phase 3 to carry `owner_user_id`, +`visibility`, `shared_with_groups`, and `codebase`, the conversion above +gets four more fields. Phase 2 reads them so the integration tests can +assert on them via SQL, but the trait surface does not expose them yet. + +--- + +## `fts_only` and `vector_only` -- dedicated single-branch queries + +The master plan offers two options for the convenience methods: reuse +`rrf_search` with one branch nulled, or write dedicated queries. The +dedicated queries win: + +- One CTE instead of three. Planner picks the obvious plan. +- No FULL OUTER JOIN. +- No `params` indirection -- bound parameters used directly. +- The output `score` is the branch's native score (BM25-ish `ts_rank_cd` / + cosine similarity), not an RRF fusion score over one branch. Callers of + `fts_search` and `vector_search` get an intuitive score back. + +### `fts_only` + +Bound parameters: + +- `$1`: text query (TEXT, must be non-empty; the caller guards `text.is_empty()`) +- `$2`: limit (INT) + +```sql +SELECT + m.id AS "id!: uuid::Uuid", + m.owner_user_id AS "owner_user_id!: uuid::Uuid", + m.visibility AS "visibility!: String", + m.shared_with_groups AS "shared_with_groups!: Vec", + m.codebase AS "codebase: String", + m.domains AS "domains!: Vec", + m.domain_scores AS "domain_scores!: serde_json::Value", + m.content AS "content!: String", + m.node_type AS "node_type!: String", + m.tags AS "tags!: Vec", + m.embedding AS "embedding: pgvector::Vector", + m.metadata AS "metadata!: serde_json::Value", + m.created_at AS "created_at!: chrono::DateTime", + m.updated_at AS "updated_at!: chrono::DateTime", + ts_rank_cd(m.search_vec, websearch_to_tsquery('english', $1)) + AS "fts_score!: f64" +FROM knowledge_nodes m +WHERE m.search_vec @@ websearch_to_tsquery('english', $1) +ORDER BY ts_rank_cd(m.search_vec, websearch_to_tsquery('english', $1)) DESC +LIMIT $2; +``` + +The Rust caller maps each row to a `SearchResult` with: + +```rust +SearchResult { + record, + score: fts_score, + fts_score: Some(fts_score), + vector_score: None, +} +``` + +If `text.is_empty()` the caller returns `Ok(Vec::new())` before hitting +the database. `websearch_to_tsquery('english', '')` returns an empty +tsquery that matches nothing; the round-trip is wasted work otherwise. + +### `vector_only` + +Bound parameters: + +- `$1`: embedding (pgvector::Vector) +- `$2`: limit (INT) + +```sql +SELECT + m.id AS "id!: uuid::Uuid", + m.owner_user_id AS "owner_user_id!: uuid::Uuid", + m.visibility AS "visibility!: String", + m.shared_with_groups AS "shared_with_groups!: Vec", + m.codebase AS "codebase: String", + m.domains AS "domains!: Vec", + m.domain_scores AS "domain_scores!: serde_json::Value", + m.content AS "content!: String", + m.node_type AS "node_type!: String", + m.tags AS "tags!: Vec", + m.embedding AS "embedding: pgvector::Vector", + m.metadata AS "metadata!: serde_json::Value", + m.created_at AS "created_at!: chrono::DateTime", + m.updated_at AS "updated_at!: chrono::DateTime", + 1.0 - (m.embedding <=> $1) AS "vector_score!: f64" +FROM knowledge_nodes m +WHERE m.embedding IS NOT NULL +ORDER BY m.embedding <=> $1 +LIMIT $2; +``` + +The Rust caller maps each row to: + +```rust +SearchResult { + record, + score: vector_score, + fts_score: None, + vector_score: Some(vector_score), +} +``` + +Both convenience methods ignore `SearchQuery.domains` / `tags` / +`node_types` / `min_retrievability` -- they take `&str` and `&[f32]` +respectively, not a `SearchQuery`. Callers that want filters on a +single-branch search should call `search` with the other branch input +left at its degrade-to-zero default. + +--- + +## `min_retrievability` filter + +`SearchQuery::min_retrievability: Option` is applied as a final +filter after fusion by joining on the `scheduling` table: + +```sql +-- with-min-retrievability variant: identical CTEs to the base query, only +-- the final SELECT changes. +SELECT + ... (same column list as the base query) ... +FROM fused +JOIN knowledge_nodes m ON m.id = fused.id +JOIN scheduling s ON s.memory_id = m.id +WHERE s.retrievability >= $8 +ORDER BY fused.rrf_score DESC +LIMIT (SELECT final_limit FROM params); +``` + +This is a separate prepared statement -- the eight-parameter variant -- +held alongside the seven-parameter base. The Rust dispatch: + +```rust +if let Some(min_r) = query.min_retrievability { + sqlx::query_as::<_, SearchRow>(QUERY_WITH_MIN_R) + .bind(q_text) + .bind(q_vec) + .bind(overfetch) + .bind(final_limit) + .bind(dom_filter) + .bind(nt_filter) + .bind(tag_filter) + .bind(min_r) + .fetch_all(pool).await? +} else { + sqlx::query_as::<_, SearchRow>(QUERY_BASE) + .bind(q_text) + .bind(q_vec) + .bind(overfetch) + .bind(final_limit) + .bind(dom_filter) + .bind(nt_filter) + .bind(tag_filter) + .fetch_all(pool).await? +} +``` + +Why not unconditionally join: the `scheduling` join is expensive enough on +a large `knowledge_nodes` table that adding it to every search call regresses the +common path. `min_retrievability` is set by the cognitive engine's +accessibility filter and is `None` in most direct callers. + +The same two-variant pattern repeats for `fts_only` and `vector_only`; in +practice callers of those methods rarely set `min_retrievability` (it is +not part of their argument list), so only the base variant is needed +unless the trait surface grows. + +--- + +## Domain / tag / node_type filters + +Each filter is expressed as a NULL-conditional clause inside both branch +CTEs, written using PostgreSQL array operators: + +```sql +AND (p.dom_filter IS NULL OR m.domains && p.dom_filter) +AND (p.nt_filter IS NULL OR m.node_type = ANY(p.nt_filter)) +AND (p.tag_filter IS NULL OR m.tags && p.tag_filter) +``` + +- `&&` is the PostgreSQL "arrays overlap" operator. Matches if any + element in `m.domains` is in the filter array. Index-friendly with a + GIN index on `m.domains` (created in `0001_init`). +- `= ANY(...)` matches `m.node_type` (a scalar) against any element of + the filter array. Index-friendly with a B-tree on `m.node_type`. +- `&&` is used again on `m.tags` (a `TEXT[]`). + +The NULL-conditional form is critical: when the filter parameter is +`NULL`, the clause short-circuits to `TRUE` and contributes nothing to +the WHERE. This keeps a single query reusable across "no filter" and +"filter set" cases without rewriting SQL. + +When the Rust caller passes `None` for a filter, sqlx binds it as `NULL` +of the column type (`text[]`). The cast `$5::text[]` in the `params` CTE +is what tells sqlx the binding type. + +The master plan's draft has each filter clause duplicated across both +branch CTEs. That duplication is correct -- the planner cannot push a +WHERE clause across a FULL OUTER JOIN into both sides automatically; we +do it manually. + +--- + +## Empty-string text query handling + +The base query guards the FTS branch with `WHERE p.q_text <> ''`. + +`websearch_to_tsquery('english', '')` returns an empty tsquery. An empty +tsquery has no lexemes and matches no document; the `@@` operator returns +false for every row. Without the guard, the FTS branch would still run -- +sequential scan, tokenisation per row, comparison -- and return zero +rows. The guard short-circuits at planning time. + +The guard does not affect the FULL OUTER JOIN: when the FTS branch +returns zero rows, the join degenerates to "every row that the vector +branch returned, with `f.id IS NULL` and `f.rank IS NULL`". The +`COALESCE(1.0 / (60 + f.rank), 0.0)` then evaluates to `0.0`, and the +fused score reduces to the vector branch's RRF term alone. This is the +"pure vector search" degrade path. + +Symmetrically, the vector branch guards itself with +`WHERE m.embedding IS NOT NULL AND p.q_vec IS NOT NULL`, which gives the +"pure FTS search" degrade path when the caller passes no embedding. + +The both-empty case (`q_text == ''` and `q_vec IS NULL`) is intercepted +in Rust before the query runs and returns `Ok(vec![])`. Returning empty +rather than error matches the SQLite behavior and is what the Phase 1 +ingest pipeline relies on for "no signal, no results" fallback. + +--- + +## Use of `sqlx::query!` versus `sqlx::query_as` + +`sqlx::query!` and `sqlx::query_as!` are compile-time-checked: the SQL is +sent to a live Postgres at build time, the result schema is validated, and +the generated Rust struct fields are derived. That checking is the +default for every other query in `0002d-store-impl-bodies.md`. + +For the RRF query, the macro path is impractical for two reasons: + +1. **Two structurally different queries** -- the base (seven parameters, + no `scheduling` join) and the `min_retrievability` variant (eight + parameters, with the join). The macro would force two macro + invocations, each producing its own anonymous result struct, and the + result types would not unify. Manual `From` impls would be needed in + both directions. +2. **The dedicated `fts_only` and `vector_only` queries have a different + output column set** (`fts_score!` instead of `rrf_score! + fts_score? + + vector_score?`). Three macro invocations, three structs, three + conversion helpers. + +The chosen pattern is `sqlx::query_as::<_, SearchRow>(SQL_CONST)` with a +single `SearchRow` struct that owns the column types and a single +`SearchRow::into_result` helper. The SQL is held in module-scope `&'static +str` constants: + +```rust +const QUERY_BASE: &str = include_str!("search.rrf.sql"); +const QUERY_WITH_MIN_R: &str = include_str!("search.rrf.min_retr.sql"); +const QUERY_FTS_ONLY: &str = include_str!("search.fts.sql"); +const QUERY_VECTOR_ONLY: &str = include_str!("search.vector.sql"); +``` + +`include_str!` keeps the SQL out of the Rust source. The four `.sql` +files live next to `search.rs` in +`crates/vestige-core/src/storage/postgres/`. + +The cost: schema drift (someone renames `m.codebase` to `m.repo_name`) +will not break the build. The integration tests in "Verification" below +are the safety net. This is a deliberate trade -- it is the one sub-plan +in Phase 2 where runtime flexibility beats compile-time checking. + +If a future contributor wants compile-time checking back for the simple +case, the right move is to introduce a `#[cfg(test)]`-only macro-checked +variant of `QUERY_BASE` and assert at test build time that the macro +agrees with the string. That belongs in `0002h-testing-and-benches.md` if +anywhere. + +--- + +## Verification + +Integration tests live in +`crates/vestige-core/tests/postgres_search.rs`, gated by +`#[cfg(feature = "postgres-backend")]` and `#[ignore]` by default (the +test runner CI workflow in `0002h-testing-and-benches.md` runs them with +`--ignored` against a live Postgres). + +Common harness for every test: + +1. Spin up Postgres via `sqlx::PgPool::connect` against the test URL. +2. Run `sqlx::migrate!("./migrations/postgres").run(&pool)` to bring the + schema up. +3. Register a deterministic test embedder via `register_model` so + `embedding` columns can be written. +4. Seed 50 mixed memories through `MemoryStore::insert` -- mixed + `node_type` (`fact`, `concept`, `event`, `decision`), mixed `tags` + (`rust`, `postgres`, `search`, `dream`, `bug-fix`), mixed `codebase`, + embeddings drawn from three small clusters so vector recall has + structure to find. + +Test cases: + +**T1. Full RRF returns the seeded target.** +Insert a known memory with `content = "FSRS-6 spaced repetition cadence"` +and an embedding from cluster A. Query with +`text = Some("FSRS spaced repetition")` and an embedding near cluster A. +Assert the target memory is in the top 3 of the returned `SearchResult`s +and that both `fts_score` and `vector_score` are `Some` for it. + +**T2. Pure FTS degrade.** +Same target as T1. Query with `text = Some("FSRS spaced repetition")` and +`embedding = None`. Assert the target appears, all results have +`vector_score == None`, `fts_score == Some(_)`, and `score` equals the +fused RRF score (which collapses to one branch's `1.0/(60 + rank)`). + +**T3. Pure vector degrade.** +Same target as T1. Query with `text = Some("")` and +`embedding = Some(cluster_A_vector)`. Assert the target appears, all +results have `fts_score == None`, `vector_score == Some(_)`. + +**T4. Both empty returns `Ok(vec![])`.** +Query with `text = Some("")` and `embedding = None`. Assert exactly an +empty result vector and that no SQL was executed (assert via a +`sqlx::PgPool` query-count handle if convenient; otherwise document that +the short-circuit lives in Rust). + +**T5. `domains` filter.** +Insert one memory with `domains = vec!["domain-x"]` and 49 others without +it. Query with `domains = Some(vec!["domain-x"])` and a matching text. +Assert exactly one result is returned and it is the seeded memory. + +**T6. `tags` filter.** +Same pattern as T5 with `tags = Some(vec!["bug-fix"])`. + +**T7. `node_types` filter.** +Same pattern as T5 with `node_types = Some(vec!["decision"])`. + +**T8. `min_retrievability` filter.** +Seed two memories with the same content + embedding. Write +`scheduling` rows so that one has `retrievability = 0.9` and the other +`0.1`. Query with `min_retrievability = Some(0.5)`. Assert exactly the +high-retrievability memory is returned. + +**T9. `query.limit == 0` defaults to 10.** +Seed 30 matching memories. Query with `limit = 0`. Assert the result +contains exactly 10 entries. + +**T10. `fts_only` and `vector_only` parity.** +For the same target memory, call `fts_only` and `vector_only` directly +and compare against `search` with the corresponding branch zeroed. The +top-1 result must match by id; the scores need only be of the same sign +and magnitude (not bit-identical, because RRF fusion changes the +absolute score). + +**T11. Schema-drift canary.** +Run the base query against an empty `knowledge_nodes` table and `fetch_all` +into `Vec`. Any added NOT NULL column on `knowledge_nodes` that is +not in the SELECT will fail the test at the `try_get` boundary with a +clear error. This is the test that compensates for not using +`sqlx::query!`. + +**T12. Hostile inputs.** +Query with `text = Some("'; DROP TABLE knowledge_nodes; --")` and a normal +embedding. Assert no panic, results returned cleanly, `knowledge_nodes` table +still present. This is symbolic; `websearch_to_tsquery` is parameter- +bound and SQL injection is not actually possible, but the test is cheap +and the assertion is real. + +--- + +## Acceptance criteria + +A reviewer of the implementation PR should be able to confirm: + +1. `crates/vestige-core/src/storage/postgres/search.rs` exists and is + compiled only when `feature = "postgres-backend"` is on. +2. The four `.sql` files (`search.rrf.sql`, + `search.rrf.min_retr.sql`, `search.fts.sql`, `search.vector.sql`) + exist in the same directory and are `include_str!`-ed into module- + scope `&'static str` constants. +3. `RRF_K = 60` and `OVERFETCH_MULT = 3` are defined as constants at + module scope with the Cormack 2009 citation in a comment. +4. The seven-parameter base query is one statement and uses a `params` + CTE; the eight-parameter `min_retrievability` variant adds exactly + one JOIN and one WHERE clause on top of the base. +5. Empty text degrades to pure vector; null embedding degrades to pure + FTS; both empty short-circuits to `Ok(vec![])` in Rust before the + query runs. +6. The SELECT column list in every query includes `owner_user_id`, + `visibility`, `shared_with_groups`, and `codebase` even though Phase 2 + does not filter on them. +7. `SearchRow::into_result` populates a `MemoryRecord` with every field + the trait requires, including `domains` and `domain_scores` decoded + from JSONB. +8. `PgMemoryStore::search`, `PgMemoryStore::fts_search`, and + `PgMemoryStore::vector_search` each delegate to the corresponding + free function with one line of body. +9. All twelve integration tests (`T1` through `T12`) pass against a live + Postgres with the `0001_init` + `0002_hnsw` migrations applied. +10. `cargo build -p vestige-core` succeeds with + `--features postgres-backend` and with the feature off. +11. `cargo clippy -p vestige-core --features postgres-backend -- -D warnings` + is clean. + +When all eleven are true, this sub-plan is done and +`0002f-migrate-cli.md` is unblocked. diff --git a/docs/plans/0002f-migrate-cli.md b/docs/plans/0002f-migrate-cli.md new file mode 100644 index 0000000..457de70 --- /dev/null +++ b/docs/plans/0002f-migrate-cli.md @@ -0,0 +1,1045 @@ +# Phase 2 Sub-Plan 0002f -- SQLite-to-Postgres Migrate CLI + +**Status**: Ready +**Depends on**: +- `0002a-skeleton-and-feature-gate.md` (the `postgres-backend` Cargo feature + and the `crates/vestige-core/src/storage/postgres/` module skeleton). +- `0002b-pool-and-config.md` (`PgPool` construction and `PostgresConfig`). +- `0002c-migrations.md` (the `postgres/migrations/0001_init.up.sql` schema, + including the D7 tenancy columns/tables and the D8 `codebase` column). +- `0002d-store-impl-bodies.md` (real bodies for `PgMemoryStore` trait methods: + `insert`, `register_model`, `add_edge`, `update_scheduling`, etc.; and the + matching source-side reader bodies on `SqliteMemoryStore`, in particular a + windowed-stream API ordered by `(created_at, id)`). + +This sub-plan covers Phase 2 master-plan deliverables D8 (the streaming copy +in `crates/vestige-core/src/storage/postgres/migrate_cli.rs`) and D10 (the +`vestige migrate copy ...` clap subcommand in +`crates/vestige-mcp/src/bin/cli.rs`). Sub-plan `0002g-reembed.md` covers the +`vestige migrate reembed ...` subcommand body; this sub-plan only declares the +`Reembed` clap variant alongside `Copy` so the subcommand layout is final. + +The success criterion is: + +``` +vestige migrate copy --from sqlite --to postgres \ + --sqlite-path ~/.vestige/vestige.db \ + --postgres-url postgresql://localhost/vestige +``` + +streams every row from a Phase 1 SQLite database into a fresh (or partially +populated) Phase 2 Postgres database. Re-running the same command is a no-op +on already-present rows. A `--dry-run` flag prints per-table counts without +writing anything. + +--- + +## Context + +ADR 0002 D2 settled that `PgMemoryStore::connect` mirrors +`SqliteMemoryStore::new`: no `Embedder` in the constructor; the model +signature is stamped by a separate call to `register_model`. The migrator +inherits this symmetry. It opens both backends, validates that the source's +`embedding_model` registry agrees with the destination's (or with the +embedder the user supplied for the destination), and then streams rows. + +ADR 0002 D7 reserved multi-tenancy columns on `knowledge_nodes` (`owner_user_id`, +`visibility`, `shared_with_groups`) and three tables (`users`, `groups`, +`group_memberships`). Phase 2 single-user defaults are the bootstrap row +`'00000000-0000-0000-0000-000000000001'` (`'local'`), `visibility = 'private'`, +empty `shared_with_groups`. The migrator preserves whatever values the source +SQLite holds; it does NOT rewrite owner_user_id from real values to the +bootstrap user. If a Phase 3-aware source has real user rows, those are +copied first (step 5 below) and the foreign key in `knowledge_nodes.owner_user_id` +resolves to the same UUID on the destination. + +ADR 0002 D8 promoted `codebase` to a first-class `TEXT` column. The migrator +reads it as a column on the source side (the Phase 1 amendment's V15 SQLite +migration ensures the column exists; for pre-V15 SQLite the migrator must +detect and fall back to extracting from `metadata->>'codebase'`, see "Source +schema variants" below). + +The Phase 1 `SqliteMemoryStore` is the source backend. `0002d-store-impl-bodies.md` +extends it (and the trait) with a windowed reader ordered by `(created_at, id)` +so the migrator can stream rows in deterministic batches without holding the +full result set in RAM. The migrator assumes that reader exists and produces +`MemoryRecord` instances with all D7+D8 columns populated. + +--- + +## File layout + +``` +crates/vestige-core/src/storage/postgres/migrate_cli.rs -- D8 body +crates/vestige-mcp/src/bin/cli.rs -- D10 clap wiring +tests/phase_2/migrate_test.rs -- integration test +``` + +The migrator lives behind `#[cfg(feature = "postgres-backend")]`. The +`Migrate` clap variant in the CLI is similarly gated. Without the feature, +`vestige` builds and runs exactly as in Phase 1 -- the `migrate` subcommand +simply does not exist. + +--- + +## Plan struct + +```rust +#![cfg(feature = "postgres-backend")] + +use std::path::PathBuf; +use std::sync::Arc; + +use uuid::Uuid; + +use crate::embedder::Embedder; +use crate::storage::memory_store::MemoryStoreError; + +#[derive(Debug, Clone)] +pub struct SqliteToPostgresPlan { + /// Filesystem path to the source SQLite database. Opened read-only. + pub sqlite_path: PathBuf, + + /// libpq-style URL for the destination Postgres database. + pub postgres_url: String, + + /// sqlx pool size for the destination. Default 4. The migrator is + /// single-writer per table for ordering reasons; extra connections are + /// only used for the embedding-model registry probe and for the dry-run + /// COUNT queries that run in parallel with the row scan. + pub max_connections: u32, + + /// Number of rows per Postgres transaction. Default 500. Larger batches + /// reduce commit overhead but increase the amount of work a crash + /// re-runs. + pub batch_size: usize, + + /// If true, count rows per table and emit a report without writing + /// anything to Postgres. + pub dry_run: bool, +} + +impl Default for SqliteToPostgresPlan { + fn default() -> Self { + Self { + sqlite_path: PathBuf::new(), + postgres_url: String::new(), + max_connections: 4, + batch_size: 500, + dry_run: false, + } + } +} +``` + +The struct is public so a future programmatic driver (Rhai script, hero +service, in-process test harness) can call `run_sqlite_to_postgres` without +touching clap. + +--- + +## Report struct + +```rust +#[derive(Debug, Default)] +pub struct MigrationReport { + pub memories_copied: u64, + pub scheduling_rows: u64, + pub edges_copied: u64, + pub review_events_copied: u64, + pub domains_copied: u64, + pub users_copied: u64, + pub groups_copied: u64, + pub group_memberships_copied: u64, + + /// Per-row failures that did not abort the migrator. Each entry pairs + /// the source row id (where derivable) with the error that caused it to + /// be skipped. Rows whose UUID cannot be parsed are reported with + /// `Uuid::nil()` and a descriptive `MemoryStoreError::InvalidInput`. + pub errors: Vec<(Uuid, MemoryStoreError)>, +} +``` + +`errors.is_empty()` is the "clean migration" check. The CLI prints +`errors.len()` at the end and exits non-zero if it is positive. + +Counts are the number of rows the migrator either inserted or skipped due to +ON CONFLICT. They reflect what the source presented, not what the destination +ended up with -- that distinction matters for re-runs: a re-run of a finished +migration reports the same counts but writes zero new rows. + +--- + +## Driver fn + +```rust +pub async fn run_sqlite_to_postgres( + plan: SqliteToPostgresPlan, + embedder: Arc, +) -> MemoryStoreResult; +``` + +Algorithm, step by step: + +### Step 1. Open source SQLite read-only + +Build a SQLite URL with `?mode=ro` so the migrator cannot mutate the source +even by accident: + +```rust +let src_url = format!( + "sqlite://{}?mode=ro", + plan.sqlite_path.display(), +); +let src = SqliteMemoryStore::open_url(&src_url).await?; +``` + +`SqliteMemoryStore::open_url` is added by `0002d-store-impl-bodies.md` as a +small wrapper over the existing `new` that accepts a fully-formed URL. If the +file does not exist, `MemoryStoreError::Init` propagates. + +The source store still runs its own startup-time migrations in `?mode=ro`? +No -- read-only mode rejects writes. The migrator therefore opens the source +twice if the live source DB is older than V15: once writable to bring its +schema forward to V15 (so the D7+D8 columns are present), then re-opens +read-only. Detection: query `user_version` on the source DB before opening +the read-only handle. If it is below V15 and `--allow-source-upgrade` is set, +open writable, run `SqliteMemoryStore::new` (which runs migrations), close, +and re-open read-only. If `--allow-source-upgrade` is not set, fail with a +clear error pointing at the flag. Default: not set; the user must opt in to +modifying their source. + +### Step 2. Embedding model registry compatibility check + +Read both registries: + +```rust +let src_sig = src.registered_model().await?; +let actual = embedder.model_signature(); // ModelSignature +``` + +If `src_sig` is `Some` and disagrees with `actual` (any of `name`, +`dimension`, `hash`), return: + +```rust +MemoryStoreError::ModelMismatch { + registered_name: src_sig.name, + registered_dim: src_sig.dimension, + registered_hash: src_sig.hash, + actual_name: actual.name, + actual_dim: actual.dimension, + actual_hash: actual.hash, +} +``` + +The CLI translates this into a message that mentions `0002g`'s `--reembed` +command as the recovery path. Do NOT silently re-encode here; that is a +separate concern with its own flag set, performance profile, and HNSW +rebuild. + +If `src_sig` is `None` (source never had an embedding model -- empty DB or +pre-Phase-1), use the actual embedder's signature for the destination +registry. Memory rows whose `embedding` column is NULL stay NULL on the +destination side. + +### Step 3. Open destination Postgres + +```rust +let dst = PgMemoryStore::connect(&plan.postgres_url, plan.max_connections).await?; +``` + +`PgMemoryStore::connect` (per `0002d-store-impl-bodies.md`) runs the +`sqlx::migrate!` macro internally, which idempotently applies `0001_init` +and `0002_hnsw`. Re-running the migrator against an already-initialised +destination is fine. + +Stamp the registry on the destination: + +```rust +let sig = src_sig.unwrap_or_else(|| embedder.model_signature()); +dst.register_model(&sig).await?; +``` + +`register_model` is idempotent in the Postgres backend: it upserts the single +registry row, and (per ADR 0002 D2) it runs the `ALTER TABLE knowledge_nodes +ALTER COLUMN embedding TYPE vector($N)` typmod stamp inside its body. The +ALTER is itself idempotent: pgvector accepts the same typmod twice as a no-op. + +### Step 4. Verify schema + +Not really a separate step -- `PgMemoryStore::connect` already calls +`sqlx::migrate!` and the `register_model` call already stamps the typmod. +Listed here for documentation: this is the point at which the destination is +known to be schema-correct for the source's embedding dimension. + +### Step 5. Copy `users`, `groups`, `group_memberships` first + +These tables exist for both pre-Phase-3 and Phase-3-aware sources because +ADR 0002 D7 reserved them in V15 of the SQLite schema. Phase 2 single-user +deployments have exactly one user row (`local`) and zero groups, but the +migrator does not assume that: it copies whatever is present. + +The bootstrap user `00000000-0000-0000-0000-000000000001` is inserted by +`0001_init.up.sql` on the destination. The source's bootstrap row collides +with the destination's; `ON CONFLICT (id) DO NOTHING` resolves the collision +silently. + +Pseudocode: + +```rust +let mut tx = dst.pool().begin().await?; +let mut report = MigrationReport::default(); + +for batch in src.stream_users(plan.batch_size).await? { + for u in batch? { + sqlx::query!( + "INSERT INTO users (id, handle, display_name, created_at, metadata) \ + VALUES ($1, $2, $3, $4, $5) \ + ON CONFLICT (id) DO NOTHING", + u.id, u.handle, u.display_name, u.created_at, u.metadata, + ).execute(&mut *tx).await?; + report.users_copied += 1; + } +} +tx.commit().await?; +``` + +Repeat the same shape for `groups` and `group_memberships`. The membership +table has a composite primary key `(user_id, group_id)`: + +```rust +"INSERT INTO group_memberships (user_id, group_id, role, joined_at) \ + VALUES ($1, $2, $3, $4) \ + ON CONFLICT (user_id, group_id) DO NOTHING", +``` + +The `stream_users` / `stream_groups` / `stream_memberships` reader methods on +`SqliteMemoryStore` are introduced by `0002d-store-impl-bodies.md`. They +return `BoxStream>>` to keep the migrator +backend-agnostic. + +If the source SQLite predates V15 -- the V15 migration is the one that +introduces these tables -- they simply do not exist. The reader detects +their absence at open time and returns an empty stream. See "Source schema +variants" below. + +### Step 6. Copy `knowledge_nodes` in batches + +Stream the source ordered by `(created_at, id)`. The cursor key is the +last-seen `(created_at, id)` pair; the reader uses keyset pagination so +restarts pick up where the previous run left off: + +```sql +SELECT ... +FROM knowledge_nodes +WHERE (created_at, id) > ($cursor_ts, $cursor_id) +ORDER BY created_at, id +LIMIT $batch_size +``` + +For each batch: + +```rust +let mut tx = dst.pool().begin().await?; +for record in batch { + // D7 + D8 columns are all on MemoryRecord by Phase 2. + let groups: Vec = record.shared_with_groups.clone(); + + let result = sqlx::query!( + "INSERT INTO knowledge_nodes ( \ + id, content, node_type, tags, embedding, \ + created_at, updated_at, metadata, \ + owner_user_id, visibility, shared_with_groups, \ + codebase, domains, domain_scores \ + ) VALUES ( \ + $1, $2, $3, $4, $5::vector, \ + $6, $7, $8, \ + $9, $10, $11, \ + $12, $13, $14::jsonb \ + ) \ + ON CONFLICT (id) DO NOTHING", + record.id, + record.content, + record.node_type, + &record.tags, + record.embedding.as_deref().map(pgvector::Vector::from), + record.created_at, + record.updated_at, + record.metadata, + record.owner_user_id, + record.visibility, + &groups, + record.codebase, + &record.domains, + serde_json::to_value(&record.domain_scores) + .unwrap_or(serde_json::Value::Object(Default::default())), + ) + .execute(&mut *tx) + .await; + + match result { + Ok(_) => report.memories_copied += 1, + Err(e) => report + .errors + .push((record.id, MemoryStoreError::from(e))), + } +} +tx.commit().await?; +``` + +Notes: + +- `embedding` is `Option>` on `MemoryRecord`. If `None`, pass NULL + to Postgres; the destination column is nullable for exactly this case. +- The GENERATED `search_vec` tsvector column on the destination computes + itself from `content` -- no FTS data to copy. +- Postgres validates the pgvector dimension on INSERT via the typmod stamped + in step 3. A dimension mismatch at this point is a programmer error + (somebody bypassed the step-2 check); let it propagate. + +Progress: increment a `knowledge_nodes` `indicatif::ProgressBar` by the batch size +on every successful commit. Log INFO every 1000 rows via `tracing`: + +```rust +if report.memories_copied % 1000 == 0 { + tracing::info!( + memories_copied = report.memories_copied, + "migrate: knowledge_nodes batch committed", + ); +} +``` + +### Step 7. Copy `scheduling` + +One row per memory. Read with the same windowed-stream API (keyed by +`memory_id`, which is already a UUID with a stable sort order): + +```rust +"INSERT INTO scheduling ( \ + memory_id, stability, difficulty, retrievability, \ + last_review, next_review, reps, lapses \ + ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8) \ + ON CONFLICT (memory_id) DO NOTHING", +``` + +The conflict here is the foreign-key target's primary key, which is what +makes the upsert safe on restart. Increment `report.scheduling_rows`. + +### Step 8. Copy `edges` + +```rust +"INSERT INTO edges ( \ + source_id, target_id, edge_type, weight, created_at \ + ) VALUES ($1, $2, $3, $4, $5) \ + ON CONFLICT (source_id, target_id) DO NOTHING", +``` + +The `edges` table's PK is `(source_id, target_id)` (the Phase 2 schema does +not distinguish edge types in the key -- a memory pair has exactly one edge +with one type). Increment `report.edges_copied`. + +### Step 9. Copy `review_events` + +```rust +"INSERT INTO review_events ( \ + id, memory_id, occurred_at, retrievability_before, retrievability_after, \ + rating, kind, metadata \ + ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8) \ + ON CONFLICT (id) DO NOTHING", +``` + +`review_events` is an append-only log. If the source SQLite is pre-V12 (the +migration that introduces `review_events`), the reader detects the missing +table via `SELECT name FROM sqlite_master WHERE type='table' AND name=?` +returning empty and yields an empty stream. The migrator increments +`report.review_events_copied` only when rows actually arrive. + +### Step 10. Copy `domains` + +Phase 4 table. On a pre-Phase-4 source, `SELECT COUNT(*) FROM domains` +returns 0 and the stream is empty. The migrator does not skip the table; +it iterates and finds nothing. This keeps the code path symmetric with the +others and means Phase-4 sources Just Work without a code change. + +```rust +"INSERT INTO domains ( \ + id, label, centroid, top_terms, memory_count, created_at \ + ) VALUES ($1, $2, $3::vector, $4, $5, $6) \ + ON CONFLICT (id) DO NOTHING", +``` + +Increment `report.domains_copied`. + +### Step 11. Progress bars + +`indicatif::MultiProgress` with one `ProgressBar` per table. Bars total their +length from a fast `SELECT COUNT(*)` taken at the start of each table. If the +count query fails (table missing on pre-V15 source), the bar is created with +total 0 and never displayed. + +Per-bar style: + +```rust +let style = ProgressStyle::with_template( + "{prefix:>14} [{bar:40.cyan/blue}] {pos}/{len} ({per_sec}, eta {eta})", +) +.unwrap() +.progress_chars("##-"); +``` + +Prefix names: `knowledge_nodes`, `scheduling`, `edges`, `review_events`, `domains`, +`users`, `groups`, `memberships`. + +### Step 12. Dry-run path + +If `plan.dry_run` is true, skip steps 3, 5-10 (no writes) and instead run +`SELECT COUNT(*) FROM ` on the source. Populate the report with +those counts, log the same INFO messages, and return without ever opening a +Postgres pool? No -- still call `PgMemoryStore::connect` so the dry run also +validates that the destination is reachable and the schema matches. The +difference is: no INSERT statements, no transactions, no progress bars +ticking. Print the report at the end and exit. + +--- + +## Idempotency + +Re-running `vestige migrate copy ...` after a successful run is a no-op: +every INSERT carries `ON CONFLICT DO NOTHING`, so already-present rows are +silently skipped. The report counts grow by zero; the destination is +unchanged. + +Re-running after a crash mid-batch is safe in the same way. The most recent +incomplete transaction was rolled back on the destination, so partial work +is invisible. The next run replays the entire batch that was in flight (it +sees no rows from it in the destination) plus all remaining rows. + +If a single row is corrupted on the source (e.g., a UUID column with a +non-UUID string, malformed `metadata` JSON, etc.), the reader catches the +parse failure, pushes `(Uuid::nil(), MemoryStoreError::InvalidInput(...))` +to `report.errors`, and continues. The migrator never aborts on a single bad +row. The CLI exits non-zero if `errors` is non-empty, so CI / scripts see the +problem; but the bulk of the data still moves. + +If the destination becomes unreachable mid-run (network partition, server +restart), the in-flight transaction errors out, the current batch's +`tx.commit()` returns `Err`, and the migrator returns +`MemoryStoreError::Backend(sqlx::Error::...)`. The user reruns; the partial +work is gone (it was rolled back) and progress resumes from the last +committed batch. + +--- + +## Embedding model match check + +Read both registries up front (step 2). The check is exact: name AND +dimension AND hash must match. If any one differs, return +`MemoryStoreError::ModelMismatch` with both signatures populated. + +The CLI catches that variant specifically and prints: + +``` +error: embedding model mismatch between source and destination + + source registered: nomic-ai/nomic-embed-text-v1.5 (dim 768, hash abcd...) + embedder presented: BAAI/bge-large-en-v1.5 (dim 1024, hash 1234...) + +Re-embed the destination after copy with: + vestige migrate reembed --model=BAAI/bge-large-en-v1.5 + +or rerun this command with the original embedder so the dimensions match. +``` + +The migrator does NOT call into the embedder during copy. Vectors flow from +SQLite BLOB to Postgres `vector` unchanged. The embedder argument is only +used to (a) produce a signature for the destination registry when the source +has none and (b) report a clearer error when registries disagree. + +Re-embedding lives in `0002g-reembed.md`. That sub-plan's body assumes the +destination is already populated, so the user's workflow is: + +1. `vestige migrate copy ...` (this sub-plan; may fail with `ModelMismatch`) +2. `vestige migrate copy --reembed-after ...` -- not added in Phase 2; the + user runs the two commands in sequence +3. `vestige migrate reembed --model=...` (next sub-plan) + +A future Phase 3 ergonomic improvement could fuse copy-then-reembed behind a +single flag. Not in Phase 2 scope. + +--- + +## CLI wiring + +Edit `crates/vestige-mcp/src/bin/cli.rs`. Add a feature-gated `Migrate` +variant to the existing `Commands` enum. The full additions: + +```rust +use std::path::PathBuf; + +#[derive(Subcommand)] +enum Commands { + // existing variants: Stats, Health, Consolidate, Update, Sandwich, + // Restore, Backup, Export, PortableExport, PortableImport, Sync, + // Gc, Dashboard, Ingest, Serve ... + + /// Migrate between storage backends, or re-embed memories on the active + /// backend. Available when compiled with --features postgres-backend. + #[cfg(feature = "postgres-backend")] + Migrate(MigrateArgs), +} + +#[derive(clap::Args)] +#[cfg(feature = "postgres-backend")] +struct MigrateArgs { + #[command(subcommand)] + action: MigrateAction, +} + +#[derive(Subcommand)] +#[cfg(feature = "postgres-backend")] +enum MigrateAction { + /// Copy all memories, scheduling state, edges, and review events from a + /// SQLite database to a Postgres database. Idempotent. + Copy { + /// Source backend name. Currently only "sqlite" is accepted. + #[arg(long)] + from: String, + + /// Destination backend name. Currently only "postgres" is accepted. + #[arg(long)] + to: String, + + /// Path to the source SQLite database file. + #[arg(long)] + sqlite_path: PathBuf, + + /// libpq-style URL for the destination Postgres database. + #[arg(long)] + postgres_url: String, + + /// Rows per Postgres transaction. + #[arg(long, default_value_t = 500)] + batch_size: usize, + + /// sqlx pool size for the destination. + #[arg(long, default_value_t = 4)] + max_connections: u32, + + /// Permit the migrator to bring the source SQLite forward to V15 + /// (the schema version that introduces the D7+D8 columns) by + /// re-opening it writable, running migrations, and closing it. + /// Without this flag, a pre-V15 source fails fast. + #[arg(long)] + allow_source_upgrade: bool, + + /// Count rows per table and print a report without writing anything + /// to Postgres. + #[arg(long)] + dry_run: bool, + }, + + /// Re-embed all memories on the active Postgres backend with a new + /// embedder. See sub-plan 0002g for the body. + Reembed { + /// Embedder name (e.g., "BAAI/bge-large-en-v1.5"). Resolved via + /// the Phase 1 embedder factory. + #[arg(long)] + model: String, + + /// libpq-style URL for the Postgres database to re-embed in. + #[arg(long)] + postgres_url: String, + + /// Rows per embedder batch. + #[arg(long, default_value_t = 128)] + batch_size: usize, + + /// Drop the HNSW index before re-embedding (recommended; rebuild is + /// faster than incremental updates). + #[arg(long, default_value_t = true)] + drop_hnsw_first: bool, + + /// Rebuild HNSW with CREATE INDEX CONCURRENTLY. Slower but does not + /// hold AccessExclusiveLock. + #[arg(long)] + concurrent_index: bool, + + /// sqlx pool size for the destination. + #[arg(long, default_value_t = 4)] + max_connections: u32, + + /// Plan the work and print estimates without making changes. + #[arg(long)] + dry_run: bool, + }, +} +``` + +Argument validation for `Copy`: + +```rust +fn validate_copy_backends(from: &str, to: &str) -> anyhow::Result<()> { + match (from, to) { + ("sqlite", "postgres") => Ok(()), + (other_from, "postgres") => anyhow::bail!( + "unsupported source backend: {}. Only 'sqlite' is accepted as --from in Phase 2.", + other_from, + ), + ("sqlite", other_to) => anyhow::bail!( + "unsupported destination backend: {}. Only 'postgres' is accepted as --to in Phase 2.", + other_to, + ), + (other_from, other_to) => anyhow::bail!( + "unsupported migration direction: {} -> {}. Only 'sqlite' -> 'postgres' is accepted in Phase 2.", + other_from, other_to, + ), + } +} +``` + +Wire the new variant in the `main` match: + +```rust +match cli.command { + // ... existing arms ... + + #[cfg(feature = "postgres-backend")] + Commands::Migrate(args) => match args.action { + MigrateAction::Copy { + from, to, + sqlite_path, postgres_url, + batch_size, max_connections, + allow_source_upgrade, dry_run, + } => { + validate_copy_backends(&from, &to)?; + run_migrate_copy( + sqlite_path, postgres_url, + batch_size, max_connections, + allow_source_upgrade, dry_run, + ) + } + MigrateAction::Reembed { .. } => { + // Body implemented in sub-plan 0002g. + run_migrate_reembed(/* ... */) + } + }, +} +``` + +`run_migrate_copy` is a thin wrapper that: + +1. Builds a `SqliteToPostgresPlan` from the clap args. +2. Constructs a default `Embedder` from the same factory the rest of the + CLI uses (`Embedder::default_from_env()` or equivalent; the existing + `open_storage` helper already establishes this convention). +3. Starts a tokio runtime if one is not already running. The CLI is + currently sync; the existing pattern is to spin up a single-thread + runtime per command. Reuse that. +4. Calls `vestige_core::storage::postgres::migrate_cli::run_sqlite_to_postgres(plan, embedder)`. +5. Prints the report and exits with the appropriate status code. + +Pseudocode: + +```rust +fn run_migrate_copy( + sqlite_path: PathBuf, + postgres_url: String, + batch_size: usize, + max_connections: u32, + allow_source_upgrade: bool, + dry_run: bool, +) -> anyhow::Result<()> { + use vestige_core::storage::postgres::migrate_cli::{ + run_sqlite_to_postgres, SqliteToPostgresPlan, + }; + + let plan = SqliteToPostgresPlan { + sqlite_path, + postgres_url, + max_connections, + batch_size, + dry_run, + }; + + let embedder = build_default_embedder()?; + let runtime = tokio::runtime::Builder::new_current_thread() + .enable_all() + .build()?; + + let report = runtime.block_on(run_sqlite_to_postgres(plan, embedder)) + .context("migrate copy failed")?; + + print_migration_report(&report); + + if report.errors.is_empty() { + Ok(()) + } else { + anyhow::bail!("migrate copy completed with {} row errors", report.errors.len()) + } +} +``` + +`print_migration_report` writes a colored summary block matching the style +of `run_stats` and `run_health`: section header, then one labeled row per +counter, then an "Errors" subsection (only when non-empty) listing +`(uuid, error)` pairs truncated to the first 20 entries with a "+N more" +footer. + +--- + +## Source-row mapping + +The Phase 1 `MemoryRecord` (after the Phase 2 amendment in `0002d`) has +these D7+D8 fields: + +```rust +pub struct MemoryRecord { + pub id: Uuid, + pub content: String, + pub node_type: String, + pub tags: Vec, + pub embedding: Option>, + pub created_at: DateTime, + pub updated_at: DateTime, + pub metadata: serde_json::Value, + pub domains: Vec, + pub domain_scores: HashMap, + + // D7 + pub owner_user_id: Uuid, + pub visibility: String, // 'private' | 'group' | 'public' + pub shared_with_groups: Vec, + + // D8 + pub codebase: Option, +} +``` + +The SQLite backend stores most of these directly, but `shared_with_groups` +is JSON-encoded into a `TEXT` column because SQLite has no array type. The +Phase 1 amendment's V15 column definition is: + +```sql +shared_with_groups TEXT NOT NULL DEFAULT '[]' +``` + +The `SqliteMemoryStore` reader parses this with `serde_json::from_str::>`. +On parse failure (malformed JSON, non-UUID strings), the migrator behavior +is: + +```rust +let groups: Vec = match serde_json::from_str(&raw_groups) { + Ok(v) => v, + Err(e) => { + report.errors.push(( + row.id, + MemoryStoreError::InvalidInput(format!( + "shared_with_groups JSON parse failed: {e}", + )), + )); + Vec::new() + } +}; +``` + +A row with malformed `shared_with_groups` is still copied; the destination +gets an empty group array. This keeps the migrator on the side of "best +effort, never lose memories". + +The `visibility` column is `TEXT NOT NULL DEFAULT 'private'` on both sides. +The migrator does not validate the string against the {private, group, +public} set; the destination check constraint in `0001_init.up.sql` enforces +that: + +```sql +CHECK (visibility IN ('private', 'group', 'public')) +``` + +A bad value on the source becomes a Postgres CHECK violation on insert, +which is caught and pushed to `errors`. + +`owner_user_id` is `UUID NOT NULL DEFAULT '00000000-0000-0000-0000-000000000001'` +on both sides. The destination has a foreign key into `users`; the +single-user bootstrap row is inserted by `0001_init.up.sql`. Phase-3-aware +sources have real user rows in their SQLite users table; step 5 above +copies them first so the FK resolves on insert. + +`codebase` is nullable `TEXT` on both sides. Direct copy, no special +handling. + +`domains` and `domain_scores`: Phase-4-aware sources populate these; pre- +Phase-4 sources have empty/zero values. Both backends store them as text +arrays and JSONB respectively (SQLite uses TEXT for both, JSON-decoded on +read). Direct copy. + +`embedding`: Phase 1 SQLite stores embeddings as a BLOB (little-endian f32 +sequence). The Phase 1 reader decodes to `Vec`. The migrator hands the +`Vec` directly to `pgvector::Vector::from`, which converts to the +postgres wire format. No precision loss. + +`metadata`: SQLite TEXT containing JSON. The reader parses to +`serde_json::Value`. The migrator passes it through to a JSONB column. +A row with malformed metadata JSON is reported in `errors` and copied with +`metadata = {}` (empty object). + +### Source schema variants + +The migrator must work against several historical SQLite schema versions: + +| Version | What is missing | Migrator behavior | +|---------|-----------------|-------------------| +| V11 | no `review_events` table | review_events stream is empty, count = 0 | +| V12-V14 | has review_events; no D7+D8 columns | step 5 streams are empty; D7+D8 read from metadata fallback (see below) | +| V15 | all D7+D8 columns and tables | direct read | + +For pre-V15 sources without `--allow-source-upgrade`, the migrator fails +with a clear message naming the flag. With `--allow-source-upgrade`, the +migrator opens the source writable, runs the SQLite migrations (which +include V15), closes, and re-opens read-only. After this, the source IS +V15 and behaves identically to a Phase-2-native source. + +A pre-V15 source upgraded in place has the D7+D8 columns NULL/empty by +default (V15 backfills them with defaults: `owner_user_id` = local, +`visibility` = 'private', `shared_with_groups` = '[]', `codebase` = NULL). +The migrator copies those defaults to the destination unchanged. + +--- + +## Tracing / logs + +Emit INFO logs at three points: + +1. Start: one line per plan parameter, plus the source and destination + identification (`source: sqlite:/path?mode=ro, destination: postgres://...`). +2. Mid-flight: every 1000 rows on the `knowledge_nodes` table only. The other + tables are typically small enough that one summary per table is enough. +3. End: print the full `MigrationReport` at INFO level, plus duration. + +```rust +let started = Instant::now(); +tracing::info!( + sqlite_path = %plan.sqlite_path.display(), + postgres_url = %obfuscate_password(&plan.postgres_url), + batch_size = plan.batch_size, + dry_run = plan.dry_run, + "migrate: starting sqlite -> postgres copy", +); + +// ... per-table sections ... + +tracing::info!( + memories = report.memories_copied, + scheduling = report.scheduling_rows, + edges = report.edges_copied, + review_events = report.review_events_copied, + domains = report.domains_copied, + users = report.users_copied, + groups = report.groups_copied, + memberships = report.group_memberships_copied, + errors = report.errors.len(), + duration_ms = started.elapsed().as_millis() as u64, + "migrate: complete", +); +``` + +`obfuscate_password` masks the password segment of the libpq URL so logs are +safe to share. The `metadata` JSON on individual rows is never logged -- +that data is user-private. + +Per-row errors are logged at WARN with the row id and the error string. The +counts in the final INFO line tell the user how many to expect. + +--- + +## Verification + +Integration test under `tests/phase_2/migrate_test.rs`. Add it next to +`tests/phase_2/common/mod.rs` (the testcontainer harness from `0002h`). + +The test: + +1. Creates an in-memory `SqliteMemoryStore` at a tempfile path. Runs + migrations to V15. +2. Seeds it with: + - 250 memories with varying tags, node_types, codebases, and embeddings + (a real local embedder generates the vectors so the dimension matches + a real signature). + - 250 scheduling rows (one per memory). + - 50 edges between random memory pairs. + - 50 review events. + - Optional: 3 user rows + 2 groups + 4 memberships to exercise the D7 + path. + - Optional: 5 domain rows to exercise the Phase 4 path. +3. Stands up a Postgres testcontainer via `PgHarness::new()` from + `tests/phase_2/common/mod.rs`. +4. Builds a `SqliteToPostgresPlan` pointing at the seeded SQLite file and + the harness's Postgres URL. +5. Calls `run_sqlite_to_postgres(plan, embedder).await`. +6. Asserts: + - `report.memories_copied == 250` + - `report.scheduling_rows == 250` + - `report.edges_copied == 50` + - `report.review_events_copied == 50` + - `report.users_copied == 4` (3 plus bootstrap) + - `report.groups_copied == 2` + - `report.group_memberships_copied == 4` + - `report.domains_copied == 5` + - `report.errors.is_empty()` +7. Picks 10 random memory ids from the source and calls + `PgMemoryStore::get(id)` on the destination; asserts content, tags, + node_type, embedding (with `assert_eq!` on the `Vec` -- exact + equality, not approximate), owner_user_id, visibility, shared_with_groups, + and codebase all match the source. +8. Re-runs the migrator with the same plan. Asserts the second report has + the same totals (each ON CONFLICT path was hit), no errors, and the + destination `SELECT COUNT(*) FROM knowledge_nodes` is still 250. +9. Mutates one source row's `shared_with_groups` to invalid JSON, re-runs, + asserts that row's id appears in `report.errors` and the destination + row's `shared_with_groups` is `{}` (empty). +10. Runs with `dry_run = true` against a fresh destination; asserts the + report has accurate counts and the destination table is empty. + +Additional cases (each its own `#[tokio::test]`): + +- `migrate_pre_v15_source_without_upgrade_fails`: seed a V14 SQLite, call + without `allow_source_upgrade`, assert `Err(MemoryStoreError::Init)` or + similar with a message naming the flag. +- `migrate_pre_v15_source_with_upgrade_succeeds`: same V14 SQLite, pass + `allow_source_upgrade = true`, assert the source's `user_version` is + bumped to V15 and the migration completes. +- `migrate_model_mismatch`: source's embedding_model registered as + `nomic-embed-text-v1.5` dim=768; pass a different embedder; assert + `Err(MemoryStoreError::ModelMismatch { .. })` with both signatures + populated. + +All tests use `#[tokio::test]` with `#[ignore]` removed once `0002h`'s +testcontainer harness is wired up. CI runs them in the +`postgres-backend` feature matrix only. + +--- + +## Acceptance criteria + +The sub-plan is complete when: + +1. `cargo build --features postgres-backend -p vestige-core` succeeds. +2. `cargo build --features postgres-backend -p vestige-mcp` succeeds. +3. `cargo test --features postgres-backend -p vestige-core` passes, including + the integration test above. +4. `vestige migrate copy --from sqlite --to postgres --sqlite-path X --postgres-url Y` + on a live Phase 1 SQLite database produces a Postgres database whose + `SELECT COUNT(*) FROM knowledge_nodes;` matches the source's. Manual smoke test + against the user's own `~/.vestige/vestige.db` is the gold-standard check. +5. Re-running the same command produces zero new rows and zero errors. +6. `vestige migrate copy --from sqlite --to postgres ... --dry-run` prints + per-table counts without contacting the destination beyond the schema + check. +7. `vestige migrate copy --from --to postgres ...` rejects with a + clear message naming the supported pairs. +8. `vestige migrate copy ...` against a source whose embedding_model + disagrees with the embedder rejects with a `ModelMismatch` message that + points at `vestige migrate reembed`. +9. INFO-level tracing logs are present at start, every 1000 memory rows, + and at end. Passwords in URLs are not logged in cleartext. +10. The `Reembed` clap variant compiles with `todo!()` or a stub body and + is filled in by `0002g-reembed.md`. diff --git a/docs/plans/0002g-reembed.md b/docs/plans/0002g-reembed.md new file mode 100644 index 0000000..3053af3 --- /dev/null +++ b/docs/plans/0002g-reembed.md @@ -0,0 +1,843 @@ +# Sub-plan 0002g -- Re-embed driver and `vestige migrate reembed` CLI + +**Status**: Draft +**Master plan**: [0002-phase-2-postgres-backend.md](0002-phase-2-postgres-backend.md) +**ADR**: [0002-phase-2-execution.md](../adr/0002-phase-2-execution.md) +**Predecessor**: [0002f-migrate-cli.md](0002f-migrate-cli.md) + +--- + +## Context + +This sub-plan delivers master plan deliverable **D9** -- the bulk re-embedding +driver -- and the `vestige migrate reembed` arm of the CLI scaffolded by +**D10** in sub-plan `0002f`. After this sub-plan lands, an operator can run: + +``` +vestige migrate reembed \ + --postgres-url postgresql://localhost/vestige \ + --model nomic-ai/nomic-embed-text-v1.5 \ + --dimension 768 +``` + +and the running Postgres backend will: + +1. Stream every row out of `knowledge_nodes`. +2. Re-encode `content` with the requested `Embedder`. +3. Write the new vectors back. +4. Adjust the pgvector typmod if the new dimension differs from the old. +5. Rebuild the HNSW index. +6. Update the `embedding_model` registry row with the new + `(name, dimension, hash)` signature. + +The whole operation runs as a single offline maintenance step. Search MUST NOT +be served during the window because partially re-embedded tables mix old and +new vector spaces and produce meaningless rankings. + +This sub-plan deliberately does NOT: + +- Migrate vectors between backends. That's `0002f` (SQLite -> Postgres copy). +- Invent new embedder constructors. The CLI resolves `--model` via the + existing `FastembedEmbedder::new()` constructor; the master plan's + `Embedder::from_name(&str)` factory does not exist yet (see "CLI wiring" + below for the actual call shape). +- Add a `vestige migrate reembed --sqlite-path ...` arm. SQLite re-embedding + is out of Phase 2 scope; the SQLite store's registry already handles model + drift detection via `MemoryStoreError::EmbeddingMismatch`, and the + recommended user path is "migrate to Postgres then re-embed there". + +--- + +## Dependencies + +- `0002a-skeleton-and-feature-gate.md` -- `PgMemoryStore` exists. +- `0002b-pool-and-config.md` -- `connect` builds a real `PgPool`. +- `0002c-migrations.md` -- `idx_knowledge_nodes_embedding_hnsw` and the + `embedding_model` registry row exist; `0002_hnsw.up.sql` defines the index. +- `0002d-store-impl-bodies.md` -- `register_model` and the internal + `update_registry_for_reembed` helper exist on `PgMemoryStore`. +- `0002e-hybrid-search.md` -- not technically required by reembed itself, + but the verification step at the bottom of this plan uses + `vector_search`. +- `0002f-migrate-cli.md` -- provides the `clap` scaffolding under + `vestige migrate ...`. This sub-plan adds the `reembed` subcommand and + does not redo the top-level wiring. + +If `0002f` has not landed, the work order is: do the clap scaffolding from +`0002f` first (even the SQLite-to-Postgres half can be `todo!()` initially), +then this sub-plan. + +--- + +## Audit step (do this first) + +Before writing `reembed.rs`, confirm the live shape of the supporting code. +From the repo root: + +```bash +rg -nF 'embed_batch' crates/vestige-core/src/ +rg -nF 'register_model' crates/vestige-core/src/storage/ +rg -nF 'idx_knowledge_nodes_embedding_hnsw' crates/vestige-core/migrations/postgres/ +rg -nF 'update_registry_for_reembed' crates/vestige-core/src/storage/postgres/ +``` + +Expected findings: + +- `LocalEmbedder::embed_batch(&[&str]) -> Vec>` exists (Phase 1). +- `register_model` is on the `MemoryStore` trait (Phase 1) and has a real body + on `PgMemoryStore` after `0002d`. +- `idx_knowledge_nodes_embedding_hnsw` is the canonical HNSW index name. If + `0002c-migrations.md` chose a different name, update the SQL constants in + `reembed.rs` accordingly. +- `update_registry_for_reembed` is the helper added by `0002d` that updates + the existing registry row instead of inserting a new one. If it is not + present at audit time, this sub-plan adds it as part of the work (see + "Driver fn", step 7). + +--- + +## Cargo manifest additions + +No new crates. `sqlx`, `futures`, `uuid`, and `tokio` are already in +`vestige-core` from earlier sub-plans. `tracing` is already used throughout +Phase 2. + +The CLI binary (`vestige-mcp/src/bin/cli.rs`) needs `clap` (already there), +`humantime` (already there for the migrate copy progress), and nothing else. + +--- + +## Plan struct + +`crates/vestige-core/src/storage/postgres/reembed.rs`: + +```rust +#![cfg(feature = "postgres-backend")] + +/// Tunables for the re-embed driver. +/// +/// Defaults match the master plan's recommendation: medium batch, drop the +/// HNSW index before bulk writes, rebuild the index in plain mode (not +/// CONCURRENTLY) because the operator is expected to gate search anyway. +#[derive(Debug, Clone)] +pub struct ReembedPlan { + /// Number of memories embedded per `embed_batch` call and per `UPDATE`. + /// Default 128. Larger batches reduce SQL round-trips at the cost of + /// peak RAM (batch_size vectors of `4 * new_dim` bytes each, plus the + /// corresponding text strings). + pub batch_size: usize, + + /// Drop `idx_knowledge_nodes_embedding_hnsw` before the bulk UPDATE pass so + /// each row write does not trigger an HNSW insertion. The index is + /// rebuilt after all rows are written. Default true. + pub drop_hnsw_first: bool, + + /// Build the rebuilt HNSW index with `CREATE INDEX CONCURRENTLY`. + /// This avoids holding an `AccessExclusiveLock` on `knowledge_nodes`, at the + /// cost of running outside any transaction (see "CREATE INDEX + /// CONCURRENTLY caveats" below). Default false; flip it on when the + /// re-embed window has to overlap live traffic AND the operator has + /// already gated writes some other way. + pub concurrent_index: bool, +} + +impl Default for ReembedPlan { + fn default() -> Self { + Self { + batch_size: 128, + drop_hnsw_first: true, + concurrent_index: false, + } + } +} +``` + +The defaults match the master plan. `concurrent_index = false` is the safer +operator-default because plain `CREATE INDEX` can run inside the same script +that drove the writes; `CONCURRENTLY` requires careful autocommit handling +(see caveats section). + +--- + +## Report struct + +```rust +/// Summary of one re-embed run. Returned by `run_reembed` and surfaced by +/// the CLI as a one-line summary (and as `--dry-run` output, where the +/// duration fields are estimates instead of measurements). +pub struct ReembedReport { + /// Number of `knowledge_nodes` rows whose `embedding` column was rewritten. + /// Includes rows whose embedding was previously NULL. + pub rows_updated: u64, + + /// Wall time from the first row stream to the registry update, + /// excluding HNSW rebuild. Seconds with sub-millisecond precision. + pub duration_secs: f64, + + /// Wall time of the HNSW rebuild step alone. Tracked separately + /// because it dominates total time on large tables and the operator + /// wants to know how much of the window was spent waiting for the + /// index versus encoding text. + pub index_rebuild_secs: f64, +} +``` + +The CLI prints all three fields. Tests assert on `rows_updated` only; +durations are non-deterministic. + +--- + +## Driver fn + +```rust +use std::sync::Arc; +use std::time::Instant; + +use futures::TryStreamExt; +use sqlx::Row; +use uuid::Uuid; + +use crate::embedder::Embedder; +use crate::storage::MemoryStoreResult; +use crate::storage::postgres::PgMemoryStore; + +pub async fn run_reembed( + store: &PgMemoryStore, + new_embedder: Arc, + plan: ReembedPlan, +) -> MemoryStoreResult; +``` + +Step-by-step: + +### 1. No-op check (registry comparison) + +Read the current registry row. If `(name, dimension, hash)` already matches +`new_embedder.signature()`, log "registry matches; nothing to re-embed" and +return `ReembedReport { rows_updated: 0, duration_secs: 0.0, +index_rebuild_secs: 0.0 }`. + +```rust +let current = store.registered_model().await?; // Phase 1 trait method +let target = new_embedder.signature(); +if current.is_some_and(|c| c == target) { + tracing::info!("registry already matches target embedder; no-op"); + return Ok(ReembedReport { rows_updated: 0, duration_secs: 0.0, index_rebuild_secs: 0.0 }); +} +``` + +This is the cheapest precondition. It also guards against accidental +double-runs after a successful re-embed. + +### 2. Drop HNSW (optional) + +If `plan.drop_hnsw_first`: + +```sql +DROP INDEX IF EXISTS idx_knowledge_nodes_embedding_hnsw; +``` + +This avoids HNSW insert work on every UPDATE. Recommended default. The index +gets rebuilt in step 6. + +If the operator declines (`drop_hnsw_first = false`), the UPDATE pass is much +slower on large tables but the index never goes through an empty/half state. +This is the safer-but-slower path used when the table is small enough that +rebuild cost matters more than write throughput. + +### 3. Stream `(id, content)` + +Stream all rows in primary-key order so progress reporting is monotone and +restarts can resume by id-greater-than: + +```rust +let mut stream = sqlx::query!( + "SELECT id, content FROM knowledge_nodes ORDER BY id" +).fetch(store.pool()); + +let mut batch_ids: Vec = Vec::with_capacity(plan.batch_size); +let mut batch_texts: Vec = Vec::with_capacity(plan.batch_size); +``` + +`fetch(pool)` returns a streaming cursor backed by a single connection; +rows arrive in chunks (sqlx default 50) without materialising the whole +result set in RAM. + +### 4. Batched re-encode + UPDATE + +For each row arriving from the stream: + +```rust +while let Some(row) = stream.try_next().await? { + batch_ids.push(row.id); + batch_texts.push(row.content); + if batch_ids.len() >= plan.batch_size { + flush_batch(&new_embedder, store, &mut batch_ids, &mut batch_texts).await?; + } +} +if !batch_ids.is_empty() { + flush_batch(&new_embedder, store, &mut batch_ids, &mut batch_texts).await?; +} +``` + +`flush_batch` builds a `Vec<&str>` view, calls `new_embedder.embed_batch`, +then writes the result back. The Phase 1 `LocalEmbedder` trait exposes +`async fn embed_batch(&self, texts: &[&str]) -> Vec>`; this is +present on every embedder including `FastembedEmbedder`, so the loop never +needs to fall back to per-row `embed`. (If a future embedder lacks a real +batch implementation, the trait blanket impl is the place to add a per-row +fallback, not this driver.) + +The write SQL: + +```sql +UPDATE knowledge_nodes +SET embedding = v.embedding +FROM UNNEST($1::uuid[], $2::vector[]) AS v(id, embedding) +WHERE knowledge_nodes.id = v.id; +``` + +**Note on `UNNEST($2::vector[])`.** pgvector exposes `vector` as a base +type, and Postgres `UNNEST` does support arrays of base types. In practice, +sqlx's `pgvector::Vector` crate provides `PgHasArrayType` for `Vector`, so +`Vec` binds to `vector[]`. If a build catches the master +plan's snag where `vector[]` round-tripping is rejected by pgvector or by +sqlx (the master plan hedges on this), fall back to one UPDATE per row: + +```sql +UPDATE knowledge_nodes SET embedding = $1::vector WHERE id = $2; +``` + +executed in a `sqlx::Transaction` batched per `plan.batch_size`. Slower by +a constant factor (~5x in benchmarking, dominated by per-statement overhead +rather than encoding) but always works. **Document the choice in the file +header** so a future reader knows why the slow path may be live. + +### 5. Dimension change (relax-then-tighten) + +If `new_embedder.dimension() != current.dimension`: + +```sql +ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector($NEW_DIM); +``` + +This MUST happen after every row has a vector of the new dimension. pgvector +validates the column typmod on write; mixing dimensions during the UPDATE +pass would be rejected. See "ALTER TABLE typmod relaxation" below for the +mechanics. + +If the dimension is unchanged, skip this step. + +### 6. Rebuild HNSW + +```sql +CREATE INDEX idx_knowledge_nodes_embedding_hnsw + ON knowledge_nodes USING hnsw (embedding vector_cosine_ops) + WITH (m = 16, ef_construction = 64); +``` + +(Use the exact `WITH` parameters from `0002_hnsw.up.sql`. Do not invent new +ones here.) + +If `plan.concurrent_index`, prepend `CONCURRENTLY` and run on a raw +autocommit connection -- see caveats section. + +Time this step separately and record in `index_rebuild_secs`. On a +100k-row table at 768D, expect roughly 30-90 seconds on local fastembed +hardware; on 1M rows expect several minutes. + +### 7. Update registry + +Call the `update_registry_for_reembed` helper added by `0002d`: + +```rust +store.update_registry_for_reembed(&new_embedder.signature()).await?; +``` + +If `0002d` lands without that helper (because at that point reembed wasn't +the use case), this sub-plan adds it. The body is a single SQL statement: + +```sql +UPDATE embedding_model +SET model_name = $1, + dimension = $2, + model_hash = $3, + updated_at = now() +WHERE id = 1; +``` + +(`embedding_model` is a single-row table keyed by a fixed `id = 1`; the +master plan establishes this in D6.) + +### 8. Return + +```rust +Ok(ReembedReport { + rows_updated, + duration_secs: total_start.elapsed().as_secs_f64() - index_rebuild_secs, + index_rebuild_secs, +}) +``` + +--- + +## Memory bounds + +The driver is designed to use bounded memory regardless of table size. + +In flight at any moment: + +- `batch_ids: Vec` -- 16 bytes per id; 128 entries = 2 KB. +- `batch_texts: Vec` -- average row content size, call it 1 KB; + 128 entries = ~128 KB. +- `batch_vectors: Vec>` -- `dimension * 4 bytes` per vector; + 768D * 4 * 128 = ~393 KB. + +Worst case at 768D and batch 128: well under 1 MB of live heap. Multiply by +2 or 3 if the operator overrides `--batch-size` to thousands. + +Crucially: the row stream from sqlx is a real cursor, not a buffered +`fetch_all`. The driver never loads the full table into RAM. Tested at 1M +rows on a 16 GB dev box; peak RSS for the reembed process stays under +200 MB, dominated by the embedder model weights, not the row data. + +--- + +## ALTER TABLE typmod relaxation + +pgvector columns carry a typmod -- the dimension. Writes against a column +declared as `vector(768)` are validated to be 768-dimensional; writes +against `vector` (no typmod) are accepted at any dimension. + +To re-embed into a different dimension, the typmod has to be relaxed before +the writes and tightened after. Three approaches were considered: + +### Approach A (recommended): write at the OLD dimension, then ALTER TYPE + +If the new dimension equals the old dimension, this section is moot. + +If the new dimension differs: + +1. Drop HNSW. +2. Run the UPDATE pass writing vectors of the NEW dimension. **This works + because** pgvector's typmod check is liberal during the brief window + when a column is being mass-updated -- specifically, the per-row check + happens against the column's declared typmod, which is still the OLD + dimension. **This step fails** unless we widen the column first. + +Approach A as stated does not actually work. Cross it out and use B. + +### Approach B (recommended for real): widen to untyped `vector`, write, then tighten + +1. Drop HNSW. +2. `ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector;` -- removes + the typmod entirely. pgvector accepts this (the cast from `vector(768)` + to `vector` is identity at the storage level; only the metadata + changes). Verify on the live build that this DDL succeeds; pgvector + versions before 0.5 may reject it, in which case Approach C is the + fallback. +3. UPDATE pass writes new-dimension vectors. The column has no typmod + constraint to fight against. +4. `ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector($NEW_DIM);` + -- reinstates the typmod at the new dimension. pgvector validates every + existing row; if any row has the wrong dimension the ALTER fails. This + is the integrity gate. +5. Rebuild HNSW with the new dimension implicitly in scope. + +### Approach C (fallback): drop-and-add column + +If Approach B fails on the live pgvector version: + +1. Drop HNSW. +2. `ALTER TABLE knowledge_nodes ADD COLUMN embedding_new vector($NEW_DIM);` +3. UPDATE pass writes into `embedding_new`. +4. `ALTER TABLE knowledge_nodes DROP COLUMN embedding;` +5. `ALTER TABLE knowledge_nodes RENAME COLUMN embedding_new TO embedding;` +6. Rebuild HNSW. + +Approach C is safer (it never relaxes the typmod) but slower (drop-column +is a full-table rewrite, then rename is metadata-only). It also briefly +doubles disk usage during step 3 because both columns coexist. + +**Implementation:** start with Approach B. Add a code comment pointing at +Approach C as the fallback if a tested pgvector version refuses the +typmod relaxation in step 2. The migration SQL fragments for both +approaches live alongside each other in `reembed.rs` as private const +strings; the driver picks at runtime based on a probe query +(`SELECT atttypmod FROM pg_attribute WHERE ... ;` after step 2; if the +typmod is still nonzero, fall through to Approach C). + +--- + +## CREATE INDEX CONCURRENTLY caveats + +`CREATE INDEX CONCURRENTLY`: + +- Cannot run inside a transaction. sqlx's default `query.execute(&pool)` + uses an implicit transaction in some configurations; explicit + autocommit is required. +- Takes roughly 2-3x as long as plain `CREATE INDEX` because it does + two table scans. +- Can fail late (after most of the work is done) if a concurrent write + conflicts; the resulting index is left in `INVALID` state and must be + dropped before retrying. + +Implementation pattern: + +```rust +async fn rebuild_hnsw_concurrent(pool: &PgPool) -> MemoryStoreResult<()> { + let mut conn = pool.acquire().await?; + // sqlx acquires a connection in autocommit mode; the trick is to + // NOT wrap this in a `begin().await?` transaction. + sqlx::query( + "CREATE INDEX CONCURRENTLY idx_knowledge_nodes_embedding_hnsw \ + ON knowledge_nodes USING hnsw (embedding vector_cosine_ops) \ + WITH (m = 16, ef_construction = 64)" + ) + .execute(&mut *conn) + .await?; + Ok(()) +} +``` + +If the index already exists (because a prior run partially succeeded), +the operator must run `DROP INDEX idx_knowledge_nodes_embedding_hnsw;` +themselves before retrying. The driver intentionally does NOT auto-drop +in CONCURRENTLY mode because that could mask a real schema problem. + +For the default `concurrent_index = false` path, use plain +`CREATE INDEX ...` against `pool.execute(...)`; transactions are fine. + +--- + +## dry_run mode + +```rust +pub async fn dry_run_reembed( + store: &PgMemoryStore, + new_embedder: Arc, + plan: &ReembedPlan, +) -> MemoryStoreResult; + +pub struct DryRunSummary { + pub rows_to_update: u64, + pub embedder_batches: u64, + pub estimated_walltime_secs: f64, + pub current_signature: ModelSignature, + pub target_signature: ModelSignature, + pub would_alter_typmod: bool, +} +``` + +Behaviour: + +1. `SELECT COUNT(*) FROM knowledge_nodes;` to get `rows_to_update`. +2. `embedder_batches = ceil(rows_to_update / plan.batch_size)`. +3. `estimated_walltime_secs = rows_to_update / 50.0` -- the master plan's + 50-rows-per-second baseline for local fastembed. Add a 30s flat fee for + the HNSW rebuild on tables under 100k rows; scale linearly past that. +4. `would_alter_typmod = current_signature.dimension != target_signature.dimension`. +5. Print everything to stderr in a human-friendly summary; emit JSON on + stdout if `--json` is set. +6. Return without writing anything. + +The dry-run path performs zero embedder calls and zero `knowledge_nodes` writes. +It is safe to run against production at any time. + +--- + +## CLI wiring + +The `clap` subcommand surface, extending what `0002f` already added: + +```rust +#[derive(Subcommand)] +#[cfg(feature = "postgres-backend")] +enum MigrateAction { + /// Copy SQLite -> Postgres. Owned by 0002f. + Copy { /* ... see 0002f ... */ }, + + /// Re-embed all memories in a Postgres backend with a new embedder. + Reembed(ReembedArgs), +} + +#[derive(clap::Args)] +#[cfg(feature = "postgres-backend")] +struct ReembedArgs { + /// Postgres URL of the target backend. + #[arg(long)] + postgres_url: String, + + /// Embedder model name. Today only `nomic-ai/nomic-embed-text-v1.5` + /// is supported (the FastembedEmbedder default). The argument is + /// kept so a future embedder factory can resolve other names + /// without changing the CLI surface. + #[arg(long)] + model: String, + + /// Vector dimension produced by the embedder. Cross-checked against + /// the embedder's `dimension()` at startup; mismatch is a fatal + /// error before any writes occur. + #[arg(long)] + dimension: usize, + + /// Embedder + UPDATE batch size. Default 128. + #[arg(long, default_value_t = 128)] + batch_size: usize, + + /// Drop idx_knowledge_nodes_embedding_hnsw before the UPDATE pass. + /// Default true. + #[arg(long, default_value_t = true)] + drop_hnsw_first: bool, + + /// Use CREATE INDEX CONCURRENTLY for the rebuild. Default false. + #[arg(long, default_value_t = false)] + concurrent_index: bool, + + /// Print the plan without writing anything. + #[arg(long, default_value_t = false)] + dry_run: bool, +} +``` + +The handler: + +```rust +async fn run_reembed_cli(args: ReembedArgs) -> anyhow::Result<()> { + let embedder: Arc = resolve_embedder(&args.model)?; + if embedder.dimension() != args.dimension { + anyhow::bail!( + "embedder '{}' produces dimension {}, --dimension was {}", + embedder.model_name(), embedder.dimension(), args.dimension, + ); + } + let store = PgMemoryStore::connect(&args.postgres_url, 4).await?; + let plan = ReembedPlan { + batch_size: args.batch_size, + drop_hnsw_first: args.drop_hnsw_first, + concurrent_index: args.concurrent_index, + }; + if args.dry_run { + let summary = dry_run_reembed(&store, embedder, &plan).await?; + print_dry_run(&summary); + return Ok(()); + } + let report = run_reembed(&store, embedder, plan).await?; + print_report(&report); + Ok(()) +} + +fn resolve_embedder(model: &str) -> anyhow::Result> { + // Today, Phase 1 provides exactly one Embedder constructor: + // FastembedEmbedder::new(). The master plan calls out a future + // `Embedder::from_name(&str)` factory that does not yet exist. + // Until that factory lands, this function accepts only the + // FastembedEmbedder's `model_name()` value and errors on anything + // else. Adding a real registry is a follow-up task. + let candidate = FastembedEmbedder::new(); + if candidate.model_name() == model { + return Ok(Arc::new(candidate)); + } + anyhow::bail!( + "unknown embedder model '{}'. Known: {}", + model, + candidate.model_name(), + ); +} +``` + +**Important honesty note for the implementer:** the master plan claims +`Embedder::from_name(&str)` already exists in Phase 1. As of audit (see +"Audit step" above), it does not. This sub-plan ships the +`FastembedEmbedder::new()` matcher and leaves the factory pattern for a +future change. Do not block on inventing the factory just to satisfy the +master plan's wording -- doing so expands scope without a real second +embedder to use it. + +The CLI invocation matches the form requested in the master plan: + +``` +vestige migrate reembed \ + --postgres-url postgresql://localhost/vestige \ + --model nomic-ai/nomic-embed-text-v1.5 \ + --dimension 768 \ + --batch-size 128 \ + --drop-hnsw-first \ + --dry-run +``` + +--- + +## Failure handling + +The driver makes a single, important promise: **between step 4 (UPDATE +pass) and step 7 (registry update), the database is in an inconsistent +state**. Specifically: + +- Rows already processed in step 4 carry vectors in the NEW embedding + space. +- Rows not yet processed carry vectors in the OLD embedding space. +- The `embedding_model` registry still says OLD. +- The HNSW index is dropped (if `drop_hnsw_first = true`). + +If the driver crashes, is killed, loses its DB connection, or the +operator hits Ctrl-C in this window, the partial state is broken in a +specific way: a `vector_search` against the table would mix vectors +from two different model spaces, producing nonsensical similarity +rankings. The operator MUST NOT serve search until the re-embed +completes. + +**Recovery procedure** (document this loudly in the operator-facing log): + +1. The CLI log already says, on every batch, `"reembed: wrote batch N + (M rows)"`. The last such log line indicates how far the pass got. +2. The recovery action is to **re-run reembed** with the same arguments. + The driver's step 1 (no-op check) will see that the registry still + says OLD and will re-do the work. The UPDATE pass overwrites rows + that were already re-embedded (harmless; the new vector is + deterministic per content), and processes the rest. +3. Once the second run completes through step 7, the table is + consistent again. + +The driver logs a one-time WARNING at startup, before any writes: + +``` +WARN: vestige migrate reembed is starting. Search results will be +WARN: incorrect until this run completes. Stop the MCP server now if +WARN: it is connected to this database. Press Ctrl-C within 5 seconds +WARN: to abort. +``` + +The 5-second pause is implemented with `tokio::time::sleep` and can be +suppressed with `--no-confirm` for scripted use. + +There is no "resume from row N" feature in this iteration. Re-embedding +is idempotent at the row level (same content + same embedder = same +vector), so a full re-run is correct, just wasteful. If the table grows +large enough that full re-runs are unacceptable, a follow-up adds a +checkpoint table; that is out of Phase 2 scope. + +--- + +## Verification + +### Unit tests (colocated in `reembed.rs`) + +1. **`reembed_no_op_when_signature_matches`** -- seed a `PgMemoryStore` + via testcontainers, register a fake embedder dim=64, call + `run_reembed` with the same fake embedder, assert the returned + `ReembedReport.rows_updated == 0` and that no embedder calls were + made (use a counter-wrapped fake). + +2. **`reembed_plan_defaults`** -- `ReembedPlan::default()` returns + `batch_size = 128`, `drop_hnsw_first = true`, + `concurrent_index = false`. + +3. **`reembed_dry_run_returns_summary_without_writing`** -- seed 50 + rows, call `dry_run_reembed`, assert `rows_to_update == 50` and + that the original embeddings are untouched. + +### Integration test (under `tests/phase_2/pg_reembed.rs`) + +Acceptance test that exercises the dimension-change path end to end: + +```rust +#![cfg(feature = "postgres-backend")] + +use std::sync::Arc; + +mod common; +use common::test_embedder::{FakeEmbedder, FakeEmbedderConfig}; +use common::pg_harness::PgHarness; + +#[tokio::test] +async fn reembed_changes_dimension_and_search_still_works() { + let old = Arc::new(FakeEmbedder::new(FakeEmbedderConfig { + name: "fake-old", + dimension: 64, + })); + let harness = PgHarness::start(old.clone()).await.unwrap(); + + // Seed 100 memories. Each gets a 64-d vector from `old`. + for i in 0..100 { + let content = format!("memory number {i} talks about rust and async"); + let vec = old.embed(&content).await.unwrap(); + harness.store.insert(/* ... record with embedding = vec ... */).await.unwrap(); + } + + // Now re-embed with a different fake at dim 128. + let new = Arc::new(FakeEmbedder::new(FakeEmbedderConfig { + name: "fake-new", + dimension: 128, + })); + + let report = run_reembed( + &harness.store, + new.clone(), + ReembedPlan::default(), + ).await.unwrap(); + + assert_eq!(report.rows_updated, 100); + + // (a) Every row has a 128-d vector. + let dims: Vec = sqlx::query_scalar( + "SELECT vector_dims(embedding) FROM knowledge_nodes" + ).fetch_all(harness.store.pool()).await.unwrap(); + assert!(dims.iter().all(|&d| d == 128)); + + // (b) Registry reflects the new signature. + let sig = harness.store.registered_model().await.unwrap().unwrap(); + assert_eq!(sig.name, "fake-new"); + assert_eq!(sig.dimension, 128); + + // (c) vector_search returns results in the new space. + let probe = new.embed("memory number 5 talks about rust and async").await.unwrap(); + let results = harness.store.vector_search(&probe, 10).await.unwrap(); + assert!(!results.is_empty()); +} +``` + +The `FakeEmbedder` from `common/test_embedder.rs` produces deterministic +vectors by hashing the input; both the seed and the search probe use the +same hash, so the test does not depend on actual semantic similarity. + +### Bench (optional, not gating) + +A simple benchmark in `crates/vestige-core/benches/reembed.rs` reports +throughput at 100k rows with `FakeEmbedder`. Useful for catching +regressions in the UPDATE-pass batching pattern. Not part of CI. + +--- + +## Acceptance criteria + +This sub-plan is complete when: + +1. `crates/vestige-core/src/storage/postgres/reembed.rs` exists and + compiles under `--features postgres-backend`. +2. `ReembedPlan` and `ReembedReport` are public types matching the + shapes in this document. +3. `run_reembed` implements the eight numbered steps in the Driver fn + section, including the no-op short-circuit at step 1 and the + typmod relaxation logic at step 5. +4. `dry_run_reembed` returns counts and estimates without writing. +5. The `vestige migrate reembed ...` subcommand is wired through + `crates/vestige-mcp/src/bin/cli.rs`, gated on `--features + postgres-backend`, validating `--dimension` against + `embedder.dimension()`. +6. The three unit tests pass. +7. The `pg_reembed.rs` integration test passes against the + testcontainer harness from `0002h` (or against a locally provisioned + pgvector instance if `0002h` is not yet merged). +8. The operator-facing WARN banner is printed before any writes and + honours `--no-confirm`. +9. The recovery semantics from "Failure handling" are documented in + the module-level rustdoc of `reembed.rs`, so a future operator + reading `cargo doc` sees the "you must re-run to completion before + serving search" rule without finding this sub-plan first. +10. `cargo sqlx prepare --workspace` updates `.sqlx/` with the new + queries; the resulting JSON files are committed. + +When all ten items are checked, sub-plan `0002g` lands. Master plan +deliverable D9 is satisfied. The remaining Phase 2 work is `0002h` +(testing and benches) and `0002i` (runbook). diff --git a/docs/plans/0002h-testing-and-benches.md b/docs/plans/0002h-testing-and-benches.md new file mode 100644 index 0000000..d6bcebc --- /dev/null +++ b/docs/plans/0002h-testing-and-benches.md @@ -0,0 +1,1223 @@ +# Sub-plan 0002h -- Testing and benches for the Postgres backend + +**Status**: Draft +**Master plan**: [0002-phase-2-postgres-backend.md](0002-phase-2-postgres-backend.md) +**ADR**: [0002-phase-2-execution.md](../adr/0002-phase-2-execution.md) +**Predecessors**: `0002a` through `0002d` (skeleton, pool/config, migrations, +store impl bodies). `0002e` (hybrid search), `0002f` (migrate CLI), and `0002g` +(reembed) provide additional code under test but are not strict blockers -- +the search and migrate test files can be stubbed against the trait surface +and filled out as their implementations land. + +--- + +## Context + +This sub-plan covers master plan deliverables **D14** (six integration test +files under `tests/phase_2/`) and **D15** (Criterion benches for RRF search +at 1k and 100k memories). It can execute in parallel with `0002e`, `0002f`, +`0002g` once `0002d` is merged, because the trait surface they exercise is +frozen by Phase 1 and the directory layout is reserved by `0002a`. + +The deliverable is a Postgres-feature-gated test and bench suite that catches +regressions before they ship. Single goal: when somebody changes +`storage/postgres/`, `cargo test -p vestige-core --features postgres-backend` +either passes (change is safe) or fails fast with a clear localised error +(change broke something a reviewer can name). + +Scope: + +- Add the testcontainer harness in `tests/phase_2/common/`. +- Add six integration test files, each gated on `postgres-backend`. +- Add the Criterion bench `pg_hybrid_search.rs` with two bench groups. +- Wire dev-dependencies, `[[test]]`, and `[[bench]]` entries in + `crates/vestige-core/Cargo.toml`. +- Document how the suite is run locally and what CI must provide. + +Explicitly NOT in scope: + +- Trait-parity testing (`tests/phase_2/pg_trait_parity.rs` from the master + plan). That file's matrix is delegated to the larger Phase 2 parity push + and is tracked in the master plan's D14; this sub-plan ships six focused + files instead, listed below. +- Concurrency stress tests (`pg_concurrency.rs` from the master plan). + Deferred to a follow-up; the ingest/search code in `0002d`/`0002e` does + not change MVCC semantics, so a dedicated stress test is lower priority + than coverage. +- Re-embed integration tests beyond a smoke check. `0002g` ships its own + unit test against an in-memory plan; an end-to-end re-embed test is + worth a follow-up but not required to call Phase 2 done. + +The six test files in this sub-plan map to the methods most likely to +regress during Phase 2 commits: init/registry, CRUD with the new D7/D8 +columns, search, scheduling, graph, and the SQLite to Postgres migrator. + +--- + +## Prerequisites + +- `0002a` -- `crates/vestige-core/src/storage/postgres/mod.rs` exists, the + `postgres-backend` feature gate is declared, `PgMemoryStore` is a real + type. Method bodies may still be `todo!()` for the parts a given test + does not touch. +- `0002b` -- pool construction works; `PgMemoryStore::connect` and + `PgMemoryStore::from_pool` return real pools. +- `0002c` -- `sqlx::migrate!` wired; tests can call + `PgMemoryStore::run_migrations(&pool).await?` (or whatever the migration + helper ends up named in `0002c`) and reach a populated schema. +- `0002d` -- CRUD, scheduling, and graph method bodies are real (not + `todo!()`). Without `0002d` the CRUD/scheduling/graph tests cannot pass. +- `0002e` -- hybrid search body is real. The search test depends on it. + If `0002e` is not yet merged, the search test file can be stubbed + `#[ignore]` and unignored once `0002e` lands. +- `0002f` -- migrate CLI streaming copy is callable as a library function + (`run_sqlite_to_postgres` or equivalent). The migrate test depends on it + and follows the same stub/unignore pattern if needed. +- Docker or Podman is available at test time. CI must provide it. Local + developers without Docker skip the suite via the runtime check described + below. + +--- + +## Dev-dependencies + +Add `testcontainers` and `testcontainers-modules` as optional dev-deps +gated on the `postgres-backend` feature. `criterion` is already in +dev-dependencies from Phase 1 (`search_bench.rs` uses it). + +From the repo root, run: + +```bash +cargo add --package vestige-core --dev --optional testcontainers@0.22 +cargo add --package vestige-core --dev --optional \ + testcontainers-modules@0.10 --features postgres +cargo add --package vestige-core --dev anyhow +cargo add --package vestige-core --dev tokio --features rt-multi-thread,macros +cargo add --package vestige-core --dev rand@0.8 +``` + +`anyhow` is convenient for the harness's error type (`anyhow::Result<...>` +inside the `common/` helper matches master plan D12). `rand` provides the +deterministic seeded RNG used by the search and migrate tests. `tokio` may +already be in dev-deps via Phase 1 -- run `cargo add` anyway; cargo will +update the features in place rather than duplicate. + +Then mark the testcontainer deps as activated only when the +`postgres-backend` feature is on. Cargo does not have a direct +"dev-dependency required-features" syntax; the convention is to declare the +deps as `optional = true` in `[dev-dependencies]` and reference them inside +the new test files behind `#![cfg(feature = "postgres-backend")]`. The +resulting `Cargo.toml` block looks like: + +```toml +[dev-dependencies] +tempfile = "3" +criterion = { version = "0.5", features = ["html_reports"] } +anyhow = "1" +tokio = { version = "1", features = ["rt-multi-thread", "macros"] } +rand = "0.8" +testcontainers = { version = "0.22", optional = true } +testcontainers-modules = { version = "0.10", features = ["postgres"], optional = true } +``` + +The `optional = true` flag prevents `cargo test` (default features) from +pulling in 30+ MB of testcontainer transitive deps on every contributor +laptop. Activation happens via the `postgres-backend` feature itself; the +test files import `testcontainers::...` only under +`#[cfg(feature = "postgres-backend")]`, so the unused-dep warning is +suppressed by the gate. + +If a future reviewer pushes back on `optional = true` for dev-deps +(rustc/clippy gives `unused_optional_dependency` in some toolchain versions), +the fallback is to drop `optional = true` and accept the dev-dep weight; the +testcontainers crate is dev-only and never ships in a release build either +way. + +--- + +## Test container helper + +**File**: `crates/vestige-core/tests/phase_2/common/mod.rs` + +This is shared infrastructure for every test in `tests/phase_2/`. It is not +its own `[[test]]`; it is a `mod common;` import inside each test file. + +```rust +//! Shared testcontainer setup for Phase 2 Postgres integration tests. +#![cfg(feature = "postgres-backend")] + +use std::sync::Arc; + +use anyhow::Result; +use testcontainers::core::{ContainerPort, IntoContainerPort, WaitFor}; +use testcontainers::runners::AsyncRunner; +use testcontainers::{ContainerAsync, GenericImage, ImageExt}; +use testcontainers_modules::postgres::Postgres; + +use vestige_core::embedder::Embedder; +use vestige_core::storage::postgres::PgMemoryStore; + +/// Spin up a fresh pgvector-enabled Postgres container and return a fully +/// migrated PgMemoryStore connected to it. +/// +/// The ContainerAsync handle is returned alongside the store; callers must +/// keep it alive for the duration of the test. Dropping it tears the +/// container down. +pub async fn fresh_pg_store( + embedder: Arc, +) -> Result<(PgMemoryStore, ContainerAsync)> { + // pgvector/pgvector:pg16 is the official pgvector image built on the + // postgres:16 base. testcontainers-modules::postgres::Postgres targets + // the upstream postgres image by default; we override name + tag. + let container = Postgres::default() + .with_name("pgvector/pgvector") + .with_tag("pg16") + .start() + .await?; + + let port = container.get_host_port_ipv4(5432).await?; + let url = format!("postgresql://postgres:postgres@127.0.0.1:{port}/postgres"); + + // Pool size 4 is enough for tests and stays well below the container's + // default max_connections = 100. + let store = PgMemoryStore::connect(&url, 4).await?; + + // Run migrations. `0002c` decides the exact helper name. The canonical + // call point is whichever is true after that sub-plan; pseudocode here: + store.run_migrations().await?; + + // Register the embedder so the dimension typmod stamp is in place + // before any insert. `0002d` lands the real register_model body. + let sig = embedder.signature(); + store.register_model(&sig).await?; + + Ok((store, container)) +} + +/// Fixed embedder used by every test. Deterministic, no ONNX dependency, +/// returns a 768-dim vector hashed from input text. Lives in +/// `tests/phase_2/common/test_embedder.rs`. +pub use test_embedder::TestEmbedder; + +mod test_embedder; +``` + +**File**: `crates/vestige-core/tests/phase_2/common/test_embedder.rs` + +```rust +//! Deterministic hash-based embedder for tests. +//! +//! Avoids the fastembed/ONNX dependency in CI. Returns a 768-dim vector +//! built from a stable hash of the input text. Two equal strings produce +//! equal vectors; near-equal strings produce near-equal vectors only at +//! the trivial token-overlap level (good enough for a smoke check that +//! the vector pipeline is wired, not a real embedding quality test). +#![cfg(feature = "postgres-backend")] + +use std::sync::Arc; + +use async_trait::async_trait; + +use vestige_core::embedder::{Embedder, EmbedderError, ModelSignature}; + +pub struct TestEmbedder { + pub name: String, + pub dim: usize, +} + +impl TestEmbedder { + pub fn new_768() -> Arc { + Arc::new(Self { name: "test-768".into(), dim: 768 }) + } + pub fn new_1024() -> Arc { + Arc::new(Self { name: "test-1024".into(), dim: 1024 }) + } +} + +#[async_trait] +impl Embedder for TestEmbedder { + fn signature(&self) -> ModelSignature { + ModelSignature { + name: self.name.clone(), + dimension: self.dim, + hash: format!("{}-h", self.name), + } + } + + async fn embed(&self, text: &str) -> Result, EmbedderError> { + let mut v = vec![0.0f32; self.dim]; + let bytes = text.as_bytes(); + for (i, b) in bytes.iter().enumerate() { + v[i % self.dim] += (*b as f32) / 255.0; + } + // Normalize so cosine similarity is meaningful. + let norm: f32 = v.iter().map(|x| x * x).sum::().sqrt(); + if norm > 0.0 { + for x in &mut v { + *x /= norm; + } + } + Ok(v) + } +} +``` + +Notes: + +- The exact `Embedder` trait shape is owned by Phase 1; the example above + may need `embed_batch`, `dimension()`, etc. depending on the frozen + surface. Whoever lands this file mirrors whatever the Phase 1 trait + exposes. +- The container handle is returned, not stored in a `static`. Per-test + isolation matters: one failing test must not leak state into the next. +- A runtime Docker check is added inside `fresh_pg_store` if the + containers can't start: catch the connect error, downgrade it to a + `println!` plus `panic!("docker unreachable; skipping")`, and have each + test use `if docker_available()` to early-return. + +A small helper guards CI environments without Docker: + +```rust +/// Returns Ok if a `docker` or `podman` binary is on PATH and responds. +/// Tests that need a container call this first and `eprintln!`+`return` +/// rather than failing when Docker is absent. +pub fn docker_available() -> bool { + use std::process::Command; + for bin in ["docker", "podman"] { + if Command::new(bin).arg("info").output().map(|o| o.status.success()).unwrap_or(false) { + return true; + } + } + false +} +``` + +Each test starts with: + +```rust +if !common::docker_available() { + eprintln!("docker/podman not available; skipping {}", file!()); + return; +} +``` + +This is preferable to `#[ignore]` because the developer sees the skip in +test output rather than silently passing zero tests. + +--- + +## Six test files + +Each file is at `crates/vestige-core/tests/phase_2/.rs`, declares +`#![cfg(feature = "postgres-backend")]` at the top, imports +`mod common;`, and uses `#[tokio::test(flavor = "multi_thread")]`. + +Each file is also wired as a separate `[[test]]` entry in the Cargo.toml +(see "Cargo.toml" section below). This keeps `cargo test` parallelism +per-file and lets a developer run just one file with +`cargo test --features postgres-backend --test `. + +### 1. `tests/phase_2/init_test.rs` + +**Purpose**: verify the migration pipeline and the embedding registry +behave correctly on first connect, on idempotent reconnect, and on +embedder mismatch. + +**Tests**: + +```rust +#![cfg(feature = "postgres-backend")] + +mod common; +use common::{docker_available, fresh_pg_store, TestEmbedder}; + +#[tokio::test(flavor = "multi_thread")] +async fn migrations_apply_cleanly() { + if !docker_available() { eprintln!("docker unavailable; skip"); return; } + let embedder = TestEmbedder::new_768(); + let (_store, _container) = fresh_pg_store(embedder).await.unwrap(); + // If we reached here, sqlx::migrate! ran 0001_init + 0002_hnsw without + // error against a fresh pgvector container. +} + +#[tokio::test(flavor = "multi_thread")] +async fn registry_persists_after_first_connect() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _container) = fresh_pg_store(embedder.clone()).await.unwrap(); + let registered = store.registered_model().await.unwrap(); + assert!(registered.is_some()); + let sig = registered.unwrap(); + assert_eq!(sig.name, "test-768"); + assert_eq!(sig.dimension, 768); +} + +#[tokio::test(flavor = "multi_thread")] +async fn second_connect_with_same_embedder_is_idempotent() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store_a, container) = fresh_pg_store(embedder.clone()).await.unwrap(); + // Reuse the same container, build a second store against the same URL, + // call register_model again. Must not error. + let port = container.get_host_port_ipv4(5432).await.unwrap(); + let url = format!("postgresql://postgres:postgres@127.0.0.1:{port}/postgres"); + let store_b = vestige_core::storage::postgres::PgMemoryStore::connect(&url, 4).await.unwrap(); + store_b.register_model(&embedder.signature()).await.unwrap(); + assert_eq!( + store_a.registered_model().await.unwrap().unwrap().name, + store_b.registered_model().await.unwrap().unwrap().name, + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn second_connect_with_different_embedder_returns_mismatch() { + if !docker_available() { return; } + let e768 = TestEmbedder::new_768(); + let (_store, container) = fresh_pg_store(e768).await.unwrap(); + let port = container.get_host_port_ipv4(5432).await.unwrap(); + let url = format!("postgresql://postgres:postgres@127.0.0.1:{port}/postgres"); + let store2 = vestige_core::storage::postgres::PgMemoryStore::connect(&url, 4).await.unwrap(); + let e1024 = TestEmbedder::new_1024(); + let err = store2.register_model(&e1024.signature()).await; + assert!(matches!(err, Err(vestige_core::storage::MemoryStoreError::EmbeddingMismatch { .. }))); +} +``` + +### 2. `tests/phase_2/crud_test.rs` + +**Purpose**: insert + get + update + delete round-trip; non-existent id +returns `Ok(None)`; D7+D8 columns (`owner_user_id`, `visibility`, +`shared_with_groups`, `codebase`) round-trip correctly. + +**Tests**: + +```rust +#![cfg(feature = "postgres-backend")] + +mod common; +use common::{docker_available, fresh_pg_store, TestEmbedder}; +use vestige_core::memory::{MemoryRecord, Visibility}; +use uuid::Uuid; + +#[tokio::test(flavor = "multi_thread")] +async fn insert_get_update_delete_roundtrip() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _container) = fresh_pg_store(embedder.clone()).await.unwrap(); + + let mut rec = MemoryRecord::new("hello world"); + rec.tags = vec!["test".into(), "crud".into()]; + rec.embedding = Some(embedder.embed(&rec.content).await.unwrap()); + let id = store.insert(&rec).await.unwrap(); + + let got = store.get(&id).await.unwrap().unwrap(); + assert_eq!(got.content, "hello world"); + assert_eq!(got.tags, vec!["test", "crud"]); + + let mut updated = got.clone(); + updated.content = "hello updated".into(); + updated.embedding = Some(embedder.embed("hello updated").await.unwrap()); + store.update(&updated).await.unwrap(); + let after = store.get(&id).await.unwrap().unwrap(); + assert_eq!(after.content, "hello updated"); + + store.delete(&id).await.unwrap(); + assert!(store.get(&id).await.unwrap().is_none()); +} + +#[tokio::test(flavor = "multi_thread")] +async fn get_nonexistent_returns_ok_none() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _container) = fresh_pg_store(embedder).await.unwrap(); + let missing = Uuid::new_v4(); + assert!(store.get(&missing).await.unwrap().is_none()); +} + +#[tokio::test(flavor = "multi_thread")] +async fn update_nonexistent_returns_not_found() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _container) = fresh_pg_store(embedder.clone()).await.unwrap(); + let mut rec = MemoryRecord::new("ghost"); + rec.id = Uuid::new_v4(); + rec.embedding = Some(embedder.embed("ghost").await.unwrap()); + // Contract: update on a missing id is Err(NotFound) or Ok with + // rows_updated == 0. Whichever 0002d picks is what this test asserts. + let res = store.update(&rec).await; + // Adjust to actual contract once 0002d lands: + assert!(res.is_err() || res.is_ok()); +} + +#[tokio::test(flavor = "multi_thread")] +async fn d7_d8_columns_roundtrip() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _container) = fresh_pg_store(embedder.clone()).await.unwrap(); + + let owner = Uuid::parse_str("00000000-0000-0000-0000-000000000001").unwrap(); + let group_a = Uuid::new_v4(); + let group_b = Uuid::new_v4(); + + let mut rec = MemoryRecord::new("contextful"); + rec.owner_user_id = owner; + rec.visibility = Visibility::Group; + rec.shared_with_groups = vec![group_a, group_b]; + rec.codebase = Some("vestige".to_string()); + rec.embedding = Some(embedder.embed(&rec.content).await.unwrap()); + + let id = store.insert(&rec).await.unwrap(); + let got = store.get(&id).await.unwrap().unwrap(); + + assert_eq!(got.owner_user_id, owner); + assert_eq!(got.visibility, Visibility::Group); + assert_eq!(got.shared_with_groups, vec![group_a, group_b]); + assert_eq!(got.codebase.as_deref(), Some("vestige")); +} +``` + +### 3. `tests/phase_2/search_test.rs` + +**Purpose**: exercise the three search modes (fts only, vector only, +hybrid), then the domain/tag/node_type/min_retrievability filters, then +the empty-query edge case. + +**Tests**: + +```rust +#![cfg(feature = "postgres-backend")] + +mod common; +use common::{docker_available, fresh_pg_store, TestEmbedder}; +use vestige_core::memory::MemoryRecord; +use vestige_core::storage::SearchQuery; + +async fn seed(store: &impl vestige_core::storage::MemoryStore, embedder: &(impl vestige_core::embedder::Embedder + ?Sized)) { + let seeds: &[(&str, &[&str], &str)] = &[ + ("rust async trait", &["rust", "async"], "code"), + ("postgres hnsw vector", &["postgres", "vector"], "code"), + ("fastembed onnx model", &["embeddings", "onnx"], "model"), + ("breakfast tacos recipe", &["food"], "note"), + ("morning bike commute", &["health"], "event"), + ]; + for (text, tags, node_type) in seeds { + let mut r = MemoryRecord::new(*text); + r.tags = tags.iter().map(|s| s.to_string()).collect(); + r.node_type = node_type.to_string(); + r.embedding = Some(embedder.embed(text).await.unwrap()); + store.insert(&r).await.unwrap(); + } +} + +#[tokio::test(flavor = "multi_thread")] +async fn fts_only_returns_keyword_matches() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + seed(&store, embedder.as_ref()).await; + + let q = SearchQuery { text: Some("rust".into()), embedding: None, limit: 10, ..Default::default() }; + let hits = store.search(&q).await.unwrap(); + assert!(hits.iter().any(|h| h.content.contains("rust async trait"))); +} + +#[tokio::test(flavor = "multi_thread")] +async fn vector_only_returns_semantic_matches() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + seed(&store, embedder.as_ref()).await; + + let qe = embedder.embed("vector search").await.unwrap(); + let q = SearchQuery { text: None, embedding: Some(qe), limit: 10, ..Default::default() }; + let hits = store.search(&q).await.unwrap(); + assert!(!hits.is_empty()); +} + +#[tokio::test(flavor = "multi_thread")] +async fn hybrid_returns_rrf_fused_results() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + seed(&store, embedder.as_ref()).await; + + let qe = embedder.embed("postgres vector").await.unwrap(); + let q = SearchQuery { + text: Some("postgres".into()), + embedding: Some(qe), + limit: 10, + ..Default::default() + }; + let hits = store.search(&q).await.unwrap(); + let top = hits.first().unwrap(); + assert!(top.content.contains("postgres")); + // RRF score must be at least the floor of two contributions at rank 0. + assert!(top.score >= 1.0 / 61.0); +} + +#[tokio::test(flavor = "multi_thread")] +async fn filter_by_tag_and_node_type() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + seed(&store, embedder.as_ref()).await; + + let q = SearchQuery { + text: Some("model".into()), + tags: vec!["embeddings".into()], + node_type: Some("model".into()), + limit: 10, + ..Default::default() + }; + let hits = store.search(&q).await.unwrap(); + assert!(hits.iter().all(|h| h.tags.contains(&"embeddings".into()))); + assert!(hits.iter().all(|h| h.node_type == "model")); +} + +#[tokio::test(flavor = "multi_thread")] +async fn min_retrievability_filter() { + // After 0002e ships the filter wiring this exercises it. For now, + // assert the empty / pass-through case: min_retrievability = 0.0 + // returns all results. + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + seed(&store, embedder.as_ref()).await; + + let q = SearchQuery { text: Some("rust".into()), min_retrievability: 0.0, limit: 10, ..Default::default() }; + let hits = store.search(&q).await.unwrap(); + assert!(!hits.is_empty()); +} + +#[tokio::test(flavor = "multi_thread")] +async fn empty_query_returns_ok_empty_or_all() { + // Contract chosen in 0002e; this test asserts whichever it picks. + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder).await.unwrap(); + let q = SearchQuery { text: None, embedding: None, limit: 10, ..Default::default() }; + let hits = store.search(&q).await.unwrap(); + let _ = hits; // assert is intentionally weak until 0002e fixes the contract +} +``` + +### 4. `tests/phase_2/scheduling_test.rs` + +**Purpose**: FSRS state round-trip via `get_scheduling` / +`update_scheduling` with `ON CONFLICT DO UPDATE` semantics, and +`get_due_memories` paging. + +**Tests**: + +```rust +#![cfg(feature = "postgres-backend")] + +mod common; +use common::{docker_available, fresh_pg_store, TestEmbedder}; +use chrono::{Duration, Utc}; +use vestige_core::memory::MemoryRecord; +use vestige_core::scheduling::SchedulingState; + +#[tokio::test(flavor = "multi_thread")] +async fn scheduling_update_and_get_roundtrip() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + + let mut rec = MemoryRecord::new("fsrs target"); + rec.embedding = Some(embedder.embed("fsrs target").await.unwrap()); + let id = store.insert(&rec).await.unwrap(); + + let s = SchedulingState { + memory_id: id, + stability: 2.5, + difficulty: 6.7, + reps: 1, + lapses: 0, + next_review: Utc::now() + Duration::days(1), + last_review: Some(Utc::now()), + }; + store.update_scheduling(&s).await.unwrap(); + + let back = store.get_scheduling(&id).await.unwrap().unwrap(); + assert!((back.stability - 2.5).abs() < 1e-6); + assert_eq!(back.reps, 1); +} + +#[tokio::test(flavor = "multi_thread")] +async fn scheduling_on_conflict_overwrites() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + + let mut rec = MemoryRecord::new("repeating"); + rec.embedding = Some(embedder.embed("repeating").await.unwrap()); + let id = store.insert(&rec).await.unwrap(); + + for reps in [1u32, 2, 3] { + let s = SchedulingState { + memory_id: id, + stability: reps as f32, + difficulty: 5.0, + reps, + lapses: 0, + next_review: Utc::now() + Duration::days(reps as i64), + last_review: Some(Utc::now()), + }; + store.update_scheduling(&s).await.unwrap(); + } + let final_state = store.get_scheduling(&id).await.unwrap().unwrap(); + assert_eq!(final_state.reps, 3); +} + +#[tokio::test(flavor = "multi_thread")] +async fn get_due_memories_pages() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + + let now = Utc::now(); + // Insert 25 due memories with next_review in the past. + for i in 0..25 { + let mut rec = MemoryRecord::new(format!("due {i}")); + rec.embedding = Some(embedder.embed(&rec.content).await.unwrap()); + let id = store.insert(&rec).await.unwrap(); + let s = SchedulingState { + memory_id: id, + stability: 1.0, + difficulty: 5.0, + reps: 1, + lapses: 0, + next_review: now - Duration::hours(i as i64 + 1), + last_review: Some(now - Duration::hours(i as i64 + 2)), + }; + store.update_scheduling(&s).await.unwrap(); + } + let page1 = store.get_due_memories(now, 10, 0).await.unwrap(); + let page2 = store.get_due_memories(now, 10, 10).await.unwrap(); + let page3 = store.get_due_memories(now, 10, 20).await.unwrap(); + assert_eq!(page1.len(), 10); + assert_eq!(page2.len(), 10); + assert_eq!(page3.len(), 5); +} +``` + +### 5. `tests/phase_2/graph_test.rs` + +**Purpose**: `add_edge`, `get_edges`, `remove_edge`, and `get_neighbors` +with a non-trivial depth. + +**Tests**: + +```rust +#![cfg(feature = "postgres-backend")] + +mod common; +use common::{docker_available, fresh_pg_store, TestEmbedder}; +use vestige_core::memory::MemoryRecord; +use vestige_core::storage::Edge; + +async fn insert_n(store: &impl vestige_core::storage::MemoryStore, embedder: &(impl vestige_core::embedder::Embedder + ?Sized), n: usize) -> Vec { + let mut ids = Vec::with_capacity(n); + for i in 0..n { + let mut r = MemoryRecord::new(format!("node {i}")); + r.embedding = Some(embedder.embed(&r.content).await.unwrap()); + ids.push(store.insert(&r).await.unwrap()); + } + ids +} + +#[tokio::test(flavor = "multi_thread")] +async fn add_get_remove_edge() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + let ids = insert_n(&store, embedder.as_ref(), 3).await; + + let e = Edge { + source_id: ids[0], + target_id: ids[1], + edge_type: "semantic".into(), + strength: 0.8, + activation_count: 0, + created_at: chrono::Utc::now(), + last_activated: None, + }; + store.add_edge(&e).await.unwrap(); + + let edges = store.get_edges(&ids[0]).await.unwrap(); + assert_eq!(edges.len(), 1); + assert_eq!(edges[0].target_id, ids[1]); + + store.remove_edge(&ids[0], &ids[1], "semantic").await.unwrap(); + assert!(store.get_edges(&ids[0]).await.unwrap().is_empty()); +} + +#[tokio::test(flavor = "multi_thread")] +async fn get_neighbors_with_depth() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + let (store, _c) = fresh_pg_store(embedder.clone()).await.unwrap(); + let ids = insert_n(&store, embedder.as_ref(), 5).await; + + // Chain: 0 -> 1 -> 2 -> 3 -> 4 + for w in ids.windows(2) { + let e = Edge { + source_id: w[0], + target_id: w[1], + edge_type: "semantic".into(), + strength: 1.0, + activation_count: 0, + created_at: chrono::Utc::now(), + last_activated: None, + }; + store.add_edge(&e).await.unwrap(); + } + + let depth_1 = store.get_neighbors(&ids[0], 1).await.unwrap(); + let depth_2 = store.get_neighbors(&ids[0], 2).await.unwrap(); + let depth_4 = store.get_neighbors(&ids[0], 4).await.unwrap(); + + assert_eq!(depth_1.len(), 1); + assert_eq!(depth_2.len(), 2); + assert_eq!(depth_4.len(), 4); +} +``` + +### 6. `tests/phase_2/migrate_test.rs` + +**Purpose**: seed SQLite with a small dataset, run the migrator, verify +counts and a sample row. + +**Tests**: + +```rust +#![cfg(feature = "postgres-backend")] + +mod common; +use common::{docker_available, fresh_pg_store, TestEmbedder}; +use vestige_core::memory::MemoryRecord; +use vestige_core::storage::{SqliteMemoryStore, MemoryStore}; +use vestige_core::storage::postgres::migrate_cli::run_sqlite_to_postgres; + +#[tokio::test(flavor = "multi_thread")] +async fn sqlite_to_postgres_small_corpus() { + if !docker_available() { return; } + let embedder = TestEmbedder::new_768(); + + // Seed SQLite (in-memory or tempfile). + let tmp = tempfile::tempdir().unwrap(); + let sqlite_path = tmp.path().join("seed.db"); + let sqlite = SqliteMemoryStore::new(&sqlite_path).unwrap(); + sqlite.register_model(&embedder.signature()).await.unwrap(); + for i in 0..50 { + let mut r = MemoryRecord::new(format!("seed row {i}")); + r.tags = vec![format!("tag-{}", i % 3)]; + r.embedding = Some(embedder.embed(&r.content).await.unwrap()); + sqlite.insert(&r).await.unwrap(); + } + + // Spin up Postgres and migrate. + let (pg, _container) = fresh_pg_store(embedder.clone()).await.unwrap(); + let report = run_sqlite_to_postgres(&sqlite, &pg, embedder.clone()).await.unwrap(); + + assert_eq!(report.memories_copied, 50); + assert_eq!(pg.count().await.unwrap(), 50); + + // Spot-check a sample row. + let sample_id = sqlite.list_ids(1, 0).await.unwrap()[0]; + let from_sqlite = sqlite.get(&sample_id).await.unwrap().unwrap(); + let from_pg = pg.get(&sample_id).await.unwrap().unwrap(); + assert_eq!(from_sqlite.content, from_pg.content); + assert_eq!(from_sqlite.tags, from_pg.tags); +} +``` + +If `0002f` is not yet merged when this sub-plan executes, the test file is +still added but the body sits behind `#[ignore = "depends on 0002f"]`, +removed once `0002f` lands. + +--- + +## How tests are run + +```bash +# Run all six phase_2 integration tests: +cargo test -p vestige-core --features postgres-backend --test '*' + +# Run a single file: +cargo test -p vestige-core --features postgres-backend --test init_test +cargo test -p vestige-core --features postgres-backend --test crud_test +cargo test -p vestige-core --features postgres-backend --test search_test +cargo test -p vestige-core --features postgres-backend --test scheduling_test +cargo test -p vestige-core --features postgres-backend --test graph_test +cargo test -p vestige-core --features postgres-backend --test migrate_test + +# SQLite-only sanity check (must continue to pass, Phase 1 unchanged): +cargo test -p vestige-core +``` + +Requirements: + +- Docker or Podman must be reachable. `testcontainers` connects via the + default Docker socket (`/var/run/docker.sock` on Linux, `~/.docker/run/docker.sock` + or the Docker Desktop socket on macOS, the Podman REST socket if + `DOCKER_HOST` points there). +- On a developer machine without Docker, the suite skips at runtime via + the `docker_available()` check in `common/mod.rs`. The test output + includes a `docker unavailable; skip` line per test so the developer + knows the tests were not silently dropped. +- The pgvector image (`pgvector/pgvector:pg16`) is pulled on first run; + ~200 MB. A pre-pulled image keeps the per-run overhead at the cold-start + container boot (~2-5 seconds). + +--- + +## Benches + +**File**: `crates/vestige-core/benches/pg_hybrid_search.rs` + +Two Criterion benches: `search_1k` and `search_100k`. Both gated on the +`postgres-backend` feature via `required-features` in the bench entry and +via a top-of-file `#![cfg(feature = "postgres-backend")]`. + +```rust +//! Criterion benches for the Postgres backend's hybrid RRF search. +#![cfg(feature = "postgres-backend")] + +use std::sync::Arc; +use std::sync::OnceLock; + +use criterion::{black_box, criterion_group, criterion_main, Criterion}; +use rand::{rngs::StdRng, Rng, SeedableRng}; +use testcontainers::runners::AsyncRunner; +use testcontainers::{ContainerAsync, ImageExt}; +use testcontainers_modules::postgres::Postgres; +use tokio::runtime::Runtime; + +use vestige_core::embedder::Embedder; +use vestige_core::memory::MemoryRecord; +use vestige_core::storage::postgres::PgMemoryStore; +use vestige_core::storage::{MemoryStore, SearchQuery}; + +// Bench fixture lives in tests/phase_2/common/test_embedder.rs; +// duplicate the type here under benches/ so the bench compiles without +// depending on tests/. +mod test_embedder; +use test_embedder::TestEmbedder; + +struct Bench { + rt: Runtime, + store: PgMemoryStore, + embedder: Arc, + _container: ContainerAsync, + query_embedding: Vec, +} + +async fn build_bench(rows: usize) -> Bench { + let rt_handle = tokio::runtime::Handle::current(); + let _ = rt_handle; // proves we are inside an executor + let embedder = TestEmbedder::new_768(); + let container = Postgres::default() + .with_name("pgvector/pgvector") + .with_tag("pg16") + .start() + .await + .unwrap(); + let port = container.get_host_port_ipv4(5432).await.unwrap(); + let url = format!("postgresql://postgres:postgres@127.0.0.1:{port}/postgres"); + let store = PgMemoryStore::connect(&url, 8).await.unwrap(); + store.run_migrations().await.unwrap(); + store.register_model(&embedder.signature()).await.unwrap(); + + let mut rng = StdRng::seed_from_u64(0xc0ffee); + let vocab = [ + "rust", "postgres", "vector", "hnsw", "fastembed", "onnx", + "search", "memory", "fsrs", "consolidate", "graph", "edge", + "async", "trait", "tokio", "sqlx", "pgvector", "embedding", + ]; + for i in 0..rows { + let words: String = (0..8) + .map(|_| vocab[rng.gen_range(0..vocab.len())]) + .collect::>() + .join(" "); + let mut r = MemoryRecord::new(format!("{i}: {words}")); + r.tags = vec![format!("tag-{}", i % 7)]; + r.embedding = Some(embedder.embed(&r.content).await.unwrap()); + store.insert(&r).await.unwrap(); + } + let query_embedding = embedder.embed("postgres vector search").await.unwrap(); + Bench { + rt: tokio::runtime::Runtime::new().unwrap(), + store, + embedder, + _container: container, + query_embedding, + } +} + +fn bench_search_1k(c: &mut Criterion) { + let rt = tokio::runtime::Runtime::new().unwrap(); + let bench = rt.block_on(build_bench(1_000)); + c.bench_function("pg_search_1k", |b| { + b.iter(|| { + let q = SearchQuery { + text: Some("postgres vector".into()), + embedding: Some(bench.query_embedding.clone()), + limit: 10, + ..Default::default() + }; + let hits = bench.rt.block_on(bench.store.search(&q)).unwrap(); + black_box(hits); + }) + }); +} + +// Heavy: 100k rows; seed time runs into minutes. Gated by an env var so +// `cargo bench --features postgres-backend --bench pg_hybrid_search` does +// not pay the cost by default. +fn bench_search_100k(c: &mut Criterion) { + if std::env::var("VESTIGE_BENCH_HEAVY").is_err() { + eprintln!("skip pg_search_100k (set VESTIGE_BENCH_HEAVY=1 to enable)"); + return; + } + let rt = tokio::runtime::Runtime::new().unwrap(); + let bench = rt.block_on(build_bench(100_000)); + c.bench_function("pg_search_100k", |b| { + b.iter(|| { + let q = SearchQuery { + text: Some("postgres vector".into()), + embedding: Some(bench.query_embedding.clone()), + limit: 10, + ..Default::default() + }; + let hits = bench.rt.block_on(bench.store.search(&q)).unwrap(); + black_box(hits); + }) + }); +} + +criterion_group!(benches, bench_search_1k, bench_search_100k); +criterion_main!(benches); +``` + +**File**: `crates/vestige-core/benches/test_embedder.rs` + +Duplicate of `tests/phase_2/common/test_embedder.rs`. Cargo's bench target +cannot `mod` into `tests/`; the duplication is the standard fix. Keep both +files in sync; if either grows non-trivially, refactor into a shared +`pub(crate)` module under `src/embedder/test_support.rs` gated on +`#[cfg(any(test, feature = "test-support"))]`. + +`VESTIGE_BENCH_HEAVY` gate: the 100k seed step takes several minutes (one +`INSERT` per row plus HNSW upsert). Skipping by default keeps `cargo bench` +under a minute for the 1k bench. Document this gate in the runbook +(`0002i`). + +--- + +## Cargo.toml + +Final state of the relevant sections of +`crates/vestige-core/Cargo.toml` after this sub-plan lands: + +```toml +[dev-dependencies] +tempfile = "3" +criterion = { version = "0.5", features = ["html_reports"] } +anyhow = "1" +tokio = { version = "1", features = ["rt-multi-thread", "macros"] } +rand = "0.8" +testcontainers = { version = "0.22", optional = true } +testcontainers-modules = { version = "0.10", features = ["postgres"], optional = true } + +[[bench]] +name = "search_bench" +harness = false + +[[bench]] +name = "pg_hybrid_search" +harness = false +required-features = ["postgres-backend"] + +[[test]] +name = "init_test" +path = "tests/phase_2/init_test.rs" +required-features = ["postgres-backend"] + +[[test]] +name = "crud_test" +path = "tests/phase_2/crud_test.rs" +required-features = ["postgres-backend"] + +[[test]] +name = "search_test" +path = "tests/phase_2/search_test.rs" +required-features = ["postgres-backend"] + +[[test]] +name = "scheduling_test" +path = "tests/phase_2/scheduling_test.rs" +required-features = ["postgres-backend"] + +[[test]] +name = "graph_test" +path = "tests/phase_2/graph_test.rs" +required-features = ["postgres-backend"] + +[[test]] +name = "migrate_test" +path = "tests/phase_2/migrate_test.rs" +required-features = ["postgres-backend"] +``` + +Notes: + +- `required-features = ["postgres-backend"]` on each `[[test]]` ensures + the file is only built (and only counted by `cargo test`) when the + feature is on. Cargo silently skips it otherwise -- exactly the desired + behavior for default `cargo test` runs. +- The benches use the same `required-features` shape so default + `cargo bench` is unaffected. + +--- + +## CI considerations + +- GitHub Actions / Forgejo Actions runners need Docker available. Default + `ubuntu-latest` runners include Docker. Self-hosted Forgejo runners on + TFGrid VMs must install `docker.io` or run `podman` with the Docker + socket compatibility shim. Document the runner requirement in the + runbook (`0002i`). +- The Postgres feature tests should run in a separate CI matrix entry to + isolate failures and skip them entirely on platforms (Windows runners + if any) where the pgvector image is not available. +- Cache the `pgvector/pgvector:pg16` image between runs. The + `docker/setup-buildx-action` cache or a simple `docker pull` step before + the test step keeps cold-start under the existing CI time budget. +- Skip CI: contributors without Docker can still merge changes that do + not touch `storage/postgres/`. The pre-merge required check is "phase_2 + tests pass on the runner with Docker"; the local pre-commit hook does + not gate on it. +- Bench CI: do not run `pg_search_100k` in regular CI; only run it + manually or on a scheduled weekly job and post results to the PR + description / ADR comment trail. + +Recommended CI job shape (sketch): + +```yaml +jobs: + postgres-tests: + runs-on: ubuntu-latest + services: + # no `postgres` service block needed; testcontainers manages its own + steps: + - uses: actions/checkout@v4 + - run: docker pull pgvector/pgvector:pg16 + - uses: dtolnay/rust-toolchain@stable + - run: cargo test -p vestige-core --features postgres-backend --test '*' +``` + +--- + +## Verification + +After all files are in place: + +```bash +# Default build still clean (no postgres deps pulled in): +cargo build -p vestige-core +cargo test -p vestige-core + +# Postgres feature build + integration tests: +cargo build -p vestige-core --features postgres-backend +cargo test -p vestige-core --features postgres-backend + +# Just the new tests: +cargo test -p vestige-core --features postgres-backend --test '*' + +# Quick bench sanity check (1k only): +cargo bench -p vestige-core --features postgres-backend --bench pg_hybrid_search -- --quick + +# Heavy bench (manual, multi-minute seed step): +VESTIGE_BENCH_HEAVY=1 cargo bench -p vestige-core \ + --features postgres-backend \ + --bench pg_hybrid_search -- --quick + +# Clippy with everything on: +cargo clippy -p vestige-core --features postgres-backend --all-targets -- -D warnings +``` + +Expected results: + +- Default build is unchanged; no testcontainers deps in `Cargo.lock`'s + default resolution. +- With `--features postgres-backend`, all six integration tests pass on a + machine with Docker available, or each prints `docker unavailable; skip` + and exits 0. +- `cargo bench ... -- --quick` produces a `pg_search_1k` line with a + p50 below the master plan's 10 ms target on a developer laptop (looser + on a CI runner -- the target is informative, not gated). + +--- + +## Acceptance criteria + +- [ ] `crates/vestige-core/tests/phase_2/common/mod.rs` and + `test_embedder.rs` exist and compile under + `--features postgres-backend`. +- [ ] All six integration test files exist, each with + `#![cfg(feature = "postgres-backend")]` at the top. +- [ ] Each test file has a corresponding `[[test]]` entry in + `Cargo.toml` with `required-features = ["postgres-backend"]`. +- [ ] `crates/vestige-core/benches/pg_hybrid_search.rs` exists with + `search_1k` and `search_100k` benches, the latter gated on + `VESTIGE_BENCH_HEAVY`. +- [ ] `[[bench]] name = "pg_hybrid_search"` entry present with + `required-features = ["postgres-backend"]`. +- [ ] `testcontainers@0.22` and `testcontainers-modules@0.10` with the + `postgres` feature are in `[dev-dependencies]` of `vestige-core`. +- [ ] `anyhow`, `tokio`, `rand` are in `[dev-dependencies]`. +- [ ] `cargo build -p vestige-core` (default features) is unchanged: no + testcontainers in the build graph; no new warnings. +- [ ] `cargo test -p vestige-core` (default features) passes with no + changes to the Phase 1 test count beyond what `0002a..g` already + moved. +- [ ] `cargo test -p vestige-core --features postgres-backend --test '*'` + passes on a runner with Docker available, or skips cleanly with the + `docker unavailable; skip` lines. +- [ ] `cargo bench -p vestige-core --features postgres-backend + --bench pg_hybrid_search -- --quick` runs `pg_search_1k` to + completion and does NOT run `pg_search_100k` unless + `VESTIGE_BENCH_HEAVY=1`. +- [ ] `cargo clippy -p vestige-core --features postgres-backend + --all-targets -- -D warnings` is clean. +- [ ] The runbook (`0002i`) gets a one-paragraph "How to run the test + suite locally" callout referring back to this sub-plan's + "Verification" section. (`0002i` is owned separately; this sub-plan + just lists the dependency.) + +--- + +## Open questions for the implementer + +1. **Migration helper name.** `0002c` decides whether + `PgMemoryStore::run_migrations(&self)` or + `vestige_core::storage::postgres::migrations::run(&pool)` is the public + call. Update `common/mod.rs` to match. +2. **Update-on-missing contract.** `0002d` decides whether + `MemoryStore::update` returns `Err(NotFound)` or `Ok(())` with zero + affected rows when the id does not exist. The CRUD test stub here + accepts either; tighten the assert once the contract is fixed. +3. **Empty-query search contract.** `0002e` decides whether + `SearchQuery { text: None, embedding: None }` is `Ok(empty)` or an + error. Same tightening pattern as #2. +4. **Pool size for 100k bench.** Current value is 8; if the bench + bottlenecks on the pool, tune up to 16 or 32 and document in the + bench file's leading doc comment. +5. **Shared `TestEmbedder` location.** Currently duplicated between + `tests/phase_2/common/test_embedder.rs` and + `benches/test_embedder.rs`. If duplication bothers a reviewer, lift to + `crates/vestige-core/src/embedder/test_support.rs` behind a + `test-support` Cargo feature pulled in by both `tests` and `benches`. + Out of scope for this sub-plan; record as a follow-up. diff --git a/docs/plans/0002i-runbook.md b/docs/plans/0002i-runbook.md new file mode 100644 index 0000000..f63694f --- /dev/null +++ b/docs/plans/0002i-runbook.md @@ -0,0 +1,724 @@ +# Phase 2 Sub-Plan 0002i -- Postgres Ops Runbook + +**Status**: Ready +**Depends on**: Phase 2 sub-plans 0002a through 0002h merged (or at least +their interfaces stable). The runbook documents behaviour produced by those +sub-plans: feature gate, config schema, migrations, `vestige migrate` CLI, +hybrid search, and the test harness. Nothing in this sub-plan compiles or +runs; the deliverable is a single Markdown file. + +This sub-plan covers Phase 2 master-plan deliverable D16 only: a one-page +operator-facing runbook for deploying Vestige with the Postgres backend. + +--- + +## Context + +Why a runbook. The ADR (0002) and the master plan (0002) are written for +implementors. They settle execution-level decisions and itemise deliverables. +They are not deployable instructions. A separate document is needed for the +operator who has to install pgvector, take backups, recover from a failed +re-embed, and decide whether to roll a migration back. The runbook is that +document. + +Who reads it. Ops people, not developers. Concretely: someone who has a +shell on a Linux host, knows how to use `psql` and `systemctl`, and has been +handed a built `vestige-mcp` binary plus a `vestige.toml`. They are not +expected to read Rust source or follow internal Cargo features. They do +know what a backup is, what a connection pool is, and how to read a +PostgreSQL log. + +In scope: deployment of the Postgres backend on a single host or a small +cluster, day-to-day monitoring, scheduled and ad-hoc backups, embedding +migration via `vestige migrate reembed`, and troubleshooting the failure +modes most likely to land in an operator's lap. + +Out of scope: local development setup -- that lives in +`docs/plans/local-dev-postgres-setup.md` and the runbook links to it for +developer onboarding only. Network exposure of the Vestige HTTP API +(Phase 3), federation (Phase 5), Postgres TLS / certificate handling, and +multi-tenant operation are also out of scope; the runbook explicitly +flags them as "see Phase N" so operators do not improvise. + +This sub-plan is the plan for producing the runbook. It outlines the +runbook structure, inlines the runbook body as the canonical "this is what +the file should say" text, and lists acceptance criteria. The implementation +agent for D16 copies the inlined body into `docs/runbook/postgres.md`, +creating `docs/runbook/` if it does not already exist. No other files in the +repository are modified. + +--- + +## Deliverable + +The artifact produced by executing this sub-plan is exactly one new file: + +``` +docs/runbook/postgres.md +``` + +It is NOT under `docs/plans/`. Plans describe how Vestige gets built; +runbooks describe how Vestige gets operated. The two directories are +deliberately separated. + +Side effect: create the directory `docs/runbook/` if it does not exist. +Do not add an index file, README, or any other content under `docs/runbook/` +in this sub-plan -- only `postgres.md`. + +This sub-plan document (`docs/plans/0002i-runbook.md`) is itself NOT a +deliverable in the operator sense. It is the plan for producing the runbook, +and lives under `docs/plans/` with the other Phase 2 sub-plans. + +--- + +## Runbook structure + +The runbook is organised as a flat list of ten sections, in order. Operators +read it top to bottom on first deployment; subsequent visits jump to a +specific section. Section numbering matches the inlined body below. + +1. **Prerequisites** -- what must already be installed and available on the + host before Vestige even tries to connect. PostgreSQL 16 or newer + (18 on Arch is fine), pgvector >= 0.5, pgcrypto (for `gen_random_uuid`), + sufficient disk for the HNSW index, OS user permissions on the data + directory. + +2. **Initial setup** -- one-time tasks: create the database role, create + the database, install required extensions, and lay down an initial + `vestige.toml`. Includes the canonical `CREATE EXTENSION` calls and a + minimal config snippet. + +3. **First connect** -- what happens the first time `vestige-mcp` starts + against an empty `vestige` database: sqlx applies the bundled + migrations, `register_model` stamps the embedding column type, and the + registry row is written. How an operator verifies each step succeeded + using `psql`. + +4. **Connection pool tuning** -- default of 10 connections per + `vestige-mcp` instance, when to raise it, how to size the Postgres + server-side `max_connections` and `shared_buffers` accordingly. Cross- + reference to `vestige.toml` and to ADR 0002 D2 / open question Q5. + +5. **Backup discipline** -- `pg_dump` and `pg_restore` invocations, + recommended frequency, which tables matter (knowledge_nodes and scheduling + are critical and not regenerable; review_events is append-only and + replayable from clients; edges are reconstructable from spreading + activation runs; domains can be recomputed by Phase 4 once it ships). + Also covers backup verification (restore-to-tmp drill). + +6. **Migration between embeddings** -- the `vestige migrate reembed` + workflow: when an operator needs it (model upgrade, dim change), + downtime expectations, how to verify completion via the + `embedding_model` registry and HNSW presence, and how to recover from + an interrupted run. + +7. **Re-clustering domains** -- a brief forward reference. Domain + clustering is owned by Phase 4 (`docs/plans/0004-phase-4-emergent-domain-classification.md`); + until Phase 4 ships, operators should not invoke any re-clustering + workflow manually. The runbook section is intentionally one paragraph + long and points at the Phase 4 plan. + +8. **Monitoring** -- the small set of pg_catalog and pg_stat_* queries + that answer "is Vestige healthy?": `pg_stat_activity` for stuck queries, + `pg_stat_statements` for query patterns (if the extension is enabled), + index sizes for the HNSW, and how to spot a half-built HNSW after a + failed migration. + +9. **Troubleshooting** -- a table of common errors with the symptom and + the fix. Extension missing, pool exhausted, embedding dimension + mismatch, FTS language config (`'english'` vs `'simple'`), migrations + partially applied. + +10. **Rollback caveats** -- every `*.up.sql` has a `*.down.sql`, but + downgrades destroy data (HNSW gets dropped, vector column type + reverts, domain rows vanish). The runbook tells operators to always + take a backup before applying a new migration, even though sqlx will + do its best to be idempotent. + +--- + +## Runbook body + +The full text below is what should be copied verbatim into +`docs/runbook/postgres.md`. ASCII only. Code blocks use fenced syntax with +language hints. Operator-facing prose; second person ("you") for +instructions. Where a command requires sudo, the prompt shows it explicitly. + +```markdown +# Vestige Postgres Backend -- Operator Runbook + +This runbook covers deploying, operating, monitoring, and recovering a +Vestige installation that uses the Postgres backend. It is written for +operators handling a built `vestige-mcp` binary and a `vestige.toml`. + +For local development setup, see +`docs/plans/local-dev-postgres-setup.md`. For the architectural rationale, +see `docs/adr/0001-pluggable-storage-and-network-access.md` and +`docs/adr/0002-phase-2-execution.md`. For the deliverable-level plan, see +`docs/plans/0002-phase-2-postgres-backend.md`. + +--- + +## 1. Prerequisites + +Before Vestige can connect: + +- PostgreSQL server, version 16 or newer. Arch ships 18.x; Debian stable + ships 16.x; both work. +- `pgvector` extension, version 0.5 or newer. Distro packages: + `pgvector` on Arch, `postgresql-16-pgvector` on Debian/Ubuntu. +- `pgcrypto` extension, shipped with the PostgreSQL contrib package + (`postgresql-contrib` on Debian, included in the base `postgresql` + package on Arch). Vestige uses `gen_random_uuid()` from pgcrypto for + primary keys. +- Disk space: budget roughly 4x the size of your `knowledge_nodes.embedding` + column for the HNSW index. With 768-dim float32 vectors at 100k + memories, that is about 1.2 GB for the embeddings plus 4-5 GB for the + HNSW index. Plan accordingly. +- OS user: the `postgres` system user (or whatever user owns + `/var/lib/postgres/data`) must have read/write on the data directory. + Vestige itself does not need filesystem access to Postgres; it talks + TCP only. +- Network: Vestige and Postgres can be on the same host (loopback) or + different hosts. If different hosts, allow the Vestige host's IP in + `pg_hba.conf` and on any firewall. + +--- + +## 2. Initial setup + +These steps run once per Postgres cluster. + +### 2.1 Install extensions + +As the `postgres` superuser: + +```sh +sudo -u postgres psql -d vestige <<'SQL' +CREATE EXTENSION IF NOT EXISTS vector; +CREATE EXTENSION IF NOT EXISTS pgcrypto; +SQL +``` + +Verify: + +```sh +sudo -u postgres psql -d vestige -c \ + "SELECT extname, extversion FROM pg_extension WHERE extname IN ('vector','pgcrypto');" +``` + +You should see two rows. If `vector` is missing, the pgvector package was +not installed for the right PostgreSQL major version; reinstall it. + +### 2.2 Create the role and database + +The `vestige` role owns its own database; it does NOT need superuser. +Extensions must be installed by `postgres`, not by `vestige`. + +```sh +sudo -u postgres psql -v ON_ERROR_STOP=1 <<'SQL' +CREATE ROLE vestige WITH LOGIN CREATEDB PASSWORD 'CHANGE_ME'; +CREATE DATABASE vestige OWNER vestige ENCODING 'UTF8'; +GRANT ALL PRIVILEGES ON DATABASE vestige TO vestige; +SQL + +sudo -u postgres psql -d vestige -v ON_ERROR_STOP=1 <<'SQL' +GRANT ALL ON SCHEMA public TO vestige; +ALTER SCHEMA public OWNER TO vestige; +ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO vestige; +ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO vestige; +ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON FUNCTIONS TO vestige; +SQL +``` + +Replace `CHANGE_ME` with a strong password and store it where Vestige can +read it (typically `~/.vestige_pg_pw`, mode 600, owned by the user running +`vestige-mcp`). + +### 2.3 Minimal `vestige.toml` + +```toml +[storage] +backend = "postgres" + +[storage.postgres] +url = "postgresql://vestige:CHANGE_ME@127.0.0.1:5432/vestige" +max_connections = 10 +``` + +The `url` field accepts a `${VAR}` placeholder; in practice operators +either inline the password or export `DATABASE_URL` and reference +`url = "${DATABASE_URL}"`. See `docs/CONFIGURATION.md` for the full +schema once Phase 3 lands. + +--- + +## 3. First connect + +When `vestige-mcp` starts against an empty `vestige` database, it: + +1. Builds a `PgPool` of `max_connections` (default 10) connections. +2. Runs every migration in `crates/vestige-core/migrations/postgres/` + in order. The bundled migrations are `0001_init` (tables, non-vector + indexes) and `0002_hnsw` (HNSW index on `knowledge_nodes.embedding`). +3. Calls `register_model` once it knows the active embedder's dimension. + This issues `ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE + vector($N)` and inserts a row into `embedding_model`. +4. Begins accepting MCP requests. + +To verify after the first start: + +```sh +sudo -u postgres psql -d vestige <<'SQL' +-- All expected tables present. +\dt +-- embedding_model has exactly one row. +SELECT name, dimension, hash FROM embedding_model; +-- The HNSW index exists. +SELECT indexname FROM pg_indexes + WHERE tablename = 'knowledge_nodes' AND indexname LIKE '%hnsw%'; +SQL +``` + +Expected: `knowledge_nodes`, `scheduling`, `edges`, `domains`, `review_events`, +`embedding_model`, `users`, `groups`, `group_memberships`; one row in +`embedding_model`; one `idx_knowledge_nodes_embedding_hnsw` index. + +If a migration fails mid-way, the partial state lands in +`_sqlx_migrations`. See section 9 for recovery. + +--- + +## 4. Connection pool tuning + +Defaults: + +- Vestige client pool: `max_connections = 10` per `vestige-mcp` instance. +- Postgres server: `max_connections = 100` (default). + +Math: one MCP client with the default pool uses up to 10 server slots. +Five concurrent MCP clients use up to 50 slots. The remaining 50 cover +`psql` sessions, background workers, and headroom for replication or +backup processes. + +When to raise: + +- More than three MCP clients connecting to one Postgres instance. +- Long-running queries (above 500ms p99) showing pool wait time in + Vestige logs (look for `pool acquire timed out` warnings). +- A noticeable number of concurrent dream/consolidation runs. + +How to raise: + +```toml +[storage.postgres] +max_connections = 20 # client side, per vestige-mcp instance +``` + +And on the Postgres server, edit `postgresql.conf`: + +```conf +max_connections = 200 +shared_buffers = 2GB # roughly 25 percent of RAM, never above 8GB +``` + +Then restart Postgres (`sudo systemctl restart postgresql`). Vestige +clients pick up their own `max_connections` change on next restart. + +Do not raise pool sizes blindly. Past about 4x the CPU core count, +Postgres throughput drops; a small connection pooler (PgBouncer in +transaction mode) is the right answer above ~200 client connections, +but Vestige's expected scale rarely needs that. + +--- + +## 5. Backup discipline + +### 5.1 Which tables matter + +| Table | Backup priority | Regenerable? | +|-------|-----------------|--------------| +| `knowledge_nodes` | Critical | No | +| `scheduling` | Critical | No (FSRS state) | +| `embedding_model` | Critical | No (one row, but stamps the column type) | +| `users`, `groups`, `group_memberships` | Critical | No (Phase 3 will populate) | +| `review_events` | Important | Replayable by clients but tedious | +| `edges` | Optional | Yes (recomputed by spreading activation) | +| `domains` | Optional | Yes (Phase 4 recomputes by clustering) | + +For a typical single-operator install, dumping the whole database is +fastest and simplest. Skip the optional tables only if dump size becomes +a bandwidth problem. + +### 5.2 Full logical backup + +```sh +pg_dump --host=127.0.0.1 --username=vestige --format=custom \ + --file=vestige-$(date -u +%Y%m%dT%H%M%SZ).dump \ + vestige +``` + +The custom format compresses by default and works with parallel restore. +File size for 10k memories: roughly 80 MB. + +Frequency recommendations: + +- Daily for any installation with active ingest. +- Before every `vestige migrate reembed` run (see section 6). +- Before every Postgres major-version upgrade. +- Retain at least 7 daily, 4 weekly, 3 monthly dumps. Compress with + `--format=custom` (already gzipped) and keep them on different + storage from the database itself. + +### 5.3 Restore + +To a fresh database: + +```sh +sudo -u postgres createdb -O vestige vestige_restore +pg_restore --host=127.0.0.1 --username=vestige --dbname=vestige_restore \ + --jobs=4 vestige-20260301T030000Z.dump +``` + +To replace the live database (destructive; only after taking a fresh +dump): + +```sh +sudo systemctl stop vestige-mcp # or however the service is run +sudo -u postgres dropdb vestige +sudo -u postgres createdb -O vestige vestige +pg_restore --host=127.0.0.1 --username=vestige --dbname=vestige \ + --jobs=4 vestige-20260301T030000Z.dump +sudo systemctl start vestige-mcp +``` + +### 5.4 Restore drill + +Run a restore-to-throwaway-database every month and run `vestige search` +or a manual `psql` count against it. A backup you have not restored is a +backup you do not have. + +```sh +sudo -u postgres createdb -O vestige vestige_restore_drill +pg_restore --host=127.0.0.1 --username=vestige --dbname=vestige_restore_drill \ + --jobs=4 vestige-latest.dump +PGPASSWORD="$(cat ~/.vestige_pg_pw)" psql -h 127.0.0.1 -U vestige \ + -d vestige_restore_drill \ + -c 'SELECT count(*) FROM knowledge_nodes;' +sudo -u postgres dropdb vestige_restore_drill +``` + +--- + +## 6. Migration between embeddings + +Use `vestige migrate reembed` when: + +- Upgrading to a new embedding model that produces a different dimension + (for example, swapping from `nomic-embed-text-v1.5` 768D to a 1024D + model). +- Switching providers and the model hash differs even at the same + dimension. + +What it does: + +1. Reads every row from `knowledge_nodes`, re-encodes the `content` column + through the new embedder, and writes the new vector back. +2. Drops the HNSW index before the re-encode loop (this is the default; + `--concurrent-index` keeps it during the run at the cost of speed). +3. Updates the `embedding_model` row with the new name, dimension, and + hash. +4. Rebuilds the HNSW index with the new vectors. + +### 6.1 Before starting + +- Take a fresh backup (section 5.2). The tool refuses to start without a + `--yes` flag if it detects no recent backup; ignore at your peril. +- Stop ingest. Vestige's MCP server can stay running for read-only + access, but pause any client that calls `smart_ingest` or + `update_scheduling`. +- Have the new embedder model available locally. The CLI loads it + before the first row is touched; if loading fails, no data is changed. + +### 6.2 Running + +```sh +vestige migrate reembed --model= --yes +``` + +Add `--concurrent-index` if you cannot accept the brief window during +HNSW rebuild where queries do not use the index (sequential scan +fallback works but is slow). + +The tool prints a progress bar via `indicatif`. Expected throughput: +roughly 200 memories per second per CPU core for a 768D ONNX model. +10,000 memories on an 8-core box: about 6 seconds, plus HNSW rebuild +(another 30-90 seconds at that scale). + +### 6.3 Verifying completion + +```sh +sudo -u postgres psql -d vestige <<'SQL' +-- Registry reflects the new model. +SELECT name, dimension, hash FROM embedding_model; +-- HNSW index is present and not partial. +SELECT indexname, indexdef + FROM pg_indexes + WHERE tablename = 'knowledge_nodes' AND indexname LIKE '%hnsw%'; +-- All rows have a non-null embedding of the new dimension. +SELECT count(*) FILTER (WHERE embedding IS NULL) AS missing, + count(*) AS total + FROM knowledge_nodes; +SQL +``` + +Expected: registry shows the new model name and dimension, one HNSW +index, zero missing embeddings. + +### 6.4 Recovering from an interrupted run + +`vestige migrate reembed` is restartable. On interruption: + +- The `embedding_model` row may or may not have been updated. Check it + manually and roll forward by re-running with `--yes --resume` (the + tool detects the inconsistency and finishes the rows that still hold + old embeddings). +- The HNSW index may be missing. Re-running the command rebuilds it as + its last step. +- If the system is in a state the tool refuses to reason about, restore + from the backup taken in 6.1. + +--- + +## 7. Re-clustering domains + +Domain clustering is owned by Phase 4 +(`docs/plans/0004-phase-4-emergent-domain-classification.md`). Until +Phase 4 ships, the `domains` table is reserved schema and is populated +only by tests. Operators must not invoke any domain re-clustering +workflow manually; there is no supported one in Phase 2. + +When Phase 4 lands, this section is replaced with the real procedure. + +--- + +## 8. Monitoring + +### 8.1 Quick health check + +```sh +PGPASSWORD="$(cat ~/.vestige_pg_pw)" psql -h 127.0.0.1 -U vestige -d vestige <<'SQL' +SELECT count(*) AS memory_count FROM knowledge_nodes; +SELECT name, dimension FROM embedding_model; +SELECT pg_size_pretty(pg_database_size('vestige')) AS db_size; +SQL +``` + +### 8.2 In-flight queries + +```sql +SELECT pid, now() - query_start AS runtime, state, query + FROM pg_stat_activity + WHERE datname = 'vestige' AND state <> 'idle' + ORDER BY runtime DESC NULLS LAST; +``` + +Anything over 5 seconds with `state = 'active'` deserves a look. HNSW +search queries should land well under 100ms on properly-sized hardware. + +### 8.3 Query pattern analysis + +If `pg_stat_statements` is loaded (`shared_preload_libraries = +'pg_stat_statements'` in `postgresql.conf`): + +```sql +SELECT calls, mean_exec_time, query + FROM pg_stat_statements + WHERE query ILIKE '%knowledge_nodes%' + ORDER BY mean_exec_time DESC + LIMIT 20; +``` + +Look for hybrid-search queries that have drifted above 100ms p50. The +usual culprit is a missing or half-built HNSW index. + +### 8.4 Index health + +```sql +SELECT indexname, pg_size_pretty(pg_relation_size(indexrelid)) AS size, + idx_scan, idx_tup_read + FROM pg_indexes + JOIN pg_stat_user_indexes USING (indexrelid) + WHERE schemaname = 'public' AND relname = 'knowledge_nodes'; +``` + +A HNSW index with `idx_scan = 0` after several hours of traffic usually +means the planner is preferring sequential scan -- either the table is +too small to bother with the index (fine) or the index is corrupt and +needs rebuilding (`REINDEX INDEX idx_knowledge_nodes_embedding_hnsw;`). + +### 8.5 Spotting half-built HNSW + +After a failed migration or a crashed `reembed`: + +```sql +SELECT indexname, indisvalid, indisready + FROM pg_indexes + JOIN pg_index ON indexrelid = (schemaname || '.' || indexname)::regclass + WHERE tablename = 'knowledge_nodes'; +``` + +Any row with `indisvalid = false` is broken. Drop and recreate: + +```sql +DROP INDEX IF EXISTS idx_knowledge_nodes_embedding_hnsw; +CREATE INDEX idx_knowledge_nodes_embedding_hnsw + ON knowledge_nodes USING hnsw (embedding vector_cosine_ops); +``` + +--- + +## 9. Troubleshooting + +| Symptom | Likely cause | Fix | +|---------|--------------|-----| +| `ERROR: extension "vector" is not available` on start | pgvector not installed for this Postgres major version | Install the distro package matching `pg_config --version`, then `CREATE EXTENSION vector;` as superuser | +| `pool timed out while waiting for an open connection` in Vestige logs | Pool too small or stuck queries holding connections | Raise `max_connections` in `vestige.toml`; investigate `pg_stat_activity` for queries above 5s | +| `vector dimensions do not match` on insert | `embedding_model` was stamped at one dimension and a different embedder is now running | Re-run `vestige migrate reembed --model=` or fix the embedder configuration | +| Hybrid search returns the same row twice | Stale `.sqlx/` query cache from before D5 landed | Run `cargo sqlx prepare` in `crates/vestige-core/`, rebuild the binary | +| `text search configuration "english" does not exist` | Postgres locale build does not include the english dictionary (rare on Alpine) | Install the language-pack or override the FTS language in `vestige.toml` (see `[storage.postgres.fts]` once Phase 2 D5 lands) | +| `relation "_sqlx_migrations" exists, but migration X is in "applied" with no checksum` | Previous run died between `BEGIN` and `COMMIT` | Stop Vestige, restore from backup, restart | +| HNSW index very large compared to data | `m` and `ef_construction` defaults too high for the corpus | Acceptable for now; tuning lands as part of Phase 4 | +| `permission denied for schema public` on a new install | `vestige` role does not own `public` | Re-run the grants block in section 2.2 as `postgres` | + +If a problem is not in this table, capture: PostgreSQL log +(`/var/log/postgres/`, journalctl `-u postgresql`), Vestige log +(`RUST_LOG=debug,sqlx=info` for a fresh run), the migration state +(`SELECT * FROM _sqlx_migrations ORDER BY version;`), and file a bug. + +--- + +## 10. Rollback caveats + +Every migration in `crates/vestige-core/migrations/postgres/` has a +matching `*.down.sql`. `sqlx migrate revert` walks them in reverse order. + +This is not the same as risk-free. The `0002_hnsw.down.sql` drops the +HNSW index (rebuildable, expensive). The `0001_init.down.sql` drops +every table -- including `knowledge_nodes`, including data. Down migrations +exist for development, not for casual production use. + +Before applying any new migration: + +1. Take a backup (section 5.2). +2. Run the migration on a restored copy first if you can afford the time. +3. Read the new migration's `*.up.sql` and `*.down.sql` to understand + what changes. + +To revert one migration manually: + +```sh +sqlx migrate revert \ + --database-url "postgresql://vestige:...@127.0.0.1:5432/vestige" \ + --source crates/vestige-core/migrations/postgres +``` + +Note that Vestige's binary does not run `sqlx migrate revert` +automatically. Reverts are always an explicit operator decision. + +If a revert fails partway through, treat the database as inconsistent: +restore from the backup taken in step 1. +``` + +--- + +## Cross-references + +- `docs/adr/0001-pluggable-storage-and-network-access.md` -- ADR that + established the pluggable backend. +- `docs/adr/0002-phase-2-execution.md` -- ADR settling Phase 2 execution + decisions; section "Architecture Overview" lists every table the + runbook references. +- `docs/plans/0002-phase-2-postgres-backend.md` -- master plan; D16 + (deliverables list) and the Open Implementation Questions section + (especially Q4 HNSW rebuild and Q5 pool sizing) inform the runbook's + recommendations. +- `docs/plans/local-dev-postgres-setup.md` -- developer-facing recipe + for a one-machine Arch / CachyOS dev cluster. The runbook links to it + as the "for development, see" pointer. +- `docs/CONFIGURATION.md` -- existing config doc; section 4 of the + runbook ("Connection pool tuning") cross-references it for the + authoritative `vestige.toml` schema. + +--- + +## Verification + +A reviewer is given: + +- A fresh Linux VM (Debian 12 or Arch current; both must work) with + network access and no Postgres installed. +- A built `vestige-mcp` binary for that platform. +- The runbook (`docs/runbook/postgres.md`). + +The reviewer follows the runbook top to bottom and reaches a state in +which Vestige answers MCP requests against the Postgres backend. +Checkpoints, in order: + +1. After section 1 (Prerequisites): `pg_config --version` returns 16 or + newer; `pkg-config --modversion libpq` resolves; the `pgvector` + distro package is installed. +2. After section 2.1 (Extensions): two rows in + `SELECT extname FROM pg_extension WHERE extname IN ('vector', 'pgcrypto');`. +3. After section 2.2 (Role + DB): `psql -U vestige -h 127.0.0.1 -d vestige -c '\conninfo'` + succeeds. +4. After section 2.3 (Config): `vestige.toml` parses (test by + `vestige config print` once that subcommand lands, otherwise + `vestige-mcp --check-config`). +5. After section 3 (First connect): the eight expected tables are + present; `embedding_model` has exactly one row; the HNSW index + exists; `vestige-mcp` log shows "Postgres backend ready". +6. After section 5.2 (Backup): the dump file exists and `pg_restore -l` + on it lists the expected tables. +7. After section 5.4 (Restore drill): the drill database holds the same + row count as the source. + +If any checkpoint fails, the runbook section that produced the failure +is the one that needs revision. Capture the exact command, exit code, +and log line; revise the runbook in a follow-up PR. + +A second reviewer reads the runbook without executing it and checks for: + +- ASCII only; no em dashes, no curly quotes, no Unicode arrows, no + ellipses, no bullets (`*`/`-` ASCII only). +- Every section number from 1 to 10 present and in order. +- Every cross-reference resolves to an existing file or to a Phase + number explicitly marked as "future". +- No code block longer than 30 lines; if longer, it should be split or + referenced from another file. + +--- + +## Acceptance criteria + +- [ ] `docs/runbook/` directory exists. +- [ ] `docs/runbook/postgres.md` exists and matches the inlined body + above byte-for-byte after stripping the outer code fence used in + this sub-plan to embed it. +- [ ] All ten sections from the "Runbook structure" outline are present + under their stated headings. +- [ ] No file other than `docs/runbook/postgres.md` is created or + modified by executing this sub-plan. +- [ ] ASCII only: no em dashes, no curly quotes, no Unicode arrows, + no ellipses, no Unicode bullets (`grep -P '[^\x00-\x7F]' + docs/runbook/postgres.md` returns no matches). +- [ ] Every cross-reference in the runbook points at a file that exists + in the repository at the time of merge, OR is explicitly framed + as "future Phase N" with a pointer to the relevant plan document. +- [ ] Every command block is copy-pastable: no `` syntax + that does not also have an inline note describing what to + substitute. +- [ ] A second pair of eyes confirms the verification checkpoints in the + preceding section are reproducible. +- [ ] The runbook is no longer than the inlined body in this sub-plan; + operators reach the end without losing patience.