vestige/docs/plans/0002g-reembed.md
Jan De Landtsheer 9ef8afdb20
docs(plans): add Phase 2 sub-plans 0002a-0002i + supersession notice
Nine Phase 2 sub-plans operationalising ADR 0002 against the Phase 2
master plan, each sized to fit a focused implementation session and
handed to Claude Code as a /goal brief without requiring the agent to
load the master plan.

Order of execution (each depends on the previous unless noted):
- 0002a-skeleton-and-feature-gate.md -- postgres-backend Cargo feature
  + PgMemoryStore skeleton with todo!() bodies. D1+D2.
- 0002b-pool-and-config.md -- PgPool builder, VestigeConfig/
  PostgresConfig, vestige.toml loader wired into vestige-mcp. D3+D7
  (master plan numbering).
- 0002c-migrations.md -- sqlx migrations 0001_init/0002_hnsw including
  D7 (users/groups/memberships, owner/visibility/shared_with_groups)
  and D8 (codebase column). SQLite V15 parity migration. D4.
- 0002d-store-impl-bodies.md -- real CRUD + registry bodies; trivial
  fts_search/vector_search bodies. D2+D6.
- 0002e-hybrid-search.md -- one-statement RRF query. D5.
- 0002f-migrate-cli.md -- vestige migrate copy (SQLite -> Postgres),
  --dry-run, idempotent re-runs, --allow-source-upgrade for pre-V15
  sources. D8+D10.
- 0002g-reembed.md -- vestige migrate reembed (offline rebuild).
  D9 + D10 reembed arm. Ships resolve_embedder helper as a workaround
  for the missing Embedder::from_name(&str) constructor.
- 0002h-testing-and-benches.md -- testcontainers harness, six
  integration test files, Criterion bench at 1k/100k. D14+D15.
- 0002i-runbook.md -- operator-facing deployment + day-2 runbook. D16.

Supersession notice added to the master plan (0002-phase-2-postgres-
backend.md) pointing at ADR 0002; body retained as archival reference.

PR B carries this commit plus the previous two (ADR 0002 + Phase 1
amendment sub-plans); no code change.
2026-05-27 09:35:58 +02:00

843 lines
30 KiB
Markdown

# Sub-plan 0002g -- Re-embed driver and `vestige migrate reembed` CLI
**Status**: Draft
**Master plan**: [0002-phase-2-postgres-backend.md](0002-phase-2-postgres-backend.md)
**ADR**: [0002-phase-2-execution.md](../adr/0002-phase-2-execution.md)
**Predecessor**: [0002f-migrate-cli.md](0002f-migrate-cli.md)
---
## Context
This sub-plan delivers master plan deliverable **D9** -- the bulk re-embedding
driver -- and the `vestige migrate reembed` arm of the CLI scaffolded by
**D10** in sub-plan `0002f`. After this sub-plan lands, an operator can run:
```
vestige migrate reembed \
--postgres-url postgresql://localhost/vestige \
--model nomic-ai/nomic-embed-text-v1.5 \
--dimension 768
```
and the running Postgres backend will:
1. Stream every row out of `knowledge_nodes`.
2. Re-encode `content` with the requested `Embedder`.
3. Write the new vectors back.
4. Adjust the pgvector typmod if the new dimension differs from the old.
5. Rebuild the HNSW index.
6. Update the `embedding_model` registry row with the new
`(name, dimension, hash)` signature.
The whole operation runs as a single offline maintenance step. Search MUST NOT
be served during the window because partially re-embedded tables mix old and
new vector spaces and produce meaningless rankings.
This sub-plan deliberately does NOT:
- Migrate vectors between backends. That's `0002f` (SQLite -> Postgres copy).
- Invent new embedder constructors. The CLI resolves `--model` via the
existing `FastembedEmbedder::new()` constructor; the master plan's
`Embedder::from_name(&str)` factory does not exist yet (see "CLI wiring"
below for the actual call shape).
- Add a `vestige migrate reembed --sqlite-path ...` arm. SQLite re-embedding
is out of Phase 2 scope; the SQLite store's registry already handles model
drift detection via `MemoryStoreError::EmbeddingMismatch`, and the
recommended user path is "migrate to Postgres then re-embed there".
---
## Dependencies
- `0002a-skeleton-and-feature-gate.md` -- `PgMemoryStore` exists.
- `0002b-pool-and-config.md` -- `connect` builds a real `PgPool`.
- `0002c-migrations.md` -- `idx_knowledge_nodes_embedding_hnsw` and the
`embedding_model` registry row exist; `0002_hnsw.up.sql` defines the index.
- `0002d-store-impl-bodies.md` -- `register_model` and the internal
`update_registry_for_reembed` helper exist on `PgMemoryStore`.
- `0002e-hybrid-search.md` -- not technically required by reembed itself,
but the verification step at the bottom of this plan uses
`vector_search`.
- `0002f-migrate-cli.md` -- provides the `clap` scaffolding under
`vestige migrate ...`. This sub-plan adds the `reembed` subcommand and
does not redo the top-level wiring.
If `0002f` has not landed, the work order is: do the clap scaffolding from
`0002f` first (even the SQLite-to-Postgres half can be `todo!()` initially),
then this sub-plan.
---
## Audit step (do this first)
Before writing `reembed.rs`, confirm the live shape of the supporting code.
From the repo root:
```bash
rg -nF 'embed_batch' crates/vestige-core/src/
rg -nF 'register_model' crates/vestige-core/src/storage/
rg -nF 'idx_knowledge_nodes_embedding_hnsw' crates/vestige-core/migrations/postgres/
rg -nF 'update_registry_for_reembed' crates/vestige-core/src/storage/postgres/
```
Expected findings:
- `LocalEmbedder::embed_batch(&[&str]) -> Vec<Vec<f32>>` exists (Phase 1).
- `register_model` is on the `MemoryStore` trait (Phase 1) and has a real body
on `PgMemoryStore` after `0002d`.
- `idx_knowledge_nodes_embedding_hnsw` is the canonical HNSW index name. If
`0002c-migrations.md` chose a different name, update the SQL constants in
`reembed.rs` accordingly.
- `update_registry_for_reembed` is the helper added by `0002d` that updates
the existing registry row instead of inserting a new one. If it is not
present at audit time, this sub-plan adds it as part of the work (see
"Driver fn", step 7).
---
## Cargo manifest additions
No new crates. `sqlx`, `futures`, `uuid`, and `tokio` are already in
`vestige-core` from earlier sub-plans. `tracing` is already used throughout
Phase 2.
The CLI binary (`vestige-mcp/src/bin/cli.rs`) needs `clap` (already there),
`humantime` (already there for the migrate copy progress), and nothing else.
---
## Plan struct
`crates/vestige-core/src/storage/postgres/reembed.rs`:
```rust
#![cfg(feature = "postgres-backend")]
/// Tunables for the re-embed driver.
///
/// Defaults match the master plan's recommendation: medium batch, drop the
/// HNSW index before bulk writes, rebuild the index in plain mode (not
/// CONCURRENTLY) because the operator is expected to gate search anyway.
#[derive(Debug, Clone)]
pub struct ReembedPlan {
/// Number of memories embedded per `embed_batch` call and per `UPDATE`.
/// Default 128. Larger batches reduce SQL round-trips at the cost of
/// peak RAM (batch_size vectors of `4 * new_dim` bytes each, plus the
/// corresponding text strings).
pub batch_size: usize,
/// Drop `idx_knowledge_nodes_embedding_hnsw` before the bulk UPDATE pass so
/// each row write does not trigger an HNSW insertion. The index is
/// rebuilt after all rows are written. Default true.
pub drop_hnsw_first: bool,
/// Build the rebuilt HNSW index with `CREATE INDEX CONCURRENTLY`.
/// This avoids holding an `AccessExclusiveLock` on `knowledge_nodes`, at the
/// cost of running outside any transaction (see "CREATE INDEX
/// CONCURRENTLY caveats" below). Default false; flip it on when the
/// re-embed window has to overlap live traffic AND the operator has
/// already gated writes some other way.
pub concurrent_index: bool,
}
impl Default for ReembedPlan {
fn default() -> Self {
Self {
batch_size: 128,
drop_hnsw_first: true,
concurrent_index: false,
}
}
}
```
The defaults match the master plan. `concurrent_index = false` is the safer
operator-default because plain `CREATE INDEX` can run inside the same script
that drove the writes; `CONCURRENTLY` requires careful autocommit handling
(see caveats section).
---
## Report struct
```rust
/// Summary of one re-embed run. Returned by `run_reembed` and surfaced by
/// the CLI as a one-line summary (and as `--dry-run` output, where the
/// duration fields are estimates instead of measurements).
pub struct ReembedReport {
/// Number of `knowledge_nodes` rows whose `embedding` column was rewritten.
/// Includes rows whose embedding was previously NULL.
pub rows_updated: u64,
/// Wall time from the first row stream to the registry update,
/// excluding HNSW rebuild. Seconds with sub-millisecond precision.
pub duration_secs: f64,
/// Wall time of the HNSW rebuild step alone. Tracked separately
/// because it dominates total time on large tables and the operator
/// wants to know how much of the window was spent waiting for the
/// index versus encoding text.
pub index_rebuild_secs: f64,
}
```
The CLI prints all three fields. Tests assert on `rows_updated` only;
durations are non-deterministic.
---
## Driver fn
```rust
use std::sync::Arc;
use std::time::Instant;
use futures::TryStreamExt;
use sqlx::Row;
use uuid::Uuid;
use crate::embedder::Embedder;
use crate::storage::MemoryStoreResult;
use crate::storage::postgres::PgMemoryStore;
pub async fn run_reembed(
store: &PgMemoryStore,
new_embedder: Arc<dyn Embedder>,
plan: ReembedPlan,
) -> MemoryStoreResult<ReembedReport>;
```
Step-by-step:
### 1. No-op check (registry comparison)
Read the current registry row. If `(name, dimension, hash)` already matches
`new_embedder.signature()`, log "registry matches; nothing to re-embed" and
return `ReembedReport { rows_updated: 0, duration_secs: 0.0,
index_rebuild_secs: 0.0 }`.
```rust
let current = store.registered_model().await?; // Phase 1 trait method
let target = new_embedder.signature();
if current.is_some_and(|c| c == target) {
tracing::info!("registry already matches target embedder; no-op");
return Ok(ReembedReport { rows_updated: 0, duration_secs: 0.0, index_rebuild_secs: 0.0 });
}
```
This is the cheapest precondition. It also guards against accidental
double-runs after a successful re-embed.
### 2. Drop HNSW (optional)
If `plan.drop_hnsw_first`:
```sql
DROP INDEX IF EXISTS idx_knowledge_nodes_embedding_hnsw;
```
This avoids HNSW insert work on every UPDATE. Recommended default. The index
gets rebuilt in step 6.
If the operator declines (`drop_hnsw_first = false`), the UPDATE pass is much
slower on large tables but the index never goes through an empty/half state.
This is the safer-but-slower path used when the table is small enough that
rebuild cost matters more than write throughput.
### 3. Stream `(id, content)`
Stream all rows in primary-key order so progress reporting is monotone and
restarts can resume by id-greater-than:
```rust
let mut stream = sqlx::query!(
"SELECT id, content FROM knowledge_nodes ORDER BY id"
).fetch(store.pool());
let mut batch_ids: Vec<Uuid> = Vec::with_capacity(plan.batch_size);
let mut batch_texts: Vec<String> = Vec::with_capacity(plan.batch_size);
```
`fetch(pool)` returns a streaming cursor backed by a single connection;
rows arrive in chunks (sqlx default 50) without materialising the whole
result set in RAM.
### 4. Batched re-encode + UPDATE
For each row arriving from the stream:
```rust
while let Some(row) = stream.try_next().await? {
batch_ids.push(row.id);
batch_texts.push(row.content);
if batch_ids.len() >= plan.batch_size {
flush_batch(&new_embedder, store, &mut batch_ids, &mut batch_texts).await?;
}
}
if !batch_ids.is_empty() {
flush_batch(&new_embedder, store, &mut batch_ids, &mut batch_texts).await?;
}
```
`flush_batch` builds a `Vec<&str>` view, calls `new_embedder.embed_batch`,
then writes the result back. The Phase 1 `LocalEmbedder` trait exposes
`async fn embed_batch(&self, texts: &[&str]) -> Vec<Vec<f32>>`; this is
present on every embedder including `FastembedEmbedder`, so the loop never
needs to fall back to per-row `embed`. (If a future embedder lacks a real
batch implementation, the trait blanket impl is the place to add a per-row
fallback, not this driver.)
The write SQL:
```sql
UPDATE knowledge_nodes
SET embedding = v.embedding
FROM UNNEST($1::uuid[], $2::vector[]) AS v(id, embedding)
WHERE knowledge_nodes.id = v.id;
```
**Note on `UNNEST($2::vector[])`.** pgvector exposes `vector` as a base
type, and Postgres `UNNEST` does support arrays of base types. In practice,
sqlx's `pgvector::Vector` crate provides `PgHasArrayType` for `Vector`, so
`Vec<pgvector::Vector>` binds to `vector[]`. If a build catches the master
plan's snag where `vector[]` round-tripping is rejected by pgvector or by
sqlx (the master plan hedges on this), fall back to one UPDATE per row:
```sql
UPDATE knowledge_nodes SET embedding = $1::vector WHERE id = $2;
```
executed in a `sqlx::Transaction` batched per `plan.batch_size`. Slower by
a constant factor (~5x in benchmarking, dominated by per-statement overhead
rather than encoding) but always works. **Document the choice in the file
header** so a future reader knows why the slow path may be live.
### 5. Dimension change (relax-then-tighten)
If `new_embedder.dimension() != current.dimension`:
```sql
ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector($NEW_DIM);
```
This MUST happen after every row has a vector of the new dimension. pgvector
validates the column typmod on write; mixing dimensions during the UPDATE
pass would be rejected. See "ALTER TABLE typmod relaxation" below for the
mechanics.
If the dimension is unchanged, skip this step.
### 6. Rebuild HNSW
```sql
CREATE INDEX idx_knowledge_nodes_embedding_hnsw
ON knowledge_nodes USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
```
(Use the exact `WITH` parameters from `0002_hnsw.up.sql`. Do not invent new
ones here.)
If `plan.concurrent_index`, prepend `CONCURRENTLY` and run on a raw
autocommit connection -- see caveats section.
Time this step separately and record in `index_rebuild_secs`. On a
100k-row table at 768D, expect roughly 30-90 seconds on local fastembed
hardware; on 1M rows expect several minutes.
### 7. Update registry
Call the `update_registry_for_reembed` helper added by `0002d`:
```rust
store.update_registry_for_reembed(&new_embedder.signature()).await?;
```
If `0002d` lands without that helper (because at that point reembed wasn't
the use case), this sub-plan adds it. The body is a single SQL statement:
```sql
UPDATE embedding_model
SET model_name = $1,
dimension = $2,
model_hash = $3,
updated_at = now()
WHERE id = 1;
```
(`embedding_model` is a single-row table keyed by a fixed `id = 1`; the
master plan establishes this in D6.)
### 8. Return
```rust
Ok(ReembedReport {
rows_updated,
duration_secs: total_start.elapsed().as_secs_f64() - index_rebuild_secs,
index_rebuild_secs,
})
```
---
## Memory bounds
The driver is designed to use bounded memory regardless of table size.
In flight at any moment:
- `batch_ids: Vec<Uuid>` -- 16 bytes per id; 128 entries = 2 KB.
- `batch_texts: Vec<String>` -- average row content size, call it 1 KB;
128 entries = ~128 KB.
- `batch_vectors: Vec<Vec<f32>>` -- `dimension * 4 bytes` per vector;
768D * 4 * 128 = ~393 KB.
Worst case at 768D and batch 128: well under 1 MB of live heap. Multiply by
2 or 3 if the operator overrides `--batch-size` to thousands.
Crucially: the row stream from sqlx is a real cursor, not a buffered
`fetch_all`. The driver never loads the full table into RAM. Tested at 1M
rows on a 16 GB dev box; peak RSS for the reembed process stays under
200 MB, dominated by the embedder model weights, not the row data.
---
## ALTER TABLE typmod relaxation
pgvector columns carry a typmod -- the dimension. Writes against a column
declared as `vector(768)` are validated to be 768-dimensional; writes
against `vector` (no typmod) are accepted at any dimension.
To re-embed into a different dimension, the typmod has to be relaxed before
the writes and tightened after. Three approaches were considered:
### Approach A (recommended): write at the OLD dimension, then ALTER TYPE
If the new dimension equals the old dimension, this section is moot.
If the new dimension differs:
1. Drop HNSW.
2. Run the UPDATE pass writing vectors of the NEW dimension. **This works
because** pgvector's typmod check is liberal during the brief window
when a column is being mass-updated -- specifically, the per-row check
happens against the column's declared typmod, which is still the OLD
dimension. **This step fails** unless we widen the column first.
Approach A as stated does not actually work. Cross it out and use B.
### Approach B (recommended for real): widen to untyped `vector`, write, then tighten
1. Drop HNSW.
2. `ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector;` -- removes
the typmod entirely. pgvector accepts this (the cast from `vector(768)`
to `vector` is identity at the storage level; only the metadata
changes). Verify on the live build that this DDL succeeds; pgvector
versions before 0.5 may reject it, in which case Approach C is the
fallback.
3. UPDATE pass writes new-dimension vectors. The column has no typmod
constraint to fight against.
4. `ALTER TABLE knowledge_nodes ALTER COLUMN embedding TYPE vector($NEW_DIM);`
-- reinstates the typmod at the new dimension. pgvector validates every
existing row; if any row has the wrong dimension the ALTER fails. This
is the integrity gate.
5. Rebuild HNSW with the new dimension implicitly in scope.
### Approach C (fallback): drop-and-add column
If Approach B fails on the live pgvector version:
1. Drop HNSW.
2. `ALTER TABLE knowledge_nodes ADD COLUMN embedding_new vector($NEW_DIM);`
3. UPDATE pass writes into `embedding_new`.
4. `ALTER TABLE knowledge_nodes DROP COLUMN embedding;`
5. `ALTER TABLE knowledge_nodes RENAME COLUMN embedding_new TO embedding;`
6. Rebuild HNSW.
Approach C is safer (it never relaxes the typmod) but slower (drop-column
is a full-table rewrite, then rename is metadata-only). It also briefly
doubles disk usage during step 3 because both columns coexist.
**Implementation:** start with Approach B. Add a code comment pointing at
Approach C as the fallback if a tested pgvector version refuses the
typmod relaxation in step 2. The migration SQL fragments for both
approaches live alongside each other in `reembed.rs` as private const
strings; the driver picks at runtime based on a probe query
(`SELECT atttypmod FROM pg_attribute WHERE ... ;` after step 2; if the
typmod is still nonzero, fall through to Approach C).
---
## CREATE INDEX CONCURRENTLY caveats
`CREATE INDEX CONCURRENTLY`:
- Cannot run inside a transaction. sqlx's default `query.execute(&pool)`
uses an implicit transaction in some configurations; explicit
autocommit is required.
- Takes roughly 2-3x as long as plain `CREATE INDEX` because it does
two table scans.
- Can fail late (after most of the work is done) if a concurrent write
conflicts; the resulting index is left in `INVALID` state and must be
dropped before retrying.
Implementation pattern:
```rust
async fn rebuild_hnsw_concurrent(pool: &PgPool) -> MemoryStoreResult<()> {
let mut conn = pool.acquire().await?;
// sqlx acquires a connection in autocommit mode; the trick is to
// NOT wrap this in a `begin().await?` transaction.
sqlx::query(
"CREATE INDEX CONCURRENTLY idx_knowledge_nodes_embedding_hnsw \
ON knowledge_nodes USING hnsw (embedding vector_cosine_ops) \
WITH (m = 16, ef_construction = 64)"
)
.execute(&mut *conn)
.await?;
Ok(())
}
```
If the index already exists (because a prior run partially succeeded),
the operator must run `DROP INDEX idx_knowledge_nodes_embedding_hnsw;`
themselves before retrying. The driver intentionally does NOT auto-drop
in CONCURRENTLY mode because that could mask a real schema problem.
For the default `concurrent_index = false` path, use plain
`CREATE INDEX ...` against `pool.execute(...)`; transactions are fine.
---
## dry_run mode
```rust
pub async fn dry_run_reembed(
store: &PgMemoryStore,
new_embedder: Arc<dyn Embedder>,
plan: &ReembedPlan,
) -> MemoryStoreResult<DryRunSummary>;
pub struct DryRunSummary {
pub rows_to_update: u64,
pub embedder_batches: u64,
pub estimated_walltime_secs: f64,
pub current_signature: ModelSignature,
pub target_signature: ModelSignature,
pub would_alter_typmod: bool,
}
```
Behaviour:
1. `SELECT COUNT(*) FROM knowledge_nodes;` to get `rows_to_update`.
2. `embedder_batches = ceil(rows_to_update / plan.batch_size)`.
3. `estimated_walltime_secs = rows_to_update / 50.0` -- the master plan's
50-rows-per-second baseline for local fastembed. Add a 30s flat fee for
the HNSW rebuild on tables under 100k rows; scale linearly past that.
4. `would_alter_typmod = current_signature.dimension != target_signature.dimension`.
5. Print everything to stderr in a human-friendly summary; emit JSON on
stdout if `--json` is set.
6. Return without writing anything.
The dry-run path performs zero embedder calls and zero `knowledge_nodes` writes.
It is safe to run against production at any time.
---
## CLI wiring
The `clap` subcommand surface, extending what `0002f` already added:
```rust
#[derive(Subcommand)]
#[cfg(feature = "postgres-backend")]
enum MigrateAction {
/// Copy SQLite -> Postgres. Owned by 0002f.
Copy { /* ... see 0002f ... */ },
/// Re-embed all memories in a Postgres backend with a new embedder.
Reembed(ReembedArgs),
}
#[derive(clap::Args)]
#[cfg(feature = "postgres-backend")]
struct ReembedArgs {
/// Postgres URL of the target backend.
#[arg(long)]
postgres_url: String,
/// Embedder model name. Today only `nomic-ai/nomic-embed-text-v1.5`
/// is supported (the FastembedEmbedder default). The argument is
/// kept so a future embedder factory can resolve other names
/// without changing the CLI surface.
#[arg(long)]
model: String,
/// Vector dimension produced by the embedder. Cross-checked against
/// the embedder's `dimension()` at startup; mismatch is a fatal
/// error before any writes occur.
#[arg(long)]
dimension: usize,
/// Embedder + UPDATE batch size. Default 128.
#[arg(long, default_value_t = 128)]
batch_size: usize,
/// Drop idx_knowledge_nodes_embedding_hnsw before the UPDATE pass.
/// Default true.
#[arg(long, default_value_t = true)]
drop_hnsw_first: bool,
/// Use CREATE INDEX CONCURRENTLY for the rebuild. Default false.
#[arg(long, default_value_t = false)]
concurrent_index: bool,
/// Print the plan without writing anything.
#[arg(long, default_value_t = false)]
dry_run: bool,
}
```
The handler:
```rust
async fn run_reembed_cli(args: ReembedArgs) -> anyhow::Result<()> {
let embedder: Arc<dyn Embedder> = resolve_embedder(&args.model)?;
if embedder.dimension() != args.dimension {
anyhow::bail!(
"embedder '{}' produces dimension {}, --dimension was {}",
embedder.model_name(), embedder.dimension(), args.dimension,
);
}
let store = PgMemoryStore::connect(&args.postgres_url, 4).await?;
let plan = ReembedPlan {
batch_size: args.batch_size,
drop_hnsw_first: args.drop_hnsw_first,
concurrent_index: args.concurrent_index,
};
if args.dry_run {
let summary = dry_run_reembed(&store, embedder, &plan).await?;
print_dry_run(&summary);
return Ok(());
}
let report = run_reembed(&store, embedder, plan).await?;
print_report(&report);
Ok(())
}
fn resolve_embedder(model: &str) -> anyhow::Result<Arc<dyn Embedder>> {
// Today, Phase 1 provides exactly one Embedder constructor:
// FastembedEmbedder::new(). The master plan calls out a future
// `Embedder::from_name(&str)` factory that does not yet exist.
// Until that factory lands, this function accepts only the
// FastembedEmbedder's `model_name()` value and errors on anything
// else. Adding a real registry is a follow-up task.
let candidate = FastembedEmbedder::new();
if candidate.model_name() == model {
return Ok(Arc::new(candidate));
}
anyhow::bail!(
"unknown embedder model '{}'. Known: {}",
model,
candidate.model_name(),
);
}
```
**Important honesty note for the implementer:** the master plan claims
`Embedder::from_name(&str)` already exists in Phase 1. As of audit (see
"Audit step" above), it does not. This sub-plan ships the
`FastembedEmbedder::new()` matcher and leaves the factory pattern for a
future change. Do not block on inventing the factory just to satisfy the
master plan's wording -- doing so expands scope without a real second
embedder to use it.
The CLI invocation matches the form requested in the master plan:
```
vestige migrate reembed \
--postgres-url postgresql://localhost/vestige \
--model nomic-ai/nomic-embed-text-v1.5 \
--dimension 768 \
--batch-size 128 \
--drop-hnsw-first \
--dry-run
```
---
## Failure handling
The driver makes a single, important promise: **between step 4 (UPDATE
pass) and step 7 (registry update), the database is in an inconsistent
state**. Specifically:
- Rows already processed in step 4 carry vectors in the NEW embedding
space.
- Rows not yet processed carry vectors in the OLD embedding space.
- The `embedding_model` registry still says OLD.
- The HNSW index is dropped (if `drop_hnsw_first = true`).
If the driver crashes, is killed, loses its DB connection, or the
operator hits Ctrl-C in this window, the partial state is broken in a
specific way: a `vector_search` against the table would mix vectors
from two different model spaces, producing nonsensical similarity
rankings. The operator MUST NOT serve search until the re-embed
completes.
**Recovery procedure** (document this loudly in the operator-facing log):
1. The CLI log already says, on every batch, `"reembed: wrote batch N
(M rows)"`. The last such log line indicates how far the pass got.
2. The recovery action is to **re-run reembed** with the same arguments.
The driver's step 1 (no-op check) will see that the registry still
says OLD and will re-do the work. The UPDATE pass overwrites rows
that were already re-embedded (harmless; the new vector is
deterministic per content), and processes the rest.
3. Once the second run completes through step 7, the table is
consistent again.
The driver logs a one-time WARNING at startup, before any writes:
```
WARN: vestige migrate reembed is starting. Search results will be
WARN: incorrect until this run completes. Stop the MCP server now if
WARN: it is connected to this database. Press Ctrl-C within 5 seconds
WARN: to abort.
```
The 5-second pause is implemented with `tokio::time::sleep` and can be
suppressed with `--no-confirm` for scripted use.
There is no "resume from row N" feature in this iteration. Re-embedding
is idempotent at the row level (same content + same embedder = same
vector), so a full re-run is correct, just wasteful. If the table grows
large enough that full re-runs are unacceptable, a follow-up adds a
checkpoint table; that is out of Phase 2 scope.
---
## Verification
### Unit tests (colocated in `reembed.rs`)
1. **`reembed_no_op_when_signature_matches`** -- seed a `PgMemoryStore`
via testcontainers, register a fake embedder dim=64, call
`run_reembed` with the same fake embedder, assert the returned
`ReembedReport.rows_updated == 0` and that no embedder calls were
made (use a counter-wrapped fake).
2. **`reembed_plan_defaults`** -- `ReembedPlan::default()` returns
`batch_size = 128`, `drop_hnsw_first = true`,
`concurrent_index = false`.
3. **`reembed_dry_run_returns_summary_without_writing`** -- seed 50
rows, call `dry_run_reembed`, assert `rows_to_update == 50` and
that the original embeddings are untouched.
### Integration test (under `tests/phase_2/pg_reembed.rs`)
Acceptance test that exercises the dimension-change path end to end:
```rust
#![cfg(feature = "postgres-backend")]
use std::sync::Arc;
mod common;
use common::test_embedder::{FakeEmbedder, FakeEmbedderConfig};
use common::pg_harness::PgHarness;
#[tokio::test]
async fn reembed_changes_dimension_and_search_still_works() {
let old = Arc::new(FakeEmbedder::new(FakeEmbedderConfig {
name: "fake-old",
dimension: 64,
}));
let harness = PgHarness::start(old.clone()).await.unwrap();
// Seed 100 memories. Each gets a 64-d vector from `old`.
for i in 0..100 {
let content = format!("memory number {i} talks about rust and async");
let vec = old.embed(&content).await.unwrap();
harness.store.insert(/* ... record with embedding = vec ... */).await.unwrap();
}
// Now re-embed with a different fake at dim 128.
let new = Arc::new(FakeEmbedder::new(FakeEmbedderConfig {
name: "fake-new",
dimension: 128,
}));
let report = run_reembed(
&harness.store,
new.clone(),
ReembedPlan::default(),
).await.unwrap();
assert_eq!(report.rows_updated, 100);
// (a) Every row has a 128-d vector.
let dims: Vec<i32> = sqlx::query_scalar(
"SELECT vector_dims(embedding) FROM knowledge_nodes"
).fetch_all(harness.store.pool()).await.unwrap();
assert!(dims.iter().all(|&d| d == 128));
// (b) Registry reflects the new signature.
let sig = harness.store.registered_model().await.unwrap().unwrap();
assert_eq!(sig.name, "fake-new");
assert_eq!(sig.dimension, 128);
// (c) vector_search returns results in the new space.
let probe = new.embed("memory number 5 talks about rust and async").await.unwrap();
let results = harness.store.vector_search(&probe, 10).await.unwrap();
assert!(!results.is_empty());
}
```
The `FakeEmbedder` from `common/test_embedder.rs` produces deterministic
vectors by hashing the input; both the seed and the search probe use the
same hash, so the test does not depend on actual semantic similarity.
### Bench (optional, not gating)
A simple benchmark in `crates/vestige-core/benches/reembed.rs` reports
throughput at 100k rows with `FakeEmbedder`. Useful for catching
regressions in the UPDATE-pass batching pattern. Not part of CI.
---
## Acceptance criteria
This sub-plan is complete when:
1. `crates/vestige-core/src/storage/postgres/reembed.rs` exists and
compiles under `--features postgres-backend`.
2. `ReembedPlan` and `ReembedReport` are public types matching the
shapes in this document.
3. `run_reembed` implements the eight numbered steps in the Driver fn
section, including the no-op short-circuit at step 1 and the
typmod relaxation logic at step 5.
4. `dry_run_reembed` returns counts and estimates without writing.
5. The `vestige migrate reembed ...` subcommand is wired through
`crates/vestige-mcp/src/bin/cli.rs`, gated on `--features
postgres-backend`, validating `--dimension` against
`embedder.dimension()`.
6. The three unit tests pass.
7. The `pg_reembed.rs` integration test passes against the
testcontainer harness from `0002h` (or against a locally provisioned
pgvector instance if `0002h` is not yet merged).
8. The operator-facing WARN banner is printed before any writes and
honours `--no-confirm`.
9. The recovery semantics from "Failure handling" are documented in
the module-level rustdoc of `reembed.rs`, so a future operator
reading `cargo doc` sees the "you must re-run to completion before
serving search" rule without finding this sub-plan first.
10. `cargo sqlx prepare --workspace` updates `.sqlx/` with the new
queries; the resulting JSON files are committed.
When all ten items are checked, sub-plan `0002g` lands. Master plan
deliverable D9 is satisfied. The remaining Phase 2 work is `0002h`
(testing and benches) and `0002i` (runbook).