mirror of
https://github.com/samvallad33/vestige.git
synced 2026-04-25 00:36:22 +02:00
docs: ADR 0001 + Phase 1-4 implementation plans
Pluggable storage backend, network access, and emergent domain classification. Introduces MemoryStore + Embedder traits, PgMemoryStore alongside SqliteMemoryStore, HTTP MCP + API key auth, and HDBSCAN-based domain clustering. Phase 5 federation deferred to a follow-up ADR. - docs/adr/0001-pluggable-storage-and-network-access.md -- Accepted - docs/plans/0001-phase-1-storage-trait-extraction.md - docs/plans/0002-phase-2-postgres-backend.md - docs/plans/0003-phase-3-network-access.md - docs/plans/0004-phase-4-emergent-domain-classification.md - docs/prd/001-getting-centralized-vestige.md -- source RFC
This commit is contained in:
parent
2391acf480
commit
0d273c5641
6 changed files with 5667 additions and 0 deletions
303
docs/adr/0001-pluggable-storage-and-network-access.md
Normal file
303
docs/adr/0001-pluggable-storage-and-network-access.md
Normal file
|
|
@ -0,0 +1,303 @@
|
|||
# ADR 0001: Pluggable Storage Backend, Network Access, and Emergent Domains
|
||||
|
||||
**Status**: Accepted
|
||||
**Date**: 2026-04-21
|
||||
**Related**: [docs/prd/001-getting-centralized-vestige.md](../prd/001-getting-centralized-vestige.md)
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Vestige v2.x runs as a per-machine local process: stdio MCP transport, SQLite +
|
||||
FTS5 + USearch HNSW in `~/.vestige/`, fastembed locally for embeddings. This is
|
||||
ideal for single-machine single-agent use but blocks three real needs:
|
||||
|
||||
- **Multi-machine access** -- same memory brain from laptop, desktop, server
|
||||
- **Multi-agent access** -- multiple AI clients against one store concurrently
|
||||
- **Future federation** -- syncing memory between decentralized nodes (MOS /
|
||||
Threefold grid)
|
||||
|
||||
SQLite's single-writer model and lack of a native network protocol make it
|
||||
unsuitable as a centralized server. PostgreSQL + pgvector collapses our three
|
||||
storage layers (SQLite, FTS5, USearch) into one engine with MVCC concurrency,
|
||||
auth, and replication.
|
||||
|
||||
Separately, Vestige today has no notion of domain or project scope -- all memories
|
||||
share one namespace. For a multi-machine brain, users want soft topical
|
||||
boundaries ("dev", "infra", "home") without manual tenanting. HDBSCAN clustering
|
||||
on embeddings produces these boundaries from the data itself.
|
||||
|
||||
The PRD at `docs/prd/001-getting-centralized-vestige.md` sketches the full design.
|
||||
This ADR records the architectural decisions and resolves the open questions from
|
||||
that document.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce two new trait boundaries, a network transport layer, and a domain
|
||||
classification module. All four changes ship in parallel phases.
|
||||
|
||||
**Trait boundaries:**
|
||||
|
||||
1. `MemoryStore` -- single trait covering CRUD, hybrid search, FSRS scheduling,
|
||||
graph edges, and domains. One big trait, not four.
|
||||
2. `Embedder` -- separate trait for text-to-vector encoding. Storage never calls
|
||||
fastembed directly. Callers (cognitive engine locally, HTTP server remotely)
|
||||
compute embeddings and pass them into the store.
|
||||
|
||||
**Backends:**
|
||||
|
||||
- `SqliteMemoryStore` -- existing code refactored behind the trait, no behavior
|
||||
change.
|
||||
- `PgMemoryStore` -- new, using sqlx + pgvector + tsvector. Selectable at runtime
|
||||
via `vestige.toml`.
|
||||
|
||||
**Network:**
|
||||
|
||||
- MCP over Streamable HTTP on the existing Axum server.
|
||||
- API key auth middleware (blake3-hashed, stored in `api_keys` table).
|
||||
- Dashboard uses the same API keys for login, then signed session cookies for
|
||||
subsequent requests.
|
||||
|
||||
**Domain classification:**
|
||||
|
||||
- HDBSCAN clustering over embeddings to discover domains automatically.
|
||||
- Soft multi-domain assignment -- raw similarity scores stored per memory, every
|
||||
domain above a threshold is assigned.
|
||||
- Conservative drift handling -- propose splits/merges, never auto-apply.
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Component Breakdown
|
||||
|
||||
1. **`Embedder` trait** (new module `crates/vestige-core/src/embedder/`)
|
||||
- `async fn embed(&self, text: &str) -> Result<Vec<f32>>`
|
||||
- `fn model_name(&self) -> &str`
|
||||
- `fn dimension(&self) -> usize`
|
||||
- Impls: `FastembedEmbedder` (local ONNX, today), future `JinaEmbedder`,
|
||||
`OpenAiEmbedder`, etc.
|
||||
- Stays pluggable forever -- no lock-in to fastembed or to nomic-embed-text.
|
||||
|
||||
2. **`MemoryStore` trait** (new module `crates/vestige-core/src/storage/trait.rs`)
|
||||
- One trait, ~25 methods across CRUD, search, FSRS, graph, domain sections.
|
||||
- Uses `trait_variant::make` to generate a `Send`-bound variant for
|
||||
`Arc<dyn MemoryStore>` in Axum/tokio contexts.
|
||||
- The 29 cognitive modules operate exclusively through this trait. No direct
|
||||
SQLite or Postgres access from the modules.
|
||||
|
||||
3. **`SqliteMemoryStore`** (refactor of existing `crates/vestige-core/src/storage/sqlite.rs`)
|
||||
- Existing rusqlite + FTS5 + USearch code, wrapped behind the trait.
|
||||
- Add `domains TEXT[]` equivalent (JSON-encoded array column in SQLite).
|
||||
- Add `domain_scores` JSON column.
|
||||
- No behavioral change for current users.
|
||||
|
||||
4. **`PgMemoryStore`** (new `crates/vestige-core/src/storage/postgres.rs`)
|
||||
- `sqlx::PgPool` with compile-time checked queries.
|
||||
- pgvector HNSW index for vector search, tsvector + GIN for FTS.
|
||||
- Native array columns for `domains`, JSONB for `domain_scores` and `metadata`.
|
||||
- Hybrid search via RRF (Reciprocal Rank Fusion) in a single SQL query.
|
||||
|
||||
5. **Model registry**
|
||||
- Per-database table `embedding_model` with `(name, dimension, hash, created_at)`.
|
||||
- Both backends refuse writes from an embedder whose signature doesn't match
|
||||
the registered row.
|
||||
- Model swap = `vestige migrate --reembed --model=<new>`, O(n) cost, explicit.
|
||||
|
||||
6. **`DomainClassifier` cognitive module** (new `crates/vestige-core/src/neuroscience/domain_classifier.rs`)
|
||||
- Owns the HDBSCAN discovery pass (using the `hdbscan` crate).
|
||||
- Computes soft-assignment scores for every memory against every centroid.
|
||||
- Stores raw `domain_scores: HashMap<String, f64>` per memory; thresholds into
|
||||
the `domains` array using `assign_threshold` (default 0.65).
|
||||
- Runs discovery on demand (`vestige domains discover`) or during dream
|
||||
consolidation passes.
|
||||
|
||||
7. **HTTP MCP transport** (extension of existing Axum server in `crates/vestige-mcp/src/`)
|
||||
- New route `POST /mcp` for Streamable HTTP JSON-RPC.
|
||||
- New route `GET /mcp` for SSE (for long-running operations).
|
||||
- REST API under `/api/v1/` for direct HTTP clients (non-MCP integrations).
|
||||
- Auth middleware validates `Authorization: Bearer ...` or `X-API-Key`, plus
|
||||
signed session cookies for dashboard.
|
||||
|
||||
8. **Key management** (new `crates/vestige-mcp/src/auth/`)
|
||||
- `api_keys` table -- blake3-hashed keys, scopes, optional domain filter,
|
||||
last-used timestamp.
|
||||
- CLI: `vestige keys create|list|revoke`.
|
||||
|
||||
9. **FSRS review event log** (future-proofing for federation)
|
||||
- New table `review_events` -- append-only `(memory_id, timestamp, rating,
|
||||
prior_state, new_state)`.
|
||||
- Current `scheduling` table becomes a materialized view over the event log
|
||||
(reconstructible from events).
|
||||
- Phase 5 federation merges event logs, not derived state. Zero cost today,
|
||||
avoids lock-in tomorrow.
|
||||
|
||||
### Data Flow
|
||||
|
||||
**Local mode (stdio MCP, unchanged UX):**
|
||||
```
|
||||
stdio client -> McpServer -> CognitiveEngine -> FastembedEmbedder -> MemoryStore (SQLite)
|
||||
```
|
||||
|
||||
**Server mode (HTTP MCP, new):**
|
||||
```
|
||||
Remote client -> Axum HTTP -> auth middleware -> CognitiveEngine
|
||||
-> FastembedEmbedder (server-side) -> MemoryStore (Postgres)
|
||||
```
|
||||
|
||||
The cognitive engine is backend-agnostic. The embedder and the store are both
|
||||
swappable. The 7-stage search pipeline (overfetch -> cross-encoder rerank ->
|
||||
temporal -> accessibility -> context match -> competition -> spreading activation)
|
||||
sits *above* the `MemoryStore` trait and works identically against either backend.
|
||||
|
||||
### Orthogonality of HDBSCAN and Reranking
|
||||
|
||||
HDBSCAN and the cross-encoder reranker solve different problems and both stay:
|
||||
|
||||
- **HDBSCAN** discovers domains by clustering embeddings. Runs once per discovery
|
||||
pass. Produces centroids. Used to *filter* search candidates, not to rank them.
|
||||
- **Cross-encoder reranker** (Jina Reranker v1 Turbo) scores query-document pairs
|
||||
at search time. Runs on every search. Produces ranked results.
|
||||
|
||||
Domain membership is a filter applied before or during overfetch; reranking runs
|
||||
on whatever candidate set survives the filter.
|
||||
|
||||
---
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
| Alternative | Pros | Cons | Why Not |
|
||||
|-------------|------|------|---------|
|
||||
| Split into 4 traits (`MemoryStore + SchedulingStore + GraphStore + DomainStore`) | Cleaner interface segregation | Every module holds 4 trait objects, coordinates transactions across them | One trait is fine in Rust; extract sub-traits later if a genuine need appears |
|
||||
| Embedding computed inside the backend | Simpler call sites for callers | Backend becomes aware of embedding models; can't support remote clients without local fastembed | Keep storage pure; separate `Embedder` trait handles pluggability |
|
||||
| Unconstrained pgvector `vector` (no dimension) | Flexible for model swaps | HNSW still needs fixed dims at index creation; hides a meaningful migration as "silent" | Fixed dimension per install, explicit `--reembed` migration |
|
||||
| Dashboard separate auth (cookies only, no keys) | Simpler dashboard UX | Two auth systems to maintain | Shared API keys with session cookie layer on top |
|
||||
| Auto-tuned `assign_threshold` targeting an unclassified ratio | Adapts to corpus | Hard to debug ("why did this memory change domain?"); magical | Static 0.65 default, config-tunable, dashboard shows `domain_scores` for manual retuning |
|
||||
| Aggressive drift (auto-reassign memories whose scores drifted) | Always up-to-date domains | Breaks user muscle memory; silent reshuffling | Conservative: always propose, user accepts |
|
||||
| CRDTs for federation state | Mathematically clean merges | Massive complexity, performance cost, overkill | Defer; design FSRS as event log now so any future sync model works |
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- Single memory brain accessible from every machine.
|
||||
- Multi-agent concurrent access via Postgres MVCC.
|
||||
- Natural topical scoping emerges from data, not manual tenants.
|
||||
- Future embedding model swaps are a config + migration, not a rewrite.
|
||||
- Federation has a clean on-ramp (event log merge) without committing now.
|
||||
- The `Embedder` / `MemoryStore` split unlocks other storage backends later
|
||||
(Redis, Qdrant, Iroh-backed blob store, etc.) with minimal work.
|
||||
|
||||
### Negative
|
||||
|
||||
- Operating a Postgres instance is more work than managing a SQLite file.
|
||||
- Users who stay on SQLite gain nothing from this ADR (but lose nothing either).
|
||||
- Migration (`vestige migrate --from sqlite --to postgres`) is a sensitive
|
||||
operation for users with months of memories -- needs strong testing.
|
||||
- HDBSCAN + re-soft-assignment runs in O(n) over all embeddings. At 100k+
|
||||
memories this starts to matter; manageable but not free.
|
||||
|
||||
### Risks
|
||||
|
||||
- **Trait abstraction leaks**: a cognitive module might need backend-specific
|
||||
behavior (e.g., Postgres triggers for tsvector). Mitigation: keep such logic
|
||||
inside the backend impl; the trait stays pure.
|
||||
Escalation: if a module genuinely cannot express what it needs through the
|
||||
trait, the trait grows, not the module bypasses.
|
||||
- **Embedding model drift**: users on older fastembed versions silently
|
||||
producing slightly different vectors after a fastembed upgrade. Mitigation:
|
||||
model hash in the registry, refuse mismatched writes, surface a clear error.
|
||||
- **Auth misconfiguration**: a user binds to `0.0.0.0` without setting
|
||||
`auth.enabled = true`. Mitigation: refuse to start with non-localhost bind
|
||||
and auth disabled. Hard error, not a warning.
|
||||
- **Re-clustering feedback loop**: dream consolidation proposes re-clusters,
|
||||
which the user accepts, which changes classifications, which affects future
|
||||
retrievals, which affect future dreams. Mitigation: cap re-cluster frequency
|
||||
(every 5th dream by default), require explicit user acceptance of proposals.
|
||||
- **Cross-domain spreading activation weight (0.5 default)**: arbitrary choice;
|
||||
could be too aggressive or too lax. Mitigation: config-tunable; instrument
|
||||
retrieval quality metrics in the dashboard so the user sees impact.
|
||||
|
||||
---
|
||||
|
||||
## Resolved Decisions (from Q&A)
|
||||
|
||||
| # | Question | Resolution |
|
||||
|---|----------|------------|
|
||||
| 1 | Trait granularity | Single `MemoryStore` trait |
|
||||
| 2 | Embedding on insert | Caller provides; separate `Embedder` trait for pluggability |
|
||||
| 3 | pgvector dimension | Fixed per install, derived from `Embedder::dimension()` at schema init |
|
||||
| 4 | Federation sync | Defer algorithm; store FSRS reviews as append-only event log now |
|
||||
| 5 | Dashboard auth | Shared API keys + signed session cookie |
|
||||
| 6 | HDBSCAN `min_cluster_size` | Default 10; user reruns with `--min-cluster-size N`; no auto-sweep |
|
||||
| 7 | Domain drift | Conservative -- always propose splits/merges, never auto-apply |
|
||||
| 8 | Cross-domain spreading activation | Follow with decay factor 0.5 (tunable) |
|
||||
| 9 | Assignment threshold | Static 0.65 default, config-tunable, raw `domain_scores` stored for introspection |
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
Five phases, each independently shippable.
|
||||
|
||||
### Phase 1: Storage trait extraction
|
||||
- Define `MemoryStore` and `Embedder` traits in `vestige-core`.
|
||||
- Refactor `SqliteMemoryStore` to implement `MemoryStore`; no behavior change.
|
||||
- Refactor `FastembedEmbedder` to implement `Embedder`.
|
||||
- Add `embedding_model` registry table; enforce consistency on write.
|
||||
- Add `domains TEXT[]`-equivalent and `domain_scores` JSON columns to SQLite
|
||||
(empty for all existing rows).
|
||||
- Convert all 29 cognitive modules to operate via the traits.
|
||||
- **Acceptance**: existing test suite passes unchanged. Zero warnings.
|
||||
|
||||
### Phase 2: PostgreSQL backend
|
||||
- `PgMemoryStore` with sqlx, pgvector, tsvector.
|
||||
- sqlx migrations (`crates/vestige-core/migrations/postgres/`).
|
||||
- Backend selection via `vestige.toml` `[storage]` section.
|
||||
- `vestige migrate --from sqlite --to postgres` command.
|
||||
- `vestige migrate --reembed` command for model swaps.
|
||||
- **Acceptance**: full test suite runs green against Postgres with a testcontainer.
|
||||
|
||||
### Phase 3: Network access
|
||||
- Streamable HTTP MCP route on Axum (`POST /mcp`, `GET /mcp` for SSE).
|
||||
- REST API under `/api/v1/`.
|
||||
- API key table + blake3 hashing + `vestige keys create|list|revoke`.
|
||||
- Auth middleware (Bearer, X-API-Key, session cookie).
|
||||
- Refuse non-localhost bind without auth enabled.
|
||||
- **Acceptance**: MCP client over HTTP works from a second machine; dashboard
|
||||
login flow works; unauth requests return 401.
|
||||
|
||||
### Phase 4: Emergent domain classification
|
||||
- `DomainClassifier` module using the `hdbscan` crate.
|
||||
- `vestige domains discover|list|rename|merge` CLI.
|
||||
- Automatic soft-assignment pipeline (compute `domain_scores` on ingest, threshold
|
||||
into `domains`).
|
||||
- Re-cluster every Nth dream consolidation (default 5); surface proposals in the
|
||||
dashboard.
|
||||
- Context signals (git repo, IDE) as soft priors on classification.
|
||||
- Cross-domain spreading activation with 0.5 decay.
|
||||
- **Acceptance**: on a corpus of 500+ mixed memories, discover produces sensible
|
||||
clusters; search scoped to a domain returns tightly relevant results.
|
||||
|
||||
### Phase 5: Federation (future, explicitly out of scope for this ADR's
|
||||
acceptance)
|
||||
- Node discovery (Mycelium / mDNS).
|
||||
- Memory sync protocol over append-only review events and LWW-per-UUID for
|
||||
memory records.
|
||||
- Explicit follow-up ADR before any code.
|
||||
|
||||
---
|
||||
|
||||
## Open Questions
|
||||
|
||||
None at ADR acceptance time. Follow-up items that are *implementation choices*,
|
||||
not architectural:
|
||||
|
||||
- Precise cross-domain decay weight (start at 0.5, instrument, tune)
|
||||
- Dashboard histogram of `domain_scores` (UX design detail)
|
||||
- Whether to gate Postgres behind a Cargo feature flag (`postgres-backend`) or
|
||||
always compile it in (lean toward feature flag to keep SQLite-only builds small)
|
||||
1026
docs/plans/0001-phase-1-storage-trait-extraction.md
Normal file
1026
docs/plans/0001-phase-1-storage-trait-extraction.md
Normal file
File diff suppressed because it is too large
Load diff
1269
docs/plans/0002-phase-2-postgres-backend.md
Normal file
1269
docs/plans/0002-phase-2-postgres-backend.md
Normal file
File diff suppressed because it is too large
Load diff
1435
docs/plans/0003-phase-3-network-access.md
Normal file
1435
docs/plans/0003-phase-3-network-access.md
Normal file
File diff suppressed because it is too large
Load diff
883
docs/plans/0004-phase-4-emergent-domain-classification.md
Normal file
883
docs/plans/0004-phase-4-emergent-domain-classification.md
Normal file
|
|
@ -0,0 +1,883 @@
|
|||
# Phase 4 Plan: Emergent Domain Classification
|
||||
|
||||
**Status**: Draft
|
||||
**Depends on**: Phase 1 (domain columns on memories, `Domain` struct + `DomainStore` methods on `MemoryStore`, `Embedder` trait), Phase 2 (Postgres JSONB + TEXT[] support for domain fields, `embedding_model` registry parity), Phase 3 (Axum HTTP server, REST `/api/v1/` scaffolding, API key auth middleware, signed dashboard session cookies)
|
||||
**Related**: docs/adr/0001-pluggable-storage-and-network-access.md (Phase 4), docs/prd/001-getting-centralized-vestige.md (Emergent Domain Model)
|
||||
|
||||
---
|
||||
|
||||
## Scope
|
||||
|
||||
### In scope
|
||||
|
||||
- `DomainClassifier` cognitive module under `crates/vestige-core/src/neuroscience/domain_classifier.rs`, alongside existing neuroscience modules (spreading_activation, synaptic_tagging, ...).
|
||||
- HDBSCAN discovery pipeline using the `hdbscan` crate (v0.10): load all embeddings, cluster, extract centroids, extract top-terms via TF-IDF over cluster members, persist via the trait's `DomainStore` methods.
|
||||
- Soft-assignment pipeline: for each memory, compute `cosine_similarity(memory.embedding, domain.centroid)` for every domain, store raw scores in `domain_scores` JSONB, threshold into `domains[]` using `assign_threshold` (default 0.65).
|
||||
- Automatic classification on ingest: run through `CognitiveEngine` / `smart_ingest` so new memories get classified against existing centroids immediately; skip when `domain_count == 0` (Phase 0 accumulation).
|
||||
- Re-cluster hook in dream consolidation: every Nth four-phase dream cycle (N=5 default) triggers a discovery pass and generates proposals (split / merge / none). Proposals land in a new `domain_proposals` table, surface in the dashboard, and are never auto-applied (conservative drift, ADR Q7).
|
||||
- Context signals: `SignalSource` trait with `GitRepoSignal` (detects `.git` in CWD or `metadata.cwd`) and `IdeHintSignal` (reads `metadata.editor` / `metadata.ide`). Each returns a `boost_map` of `domain_id -> additive delta` (typical +0.05). Injected as a `signal_boost: Option<HashMap<String, f64>>` parameter into `DomainClassifier::classify`.
|
||||
- Cross-domain spreading activation decay: `ActivationNetwork` traversal multiplies the edge's effective weight by `cross_domain_decay` (default 0.5) when `target.domains` and `source.domains` are disjoint. Strict "no overlap" policy, not graded.
|
||||
- CLI subcommands (in `crates/vestige-mcp/src/bin/cli.rs`, under a new `Domains` command group): `list`, `discover [--min-cluster-size N] [--force]`, `rename <id> <new_label>`, `merge <a> <b> [--into <id>]`. Human-readable tables on stdout; JSON via `--json`.
|
||||
- Dashboard UI additions (`apps/dashboard/src/routes/(app)/domains/`): list page, per-domain detail (memories, centroid top_terms, score histogram, proposal review controls).
|
||||
- REST endpoints under `/api/v1/domains` (introduced by Phase 3 skeleton, implemented in Phase 4): list, discover, rename, merge, proposal list / accept / reject.
|
||||
- Config additions: `[domains]` section in `vestige.toml` covering `assign_threshold`, `recluster_interval`, `min_cluster_size`, `cross_domain_decay`, `discovery_threshold`, `merge_threshold`, `signal_boost` (per-signal toggle).
|
||||
|
||||
### Out of scope
|
||||
|
||||
- Phase 5 federation (explicit separate ADR). Domain centroids are installation-local; no sync.
|
||||
- Learned re-weighting of domain scores (future, only if retrieval-quality metrics show a need).
|
||||
- Interactive cluster-membership editing in the UI (drag-and-drop reassign) -- future enhancement.
|
||||
- Multi-user domain namespaces. One domain set per installation; API keys that carry `domain_filter` just restrict access, they do not create namespaces.
|
||||
- Auto-sweep of `min_cluster_size` / auto-tuned `assign_threshold` (ADR resolution Q6 + Q9: static defaults, user tunes).
|
||||
- Graded cross-domain decay (`|A intersect B| / max(|A|,|B|)`) -- strict "no overlap" is the Phase 4 rule.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Artifacts that Phases 1-3 are expected to have landed:
|
||||
|
||||
- In `vestige-core`:
|
||||
- `Embedder` trait (`crates/vestige-core/src/embedder/`).
|
||||
- `MemoryStore` trait (`crates/vestige-core/src/storage/trait.rs` or similar) including `DomainStore` methods: `list_domains`, `get_domain`, `upsert_domain`, `delete_domain`, `classify(&[f32]) -> Vec<(String, f64)>`, plus a bulk accessor such as `all_embeddings()` (already present in sqlite.rs as `get_all_embeddings`) and a `get_all_memories_with_embeddings()` iterator for discovery. The trait must expose a method to batch-update `(domains, domain_scores)` for a memory id.
|
||||
- `Domain` struct: `{ id: String, label: String, centroid: Vec<f32>, top_terms: Vec<String>, memory_count: usize, created_at: DateTime<Utc> }`.
|
||||
- Columns on memories in both SQLite and Postgres: `domains TEXT[]` (or JSON array on SQLite) and `domain_scores JSONB` (or TEXT JSON on SQLite).
|
||||
- The `domains` table in both backends (see PRD schema sketch).
|
||||
- In `vestige-mcp`:
|
||||
- Axum `/api/v1/` router prefix with auth middleware.
|
||||
- CLI skeleton (`bin/cli.rs`) using `clap`; Phase 4 adds a `Domains` subcommand tree.
|
||||
- REST handlers file structure ready under `crates/vestige-mcp/src/dashboard/handlers.rs` (legacy) and a dedicated REST handler under `/api/v1/`; Phase 4 adds `domains.rs` handler module.
|
||||
- SvelteKit dashboard (`apps/dashboard/`) with existing `(app)/memories`, `(app)/timeline`, `(app)/stats`, etc. Phase 4 adds `(app)/domains/`.
|
||||
|
||||
New workspace crate additions required (added manually to `Cargo.toml`, since `cargo add` is not run from the plan):
|
||||
|
||||
- `hdbscan = "0.10"` in `crates/vestige-core/Cargo.toml` (feature-gated behind `domain-classification`).
|
||||
- Optional: a lightweight stop-word constant inline; no external stop-word crate -- the neuroscience modules already do tokenization on whitespace + length>3 (see `dreams.rs::content_similarity`). Reuse that style; no `ndarray` needed because `hdbscan` v0.10 accepts `&[Vec<f32>]` directly (verified from PRD snippet).
|
||||
- No new deps in `vestige-mcp` for Phase 4 -- CLI reuses `clap` / `colored` / `comfy-table` if already present, otherwise a hand-rolled padded print. We pick hand-rolled to avoid adding a table crate; this matches the existing style of `run_stats` in `cli.rs`.
|
||||
|
||||
Test fixtures:
|
||||
|
||||
- A JSON seed corpus checked into `tests/phase_4/fixtures/seed_500.json` containing >= 500 memories drawn from three plausible clusters. A builder function `tests/phase_4/support/fixtures.rs::build_seed_corpus()` deterministically generates or loads this corpus. Each record has `content`, `tags`, `embedding` (768D bge-base-en-v1.5; use a committed vector or a deterministic mock embedder in tests). For deterministic tests we fake embeddings by hashing content -- acceptable as long as the fake preserves cluster separability (prefix-based: "DEV-...", "INFRA-...", "HOME-..." seeds three Gaussian blobs).
|
||||
- Reuse `Embedder` mock from Phase 1 tests (`MockEmbedder`) for discovery tests that need real cosine similarity.
|
||||
- A minimal git-repo fixture created in a tempdir (`tempfile::tempdir` + `std::process::Command::new("git").arg("init")`) for context-signal tests.
|
||||
|
||||
---
|
||||
|
||||
## Deliverables
|
||||
|
||||
1. `DomainClassifier` cognitive module: struct, defaults, `classify`, `classify_with_boost`, `reassign_all`, `discover`.
|
||||
2. `domain_terms` helper (TF-IDF over cluster members, returning `top_k` terms).
|
||||
3. `cli domains discover` subcommand.
|
||||
4. `cli domains list` / `rename` / `merge` subcommands.
|
||||
5. Auto-classify hook on ingest (wired into the cognitive engine's ingest pipeline before persistence).
|
||||
6. Re-cluster hook in dream consolidation (`DreamEngine::run` orchestrator gets an optional `DomainReClusterHook`; triggers every Nth dream).
|
||||
7. Context signal extractor module (`crates/vestige-core/src/neuroscience/context_signals.rs`) with `SignalSource` trait + `GitRepoSignal` + `IdeHintSignal`.
|
||||
8. Cross-domain spreading activation decay in `ActivationNetwork::activate` (config-driven).
|
||||
9. `vestige.toml` `[domains]` section + defaults loader.
|
||||
10. Dashboard UI: SvelteKit routes `(app)/domains/+page.svelte` (list), `(app)/domains/[id]/+page.svelte` (detail), `(app)/domains/proposals/+page.svelte` (review).
|
||||
11. REST endpoints under `/api/v1/domains` + `/api/v1/domains/proposals`.
|
||||
12. `domain_proposals` table + migration + `DomainProposal` trait methods on `MemoryStore`.
|
||||
13. WebSocket event `VestigeEvent::DomainProposalCreated` so the dashboard gets a live notification after a re-cluster fires.
|
||||
|
||||
---
|
||||
|
||||
## Detailed Task Breakdown
|
||||
|
||||
### 1. `DomainClassifier` cognitive module
|
||||
|
||||
**File**: `crates/vestige-core/src/neuroscience/domain_classifier.rs`
|
||||
**Export**: in `crates/vestige-core/src/neuroscience/mod.rs`, add `pub mod domain_classifier;` and re-export `pub use domain_classifier::{DomainClassifier, ClassificationResult, DomainProposal, ProposalKind};`
|
||||
**Deps**: `hdbscan = "0.10"`, `serde`, `serde_json`, `chrono`, `tracing`, existing `crate::storage::Domain`, `crate::storage::MemoryStore` trait.
|
||||
|
||||
Struct and defaults (match PRD exactly):
|
||||
|
||||
```rust
|
||||
pub struct DomainClassifier {
|
||||
pub assign_threshold: f64, // default 0.65
|
||||
pub discovery_threshold: usize, // default 150
|
||||
pub recluster_interval: usize, // default 5 (every 5th dream)
|
||||
pub min_cluster_size: usize, // default 10
|
||||
pub min_samples: usize, // default 5 (HDBSCAN)
|
||||
pub cross_domain_decay: f64, // default 0.5
|
||||
pub merge_threshold: f64, // default 0.90 (centroid cosine)
|
||||
pub top_terms_k: usize, // default 10
|
||||
}
|
||||
|
||||
impl Default for DomainClassifier { ... }
|
||||
```
|
||||
|
||||
Result types:
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ClassificationResult {
|
||||
pub scores: HashMap<String, f64>, // raw per-domain similarities
|
||||
pub domains: Vec<String>, // above assign_threshold
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub enum ProposalKind {
|
||||
Split { parent: String, children: Vec<String> },
|
||||
Merge { targets: Vec<String>, suggested_label: String },
|
||||
NewCluster { top_terms: Vec<String> },
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct DomainProposal {
|
||||
pub id: String, // uuid v4
|
||||
pub kind: ProposalKind,
|
||||
pub rationale: String,
|
||||
pub confidence: f64,
|
||||
pub created_at: DateTime<Utc>,
|
||||
pub status: ProposalStatus, // Pending | Accepted | Rejected
|
||||
}
|
||||
```
|
||||
|
||||
Key methods (all pure where possible; all pub):
|
||||
|
||||
```rust
|
||||
impl DomainClassifier {
|
||||
pub fn classify(&self, embedding: &[f32], domains: &[Domain]) -> ClassificationResult;
|
||||
|
||||
pub fn classify_with_boost(
|
||||
&self,
|
||||
embedding: &[f32],
|
||||
domains: &[Domain],
|
||||
boost: Option<&HashMap<String, f64>>,
|
||||
) -> ClassificationResult;
|
||||
|
||||
pub async fn reassign_all(
|
||||
&self,
|
||||
store: &dyn MemoryStore,
|
||||
domains: &[Domain],
|
||||
) -> Result<usize, StorageError>;
|
||||
|
||||
pub async fn discover(
|
||||
&self,
|
||||
store: &dyn MemoryStore,
|
||||
) -> Result<Vec<Domain>, StorageError>;
|
||||
|
||||
pub async fn propose_changes(
|
||||
&self,
|
||||
store: &dyn MemoryStore,
|
||||
existing: &[Domain],
|
||||
newly_discovered: &[Domain],
|
||||
) -> Result<Vec<DomainProposal>, StorageError>;
|
||||
|
||||
pub async fn apply_proposal(
|
||||
&self,
|
||||
store: &dyn MemoryStore,
|
||||
proposal: &DomainProposal,
|
||||
) -> Result<(), StorageError>;
|
||||
}
|
||||
```
|
||||
|
||||
Behavior notes:
|
||||
|
||||
- `classify` returns empty `{ scores: {}, domains: [] }` iff `domains.is_empty()` (accumulation phase). This matches the PRD snippet verbatim.
|
||||
- `classify_with_boost` adds the boost delta to each score AFTER cosine, before thresholding. It clamps to `[0.0, 1.0]`. Boost keys not present in `domains` are ignored.
|
||||
- `reassign_all` streams memories in batches of 500 (iterator on the store) to keep memory bounded; for each memory issues a single `UPDATE memories SET domains = ?, domain_scores = ? WHERE id = ?` call. Returns count of memories whose `domains` vector actually changed.
|
||||
- `discover` loads all `(id, embedding)` pairs via an `all_embeddings()` method on the store (exists under `#[cfg(all(feature = "embeddings", feature = "vector-search"))]` in `sqlite.rs::get_all_embeddings`; Phase 1 should promote this onto the trait -- if not yet promoted, add the method). Then:
|
||||
1. Build `Vec<Vec<f32>>` and index -> id map.
|
||||
2. `Hdbscan::default_hyper_params(&embeddings).min_cluster_size(self.min_cluster_size).min_samples(self.min_samples).build()` (exact builder depends on hdbscan 0.10 surface; see Open Question).
|
||||
3. `let labels = clusterer.cluster()?;`
|
||||
4. `let centers = clusterer.calc_centers(Center::Centroid, &labels)?;`
|
||||
5. Group indices by label ignoring -1 (noise). For each cluster compute `top_terms` via `compute_top_terms`.
|
||||
6. Preserve stable IDs where possible: match each new cluster centroid to the closest existing domain by cosine; if similarity > 0.85, reuse the existing domain id + label. Otherwise generate a fresh id `cluster_{n}` with a label derived from the first 2 terms.
|
||||
7. Upsert all resulting `Domain`s via the store.
|
||||
- `propose_changes` compares old vs new clusters:
|
||||
- **Split**: an old domain that best-matches two or more new domains each with >= `min_cluster_size` members. Rationale: "domain `dev` is now 2 clusters of >=10 memories: `systems` and `networking`".
|
||||
- **Merge**: two old domains whose centroids now satisfy `cosine > merge_threshold` get a merge proposal.
|
||||
- **NewCluster**: a new cluster that doesn't match any old domain above 0.85 similarity.
|
||||
- `apply_proposal` runs the split or merge against the store (reassign memberships via `reassign_all`), then marks the proposal `Accepted`. It never runs automatically -- only via the CLI or dashboard.
|
||||
|
||||
Helper:
|
||||
|
||||
```rust
|
||||
fn compute_top_terms(documents: &[&str], k: usize) -> Vec<String>;
|
||||
```
|
||||
|
||||
Uses TF-IDF with IDF computed over the entire passed-in corpus (the `documents` slice), tokenization = whitespace split, lowercase, strip non-alphanumeric, drop tokens shorter than 4 chars and a small built-in stop-word list (`the`, `and`, `for`, `that`, `with`, ...). Matches the tokenizer used in `dreams.rs::content_similarity` and `dreams.rs::extract_patterns` so behavior is predictable.
|
||||
|
||||
Cosine similarity helper:
|
||||
|
||||
```rust
|
||||
fn cosine_similarity(a: &[f32], b: &[f32]) -> f64;
|
||||
```
|
||||
|
||||
Keep the existing crate-level `cosine_similarity` if already present (check `embeddings::` or `search::`); otherwise add a private one. Returns 0.0 on dimension mismatch, panics would be a bug.
|
||||
|
||||
### 2. Top-terms computation helper
|
||||
|
||||
**File**: same module, private section.
|
||||
|
||||
- `fn tokenize(text: &str) -> Vec<String>`: lowercase, split on non-alphanumeric, filter len >= 4, drop stop-words.
|
||||
- `fn tfidf_top_k(docs: &[&str], k: usize) -> Vec<String>`:
|
||||
1. `tf[doc_idx][term] = count / total_terms`.
|
||||
2. `df[term] = docs containing term`.
|
||||
3. `idf[term] = log((N + 1) / (df[term] + 1)) + 1` (smoothed).
|
||||
4. For each term, average `tf` across docs in the cluster; multiply by `idf`; sort desc; return top `k`.
|
||||
|
||||
Cluster top-terms are computed over cluster members only, with IDF over the **whole corpus** (all memory contents), not the cluster, so common words get penalized globally. Recompute global IDF once per `discover` call.
|
||||
|
||||
### 3. CLI subcommand: `vestige domains discover`
|
||||
|
||||
**File**: `crates/vestige-mcp/src/bin/cli.rs`
|
||||
|
||||
Add to `enum Commands`:
|
||||
|
||||
```rust
|
||||
/// Emergent domain management
|
||||
Domains {
|
||||
#[command(subcommand)]
|
||||
action: DomainAction,
|
||||
},
|
||||
```
|
||||
|
||||
```rust
|
||||
#[derive(clap::Subcommand)]
|
||||
enum DomainAction {
|
||||
/// List all discovered domains
|
||||
List {
|
||||
#[arg(long)] json: bool,
|
||||
},
|
||||
/// Run HDBSCAN discovery on all embeddings and propose domains
|
||||
Discover {
|
||||
#[arg(long, default_value_t = 10)] min_cluster_size: usize,
|
||||
/// Skip the proposal flow and write new domains directly (first-time use)
|
||||
#[arg(long)] force: bool,
|
||||
#[arg(long)] json: bool,
|
||||
},
|
||||
/// Rename a domain (by id)
|
||||
Rename {
|
||||
id: String,
|
||||
new_label: String,
|
||||
},
|
||||
/// Merge two domains
|
||||
Merge {
|
||||
a: String,
|
||||
b: String,
|
||||
#[arg(long)] into: Option<String>, // default: `a`
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
Handler plumbing lives in `run_domains(action)` dispatching to `run_domains_list`, `run_domains_discover`, `run_domains_rename`, `run_domains_merge`. Each opens the default `Storage`, constructs a `DomainClassifier::default()`, and invokes the appropriate method.
|
||||
|
||||
Output format for `list`:
|
||||
|
||||
```
|
||||
ID LABEL MEMORIES TOP TERMS
|
||||
dev Development 87 rust, trait, async, tokio, zinit
|
||||
infra Infrastructure 47 bgp, sonic, vlan, frr, peering
|
||||
home Home 31 solar, kwh, battery, pool, esphome
|
||||
(unclassified) 12
|
||||
```
|
||||
|
||||
Produced via plain `print!` with `%-15s %-18s %-10d %s` style padding. `--json` emits `serde_json::to_string_pretty(&domains)`.
|
||||
|
||||
Output format for `discover` with `--force`:
|
||||
|
||||
```
|
||||
HDBSCAN: 500 embeddings, min_cluster_size=10, min_samples=5
|
||||
Found 3 clusters (ignoring 14 noise points)
|
||||
cluster_0 (N=47) top: bgp, sonic, vlan, frr, peering
|
||||
cluster_1 (N=31) top: solar, kwh, battery, pool, esphome
|
||||
cluster_2 (N=22) top: rust, trait, async, tokio, zinit
|
||||
|
||||
Writing 3 domains to the store...
|
||||
Soft-assigning 500 memories against centroids...
|
||||
multi-domain: 43
|
||||
single-domain: 412
|
||||
unclassified (below threshold 0.65): 45
|
||||
Done in 7.4s.
|
||||
```
|
||||
|
||||
Output format for `discover` without `--force` (post-Phase-0):
|
||||
|
||||
```
|
||||
HDBSCAN: 623 embeddings, min_cluster_size=10
|
||||
Comparing to existing 3 domains...
|
||||
|
||||
Proposals (pending, accept via dashboard or `vestige domains proposals`):
|
||||
[split] dev -> (systems:34, networking:28) confidence 0.82
|
||||
[new] cluster_5 (books, novels, reading) confidence 0.71
|
||||
|
||||
Run `vestige domains proposals` to review, or open the dashboard.
|
||||
```
|
||||
|
||||
### 4. CLI: `list`, `rename`, `merge`
|
||||
|
||||
- `list`: calls `store.list_domains()`, fetches unclassified count via `store.count_memories_without_domains()` (Phase 1 should have provided this; if not, Phase 4 adds it to the trait and both backends).
|
||||
- `rename`: `store.get_domain(id)` -> mutate `label` -> `store.upsert_domain`. No memory touch.
|
||||
- `merge`: load both, compute blended centroid (weighted by `memory_count`), merge `top_terms` (union, recompute TF-IDF rank if both sides share the corpus), delete the non-`into` domain, call `reassign_all`. Wrapped in a transaction on Postgres; on SQLite rely on the existing writer-lock pattern.
|
||||
|
||||
### 5. Auto-classify on ingest
|
||||
|
||||
**File**: `crates/vestige-core/src/cognitive.rs` (or equivalent ingest entry in `vestige-mcp/src/tools/smart_ingest.rs`).
|
||||
|
||||
Integration point: just before the record is persisted in the smart-ingest path, after the embedder has produced `embedding` and before `storage.insert(...)`. Trace the current call site -- today `Storage::ingest(IngestInput)` computes embedding inside storage; in Phase 1 the embedder becomes external (ADR decision Q2), so classification can hook right there in the cognitive engine.
|
||||
|
||||
Pseudocode:
|
||||
|
||||
```rust
|
||||
let embedding = embedder.embed(&input.content).await?;
|
||||
let domains = store.list_domains().await?;
|
||||
|
||||
let (domains_assigned, domain_scores) = if domains.is_empty() {
|
||||
(Vec::new(), HashMap::new())
|
||||
} else {
|
||||
let boost = context_signals.gather_boost(&input.metadata, &domains);
|
||||
let result = classifier.classify_with_boost(&embedding, &domains, boost.as_ref());
|
||||
(result.domains, result.scores)
|
||||
};
|
||||
|
||||
record.embedding = Some(embedding);
|
||||
record.domains = domains_assigned;
|
||||
record.domain_scores = domain_scores;
|
||||
store.insert(&record).await?;
|
||||
```
|
||||
|
||||
Edge cases:
|
||||
|
||||
- Accumulation phase (`domains.is_empty()`): skip classification entirely. Zero overhead.
|
||||
- Embedding failed / skipped: leave `domains = []`, `domain_scores = {}`. Never fail ingest because of classification.
|
||||
- Metric: emit `VestigeEvent::MemoryClassified { id, domains, top_score }` on the WebSocket bus so the dashboard sees it live.
|
||||
|
||||
### 6. Re-cluster hook in dream consolidation
|
||||
|
||||
**File**: `crates/vestige-core/src/advanced/dreams.rs` (long file, 1131-line `dream()` entry on the `MemoryDreamer` impl) plus `crates/vestige-core/src/consolidation/phases.rs` (the `DreamEngine::run` orchestrator).
|
||||
|
||||
Design: the `DreamEngine::run(...)` returns `FourPhaseDreamResult`. It does not currently know how many times it has run. Phase 4 introduces a persistent counter on disk (column `dream_cycle_count` on a new singleton `system_state` table, or a simple row in the existing `metadata` / `embedding_model` registry). After the Integration phase finishes, the cognitive engine increments the counter and, if `counter % recluster_interval == 0`, launches discovery asynchronously:
|
||||
|
||||
Extension struct in `phases.rs`:
|
||||
|
||||
```rust
|
||||
pub struct DreamReClusterHook<'a> {
|
||||
pub classifier: &'a DomainClassifier,
|
||||
pub store: &'a dyn MemoryStore,
|
||||
pub event_tx: Option<&'a tokio::sync::mpsc::UnboundedSender<VestigeEvent>>,
|
||||
}
|
||||
|
||||
impl<'a> DreamReClusterHook<'a> {
|
||||
pub async fn tick(&self, cycle_count: usize) -> Result<Vec<DomainProposal>, StorageError> {
|
||||
if cycle_count == 0 || cycle_count % self.classifier.recluster_interval != 0 {
|
||||
return Ok(vec![]);
|
||||
}
|
||||
let existing = self.store.list_domains().await?;
|
||||
let rediscovered = self.classifier.discover(self.store).await?;
|
||||
let proposals = self
|
||||
.classifier
|
||||
.propose_changes(self.store, &existing, &rediscovered)
|
||||
.await?;
|
||||
for p in &proposals {
|
||||
self.store.insert_domain_proposal(p).await?;
|
||||
if let Some(tx) = self.event_tx {
|
||||
let _ = tx.send(VestigeEvent::DomainProposalCreated {
|
||||
id: p.id.clone(),
|
||||
kind: format!("{:?}", p.kind),
|
||||
confidence: p.confidence,
|
||||
timestamp: Utc::now(),
|
||||
});
|
||||
}
|
||||
}
|
||||
Ok(proposals)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Caller wires `tick()` after `DreamEngine::run()` returns, at the ingest/consolidation orchestrator level. The hook never mutates existing domains -- it only writes proposals. The acceptance path is manual (CLI or dashboard).
|
||||
|
||||
Counter storage: add method `store.bump_dream_cycle_count() -> Result<usize>` returning the new count. Single-row table:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS system_state (
|
||||
key TEXT PRIMARY KEY,
|
||||
value TEXT NOT NULL
|
||||
);
|
||||
-- seed: ('dream_cycle_count', '0')
|
||||
```
|
||||
|
||||
### 7. Context signal extractor
|
||||
|
||||
**File**: `crates/vestige-core/src/neuroscience/context_signals.rs`
|
||||
|
||||
```rust
|
||||
pub trait SignalSource: Send + Sync {
|
||||
/// Returns domain_id -> additive boost (positive or negative, typically in [-0.1, +0.1]).
|
||||
fn boost_map(
|
||||
&self,
|
||||
input_metadata: &serde_json::Value,
|
||||
domains: &[Domain],
|
||||
) -> HashMap<String, f64>;
|
||||
|
||||
fn name(&self) -> &'static str;
|
||||
}
|
||||
|
||||
pub struct GitRepoSignal {
|
||||
pub boost: f64, // default +0.05
|
||||
}
|
||||
|
||||
pub struct IdeHintSignal {
|
||||
pub boost: f64,
|
||||
}
|
||||
|
||||
pub struct ContextSignals {
|
||||
sources: Vec<Box<dyn SignalSource>>,
|
||||
}
|
||||
|
||||
impl ContextSignals {
|
||||
pub fn gather_boost(
|
||||
&self,
|
||||
input_metadata: &serde_json::Value,
|
||||
domains: &[Domain],
|
||||
) -> Option<HashMap<String, f64>>;
|
||||
}
|
||||
```
|
||||
|
||||
Signal encoding convention (document in the module header):
|
||||
|
||||
- A signal is a **soft prior**. It nudges the post-cosine score by a small additive delta, clamped to `[-0.10, +0.10]` per signal.
|
||||
- Multiple signals sum, then the final boost per domain is clamped to `[-0.15, +0.15]` so signals cannot by themselves push a memory into or out of a domain; the embedding similarity dominates.
|
||||
- Signals target domains by heuristic: `GitRepoSignal` boosts any domain whose `top_terms` overlaps `{"rust","async","trait","function","class","def","git","commit","fn","code"}`. `IdeHintSignal` does the same for `{"file","line","editor","vscode","neovim","rust-analyzer","lsp"}`.
|
||||
- All signal boosts are logged via `tracing::debug!` so users can audit why a memory picked up a domain.
|
||||
|
||||
`GitRepoSignal::boost_map` implementation:
|
||||
|
||||
```rust
|
||||
fn boost_map(&self, meta: &Value, domains: &[Domain]) -> HashMap<String, f64> {
|
||||
let is_git = meta.get("cwd")
|
||||
.and_then(|v| v.as_str())
|
||||
.map(|cwd| std::path::Path::new(cwd).join(".git").exists())
|
||||
.unwrap_or(false)
|
||||
|| meta.get("git_repo").is_some();
|
||||
if !is_git { return HashMap::new(); }
|
||||
let mut out = HashMap::new();
|
||||
for d in domains {
|
||||
let code_hits = d.top_terms.iter()
|
||||
.filter(|t| CODE_TERMS.contains(t.as_str()))
|
||||
.count();
|
||||
if code_hits > 0 { out.insert(d.id.clone(), self.boost); }
|
||||
}
|
||||
out
|
||||
}
|
||||
```
|
||||
|
||||
Config knob in `[domains.signals]`: `git = true`, `ide = true`, `git_boost = 0.05`, `ide_boost = 0.05`.
|
||||
|
||||
### 8. Cross-domain spreading activation decay
|
||||
|
||||
**File**: `crates/vestige-core/src/neuroscience/spreading_activation.rs`
|
||||
|
||||
Modify `ActivationConfig`:
|
||||
|
||||
```rust
|
||||
pub struct ActivationConfig {
|
||||
pub decay_factor: f64,
|
||||
pub max_hops: u32,
|
||||
pub min_threshold: f64,
|
||||
pub allow_cycles: bool,
|
||||
pub cross_domain_decay: f64, // NEW, default 0.5
|
||||
}
|
||||
```
|
||||
|
||||
Domain metadata on nodes: the current `ActivationNode` has `id`, `activation`, `last_activated`, `edges: Vec<String>`. Phase 4 adds `pub domains: Vec<String>`. Populated when nodes get added (propagated from the memory's `domains` field). The network is rebuilt on each search from the store; if the in-memory network is persisted (check `ActivationNetwork` lifetime in `CognitiveEngine`), the population happens in the engine at boot and on insert.
|
||||
|
||||
Traversal change, in `ActivationNetwork::activate` loop, replacing the single line `let propagated = current_activation * edge.strength * self.config.decay_factor;`:
|
||||
|
||||
```rust
|
||||
let cross_penalty = {
|
||||
let src_doms = self.nodes.get(¤t_id).map(|n| &n.domains);
|
||||
let tgt_doms = self.nodes.get(&target_id).map(|n| &n.domains);
|
||||
match (src_doms, tgt_doms) {
|
||||
(Some(s), Some(t)) if !s.is_empty() && !t.is_empty() => {
|
||||
let overlap = s.iter().any(|d| t.contains(d));
|
||||
if overlap { 1.0 } else { self.config.cross_domain_decay }
|
||||
}
|
||||
_ => 1.0, // unclassified on either side: no penalty
|
||||
}
|
||||
};
|
||||
let propagated = current_activation * edge.strength * self.config.decay_factor * cross_penalty;
|
||||
```
|
||||
|
||||
Rationale for "unclassified -> no penalty": unclassified memories are Phase-0 or low-confidence corpus members; penalizing them would block useful cross-pollination during the accumulation ramp.
|
||||
|
||||
API to update a node's domains after reclassification:
|
||||
|
||||
```rust
|
||||
pub fn set_node_domains(&mut self, id: &str, domains: Vec<String>);
|
||||
```
|
||||
|
||||
Called by the reassignment pipeline after `reassign_all`.
|
||||
|
||||
### 9. `vestige.toml` `[domains]` section
|
||||
|
||||
**File**: wherever `vestige.toml` is loaded (search for `[storage]` / `[server]` loaders). Add:
|
||||
|
||||
```toml
|
||||
[domains]
|
||||
assign_threshold = 0.65
|
||||
discovery_threshold = 150
|
||||
recluster_interval = 5
|
||||
min_cluster_size = 10
|
||||
min_samples = 5
|
||||
cross_domain_decay = 0.5
|
||||
merge_threshold = 0.90
|
||||
top_terms_k = 10
|
||||
|
||||
[domains.signals]
|
||||
git = true
|
||||
ide = true
|
||||
git_boost = 0.05
|
||||
ide_boost = 0.05
|
||||
```
|
||||
|
||||
Rust-side: `DomainsConfig { ... }` struct with `serde(default)` so `vestige.toml` without a `[domains]` section falls back to hard-coded defaults. `DomainClassifier::from_config(cfg: &DomainsConfig) -> Self`.
|
||||
|
||||
### 10. Dashboard UI additions
|
||||
|
||||
**SvelteKit routes** (`apps/dashboard/src/routes/(app)/domains/`):
|
||||
|
||||
- `+page.svelte` (list): fetches `GET /api/v1/domains` and `GET /api/v1/domains/unclassified-count`. Renders a table: `label`, `memories`, `top_terms` chips, `created_at`. Each row links to `/domains/[id]`. A "Discover" button posts `POST /api/v1/domains/discover`.
|
||||
- `[id]/+page.svelte` (detail): fetches `GET /api/v1/domains/:id`, `GET /api/v1/domains/:id/memories?limit=100`, `GET /api/v1/domains/:id/score-histogram`. Renders:
|
||||
- Header: label (editable, triggers `PUT /api/v1/domains/:id`), top-terms chips, memory count, created_at.
|
||||
- Histogram: a vertical bar chart of `domain_scores[:id]` buckets 0-0.1, 0.1-0.2, ..., 0.9-1.0 across all memories. Data source: server precomputes buckets so the client does not need to fetch all scores.
|
||||
- Memory list: paginated, each row shows the raw score for this domain.
|
||||
- `proposals/+page.svelte`: fetches `GET /api/v1/domains/proposals?status=pending`. Each pending proposal card shows `kind`, `rationale`, `confidence`, `created_at`, buttons "Accept" (posts `POST /api/v1/domains/proposals/:id/accept`) and "Reject" (`POST .../reject`). Live updates via the existing WebSocket channel (`/ws`) reacting to `DomainProposalCreated` events.
|
||||
|
||||
Styling reuses the existing Tailwind + shadcn-svelte conventions in `apps/dashboard/src/lib/components/`.
|
||||
|
||||
Existing `(app)/stats` and `(app)/feed` pages get a small "Domains" summary panel that links to `/domains`.
|
||||
|
||||
### 11. REST endpoints
|
||||
|
||||
**File**: `crates/vestige-mcp/src/protocol/http.rs` or a new `crates/vestige-mcp/src/api/domains.rs` module, wired into the `/api/v1/` router.
|
||||
|
||||
| Method | Path | Handler |
|
||||
|--------|------|---------|
|
||||
| GET | `/api/v1/domains` | `list_domains` -- returns `[Domain...]` + unclassified count |
|
||||
| POST | `/api/v1/domains/discover` | `trigger_discover` -- body `{ min_cluster_size?: usize, force?: bool }`, returns proposals or applied domains |
|
||||
| GET | `/api/v1/domains/:id` | `get_domain` |
|
||||
| PUT | `/api/v1/domains/:id` | `update_domain` -- rename |
|
||||
| DELETE | `/api/v1/domains/:id` | `delete_domain` -- with `?merge_into=other_id` |
|
||||
| GET | `/api/v1/domains/:id/memories` | paginated memories in this domain |
|
||||
| GET | `/api/v1/domains/:id/score-histogram` | precomputed buckets |
|
||||
| GET | `/api/v1/domains/proposals` | `list_proposals?status=pending` |
|
||||
| POST | `/api/v1/domains/proposals/:id/accept` | `accept_proposal` |
|
||||
| POST | `/api/v1/domains/proposals/:id/reject` | `reject_proposal` |
|
||||
|
||||
All handlers go through the Phase 3 auth middleware (Bearer / X-API-Key / session cookie). Responses are JSON; error paths use `StatusCode::*` with a small `{"error": "..."}` body.
|
||||
|
||||
### 12. `domain_proposals` table + trait methods
|
||||
|
||||
Postgres migration (`crates/vestige-core/migrations/postgres/00XX_domain_proposals.sql`):
|
||||
|
||||
```sql
|
||||
CREATE TABLE domain_proposals (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
kind TEXT NOT NULL, -- 'split' | 'merge' | 'new_cluster'
|
||||
payload JSONB NOT NULL, -- serialized ProposalKind body
|
||||
rationale TEXT NOT NULL,
|
||||
confidence DOUBLE PRECISION NOT NULL,
|
||||
status TEXT NOT NULL DEFAULT 'pending', -- pending|accepted|rejected
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
resolved_at TIMESTAMPTZ
|
||||
);
|
||||
CREATE INDEX idx_domain_proposals_status ON domain_proposals (status, created_at DESC);
|
||||
```
|
||||
|
||||
SQLite migration: same table, `UUID` -> `TEXT`, `JSONB` -> `TEXT` with JSON-encoded bodies, `TIMESTAMPTZ` -> `TEXT` ISO-8601.
|
||||
|
||||
`MemoryStore` trait additions:
|
||||
|
||||
```rust
|
||||
async fn insert_domain_proposal(&self, p: &DomainProposal) -> Result<()>;
|
||||
async fn list_domain_proposals(&self, status: Option<&str>) -> Result<Vec<DomainProposal>>;
|
||||
async fn get_domain_proposal(&self, id: &str) -> Result<Option<DomainProposal>>;
|
||||
async fn set_proposal_status(&self, id: &str, status: &str) -> Result<()>;
|
||||
```
|
||||
|
||||
### 13. WebSocket event for proposals
|
||||
|
||||
**File**: `crates/vestige-mcp/src/dashboard/events.rs`
|
||||
|
||||
Add variant:
|
||||
|
||||
```rust
|
||||
pub enum VestigeEvent {
|
||||
// ... existing ...
|
||||
DomainProposalCreated {
|
||||
id: String,
|
||||
kind: String,
|
||||
confidence: f64,
|
||||
timestamp: DateTime<Utc>,
|
||||
},
|
||||
MemoryClassified {
|
||||
id: String,
|
||||
domains: Vec<String>,
|
||||
top_score: f64,
|
||||
timestamp: DateTime<Utc>,
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
The SvelteKit dashboard's WS client reacts to both events: classified events refresh any open domain-detail page; proposal events push a toast and a badge on the navbar.
|
||||
|
||||
---
|
||||
|
||||
## Test Plan
|
||||
|
||||
Test root: `tests/phase_4/` (a new member of the workspace; mirror the `tests/e2e` layout).
|
||||
|
||||
`tests/phase_4/Cargo.toml`:
|
||||
|
||||
```toml
|
||||
[package]
|
||||
name = "vestige-phase4-tests"
|
||||
version = "0.0.0"
|
||||
edition = "2024"
|
||||
publish = false
|
||||
|
||||
[dependencies]
|
||||
vestige-core = { path = "../../crates/vestige-core", features = ["embeddings", "vector-search", "domain-classification"] }
|
||||
vestige-mcp = { path = "../../crates/vestige-mcp" }
|
||||
tokio = { workspace = true }
|
||||
anyhow = "1"
|
||||
tempfile = "3"
|
||||
serde_json = { workspace = true }
|
||||
uuid = { workspace = true }
|
||||
```
|
||||
|
||||
### Unit tests (colocated in `domain_classifier.rs::tests`, `context_signals.rs::tests`, `spreading_activation.rs::tests`)
|
||||
|
||||
Each public function must have at least one test:
|
||||
|
||||
- `classify_empty_domains_returns_empty`: `classify(&[0.0; 768], &[])` returns `ClassificationResult { scores: {}, domains: [] }`.
|
||||
- `classify_single_domain_scores`: one `Domain` with a known centroid; input embedding equal to centroid; expect score 1.0 and `domains == [id]`.
|
||||
- `classify_multi_domain_overlap`: two domains A, B; input halfway between centroids; expect both scores >= `assign_threshold`; expect `domains == [A, B]` (order not guaranteed).
|
||||
- `classify_below_threshold_returns_empty_domains_but_scores_filled`: input orthogonal to all centroids; expect `scores` populated, `domains` empty.
|
||||
- `classify_with_boost_adds_delta`: same input as above, with `boost = {A: 0.4}`; expect A now above threshold, B unchanged.
|
||||
- `classify_boost_clamps_to_unit`: `boost = {A: 5.0}`; resulting `scores[A]` must be <= 1.0.
|
||||
- `tfidf_top_k_returns_distinct_terms`: given three fake docs, `top_k=3` returns three non-duplicate strings, in descending TF-IDF order.
|
||||
- `tfidf_top_k_drops_stopwords`: `["the and for"]` + real content -> stop-words absent.
|
||||
- `compute_top_terms_handles_empty_cluster`: returns `vec![]` (no panic).
|
||||
- `signal_git_present_vs_absent`: `GitRepoSignal` given metadata with `.git` in cwd returns non-empty map; without it returns empty.
|
||||
- `signal_ide_present_vs_absent`: `IdeHintSignal` ditto for `metadata.editor == "vscode"`.
|
||||
- `signal_combined_clamped`: two signals both firing each at +0.10 -> combined map values <= +0.15.
|
||||
- `cross_domain_decay_full_weight_on_overlap`: graph with node A in domain `dev`, node B in domain `dev`, edge A->B strength 1.0; after `activate`, B's activation equals the standard `initial * strength * decay_factor` (no extra penalty).
|
||||
- `cross_domain_decay_half_weight_no_overlap`: A in `dev`, B in `infra`, same edge -> B's activation is 0.5x that of the overlap case.
|
||||
- `cross_domain_decay_unclassified_no_penalty`: A classified, B unclassified -> full weight.
|
||||
- `propose_changes_detects_split`: existing domain `dev`; new discovery returns two clusters whose centroids both sit close to old `dev` centroid, each >= min_cluster_size members -> proposal of kind `Split { parent: "dev", children: [a, b] }`.
|
||||
- `propose_changes_detects_merge`: two existing domains whose new centroids now have cosine > `merge_threshold` -> proposal of kind `Merge`.
|
||||
- `propose_changes_detects_new_cluster`: a new cluster with no match >= 0.85 to any existing -> `NewCluster`.
|
||||
- `apply_proposal_split_updates_memberships`: after accept, memories previously in `dev` get reassigned (some to child a, some to child b) via `reassign_all`.
|
||||
|
||||
### Integration tests (`tests/phase_4/tests/`)
|
||||
|
||||
One file per behavior listed in the Phase 4 acceptance sheet.
|
||||
|
||||
- `discover_seed_corpus.rs` -- loads the 500-memory fixture, runs `classifier.discover(&store).await`, asserts at least 3 clusters, asserts per-cluster intra-similarity mean > 0.6, asserts discovery wall time < 10s in release. Also asserts `top_terms` for each cluster contains at least one expected keyword per cluster (dev: contains any of `rust/trait/async`; infra: `bgp/vlan/network`; home: `solar/battery/pool`).
|
||||
- `soft_assign_multi_domain.rs` -- inserts a memory "deploy zinit containers over BGP network"; after classify, `domains` contains both `dev` and `infra` (from a known centroid setup).
|
||||
- `auto_classify_on_ingest.rs` -- with three existing domains, a fresh `smart_ingest` of a dev-ish sentence ends up with `domains == ["dev"]` and non-empty `domain_scores`.
|
||||
- `reembed_triggers_recluster.rs` -- after `vestige migrate --reembed`, centroids must be recomputed; verify `list_domains()` returns fresh `centroid` values (different from pre-reembed).
|
||||
- `dream_consolidation_recluster_hook.rs` -- run 5 dream cycles with heavy synthetic memory insertion; after the 5th, assert `list_domain_proposals("pending")` has at least one proposal.
|
||||
- `proposal_accept_applies_changes.rs` -- accept a split proposal via `apply_proposal`; verify that memories in `dev` are now distributed across the new children and that the old `dev` domain is removed.
|
||||
- `proposal_reject_leaves_state.rs` -- reject a proposal; verify all domains and memberships unchanged.
|
||||
- `drift_is_proposal_only.rs` -- over 5 dream cycles with new inserts, never call accept; verify every memory's `domains` field equals its initial post-discovery value. No auto-apply.
|
||||
- `cross_domain_activation_decay.rs` -- build a `ActivationNetwork` with two memories linked by a strength-1.0 edge, one in `dev`, one in `infra`; activate `dev` memory with 1.0; assert `infra` memory's activation == `0.5 * decay_factor` (0.35 with default decay_factor 0.7). Then set both to `dev` and reassert activation == `0.7`.
|
||||
- `cli_domains_discover.rs` -- spawn `cargo run -- domains discover --force --json`, parse stdout, assert at least 3 clusters and valid JSON shape.
|
||||
- `cli_domains_rename_merge.rs` -- happy-path rename then merge, with stdout assertions.
|
||||
- `context_signal_git_repo.rs` -- ingest the same sentence from inside a tempdir with `.git` vs outside; assert the git-run produces slightly higher `domain_scores` for the code-related domain (diff >= 0.04, matches `git_boost = 0.05`).
|
||||
- `threshold_tunable.rs` -- same memory, two runs with `assign_threshold = 0.40` vs `0.85`; the low-threshold run assigns more domains than the high-threshold run for the same content.
|
||||
- `signal_boost_clamped.rs` -- artificially configure `git_boost = 5.0` and assert the resulting per-domain score is still <= 1.0.
|
||||
- `discover_preserves_stable_ids.rs` -- run discover twice with no new memories; the second run's domain ids match the first's (via centroid-similarity stable-ID matching above 0.85).
|
||||
|
||||
### Dashboard UI tests (`tests/phase_4/ui/`)
|
||||
|
||||
Use curl-driven smoke tests (avoids adding Playwright as a new hard dep; Playwright already exists at `apps/dashboard/playwright.config.ts` and can be extended later).
|
||||
|
||||
- `domains_list_renders.sh` -- `curl -H "X-API-Key: $KEY" http://localhost:3927/api/v1/domains` returns 200 + JSON array with expected keys.
|
||||
- `domain_detail_histogram.sh` -- `curl .../api/v1/domains/dev/score-histogram` returns 10 buckets.
|
||||
- `proposal_review_flow.sh` -- create a pending proposal via SQL insert; `curl POST .../api/v1/domains/proposals/<id>/accept`; `curl GET .../proposals?status=accepted` shows it.
|
||||
- `unauth_domain_list_rejected.sh` -- no auth header -> 401.
|
||||
|
||||
### Benchmarks (`tests/phase_4/benches/`)
|
||||
|
||||
Criterion benches:
|
||||
|
||||
- `bench_discover_10k.rs` -- synthetic 10k x 768D embeddings drawn from 5 blobs; assert `discover` wall p95 < 30s on a warm release build.
|
||||
- `bench_auto_classify_single.rs` -- 20 domains in memory, classify one 768D vector; assert p99 < 5ms.
|
||||
- `bench_reassign_all.rs` -- 10k memories, 5 domains; assert full `reassign_all` wall time < 90s (100 rows/ms baseline).
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `cargo build -p vestige-core --features domain-classification` zero warnings.
|
||||
- [ ] `cargo build -p vestige-mcp` zero warnings.
|
||||
- [ ] `cargo clippy --workspace --all-targets --all-features -- -D warnings` clean.
|
||||
- [ ] `cargo test -p vestige-phase4-tests` -- all tests in `tests/phase_4/` pass.
|
||||
- [ ] On a 500+ memory seed corpus covering three natural clusters (dev / infra / home), `vestige domains discover --force` produces sensible top-terms matching the expected keyword sets and labels are stable on a second run.
|
||||
- [ ] `vestige search` with domain filter `["dev"]` excludes any memory whose `domains` array does not include `dev`.
|
||||
- [ ] After 5 dream cycles with ongoing inserts, no existing memory's `domains` has silently changed; proposals exist in `domain_proposals` table; accepting a proposal reassigns as described.
|
||||
- [ ] Cross-domain spreading activation: a query in `dev` that crosses a single edge into an `infra`-only memory still returns the memory but with activation `cross_domain_decay * in-domain_activation`.
|
||||
- [ ] `vestige domains discover --min-cluster-size 20` produces strictly fewer or equal clusters than the default, and with larger per-cluster membership.
|
||||
- [ ] Dashboard `/dashboard/domains` route renders all domains within 2 seconds on the seed corpus.
|
||||
- [ ] Proposal UI flow (open pending, accept, confirmed in store) works end-to-end.
|
||||
- [ ] Benchmarks meet targets (discover 10k p95 < 30s, auto-classify p99 < 5ms).
|
||||
|
||||
---
|
||||
|
||||
## Rollback Notes
|
||||
|
||||
- **Feature gate**: add `domain-classification` to `crates/vestige-core/Cargo.toml`'s `[features]`. When disabled, the `DomainClassifier` module is not compiled, the classification call in the ingest path is a no-op (`#[cfg]`-guarded), and cross-domain decay collapses to `1.0`. The CLI `domains` subcommand emits "domain classification is disabled in this build".
|
||||
- **Revert strategy**: drop the two new tables `domains` (if created in Phase 1 is retained) or `domain_proposals` (Phase 4). A DOWN migration clears `memories.domains` and `memories.domain_scores`. Existing memories simply lose their domain assignments; all search and retrieval paths work unchanged because `domains = []` is the documented "unclassified" state.
|
||||
- **Idempotency**: rerunning `discover` is always safe. Cluster numeric IDs may differ between runs, but the stable-ID match by centroid similarity preserves user-assigned labels. Do not persist cluster ids in client-side bookmarks; link via the user-assigned label.
|
||||
- **Data-loss risk**: `apply_proposal` is a destructive operation (it deletes the old parent domain in a split or merges two). The dashboard's accept button double-confirms with a modal that shows the number of affected memories.
|
||||
|
||||
---
|
||||
|
||||
## Open Implementation Questions
|
||||
|
||||
Each question + candidates + RECOMMENDATION.
|
||||
|
||||
### OQ1. Top-terms extraction: TF-IDF vs BM25 vs frequency?
|
||||
- TF-IDF with smoothed IDF -- standard, cheap, good-enough.
|
||||
- BM25 -- better for long-document discrimination, overkill for short memory contents.
|
||||
- Raw frequency -- noisy; stop-words dominate.
|
||||
**RECOMMENDATION**: TF-IDF with global IDF over the entire memory corpus (not just cluster members), recomputed once per `discover` call. Same tokenizer as the `dreams.rs::content_similarity` Jaccard for consistency.
|
||||
|
||||
### OQ2. Proposal persistence: DB table vs in-memory with dashboard notification?
|
||||
- DB table (`domain_proposals`) -- durable, surfaces across restarts, enables audit.
|
||||
- In-memory only -- simpler, but loses proposals on server restart.
|
||||
**RECOMMENDATION**: DB table. Proposals are rare (every 5th dream) and valuable user-facing artifacts; durability is mandatory.
|
||||
|
||||
### OQ3. `hdbscan` crate: f32 vs f64 input, exact API surface?
|
||||
- v0.10 historically takes `&[Vec<f64>]`; embeddings are `Vec<f32>`.
|
||||
- Cost of converting f32 -> f64 at discovery time: `10k * 768 = 7.68M` f64 doubles ~ 60MB transient, acceptable.
|
||||
**RECOMMENDATION**: verify v0.10's type signature at implementation time; if it requires f64, perform the conversion in `discover()` behind a single allocation. Document in module header. If the crate API diverged from the PRD snippet, fall back to the manual builder style (`HdbscanHyperParams::builder().min_cluster_size(n).min_samples(s).build()`).
|
||||
|
||||
### OQ4. Stable domain IDs across discover re-runs?
|
||||
- Option A: numeric IDs from HDBSCAN labels -- unstable, re-runs shuffle them.
|
||||
- Option B: hash(top_terms) -- stable if top-terms stable, but top-terms drift.
|
||||
- Option C (recommended): after computing new centroids, match each to the closest existing domain by centroid cosine; if similarity > 0.85, reuse the existing domain's `id` and `label`. Otherwise mint a fresh `id = "cluster_<uuid>"`.
|
||||
**RECOMMENDATION**: Option C. Preserves user-assigned labels across drift. Threshold 0.85 is config-tunable via `stable_id_threshold` if needed later.
|
||||
|
||||
### OQ5. Context signal injection site: ingest handler vs embedder vs classifier?
|
||||
- Embedder -- would alter embedding; signals are not about embedding quality.
|
||||
- Ingest handler -- signals known there, but then `DomainClassifier` cannot be tested in isolation.
|
||||
- Classifier as a `classify_with_boost(boost: Option<&HashMap>)` parameter -- pure, testable, composable.
|
||||
**RECOMMENDATION**: classifier parameter. The cognitive engine constructs the boost map via `ContextSignals::gather_boost(&metadata, &domains)` and hands it to the classifier. Keeps the classifier stateless w.r.t. signals.
|
||||
|
||||
### OQ6. Re-cluster proposal cadence: event-based (every Nth dream) vs time-based (weekly)?
|
||||
- ADR resolution Q7: every Nth dream (N=5 default).
|
||||
- Alternative: once per week regardless of dream cadence.
|
||||
**RECOMMENDATION**: stick with every Nth dream. Users who dream rarely re-cluster rarely -- that matches the philosophy ("memory work triggers memory bookkeeping"). Note the alternative as future consideration; if users complain about never seeing proposals, add a time-based fallback.
|
||||
|
||||
### OQ7. Minimum corpus size for first discover?
|
||||
- PRD default: 150.
|
||||
- Too low -> noisy initial clusters, proposals every dream.
|
||||
- Too high -> user waits forever for domains to appear.
|
||||
**RECOMMENDATION**: 150 as the default discovery gate; HDBSCAN's `min_cluster_size=10` will produce 0 clusters for < 100 memories, so the system gracefully produces no domains until the corpus is large enough. Test with `N=80, 150, 500` in `threshold_tunable.rs` to confirm sensible behavior.
|
||||
|
||||
### OQ8. Cross-domain decay: strict no-overlap vs graded?
|
||||
- Strict: `1.0` if any overlap, `cross_domain_decay` otherwise.
|
||||
- Graded: `max(cross_domain_decay, |A intersect B| / max(|A|, |B|))`.
|
||||
**RECOMMENDATION**: strict for Phase 4. Easier to reason about, easier to tune, easier to test. Graded is a marked future enhancement; file an issue if retrieval-quality metrics justify it.
|
||||
|
||||
### OQ9. Classifier invocation from remote HTTP clients?
|
||||
- In server mode, an agent posts `smart_ingest` -> server embeds -> server classifies.
|
||||
- All the work stays server-side; MCP clients never do classification.
|
||||
**RECOMMENDATION**: confirmed server-side-only. Document in the MCP tool schema that `smart_ingest` now returns `domains` and `domain_scores` in its response so clients can display the classification to the user.
|
||||
|
||||
### OQ10. Where to store the dream-cycle counter?
|
||||
- In-memory on `CognitiveEngine` -- lost on restart, miscounts cadence.
|
||||
- New `system_state` singleton table.
|
||||
**RECOMMENDATION**: `system_state` table. Survives restarts. Also useful for future metrics (total memories ever, total dreams ever).
|
||||
|
||||
### OQ11. Scope of `reassign_all` after a proposal accept vs a normal discover?
|
||||
- On discover --force (first-time), run `reassign_all` against all memories.
|
||||
- On proposal accept (split / merge), run `reassign_all` only on affected memories (parent's members for split; both parents' members for merge) to avoid touching unrelated records.
|
||||
**RECOMMENDATION**: scoped reassignment where possible; fall back to full `reassign_all` only on `discover --force` or when the set of domains has fundamentally changed. Reduces write amplification on large corpora.
|
||||
|
||||
### OQ12. Proposal freshness?
|
||||
- Multiple re-clusters could stack up pending proposals.
|
||||
**RECOMMENDATION**: before inserting a new proposal, check for existing pending proposals with the same `kind + targets`; if present, bump `created_at` and `confidence` instead of creating a duplicate. Add a `confidence_history` array in the `payload` JSONB for audit.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Sequencing (suggested order)
|
||||
|
||||
1. Land the `DomainClassifier` struct, `classify` / `classify_with_boost`, unit tests. (Day 1)
|
||||
2. Add `compute_top_terms` + TF-IDF helper, tests. (Day 1)
|
||||
3. Wire `discover` end-to-end against SQLite; `discover_seed_corpus` integration test. (Day 2)
|
||||
4. Add `domain_proposals` table migrations + trait methods; both backends. (Day 2)
|
||||
5. Implement `propose_changes` + `apply_proposal`; proposal unit tests. (Day 3)
|
||||
6. Context signals module + tests. (Day 3)
|
||||
7. Hook classifier into ingest path; `auto_classify_on_ingest` integration test. (Day 4)
|
||||
8. Cross-domain decay in spreading activation; unit + integration tests. (Day 4)
|
||||
9. Dream re-cluster hook + `system_state` counter; integration tests for drift-only behavior. (Day 5)
|
||||
10. CLI subcommands. (Day 6)
|
||||
11. REST endpoints. (Day 6)
|
||||
12. SvelteKit dashboard routes + WebSocket event wiring. (Day 7-8)
|
||||
13. Benchmarks + acceptance sweep on the 500-memory seed. (Day 9)
|
||||
|
||||
---
|
||||
|
||||
## File Map (everything Phase 4 touches or creates)
|
||||
|
||||
Creates:
|
||||
|
||||
- `crates/vestige-core/src/neuroscience/domain_classifier.rs`
|
||||
- `crates/vestige-core/src/neuroscience/context_signals.rs`
|
||||
- `crates/vestige-core/migrations/postgres/00XX_domain_proposals.sql`
|
||||
- `crates/vestige-core/migrations/sqlite/00XX_domain_proposals.sql` (or inline in `storage/migrations.rs`)
|
||||
- `crates/vestige-mcp/src/api/domains.rs` (REST handlers)
|
||||
- `apps/dashboard/src/routes/(app)/domains/+page.svelte`
|
||||
- `apps/dashboard/src/routes/(app)/domains/[id]/+page.svelte`
|
||||
- `apps/dashboard/src/routes/(app)/domains/proposals/+page.svelte`
|
||||
- `apps/dashboard/src/lib/api/domains.ts`
|
||||
- `tests/phase_4/Cargo.toml`
|
||||
- `tests/phase_4/tests/*.rs` (per the Integration test list)
|
||||
- `tests/phase_4/fixtures/seed_500.json`
|
||||
- `tests/phase_4/support/fixtures.rs`
|
||||
|
||||
Modifies:
|
||||
|
||||
- `crates/vestige-core/Cargo.toml` -- add `hdbscan = "0.10"` under a new `domain-classification` feature.
|
||||
- `crates/vestige-core/src/neuroscience/mod.rs` -- register new modules, re-exports.
|
||||
- `crates/vestige-core/src/neuroscience/spreading_activation.rs` -- `cross_domain_decay` field in `ActivationConfig`, `domains` field on `ActivationNode`, decay math in `activate`.
|
||||
- `crates/vestige-core/src/consolidation/phases.rs` -- `DreamReClusterHook`.
|
||||
- `crates/vestige-core/src/advanced/dreams.rs` -- accept a hook callback from the orchestrator (if the orchestration is done at this level).
|
||||
- `crates/vestige-core/src/storage/trait.rs` -- add proposal + system_state methods.
|
||||
- `crates/vestige-core/src/storage/sqlite.rs` -- implement proposal + system_state methods + `all_embeddings_with_meta` if not already on the trait.
|
||||
- `crates/vestige-core/src/storage/postgres.rs` (Phase 2) -- same.
|
||||
- `crates/vestige-core/src/lib.rs` -- re-exports.
|
||||
- `crates/vestige-core/src/cognitive.rs` (or equivalent ingest orchestrator) -- auto-classify injection.
|
||||
- `crates/vestige-mcp/src/bin/cli.rs` -- `Domains` subcommand + dispatch.
|
||||
- `crates/vestige-mcp/src/dashboard/mod.rs` -- wire new REST routes.
|
||||
- `crates/vestige-mcp/src/dashboard/events.rs` -- new event variants.
|
||||
- `crates/vestige-mcp/src/dashboard/handlers.rs` -- if legacy dashboard gets a domains panel (optional).
|
||||
- `vestige.toml` config loader -- `[domains]` section + struct + defaults.
|
||||
- Root `Cargo.toml` workspace members -- add `tests/phase_4`.
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- **HDBSCAN determinism**: HDBSCAN is deterministic given input order; sorting embeddings by memory id before feeding the clusterer guarantees reproducibility across runs -- do this in `discover()` and document it.
|
||||
- **Embedding dimension drift**: Phase 1's `embedding_model` registry blocks writes from mismatched embedders. If `discover()` ever sees two dimensions, it bails with a clear error and points at `vestige migrate --reembed`.
|
||||
- **Classification latency on ingest**: for users with thousands of domains (unlikely but possible), `classify` is O(n_domains * dim). 20 domains * 768 f32 = 15k flops per classification, trivial. Still, expose a `classify_budget_ms` config knob for paranoia.
|
||||
- **Re-cluster proposal storms**: if the corpus is borderline-stable, small changes can produce conflicting proposals on consecutive dreams. Mitigation: OQ12 (dedup by target set, bump confidence instead of stacking).
|
||||
- **Dashboard feature gap**: if the SvelteKit app lands with the domains route but the REST endpoints are not yet deployed, the route 404s. Mitigation: ship the REST endpoints in the same release; a feature flag on the client toggles the nav entry.
|
||||
|
||||
---
|
||||
|
||||
## Non-Goals Reminder
|
||||
|
||||
- No Phase 5 federation concerns in this plan.
|
||||
- No cross-installation domain sync.
|
||||
- No automatic accept of proposals, ever.
|
||||
- No graded cross-domain decay; strict only.
|
||||
- No ML-based domain label suggestion (top-terms are enough for v1).
|
||||
- No editing individual memory memberships from the UI in this phase.
|
||||
751
docs/prd/001-getting-centralized-vestige.md
Normal file
751
docs/prd/001-getting-centralized-vestige.md
Normal file
|
|
@ -0,0 +1,751 @@
|
|||
# RFC: Pluggable Storage Backend + Network Access for Vestige
|
||||
|
||||
**Status**: Draft / Discussion
|
||||
**Author**: Jan
|
||||
**Date**: 2026-02-26
|
||||
**Vestige version**: v2.x (current main)
|
||||
|
||||
## Summary
|
||||
|
||||
Add a pluggable storage backend trait to Vestige, enabling PostgreSQL (+pgvector) as an alternative to the current SQLite+FTS5+USearch stack. Simultaneously add HTTP MCP transport with API key authentication to enable centralized/remote deployment.
|
||||
|
||||
This keeps the existing local-first SQLite mode fully intact while opening up a server deployment model.
|
||||
|
||||
## Motivation
|
||||
|
||||
Vestige currently runs as a local process per machine (MCP via stdio, SQLite in `~/.vestige/`). This works great for single-machine use but doesn't support:
|
||||
|
||||
- **Multi-machine access**: Same memory brain from laptop, desktop, and server
|
||||
- **Multi-agent access**: Multiple AI clients hitting one memory store concurrently
|
||||
- **Future federation**: Syncing memory between decentralized nodes (e.g., MOS/Threefold grid)
|
||||
|
||||
SQLite's single-writer model and lack of native network protocol make it unsuitable as a centralized server. PostgreSQL is a natural fit: built-in concurrency (MVCC), authentication, replication, and with `pgvector` + built-in FTS it collapses three separate storage layers into one.
|
||||
|
||||
## Design
|
||||
|
||||
### Storage Trait
|
||||
|
||||
The core abstraction. All 29 cognitive modules interact with storage exclusively through this trait (or a small family of traits).
|
||||
|
||||
```rust
|
||||
use std::collections::HashMap;
|
||||
use uuid::Uuid;
|
||||
|
||||
/// Core memory record, backend-agnostic
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct MemoryRecord {
|
||||
pub id: Uuid,
|
||||
pub domains: Vec<String>, // [] = unclassified, ["dev"], ["dev", "infra"], etc.
|
||||
pub domain_scores: HashMap<String, f64>, // raw similarities: {"dev": 0.82, "infra": 0.71}
|
||||
pub content: String,
|
||||
pub node_type: String,
|
||||
pub tags: Vec<String>,
|
||||
pub embedding: Option<Vec<f32>>, // dimensionality is runtime config
|
||||
pub created_at: chrono::DateTime<chrono::Utc>,
|
||||
pub updated_at: chrono::DateTime<chrono::Utc>,
|
||||
pub metadata: serde_json::Value,
|
||||
}
|
||||
|
||||
/// FSRS scheduling state, stored alongside each memory
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SchedulingState {
|
||||
pub memory_id: Uuid,
|
||||
pub stability: f64,
|
||||
pub difficulty: f64,
|
||||
pub retrievability: f64,
|
||||
pub last_review: Option<chrono::DateTime<chrono::Utc>>,
|
||||
pub next_review: Option<chrono::DateTime<chrono::Utc>>,
|
||||
pub reps: u32,
|
||||
pub lapses: u32,
|
||||
}
|
||||
|
||||
/// Hybrid search request
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SearchQuery {
|
||||
pub domains: Option<Vec<String>>, // None = search all domains
|
||||
pub text: Option<String>, // FTS query
|
||||
pub embedding: Option<Vec<f32>>, // vector similarity
|
||||
pub tags: Option<Vec<String>>, // tag filter
|
||||
pub node_types: Option<Vec<String>>,
|
||||
pub limit: usize,
|
||||
pub min_retrievability: Option<f64>, // filter by FSRS state
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SearchResult {
|
||||
pub record: MemoryRecord,
|
||||
pub score: f64, // combined/fused score
|
||||
pub fts_score: Option<f64>,
|
||||
pub vector_score: Option<f64>,
|
||||
}
|
||||
|
||||
/// Connection/edge between memories (for spreading activation)
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct MemoryEdge {
|
||||
pub source_id: Uuid,
|
||||
pub target_id: Uuid,
|
||||
pub edge_type: String,
|
||||
pub weight: f64,
|
||||
pub created_at: chrono::DateTime<chrono::Utc>,
|
||||
}
|
||||
|
||||
/// Main storage trait — one impl per backend
|
||||
/// trait_variant generates a Send-bound `MemoryStore` alias,
|
||||
/// enabling Arc<dyn MemoryStore> without manual boxing.
|
||||
#[trait_variant::make(MemoryStore: Send)]
|
||||
pub trait LocalMemoryStore: Sync + 'static {
|
||||
// --- Lifecycle ---
|
||||
async fn init(&self) -> Result<()>;
|
||||
async fn health_check(&self) -> Result<HealthStatus>;
|
||||
|
||||
// --- CRUD ---
|
||||
async fn insert(&self, record: &MemoryRecord) -> Result<Uuid>;
|
||||
async fn get(&self, id: Uuid) -> Result<Option<MemoryRecord>>;
|
||||
async fn update(&self, record: &MemoryRecord) -> Result<()>;
|
||||
async fn delete(&self, id: Uuid) -> Result<()>;
|
||||
|
||||
// --- Search ---
|
||||
async fn search(&self, query: &SearchQuery) -> Result<Vec<SearchResult>>;
|
||||
async fn fts_search(&self, text: &str, limit: usize) -> Result<Vec<SearchResult>>;
|
||||
async fn vector_search(&self, embedding: &[f32], limit: usize) -> Result<Vec<SearchResult>>;
|
||||
|
||||
// --- FSRS Scheduling ---
|
||||
async fn get_scheduling(&self, memory_id: Uuid) -> Result<Option<SchedulingState>>;
|
||||
async fn update_scheduling(&self, state: &SchedulingState) -> Result<()>;
|
||||
async fn get_due_memories(&self, before: chrono::DateTime<chrono::Utc>, limit: usize) -> Result<Vec<(MemoryRecord, SchedulingState)>>;
|
||||
|
||||
// --- Graph (spreading activation) ---
|
||||
async fn add_edge(&self, edge: &MemoryEdge) -> Result<()>;
|
||||
async fn get_edges(&self, node_id: Uuid, edge_type: Option<&str>) -> Result<Vec<MemoryEdge>>;
|
||||
async fn remove_edge(&self, source: Uuid, target: Uuid) -> Result<()>;
|
||||
async fn get_neighbors(&self, node_id: Uuid, depth: usize) -> Result<Vec<(MemoryRecord, f64)>>;
|
||||
|
||||
// --- Bulk / Maintenance ---
|
||||
async fn count(&self) -> Result<usize>;
|
||||
async fn get_stats(&self) -> Result<StoreStats>;
|
||||
async fn vacuum(&self) -> Result<()>;
|
||||
}
|
||||
```
|
||||
|
||||
**Design notes:**
|
||||
|
||||
- `trait_variant::make` generates a `MemoryStore` trait alias with `Send`-bound futures, allowing `Arc<dyn MemoryStore>` for runtime backend selection. `LocalMemoryStore` is the base (usable in single-threaded contexts), `MemoryStore` is the Send variant for Axum/tokio.
|
||||
- `embedding: Option<Vec<f32>>` — dimensions determined at runtime by the configured fastembed model. The backend stores whatever it gets.
|
||||
- The trait is intentionally flat. The cognitive modules (FSRS-6, spreading activation, synaptic tagging, prediction error gating, etc.) sit *above* this trait and don't need to know about the backend.
|
||||
- `search()` does hybrid RRF fusion at the backend level — both SQLite and Postgres implementations handle this internally.
|
||||
|
||||
### Backend: SQLite (existing, refactored)
|
||||
|
||||
Wraps the current implementation behind the trait:
|
||||
|
||||
```
|
||||
SqliteMemoryStore
|
||||
├── rusqlite connection pool (r2d2 or deadpool)
|
||||
├── FTS5 virtual table (keyword search)
|
||||
├── USearch HNSW index (vector search, behind RwLock)
|
||||
└── WAL mode + busy timeout for concurrent readers
|
||||
```
|
||||
|
||||
No behavioral changes — just the trait boundary.
|
||||
|
||||
### Backend: PostgreSQL (new)
|
||||
|
||||
```
|
||||
PgMemoryStore
|
||||
├── sqlx::PgPool (connection pool, compile-time checked queries)
|
||||
├── tsvector + GIN index (keyword search)
|
||||
├── pgvector + HNSW index (vector search)
|
||||
└── Standard PostgreSQL MVCC concurrency
|
||||
```
|
||||
|
||||
**Schema sketch:**
|
||||
|
||||
```sql
|
||||
CREATE EXTENSION IF NOT EXISTS vector;
|
||||
|
||||
-- Domain registry — populated by clustering, not by user
|
||||
CREATE TABLE domains (
|
||||
id TEXT PRIMARY KEY, -- auto-generated or user-named
|
||||
label TEXT NOT NULL, -- human label (suggested or user-provided)
|
||||
centroid vector, -- mean embedding of domain members
|
||||
top_terms TEXT[] NOT NULL DEFAULT '{}', -- top keywords for display
|
||||
memory_count INTEGER NOT NULL DEFAULT 0,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
metadata JSONB NOT NULL DEFAULT '{}'
|
||||
);
|
||||
|
||||
CREATE TABLE memories (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
domains TEXT[] NOT NULL DEFAULT '{}', -- [] = unclassified
|
||||
domain_scores JSONB NOT NULL DEFAULT '{}', -- {"dev": 0.82, "infra": 0.71} raw similarities
|
||||
content TEXT NOT NULL,
|
||||
node_type TEXT NOT NULL DEFAULT 'general',
|
||||
tags TEXT[] NOT NULL DEFAULT '{}',
|
||||
embedding vector, -- dimension set at table creation or unconstrained
|
||||
metadata JSONB NOT NULL DEFAULT '{}',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
|
||||
-- FTS: auto-maintained tsvector column
|
||||
search_vec TSVECTOR GENERATED ALWAYS AS (
|
||||
setweight(to_tsvector('english', content), 'A') ||
|
||||
setweight(to_tsvector('english', coalesce(node_type, '')), 'B') ||
|
||||
setweight(array_to_tsvector(tags), 'C')
|
||||
) STORED
|
||||
);
|
||||
|
||||
-- FTS index
|
||||
CREATE INDEX idx_memories_fts ON memories USING GIN (search_vec);
|
||||
|
||||
-- Vector similarity (HNSW)
|
||||
CREATE INDEX idx_memories_embedding ON memories
|
||||
USING hnsw (embedding vector_cosine_ops)
|
||||
WITH (m = 16, ef_construction = 64);
|
||||
|
||||
-- Common filters
|
||||
CREATE INDEX idx_memories_domains ON memories USING GIN (domains);
|
||||
CREATE INDEX idx_memories_node_type ON memories (node_type);
|
||||
CREATE INDEX idx_memories_tags ON memories USING GIN (tags);
|
||||
CREATE INDEX idx_memories_created ON memories (created_at);
|
||||
|
||||
-- FSRS scheduling state
|
||||
CREATE TABLE scheduling (
|
||||
memory_id UUID PRIMARY KEY REFERENCES memories(id) ON DELETE CASCADE,
|
||||
stability DOUBLE PRECISION NOT NULL DEFAULT 0.0,
|
||||
difficulty DOUBLE PRECISION NOT NULL DEFAULT 0.0,
|
||||
retrievability DOUBLE PRECISION NOT NULL DEFAULT 1.0,
|
||||
last_review TIMESTAMPTZ,
|
||||
next_review TIMESTAMPTZ,
|
||||
reps INTEGER NOT NULL DEFAULT 0,
|
||||
lapses INTEGER NOT NULL DEFAULT 0
|
||||
);
|
||||
|
||||
CREATE INDEX idx_scheduling_next ON scheduling (next_review);
|
||||
|
||||
-- Graph edges (spreading activation)
|
||||
-- Edges can cross domain boundaries — spreading activation respects
|
||||
-- domain filters when provided, traverses freely when searching all domains.
|
||||
CREATE TABLE edges (
|
||||
source_id UUID NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
|
||||
target_id UUID NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
|
||||
edge_type TEXT NOT NULL DEFAULT 'related',
|
||||
weight DOUBLE PRECISION NOT NULL DEFAULT 1.0,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
PRIMARY KEY (source_id, target_id, edge_type)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_edges_target ON edges (target_id);
|
||||
|
||||
-- API keys
|
||||
CREATE TABLE api_keys (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
key_hash TEXT NOT NULL UNIQUE, -- blake3
|
||||
label TEXT NOT NULL,
|
||||
scopes TEXT[] NOT NULL DEFAULT '{read,write}',
|
||||
domain_filter TEXT[] NOT NULL DEFAULT '{}', -- {} = access all domains
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
last_used TIMESTAMPTZ,
|
||||
active BOOLEAN NOT NULL DEFAULT true
|
||||
);
|
||||
```
|
||||
|
||||
**Hybrid search in SQL:**
|
||||
|
||||
```sql
|
||||
-- RRF (Reciprocal Rank Fusion) combining FTS + vector
|
||||
-- $1 = query text, $2 = embedding, $3 = limit, $4 = domain filter (NULL for all)
|
||||
WITH fts AS (
|
||||
SELECT id, ts_rank_cd(search_vec, websearch_to_tsquery('english', $1)) AS score,
|
||||
ROW_NUMBER() OVER (ORDER BY ts_rank_cd(search_vec, websearch_to_tsquery('english', $1)) DESC) AS rank
|
||||
FROM memories
|
||||
WHERE search_vec @@ websearch_to_tsquery('english', $1)
|
||||
AND ($4::text[] IS NULL OR domains && $4) -- array overlap: any match
|
||||
LIMIT 50
|
||||
),
|
||||
vec AS (
|
||||
SELECT id, 1 - (embedding <=> $2::vector) AS score,
|
||||
ROW_NUMBER() OVER (ORDER BY embedding <=> $2::vector) AS rank
|
||||
FROM memories
|
||||
WHERE embedding IS NOT NULL
|
||||
AND ($4::text[] IS NULL OR domains && $4)
|
||||
LIMIT 50
|
||||
)
|
||||
SELECT COALESCE(f.id, v.id) AS id,
|
||||
COALESCE(1.0 / (60 + f.rank), 0) + COALESCE(1.0 / (60 + v.rank), 0) AS rrf_score,
|
||||
f.score AS fts_score,
|
||||
v.score AS vector_score
|
||||
FROM fts f FULL OUTER JOIN vec v ON f.id = v.id
|
||||
ORDER BY rrf_score DESC
|
||||
LIMIT $3;
|
||||
```
|
||||
|
||||
### Embedding Configuration
|
||||
|
||||
The embedding layer stays external to the storage backend. fastembed runs locally and produces vectors that get passed into `MemoryRecord.embedding`.
|
||||
|
||||
```toml
|
||||
# vestige.toml
|
||||
[embeddings]
|
||||
provider = "fastembed" # only local for now
|
||||
model = "BAAI/bge-base-en-v1.5" # 768 dimensions
|
||||
# model = "BAAI/bge-large-en-v1.5" # 1024 dimensions
|
||||
# model = "BAAI/bge-small-en-v1.5" # 384 dimensions
|
||||
|
||||
[storage]
|
||||
backend = "postgres" # or "sqlite"
|
||||
|
||||
[storage.sqlite]
|
||||
path = "~/.vestige/vestige.db"
|
||||
|
||||
[storage.postgres]
|
||||
url = "postgresql://vestige:secret@localhost:5432/vestige"
|
||||
max_connections = 10
|
||||
```
|
||||
|
||||
On init, the backend reads the embedding dimension from the first stored vector (or from config) and validates consistency.
|
||||
|
||||
For pgvector: you can either create the column as `vector(768)` (fixed, faster) or unconstrained `vector` (flexible, slightly slower). Recommendation: fixed dimension derived from config, with a migration path if the model changes.
|
||||
|
||||
### Emergent Domain Model
|
||||
|
||||
Instead of user-defined tenants, domains emerge automatically from the data via clustering. The user never has to decide where a memory belongs — the system figures it out.
|
||||
|
||||
#### Pipeline
|
||||
|
||||
```
|
||||
Phase 1: Accumulate (cold start, 0 → N memories)
|
||||
│ All memories stored with domains = [] (unclassified)
|
||||
│ No classification overhead, just embed and store
|
||||
│ Threshold N is configurable, default ~150 memories
|
||||
│
|
||||
Phase 2: Discover (triggered once at threshold, or manually)
|
||||
│ Run HDBSCAN on all embeddings:
|
||||
│ - min_cluster_size: ~10
|
||||
│ - min_samples: ~5
|
||||
│ - No eps parameter needed (unlike DBSCAN)
|
||||
│ - Automatically determines number of clusters
|
||||
│ - Handles variable-density clusters
|
||||
│ - Border points between clusters flagged naturally
|
||||
│
|
||||
│ For each cluster, extract:
|
||||
│ - Centroid (mean embedding)
|
||||
│ - Top terms (TF-IDF or frequency over cluster members)
|
||||
│ - Suggested label from top terms
|
||||
│
|
||||
│ Present to user (via dashboard or CLI):
|
||||
│ "I found 3 natural groupings in your memories:
|
||||
│ ● cluster_0 (47 memories): BGP, SONiC, VLAN, FRR, peering...
|
||||
│ ● cluster_1 (31 memories): solar, kWh, battery, pool, ESPHome...
|
||||
│ ● cluster_2 (22 memories): Rust, trait, async, zinit, tokio..."
|
||||
│
|
||||
│ User can:
|
||||
│ - Name them: cluster_0 → "infra", cluster_1 → "home", cluster_2 → "dev"
|
||||
│ - Accept suggested names
|
||||
│ - Merge clusters
|
||||
│ - Do nothing (auto-names stick)
|
||||
│
|
||||
Phase 3: Soft-assign all existing memories
|
||||
│ Now that centroids exist, re-score every memory (including
|
||||
│ those from discovery) against all centroids.
|
||||
│ This replaces HDBSCAN's hard labels with continuous scores:
|
||||
│
|
||||
│ For each memory:
|
||||
│ similarities = [(domain, cosine_sim(embedding, centroid)) for each domain]
|
||||
│ domains = [id for (id, score) in similarities if score >= threshold]
|
||||
│
|
||||
│ Memories in overlap zones get multiple domains.
|
||||
│ Memories far from all centroids stay unclassified.
|
||||
│
|
||||
Phase 4: Classify (ongoing, after discovery)
|
||||
│ New memory ingested:
|
||||
│ 1. Compute embedding
|
||||
│ 2. Compute similarity to ALL domain centroids
|
||||
│ 3. Store raw scores in domain_scores JSONB
|
||||
│ 4. Threshold into domains[] array
|
||||
│ 5. Update domain centroids incrementally (running mean)
|
||||
│
|
||||
│ Context signals as soft priors:
|
||||
│ - Git repo / IDE metadata → boost similarity to code-related domains
|
||||
│ - No workspace context → slight boost toward non-technical domains
|
||||
│ - These shift the score, never override the embedding distance
|
||||
│
|
||||
Phase 5: Re-cluster (periodic, during dream consolidation)
|
||||
Re-run HDBSCAN on all embeddings including new ones
|
||||
Detect:
|
||||
- New clusters forming from previously unclassified memories
|
||||
- Existing clusters splitting (domain grew too broad)
|
||||
- Clusters merging (domains that were artificially separate)
|
||||
Propose changes to user:
|
||||
"Your 'dev' domain may have split into two groups:
|
||||
- systems (zinit, MOS, containers, VMs) — 34 memories
|
||||
- networking (BGP, SONiC, VLANs, MLAG) — 28 memories
|
||||
Split them? [yes / no / later]"
|
||||
Re-run soft assignment on all memories after structural changes
|
||||
Centroid vectors are updated regardless
|
||||
```
|
||||
|
||||
#### Domain Storage
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct Domain {
|
||||
pub id: String,
|
||||
pub label: String,
|
||||
pub centroid: Vec<f32>,
|
||||
pub top_terms: Vec<String>,
|
||||
pub memory_count: usize,
|
||||
pub created_at: chrono::DateTime<chrono::Utc>,
|
||||
}
|
||||
```
|
||||
|
||||
Added to the `MemoryStore` trait:
|
||||
|
||||
```rust
|
||||
// --- Domains ---
|
||||
async fn list_domains(&self) -> Result<Vec<Domain>>;
|
||||
async fn get_domain(&self, id: &str) -> Result<Option<Domain>>;
|
||||
async fn upsert_domain(&self, domain: &Domain) -> Result<()>;
|
||||
async fn delete_domain(&self, id: &str) -> Result<()>;
|
||||
async fn classify(&self, embedding: &[f32]) -> Result<Vec<(String, f64)>>;
|
||||
// Returns [(domain_id, similarity)] sorted by similarity desc.
|
||||
// Caller decides threshold for assignment.
|
||||
```
|
||||
|
||||
#### Classification Module
|
||||
|
||||
A new cognitive module alongside FSRS, spreading activation, etc.:
|
||||
|
||||
```rust
|
||||
pub struct DomainClassifier {
|
||||
/// Similarity threshold — domains scoring above this are assigned
|
||||
pub assign_threshold: f64, // default: 0.65
|
||||
/// Minimum memories before running initial discovery
|
||||
pub discovery_threshold: usize, // default: 150
|
||||
/// How often to re-cluster (in dream consolidation passes)
|
||||
pub recluster_interval: usize, // default: every 5th consolidation
|
||||
/// HDBSCAN min_cluster_size
|
||||
pub min_cluster_size: usize, // default: 10
|
||||
}
|
||||
|
||||
/// Raw classification result — all scores, before thresholding
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ClassificationResult {
|
||||
/// Similarity to every known domain centroid
|
||||
pub scores: HashMap<String, f64>, // {"dev": 0.82, "infra": 0.71, "home": 0.34}
|
||||
/// Domains above assign_threshold
|
||||
pub domains: Vec<String>, // ["dev", "infra"]
|
||||
}
|
||||
|
||||
impl DomainClassifier {
|
||||
/// Score a memory against all domain centroids.
|
||||
/// Returns raw scores AND thresholded domain list.
|
||||
pub fn classify(
|
||||
&self,
|
||||
embedding: &[f32],
|
||||
domains: &[Domain],
|
||||
) -> ClassificationResult {
|
||||
if domains.is_empty() {
|
||||
return ClassificationResult {
|
||||
scores: HashMap::new(),
|
||||
domains: vec![], // still in accumulation phase
|
||||
};
|
||||
}
|
||||
|
||||
let scores: HashMap<String, f64> = domains.iter()
|
||||
.map(|d| (d.id.clone(), cosine_similarity(embedding, &d.centroid)))
|
||||
.collect();
|
||||
|
||||
let assigned: Vec<String> = scores.iter()
|
||||
.filter(|(_, &s)| s >= self.assign_threshold)
|
||||
.map(|(id, _)| id.clone())
|
||||
.collect();
|
||||
|
||||
ClassificationResult { scores, domains: assigned }
|
||||
}
|
||||
|
||||
/// Soft-assign all existing memories after discovery or re-clustering.
|
||||
/// Returns number of memories whose domains changed.
|
||||
pub async fn reassign_all(
|
||||
&self,
|
||||
store: &dyn MemoryStore,
|
||||
domains: &[Domain],
|
||||
) -> Result<usize> {
|
||||
// Load all memories, re-score, update domains + domain_scores
|
||||
// Batched to avoid loading everything into memory at once
|
||||
todo!()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Key distinction from the previous design:** there's no "closest wins" or "margin" logic. Every domain gets a score, and *all* domains above threshold are assigned. A memory about "deploying zinit containers via BGP-routed network" might score 0.78 on "dev" and 0.72 on "infra" — it gets both. A memory about "solar panel output today" scores 0.85 on "home" and 0.31 on everything else — it only gets "home".
|
||||
|
||||
The raw `domain_scores` are always stored, so you (or the dashboard) can see *why* a memory was classified the way it was, and the threshold can be adjusted retroactively without re-computing embeddings.
|
||||
|
||||
#### Search Behavior
|
||||
|
||||
- **Default (no domain filter)**: searches all memories across all domains
|
||||
- **Domain-scoped**: `domains: Some(vec!["dev"])` — only memories tagged with `dev`
|
||||
- **Multi-domain**: `domains: Some(vec!["dev", "infra"])` — memories in either
|
||||
- **MCP clients can set `X-Vestige-Domain` header** for default scoping, but the system works fine without it
|
||||
|
||||
#### HDBSCAN Implementation
|
||||
|
||||
HDBSCAN (Hierarchical DBSCAN) over the embedding vectors. Advantages over plain DBSCAN:
|
||||
|
||||
- **No `eps` parameter** — the hardest thing to tune in DBSCAN. HDBSCAN determines density thresholds from the data hierarchy.
|
||||
- **Variable-density clusters** — a tight cluster of networking memories and a spread-out cluster of personal memories are both detected correctly.
|
||||
- **Border points** — memories between clusters are identified as low-confidence members, which aligns perfectly with soft assignment.
|
||||
|
||||
Implementation: the `hdbscan` crate in Rust. Load all embeddings into memory (at 768d × f32 × 10k memories ≈ 30MB — fine), cluster, compute centroids, soft-assign all memories against the centroids.
|
||||
|
||||
```rust
|
||||
use hdbscan::{Center, Hdbscan};
|
||||
|
||||
fn discover_domains(
|
||||
embeddings: &[Vec<f32>],
|
||||
min_cluster_size: usize,
|
||||
) -> (Vec<Vec<usize>>, Vec<Vec<f32>>) { // (cluster → member indices, centroids)
|
||||
let clusterer = Hdbscan::default(embeddings);
|
||||
let labels = clusterer.cluster().unwrap();
|
||||
let centroids = clusterer.calc_centers(Center::Centroid, &labels).unwrap();
|
||||
|
||||
// Group indices by label, ignoring noise (-1)
|
||||
let mut clusters: HashMap<i32, Vec<usize>> = HashMap::new();
|
||||
for (i, &label) in labels.iter().enumerate() {
|
||||
if label >= 0 {
|
||||
clusters.entry(label).or_default().push(i);
|
||||
}
|
||||
}
|
||||
(clusters.into_values().collect(), centroids)
|
||||
}
|
||||
```
|
||||
|
||||
After HDBSCAN produces hard clusters, the soft-assignment pass (Phase 3) immediately re-scores all memories — including the ones HDBSCAN assigned — against the computed centroids. So HDBSCAN's hard labels are only used to *define* the centroids. The actual domain assignments always come from the continuous similarity scores.
|
||||
|
||||
This works identically for both SQLite and Postgres backends — clustering runs in Rust application code, results are written back to the storage layer.
|
||||
|
||||
### Network Transport
|
||||
|
||||
#### MCP over Streamable HTTP
|
||||
|
||||
Extend the existing Axum server:
|
||||
|
||||
```rust
|
||||
// Alongside existing dashboard routes
|
||||
let app = Router::new()
|
||||
// Existing dashboard
|
||||
.route("/api/health", get(health_handler))
|
||||
.route("/dashboard/*path", get(dashboard_handler))
|
||||
// New: MCP over HTTP
|
||||
.route("/mcp", post(mcp_handler).get(mcp_sse_handler))
|
||||
// New: REST API
|
||||
// X-Vestige-Domain header optionally scopes to a domain
|
||||
.route("/api/v1/memories", post(create_memory).get(list_memories))
|
||||
.route("/api/v1/memories/:id", get(get_memory).put(update_memory).delete(delete_memory))
|
||||
.route("/api/v1/search", post(search_memories))
|
||||
.route("/api/v1/consolidate", post(trigger_consolidation))
|
||||
.route("/api/v1/stats", get(get_stats))
|
||||
.route("/api/v1/domains", get(list_domains))
|
||||
.route("/api/v1/domains/discover", post(trigger_discovery))
|
||||
.route("/api/v1/domains/:id", put(rename_domain).delete(merge_domain))
|
||||
// Auth on everything except health
|
||||
.layer(middleware::from_fn(api_key_auth));
|
||||
```
|
||||
|
||||
#### Auth Middleware
|
||||
|
||||
```rust
|
||||
async fn api_key_auth(
|
||||
State(store): State<Arc<dyn MemoryStore>>,
|
||||
request: axum::extract::Request,
|
||||
next: middleware::Next,
|
||||
) -> Result<Response, StatusCode> {
|
||||
// Skip auth for health endpoint
|
||||
if request.uri().path() == "/api/health" {
|
||||
return Ok(next.run(request).await);
|
||||
}
|
||||
|
||||
let key = request.headers()
|
||||
.get("Authorization")
|
||||
.and_then(|v| v.to_str().ok())
|
||||
.and_then(|v| v.strip_prefix("Bearer "))
|
||||
.or_else(|| request.headers()
|
||||
.get("X-API-Key")
|
||||
.and_then(|v| v.to_str().ok()));
|
||||
|
||||
match key {
|
||||
Some(k) if verify_api_key(store.as_ref(), k).await => {
|
||||
Ok(next.run(request).await)
|
||||
}
|
||||
_ => Err(StatusCode::UNAUTHORIZED),
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Client Configuration
|
||||
|
||||
```json
|
||||
// Claude Desktop / Claude Code — single key, all domains
|
||||
{
|
||||
"mcpServers": {
|
||||
"vestige": {
|
||||
"url": "http://vestige.local:3927/mcp",
|
||||
"headers": {
|
||||
"Authorization": "Bearer vst_a1b2c3..."
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
No domain header needed — searches all domains by default. The MCP tools include an optional `domain` parameter for scoped queries if the LLM or user wants to narrow down.
|
||||
|
||||
Alternatively, scope a connection to a specific domain:
|
||||
|
||||
```json
|
||||
// Domain-scoped connection (e.g., for a home automation agent)
|
||||
{
|
||||
"mcpServers": {
|
||||
"vestige-home": {
|
||||
"url": "http://vestige.local:3927/mcp",
|
||||
"headers": {
|
||||
"Authorization": "Bearer vst_e5f6g7...",
|
||||
"X-Vestige-Domain": "home"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Server Configuration
|
||||
|
||||
```toml
|
||||
# vestige.toml — full example for server mode
|
||||
[server]
|
||||
bind = "0.0.0.0:3927" # or mycelium IPv6 address
|
||||
# tls_cert = "/path/to/cert.pem" # optional
|
||||
# tls_key = "/path/to/key.pem"
|
||||
|
||||
[auth]
|
||||
enabled = true
|
||||
# If false, no key required (local-only mode)
|
||||
|
||||
[storage]
|
||||
backend = "postgres"
|
||||
|
||||
[storage.postgres]
|
||||
url = "postgresql://vestige:secret@localhost:5432/vestige"
|
||||
max_connections = 10
|
||||
|
||||
[embeddings]
|
||||
provider = "fastembed"
|
||||
model = "BAAI/bge-base-en-v1.5"
|
||||
```
|
||||
|
||||
### CLI Extensions
|
||||
|
||||
```bash
|
||||
# Domain management (mostly automatic, but user can inspect/rename)
|
||||
vestige domains list
|
||||
# → dev Development (auto) memories: 87 top: Rust, trait, async, tokio
|
||||
# → infra Infrastructure (auto) memories: 47 top: BGP, SONiC, VLAN, FRR
|
||||
# → home Home (auto) memories: 31 top: solar, kWh, pool, ESPHome
|
||||
# → (unclassified) memories: 12
|
||||
|
||||
vestige domains rename cluster_0 infra --label "Infrastructure"
|
||||
vestige domains merge home personal --into home
|
||||
vestige domains discover --force # re-run HDBSCAN now
|
||||
|
||||
# Key management
|
||||
vestige keys create --label "macbook"
|
||||
# → Created key: vst_a1b2c3d4... (store this, shown once)
|
||||
|
||||
vestige keys create --label "home-assistant" --scopes read --domains home
|
||||
# → Created key: vst_e5f6g7h8... (read-only, home domain only)
|
||||
|
||||
vestige keys list
|
||||
# → macbook vst_a1b2... scopes: [read,write] domains: [all]
|
||||
# → home-assistant vst_e5f6... scopes: [read] domains: [home]
|
||||
|
||||
vestige keys revoke vst_a1b2c3d4...
|
||||
|
||||
# Migration
|
||||
vestige migrate --from sqlite --to postgres \
|
||||
--sqlite-path ~/.vestige/vestige.db \
|
||||
--postgres-url postgresql://localhost/vestige
|
||||
```
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Storage Trait Extraction
|
||||
- Define the `MemoryStore` trait (including domain methods)
|
||||
- Refactor current SQLite code to implement it
|
||||
- Add `domains TEXT[]` column to existing SQLite schema
|
||||
- Verify all 29 modules work through the trait (no direct SQLite access)
|
||||
- **No behavioral changes** — all memories start as unclassified
|
||||
|
||||
### Phase 2: PostgreSQL Backend
|
||||
- Implement `PgMemoryStore`
|
||||
- Schema migrations (sqlx or refinery)
|
||||
- `vestige migrate` command for SQLite → Postgres
|
||||
- Config file support for backend selection
|
||||
|
||||
### Phase 3: Network Access
|
||||
- MCP Streamable HTTP endpoint on existing Axum server
|
||||
- API key auth middleware + CLI management
|
||||
- REST API endpoints
|
||||
- Feature flags for stdio vs HTTP mode
|
||||
|
||||
### Phase 4: Emergent Domain Classification
|
||||
- `DomainClassifier` cognitive module
|
||||
- HDBSCAN clustering via `hdbscan` crate (runs on both backends)
|
||||
- Soft assignment pass: score all memories against centroids, threshold into domains
|
||||
- `domain_scores` JSONB stored per memory for transparency / retroactive re-thresholding
|
||||
- Domain discovery CLI and dashboard UI
|
||||
- Auto-classification on ingest (once domains exist)
|
||||
- Re-clustering during dream consolidation passes
|
||||
- Domain management CLI (rename, merge, inspect)
|
||||
|
||||
### Phase 5: Federation (future)
|
||||
- Node discovery via Mycelium / mDNS
|
||||
- Memory sync protocol (UUID-based, last-write-wins)
|
||||
- Possibly Iroh for content-addressed replication
|
||||
- FSRS state merge (review history append, not overwrite)
|
||||
|
||||
## Crate Dependencies (new)
|
||||
|
||||
```toml
|
||||
# Phase 1 — trait abstraction
|
||||
trait-variant = "0.1"
|
||||
|
||||
# Phase 2 — Postgres
|
||||
sqlx = { version = "0.8", features = ["runtime-tokio", "postgres", "uuid", "chrono", "json"] }
|
||||
pgvector = "0.4" # sqlx integration for vector type
|
||||
|
||||
# Phase 3 — Auth
|
||||
blake3 = "1" # key hashing
|
||||
rand = "0.8" # key generation
|
||||
|
||||
# Phase 4 — Domain clustering
|
||||
hdbscan = "0.10" # HDBSCAN — no eps tuning, variable density, built-in centroid calc
|
||||
```
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Trait granularity**: One big `MemoryStore` trait or split into `MemoryStore + SchedulingStore + GraphStore + DomainStore`? Splitting is cleaner but means more `dyn` parameters threading through handlers.
|
||||
|
||||
2. **Embedding on insert**: Should the storage backend call fastembed, or should the caller always provide the embedding? Current design says caller provides it, keeping the backend pure storage. But this means every client needs fastembed locally even if the DB is remote. For the server model, having the server compute embeddings makes more sense.
|
||||
|
||||
3. **pgvector dimension**: Fixed (e.g., `vector(768)`) or unconstrained (`vector`)? Fixed is faster for HNSW but requires migration if model changes.
|
||||
|
||||
4. **Sync conflict resolution for federation**: LWW per-UUID is simple but lossy. CRDTs would be more correct but massively more complex. For FSRS state specifically, merging review event logs would be ideal.
|
||||
|
||||
5. **Dashboard auth**: The 3D dashboard currently runs unauthenticated on localhost. With remote access, it needs the same auth. Should it use the same API keys or have a separate session/cookie mechanism?
|
||||
|
||||
6. **HDBSCAN `min_cluster_size`**: The main tuning knob. Too small → noisy micro-clusters. Too large → distinct topics get merged. Default of 10 should work for most cases, but may need a manual override or auto-sweep (run with several values, pick the one with best silhouette score).
|
||||
|
||||
7. **Domain drift**: Over time, the character of a domain changes. How aggressively should re-clustering reshape existing domains? Conservative (only propose splits/merges, never auto-apply) vs. aggressive (auto-reassign memories whose scores drifted below threshold)?
|
||||
|
||||
8. **Spreading activation across domains**: When searching within a single domain, should graph edges that cross into other domains be followed? Probably yes for recall quality, but with decaying weight as you cross boundaries.
|
||||
|
||||
9. **Threshold tuning**: The `assign_threshold` (0.65 default) determines how many memories are multi-domain vs single-domain vs unclassified. Too low → everything is multi-domain (useless). Too high → too many unclassified. Could be auto-tuned per dataset by targeting a specific unclassified ratio (e.g., "keep fewer than 10% unclassified").
|
||||
Loading…
Add table
Add a link
Reference in a new issue