docs: ADR 0001 + Phase 1-4 implementation plans

Pluggable storage backend, network access, and emergent domain
classification. Introduces MemoryStore + Embedder traits, PgMemoryStore
alongside SqliteMemoryStore, HTTP MCP + API key auth, and HDBSCAN-based
domain clustering. Phase 5 federation deferred to a follow-up ADR.

- docs/adr/0001-pluggable-storage-and-network-access.md -- Accepted
- docs/plans/0001-phase-1-storage-trait-extraction.md
- docs/plans/0002-phase-2-postgres-backend.md
- docs/plans/0003-phase-3-network-access.md
- docs/plans/0004-phase-4-emergent-domain-classification.md
- docs/prd/001-getting-centralized-vestige.md -- source RFC
This commit is contained in:
Jan De Landtsheer 2026-04-21 20:29:40 +02:00
parent 2391acf480
commit 0d273c5641
No known key found for this signature in database
GPG key ID: 95CD37F0C226040B
6 changed files with 5667 additions and 0 deletions

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,883 @@
# Phase 4 Plan: Emergent Domain Classification
**Status**: Draft
**Depends on**: Phase 1 (domain columns on memories, `Domain` struct + `DomainStore` methods on `MemoryStore`, `Embedder` trait), Phase 2 (Postgres JSONB + TEXT[] support for domain fields, `embedding_model` registry parity), Phase 3 (Axum HTTP server, REST `/api/v1/` scaffolding, API key auth middleware, signed dashboard session cookies)
**Related**: docs/adr/0001-pluggable-storage-and-network-access.md (Phase 4), docs/prd/001-getting-centralized-vestige.md (Emergent Domain Model)
---
## Scope
### In scope
- `DomainClassifier` cognitive module under `crates/vestige-core/src/neuroscience/domain_classifier.rs`, alongside existing neuroscience modules (spreading_activation, synaptic_tagging, ...).
- HDBSCAN discovery pipeline using the `hdbscan` crate (v0.10): load all embeddings, cluster, extract centroids, extract top-terms via TF-IDF over cluster members, persist via the trait's `DomainStore` methods.
- Soft-assignment pipeline: for each memory, compute `cosine_similarity(memory.embedding, domain.centroid)` for every domain, store raw scores in `domain_scores` JSONB, threshold into `domains[]` using `assign_threshold` (default 0.65).
- Automatic classification on ingest: run through `CognitiveEngine` / `smart_ingest` so new memories get classified against existing centroids immediately; skip when `domain_count == 0` (Phase 0 accumulation).
- Re-cluster hook in dream consolidation: every Nth four-phase dream cycle (N=5 default) triggers a discovery pass and generates proposals (split / merge / none). Proposals land in a new `domain_proposals` table, surface in the dashboard, and are never auto-applied (conservative drift, ADR Q7).
- Context signals: `SignalSource` trait with `GitRepoSignal` (detects `.git` in CWD or `metadata.cwd`) and `IdeHintSignal` (reads `metadata.editor` / `metadata.ide`). Each returns a `boost_map` of `domain_id -> additive delta` (typical +0.05). Injected as a `signal_boost: Option<HashMap<String, f64>>` parameter into `DomainClassifier::classify`.
- Cross-domain spreading activation decay: `ActivationNetwork` traversal multiplies the edge's effective weight by `cross_domain_decay` (default 0.5) when `target.domains` and `source.domains` are disjoint. Strict "no overlap" policy, not graded.
- CLI subcommands (in `crates/vestige-mcp/src/bin/cli.rs`, under a new `Domains` command group): `list`, `discover [--min-cluster-size N] [--force]`, `rename <id> <new_label>`, `merge <a> <b> [--into <id>]`. Human-readable tables on stdout; JSON via `--json`.
- Dashboard UI additions (`apps/dashboard/src/routes/(app)/domains/`): list page, per-domain detail (memories, centroid top_terms, score histogram, proposal review controls).
- REST endpoints under `/api/v1/domains` (introduced by Phase 3 skeleton, implemented in Phase 4): list, discover, rename, merge, proposal list / accept / reject.
- Config additions: `[domains]` section in `vestige.toml` covering `assign_threshold`, `recluster_interval`, `min_cluster_size`, `cross_domain_decay`, `discovery_threshold`, `merge_threshold`, `signal_boost` (per-signal toggle).
### Out of scope
- Phase 5 federation (explicit separate ADR). Domain centroids are installation-local; no sync.
- Learned re-weighting of domain scores (future, only if retrieval-quality metrics show a need).
- Interactive cluster-membership editing in the UI (drag-and-drop reassign) -- future enhancement.
- Multi-user domain namespaces. One domain set per installation; API keys that carry `domain_filter` just restrict access, they do not create namespaces.
- Auto-sweep of `min_cluster_size` / auto-tuned `assign_threshold` (ADR resolution Q6 + Q9: static defaults, user tunes).
- Graded cross-domain decay (`|A intersect B| / max(|A|,|B|)`) -- strict "no overlap" is the Phase 4 rule.
---
## Prerequisites
Artifacts that Phases 1-3 are expected to have landed:
- In `vestige-core`:
- `Embedder` trait (`crates/vestige-core/src/embedder/`).
- `MemoryStore` trait (`crates/vestige-core/src/storage/trait.rs` or similar) including `DomainStore` methods: `list_domains`, `get_domain`, `upsert_domain`, `delete_domain`, `classify(&[f32]) -> Vec<(String, f64)>`, plus a bulk accessor such as `all_embeddings()` (already present in sqlite.rs as `get_all_embeddings`) and a `get_all_memories_with_embeddings()` iterator for discovery. The trait must expose a method to batch-update `(domains, domain_scores)` for a memory id.
- `Domain` struct: `{ id: String, label: String, centroid: Vec<f32>, top_terms: Vec<String>, memory_count: usize, created_at: DateTime<Utc> }`.
- Columns on memories in both SQLite and Postgres: `domains TEXT[]` (or JSON array on SQLite) and `domain_scores JSONB` (or TEXT JSON on SQLite).
- The `domains` table in both backends (see PRD schema sketch).
- In `vestige-mcp`:
- Axum `/api/v1/` router prefix with auth middleware.
- CLI skeleton (`bin/cli.rs`) using `clap`; Phase 4 adds a `Domains` subcommand tree.
- REST handlers file structure ready under `crates/vestige-mcp/src/dashboard/handlers.rs` (legacy) and a dedicated REST handler under `/api/v1/`; Phase 4 adds `domains.rs` handler module.
- SvelteKit dashboard (`apps/dashboard/`) with existing `(app)/memories`, `(app)/timeline`, `(app)/stats`, etc. Phase 4 adds `(app)/domains/`.
New workspace crate additions required (added manually to `Cargo.toml`, since `cargo add` is not run from the plan):
- `hdbscan = "0.10"` in `crates/vestige-core/Cargo.toml` (feature-gated behind `domain-classification`).
- Optional: a lightweight stop-word constant inline; no external stop-word crate -- the neuroscience modules already do tokenization on whitespace + length>3 (see `dreams.rs::content_similarity`). Reuse that style; no `ndarray` needed because `hdbscan` v0.10 accepts `&[Vec<f32>]` directly (verified from PRD snippet).
- No new deps in `vestige-mcp` for Phase 4 -- CLI reuses `clap` / `colored` / `comfy-table` if already present, otherwise a hand-rolled padded print. We pick hand-rolled to avoid adding a table crate; this matches the existing style of `run_stats` in `cli.rs`.
Test fixtures:
- A JSON seed corpus checked into `tests/phase_4/fixtures/seed_500.json` containing >= 500 memories drawn from three plausible clusters. A builder function `tests/phase_4/support/fixtures.rs::build_seed_corpus()` deterministically generates or loads this corpus. Each record has `content`, `tags`, `embedding` (768D bge-base-en-v1.5; use a committed vector or a deterministic mock embedder in tests). For deterministic tests we fake embeddings by hashing content -- acceptable as long as the fake preserves cluster separability (prefix-based: "DEV-...", "INFRA-...", "HOME-..." seeds three Gaussian blobs).
- Reuse `Embedder` mock from Phase 1 tests (`MockEmbedder`) for discovery tests that need real cosine similarity.
- A minimal git-repo fixture created in a tempdir (`tempfile::tempdir` + `std::process::Command::new("git").arg("init")`) for context-signal tests.
---
## Deliverables
1. `DomainClassifier` cognitive module: struct, defaults, `classify`, `classify_with_boost`, `reassign_all`, `discover`.
2. `domain_terms` helper (TF-IDF over cluster members, returning `top_k` terms).
3. `cli domains discover` subcommand.
4. `cli domains list` / `rename` / `merge` subcommands.
5. Auto-classify hook on ingest (wired into the cognitive engine's ingest pipeline before persistence).
6. Re-cluster hook in dream consolidation (`DreamEngine::run` orchestrator gets an optional `DomainReClusterHook`; triggers every Nth dream).
7. Context signal extractor module (`crates/vestige-core/src/neuroscience/context_signals.rs`) with `SignalSource` trait + `GitRepoSignal` + `IdeHintSignal`.
8. Cross-domain spreading activation decay in `ActivationNetwork::activate` (config-driven).
9. `vestige.toml` `[domains]` section + defaults loader.
10. Dashboard UI: SvelteKit routes `(app)/domains/+page.svelte` (list), `(app)/domains/[id]/+page.svelte` (detail), `(app)/domains/proposals/+page.svelte` (review).
11. REST endpoints under `/api/v1/domains` + `/api/v1/domains/proposals`.
12. `domain_proposals` table + migration + `DomainProposal` trait methods on `MemoryStore`.
13. WebSocket event `VestigeEvent::DomainProposalCreated` so the dashboard gets a live notification after a re-cluster fires.
---
## Detailed Task Breakdown
### 1. `DomainClassifier` cognitive module
**File**: `crates/vestige-core/src/neuroscience/domain_classifier.rs`
**Export**: in `crates/vestige-core/src/neuroscience/mod.rs`, add `pub mod domain_classifier;` and re-export `pub use domain_classifier::{DomainClassifier, ClassificationResult, DomainProposal, ProposalKind};`
**Deps**: `hdbscan = "0.10"`, `serde`, `serde_json`, `chrono`, `tracing`, existing `crate::storage::Domain`, `crate::storage::MemoryStore` trait.
Struct and defaults (match PRD exactly):
```rust
pub struct DomainClassifier {
pub assign_threshold: f64, // default 0.65
pub discovery_threshold: usize, // default 150
pub recluster_interval: usize, // default 5 (every 5th dream)
pub min_cluster_size: usize, // default 10
pub min_samples: usize, // default 5 (HDBSCAN)
pub cross_domain_decay: f64, // default 0.5
pub merge_threshold: f64, // default 0.90 (centroid cosine)
pub top_terms_k: usize, // default 10
}
impl Default for DomainClassifier { ... }
```
Result types:
```rust
#[derive(Debug, Clone)]
pub struct ClassificationResult {
pub scores: HashMap<String, f64>, // raw per-domain similarities
pub domains: Vec<String>, // above assign_threshold
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ProposalKind {
Split { parent: String, children: Vec<String> },
Merge { targets: Vec<String>, suggested_label: String },
NewCluster { top_terms: Vec<String> },
}
#[derive(Debug, Clone)]
pub struct DomainProposal {
pub id: String, // uuid v4
pub kind: ProposalKind,
pub rationale: String,
pub confidence: f64,
pub created_at: DateTime<Utc>,
pub status: ProposalStatus, // Pending | Accepted | Rejected
}
```
Key methods (all pure where possible; all pub):
```rust
impl DomainClassifier {
pub fn classify(&self, embedding: &[f32], domains: &[Domain]) -> ClassificationResult;
pub fn classify_with_boost(
&self,
embedding: &[f32],
domains: &[Domain],
boost: Option<&HashMap<String, f64>>,
) -> ClassificationResult;
pub async fn reassign_all(
&self,
store: &dyn MemoryStore,
domains: &[Domain],
) -> Result<usize, StorageError>;
pub async fn discover(
&self,
store: &dyn MemoryStore,
) -> Result<Vec<Domain>, StorageError>;
pub async fn propose_changes(
&self,
store: &dyn MemoryStore,
existing: &[Domain],
newly_discovered: &[Domain],
) -> Result<Vec<DomainProposal>, StorageError>;
pub async fn apply_proposal(
&self,
store: &dyn MemoryStore,
proposal: &DomainProposal,
) -> Result<(), StorageError>;
}
```
Behavior notes:
- `classify` returns empty `{ scores: {}, domains: [] }` iff `domains.is_empty()` (accumulation phase). This matches the PRD snippet verbatim.
- `classify_with_boost` adds the boost delta to each score AFTER cosine, before thresholding. It clamps to `[0.0, 1.0]`. Boost keys not present in `domains` are ignored.
- `reassign_all` streams memories in batches of 500 (iterator on the store) to keep memory bounded; for each memory issues a single `UPDATE memories SET domains = ?, domain_scores = ? WHERE id = ?` call. Returns count of memories whose `domains` vector actually changed.
- `discover` loads all `(id, embedding)` pairs via an `all_embeddings()` method on the store (exists under `#[cfg(all(feature = "embeddings", feature = "vector-search"))]` in `sqlite.rs::get_all_embeddings`; Phase 1 should promote this onto the trait -- if not yet promoted, add the method). Then:
1. Build `Vec<Vec<f32>>` and index -> id map.
2. `Hdbscan::default_hyper_params(&embeddings).min_cluster_size(self.min_cluster_size).min_samples(self.min_samples).build()` (exact builder depends on hdbscan 0.10 surface; see Open Question).
3. `let labels = clusterer.cluster()?;`
4. `let centers = clusterer.calc_centers(Center::Centroid, &labels)?;`
5. Group indices by label ignoring -1 (noise). For each cluster compute `top_terms` via `compute_top_terms`.
6. Preserve stable IDs where possible: match each new cluster centroid to the closest existing domain by cosine; if similarity > 0.85, reuse the existing domain id + label. Otherwise generate a fresh id `cluster_{n}` with a label derived from the first 2 terms.
7. Upsert all resulting `Domain`s via the store.
- `propose_changes` compares old vs new clusters:
- **Split**: an old domain that best-matches two or more new domains each with >= `min_cluster_size` members. Rationale: "domain `dev` is now 2 clusters of >=10 memories: `systems` and `networking`".
- **Merge**: two old domains whose centroids now satisfy `cosine > merge_threshold` get a merge proposal.
- **NewCluster**: a new cluster that doesn't match any old domain above 0.85 similarity.
- `apply_proposal` runs the split or merge against the store (reassign memberships via `reassign_all`), then marks the proposal `Accepted`. It never runs automatically -- only via the CLI or dashboard.
Helper:
```rust
fn compute_top_terms(documents: &[&str], k: usize) -> Vec<String>;
```
Uses TF-IDF with IDF computed over the entire passed-in corpus (the `documents` slice), tokenization = whitespace split, lowercase, strip non-alphanumeric, drop tokens shorter than 4 chars and a small built-in stop-word list (`the`, `and`, `for`, `that`, `with`, ...). Matches the tokenizer used in `dreams.rs::content_similarity` and `dreams.rs::extract_patterns` so behavior is predictable.
Cosine similarity helper:
```rust
fn cosine_similarity(a: &[f32], b: &[f32]) -> f64;
```
Keep the existing crate-level `cosine_similarity` if already present (check `embeddings::` or `search::`); otherwise add a private one. Returns 0.0 on dimension mismatch, panics would be a bug.
### 2. Top-terms computation helper
**File**: same module, private section.
- `fn tokenize(text: &str) -> Vec<String>`: lowercase, split on non-alphanumeric, filter len >= 4, drop stop-words.
- `fn tfidf_top_k(docs: &[&str], k: usize) -> Vec<String>`:
1. `tf[doc_idx][term] = count / total_terms`.
2. `df[term] = docs containing term`.
3. `idf[term] = log((N + 1) / (df[term] + 1)) + 1` (smoothed).
4. For each term, average `tf` across docs in the cluster; multiply by `idf`; sort desc; return top `k`.
Cluster top-terms are computed over cluster members only, with IDF over the **whole corpus** (all memory contents), not the cluster, so common words get penalized globally. Recompute global IDF once per `discover` call.
### 3. CLI subcommand: `vestige domains discover`
**File**: `crates/vestige-mcp/src/bin/cli.rs`
Add to `enum Commands`:
```rust
/// Emergent domain management
Domains {
#[command(subcommand)]
action: DomainAction,
},
```
```rust
#[derive(clap::Subcommand)]
enum DomainAction {
/// List all discovered domains
List {
#[arg(long)] json: bool,
},
/// Run HDBSCAN discovery on all embeddings and propose domains
Discover {
#[arg(long, default_value_t = 10)] min_cluster_size: usize,
/// Skip the proposal flow and write new domains directly (first-time use)
#[arg(long)] force: bool,
#[arg(long)] json: bool,
},
/// Rename a domain (by id)
Rename {
id: String,
new_label: String,
},
/// Merge two domains
Merge {
a: String,
b: String,
#[arg(long)] into: Option<String>, // default: `a`
},
}
```
Handler plumbing lives in `run_domains(action)` dispatching to `run_domains_list`, `run_domains_discover`, `run_domains_rename`, `run_domains_merge`. Each opens the default `Storage`, constructs a `DomainClassifier::default()`, and invokes the appropriate method.
Output format for `list`:
```
ID LABEL MEMORIES TOP TERMS
dev Development 87 rust, trait, async, tokio, zinit
infra Infrastructure 47 bgp, sonic, vlan, frr, peering
home Home 31 solar, kwh, battery, pool, esphome
(unclassified) 12
```
Produced via plain `print!` with `%-15s %-18s %-10d %s` style padding. `--json` emits `serde_json::to_string_pretty(&domains)`.
Output format for `discover` with `--force`:
```
HDBSCAN: 500 embeddings, min_cluster_size=10, min_samples=5
Found 3 clusters (ignoring 14 noise points)
cluster_0 (N=47) top: bgp, sonic, vlan, frr, peering
cluster_1 (N=31) top: solar, kwh, battery, pool, esphome
cluster_2 (N=22) top: rust, trait, async, tokio, zinit
Writing 3 domains to the store...
Soft-assigning 500 memories against centroids...
multi-domain: 43
single-domain: 412
unclassified (below threshold 0.65): 45
Done in 7.4s.
```
Output format for `discover` without `--force` (post-Phase-0):
```
HDBSCAN: 623 embeddings, min_cluster_size=10
Comparing to existing 3 domains...
Proposals (pending, accept via dashboard or `vestige domains proposals`):
[split] dev -> (systems:34, networking:28) confidence 0.82
[new] cluster_5 (books, novels, reading) confidence 0.71
Run `vestige domains proposals` to review, or open the dashboard.
```
### 4. CLI: `list`, `rename`, `merge`
- `list`: calls `store.list_domains()`, fetches unclassified count via `store.count_memories_without_domains()` (Phase 1 should have provided this; if not, Phase 4 adds it to the trait and both backends).
- `rename`: `store.get_domain(id)` -> mutate `label` -> `store.upsert_domain`. No memory touch.
- `merge`: load both, compute blended centroid (weighted by `memory_count`), merge `top_terms` (union, recompute TF-IDF rank if both sides share the corpus), delete the non-`into` domain, call `reassign_all`. Wrapped in a transaction on Postgres; on SQLite rely on the existing writer-lock pattern.
### 5. Auto-classify on ingest
**File**: `crates/vestige-core/src/cognitive.rs` (or equivalent ingest entry in `vestige-mcp/src/tools/smart_ingest.rs`).
Integration point: just before the record is persisted in the smart-ingest path, after the embedder has produced `embedding` and before `storage.insert(...)`. Trace the current call site -- today `Storage::ingest(IngestInput)` computes embedding inside storage; in Phase 1 the embedder becomes external (ADR decision Q2), so classification can hook right there in the cognitive engine.
Pseudocode:
```rust
let embedding = embedder.embed(&input.content).await?;
let domains = store.list_domains().await?;
let (domains_assigned, domain_scores) = if domains.is_empty() {
(Vec::new(), HashMap::new())
} else {
let boost = context_signals.gather_boost(&input.metadata, &domains);
let result = classifier.classify_with_boost(&embedding, &domains, boost.as_ref());
(result.domains, result.scores)
};
record.embedding = Some(embedding);
record.domains = domains_assigned;
record.domain_scores = domain_scores;
store.insert(&record).await?;
```
Edge cases:
- Accumulation phase (`domains.is_empty()`): skip classification entirely. Zero overhead.
- Embedding failed / skipped: leave `domains = []`, `domain_scores = {}`. Never fail ingest because of classification.
- Metric: emit `VestigeEvent::MemoryClassified { id, domains, top_score }` on the WebSocket bus so the dashboard sees it live.
### 6. Re-cluster hook in dream consolidation
**File**: `crates/vestige-core/src/advanced/dreams.rs` (long file, 1131-line `dream()` entry on the `MemoryDreamer` impl) plus `crates/vestige-core/src/consolidation/phases.rs` (the `DreamEngine::run` orchestrator).
Design: the `DreamEngine::run(...)` returns `FourPhaseDreamResult`. It does not currently know how many times it has run. Phase 4 introduces a persistent counter on disk (column `dream_cycle_count` on a new singleton `system_state` table, or a simple row in the existing `metadata` / `embedding_model` registry). After the Integration phase finishes, the cognitive engine increments the counter and, if `counter % recluster_interval == 0`, launches discovery asynchronously:
Extension struct in `phases.rs`:
```rust
pub struct DreamReClusterHook<'a> {
pub classifier: &'a DomainClassifier,
pub store: &'a dyn MemoryStore,
pub event_tx: Option<&'a tokio::sync::mpsc::UnboundedSender<VestigeEvent>>,
}
impl<'a> DreamReClusterHook<'a> {
pub async fn tick(&self, cycle_count: usize) -> Result<Vec<DomainProposal>, StorageError> {
if cycle_count == 0 || cycle_count % self.classifier.recluster_interval != 0 {
return Ok(vec![]);
}
let existing = self.store.list_domains().await?;
let rediscovered = self.classifier.discover(self.store).await?;
let proposals = self
.classifier
.propose_changes(self.store, &existing, &rediscovered)
.await?;
for p in &proposals {
self.store.insert_domain_proposal(p).await?;
if let Some(tx) = self.event_tx {
let _ = tx.send(VestigeEvent::DomainProposalCreated {
id: p.id.clone(),
kind: format!("{:?}", p.kind),
confidence: p.confidence,
timestamp: Utc::now(),
});
}
}
Ok(proposals)
}
}
```
Caller wires `tick()` after `DreamEngine::run()` returns, at the ingest/consolidation orchestrator level. The hook never mutates existing domains -- it only writes proposals. The acceptance path is manual (CLI or dashboard).
Counter storage: add method `store.bump_dream_cycle_count() -> Result<usize>` returning the new count. Single-row table:
```sql
CREATE TABLE IF NOT EXISTS system_state (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
-- seed: ('dream_cycle_count', '0')
```
### 7. Context signal extractor
**File**: `crates/vestige-core/src/neuroscience/context_signals.rs`
```rust
pub trait SignalSource: Send + Sync {
/// Returns domain_id -> additive boost (positive or negative, typically in [-0.1, +0.1]).
fn boost_map(
&self,
input_metadata: &serde_json::Value,
domains: &[Domain],
) -> HashMap<String, f64>;
fn name(&self) -> &'static str;
}
pub struct GitRepoSignal {
pub boost: f64, // default +0.05
}
pub struct IdeHintSignal {
pub boost: f64,
}
pub struct ContextSignals {
sources: Vec<Box<dyn SignalSource>>,
}
impl ContextSignals {
pub fn gather_boost(
&self,
input_metadata: &serde_json::Value,
domains: &[Domain],
) -> Option<HashMap<String, f64>>;
}
```
Signal encoding convention (document in the module header):
- A signal is a **soft prior**. It nudges the post-cosine score by a small additive delta, clamped to `[-0.10, +0.10]` per signal.
- Multiple signals sum, then the final boost per domain is clamped to `[-0.15, +0.15]` so signals cannot by themselves push a memory into or out of a domain; the embedding similarity dominates.
- Signals target domains by heuristic: `GitRepoSignal` boosts any domain whose `top_terms` overlaps `{"rust","async","trait","function","class","def","git","commit","fn","code"}`. `IdeHintSignal` does the same for `{"file","line","editor","vscode","neovim","rust-analyzer","lsp"}`.
- All signal boosts are logged via `tracing::debug!` so users can audit why a memory picked up a domain.
`GitRepoSignal::boost_map` implementation:
```rust
fn boost_map(&self, meta: &Value, domains: &[Domain]) -> HashMap<String, f64> {
let is_git = meta.get("cwd")
.and_then(|v| v.as_str())
.map(|cwd| std::path::Path::new(cwd).join(".git").exists())
.unwrap_or(false)
|| meta.get("git_repo").is_some();
if !is_git { return HashMap::new(); }
let mut out = HashMap::new();
for d in domains {
let code_hits = d.top_terms.iter()
.filter(|t| CODE_TERMS.contains(t.as_str()))
.count();
if code_hits > 0 { out.insert(d.id.clone(), self.boost); }
}
out
}
```
Config knob in `[domains.signals]`: `git = true`, `ide = true`, `git_boost = 0.05`, `ide_boost = 0.05`.
### 8. Cross-domain spreading activation decay
**File**: `crates/vestige-core/src/neuroscience/spreading_activation.rs`
Modify `ActivationConfig`:
```rust
pub struct ActivationConfig {
pub decay_factor: f64,
pub max_hops: u32,
pub min_threshold: f64,
pub allow_cycles: bool,
pub cross_domain_decay: f64, // NEW, default 0.5
}
```
Domain metadata on nodes: the current `ActivationNode` has `id`, `activation`, `last_activated`, `edges: Vec<String>`. Phase 4 adds `pub domains: Vec<String>`. Populated when nodes get added (propagated from the memory's `domains` field). The network is rebuilt on each search from the store; if the in-memory network is persisted (check `ActivationNetwork` lifetime in `CognitiveEngine`), the population happens in the engine at boot and on insert.
Traversal change, in `ActivationNetwork::activate` loop, replacing the single line `let propagated = current_activation * edge.strength * self.config.decay_factor;`:
```rust
let cross_penalty = {
let src_doms = self.nodes.get(&current_id).map(|n| &n.domains);
let tgt_doms = self.nodes.get(&target_id).map(|n| &n.domains);
match (src_doms, tgt_doms) {
(Some(s), Some(t)) if !s.is_empty() && !t.is_empty() => {
let overlap = s.iter().any(|d| t.contains(d));
if overlap { 1.0 } else { self.config.cross_domain_decay }
}
_ => 1.0, // unclassified on either side: no penalty
}
};
let propagated = current_activation * edge.strength * self.config.decay_factor * cross_penalty;
```
Rationale for "unclassified -> no penalty": unclassified memories are Phase-0 or low-confidence corpus members; penalizing them would block useful cross-pollination during the accumulation ramp.
API to update a node's domains after reclassification:
```rust
pub fn set_node_domains(&mut self, id: &str, domains: Vec<String>);
```
Called by the reassignment pipeline after `reassign_all`.
### 9. `vestige.toml` `[domains]` section
**File**: wherever `vestige.toml` is loaded (search for `[storage]` / `[server]` loaders). Add:
```toml
[domains]
assign_threshold = 0.65
discovery_threshold = 150
recluster_interval = 5
min_cluster_size = 10
min_samples = 5
cross_domain_decay = 0.5
merge_threshold = 0.90
top_terms_k = 10
[domains.signals]
git = true
ide = true
git_boost = 0.05
ide_boost = 0.05
```
Rust-side: `DomainsConfig { ... }` struct with `serde(default)` so `vestige.toml` without a `[domains]` section falls back to hard-coded defaults. `DomainClassifier::from_config(cfg: &DomainsConfig) -> Self`.
### 10. Dashboard UI additions
**SvelteKit routes** (`apps/dashboard/src/routes/(app)/domains/`):
- `+page.svelte` (list): fetches `GET /api/v1/domains` and `GET /api/v1/domains/unclassified-count`. Renders a table: `label`, `memories`, `top_terms` chips, `created_at`. Each row links to `/domains/[id]`. A "Discover" button posts `POST /api/v1/domains/discover`.
- `[id]/+page.svelte` (detail): fetches `GET /api/v1/domains/:id`, `GET /api/v1/domains/:id/memories?limit=100`, `GET /api/v1/domains/:id/score-histogram`. Renders:
- Header: label (editable, triggers `PUT /api/v1/domains/:id`), top-terms chips, memory count, created_at.
- Histogram: a vertical bar chart of `domain_scores[:id]` buckets 0-0.1, 0.1-0.2, ..., 0.9-1.0 across all memories. Data source: server precomputes buckets so the client does not need to fetch all scores.
- Memory list: paginated, each row shows the raw score for this domain.
- `proposals/+page.svelte`: fetches `GET /api/v1/domains/proposals?status=pending`. Each pending proposal card shows `kind`, `rationale`, `confidence`, `created_at`, buttons "Accept" (posts `POST /api/v1/domains/proposals/:id/accept`) and "Reject" (`POST .../reject`). Live updates via the existing WebSocket channel (`/ws`) reacting to `DomainProposalCreated` events.
Styling reuses the existing Tailwind + shadcn-svelte conventions in `apps/dashboard/src/lib/components/`.
Existing `(app)/stats` and `(app)/feed` pages get a small "Domains" summary panel that links to `/domains`.
### 11. REST endpoints
**File**: `crates/vestige-mcp/src/protocol/http.rs` or a new `crates/vestige-mcp/src/api/domains.rs` module, wired into the `/api/v1/` router.
| Method | Path | Handler |
|--------|------|---------|
| GET | `/api/v1/domains` | `list_domains` -- returns `[Domain...]` + unclassified count |
| POST | `/api/v1/domains/discover` | `trigger_discover` -- body `{ min_cluster_size?: usize, force?: bool }`, returns proposals or applied domains |
| GET | `/api/v1/domains/:id` | `get_domain` |
| PUT | `/api/v1/domains/:id` | `update_domain` -- rename |
| DELETE | `/api/v1/domains/:id` | `delete_domain` -- with `?merge_into=other_id` |
| GET | `/api/v1/domains/:id/memories` | paginated memories in this domain |
| GET | `/api/v1/domains/:id/score-histogram` | precomputed buckets |
| GET | `/api/v1/domains/proposals` | `list_proposals?status=pending` |
| POST | `/api/v1/domains/proposals/:id/accept` | `accept_proposal` |
| POST | `/api/v1/domains/proposals/:id/reject` | `reject_proposal` |
All handlers go through the Phase 3 auth middleware (Bearer / X-API-Key / session cookie). Responses are JSON; error paths use `StatusCode::*` with a small `{"error": "..."}` body.
### 12. `domain_proposals` table + trait methods
Postgres migration (`crates/vestige-core/migrations/postgres/00XX_domain_proposals.sql`):
```sql
CREATE TABLE domain_proposals (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
kind TEXT NOT NULL, -- 'split' | 'merge' | 'new_cluster'
payload JSONB NOT NULL, -- serialized ProposalKind body
rationale TEXT NOT NULL,
confidence DOUBLE PRECISION NOT NULL,
status TEXT NOT NULL DEFAULT 'pending', -- pending|accepted|rejected
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
resolved_at TIMESTAMPTZ
);
CREATE INDEX idx_domain_proposals_status ON domain_proposals (status, created_at DESC);
```
SQLite migration: same table, `UUID` -> `TEXT`, `JSONB` -> `TEXT` with JSON-encoded bodies, `TIMESTAMPTZ` -> `TEXT` ISO-8601.
`MemoryStore` trait additions:
```rust
async fn insert_domain_proposal(&self, p: &DomainProposal) -> Result<()>;
async fn list_domain_proposals(&self, status: Option<&str>) -> Result<Vec<DomainProposal>>;
async fn get_domain_proposal(&self, id: &str) -> Result<Option<DomainProposal>>;
async fn set_proposal_status(&self, id: &str, status: &str) -> Result<()>;
```
### 13. WebSocket event for proposals
**File**: `crates/vestige-mcp/src/dashboard/events.rs`
Add variant:
```rust
pub enum VestigeEvent {
// ... existing ...
DomainProposalCreated {
id: String,
kind: String,
confidence: f64,
timestamp: DateTime<Utc>,
},
MemoryClassified {
id: String,
domains: Vec<String>,
top_score: f64,
timestamp: DateTime<Utc>,
},
}
```
The SvelteKit dashboard's WS client reacts to both events: classified events refresh any open domain-detail page; proposal events push a toast and a badge on the navbar.
---
## Test Plan
Test root: `tests/phase_4/` (a new member of the workspace; mirror the `tests/e2e` layout).
`tests/phase_4/Cargo.toml`:
```toml
[package]
name = "vestige-phase4-tests"
version = "0.0.0"
edition = "2024"
publish = false
[dependencies]
vestige-core = { path = "../../crates/vestige-core", features = ["embeddings", "vector-search", "domain-classification"] }
vestige-mcp = { path = "../../crates/vestige-mcp" }
tokio = { workspace = true }
anyhow = "1"
tempfile = "3"
serde_json = { workspace = true }
uuid = { workspace = true }
```
### Unit tests (colocated in `domain_classifier.rs::tests`, `context_signals.rs::tests`, `spreading_activation.rs::tests`)
Each public function must have at least one test:
- `classify_empty_domains_returns_empty`: `classify(&[0.0; 768], &[])` returns `ClassificationResult { scores: {}, domains: [] }`.
- `classify_single_domain_scores`: one `Domain` with a known centroid; input embedding equal to centroid; expect score 1.0 and `domains == [id]`.
- `classify_multi_domain_overlap`: two domains A, B; input halfway between centroids; expect both scores >= `assign_threshold`; expect `domains == [A, B]` (order not guaranteed).
- `classify_below_threshold_returns_empty_domains_but_scores_filled`: input orthogonal to all centroids; expect `scores` populated, `domains` empty.
- `classify_with_boost_adds_delta`: same input as above, with `boost = {A: 0.4}`; expect A now above threshold, B unchanged.
- `classify_boost_clamps_to_unit`: `boost = {A: 5.0}`; resulting `scores[A]` must be <= 1.0.
- `tfidf_top_k_returns_distinct_terms`: given three fake docs, `top_k=3` returns three non-duplicate strings, in descending TF-IDF order.
- `tfidf_top_k_drops_stopwords`: `["the and for"]` + real content -> stop-words absent.
- `compute_top_terms_handles_empty_cluster`: returns `vec![]` (no panic).
- `signal_git_present_vs_absent`: `GitRepoSignal` given metadata with `.git` in cwd returns non-empty map; without it returns empty.
- `signal_ide_present_vs_absent`: `IdeHintSignal` ditto for `metadata.editor == "vscode"`.
- `signal_combined_clamped`: two signals both firing each at +0.10 -> combined map values <= +0.15.
- `cross_domain_decay_full_weight_on_overlap`: graph with node A in domain `dev`, node B in domain `dev`, edge A->B strength 1.0; after `activate`, B's activation equals the standard `initial * strength * decay_factor` (no extra penalty).
- `cross_domain_decay_half_weight_no_overlap`: A in `dev`, B in `infra`, same edge -> B's activation is 0.5x that of the overlap case.
- `cross_domain_decay_unclassified_no_penalty`: A classified, B unclassified -> full weight.
- `propose_changes_detects_split`: existing domain `dev`; new discovery returns two clusters whose centroids both sit close to old `dev` centroid, each >= min_cluster_size members -> proposal of kind `Split { parent: "dev", children: [a, b] }`.
- `propose_changes_detects_merge`: two existing domains whose new centroids now have cosine > `merge_threshold` -> proposal of kind `Merge`.
- `propose_changes_detects_new_cluster`: a new cluster with no match >= 0.85 to any existing -> `NewCluster`.
- `apply_proposal_split_updates_memberships`: after accept, memories previously in `dev` get reassigned (some to child a, some to child b) via `reassign_all`.
### Integration tests (`tests/phase_4/tests/`)
One file per behavior listed in the Phase 4 acceptance sheet.
- `discover_seed_corpus.rs` -- loads the 500-memory fixture, runs `classifier.discover(&store).await`, asserts at least 3 clusters, asserts per-cluster intra-similarity mean > 0.6, asserts discovery wall time < 10s in release. Also asserts `top_terms` for each cluster contains at least one expected keyword per cluster (dev: contains any of `rust/trait/async`; infra: `bgp/vlan/network`; home: `solar/battery/pool`).
- `soft_assign_multi_domain.rs` -- inserts a memory "deploy zinit containers over BGP network"; after classify, `domains` contains both `dev` and `infra` (from a known centroid setup).
- `auto_classify_on_ingest.rs` -- with three existing domains, a fresh `smart_ingest` of a dev-ish sentence ends up with `domains == ["dev"]` and non-empty `domain_scores`.
- `reembed_triggers_recluster.rs` -- after `vestige migrate --reembed`, centroids must be recomputed; verify `list_domains()` returns fresh `centroid` values (different from pre-reembed).
- `dream_consolidation_recluster_hook.rs` -- run 5 dream cycles with heavy synthetic memory insertion; after the 5th, assert `list_domain_proposals("pending")` has at least one proposal.
- `proposal_accept_applies_changes.rs` -- accept a split proposal via `apply_proposal`; verify that memories in `dev` are now distributed across the new children and that the old `dev` domain is removed.
- `proposal_reject_leaves_state.rs` -- reject a proposal; verify all domains and memberships unchanged.
- `drift_is_proposal_only.rs` -- over 5 dream cycles with new inserts, never call accept; verify every memory's `domains` field equals its initial post-discovery value. No auto-apply.
- `cross_domain_activation_decay.rs` -- build a `ActivationNetwork` with two memories linked by a strength-1.0 edge, one in `dev`, one in `infra`; activate `dev` memory with 1.0; assert `infra` memory's activation == `0.5 * decay_factor` (0.35 with default decay_factor 0.7). Then set both to `dev` and reassert activation == `0.7`.
- `cli_domains_discover.rs` -- spawn `cargo run -- domains discover --force --json`, parse stdout, assert at least 3 clusters and valid JSON shape.
- `cli_domains_rename_merge.rs` -- happy-path rename then merge, with stdout assertions.
- `context_signal_git_repo.rs` -- ingest the same sentence from inside a tempdir with `.git` vs outside; assert the git-run produces slightly higher `domain_scores` for the code-related domain (diff >= 0.04, matches `git_boost = 0.05`).
- `threshold_tunable.rs` -- same memory, two runs with `assign_threshold = 0.40` vs `0.85`; the low-threshold run assigns more domains than the high-threshold run for the same content.
- `signal_boost_clamped.rs` -- artificially configure `git_boost = 5.0` and assert the resulting per-domain score is still <= 1.0.
- `discover_preserves_stable_ids.rs` -- run discover twice with no new memories; the second run's domain ids match the first's (via centroid-similarity stable-ID matching above 0.85).
### Dashboard UI tests (`tests/phase_4/ui/`)
Use curl-driven smoke tests (avoids adding Playwright as a new hard dep; Playwright already exists at `apps/dashboard/playwright.config.ts` and can be extended later).
- `domains_list_renders.sh` -- `curl -H "X-API-Key: $KEY" http://localhost:3927/api/v1/domains` returns 200 + JSON array with expected keys.
- `domain_detail_histogram.sh` -- `curl .../api/v1/domains/dev/score-histogram` returns 10 buckets.
- `proposal_review_flow.sh` -- create a pending proposal via SQL insert; `curl POST .../api/v1/domains/proposals/<id>/accept`; `curl GET .../proposals?status=accepted` shows it.
- `unauth_domain_list_rejected.sh` -- no auth header -> 401.
### Benchmarks (`tests/phase_4/benches/`)
Criterion benches:
- `bench_discover_10k.rs` -- synthetic 10k x 768D embeddings drawn from 5 blobs; assert `discover` wall p95 < 30s on a warm release build.
- `bench_auto_classify_single.rs` -- 20 domains in memory, classify one 768D vector; assert p99 < 5ms.
- `bench_reassign_all.rs` -- 10k memories, 5 domains; assert full `reassign_all` wall time < 90s (100 rows/ms baseline).
---
## Acceptance Criteria
- [ ] `cargo build -p vestige-core --features domain-classification` zero warnings.
- [ ] `cargo build -p vestige-mcp` zero warnings.
- [ ] `cargo clippy --workspace --all-targets --all-features -- -D warnings` clean.
- [ ] `cargo test -p vestige-phase4-tests` -- all tests in `tests/phase_4/` pass.
- [ ] On a 500+ memory seed corpus covering three natural clusters (dev / infra / home), `vestige domains discover --force` produces sensible top-terms matching the expected keyword sets and labels are stable on a second run.
- [ ] `vestige search` with domain filter `["dev"]` excludes any memory whose `domains` array does not include `dev`.
- [ ] After 5 dream cycles with ongoing inserts, no existing memory's `domains` has silently changed; proposals exist in `domain_proposals` table; accepting a proposal reassigns as described.
- [ ] Cross-domain spreading activation: a query in `dev` that crosses a single edge into an `infra`-only memory still returns the memory but with activation `cross_domain_decay * in-domain_activation`.
- [ ] `vestige domains discover --min-cluster-size 20` produces strictly fewer or equal clusters than the default, and with larger per-cluster membership.
- [ ] Dashboard `/dashboard/domains` route renders all domains within 2 seconds on the seed corpus.
- [ ] Proposal UI flow (open pending, accept, confirmed in store) works end-to-end.
- [ ] Benchmarks meet targets (discover 10k p95 < 30s, auto-classify p99 < 5ms).
---
## Rollback Notes
- **Feature gate**: add `domain-classification` to `crates/vestige-core/Cargo.toml`'s `[features]`. When disabled, the `DomainClassifier` module is not compiled, the classification call in the ingest path is a no-op (`#[cfg]`-guarded), and cross-domain decay collapses to `1.0`. The CLI `domains` subcommand emits "domain classification is disabled in this build".
- **Revert strategy**: drop the two new tables `domains` (if created in Phase 1 is retained) or `domain_proposals` (Phase 4). A DOWN migration clears `memories.domains` and `memories.domain_scores`. Existing memories simply lose their domain assignments; all search and retrieval paths work unchanged because `domains = []` is the documented "unclassified" state.
- **Idempotency**: rerunning `discover` is always safe. Cluster numeric IDs may differ between runs, but the stable-ID match by centroid similarity preserves user-assigned labels. Do not persist cluster ids in client-side bookmarks; link via the user-assigned label.
- **Data-loss risk**: `apply_proposal` is a destructive operation (it deletes the old parent domain in a split or merges two). The dashboard's accept button double-confirms with a modal that shows the number of affected memories.
---
## Open Implementation Questions
Each question + candidates + RECOMMENDATION.
### OQ1. Top-terms extraction: TF-IDF vs BM25 vs frequency?
- TF-IDF with smoothed IDF -- standard, cheap, good-enough.
- BM25 -- better for long-document discrimination, overkill for short memory contents.
- Raw frequency -- noisy; stop-words dominate.
**RECOMMENDATION**: TF-IDF with global IDF over the entire memory corpus (not just cluster members), recomputed once per `discover` call. Same tokenizer as the `dreams.rs::content_similarity` Jaccard for consistency.
### OQ2. Proposal persistence: DB table vs in-memory with dashboard notification?
- DB table (`domain_proposals`) -- durable, surfaces across restarts, enables audit.
- In-memory only -- simpler, but loses proposals on server restart.
**RECOMMENDATION**: DB table. Proposals are rare (every 5th dream) and valuable user-facing artifacts; durability is mandatory.
### OQ3. `hdbscan` crate: f32 vs f64 input, exact API surface?
- v0.10 historically takes `&[Vec<f64>]`; embeddings are `Vec<f32>`.
- Cost of converting f32 -> f64 at discovery time: `10k * 768 = 7.68M` f64 doubles ~ 60MB transient, acceptable.
**RECOMMENDATION**: verify v0.10's type signature at implementation time; if it requires f64, perform the conversion in `discover()` behind a single allocation. Document in module header. If the crate API diverged from the PRD snippet, fall back to the manual builder style (`HdbscanHyperParams::builder().min_cluster_size(n).min_samples(s).build()`).
### OQ4. Stable domain IDs across discover re-runs?
- Option A: numeric IDs from HDBSCAN labels -- unstable, re-runs shuffle them.
- Option B: hash(top_terms) -- stable if top-terms stable, but top-terms drift.
- Option C (recommended): after computing new centroids, match each to the closest existing domain by centroid cosine; if similarity > 0.85, reuse the existing domain's `id` and `label`. Otherwise mint a fresh `id = "cluster_<uuid>"`.
**RECOMMENDATION**: Option C. Preserves user-assigned labels across drift. Threshold 0.85 is config-tunable via `stable_id_threshold` if needed later.
### OQ5. Context signal injection site: ingest handler vs embedder vs classifier?
- Embedder -- would alter embedding; signals are not about embedding quality.
- Ingest handler -- signals known there, but then `DomainClassifier` cannot be tested in isolation.
- Classifier as a `classify_with_boost(boost: Option<&HashMap>)` parameter -- pure, testable, composable.
**RECOMMENDATION**: classifier parameter. The cognitive engine constructs the boost map via `ContextSignals::gather_boost(&metadata, &domains)` and hands it to the classifier. Keeps the classifier stateless w.r.t. signals.
### OQ6. Re-cluster proposal cadence: event-based (every Nth dream) vs time-based (weekly)?
- ADR resolution Q7: every Nth dream (N=5 default).
- Alternative: once per week regardless of dream cadence.
**RECOMMENDATION**: stick with every Nth dream. Users who dream rarely re-cluster rarely -- that matches the philosophy ("memory work triggers memory bookkeeping"). Note the alternative as future consideration; if users complain about never seeing proposals, add a time-based fallback.
### OQ7. Minimum corpus size for first discover?
- PRD default: 150.
- Too low -> noisy initial clusters, proposals every dream.
- Too high -> user waits forever for domains to appear.
**RECOMMENDATION**: 150 as the default discovery gate; HDBSCAN's `min_cluster_size=10` will produce 0 clusters for < 100 memories, so the system gracefully produces no domains until the corpus is large enough. Test with `N=80, 150, 500` in `threshold_tunable.rs` to confirm sensible behavior.
### OQ8. Cross-domain decay: strict no-overlap vs graded?
- Strict: `1.0` if any overlap, `cross_domain_decay` otherwise.
- Graded: `max(cross_domain_decay, |A intersect B| / max(|A|, |B|))`.
**RECOMMENDATION**: strict for Phase 4. Easier to reason about, easier to tune, easier to test. Graded is a marked future enhancement; file an issue if retrieval-quality metrics justify it.
### OQ9. Classifier invocation from remote HTTP clients?
- In server mode, an agent posts `smart_ingest` -> server embeds -> server classifies.
- All the work stays server-side; MCP clients never do classification.
**RECOMMENDATION**: confirmed server-side-only. Document in the MCP tool schema that `smart_ingest` now returns `domains` and `domain_scores` in its response so clients can display the classification to the user.
### OQ10. Where to store the dream-cycle counter?
- In-memory on `CognitiveEngine` -- lost on restart, miscounts cadence.
- New `system_state` singleton table.
**RECOMMENDATION**: `system_state` table. Survives restarts. Also useful for future metrics (total memories ever, total dreams ever).
### OQ11. Scope of `reassign_all` after a proposal accept vs a normal discover?
- On discover --force (first-time), run `reassign_all` against all memories.
- On proposal accept (split / merge), run `reassign_all` only on affected memories (parent's members for split; both parents' members for merge) to avoid touching unrelated records.
**RECOMMENDATION**: scoped reassignment where possible; fall back to full `reassign_all` only on `discover --force` or when the set of domains has fundamentally changed. Reduces write amplification on large corpora.
### OQ12. Proposal freshness?
- Multiple re-clusters could stack up pending proposals.
**RECOMMENDATION**: before inserting a new proposal, check for existing pending proposals with the same `kind + targets`; if present, bump `created_at` and `confidence` instead of creating a duplicate. Add a `confidence_history` array in the `payload` JSONB for audit.
---
## Implementation Sequencing (suggested order)
1. Land the `DomainClassifier` struct, `classify` / `classify_with_boost`, unit tests. (Day 1)
2. Add `compute_top_terms` + TF-IDF helper, tests. (Day 1)
3. Wire `discover` end-to-end against SQLite; `discover_seed_corpus` integration test. (Day 2)
4. Add `domain_proposals` table migrations + trait methods; both backends. (Day 2)
5. Implement `propose_changes` + `apply_proposal`; proposal unit tests. (Day 3)
6. Context signals module + tests. (Day 3)
7. Hook classifier into ingest path; `auto_classify_on_ingest` integration test. (Day 4)
8. Cross-domain decay in spreading activation; unit + integration tests. (Day 4)
9. Dream re-cluster hook + `system_state` counter; integration tests for drift-only behavior. (Day 5)
10. CLI subcommands. (Day 6)
11. REST endpoints. (Day 6)
12. SvelteKit dashboard routes + WebSocket event wiring. (Day 7-8)
13. Benchmarks + acceptance sweep on the 500-memory seed. (Day 9)
---
## File Map (everything Phase 4 touches or creates)
Creates:
- `crates/vestige-core/src/neuroscience/domain_classifier.rs`
- `crates/vestige-core/src/neuroscience/context_signals.rs`
- `crates/vestige-core/migrations/postgres/00XX_domain_proposals.sql`
- `crates/vestige-core/migrations/sqlite/00XX_domain_proposals.sql` (or inline in `storage/migrations.rs`)
- `crates/vestige-mcp/src/api/domains.rs` (REST handlers)
- `apps/dashboard/src/routes/(app)/domains/+page.svelte`
- `apps/dashboard/src/routes/(app)/domains/[id]/+page.svelte`
- `apps/dashboard/src/routes/(app)/domains/proposals/+page.svelte`
- `apps/dashboard/src/lib/api/domains.ts`
- `tests/phase_4/Cargo.toml`
- `tests/phase_4/tests/*.rs` (per the Integration test list)
- `tests/phase_4/fixtures/seed_500.json`
- `tests/phase_4/support/fixtures.rs`
Modifies:
- `crates/vestige-core/Cargo.toml` -- add `hdbscan = "0.10"` under a new `domain-classification` feature.
- `crates/vestige-core/src/neuroscience/mod.rs` -- register new modules, re-exports.
- `crates/vestige-core/src/neuroscience/spreading_activation.rs` -- `cross_domain_decay` field in `ActivationConfig`, `domains` field on `ActivationNode`, decay math in `activate`.
- `crates/vestige-core/src/consolidation/phases.rs` -- `DreamReClusterHook`.
- `crates/vestige-core/src/advanced/dreams.rs` -- accept a hook callback from the orchestrator (if the orchestration is done at this level).
- `crates/vestige-core/src/storage/trait.rs` -- add proposal + system_state methods.
- `crates/vestige-core/src/storage/sqlite.rs` -- implement proposal + system_state methods + `all_embeddings_with_meta` if not already on the trait.
- `crates/vestige-core/src/storage/postgres.rs` (Phase 2) -- same.
- `crates/vestige-core/src/lib.rs` -- re-exports.
- `crates/vestige-core/src/cognitive.rs` (or equivalent ingest orchestrator) -- auto-classify injection.
- `crates/vestige-mcp/src/bin/cli.rs` -- `Domains` subcommand + dispatch.
- `crates/vestige-mcp/src/dashboard/mod.rs` -- wire new REST routes.
- `crates/vestige-mcp/src/dashboard/events.rs` -- new event variants.
- `crates/vestige-mcp/src/dashboard/handlers.rs` -- if legacy dashboard gets a domains panel (optional).
- `vestige.toml` config loader -- `[domains]` section + struct + defaults.
- Root `Cargo.toml` workspace members -- add `tests/phase_4`.
---
## Risks
- **HDBSCAN determinism**: HDBSCAN is deterministic given input order; sorting embeddings by memory id before feeding the clusterer guarantees reproducibility across runs -- do this in `discover()` and document it.
- **Embedding dimension drift**: Phase 1's `embedding_model` registry blocks writes from mismatched embedders. If `discover()` ever sees two dimensions, it bails with a clear error and points at `vestige migrate --reembed`.
- **Classification latency on ingest**: for users with thousands of domains (unlikely but possible), `classify` is O(n_domains * dim). 20 domains * 768 f32 = 15k flops per classification, trivial. Still, expose a `classify_budget_ms` config knob for paranoia.
- **Re-cluster proposal storms**: if the corpus is borderline-stable, small changes can produce conflicting proposals on consecutive dreams. Mitigation: OQ12 (dedup by target set, bump confidence instead of stacking).
- **Dashboard feature gap**: if the SvelteKit app lands with the domains route but the REST endpoints are not yet deployed, the route 404s. Mitigation: ship the REST endpoints in the same release; a feature flag on the client toggles the nav entry.
---
## Non-Goals Reminder
- No Phase 5 federation concerns in this plan.
- No cross-installation domain sync.
- No automatic accept of proposals, ever.
- No graded cross-domain decay; strict only.
- No ML-based domain label suggestion (top-terms are enough for v1).
- No editing individual memory memberships from the UI in this phase.