Pluggable storage backend, network access, and emergent domain classification. Introduces MemoryStore + Embedder traits, PgMemoryStore alongside SqliteMemoryStore, HTTP MCP + API key auth, and HDBSCAN-based domain clustering. Phase 5 federation deferred to a follow-up ADR. - docs/adr/0001-pluggable-storage-and-network-access.md -- Accepted - docs/plans/0001-phase-1-storage-trait-extraction.md - docs/plans/0002-phase-2-postgres-backend.md - docs/plans/0003-phase-3-network-access.md - docs/plans/0004-phase-4-emergent-domain-classification.md - docs/prd/001-getting-centralized-vestige.md -- source RFC
51 KiB
Phase 4 Plan: Emergent Domain Classification
Status: Draft
Depends on: Phase 1 (domain columns on memories, Domain struct + DomainStore methods on MemoryStore, Embedder trait), Phase 2 (Postgres JSONB + TEXT[] support for domain fields, embedding_model registry parity), Phase 3 (Axum HTTP server, REST /api/v1/ scaffolding, API key auth middleware, signed dashboard session cookies)
Related: docs/adr/0001-pluggable-storage-and-network-access.md (Phase 4), docs/prd/001-getting-centralized-vestige.md (Emergent Domain Model)
Scope
In scope
DomainClassifiercognitive module undercrates/vestige-core/src/neuroscience/domain_classifier.rs, alongside existing neuroscience modules (spreading_activation, synaptic_tagging, ...).- HDBSCAN discovery pipeline using the
hdbscancrate (v0.10): load all embeddings, cluster, extract centroids, extract top-terms via TF-IDF over cluster members, persist via the trait'sDomainStoremethods. - Soft-assignment pipeline: for each memory, compute
cosine_similarity(memory.embedding, domain.centroid)for every domain, store raw scores indomain_scoresJSONB, threshold intodomains[]usingassign_threshold(default 0.65). - Automatic classification on ingest: run through
CognitiveEngine/smart_ingestso new memories get classified against existing centroids immediately; skip whendomain_count == 0(Phase 0 accumulation). - Re-cluster hook in dream consolidation: every Nth four-phase dream cycle (N=5 default) triggers a discovery pass and generates proposals (split / merge / none). Proposals land in a new
domain_proposalstable, surface in the dashboard, and are never auto-applied (conservative drift, ADR Q7). - Context signals:
SignalSourcetrait withGitRepoSignal(detects.gitin CWD ormetadata.cwd) andIdeHintSignal(readsmetadata.editor/metadata.ide). Each returns aboost_mapofdomain_id -> additive delta(typical +0.05). Injected as asignal_boost: Option<HashMap<String, f64>>parameter intoDomainClassifier::classify. - Cross-domain spreading activation decay:
ActivationNetworktraversal multiplies the edge's effective weight bycross_domain_decay(default 0.5) whentarget.domainsandsource.domainsare disjoint. Strict "no overlap" policy, not graded. - CLI subcommands (in
crates/vestige-mcp/src/bin/cli.rs, under a newDomainscommand group):list,discover [--min-cluster-size N] [--force],rename <id> <new_label>,merge <a> <b> [--into <id>]. Human-readable tables on stdout; JSON via--json. - Dashboard UI additions (
apps/dashboard/src/routes/(app)/domains/): list page, per-domain detail (memories, centroid top_terms, score histogram, proposal review controls). - REST endpoints under
/api/v1/domains(introduced by Phase 3 skeleton, implemented in Phase 4): list, discover, rename, merge, proposal list / accept / reject. - Config additions:
[domains]section investige.tomlcoveringassign_threshold,recluster_interval,min_cluster_size,cross_domain_decay,discovery_threshold,merge_threshold,signal_boost(per-signal toggle).
Out of scope
- Phase 5 federation (explicit separate ADR). Domain centroids are installation-local; no sync.
- Learned re-weighting of domain scores (future, only if retrieval-quality metrics show a need).
- Interactive cluster-membership editing in the UI (drag-and-drop reassign) -- future enhancement.
- Multi-user domain namespaces. One domain set per installation; API keys that carry
domain_filterjust restrict access, they do not create namespaces. - Auto-sweep of
min_cluster_size/ auto-tunedassign_threshold(ADR resolution Q6 + Q9: static defaults, user tunes). - Graded cross-domain decay (
|A intersect B| / max(|A|,|B|)) -- strict "no overlap" is the Phase 4 rule.
Prerequisites
Artifacts that Phases 1-3 are expected to have landed:
- In
vestige-core:Embeddertrait (crates/vestige-core/src/embedder/).MemoryStoretrait (crates/vestige-core/src/storage/trait.rsor similar) includingDomainStoremethods:list_domains,get_domain,upsert_domain,delete_domain,classify(&[f32]) -> Vec<(String, f64)>, plus a bulk accessor such asall_embeddings()(already present in sqlite.rs asget_all_embeddings) and aget_all_memories_with_embeddings()iterator for discovery. The trait must expose a method to batch-update(domains, domain_scores)for a memory id.Domainstruct:{ id: String, label: String, centroid: Vec<f32>, top_terms: Vec<String>, memory_count: usize, created_at: DateTime<Utc> }.- Columns on memories in both SQLite and Postgres:
domains TEXT[](or JSON array on SQLite) anddomain_scores JSONB(or TEXT JSON on SQLite). - The
domainstable in both backends (see PRD schema sketch).
- In
vestige-mcp:- Axum
/api/v1/router prefix with auth middleware. - CLI skeleton (
bin/cli.rs) usingclap; Phase 4 adds aDomainssubcommand tree. - REST handlers file structure ready under
crates/vestige-mcp/src/dashboard/handlers.rs(legacy) and a dedicated REST handler under/api/v1/; Phase 4 addsdomains.rshandler module. - SvelteKit dashboard (
apps/dashboard/) with existing(app)/memories,(app)/timeline,(app)/stats, etc. Phase 4 adds(app)/domains/.
- Axum
New workspace crate additions required (added manually to Cargo.toml, since cargo add is not run from the plan):
hdbscan = "0.10"incrates/vestige-core/Cargo.toml(feature-gated behinddomain-classification).- Optional: a lightweight stop-word constant inline; no external stop-word crate -- the neuroscience modules already do tokenization on whitespace + length>3 (see
dreams.rs::content_similarity). Reuse that style; nondarrayneeded becausehdbscanv0.10 accepts&[Vec<f32>]directly (verified from PRD snippet). - No new deps in
vestige-mcpfor Phase 4 -- CLI reusesclap/colored/comfy-tableif already present, otherwise a hand-rolled padded print. We pick hand-rolled to avoid adding a table crate; this matches the existing style ofrun_statsincli.rs.
Test fixtures:
- A JSON seed corpus checked into
tests/phase_4/fixtures/seed_500.jsoncontaining >= 500 memories drawn from three plausible clusters. A builder functiontests/phase_4/support/fixtures.rs::build_seed_corpus()deterministically generates or loads this corpus. Each record hascontent,tags,embedding(768D bge-base-en-v1.5; use a committed vector or a deterministic mock embedder in tests). For deterministic tests we fake embeddings by hashing content -- acceptable as long as the fake preserves cluster separability (prefix-based: "DEV-...", "INFRA-...", "HOME-..." seeds three Gaussian blobs). - Reuse
Embeddermock from Phase 1 tests (MockEmbedder) for discovery tests that need real cosine similarity. - A minimal git-repo fixture created in a tempdir (
tempfile::tempdir+std::process::Command::new("git").arg("init")) for context-signal tests.
Deliverables
DomainClassifiercognitive module: struct, defaults,classify,classify_with_boost,reassign_all,discover.domain_termshelper (TF-IDF over cluster members, returningtop_kterms).cli domains discoversubcommand.cli domains list/rename/mergesubcommands.- Auto-classify hook on ingest (wired into the cognitive engine's ingest pipeline before persistence).
- Re-cluster hook in dream consolidation (
DreamEngine::runorchestrator gets an optionalDomainReClusterHook; triggers every Nth dream). - Context signal extractor module (
crates/vestige-core/src/neuroscience/context_signals.rs) withSignalSourcetrait +GitRepoSignal+IdeHintSignal. - Cross-domain spreading activation decay in
ActivationNetwork::activate(config-driven). vestige.toml[domains]section + defaults loader.- Dashboard UI: SvelteKit routes
(app)/domains/+page.svelte(list),(app)/domains/[id]/+page.svelte(detail),(app)/domains/proposals/+page.svelte(review). - REST endpoints under
/api/v1/domains+/api/v1/domains/proposals. domain_proposalstable + migration +DomainProposaltrait methods onMemoryStore.- WebSocket event
VestigeEvent::DomainProposalCreatedso the dashboard gets a live notification after a re-cluster fires.
Detailed Task Breakdown
1. DomainClassifier cognitive module
File: crates/vestige-core/src/neuroscience/domain_classifier.rs
Export: in crates/vestige-core/src/neuroscience/mod.rs, add pub mod domain_classifier; and re-export pub use domain_classifier::{DomainClassifier, ClassificationResult, DomainProposal, ProposalKind};
Deps: hdbscan = "0.10", serde, serde_json, chrono, tracing, existing crate::storage::Domain, crate::storage::MemoryStore trait.
Struct and defaults (match PRD exactly):
pub struct DomainClassifier {
pub assign_threshold: f64, // default 0.65
pub discovery_threshold: usize, // default 150
pub recluster_interval: usize, // default 5 (every 5th dream)
pub min_cluster_size: usize, // default 10
pub min_samples: usize, // default 5 (HDBSCAN)
pub cross_domain_decay: f64, // default 0.5
pub merge_threshold: f64, // default 0.90 (centroid cosine)
pub top_terms_k: usize, // default 10
}
impl Default for DomainClassifier { ... }
Result types:
#[derive(Debug, Clone)]
pub struct ClassificationResult {
pub scores: HashMap<String, f64>, // raw per-domain similarities
pub domains: Vec<String>, // above assign_threshold
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ProposalKind {
Split { parent: String, children: Vec<String> },
Merge { targets: Vec<String>, suggested_label: String },
NewCluster { top_terms: Vec<String> },
}
#[derive(Debug, Clone)]
pub struct DomainProposal {
pub id: String, // uuid v4
pub kind: ProposalKind,
pub rationale: String,
pub confidence: f64,
pub created_at: DateTime<Utc>,
pub status: ProposalStatus, // Pending | Accepted | Rejected
}
Key methods (all pure where possible; all pub):
impl DomainClassifier {
pub fn classify(&self, embedding: &[f32], domains: &[Domain]) -> ClassificationResult;
pub fn classify_with_boost(
&self,
embedding: &[f32],
domains: &[Domain],
boost: Option<&HashMap<String, f64>>,
) -> ClassificationResult;
pub async fn reassign_all(
&self,
store: &dyn MemoryStore,
domains: &[Domain],
) -> Result<usize, StorageError>;
pub async fn discover(
&self,
store: &dyn MemoryStore,
) -> Result<Vec<Domain>, StorageError>;
pub async fn propose_changes(
&self,
store: &dyn MemoryStore,
existing: &[Domain],
newly_discovered: &[Domain],
) -> Result<Vec<DomainProposal>, StorageError>;
pub async fn apply_proposal(
&self,
store: &dyn MemoryStore,
proposal: &DomainProposal,
) -> Result<(), StorageError>;
}
Behavior notes:
classifyreturns empty{ scores: {}, domains: [] }iffdomains.is_empty()(accumulation phase). This matches the PRD snippet verbatim.classify_with_boostadds the boost delta to each score AFTER cosine, before thresholding. It clamps to[0.0, 1.0]. Boost keys not present indomainsare ignored.reassign_allstreams memories in batches of 500 (iterator on the store) to keep memory bounded; for each memory issues a singleUPDATE memories SET domains = ?, domain_scores = ? WHERE id = ?call. Returns count of memories whosedomainsvector actually changed.discoverloads all(id, embedding)pairs via anall_embeddings()method on the store (exists under#[cfg(all(feature = "embeddings", feature = "vector-search"))]insqlite.rs::get_all_embeddings; Phase 1 should promote this onto the trait -- if not yet promoted, add the method). Then:- Build
Vec<Vec<f32>>and index -> id map. Hdbscan::default_hyper_params(&embeddings).min_cluster_size(self.min_cluster_size).min_samples(self.min_samples).build()(exact builder depends on hdbscan 0.10 surface; see Open Question).let labels = clusterer.cluster()?;let centers = clusterer.calc_centers(Center::Centroid, &labels)?;- Group indices by label ignoring -1 (noise). For each cluster compute
top_termsviacompute_top_terms. - Preserve stable IDs where possible: match each new cluster centroid to the closest existing domain by cosine; if similarity > 0.85, reuse the existing domain id + label. Otherwise generate a fresh id
cluster_{n}with a label derived from the first 2 terms. - Upsert all resulting
Domains via the store.
- Build
propose_changescompares old vs new clusters:- Split: an old domain that best-matches two or more new domains each with >=
min_cluster_sizemembers. Rationale: "domaindevis now 2 clusters of >=10 memories:systemsandnetworking". - Merge: two old domains whose centroids now satisfy
cosine > merge_thresholdget a merge proposal. - NewCluster: a new cluster that doesn't match any old domain above 0.85 similarity.
- Split: an old domain that best-matches two or more new domains each with >=
apply_proposalruns the split or merge against the store (reassign memberships viareassign_all), then marks the proposalAccepted. It never runs automatically -- only via the CLI or dashboard.
Helper:
fn compute_top_terms(documents: &[&str], k: usize) -> Vec<String>;
Uses TF-IDF with IDF computed over the entire passed-in corpus (the documents slice), tokenization = whitespace split, lowercase, strip non-alphanumeric, drop tokens shorter than 4 chars and a small built-in stop-word list (the, and, for, that, with, ...). Matches the tokenizer used in dreams.rs::content_similarity and dreams.rs::extract_patterns so behavior is predictable.
Cosine similarity helper:
fn cosine_similarity(a: &[f32], b: &[f32]) -> f64;
Keep the existing crate-level cosine_similarity if already present (check embeddings:: or search::); otherwise add a private one. Returns 0.0 on dimension mismatch, panics would be a bug.
2. Top-terms computation helper
File: same module, private section.
fn tokenize(text: &str) -> Vec<String>: lowercase, split on non-alphanumeric, filter len >= 4, drop stop-words.fn tfidf_top_k(docs: &[&str], k: usize) -> Vec<String>:tf[doc_idx][term] = count / total_terms.df[term] = docs containing term.idf[term] = log((N + 1) / (df[term] + 1)) + 1(smoothed).- For each term, average
tfacross docs in the cluster; multiply byidf; sort desc; return topk.
Cluster top-terms are computed over cluster members only, with IDF over the whole corpus (all memory contents), not the cluster, so common words get penalized globally. Recompute global IDF once per discover call.
3. CLI subcommand: vestige domains discover
File: crates/vestige-mcp/src/bin/cli.rs
Add to enum Commands:
/// Emergent domain management
Domains {
#[command(subcommand)]
action: DomainAction,
},
#[derive(clap::Subcommand)]
enum DomainAction {
/// List all discovered domains
List {
#[arg(long)] json: bool,
},
/// Run HDBSCAN discovery on all embeddings and propose domains
Discover {
#[arg(long, default_value_t = 10)] min_cluster_size: usize,
/// Skip the proposal flow and write new domains directly (first-time use)
#[arg(long)] force: bool,
#[arg(long)] json: bool,
},
/// Rename a domain (by id)
Rename {
id: String,
new_label: String,
},
/// Merge two domains
Merge {
a: String,
b: String,
#[arg(long)] into: Option<String>, // default: `a`
},
}
Handler plumbing lives in run_domains(action) dispatching to run_domains_list, run_domains_discover, run_domains_rename, run_domains_merge. Each opens the default Storage, constructs a DomainClassifier::default(), and invokes the appropriate method.
Output format for list:
ID LABEL MEMORIES TOP TERMS
dev Development 87 rust, trait, async, tokio, zinit
infra Infrastructure 47 bgp, sonic, vlan, frr, peering
home Home 31 solar, kwh, battery, pool, esphome
(unclassified) 12
Produced via plain print! with %-15s %-18s %-10d %s style padding. --json emits serde_json::to_string_pretty(&domains).
Output format for discover with --force:
HDBSCAN: 500 embeddings, min_cluster_size=10, min_samples=5
Found 3 clusters (ignoring 14 noise points)
cluster_0 (N=47) top: bgp, sonic, vlan, frr, peering
cluster_1 (N=31) top: solar, kwh, battery, pool, esphome
cluster_2 (N=22) top: rust, trait, async, tokio, zinit
Writing 3 domains to the store...
Soft-assigning 500 memories against centroids...
multi-domain: 43
single-domain: 412
unclassified (below threshold 0.65): 45
Done in 7.4s.
Output format for discover without --force (post-Phase-0):
HDBSCAN: 623 embeddings, min_cluster_size=10
Comparing to existing 3 domains...
Proposals (pending, accept via dashboard or `vestige domains proposals`):
[split] dev -> (systems:34, networking:28) confidence 0.82
[new] cluster_5 (books, novels, reading) confidence 0.71
Run `vestige domains proposals` to review, or open the dashboard.
4. CLI: list, rename, merge
list: callsstore.list_domains(), fetches unclassified count viastore.count_memories_without_domains()(Phase 1 should have provided this; if not, Phase 4 adds it to the trait and both backends).rename:store.get_domain(id)-> mutatelabel->store.upsert_domain. No memory touch.merge: load both, compute blended centroid (weighted bymemory_count), mergetop_terms(union, recompute TF-IDF rank if both sides share the corpus), delete the non-intodomain, callreassign_all. Wrapped in a transaction on Postgres; on SQLite rely on the existing writer-lock pattern.
5. Auto-classify on ingest
File: crates/vestige-core/src/cognitive.rs (or equivalent ingest entry in vestige-mcp/src/tools/smart_ingest.rs).
Integration point: just before the record is persisted in the smart-ingest path, after the embedder has produced embedding and before storage.insert(...). Trace the current call site -- today Storage::ingest(IngestInput) computes embedding inside storage; in Phase 1 the embedder becomes external (ADR decision Q2), so classification can hook right there in the cognitive engine.
Pseudocode:
let embedding = embedder.embed(&input.content).await?;
let domains = store.list_domains().await?;
let (domains_assigned, domain_scores) = if domains.is_empty() {
(Vec::new(), HashMap::new())
} else {
let boost = context_signals.gather_boost(&input.metadata, &domains);
let result = classifier.classify_with_boost(&embedding, &domains, boost.as_ref());
(result.domains, result.scores)
};
record.embedding = Some(embedding);
record.domains = domains_assigned;
record.domain_scores = domain_scores;
store.insert(&record).await?;
Edge cases:
- Accumulation phase (
domains.is_empty()): skip classification entirely. Zero overhead. - Embedding failed / skipped: leave
domains = [],domain_scores = {}. Never fail ingest because of classification. - Metric: emit
VestigeEvent::MemoryClassified { id, domains, top_score }on the WebSocket bus so the dashboard sees it live.
6. Re-cluster hook in dream consolidation
File: crates/vestige-core/src/advanced/dreams.rs (long file, 1131-line dream() entry on the MemoryDreamer impl) plus crates/vestige-core/src/consolidation/phases.rs (the DreamEngine::run orchestrator).
Design: the DreamEngine::run(...) returns FourPhaseDreamResult. It does not currently know how many times it has run. Phase 4 introduces a persistent counter on disk (column dream_cycle_count on a new singleton system_state table, or a simple row in the existing metadata / embedding_model registry). After the Integration phase finishes, the cognitive engine increments the counter and, if counter % recluster_interval == 0, launches discovery asynchronously:
Extension struct in phases.rs:
pub struct DreamReClusterHook<'a> {
pub classifier: &'a DomainClassifier,
pub store: &'a dyn MemoryStore,
pub event_tx: Option<&'a tokio::sync::mpsc::UnboundedSender<VestigeEvent>>,
}
impl<'a> DreamReClusterHook<'a> {
pub async fn tick(&self, cycle_count: usize) -> Result<Vec<DomainProposal>, StorageError> {
if cycle_count == 0 || cycle_count % self.classifier.recluster_interval != 0 {
return Ok(vec![]);
}
let existing = self.store.list_domains().await?;
let rediscovered = self.classifier.discover(self.store).await?;
let proposals = self
.classifier
.propose_changes(self.store, &existing, &rediscovered)
.await?;
for p in &proposals {
self.store.insert_domain_proposal(p).await?;
if let Some(tx) = self.event_tx {
let _ = tx.send(VestigeEvent::DomainProposalCreated {
id: p.id.clone(),
kind: format!("{:?}", p.kind),
confidence: p.confidence,
timestamp: Utc::now(),
});
}
}
Ok(proposals)
}
}
Caller wires tick() after DreamEngine::run() returns, at the ingest/consolidation orchestrator level. The hook never mutates existing domains -- it only writes proposals. The acceptance path is manual (CLI or dashboard).
Counter storage: add method store.bump_dream_cycle_count() -> Result<usize> returning the new count. Single-row table:
CREATE TABLE IF NOT EXISTS system_state (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
-- seed: ('dream_cycle_count', '0')
7. Context signal extractor
File: crates/vestige-core/src/neuroscience/context_signals.rs
pub trait SignalSource: Send + Sync {
/// Returns domain_id -> additive boost (positive or negative, typically in [-0.1, +0.1]).
fn boost_map(
&self,
input_metadata: &serde_json::Value,
domains: &[Domain],
) -> HashMap<String, f64>;
fn name(&self) -> &'static str;
}
pub struct GitRepoSignal {
pub boost: f64, // default +0.05
}
pub struct IdeHintSignal {
pub boost: f64,
}
pub struct ContextSignals {
sources: Vec<Box<dyn SignalSource>>,
}
impl ContextSignals {
pub fn gather_boost(
&self,
input_metadata: &serde_json::Value,
domains: &[Domain],
) -> Option<HashMap<String, f64>>;
}
Signal encoding convention (document in the module header):
- A signal is a soft prior. It nudges the post-cosine score by a small additive delta, clamped to
[-0.10, +0.10]per signal. - Multiple signals sum, then the final boost per domain is clamped to
[-0.15, +0.15]so signals cannot by themselves push a memory into or out of a domain; the embedding similarity dominates. - Signals target domains by heuristic:
GitRepoSignalboosts any domain whosetop_termsoverlaps{"rust","async","trait","function","class","def","git","commit","fn","code"}.IdeHintSignaldoes the same for{"file","line","editor","vscode","neovim","rust-analyzer","lsp"}. - All signal boosts are logged via
tracing::debug!so users can audit why a memory picked up a domain.
GitRepoSignal::boost_map implementation:
fn boost_map(&self, meta: &Value, domains: &[Domain]) -> HashMap<String, f64> {
let is_git = meta.get("cwd")
.and_then(|v| v.as_str())
.map(|cwd| std::path::Path::new(cwd).join(".git").exists())
.unwrap_or(false)
|| meta.get("git_repo").is_some();
if !is_git { return HashMap::new(); }
let mut out = HashMap::new();
for d in domains {
let code_hits = d.top_terms.iter()
.filter(|t| CODE_TERMS.contains(t.as_str()))
.count();
if code_hits > 0 { out.insert(d.id.clone(), self.boost); }
}
out
}
Config knob in [domains.signals]: git = true, ide = true, git_boost = 0.05, ide_boost = 0.05.
8. Cross-domain spreading activation decay
File: crates/vestige-core/src/neuroscience/spreading_activation.rs
Modify ActivationConfig:
pub struct ActivationConfig {
pub decay_factor: f64,
pub max_hops: u32,
pub min_threshold: f64,
pub allow_cycles: bool,
pub cross_domain_decay: f64, // NEW, default 0.5
}
Domain metadata on nodes: the current ActivationNode has id, activation, last_activated, edges: Vec<String>. Phase 4 adds pub domains: Vec<String>. Populated when nodes get added (propagated from the memory's domains field). The network is rebuilt on each search from the store; if the in-memory network is persisted (check ActivationNetwork lifetime in CognitiveEngine), the population happens in the engine at boot and on insert.
Traversal change, in ActivationNetwork::activate loop, replacing the single line let propagated = current_activation * edge.strength * self.config.decay_factor;:
let cross_penalty = {
let src_doms = self.nodes.get(¤t_id).map(|n| &n.domains);
let tgt_doms = self.nodes.get(&target_id).map(|n| &n.domains);
match (src_doms, tgt_doms) {
(Some(s), Some(t)) if !s.is_empty() && !t.is_empty() => {
let overlap = s.iter().any(|d| t.contains(d));
if overlap { 1.0 } else { self.config.cross_domain_decay }
}
_ => 1.0, // unclassified on either side: no penalty
}
};
let propagated = current_activation * edge.strength * self.config.decay_factor * cross_penalty;
Rationale for "unclassified -> no penalty": unclassified memories are Phase-0 or low-confidence corpus members; penalizing them would block useful cross-pollination during the accumulation ramp.
API to update a node's domains after reclassification:
pub fn set_node_domains(&mut self, id: &str, domains: Vec<String>);
Called by the reassignment pipeline after reassign_all.
9. vestige.toml [domains] section
File: wherever vestige.toml is loaded (search for [storage] / [server] loaders). Add:
[domains]
assign_threshold = 0.65
discovery_threshold = 150
recluster_interval = 5
min_cluster_size = 10
min_samples = 5
cross_domain_decay = 0.5
merge_threshold = 0.90
top_terms_k = 10
[domains.signals]
git = true
ide = true
git_boost = 0.05
ide_boost = 0.05
Rust-side: DomainsConfig { ... } struct with serde(default) so vestige.toml without a [domains] section falls back to hard-coded defaults. DomainClassifier::from_config(cfg: &DomainsConfig) -> Self.
10. Dashboard UI additions
SvelteKit routes (apps/dashboard/src/routes/(app)/domains/):
+page.svelte(list): fetchesGET /api/v1/domainsandGET /api/v1/domains/unclassified-count. Renders a table:label,memories,top_termschips,created_at. Each row links to/domains/[id]. A "Discover" button postsPOST /api/v1/domains/discover.[id]/+page.svelte(detail): fetchesGET /api/v1/domains/:id,GET /api/v1/domains/:id/memories?limit=100,GET /api/v1/domains/:id/score-histogram. Renders:- Header: label (editable, triggers
PUT /api/v1/domains/:id), top-terms chips, memory count, created_at. - Histogram: a vertical bar chart of
domain_scores[:id]buckets 0-0.1, 0.1-0.2, ..., 0.9-1.0 across all memories. Data source: server precomputes buckets so the client does not need to fetch all scores. - Memory list: paginated, each row shows the raw score for this domain.
- Header: label (editable, triggers
proposals/+page.svelte: fetchesGET /api/v1/domains/proposals?status=pending. Each pending proposal card showskind,rationale,confidence,created_at, buttons "Accept" (postsPOST /api/v1/domains/proposals/:id/accept) and "Reject" (POST .../reject). Live updates via the existing WebSocket channel (/ws) reacting toDomainProposalCreatedevents.
Styling reuses the existing Tailwind + shadcn-svelte conventions in apps/dashboard/src/lib/components/.
Existing (app)/stats and (app)/feed pages get a small "Domains" summary panel that links to /domains.
11. REST endpoints
File: crates/vestige-mcp/src/protocol/http.rs or a new crates/vestige-mcp/src/api/domains.rs module, wired into the /api/v1/ router.
| Method | Path | Handler |
|---|---|---|
| GET | /api/v1/domains |
list_domains -- returns [Domain...] + unclassified count |
| POST | /api/v1/domains/discover |
trigger_discover -- body { min_cluster_size?: usize, force?: bool }, returns proposals or applied domains |
| GET | /api/v1/domains/:id |
get_domain |
| PUT | /api/v1/domains/:id |
update_domain -- rename |
| DELETE | /api/v1/domains/:id |
delete_domain -- with ?merge_into=other_id |
| GET | /api/v1/domains/:id/memories |
paginated memories in this domain |
| GET | /api/v1/domains/:id/score-histogram |
precomputed buckets |
| GET | /api/v1/domains/proposals |
list_proposals?status=pending |
| POST | /api/v1/domains/proposals/:id/accept |
accept_proposal |
| POST | /api/v1/domains/proposals/:id/reject |
reject_proposal |
All handlers go through the Phase 3 auth middleware (Bearer / X-API-Key / session cookie). Responses are JSON; error paths use StatusCode::* with a small {"error": "..."} body.
12. domain_proposals table + trait methods
Postgres migration (crates/vestige-core/migrations/postgres/00XX_domain_proposals.sql):
CREATE TABLE domain_proposals (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
kind TEXT NOT NULL, -- 'split' | 'merge' | 'new_cluster'
payload JSONB NOT NULL, -- serialized ProposalKind body
rationale TEXT NOT NULL,
confidence DOUBLE PRECISION NOT NULL,
status TEXT NOT NULL DEFAULT 'pending', -- pending|accepted|rejected
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
resolved_at TIMESTAMPTZ
);
CREATE INDEX idx_domain_proposals_status ON domain_proposals (status, created_at DESC);
SQLite migration: same table, UUID -> TEXT, JSONB -> TEXT with JSON-encoded bodies, TIMESTAMPTZ -> TEXT ISO-8601.
MemoryStore trait additions:
async fn insert_domain_proposal(&self, p: &DomainProposal) -> Result<()>;
async fn list_domain_proposals(&self, status: Option<&str>) -> Result<Vec<DomainProposal>>;
async fn get_domain_proposal(&self, id: &str) -> Result<Option<DomainProposal>>;
async fn set_proposal_status(&self, id: &str, status: &str) -> Result<()>;
13. WebSocket event for proposals
File: crates/vestige-mcp/src/dashboard/events.rs
Add variant:
pub enum VestigeEvent {
// ... existing ...
DomainProposalCreated {
id: String,
kind: String,
confidence: f64,
timestamp: DateTime<Utc>,
},
MemoryClassified {
id: String,
domains: Vec<String>,
top_score: f64,
timestamp: DateTime<Utc>,
},
}
The SvelteKit dashboard's WS client reacts to both events: classified events refresh any open domain-detail page; proposal events push a toast and a badge on the navbar.
Test Plan
Test root: tests/phase_4/ (a new member of the workspace; mirror the tests/e2e layout).
tests/phase_4/Cargo.toml:
[package]
name = "vestige-phase4-tests"
version = "0.0.0"
edition = "2024"
publish = false
[dependencies]
vestige-core = { path = "../../crates/vestige-core", features = ["embeddings", "vector-search", "domain-classification"] }
vestige-mcp = { path = "../../crates/vestige-mcp" }
tokio = { workspace = true }
anyhow = "1"
tempfile = "3"
serde_json = { workspace = true }
uuid = { workspace = true }
Unit tests (colocated in domain_classifier.rs::tests, context_signals.rs::tests, spreading_activation.rs::tests)
Each public function must have at least one test:
classify_empty_domains_returns_empty:classify(&[0.0; 768], &[])returnsClassificationResult { scores: {}, domains: [] }.classify_single_domain_scores: oneDomainwith a known centroid; input embedding equal to centroid; expect score 1.0 anddomains == [id].classify_multi_domain_overlap: two domains A, B; input halfway between centroids; expect both scores >=assign_threshold; expectdomains == [A, B](order not guaranteed).classify_below_threshold_returns_empty_domains_but_scores_filled: input orthogonal to all centroids; expectscorespopulated,domainsempty.classify_with_boost_adds_delta: same input as above, withboost = {A: 0.4}; expect A now above threshold, B unchanged.classify_boost_clamps_to_unit:boost = {A: 5.0}; resultingscores[A]must be <= 1.0.tfidf_top_k_returns_distinct_terms: given three fake docs,top_k=3returns three non-duplicate strings, in descending TF-IDF order.tfidf_top_k_drops_stopwords:["the and for"]+ real content -> stop-words absent.compute_top_terms_handles_empty_cluster: returnsvec.signal_git_present_vs_absent:GitRepoSignalgiven metadata with.gitin cwd returns non-empty map; without it returns empty.signal_ide_present_vs_absent:IdeHintSignalditto formetadata.editor == "vscode".signal_combined_clamped: two signals both firing each at +0.10 -> combined map values <= +0.15.cross_domain_decay_full_weight_on_overlap: graph with node A in domaindev, node B in domaindev, edge A->B strength 1.0; afteractivate, B's activation equals the standardinitial * strength * decay_factor(no extra penalty).cross_domain_decay_half_weight_no_overlap: A indev, B ininfra, same edge -> B's activation is 0.5x that of the overlap case.cross_domain_decay_unclassified_no_penalty: A classified, B unclassified -> full weight.propose_changes_detects_split: existing domaindev; new discovery returns two clusters whose centroids both sit close to olddevcentroid, each >= min_cluster_size members -> proposal of kindSplit { parent: "dev", children: [a, b] }.propose_changes_detects_merge: two existing domains whose new centroids now have cosine >merge_threshold-> proposal of kindMerge.propose_changes_detects_new_cluster: a new cluster with no match >= 0.85 to any existing ->NewCluster.apply_proposal_split_updates_memberships: after accept, memories previously indevget reassigned (some to child a, some to child b) viareassign_all.
Integration tests (tests/phase_4/tests/)
One file per behavior listed in the Phase 4 acceptance sheet.
discover_seed_corpus.rs-- loads the 500-memory fixture, runsclassifier.discover(&store).await, asserts at least 3 clusters, asserts per-cluster intra-similarity mean > 0.6, asserts discovery wall time < 10s in release. Also assertstop_termsfor each cluster contains at least one expected keyword per cluster (dev: contains any ofrust/trait/async; infra:bgp/vlan/network; home:solar/battery/pool).soft_assign_multi_domain.rs-- inserts a memory "deploy zinit containers over BGP network"; after classify,domainscontains bothdevandinfra(from a known centroid setup).auto_classify_on_ingest.rs-- with three existing domains, a freshsmart_ingestof a dev-ish sentence ends up withdomains == ["dev"]and non-emptydomain_scores.reembed_triggers_recluster.rs-- aftervestige migrate --reembed, centroids must be recomputed; verifylist_domains()returns freshcentroidvalues (different from pre-reembed).dream_consolidation_recluster_hook.rs-- run 5 dream cycles with heavy synthetic memory insertion; after the 5th, assertlist_domain_proposals("pending")has at least one proposal.proposal_accept_applies_changes.rs-- accept a split proposal viaapply_proposal; verify that memories indevare now distributed across the new children and that the olddevdomain is removed.proposal_reject_leaves_state.rs-- reject a proposal; verify all domains and memberships unchanged.drift_is_proposal_only.rs-- over 5 dream cycles with new inserts, never call accept; verify every memory'sdomainsfield equals its initial post-discovery value. No auto-apply.cross_domain_activation_decay.rs-- build aActivationNetworkwith two memories linked by a strength-1.0 edge, one indev, one ininfra; activatedevmemory with 1.0; assertinframemory's activation ==0.5 * decay_factor(0.35 with default decay_factor 0.7). Then set both todevand reassert activation ==0.7.cli_domains_discover.rs-- spawncargo run -- domains discover --force --json, parse stdout, assert at least 3 clusters and valid JSON shape.cli_domains_rename_merge.rs-- happy-path rename then merge, with stdout assertions.context_signal_git_repo.rs-- ingest the same sentence from inside a tempdir with.gitvs outside; assert the git-run produces slightly higherdomain_scoresfor the code-related domain (diff >= 0.04, matchesgit_boost = 0.05).threshold_tunable.rs-- same memory, two runs withassign_threshold = 0.40vs0.85; the low-threshold run assigns more domains than the high-threshold run for the same content.signal_boost_clamped.rs-- artificially configuregit_boost = 5.0and assert the resulting per-domain score is still <= 1.0.discover_preserves_stable_ids.rs-- run discover twice with no new memories; the second run's domain ids match the first's (via centroid-similarity stable-ID matching above 0.85).
Dashboard UI tests (tests/phase_4/ui/)
Use curl-driven smoke tests (avoids adding Playwright as a new hard dep; Playwright already exists at apps/dashboard/playwright.config.ts and can be extended later).
domains_list_renders.sh--curl -H "X-API-Key: $KEY" http://localhost:3927/api/v1/domainsreturns 200 + JSON array with expected keys.domain_detail_histogram.sh--curl .../api/v1/domains/dev/score-histogramreturns 10 buckets.proposal_review_flow.sh-- create a pending proposal via SQL insert;curl POST .../api/v1/domains/proposals/<id>/accept;curl GET .../proposals?status=acceptedshows it.unauth_domain_list_rejected.sh-- no auth header -> 401.
Benchmarks (tests/phase_4/benches/)
Criterion benches:
bench_discover_10k.rs-- synthetic 10k x 768D embeddings drawn from 5 blobs; assertdiscoverwall p95 < 30s on a warm release build.bench_auto_classify_single.rs-- 20 domains in memory, classify one 768D vector; assert p99 < 5ms.bench_reassign_all.rs-- 10k memories, 5 domains; assert fullreassign_allwall time < 90s (100 rows/ms baseline).
Acceptance Criteria
cargo build -p vestige-core --features domain-classificationzero warnings.cargo build -p vestige-mcpzero warnings.cargo clippy --workspace --all-targets --all-features -- -D warningsclean.cargo test -p vestige-phase4-tests-- all tests intests/phase_4/pass.- On a 500+ memory seed corpus covering three natural clusters (dev / infra / home),
vestige domains discover --forceproduces sensible top-terms matching the expected keyword sets and labels are stable on a second run. vestige searchwith domain filter["dev"]excludes any memory whosedomainsarray does not includedev.- After 5 dream cycles with ongoing inserts, no existing memory's
domainshas silently changed; proposals exist indomain_proposalstable; accepting a proposal reassigns as described. - Cross-domain spreading activation: a query in
devthat crosses a single edge into aninfra-only memory still returns the memory but with activationcross_domain_decay * in-domain_activation. vestige domains discover --min-cluster-size 20produces strictly fewer or equal clusters than the default, and with larger per-cluster membership.- Dashboard
/dashboard/domainsroute renders all domains within 2 seconds on the seed corpus. - Proposal UI flow (open pending, accept, confirmed in store) works end-to-end.
- Benchmarks meet targets (discover 10k p95 < 30s, auto-classify p99 < 5ms).
Rollback Notes
- Feature gate: add
domain-classificationtocrates/vestige-core/Cargo.toml's[features]. When disabled, theDomainClassifiermodule is not compiled, the classification call in the ingest path is a no-op (#[cfg]-guarded), and cross-domain decay collapses to1.0. The CLIdomainssubcommand emits "domain classification is disabled in this build". - Revert strategy: drop the two new tables
domains(if created in Phase 1 is retained) ordomain_proposals(Phase 4). A DOWN migration clearsmemories.domainsandmemories.domain_scores. Existing memories simply lose their domain assignments; all search and retrieval paths work unchanged becausedomains = []is the documented "unclassified" state. - Idempotency: rerunning
discoveris always safe. Cluster numeric IDs may differ between runs, but the stable-ID match by centroid similarity preserves user-assigned labels. Do not persist cluster ids in client-side bookmarks; link via the user-assigned label. - Data-loss risk:
apply_proposalis a destructive operation (it deletes the old parent domain in a split or merges two). The dashboard's accept button double-confirms with a modal that shows the number of affected memories.
Open Implementation Questions
Each question + candidates + RECOMMENDATION.
OQ1. Top-terms extraction: TF-IDF vs BM25 vs frequency?
- TF-IDF with smoothed IDF -- standard, cheap, good-enough.
- BM25 -- better for long-document discrimination, overkill for short memory contents.
- Raw frequency -- noisy; stop-words dominate.
RECOMMENDATION: TF-IDF with global IDF over the entire memory corpus (not just cluster members), recomputed once per
discovercall. Same tokenizer as thedreams.rs::content_similarityJaccard for consistency.
OQ2. Proposal persistence: DB table vs in-memory with dashboard notification?
- DB table (
domain_proposals) -- durable, surfaces across restarts, enables audit. - In-memory only -- simpler, but loses proposals on server restart. RECOMMENDATION: DB table. Proposals are rare (every 5th dream) and valuable user-facing artifacts; durability is mandatory.
OQ3. hdbscan crate: f32 vs f64 input, exact API surface?
- v0.10 historically takes
&[Vec<f64>]; embeddings areVec<f32>. - Cost of converting f32 -> f64 at discovery time:
10k * 768 = 7.68Mf64 doubles ~ 60MB transient, acceptable. RECOMMENDATION: verify v0.10's type signature at implementation time; if it requires f64, perform the conversion indiscover()behind a single allocation. Document in module header. If the crate API diverged from the PRD snippet, fall back to the manual builder style (HdbscanHyperParams::builder().min_cluster_size(n).min_samples(s).build()).
OQ4. Stable domain IDs across discover re-runs?
- Option A: numeric IDs from HDBSCAN labels -- unstable, re-runs shuffle them.
- Option B: hash(top_terms) -- stable if top-terms stable, but top-terms drift.
- Option C (recommended): after computing new centroids, match each to the closest existing domain by centroid cosine; if similarity > 0.85, reuse the existing domain's
idandlabel. Otherwise mint a freshid = "cluster_<uuid>". RECOMMENDATION: Option C. Preserves user-assigned labels across drift. Threshold 0.85 is config-tunable viastable_id_thresholdif needed later.
OQ5. Context signal injection site: ingest handler vs embedder vs classifier?
- Embedder -- would alter embedding; signals are not about embedding quality.
- Ingest handler -- signals known there, but then
DomainClassifiercannot be tested in isolation. - Classifier as a
classify_with_boost(boost: Option<&HashMap>)parameter -- pure, testable, composable. RECOMMENDATION: classifier parameter. The cognitive engine constructs the boost map viaContextSignals::gather_boost(&metadata, &domains)and hands it to the classifier. Keeps the classifier stateless w.r.t. signals.
OQ6. Re-cluster proposal cadence: event-based (every Nth dream) vs time-based (weekly)?
- ADR resolution Q7: every Nth dream (N=5 default).
- Alternative: once per week regardless of dream cadence. RECOMMENDATION: stick with every Nth dream. Users who dream rarely re-cluster rarely -- that matches the philosophy ("memory work triggers memory bookkeeping"). Note the alternative as future consideration; if users complain about never seeing proposals, add a time-based fallback.
OQ7. Minimum corpus size for first discover?
- PRD default: 150.
- Too low -> noisy initial clusters, proposals every dream.
- Too high -> user waits forever for domains to appear.
RECOMMENDATION: 150 as the default discovery gate; HDBSCAN's
min_cluster_size=10will produce 0 clusters for < 100 memories, so the system gracefully produces no domains until the corpus is large enough. Test withN=80, 150, 500inthreshold_tunable.rsto confirm sensible behavior.
OQ8. Cross-domain decay: strict no-overlap vs graded?
- Strict:
1.0if any overlap,cross_domain_decayotherwise. - Graded:
max(cross_domain_decay, |A intersect B| / max(|A|, |B|)). RECOMMENDATION: strict for Phase 4. Easier to reason about, easier to tune, easier to test. Graded is a marked future enhancement; file an issue if retrieval-quality metrics justify it.
OQ9. Classifier invocation from remote HTTP clients?
- In server mode, an agent posts
smart_ingest-> server embeds -> server classifies. - All the work stays server-side; MCP clients never do classification.
RECOMMENDATION: confirmed server-side-only. Document in the MCP tool schema that
smart_ingestnow returnsdomainsanddomain_scoresin its response so clients can display the classification to the user.
OQ10. Where to store the dream-cycle counter?
- In-memory on
CognitiveEngine-- lost on restart, miscounts cadence. - New
system_statesingleton table. RECOMMENDATION:system_statetable. Survives restarts. Also useful for future metrics (total memories ever, total dreams ever).
OQ11. Scope of reassign_all after a proposal accept vs a normal discover?
- On discover --force (first-time), run
reassign_allagainst all memories. - On proposal accept (split / merge), run
reassign_allonly on affected memories (parent's members for split; both parents' members for merge) to avoid touching unrelated records. RECOMMENDATION: scoped reassignment where possible; fall back to fullreassign_allonly ondiscover --forceor when the set of domains has fundamentally changed. Reduces write amplification on large corpora.
OQ12. Proposal freshness?
- Multiple re-clusters could stack up pending proposals.
RECOMMENDATION: before inserting a new proposal, check for existing pending proposals with the same
kind + targets; if present, bumpcreated_atandconfidenceinstead of creating a duplicate. Add aconfidence_historyarray in thepayloadJSONB for audit.
Implementation Sequencing (suggested order)
- Land the
DomainClassifierstruct,classify/classify_with_boost, unit tests. (Day 1) - Add
compute_top_terms+ TF-IDF helper, tests. (Day 1) - Wire
discoverend-to-end against SQLite;discover_seed_corpusintegration test. (Day 2) - Add
domain_proposalstable migrations + trait methods; both backends. (Day 2) - Implement
propose_changes+apply_proposal; proposal unit tests. (Day 3) - Context signals module + tests. (Day 3)
- Hook classifier into ingest path;
auto_classify_on_ingestintegration test. (Day 4) - Cross-domain decay in spreading activation; unit + integration tests. (Day 4)
- Dream re-cluster hook +
system_statecounter; integration tests for drift-only behavior. (Day 5) - CLI subcommands. (Day 6)
- REST endpoints. (Day 6)
- SvelteKit dashboard routes + WebSocket event wiring. (Day 7-8)
- Benchmarks + acceptance sweep on the 500-memory seed. (Day 9)
File Map (everything Phase 4 touches or creates)
Creates:
crates/vestige-core/src/neuroscience/domain_classifier.rscrates/vestige-core/src/neuroscience/context_signals.rscrates/vestige-core/migrations/postgres/00XX_domain_proposals.sqlcrates/vestige-core/migrations/sqlite/00XX_domain_proposals.sql(or inline instorage/migrations.rs)crates/vestige-mcp/src/api/domains.rs(REST handlers)apps/dashboard/src/routes/(app)/domains/+page.svelteapps/dashboard/src/routes/(app)/domains/[id]/+page.svelteapps/dashboard/src/routes/(app)/domains/proposals/+page.svelteapps/dashboard/src/lib/api/domains.tstests/phase_4/Cargo.tomltests/phase_4/tests/*.rs(per the Integration test list)tests/phase_4/fixtures/seed_500.jsontests/phase_4/support/fixtures.rs
Modifies:
crates/vestige-core/Cargo.toml-- addhdbscan = "0.10"under a newdomain-classificationfeature.crates/vestige-core/src/neuroscience/mod.rs-- register new modules, re-exports.crates/vestige-core/src/neuroscience/spreading_activation.rs--cross_domain_decayfield inActivationConfig,domainsfield onActivationNode, decay math inactivate.crates/vestige-core/src/consolidation/phases.rs--DreamReClusterHook.crates/vestige-core/src/advanced/dreams.rs-- accept a hook callback from the orchestrator (if the orchestration is done at this level).crates/vestige-core/src/storage/trait.rs-- add proposal + system_state methods.crates/vestige-core/src/storage/sqlite.rs-- implement proposal + system_state methods +all_embeddings_with_metaif not already on the trait.crates/vestige-core/src/storage/postgres.rs(Phase 2) -- same.crates/vestige-core/src/lib.rs-- re-exports.crates/vestige-core/src/cognitive.rs(or equivalent ingest orchestrator) -- auto-classify injection.crates/vestige-mcp/src/bin/cli.rs--Domainssubcommand + dispatch.crates/vestige-mcp/src/dashboard/mod.rs-- wire new REST routes.crates/vestige-mcp/src/dashboard/events.rs-- new event variants.crates/vestige-mcp/src/dashboard/handlers.rs-- if legacy dashboard gets a domains panel (optional).vestige.tomlconfig loader --[domains]section + struct + defaults.- Root
Cargo.tomlworkspace members -- addtests/phase_4.
Risks
- HDBSCAN determinism: HDBSCAN is deterministic given input order; sorting embeddings by memory id before feeding the clusterer guarantees reproducibility across runs -- do this in
discover()and document it. - Embedding dimension drift: Phase 1's
embedding_modelregistry blocks writes from mismatched embedders. Ifdiscover()ever sees two dimensions, it bails with a clear error and points atvestige migrate --reembed. - Classification latency on ingest: for users with thousands of domains (unlikely but possible),
classifyis O(n_domains * dim). 20 domains * 768 f32 = 15k flops per classification, trivial. Still, expose aclassify_budget_msconfig knob for paranoia. - Re-cluster proposal storms: if the corpus is borderline-stable, small changes can produce conflicting proposals on consecutive dreams. Mitigation: OQ12 (dedup by target set, bump confidence instead of stacking).
- Dashboard feature gap: if the SvelteKit app lands with the domains route but the REST endpoints are not yet deployed, the route 404s. Mitigation: ship the REST endpoints in the same release; a feature flag on the client toggles the nav entry.
Non-Goals Reminder
- No Phase 5 federation concerns in this plan.
- No cross-installation domain sync.
- No automatic accept of proposals, ever.
- No graded cross-domain decay; strict only.
- No ML-based domain label suggestion (top-terms are enough for v1).
- No editing individual memory memberships from the UI in this phase.