trustgraph/docs/tech-specs/workspace-scoped-services.md
cybermaggedon 9f2bfbce0c
Per-workspace queue routing for workspace-scoped services (#862)
Workspace identity is now determined by queue infrastructure instead of
message body fields, closing a privilege-escalation vector where a caller
could spoof workspace in the request payload.

- Add WorkspaceProcessor base class: discovers workspaces from config at
  startup, creates per-workspace consumers (queue:workspace), and manages
  consumer lifecycle on workspace create/delete events
- Roll out to librarian, flow-svc, knowledge cores, and config-svc
- Config service gets a dual-queue regime: a system queue for
  cross-workspace ops (getvalues-all-ws, bootstrapper writes to
  __workspaces__) and per-workspace queues for tenant-scoped ops, with
  workspace discovery from its own Cassandra store
- Remove workspace field from request schemas (FlowRequest,
  LibrarianRequest, KnowledgeRequest, CollectionManagementRequest) and
  from DocumentMetadata / ProcessingMetadata — table stores now accept
  workspace as an explicit parameter
- Strip workspace encode/decode from all message translators and gateway
  serializers
- Gateway enforces workspace existence: reject requests targeting
  non-existent workspaces instead of routing to queues with no consumer
- Config service provisions new workspaces from __template__ on creation
- Add workspace lifecycle hooks to AsyncProcessor so any processor can
  react to workspace create/delete without subclassing WorkspaceProcessor
2026-05-04 10:30:03 +01:00

366 lines
16 KiB
Markdown

---
layout: default
title: "Workspace-Scoped Services"
parent: "Tech Specs"
---
# Workspace-Scoped Services
## Problem Statement
Workspace-scoped services (librarian, config, knowledge, collection
management) currently operate on global queues — a single
`request:tg:librarian` queue handles requests for all workspaces.
Workspace identity is carried as a field in the request body, set by
the gateway after authentication. This creates several problems:
- **No structural isolation.** All workspaces share a single queue.
Workspace scoping depends entirely on a body field being populated
correctly. If the field is missing or wrong, the service operates
on the wrong workspace — or fails with a confusing error. This is
a security concern: workspace isolation should be enforced by
infrastructure, not by trusting a field.
- **Redundant workspace fields.** Nested objects within requests
(e.g. `processing-metadata`, `document-metadata`) carry their own
`workspace` fields alongside the top-level request workspace.
The gateway resolves the top-level workspace but does not
propagate into nested payloads. Services that read workspace from
a nested object instead of the top-level address see `None` and
fail.
- **No workspace lifecycle awareness.** Workspace-scoped services
have no mechanism to learn when workspaces are created or deleted.
Flow processors discover workspaces indirectly through config
entries, but workspace-scoped services on global queues have no
equivalent. There is no event when a workspace appears or
disappears.
- **Inconsistency with flow-scoped services.** Flow-scoped services
already use per-workspace, per-flow queue names
(`request:tg:{workspace}:embeddings:{class}`). Workspace-scoped
services are the exception — they sit on global queues while
everything else is structurally isolated.
## Design
### Per-workspace queues for workspace-scoped services
Workspace-scoped services move from global queues to per-workspace
queues. The queue name includes the workspace identifier:
**Current (global):**
```
request:tg:librarian
request:tg:config
```
**Proposed (per-workspace):**
```
request:tg:librarian:{workspace}
request:tg:config:{workspace}
```
The gateway routes requests to the correct queue based on the
resolved workspace from authentication — the same workspace that
today gets written into the request body. The workspace is now part
of the queue address, not just a field in the payload.
Services subscribe to per-workspace queues. When a new workspace is
created, they subscribe to its queue. When a workspace is deleted,
they unsubscribe.
### Workspace lifecycle via the `__workspaces__` config namespace
Workspace lifecycle events are modelled as config changes in a
reserved `__workspaces__` namespace. This mirrors the existing
`__system__` namespace — a reserved space for infrastructure
concerns that don't belong to any user workspace.
When IAM creates a workspace, it writes an entry to the config
service:
```
workspace: __workspaces__
type: workspace
key: <workspace-id>
value: {"enabled": true}
```
When IAM deletes (or disables) a workspace, it updates or deletes
the entry. The config service sees this as a normal config change
and pushes a notification through the existing `ConfigPush`
mechanism.
This avoids introducing a new notification channel. The config
service already has the machinery to notify subscribers of changes
by type and workspace. Workspace lifecycle is just another config
type that services can register handlers for.
### Config push changes
#### Remove `_`-prefix suppression
The config service currently suppresses notifications for workspaces
whose names start with `_`. This suppression is removed — the
config service pushes notifications for all workspaces
unconditionally.
The filtering moves to the consumer side. `AsyncProcessor` already
filters `_`-prefixed workspaces in its config handler dispatch
(lines 212 and 315 of `async_processor.py`). This filtering is
retained as the default behaviour, but handlers can opt in to
infrastructure namespaces by registering for them explicitly (see
`WorkspaceProcessor` below).
#### Workspace change events
The `ConfigPush` message gains a `workspace_changes` field alongside
the existing `changes` field:
```python
@dataclass
class ConfigPush:
version: int = 0
# Config changes: type -> [affected workspaces]
changes: dict[str, list[str]] = field(default_factory=dict)
# Workspace lifecycle: created/deleted workspace lists
workspace_changes: WorkspaceChanges | None = None
@dataclass
class WorkspaceChanges:
created: list[str] = field(default_factory=list)
deleted: list[str] = field(default_factory=list)
```
The config service populates `workspace_changes` when it detects
changes to the `__workspaces__` config namespace. A new key
appearing is a creation; a key being deleted is a deletion.
Services that don't care about workspace lifecycle ignore the field.
Services that do (workspace-scoped services, the gateway) react by
subscribing to or tearing down per-workspace queues.
### The `WorkspaceProcessor` base class
A new base class sits between `AsyncProcessor` and `FlowProcessor`
in the processor hierarchy:
```
AsyncProcessor → WorkspaceProcessor → FlowProcessor
```
`WorkspaceProcessor` manages per-workspace queue lifecycle the same
way `FlowProcessor` manages per-flow lifecycle. It:
1. On startup, discovers existing workspaces by fetching config from
the `__workspaces__` namespace (using the existing
`_fetch_type_all_workspaces` pattern).
2. For each workspace, subscribes to the service's per-workspace
queue (e.g. `request:tg:librarian:{workspace}`).
3. Registers a config handler for the `workspace` type in the
`__workspaces__` namespace. When a workspace is created, it
subscribes to the new queue. When a workspace is deleted, it
unsubscribes and tears down.
4. Exposes hooks for derived classes:
- `on_workspace_created(workspace)` — called after subscribing
to the new workspace's queue.
- `on_workspace_deleted(workspace)` — called before
unsubscribing from the workspace's queue.
`FlowProcessor` extends `WorkspaceProcessor` instead of
`AsyncProcessor`. Flows exist within workspaces, so the hierarchy
is natural: workspace creation triggers queue subscription, then
flow config changes within that workspace trigger flow start/stop.
Services that are workspace-scoped but not flow-scoped (librarian,
knowledge, collection management) extend `WorkspaceProcessor`
directly.
### Gateway routing changes
The gateway currently dispatches workspace-scoped requests to global
service dispatchers. This changes to per-workspace dispatchers that
route to per-workspace queues.
For HTTP requests, the resolved workspace from the URL path
(`/api/v1/workspaces/{w}/library`) determines the target queue.
For WebSocket requests via the Mux, the resolved workspace from
`enforce_workspace` determines the target queue. The Mux already
resolves workspace before dispatching (line 214 of `mux.py`); the
change is that `invoke_global_service` uses workspace to select the
queue, rather than routing to a single global queue.
System-level services (IAM) remain on global queues — they are not
workspace-scoped.
### Workspace field on nested metadata objects
With per-workspace queues, the workspace is part of the queue
address. Services know which workspace they are serving by which
queue a message arrived on.
The `workspace` field on `DocumentMetadata` and
`ProcessingMetadata` in the librarian schema becomes a storage
attribute — the workspace the record belongs to, populated by the
service from the request context, not by the caller. The service
reads workspace from `request.workspace` (the resolved address) or
from the queue context, never from a nested payload field.
Callers are not required to populate workspace on nested objects.
The service fills it in authoritatively from the request context
before storing.
## Interaction with existing specs
### IAM (`iam.md`, `iam-contract.md`)
IAM is the authority for workspace existence. When IAM creates or
deletes a workspace, it writes to the `__workspaces__` config
namespace. This is a two-step operation: register the workspace in
IAM's own store (`iam_workspaces` table), then announce it via
config.
The IAM service itself remains on a global queue — it is a
system-level service, not workspace-scoped.
### Config service
The config service is workspace-scoped — it stores per-workspace
configuration. Under this design, the config service moves to
per-workspace queues like other workspace-scoped services.
On startup, the config service discovers workspaces from its own
store (it has direct access to the config tables, unlike other
services that fetch via request/response). It subscribes to
per-workspace queues for each known workspace.
When IAM writes a new workspace entry to the `__workspaces__`
namespace, the config service sees the write directly (it is the
config service), creates the per-workspace queue, and pushes the
notification.
### Flow blueprints (`flow-blueprint-definition.md`)
Flow blueprints already use `{workspace}` in queue name templates.
No changes needed — flows are created within an already-existing
workspace, so the per-workspace infrastructure is in place before
flow start.
### Data ownership (`data-ownership-model.md`)
This spec reinforces the data ownership model: a workspace is the
primary isolation boundary, and per-workspace queues make that
boundary structural rather than conventional.
## Migration
### Queue naming
Existing deployments use global queues for workspace-scoped
services. Migration requires:
1. Deploy updated services that subscribe to both global and
per-workspace queues during a transition period.
2. Update the gateway to route to per-workspace queues.
3. Drain the global queues.
4. Remove global queue subscriptions from services.
### `__workspaces__` bootstrap
On first start after migration, IAM populates the `__workspaces__`
config namespace with entries for all existing workspaces from
`iam_workspaces`. This seeds the config store so that
workspace-scoped services discover existing workspaces on startup.
### Config push compatibility
The `workspace_changes` field on `ConfigPush` is additive.
Services that don't understand it ignore it (the field defaults to
`None`). No breaking change to the push protocol.
## Summary of changes
| Component | Change |
|-----------|--------|
| Queue names | Workspace-scoped services move from `request:tg:{service}` to `request:tg:{service}:{workspace}` |
| `__workspaces__` namespace | New reserved config namespace for workspace lifecycle |
| IAM service | Writes to `__workspaces__` on workspace create/delete |
| Config service | Removes `_`-prefix notification suppression; generates `workspace_changes` events; moves to per-workspace queues |
| `ConfigPush` schema | Adds `workspace_changes` field (`WorkspaceChanges` dataclass) |
| `WorkspaceProcessor` | New base class managing per-workspace queue lifecycle |
| `FlowProcessor` | Extends `WorkspaceProcessor` instead of `AsyncProcessor` |
| `AsyncProcessor` | Relaxes `_`-prefix filtering to allow opt-in for infrastructure namespaces |
| Gateway | Routes workspace-scoped requests to per-workspace queues |
| Librarian schema | `workspace` on nested metadata becomes a service-populated storage attribute, not a caller-supplied address |
## Implementation Plan
### Phase 1: Foundation — `__workspaces__` namespace and config push
- **`ConfigPush` schema** (`trustgraph-base/trustgraph/schema/services/config.py`): Add `WorkspaceChanges` dataclass and `workspace_changes` field.
- **Config push serialization** (`trustgraph-base/trustgraph/messaging/translators/`): Encode/decode the new field.
- **Config service** (`trustgraph-flow/trustgraph/config/`): Detect writes to `__workspaces__` namespace and populate `workspace_changes` on the push message. Remove `_`-prefix notification suppression.
- **`AsyncProcessor`** (`trustgraph-base/trustgraph/base/async_processor.py`): Relax `_`-prefix filtering so handlers can opt in to infrastructure namespaces.
- **IAM service** (`trustgraph-flow/trustgraph/iam/`): Write to `__workspaces__` config namespace on `create-workspace` and `delete-workspace`. Add bootstrap step to seed `__workspaces__` entries for existing workspaces.
### Phase 2: `WorkspaceProcessor` base class
- **New `WorkspaceProcessor`** (`trustgraph-base/trustgraph/base/workspace_processor.py`): Implements workspace discovery on startup, per-workspace queue subscribe/unsubscribe, workspace lifecycle handler registration, `on_workspace_created`/`on_workspace_deleted` hooks.
- **`FlowProcessor`** (`trustgraph-base/trustgraph/base/flow_processor.py`): Re-parent from `AsyncProcessor` to `WorkspaceProcessor`.
- **Verify** existing flow processors continue to work — the new layer should be transparent to them.
### Phase 3: Per-workspace queues for workspace-scoped services
- **Queue definitions** (`trustgraph-base/trustgraph/schema/`): Update queue names for librarian, config, knowledge, collection management to include `{workspace}`.
- **Librarian** (`trustgraph-flow/trustgraph/librarian/`): Extend `WorkspaceProcessor`. Remove reliance on workspace from nested metadata objects.
- **Knowledge service, collection management** and other workspace-scoped services: Extend `WorkspaceProcessor`.
- **Config service**: Self-bootstrap per-workspace queues from its own store on startup; subscribe to new workspace queues when `__workspaces__` entries appear.
### Phase 4: Gateway routing
- **Gateway dispatcher manager** (`trustgraph-flow/trustgraph/gateway/dispatch/manager.py`): Route workspace-scoped services to per-workspace queues using resolved workspace. System-level services (IAM) remain on global queues.
- **Mux** (`trustgraph-flow/trustgraph/gateway/dispatch/mux.py`): Pass workspace to `invoke_global_service` for workspace-scoped services.
- **HTTP endpoints** (`trustgraph-flow/trustgraph/gateway/endpoint/`): Route to per-workspace queues based on URL path workspace.
### Phase 5: Schema cleanup
- **`DocumentMetadata`, `ProcessingMetadata`** (`trustgraph-base/trustgraph/schema/services/library.py`): Remove `workspace` field from nested metadata objects, or retain as a service-populated storage attribute only.
- **Serialization** (`trustgraph-flow/trustgraph/gateway/dispatch/serialize.py`, `trustgraph-base/trustgraph/messaging/translators/metadata.py`): Update translators to match.
- **API client** (`trustgraph-base/trustgraph/api/library.py`): Stop sending workspace in nested payloads.
- **Librarian service** (`trustgraph-flow/trustgraph/librarian/`): Populate workspace on stored records from request context.
### Dependencies
```
Phase 1 (foundation)
Phase 2 (WorkspaceProcessor)
Phase 3 (per-workspace queues) ←→ Phase 4 (gateway routing)
↓ ↓
Phase 5 (schema cleanup)
```
Phases 3 and 4 can be developed in parallel but must be deployed together — services expecting per-workspace queues need the gateway to route to them.
## References
- [Identity and Access Management](iam.md) — workspace registry,
authentication, and workspace resolution.
- [IAM Contract](iam-contract.md) — resource model and workspace as
address vs. parameter.
- [Data Ownership and Information Separation](data-ownership-model.md)
— workspace as isolation boundary.
- [Config Push and Poke](config-push-poke.md) — config notification
mechanism.
- [Flow Blueprint Definition](flow-blueprint-definition.md) —
`{workspace}` template variable in queue names.
- [Flow Service Queue Lifecycle](flow-service-queue-lifecycle.md) —
queue ownership and lifecycle model.