release/v2.4 -> master (#844)

2026-06-24 06:08:05 +02:00 · 2026-04-22 15:19:57 +01:00 · 2026-04-22 15:19:57 +01:00 · 89cabee1b4
commit 89cabee1b4
parent a24df8e990
386 changed files with 7202 additions and 5741 deletions
--- a/docs/tech-specs/data-ownership-model.md
+++ b/docs/tech-specs/data-ownership-model.md
@ -0,0 +1,309 @@
+---
+layout: default
+title: "Data Ownership and Information Separation"
+parent: "Tech Specs"
+---
+
+# Data Ownership and Information Separation
+
+## Purpose
+
+This document defines the logical ownership model for data in
+TrustGraph: what the artefacts are, who owns them, and how they relate
+to each other.
+
+The IAM spec ([iam.md](iam.md)) describes authentication and
+authorisation mechanics. This spec addresses the prior question: what
+are the boundaries around data, and who owns what?
+
+## Concepts
+
+### Workspace
+
+A workspace is the primary isolation boundary. It represents an
+organisation, team, or independent operating unit. All data belongs to
+exactly one workspace. Cross-workspace access is never permitted through
+the API.
+
+A workspace owns:
+- Source documents
+- Flows (processing pipeline definitions)
+- Knowledge cores (stored extraction output)
+- Collections (organisational units for extracted knowledge)
+
+### Collection
+
+A collection is an organisational unit within a workspace. It groups
+extracted knowledge produced from source documents. A workspace can
+have multiple collections, allowing:
+
+- Processing the same documents with different parameters or models.
+- Maintaining separate knowledge bases for different purposes.
+- Deleting extracted knowledge without deleting source documents.
+
+Collections do not own source documents. A source document exists at the
+workspace level and can be processed into multiple collections.
+
+### Source document
+
+A source document (PDF, text file, etc.) is raw input uploaded to the
+system. Documents belong to the workspace, not to a specific collection.
+
+This is intentional. A document is an asset that exists independently
+of how it is processed. The same PDF might be processed into multiple
+collections with different chunking parameters or extraction models.
+Tying a document to a single collection would force re-upload for each
+collection.
+
+### Flow
+
+A flow defines a processing pipeline: which models to use, what
+parameters to apply (chunk size, temperature, etc.), and how processing
+services are connected. Flows belong to a workspace.
+
+The processing services themselves (document-decoder, chunker,
+embeddings, LLM completion, etc.) are shared infrastructure — they serve
+all workspaces. Each flow has its own queues, keeping data from
+different workspaces and flows separate as it moves through the
+pipeline.
+
+Different workspaces can define different flows. Workspace A might use
+GPT-5.2 with a chunk size of 2000, while workspace B uses Claude with a
+chunk size of 1000.
+
+### Prompts
+
+Prompts are templates that control how the LLM behaves during knowledge
+extraction and query answering. They belong to a workspace, allowing
+different workspaces to have different extraction strategies, response
+styles, or domain-specific instructions.
+
+### Ontology
+
+An ontology defines the concepts, entities, and relationships that the
+extraction pipeline looks for in source documents. Ontologies belong to
+a workspace. A medical workspace might define ontologies around diseases,
+symptoms, and treatments, while a legal workspace defines ontologies
+around statutes, precedents, and obligations.
+
+### Schemas
+
+Schemas define structured data types for extraction. They specify what
+fields to extract, their types, and how they relate. Schemas belong to
+a workspace, as different workspaces extract different structured
+information from their documents.
+
+### Tools, tool services, and MCP tools
+
+Tools define capabilities available to agents: what actions they can
+take, what external services they can call. Tool services configure how
+tools connect to backend services. MCP tools configure connections to
+remote MCP servers, including authentication tokens. All belong to a
+workspace.
+
+### Agent patterns and agent task types
+
+Agent patterns define agent behaviour strategies (how an agent reasons,
+what steps it follows). Agent task types define the kinds of tasks
+agents can perform. Both belong to a workspace, as different workspaces
+may have different agent configurations.
+
+### Token costs
+
+Token cost definitions specify pricing for LLM token usage per model.
+These belong to a workspace since different workspaces may use different
+models or have different billing arrangements.
+
+### Flow blueprints
+
+Flow blueprints are templates for creating flows. They define the
+default pipeline structure and parameters. Blueprints belong to a
+workspace, allowing workspaces to define custom processing templates.
+
+### Parameter types
+
+Parameter types define the kinds of parameters that flows accept (e.g.
+"llm-model", "temperature"), including their defaults and validation
+rules. They belong to a workspace since workspaces that define custom
+flows need to define the parameter types those flows use.
+
+### Interface descriptions
+
+Interface descriptions define the connection points of a flow — what
+queues and topics it uses. They belong to a workspace since they
+describe workspace-owned flows.
+
+### Knowledge core
+
+A knowledge core is a stored snapshot of extracted knowledge (triples
+and graph embeddings). Knowledge cores belong to a workspace and can be
+loaded into any collection within that workspace.
+
+Knowledge cores serve as a portable extraction output. You process
+documents through a flow, the pipeline produces triples and embeddings,
+and the results can be stored as a knowledge core. That core can later
+be loaded into a different collection or reloaded after a collection is
+cleared.
+
+### Extracted knowledge
+
+Extracted knowledge is the live, queryable content within a collection:
+triples in the knowledge graph, graph embeddings, and document
+embeddings. It is the product of processing source documents through a
+flow into a specific collection.
+
+Extracted knowledge is scoped to a workspace and a collection. It
+cannot exist without both.
+
+### Processing record
+
+A processing record tracks which source document was processed, through
+which flow, into which collection. It links the source document
+(workspace-scoped) to the extracted knowledge (workspace + collection
+scoped).
+
+## Ownership summary
+
+| Artefact | Owned by | Shared across collections? |
+|----------|----------|---------------------------|
+| Workspaces | Global (platform) | N/A |
+| User accounts | Global (platform) | N/A |
+| API keys | Global (platform) | N/A |
+| Source documents | Workspace | Yes |
+| Flows | Workspace | N/A |
+| Flow blueprints | Workspace | N/A |
+| Prompts | Workspace | N/A |
+| Ontologies | Workspace | N/A |
+| Schemas | Workspace | N/A |
+| Tools | Workspace | N/A |
+| Tool services | Workspace | N/A |
+| MCP tools | Workspace | N/A |
+| Agent patterns | Workspace | N/A |
+| Agent task types | Workspace | N/A |
+| Token costs | Workspace | N/A |
+| Parameter types | Workspace | N/A |
+| Interface descriptions | Workspace | N/A |
+| Knowledge cores | Workspace | Yes — can be loaded into any collection |
+| Collections | Workspace | N/A |
+| Extracted knowledge | Workspace + collection | No |
+| Processing records | Workspace + collection | No |
+
+## Scoping summary
+
+### Global (system-level)
+
+A small number of artefacts exist outside any workspace:
+
+- **Workspace registry** — the list of workspaces itself
+- **User accounts** — users reference a workspace but are not owned by
+  one
+- **API keys** — belong to users, not workspaces
+
+These are managed by the IAM layer and exist at the platform level.
+
+### Workspace-owned
+
+All other configuration and data is workspace-owned:
+
+- Flow definitions and parameters
+- Flow blueprints
+- Prompts
+- Ontologies
+- Schemas
+- Tools, tool services, and MCP tools
+- Agent patterns and agent task types
+- Token costs
+- Parameter types
+- Interface descriptions
+- Collection definitions
+- Knowledge cores
+- Source documents
+- Collections and their extracted knowledge
+
+## Relationship between artefacts
+
+```
+Platform (global)
+ |
+ +-- Workspaces
+ |    |
+ +-- User accounts (each assigned to a workspace)
+ |    |
+ +-- API keys (belong to users)
+
+Workspace
+ |
+ +-- Source documents (uploaded, unprocessed)
+ |
+ +-- Flows (pipeline definitions: models, parameters, queues)
+ |
+ +-- Flow blueprints (templates for creating flows)
+ |
+ +-- Prompts (LLM instruction templates)
+ |
+ +-- Ontologies (entity and relationship definitions)
+ |
+ +-- Schemas (structured data type definitions)
+ |
+ +-- Tools, tool services, MCP tools (agent capabilities)
+ |
+ +-- Agent patterns and agent task types (agent behaviour)
+ |
+ +-- Token costs (LLM pricing per model)
+ |
+ +-- Parameter types (flow parameter definitions)
+ |
+ +-- Interface descriptions (flow connection points)
+ |
+ +-- Knowledge cores (stored extraction snapshots)
+ |
+ +-- Collections
+      |
+      +-- Extracted knowledge (triples, embeddings)
+      |
+      +-- Processing records (links documents to collections)
+```
+
+A typical workflow:
+
+1. A source document is uploaded to the workspace.
+2. A flow defines how to process it (which models, what parameters).
+3. The document is processed through the flow into a collection.
+4. Processing records track what was processed.
+5. Extracted knowledge (triples, embeddings) is queryable within the
+   collection.
+6. Optionally, the extracted knowledge is stored as a knowledge core
+   for later reuse.
+
+## Implementation notes
+
+The current codebase uses a `user` field in message metadata and storage
+partition keys to identify the workspace. The `collection` field
+identifies the collection within that workspace. The IAM spec describes
+how the gateway maps authenticated credentials to a workspace identity
+and sets these fields.
+
+For details on how each storage backend implements this scoping, see:
+
+- [Entity-Centric Graph](entity-centric-graph.md) — Cassandra KG schema
+- [Neo4j User Collection Isolation](neo4j-user-collection-isolation.md)
+- [Collection Management](collection-management.md)
+
+### Known inconsistencies in current implementation
+
+- **Pipeline intermediate tables** do not include collection in their
+  partition keys. Re-processing the same document into a different
+  collection may overwrite intermediate state.
+- **Processing metadata** stores collection in the row payload but not
+  in the partition key, making collection-based queries inefficient.
+- **Upload sessions** are keyed by upload ID, not workspace. The
+  gateway should validate workspace ownership before allowing
+  operations on upload sessions.
+
+## References
+
+- [Identity and Access Management](iam.md)
+- [Collection Management](collection-management.md)
+- [Entity-Centric Graph](entity-centric-graph.md)
+- [Neo4j User Collection Isolation](neo4j-user-collection-isolation.md)
+- [Multi-Tenant Support](multi-tenant-support.md)
--- a/docs/tech-specs/flow-class-definition.md
+++ b/docs/tech-specs/flow-class-definition.md
@ -20,8 +20,8 @@ Defines shared service processors that are instantiated once per flow blueprint.
 ```json
 "class": {
  "service-name:{class}": {
-    "request": "queue-pattern:{class}",
-    "response": "queue-pattern:{class}",
+    "request": "queue-pattern:{workspace}:{class}",
+    "response": "queue-pattern:{workspace}:{class}",
    "settings": {
      "setting-name": "fixed-value",
      "parameterized-setting": "{parameter-name}"
@ -31,11 +31,11 @@ Defines shared service processors that are instantiated once per flow blueprint.
 ```

 **Characteristics:**
- Shared across all flow instances of the same class
+- Shared across all flow instances of the same class within a workspace
 - Typically expensive or stateless services (LLMs, embedding models)
- Use `{class}` template variable for queue naming
+- Use `{workspace}` and `{class}` template variables for queue naming
 - Settings can be fixed values or parameterized with `{parameter-name}` syntax
- Examples: `embeddings:{class}`, `text-completion:{class}`, `graph-rag:{class}`
+- Examples: `embeddings:{workspace}:{class}`, `text-completion:{workspace}:{class}`

 ### 2. Flow Section
 Defines flow-specific processors that are instantiated for each individual flow instance. Each flow gets its own isolated set of these processors.
@ -43,8 +43,8 @@ Defines flow-specific processors that are instantiated for each individual flow
 ```json
 "flow": {
  "processor-name:{id}": {
-    "input": "queue-pattern:{id}",
-    "output": "queue-pattern:{id}",
+    "input": "queue-pattern:{workspace}:{id}",
+    "output": "queue-pattern:{workspace}:{id}",
    "settings": {
      "setting-name": "fixed-value",
      "parameterized-setting": "{parameter-name}"
@ -56,9 +56,9 @@ Defines flow-specific processors that are instantiated for each individual flow
 **Characteristics:**
 - Unique instance per flow
 - Handle flow-specific data and state
- Use `{id}` template variable for queue naming
+- Use `{workspace}` and `{id}` template variables for queue naming
 - Settings can be fixed values or parameterized with `{parameter-name}` syntax
- Examples: `chunker:{id}`, `pdf-decoder:{id}`, `kg-extract-relationships:{id}`
+- Examples: `chunker:{workspace}:{id}`, `pdf-decoder:{workspace}:{id}`

 ### 3. Interfaces Section
 Defines the entry points and interaction contracts for the flow. These form the API surface for external systems and internal component communication.
@ -68,8 +68,8 @@ Interfaces can take two forms:
 **Fire-and-Forget Pattern** (single queue):
 ```json
 "interfaces": {
-  "document-load": "persistent://tg/flow/document-load:{id}",
-  "triples-store": "persistent://tg/flow/triples-store:{id}"
+  "document-load": "persistent://tg/flow/{workspace}:document-load:{id}",
+  "triples-store": "persistent://tg/flow/{workspace}:triples-store:{id}"
 }
 ```

@ -77,8 +77,8 @@ Interfaces can take two forms:
 ```json
 "interfaces": {
  "embeddings": {
-    "request": "non-persistent://tg/request/embeddings:{class}",
-    "response": "non-persistent://tg/response/embeddings:{class}"
+    "request": "non-persistent://tg/request/{workspace}:embeddings:{class}",
+    "response": "non-persistent://tg/response/{workspace}:embeddings:{class}"
  }
 }
 ```
@ -117,6 +117,16 @@ Additional information about the flow blueprint:

 ### System Variables

+#### {workspace}
+- Replaced with the workspace identifier
+- Isolates queue names between workspaces so that two workspaces
+  starting the same flow do not share queues
+- Must be included in all queue name patterns to ensure workspace
+  isolation
+- Example: `ws-acme`, `ws-globex`
+- All blueprint templates must include `{workspace}` in queue name
+  patterns
+
 #### {id}
 - Replaced with the unique flow instance identifier
 - Creates isolated resources for each flow
--- a/docs/tech-specs/iam.md
+++ b/docs/tech-specs/iam.md
@ -0,0 +1,858 @@
+---
+layout: default
+title: "Identity and Access Management"
+parent: "Tech Specs"
+---
+
+# Identity and Access Management
+
+## Problem Statement
+
+TrustGraph has no meaningful identity or access management. The system
+relies on a single shared gateway token for authentication and an
+honour-system `user` query parameter for data isolation. This creates
+several problems:
+
+- **No user identity.** There are no user accounts, no login, and no way
+  to know who is making a request. The `user` field in message metadata
+  is a caller-supplied string with no validation — any client can claim
+  to be any user.
+
+- **No access control.** A valid gateway token grants unrestricted access
+  to every endpoint, every user's data, every collection, and every
+  administrative operation. There is no way to limit what an
+  authenticated caller can do.
+
+- **No credential isolation.** All callers share one static token. There
+  is no per-user credential, no token expiration, and no rotation
+  mechanism. Revoking access means changing the shared token, which
+  affects all callers.
+
+- **Data isolation is unenforced.** Storage backends (Cassandra, Neo4j,
+  Qdrant) filter queries by `user` and `collection`, but the gateway
+  does not prevent a caller from specifying another user's identity.
+  Cross-user data access is trivial.
+
+- **No audit trail.** There is no logging of who accessed what. Without
+  user identity, audit logging is impossible.
+
+These gaps make the system unsuitable for multi-user deployments,
+multi-tenant SaaS, or any environment where access needs to be
+controlled or audited.
+
+## Current State
+
+### Authentication
+
+The API gateway supports a single shared token configured via the
+`GATEWAY_SECRET` environment variable or `--api-token` CLI argument. If
+unset, authentication is disabled entirely. When enabled, every HTTP
+endpoint requires an `Authorization: Bearer <token>` header. WebSocket
+connections pass the token as a query parameter.
+
+Implementation: `trustgraph-flow/trustgraph/gateway/auth.py`
+
+```python
+class Authenticator:
+    def __init__(self, token=None, allow_all=False):
+        self.token = token
+        self.allow_all = allow_all
+
+    def permitted(self, token, roles):
+        if self.allow_all: return True
+        if self.token != token: return False
+        return True
+```
+
+The `roles` parameter is accepted but never evaluated. All authenticated
+requests have identical privileges.
+
+MCP tool configurations support an optional per-tool `auth-token` for
+service-to-service authentication with remote MCP servers. These are
+static, system-wide tokens — not per-user credentials. See
+[mcp-tool-bearer-token.md](mcp-tool-bearer-token.md) for details.
+
+### User identity
+
+The `user` field is passed explicitly by the caller as a query parameter
+(e.g. `?user=trustgraph`) or set by CLI tools. It flows through the
+system in the core `Metadata` dataclass:
+
+```python
+@dataclass
+class Metadata:
+    id: str = ""
+    root: str = ""
+    user: str = ""
+    collection: str = ""
+```
+
+There is no user registration, login, user database, or session
+management.
+
+### Data isolation
+
+The `user` + `collection` pair is used at the storage layer to partition
+data:
+
+- **Cassandra**: queries filter by `user` and `collection` columns
+- **Neo4j**: queries filter by `user` and `collection` properties
+- **Qdrant**: vector search filters by `user` and `collection` metadata
+
+| Layer | Isolation mechanism | Enforced by |
+|-------|-------------------|-------------|
+| Gateway | Single shared token | `Authenticator` class |
+| Message metadata | `user` + `collection` fields | Caller (honour system) |
+| Cassandra | Column filters on `user`, `collection` | Query layer |
+| Neo4j | Property filters on `user`, `collection` | Query layer |
+| Qdrant | Metadata filters on `user`, `collection` | Query layer |
+| Pub/sub topics | Per-flow topic namespacing | Flow service |
+
+The storage-layer isolation depends on all queries correctly filtering by
+`user` and `collection`. There is no gateway-level enforcement preventing
+a caller from querying another user's data by passing a different `user`
+parameter.
+
+### Configuration and secrets
+
+| Setting | Source | Default | Purpose |
+|---------|--------|---------|---------|
+| `GATEWAY_SECRET` | Env var | Empty (auth disabled) | Gateway bearer token |
+| `--api-token` | CLI arg | None | Gateway bearer token (overrides env) |
+| `PULSAR_API_KEY` | Env var | None | Pub/sub broker auth |
+| MCP `auth-token` | Config service | None | Per-tool MCP server auth |
+
+No secrets are encrypted at rest. The gateway token and MCP tokens are
+stored and transmitted in plaintext (aside from any transport-layer
+encryption such as TLS).
+
+### Capabilities that do not exist
+
+- Per-user authentication (JWT, OAuth, SAML, API keys per user)
+- User accounts or user management
+- Role-based access control (RBAC)
+- Attribute-based access control (ABAC)
+- Per-user or per-workspace API keys
+- Token expiration or rotation
+- Session management
+- Per-user rate limiting
+- Audit logging of user actions
+- Permission checks preventing cross-user data access
+- Multi-workspace credential isolation
+
+### Key files
+
+| File | Purpose |
+|------|---------|
+| `trustgraph-flow/trustgraph/gateway/auth.py` | Authenticator class |
+| `trustgraph-flow/trustgraph/gateway/service.py` | Gateway init, token config |
+| `trustgraph-flow/trustgraph/gateway/endpoint/*.py` | Per-endpoint auth checks |
+| `trustgraph-base/trustgraph/schema/core/metadata.py` | `Metadata` dataclass with `user` field |
+
+## Technical Design
+
+### Design principles
+
+- **Auth at the edge.** The gateway is the single enforcement point.
+  Internal services trust the gateway and do not re-authenticate.
+  This avoids distributing credential validation across dozens of
+  microservices.
+
+- **Identity from credentials, not from callers.** The gateway derives
+  user identity from authentication credentials. Callers can no longer
+  self-declare their identity via query parameters.
+
+- **Workspace isolation by default.** Every authenticated user belongs to
+  a workspace. All data operations are scoped to that workspace.
+  Cross-workspace access is not possible through the API.
+
+- **Extensible API contract.** The API accepts an optional workspace
+  parameter on every request. This allows the same protocol to support
+  single-workspace deployments today and multi-workspace extensions in
+  the future without breaking changes.
+
+- **Simple roles, not fine-grained permissions.** A small number of
+  predefined roles controls what operations a user can perform. This is
+  sufficient for the current API surface and avoids the complexity of
+  per-resource permission management.
+
+### Authentication
+
+The gateway supports two credential types. Both are carried as a Bearer
+token in the `Authorization` header for HTTP requests. The gateway
+distinguishes them by format.
+
+For WebSocket connections, credentials are not passed in the URL or
+headers. Instead, the client authenticates after connecting by sending
+an auth message as the first frame:
+
+```
+Client: opens WebSocket to /api/v1/socket
+Server: accepts connection (unauthenticated state)
+Client: sends {"type": "auth", "token": "tg_abc123..."}
+Server: validates token
+  success → {"type": "auth-ok", "workspace": "acme"}
+  failure → {"type": "auth-failed", "error": "invalid token"}
+```
+
+The server rejects all non-auth messages until authentication succeeds.
+The socket remains open on auth failure, allowing the client to retry
+with a different token without reconnecting. The client can also send
+a new auth message at any time to re-authenticate — for example, to
+refresh an expiring JWT or to switch workspace. The
+resolved identity (user, workspace, roles) is updated on each
+successful auth.
+
+#### API keys
+
+For programmatic access: CLI tools, scripts, and integrations.
+
+- Opaque tokens (e.g. `tg_a1b2c3d4e5f6...`). Not JWTs — short,
+  simple, easy to paste into CLI tools and headers.
+- Each user has one or more API keys.
+- Keys are stored hashed (SHA-256 with salt) in the IAM service. The
+  plaintext key is returned once at creation time and cannot be
+  retrieved afterwards.
+- Keys can be revoked individually without affecting other users.
+- Keys optionally have an expiry date. Expired keys are rejected.
+
+On each request, the gateway resolves an API key by:
+
+1. Hashing the token.
+2. Checking a local cache (hash → user/workspace/roles).
+3. On cache miss, calling the IAM service to resolve.
+4. Caching the result with a short TTL (e.g. 60 seconds).
+
+Revoked keys stop working when the cache entry expires. No push
+invalidation is needed.
+
+#### JWTs (login sessions)
+
+For interactive access via the UI or WebSocket connections.
+
+- A user logs in with username and password. The gateway forwards the
+  request to the IAM service, which validates the credentials and
+  returns a signed JWT.
+- The JWT carries the user ID, workspace, and roles as claims.
+- The gateway validates JWTs locally using the IAM service's public
+  signing key — no service call needed on subsequent requests.
+- Token expiry is enforced by standard JWT validation at the time the
+  request (or WebSocket connection) is made.
+- For long-lived WebSocket connections, the JWT is validated at connect
+  time only. The connection remains authenticated for its lifetime.
+
+The IAM service manages the signing key. The gateway fetches the public
+key at startup (or on first JWT encounter) and caches it.
+
+#### Login endpoint
+
+```
+POST /api/v1/auth/login
+{
+    "username": "alice",
+    "password": "..."
+}
+→ {
+    "token": "eyJ...",
+    "expires": "2026-04-20T19:00:00Z"
+}
+```
+
+The gateway forwards this to the IAM service, which validates
+credentials and returns a signed JWT. The gateway returns the JWT to
+the caller.
+
+#### IAM service delegation
+
+The gateway stays thin. Its authentication logic is:
+
+1. Extract Bearer token from header (or query param for WebSocket).
+2. If the token has JWT format (dotted structure), validate the
+   signature locally and extract claims.
+3. Otherwise, treat as an API key: hash it and check the local cache.
+   On cache miss, call the IAM service to resolve.
+4. If neither succeeds, return 401.
+
+All user management, key management, credential validation, and token
+signing logic lives in the IAM service. The gateway is a generic
+enforcement point that can be replaced without changing the IAM
+service.
+
+#### No legacy token support
+
+The existing `GATEWAY_SECRET` shared token is removed. All
+authentication uses API keys or JWTs. On first start, the bootstrap
+process creates a default workspace and admin user with an initial API
+key.
+
+### User identity
+
+A user belongs to exactly one workspace. The design supports extending
+this to multi-workspace access in the future (see
+[Extension points](#extension-points)).
+
+A user record contains:
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `id` | string | Unique user identifier (UUID) |
+| `name` | string | Display name |
+| `email` | string | Email address (optional) |
+| `workspace` | string | Workspace the user belongs to |
+| `roles` | list[string] | Assigned roles (e.g. `["reader"]`) |
+| `enabled` | bool | Whether the user can authenticate |
+| `created` | datetime | Account creation timestamp |
+
+The `workspace` field maps to the existing `user` field in `Metadata`.
+This means the storage-layer isolation (Cassandra, Neo4j, Qdrant
+filtering by `user` + `collection`) works without changes — the gateway
+sets the `user` metadata field to the authenticated user's workspace.
+
+### Workspaces
+
+A workspace is an isolated data boundary. Users belong to a workspace,
+and all data operations are scoped to it. Workspaces map to the existing
+`user` field in `Metadata` and the corresponding Cassandra keyspace,
+Qdrant collection prefix, and Neo4j property filters.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `id` | string | Unique workspace identifier |
+| `name` | string | Display name |
+| `enabled` | bool | Whether the workspace is active |
+| `created` | datetime | Creation timestamp |
+
+All data operations are scoped to a workspace. The gateway determines
+the effective workspace for each request as follows:
+
+1. If the request includes a `workspace` parameter, validate it against
+   the user's assigned workspace.
+   - If it matches, use it.
+   - If it does not match, return 403. (This could be extended to
+     check a workspace access grant list.)
+2. If no `workspace` parameter is provided, use the user's assigned
+   workspace.
+
+The gateway sets the `user` field in `Metadata` to the effective
+workspace ID, replacing the caller-supplied `?user=` query parameter.
+
+This design ensures forward compatibility. Clients that pass a
+workspace parameter will work unchanged if multi-workspace support is
+added later. Requests for an unassigned workspace get a clear 403
+rather than silent misbehaviour.
+
+### Roles and access control
+
+Three roles with fixed permissions:
+
+| Role | Data operations | Admin operations | System |
+|------|----------------|-----------------|--------|
+| `reader` | Query knowledge graph, embeddings, RAG | None | None |
+| `writer` | All reader operations + load documents, manage collections | None | None |
+| `admin` | All writer operations | Config, flows, collection management, user management | Metrics |
+
+Role checks happen at the gateway before dispatching to backend
+services. Each endpoint declares the minimum role required:
+
+| Endpoint pattern | Minimum role |
+|-----------------|--------------|
+| `GET /api/v1/socket` (queries) | `reader` |
+| `POST /api/v1/librarian` | `writer` |
+| `POST /api/v1/flow/*/import/*` | `writer` |
+| `POST /api/v1/config` | `admin` |
+| `GET /api/v1/flow/*` | `admin` |
+| `GET /api/metrics` | `admin` |
+
+Roles are hierarchical: `admin` implies `writer`, which implies
+`reader`.
+
+### IAM service
+
+The IAM service is a new backend service that manages all identity and
+access data. It is the authority for users, workspaces, API keys, and
+credentials. The gateway delegates to it.
+
+#### Data model
+
+```
+iam_workspaces (
+    id text PRIMARY KEY,
+    name text,
+    enabled boolean,
+    created timestamp
+)
+
+iam_users (
+    id text PRIMARY KEY,
+    workspace text,
+    name text,
+    email text,
+    password_hash text,
+    roles set<text>,
+    enabled boolean,
+    created timestamp
+)
+
+iam_api_keys (
+    key_hash text PRIMARY KEY,
+    user_id text,
+    name text,
+    expires timestamp,
+    created timestamp
+)
+```
+
+A secondary index on `iam_api_keys.user_id` supports listing a user's
+keys.
+
+#### Responsibilities
+
+- User CRUD (create, list, update, disable)
+- Workspace CRUD (create, list, update, disable)
+- API key management (create, revoke, list)
+- API key resolution (hash → user/workspace/roles)
+- Credential validation (username/password → signed JWT)
+- JWT signing key management (initialise, rotate)
+- Bootstrap (create default workspace and admin user on first start)
+
+#### Communication
+
+The IAM service communicates via the standard request/response pub/sub
+pattern, the same as the config service. The gateway calls it to
+resolve API keys and to handle login requests. User management
+operations (create user, revoke key, etc.) also go through the IAM
+service.
+
+### Gateway changes
+
+The current `Authenticator` class is replaced with a thin authentication
+middleware that delegates to the IAM service:
+
+For HTTP requests:
+
+1. Extract Bearer token from the `Authorization` header.
+2. If the token has JWT format (dotted structure):
+   - Validate signature locally using the cached public key.
+   - Extract user ID, workspace, and roles from claims.
+3. Otherwise, treat as an API key:
+   - Hash the token and check the local cache.
+   - On cache miss, call the IAM service to resolve.
+   - Cache the result (user/workspace/roles) with a short TTL.
+4. If neither succeeds, return 401.
+5. If the user or workspace is disabled, return 403.
+6. Check the user's role against the endpoint's minimum role. If
+   insufficient, return 403.
+7. Resolve the effective workspace:
+   - If the request includes a `workspace` parameter, validate it
+     against the user's assigned workspace. Return 403 on mismatch.
+   - If no `workspace` parameter, use the user's assigned workspace.
+8. Set the `user` field in the request context to the effective
+   workspace ID. This propagates through `Metadata` to all downstream
+   services.
+
+For WebSocket connections:
+
+1. Accept the connection in an unauthenticated state.
+2. Wait for an auth message (`{"type": "auth", "token": "..."}`).
+3. Validate the token using the same logic as steps 2-7 above.
+4. On success, attach the resolved identity to the connection and
+   send `{"type": "auth-ok", ...}`.
+5. On failure, send `{"type": "auth-failed", ...}` but keep the
+   socket open.
+6. Reject all non-auth messages until authentication succeeds.
+7. Accept new auth messages at any time to re-authenticate.
+
+### CLI changes
+
+CLI tools authenticate with API keys:
+
+- `--api-key` argument on all CLI tools, replacing `--api-token`.
+- `tg-create-workspace`, `tg-list-workspaces` for workspace management.
+- `tg-create-user`, `tg-list-users`, `tg-disable-user` for user
+  management.
+- `tg-create-api-key`, `tg-list-api-keys`, `tg-revoke-api-key` for
+  key management.
+- `--workspace` argument on tools that operate on workspace-scoped
+  data.
+- The API key is passed as a Bearer token in the same way as the
+  current shared token, so the transport protocol is unchanged.
+
+### Audit logging
+
+With user identity established, the gateway logs:
+
+- Timestamp, user ID, workspace, endpoint, HTTP method, response status.
+- Audit logs are written to the standard logging output (structured
+  JSON). Integration with external log aggregation (Loki, ELK) is a
+  deployment concern, not an application concern.
+
+### Config service changes
+
+All configuration is workspace-scoped (see
+[data-ownership-model.md](data-ownership-model.md)). The config service
+needs to support this.
+
+#### Schema change
+
+The config table adds workspace as a key dimension:
+
+```
+config (
+    workspace text,
+    class text,
+    key text,
+    value text,
+    PRIMARY KEY ((workspace, class), key)
+)
+```
+
+#### Request format
+
+Config requests add a `workspace` field at the request level. The
+existing `(type, key)` structure is unchanged within each workspace.
+
+**Get:**
+```json
+{
+    "operation": "get",
+    "workspace": "workspace-a",
+    "keys": [{"type": "prompt", "key": "rag-prompt"}]
+}
+```
+
+**Put:**
+```json
+{
+    "operation": "put",
+    "workspace": "workspace-a",
+    "values": [{"type": "prompt", "key": "rag-prompt", "value": "..."}]
+}
+```
+
+**List (all keys of a type within a workspace):**
+```json
+{
+    "operation": "list",
+    "workspace": "workspace-a",
+    "type": "prompt"
+}
+```
+
+**Delete:**
+```json
+{
+    "operation": "delete",
+    "workspace": "workspace-a",
+    "keys": [{"type": "prompt", "key": "rag-prompt"}]
+}
+```
+
+The workspace is set by:
+
+- **Gateway** — from the authenticated user's workspace for API-facing
+  requests.
+- **Internal services** — explicitly, based on `Metadata.user` from
+  the message being processed, or `_system` for operational config.
+
+#### System config namespace
+
+Processor-level operational config (logging levels, connection strings,
+resource limits) is not workspace-specific. This stays in a reserved
+`_system` workspace that is not associated with any user workspace.
+Services read system config at startup without needing a workspace
+context.
+
+#### Config change notifications
+
+The config notify mechanism pushes change notifications via pub/sub
+when config is updated. A single update may affect multiple workspaces
+and multiple config types. The notification message carries a dict of
+changes keyed by config type, with each value being the list of
+affected workspaces:
+
+```json
+{
+    "version": 42,
+    "changes": {
+        "prompt": ["workspace-a", "workspace-b"],
+        "schema": ["workspace-a"]
+    }
+}
+```
+
+System config changes use the reserved `_system` workspace:
+
+```json
+{
+    "version": 43,
+    "changes": {
+        "logging": ["_system"]
+    }
+}
+```
+
+This structure is keyed by type because handlers register by type. A
+handler registered for `prompt` looks up `"prompt"` directly and gets
+the list of affected workspaces — no iteration over unrelated types.
+
+#### Config change handlers
+
+The current `on_config` hook mechanism needs two modes to support shared
+processing services:
+
+- **Workspace-scoped handlers** — notify when a config type changes in a
+  specific workspace. The handler looks up its registered type in the
+  changes dict and checks if its workspace is in the list. Used by the
+  gateway and by services that serve a single workspace.
+
+- **Global handlers** — notify when a config type changes in any
+  workspace. The handler looks up its registered type in the changes
+  dict and gets the full list of affected workspaces. Used by shared
+  processing services (prompt-rag, agent manager, etc.) that serve all
+  workspaces. Each workspace in the list tells the handler which cache
+  entry to update rather than reloading everything.
+
+#### Per-workspace config caching
+
+Shared services that handle messages from multiple workspaces maintain a
+per-workspace config cache. When a message arrives, the service looks up
+the config for the workspace identified in `Metadata.user`. If the
+workspace is not yet cached, the service fetches its config on demand.
+Config change notifications update the relevant cache entry.
+
+### Flow and queue isolation
+
+Flows are workspace-owned. When two workspaces start flows with the same
+name and blueprint, their queues must be separate to prevent data
+mixing.
+
+Flow blueprint templates currently use `{id}` (flow instance ID) and
+`{class}` (blueprint name) as template variables in queue names. A new
+`{workspace}` variable is added so queue names include the workspace:
+
+**Current queue names (no workspace isolation):**
+```
+flow:tg:document-load:{id}         → flow:tg:document-load:default
+request:tg:embeddings:{class}      → request:tg:embeddings:everything
+```
+
+**With workspace isolation:**
+```
+flow:tg:{workspace}:document-load:{id}      → flow:tg:ws-a:document-load:default
+request:tg:{workspace}:embeddings:{class}   → request:tg:ws-a:embeddings:everything
+```
+
+The flow service substitutes `{workspace}` from the authenticated
+workspace when starting a flow, the same way it substitutes `{id}` and
+`{class}` today.
+
+Processing services are shared infrastructure — they consume from
+workspace-specific queues but are not themselves workspace-aware. The
+workspace is carried in `Metadata.user` on every message, so services
+know which workspace's data they are processing.
+
+Blueprint templates need updating to include `{workspace}` in all queue
+name patterns. For migration, the flow service can inject the workspace
+into queue names automatically if the template does not include
+`{workspace}`, defaulting to the legacy behaviour for existing
+blueprints.
+
+See [flow-class-definition.md](flow-class-definition.md) for the full
+blueprint template specification.
+
+### What changes and what doesn't
+
+**Changes:**
+
+| Component | Change |
+|-----------|--------|
+| `gateway/auth.py` | Replace `Authenticator` with new auth middleware |
+| `gateway/service.py` | Initialise IAM client, configure JWT validation |
+| `gateway/endpoint/*.py` | Add role requirement per endpoint |
+| Metadata propagation | Gateway sets `user` from workspace, ignores query param |
+| Config service | Add workspace dimension to config schema |
+| Config table | `PRIMARY KEY ((workspace, class), key)` |
+| Config request/response schema | Add `workspace` field |
+| Config notify messages | Include workspace ID in change notifications |
+| `on_config` handlers | Support workspace-scoped and global modes |
+| Shared services | Per-workspace config caching |
+| Flow blueprints | Add `{workspace}` template variable to queue names |
+| Flow service | Substitute `{workspace}` when starting flows |
+| CLI tools | New user management commands, `--api-key` argument |
+| Cassandra schema | New `iam_workspaces`, `iam_users`, `iam_api_keys` tables |
+
+**Does not change:**
+
+| Component | Reason |
+|-----------|--------|
+| Internal service-to-service pub/sub | Services trust the gateway |
+| `Metadata` dataclass | `user` field continues to carry workspace identity |
+| Storage-layer isolation | Same `user` + `collection` filtering |
+| Message serialisation | No schema changes |
+
+### Migration
+
+This is a breaking change. Existing deployments must be reconfigured:
+
+1. `GATEWAY_SECRET` is removed. Authentication requires API keys or
+   JWT login tokens.
+2. The `?user=` query parameter is removed. Workspace identity comes
+   from authentication.
+3. On first start, the IAM service bootstraps a default workspace and
+   admin user. The initial API key is output to the service log.
+4. Operators create additional workspaces and users via CLI tools.
+5. Flow blueprints must be updated to include `{workspace}` in queue
+   name patterns.
+6. Config data must be migrated to include the workspace dimension.
+
+## Extension points
+
+The design includes deliberate extension points for future capabilities.
+These are not implemented but the architecture does not preclude them:
+
+- **Multi-workspace access.** Users could be granted access to
+  additional workspaces beyond their primary assignment. The workspace
+  validation step checks a grant list instead of a single assignment.
+- **Rules-based access control.** A separate access control service
+  could evaluate fine-grained policies (per-collection permissions,
+  operation-level restrictions, time-based access). The gateway
+  delegates authorisation decisions to this service.
+- **External identity provider integration.** SAML, LDAP, and OIDC
+  flows (group mapping, claims-based role assignment) could be added
+  to the IAM service.
+- **Cross-workspace administration.** A `superadmin` role for platform
+  operators who manage multiple workspaces.
+- **Delegated workspace provisioning.** APIs for programmatic workspace
+  creation and user onboarding.
+
+These extensions are additive — they extend the validation logic
+without changing the request/response protocol. The gateway can be
+replaced with an alternative implementation that supports these
+capabilities while the IAM service and backend services remain
+unchanged.
+
+## Implementation plan
+
+Workspace support is a prerequisite for auth — users are assigned to
+workspaces, config is workspace-scoped, and flows use workspace in
+queue names. Implementing workspaces first allows the structural changes
+to be tested end-to-end without auth complicating debugging.
+
+### Phase 1: Workspace support (no auth)
+
+All workspace-scoped data and processing changes. The system works with
+workspaces but no authentication — callers pass workspace as a
+parameter, honour system. This allows full end-to-end testing: multiple
+workspaces with separate flows, config, queues, and data.
+
+#### Config service
+
+- Update config client API to accept a workspace parameter on all
+  requests
+- Update config storage schema to add workspace as a key dimension
+- Update config notification API to report changes as a dict of
+  type → workspace list
+- Update the processor base class to understand workspaces in config
+  notifications (workspace-scoped and global handler modes)
+- Update all processors to implement workspace-aware config handling
+  (per-workspace config caching, on-demand fetch)
+
+#### Flow and queue isolation
+
+- Update flow blueprints to include `{workspace}` in all queue name
+  patterns
+- Update the flow service to substitute `{workspace}` when starting
+  flows
+- Update all built-in blueprints to include `{workspace}`
+
+#### CLI tools (workspace support)
+
+- Add `--workspace` argument to CLI tools that operate on
+  workspace-scoped data
+- Add `tg-create-workspace`, `tg-list-workspaces` commands
+
+### Phase 2: Authentication and access control
+
+With workspaces working, add the IAM service and lock down the gateway.
+
+#### IAM service
+
+A new service handling identity and access management on behalf of the
+API gateway:
+
+- Add workspace table support (CRUD, enable/disable)
+- Add user table support (CRUD, enable/disable, workspace assignment)
+- Add roles support (role assignment, role validation)
+- Add API key support (create, revoke, list, hash storage)
+- Add ability to initialise a JWT signing key for token grants
+- Add token grant endpoint: user/password login returns a signed JWT
+- Add bootstrap/initialisation mechanism: ability to set the signing
+  key and create the initial workspace + admin user on first start
+
+#### API gateway integration
+
+- Add IAM middleware to the API gateway replacing the current
+  `Authenticator`
+- Add local JWT validation (public key from IAM service)
+- Add API key resolution with local cache (hash → user/workspace/roles,
+  cache miss calls IAM service, short TTL)
+- Add login endpoint forwarding to IAM service
+- Add workspace resolution: validate requested workspace against user
+  assignment
+- Add role-based endpoint access checks
+- Add user management API endpoints (forwarded to IAM service)
+- Add audit logging (user ID, workspace, endpoint, method, status)
+- WebSocket auth via first-message protocol (auth message after
+  connect, socket stays open on failure, re-auth supported)
+
+#### CLI tools (auth support)
+
+- Add `tg-create-user`, `tg-list-users`, `tg-disable-user` commands
+- Add `tg-create-api-key`, `tg-list-api-keys`, `tg-revoke-api-key`
+  commands
+- Replace `--api-token` with `--api-key` on existing CLI tools
+
+#### Bootstrap and cutover
+
+- Create default workspace and admin user on first start if IAM tables
+  are empty
+- Remove `GATEWAY_SECRET` and `?user=` query parameter support
+
+## Design Decisions
+
+### IAM data store
+
+IAM data is stored in dedicated Cassandra tables owned by the IAM
+service, not in the config service. Reasons:
+
+- **Security isolation.** The config service has a broad, generic
+  protocol. An access control failure on the config service could
+  expose credentials. A dedicated IAM service with a purpose-built
+  protocol limits the attack surface and makes security auditing
+  clearer.
+- **Data model fit.** IAM needs indexed lookups (API key hash → user,
+  list keys by user). The config service's `(workspace, type, key) →
+  value` model stores opaque JSON strings with no secondary indexes.
+- **Scope.** IAM data is global (workspaces, users, keys). Config is
+  workspace-scoped. Mixing global and workspace-scoped data in the
+  same store adds complexity.
+- **Audit.** IAM operations (key creation, revocation, login attempts)
+  are security events that should be logged separately from general
+  config changes.
+
+## Deferred to future design
+
+- **OIDC integration.** External identity provider support (SAML, LDAP,
+  OIDC) is left for future implementation. The extension points section
+  describes where this fits architecturally.
+- **API key scoping.** API keys could be scoped to specific collections
+  within a workspace rather than granting workspace-wide access. To be
+  designed when the need arises.
+- **tg-init-trustgraph** only initialises a single workspace.
+
+## References
+
+- [Data Ownership and Information Separation](data-ownership-model.md)
+- [MCP Tool Bearer Token Specification](mcp-tool-bearer-token.md)
+- [Multi-Tenant Support Specification](multi-tenant-support.md)
+- [Neo4j User Collection Isolation](neo4j-user-collection-isolation.md)