feat: workspace-based multi-tenancy, replacing user as tenancy axis (#840)

Introduces `workspace` as the isolation boundary for config, flows, library, and knowledge data. Removes `user` as a schema-level field throughout the code, API specs, and tests; workspace provides the same separation more cleanly at the trusted flow.workspace layer rather than through client-supplied message fields. Design ------ - IAM tech spec (docs/tech-specs/iam.md) documents current state, proposed auth/access model, and migration direction. - Data ownership model (docs/tech-specs/data-ownership-model.md) captures the workspace/collection/flow hierarchy. Schema + messaging ------------------ - Drop `user` field from AgentRequest/Step, GraphRagQuery, DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest, Sparql/Rows/Structured QueryRequest, ToolServiceRequest. - Keep collection/workspace routing via flow.workspace at the service layer. - Translators updated to not serialise/deserialise user. API specs --------- - OpenAPI schemas and path examples cleaned of user fields. - Websocket async-api messages updated. - Removed the unused parameters/User.yaml. Services + base --------------- - Librarian, collection manager, knowledge, config: all operations scoped by workspace. Config client API takes workspace as first positional arg. - `flow.workspace` set at flow start time by the infrastructure; no longer pass-through from clients. - Tool service drops user-personalisation passthrough. CLI + SDK --------- - tg-init-workspace and workspace-aware import/export. - All tg-* commands drop user args; accept --workspace. - Python API/SDK (flow, socket_client, async_*, explainability, library) drop user kwargs from every method signature. MCP server ---------- - All tool endpoints drop user parameters; socket_manager no longer keyed per user. Flow service ------------ - Closure-based topic cleanup on flow stop: only delete topics whose blueprint template was parameterised AND no remaining live flow (across all workspaces) still resolves to that topic. Three scopes fall out naturally from template analysis: * {id} -> per-flow, deleted on stop * {blueprint} -> per-blueprint, kept while any flow of the same blueprint exists * {workspace} -> per-workspace, kept while any flow in the workspace exists * literal -> global, never deleted (e.g. tg.request.librarian) Fixes a bug where stopping a flow silently destroyed the global librarian exchange, wedging all library operations until manual restart. RabbitMQ backend ---------------- - heartbeat=60, blocked_connection_timeout=300. Catches silently dead connections (broker restart, orphaned channels, network partitions) within ~2 heartbeat windows, so the consumer reconnects and re-binds its queue rather than sitting forever on a zombie connection. Tests ----- - Full test refresh: unit, integration, contract, provenance. - Dropped user-field assertions and constructor kwargs across ~100 test files. - Renamed user-collection isolation tests to workspace-collection.
2026-04-26 08:56:21 +02:00 · 2026-04-21 23:23:01 +01:00 · 2026-04-21 23:23:01 +01:00 · d35473f7f7
commit d35473f7f7
parent 9332089b3d
377 changed files with 6868 additions and 5785 deletions
--- a/docs/tech-specs/data-ownership-model.md
+++ b/docs/tech-specs/data-ownership-model.md
@ -0,0 +1,309 @@
+---
+layout: default
+title: "Data Ownership and Information Separation"
+parent: "Tech Specs"
+---
+
+# Data Ownership and Information Separation
+
+## Purpose
+
+This document defines the logical ownership model for data in
+TrustGraph: what the artefacts are, who owns them, and how they relate
+to each other.
+
+The IAM spec ([iam.md](iam.md)) describes authentication and
+authorisation mechanics. This spec addresses the prior question: what
+are the boundaries around data, and who owns what?
+
+## Concepts
+
+### Workspace
+
+A workspace is the primary isolation boundary. It represents an
+organisation, team, or independent operating unit. All data belongs to
+exactly one workspace. Cross-workspace access is never permitted through
+the API.
+
+A workspace owns:
+- Source documents
+- Flows (processing pipeline definitions)
+- Knowledge cores (stored extraction output)
+- Collections (organisational units for extracted knowledge)
+
+### Collection
+
+A collection is an organisational unit within a workspace. It groups
+extracted knowledge produced from source documents. A workspace can
+have multiple collections, allowing:
+
+- Processing the same documents with different parameters or models.
+- Maintaining separate knowledge bases for different purposes.
+- Deleting extracted knowledge without deleting source documents.
+
+Collections do not own source documents. A source document exists at the
+workspace level and can be processed into multiple collections.
+
+### Source document
+
+A source document (PDF, text file, etc.) is raw input uploaded to the
+system. Documents belong to the workspace, not to a specific collection.
+
+This is intentional. A document is an asset that exists independently
+of how it is processed. The same PDF might be processed into multiple
+collections with different chunking parameters or extraction models.
+Tying a document to a single collection would force re-upload for each
+collection.
+
+### Flow
+
+A flow defines a processing pipeline: which models to use, what
+parameters to apply (chunk size, temperature, etc.), and how processing
+services are connected. Flows belong to a workspace.
+
+The processing services themselves (document-decoder, chunker,
+embeddings, LLM completion, etc.) are shared infrastructure — they serve
+all workspaces. Each flow has its own queues, keeping data from
+different workspaces and flows separate as it moves through the
+pipeline.
+
+Different workspaces can define different flows. Workspace A might use
+GPT-5.2 with a chunk size of 2000, while workspace B uses Claude with a
+chunk size of 1000.
+
+### Prompts
+
+Prompts are templates that control how the LLM behaves during knowledge
+extraction and query answering. They belong to a workspace, allowing
+different workspaces to have different extraction strategies, response
+styles, or domain-specific instructions.
+
+### Ontology
+
+An ontology defines the concepts, entities, and relationships that the
+extraction pipeline looks for in source documents. Ontologies belong to
+a workspace. A medical workspace might define ontologies around diseases,
+symptoms, and treatments, while a legal workspace defines ontologies
+around statutes, precedents, and obligations.
+
+### Schemas
+
+Schemas define structured data types for extraction. They specify what
+fields to extract, their types, and how they relate. Schemas belong to
+a workspace, as different workspaces extract different structured
+information from their documents.
+
+### Tools, tool services, and MCP tools
+
+Tools define capabilities available to agents: what actions they can
+take, what external services they can call. Tool services configure how
+tools connect to backend services. MCP tools configure connections to
+remote MCP servers, including authentication tokens. All belong to a
+workspace.
+
+### Agent patterns and agent task types
+
+Agent patterns define agent behaviour strategies (how an agent reasons,
+what steps it follows). Agent task types define the kinds of tasks
+agents can perform. Both belong to a workspace, as different workspaces
+may have different agent configurations.
+
+### Token costs
+
+Token cost definitions specify pricing for LLM token usage per model.
+These belong to a workspace since different workspaces may use different
+models or have different billing arrangements.
+
+### Flow blueprints
+
+Flow blueprints are templates for creating flows. They define the
+default pipeline structure and parameters. Blueprints belong to a
+workspace, allowing workspaces to define custom processing templates.
+
+### Parameter types
+
+Parameter types define the kinds of parameters that flows accept (e.g.
+"llm-model", "temperature"), including their defaults and validation
+rules. They belong to a workspace since workspaces that define custom
+flows need to define the parameter types those flows use.
+
+### Interface descriptions
+
+Interface descriptions define the connection points of a flow — what
+queues and topics it uses. They belong to a workspace since they
+describe workspace-owned flows.
+
+### Knowledge core
+
+A knowledge core is a stored snapshot of extracted knowledge (triples
+and graph embeddings). Knowledge cores belong to a workspace and can be
+loaded into any collection within that workspace.
+
+Knowledge cores serve as a portable extraction output. You process
+documents through a flow, the pipeline produces triples and embeddings,
+and the results can be stored as a knowledge core. That core can later
+be loaded into a different collection or reloaded after a collection is
+cleared.
+
+### Extracted knowledge
+
+Extracted knowledge is the live, queryable content within a collection:
+triples in the knowledge graph, graph embeddings, and document
+embeddings. It is the product of processing source documents through a
+flow into a specific collection.
+
+Extracted knowledge is scoped to a workspace and a collection. It
+cannot exist without both.
+
+### Processing record
+
+A processing record tracks which source document was processed, through
+which flow, into which collection. It links the source document
+(workspace-scoped) to the extracted knowledge (workspace + collection
+scoped).
+
+## Ownership summary
+
+| Artefact | Owned by | Shared across collections? |
+|----------|----------|---------------------------|
+| Workspaces | Global (platform) | N/A |
+| User accounts | Global (platform) | N/A |
+| API keys | Global (platform) | N/A |
+| Source documents | Workspace | Yes |
+| Flows | Workspace | N/A |
+| Flow blueprints | Workspace | N/A |
+| Prompts | Workspace | N/A |
+| Ontologies | Workspace | N/A |
+| Schemas | Workspace | N/A |
+| Tools | Workspace | N/A |
+| Tool services | Workspace | N/A |
+| MCP tools | Workspace | N/A |
+| Agent patterns | Workspace | N/A |
+| Agent task types | Workspace | N/A |
+| Token costs | Workspace | N/A |
+| Parameter types | Workspace | N/A |
+| Interface descriptions | Workspace | N/A |
+| Knowledge cores | Workspace | Yes — can be loaded into any collection |
+| Collections | Workspace | N/A |
+| Extracted knowledge | Workspace + collection | No |
+| Processing records | Workspace + collection | No |
+
+## Scoping summary
+
+### Global (system-level)
+
+A small number of artefacts exist outside any workspace:
+
+- **Workspace registry** — the list of workspaces itself
+- **User accounts** — users reference a workspace but are not owned by
+  one
+- **API keys** — belong to users, not workspaces
+
+These are managed by the IAM layer and exist at the platform level.
+
+### Workspace-owned
+
+All other configuration and data is workspace-owned:
+
+- Flow definitions and parameters
+- Flow blueprints
+- Prompts
+- Ontologies
+- Schemas
+- Tools, tool services, and MCP tools
+- Agent patterns and agent task types
+- Token costs
+- Parameter types
+- Interface descriptions
+- Collection definitions
+- Knowledge cores
+- Source documents
+- Collections and their extracted knowledge
+
+## Relationship between artefacts
+
+```
+Platform (global)
+ |
+ +-- Workspaces
+ |    |
+ +-- User accounts (each assigned to a workspace)
+ |    |
+ +-- API keys (belong to users)
+
+Workspace
+ |
+ +-- Source documents (uploaded, unprocessed)
+ |
+ +-- Flows (pipeline definitions: models, parameters, queues)
+ |
+ +-- Flow blueprints (templates for creating flows)
+ |
+ +-- Prompts (LLM instruction templates)
+ |
+ +-- Ontologies (entity and relationship definitions)
+ |
+ +-- Schemas (structured data type definitions)
+ |
+ +-- Tools, tool services, MCP tools (agent capabilities)
+ |
+ +-- Agent patterns and agent task types (agent behaviour)
+ |
+ +-- Token costs (LLM pricing per model)
+ |
+ +-- Parameter types (flow parameter definitions)
+ |
+ +-- Interface descriptions (flow connection points)
+ |
+ +-- Knowledge cores (stored extraction snapshots)
+ |
+ +-- Collections
+      |
+      +-- Extracted knowledge (triples, embeddings)
+      |
+      +-- Processing records (links documents to collections)
+```
+
+A typical workflow:
+
+1. A source document is uploaded to the workspace.
+2. A flow defines how to process it (which models, what parameters).
+3. The document is processed through the flow into a collection.
+4. Processing records track what was processed.
+5. Extracted knowledge (triples, embeddings) is queryable within the
+   collection.
+6. Optionally, the extracted knowledge is stored as a knowledge core
+   for later reuse.
+
+## Implementation notes
+
+The current codebase uses a `user` field in message metadata and storage
+partition keys to identify the workspace. The `collection` field
+identifies the collection within that workspace. The IAM spec describes
+how the gateway maps authenticated credentials to a workspace identity
+and sets these fields.
+
+For details on how each storage backend implements this scoping, see:
+
+- [Entity-Centric Graph](entity-centric-graph.md) — Cassandra KG schema
+- [Neo4j User Collection Isolation](neo4j-user-collection-isolation.md)
+- [Collection Management](collection-management.md)
+
+### Known inconsistencies in current implementation
+
+- **Pipeline intermediate tables** do not include collection in their
+  partition keys. Re-processing the same document into a different
+  collection may overwrite intermediate state.
+- **Processing metadata** stores collection in the row payload but not
+  in the partition key, making collection-based queries inefficient.
+- **Upload sessions** are keyed by upload ID, not workspace. The
+  gateway should validate workspace ownership before allowing
+  operations on upload sessions.
+
+## References
+
+- [Identity and Access Management](iam.md)
+- [Collection Management](collection-management.md)
+- [Entity-Centric Graph](entity-centric-graph.md)
+- [Neo4j User Collection Isolation](neo4j-user-collection-isolation.md)
+- [Multi-Tenant Support](multi-tenant-support.md)