mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-26 08:56:21 +02:00
feat: workspace-based multi-tenancy, replacing user as tenancy axis (#840)
Introduces `workspace` as the isolation boundary for config, flows,
library, and knowledge data. Removes `user` as a schema-level field
throughout the code, API specs, and tests; workspace provides the
same separation more cleanly at the trusted flow.workspace layer
rather than through client-supplied message fields.
Design
------
- IAM tech spec (docs/tech-specs/iam.md) documents current state,
proposed auth/access model, and migration direction.
- Data ownership model (docs/tech-specs/data-ownership-model.md)
captures the workspace/collection/flow hierarchy.
Schema + messaging
------------------
- Drop `user` field from AgentRequest/Step, GraphRagQuery,
DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest,
Sparql/Rows/Structured QueryRequest, ToolServiceRequest.
- Keep collection/workspace routing via flow.workspace at the
service layer.
- Translators updated to not serialise/deserialise user.
API specs
---------
- OpenAPI schemas and path examples cleaned of user fields.
- Websocket async-api messages updated.
- Removed the unused parameters/User.yaml.
Services + base
---------------
- Librarian, collection manager, knowledge, config: all operations
scoped by workspace. Config client API takes workspace as first
positional arg.
- `flow.workspace` set at flow start time by the infrastructure;
no longer pass-through from clients.
- Tool service drops user-personalisation passthrough.
CLI + SDK
---------
- tg-init-workspace and workspace-aware import/export.
- All tg-* commands drop user args; accept --workspace.
- Python API/SDK (flow, socket_client, async_*, explainability,
library) drop user kwargs from every method signature.
MCP server
----------
- All tool endpoints drop user parameters; socket_manager no longer
keyed per user.
Flow service
------------
- Closure-based topic cleanup on flow stop: only delete topics
whose blueprint template was parameterised AND no remaining
live flow (across all workspaces) still resolves to that topic.
Three scopes fall out naturally from template analysis:
* {id} -> per-flow, deleted on stop
* {blueprint} -> per-blueprint, kept while any flow of the
same blueprint exists
* {workspace} -> per-workspace, kept while any flow in the
workspace exists
* literal -> global, never deleted (e.g. tg.request.librarian)
Fixes a bug where stopping a flow silently destroyed the global
librarian exchange, wedging all library operations until manual
restart.
RabbitMQ backend
----------------
- heartbeat=60, blocked_connection_timeout=300. Catches silently
dead connections (broker restart, orphaned channels, network
partitions) within ~2 heartbeat windows, so the consumer
reconnects and re-binds its queue rather than sitting forever
on a zombie connection.
Tests
-----
- Full test refresh: unit, integration, contract, provenance.
- Dropped user-field assertions and constructor kwargs across
~100 test files.
- Renamed user-collection isolation tests to workspace-collection.
This commit is contained in:
parent
9332089b3d
commit
d35473f7f7
377 changed files with 6868 additions and 5785 deletions
309
docs/tech-specs/data-ownership-model.md
Normal file
309
docs/tech-specs/data-ownership-model.md
Normal file
|
|
@ -0,0 +1,309 @@
|
|||
---
|
||||
layout: default
|
||||
title: "Data Ownership and Information Separation"
|
||||
parent: "Tech Specs"
|
||||
---
|
||||
|
||||
# Data Ownership and Information Separation
|
||||
|
||||
## Purpose
|
||||
|
||||
This document defines the logical ownership model for data in
|
||||
TrustGraph: what the artefacts are, who owns them, and how they relate
|
||||
to each other.
|
||||
|
||||
The IAM spec ([iam.md](iam.md)) describes authentication and
|
||||
authorisation mechanics. This spec addresses the prior question: what
|
||||
are the boundaries around data, and who owns what?
|
||||
|
||||
## Concepts
|
||||
|
||||
### Workspace
|
||||
|
||||
A workspace is the primary isolation boundary. It represents an
|
||||
organisation, team, or independent operating unit. All data belongs to
|
||||
exactly one workspace. Cross-workspace access is never permitted through
|
||||
the API.
|
||||
|
||||
A workspace owns:
|
||||
- Source documents
|
||||
- Flows (processing pipeline definitions)
|
||||
- Knowledge cores (stored extraction output)
|
||||
- Collections (organisational units for extracted knowledge)
|
||||
|
||||
### Collection
|
||||
|
||||
A collection is an organisational unit within a workspace. It groups
|
||||
extracted knowledge produced from source documents. A workspace can
|
||||
have multiple collections, allowing:
|
||||
|
||||
- Processing the same documents with different parameters or models.
|
||||
- Maintaining separate knowledge bases for different purposes.
|
||||
- Deleting extracted knowledge without deleting source documents.
|
||||
|
||||
Collections do not own source documents. A source document exists at the
|
||||
workspace level and can be processed into multiple collections.
|
||||
|
||||
### Source document
|
||||
|
||||
A source document (PDF, text file, etc.) is raw input uploaded to the
|
||||
system. Documents belong to the workspace, not to a specific collection.
|
||||
|
||||
This is intentional. A document is an asset that exists independently
|
||||
of how it is processed. The same PDF might be processed into multiple
|
||||
collections with different chunking parameters or extraction models.
|
||||
Tying a document to a single collection would force re-upload for each
|
||||
collection.
|
||||
|
||||
### Flow
|
||||
|
||||
A flow defines a processing pipeline: which models to use, what
|
||||
parameters to apply (chunk size, temperature, etc.), and how processing
|
||||
services are connected. Flows belong to a workspace.
|
||||
|
||||
The processing services themselves (document-decoder, chunker,
|
||||
embeddings, LLM completion, etc.) are shared infrastructure — they serve
|
||||
all workspaces. Each flow has its own queues, keeping data from
|
||||
different workspaces and flows separate as it moves through the
|
||||
pipeline.
|
||||
|
||||
Different workspaces can define different flows. Workspace A might use
|
||||
GPT-5.2 with a chunk size of 2000, while workspace B uses Claude with a
|
||||
chunk size of 1000.
|
||||
|
||||
### Prompts
|
||||
|
||||
Prompts are templates that control how the LLM behaves during knowledge
|
||||
extraction and query answering. They belong to a workspace, allowing
|
||||
different workspaces to have different extraction strategies, response
|
||||
styles, or domain-specific instructions.
|
||||
|
||||
### Ontology
|
||||
|
||||
An ontology defines the concepts, entities, and relationships that the
|
||||
extraction pipeline looks for in source documents. Ontologies belong to
|
||||
a workspace. A medical workspace might define ontologies around diseases,
|
||||
symptoms, and treatments, while a legal workspace defines ontologies
|
||||
around statutes, precedents, and obligations.
|
||||
|
||||
### Schemas
|
||||
|
||||
Schemas define structured data types for extraction. They specify what
|
||||
fields to extract, their types, and how they relate. Schemas belong to
|
||||
a workspace, as different workspaces extract different structured
|
||||
information from their documents.
|
||||
|
||||
### Tools, tool services, and MCP tools
|
||||
|
||||
Tools define capabilities available to agents: what actions they can
|
||||
take, what external services they can call. Tool services configure how
|
||||
tools connect to backend services. MCP tools configure connections to
|
||||
remote MCP servers, including authentication tokens. All belong to a
|
||||
workspace.
|
||||
|
||||
### Agent patterns and agent task types
|
||||
|
||||
Agent patterns define agent behaviour strategies (how an agent reasons,
|
||||
what steps it follows). Agent task types define the kinds of tasks
|
||||
agents can perform. Both belong to a workspace, as different workspaces
|
||||
may have different agent configurations.
|
||||
|
||||
### Token costs
|
||||
|
||||
Token cost definitions specify pricing for LLM token usage per model.
|
||||
These belong to a workspace since different workspaces may use different
|
||||
models or have different billing arrangements.
|
||||
|
||||
### Flow blueprints
|
||||
|
||||
Flow blueprints are templates for creating flows. They define the
|
||||
default pipeline structure and parameters. Blueprints belong to a
|
||||
workspace, allowing workspaces to define custom processing templates.
|
||||
|
||||
### Parameter types
|
||||
|
||||
Parameter types define the kinds of parameters that flows accept (e.g.
|
||||
"llm-model", "temperature"), including their defaults and validation
|
||||
rules. They belong to a workspace since workspaces that define custom
|
||||
flows need to define the parameter types those flows use.
|
||||
|
||||
### Interface descriptions
|
||||
|
||||
Interface descriptions define the connection points of a flow — what
|
||||
queues and topics it uses. They belong to a workspace since they
|
||||
describe workspace-owned flows.
|
||||
|
||||
### Knowledge core
|
||||
|
||||
A knowledge core is a stored snapshot of extracted knowledge (triples
|
||||
and graph embeddings). Knowledge cores belong to a workspace and can be
|
||||
loaded into any collection within that workspace.
|
||||
|
||||
Knowledge cores serve as a portable extraction output. You process
|
||||
documents through a flow, the pipeline produces triples and embeddings,
|
||||
and the results can be stored as a knowledge core. That core can later
|
||||
be loaded into a different collection or reloaded after a collection is
|
||||
cleared.
|
||||
|
||||
### Extracted knowledge
|
||||
|
||||
Extracted knowledge is the live, queryable content within a collection:
|
||||
triples in the knowledge graph, graph embeddings, and document
|
||||
embeddings. It is the product of processing source documents through a
|
||||
flow into a specific collection.
|
||||
|
||||
Extracted knowledge is scoped to a workspace and a collection. It
|
||||
cannot exist without both.
|
||||
|
||||
### Processing record
|
||||
|
||||
A processing record tracks which source document was processed, through
|
||||
which flow, into which collection. It links the source document
|
||||
(workspace-scoped) to the extracted knowledge (workspace + collection
|
||||
scoped).
|
||||
|
||||
## Ownership summary
|
||||
|
||||
| Artefact | Owned by | Shared across collections? |
|
||||
|----------|----------|---------------------------|
|
||||
| Workspaces | Global (platform) | N/A |
|
||||
| User accounts | Global (platform) | N/A |
|
||||
| API keys | Global (platform) | N/A |
|
||||
| Source documents | Workspace | Yes |
|
||||
| Flows | Workspace | N/A |
|
||||
| Flow blueprints | Workspace | N/A |
|
||||
| Prompts | Workspace | N/A |
|
||||
| Ontologies | Workspace | N/A |
|
||||
| Schemas | Workspace | N/A |
|
||||
| Tools | Workspace | N/A |
|
||||
| Tool services | Workspace | N/A |
|
||||
| MCP tools | Workspace | N/A |
|
||||
| Agent patterns | Workspace | N/A |
|
||||
| Agent task types | Workspace | N/A |
|
||||
| Token costs | Workspace | N/A |
|
||||
| Parameter types | Workspace | N/A |
|
||||
| Interface descriptions | Workspace | N/A |
|
||||
| Knowledge cores | Workspace | Yes — can be loaded into any collection |
|
||||
| Collections | Workspace | N/A |
|
||||
| Extracted knowledge | Workspace + collection | No |
|
||||
| Processing records | Workspace + collection | No |
|
||||
|
||||
## Scoping summary
|
||||
|
||||
### Global (system-level)
|
||||
|
||||
A small number of artefacts exist outside any workspace:
|
||||
|
||||
- **Workspace registry** — the list of workspaces itself
|
||||
- **User accounts** — users reference a workspace but are not owned by
|
||||
one
|
||||
- **API keys** — belong to users, not workspaces
|
||||
|
||||
These are managed by the IAM layer and exist at the platform level.
|
||||
|
||||
### Workspace-owned
|
||||
|
||||
All other configuration and data is workspace-owned:
|
||||
|
||||
- Flow definitions and parameters
|
||||
- Flow blueprints
|
||||
- Prompts
|
||||
- Ontologies
|
||||
- Schemas
|
||||
- Tools, tool services, and MCP tools
|
||||
- Agent patterns and agent task types
|
||||
- Token costs
|
||||
- Parameter types
|
||||
- Interface descriptions
|
||||
- Collection definitions
|
||||
- Knowledge cores
|
||||
- Source documents
|
||||
- Collections and their extracted knowledge
|
||||
|
||||
## Relationship between artefacts
|
||||
|
||||
```
|
||||
Platform (global)
|
||||
|
|
||||
+-- Workspaces
|
||||
| |
|
||||
+-- User accounts (each assigned to a workspace)
|
||||
| |
|
||||
+-- API keys (belong to users)
|
||||
|
||||
Workspace
|
||||
|
|
||||
+-- Source documents (uploaded, unprocessed)
|
||||
|
|
||||
+-- Flows (pipeline definitions: models, parameters, queues)
|
||||
|
|
||||
+-- Flow blueprints (templates for creating flows)
|
||||
|
|
||||
+-- Prompts (LLM instruction templates)
|
||||
|
|
||||
+-- Ontologies (entity and relationship definitions)
|
||||
|
|
||||
+-- Schemas (structured data type definitions)
|
||||
|
|
||||
+-- Tools, tool services, MCP tools (agent capabilities)
|
||||
|
|
||||
+-- Agent patterns and agent task types (agent behaviour)
|
||||
|
|
||||
+-- Token costs (LLM pricing per model)
|
||||
|
|
||||
+-- Parameter types (flow parameter definitions)
|
||||
|
|
||||
+-- Interface descriptions (flow connection points)
|
||||
|
|
||||
+-- Knowledge cores (stored extraction snapshots)
|
||||
|
|
||||
+-- Collections
|
||||
|
|
||||
+-- Extracted knowledge (triples, embeddings)
|
||||
|
|
||||
+-- Processing records (links documents to collections)
|
||||
```
|
||||
|
||||
A typical workflow:
|
||||
|
||||
1. A source document is uploaded to the workspace.
|
||||
2. A flow defines how to process it (which models, what parameters).
|
||||
3. The document is processed through the flow into a collection.
|
||||
4. Processing records track what was processed.
|
||||
5. Extracted knowledge (triples, embeddings) is queryable within the
|
||||
collection.
|
||||
6. Optionally, the extracted knowledge is stored as a knowledge core
|
||||
for later reuse.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
The current codebase uses a `user` field in message metadata and storage
|
||||
partition keys to identify the workspace. The `collection` field
|
||||
identifies the collection within that workspace. The IAM spec describes
|
||||
how the gateway maps authenticated credentials to a workspace identity
|
||||
and sets these fields.
|
||||
|
||||
For details on how each storage backend implements this scoping, see:
|
||||
|
||||
- [Entity-Centric Graph](entity-centric-graph.md) — Cassandra KG schema
|
||||
- [Neo4j User Collection Isolation](neo4j-user-collection-isolation.md)
|
||||
- [Collection Management](collection-management.md)
|
||||
|
||||
### Known inconsistencies in current implementation
|
||||
|
||||
- **Pipeline intermediate tables** do not include collection in their
|
||||
partition keys. Re-processing the same document into a different
|
||||
collection may overwrite intermediate state.
|
||||
- **Processing metadata** stores collection in the row payload but not
|
||||
in the partition key, making collection-based queries inefficient.
|
||||
- **Upload sessions** are keyed by upload ID, not workspace. The
|
||||
gateway should validate workspace ownership before allowing
|
||||
operations on upload sessions.
|
||||
|
||||
## References
|
||||
|
||||
- [Identity and Access Management](iam.md)
|
||||
- [Collection Management](collection-management.md)
|
||||
- [Entity-Centric Graph](entity-centric-graph.md)
|
||||
- [Neo4j User Collection Isolation](neo4j-user-collection-isolation.md)
|
||||
- [Multi-Tenant Support](multi-tenant-support.md)
|
||||
Loading…
Add table
Add a link
Reference in a new issue