mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-25 00:16:23 +02:00
309 lines
10 KiB
Markdown
309 lines
10 KiB
Markdown
---
|
|
layout: default
|
|
title: "Data Ownership and Information Separation"
|
|
parent: "Tech Specs"
|
|
---
|
|
|
|
# Data Ownership and Information Separation
|
|
|
|
## Purpose
|
|
|
|
This document defines the logical ownership model for data in
|
|
TrustGraph: what the artefacts are, who owns them, and how they relate
|
|
to each other.
|
|
|
|
The IAM spec ([iam.md](iam.md)) describes authentication and
|
|
authorisation mechanics. This spec addresses the prior question: what
|
|
are the boundaries around data, and who owns what?
|
|
|
|
## Concepts
|
|
|
|
### Workspace
|
|
|
|
A workspace is the primary isolation boundary. It represents an
|
|
organisation, team, or independent operating unit. All data belongs to
|
|
exactly one workspace. Cross-workspace access is never permitted through
|
|
the API.
|
|
|
|
A workspace owns:
|
|
- Source documents
|
|
- Flows (processing pipeline definitions)
|
|
- Knowledge cores (stored extraction output)
|
|
- Collections (organisational units for extracted knowledge)
|
|
|
|
### Collection
|
|
|
|
A collection is an organisational unit within a workspace. It groups
|
|
extracted knowledge produced from source documents. A workspace can
|
|
have multiple collections, allowing:
|
|
|
|
- Processing the same documents with different parameters or models.
|
|
- Maintaining separate knowledge bases for different purposes.
|
|
- Deleting extracted knowledge without deleting source documents.
|
|
|
|
Collections do not own source documents. A source document exists at the
|
|
workspace level and can be processed into multiple collections.
|
|
|
|
### Source document
|
|
|
|
A source document (PDF, text file, etc.) is raw input uploaded to the
|
|
system. Documents belong to the workspace, not to a specific collection.
|
|
|
|
This is intentional. A document is an asset that exists independently
|
|
of how it is processed. The same PDF might be processed into multiple
|
|
collections with different chunking parameters or extraction models.
|
|
Tying a document to a single collection would force re-upload for each
|
|
collection.
|
|
|
|
### Flow
|
|
|
|
A flow defines a processing pipeline: which models to use, what
|
|
parameters to apply (chunk size, temperature, etc.), and how processing
|
|
services are connected. Flows belong to a workspace.
|
|
|
|
The processing services themselves (document-decoder, chunker,
|
|
embeddings, LLM completion, etc.) are shared infrastructure — they serve
|
|
all workspaces. Each flow has its own queues, keeping data from
|
|
different workspaces and flows separate as it moves through the
|
|
pipeline.
|
|
|
|
Different workspaces can define different flows. Workspace A might use
|
|
GPT-5.2 with a chunk size of 2000, while workspace B uses Claude with a
|
|
chunk size of 1000.
|
|
|
|
### Prompts
|
|
|
|
Prompts are templates that control how the LLM behaves during knowledge
|
|
extraction and query answering. They belong to a workspace, allowing
|
|
different workspaces to have different extraction strategies, response
|
|
styles, or domain-specific instructions.
|
|
|
|
### Ontology
|
|
|
|
An ontology defines the concepts, entities, and relationships that the
|
|
extraction pipeline looks for in source documents. Ontologies belong to
|
|
a workspace. A medical workspace might define ontologies around diseases,
|
|
symptoms, and treatments, while a legal workspace defines ontologies
|
|
around statutes, precedents, and obligations.
|
|
|
|
### Schemas
|
|
|
|
Schemas define structured data types for extraction. They specify what
|
|
fields to extract, their types, and how they relate. Schemas belong to
|
|
a workspace, as different workspaces extract different structured
|
|
information from their documents.
|
|
|
|
### Tools, tool services, and MCP tools
|
|
|
|
Tools define capabilities available to agents: what actions they can
|
|
take, what external services they can call. Tool services configure how
|
|
tools connect to backend services. MCP tools configure connections to
|
|
remote MCP servers, including authentication tokens. All belong to a
|
|
workspace.
|
|
|
|
### Agent patterns and agent task types
|
|
|
|
Agent patterns define agent behaviour strategies (how an agent reasons,
|
|
what steps it follows). Agent task types define the kinds of tasks
|
|
agents can perform. Both belong to a workspace, as different workspaces
|
|
may have different agent configurations.
|
|
|
|
### Token costs
|
|
|
|
Token cost definitions specify pricing for LLM token usage per model.
|
|
These belong to a workspace since different workspaces may use different
|
|
models or have different billing arrangements.
|
|
|
|
### Flow blueprints
|
|
|
|
Flow blueprints are templates for creating flows. They define the
|
|
default pipeline structure and parameters. Blueprints belong to a
|
|
workspace, allowing workspaces to define custom processing templates.
|
|
|
|
### Parameter types
|
|
|
|
Parameter types define the kinds of parameters that flows accept (e.g.
|
|
"llm-model", "temperature"), including their defaults and validation
|
|
rules. They belong to a workspace since workspaces that define custom
|
|
flows need to define the parameter types those flows use.
|
|
|
|
### Interface descriptions
|
|
|
|
Interface descriptions define the connection points of a flow — what
|
|
queues and topics it uses. They belong to a workspace since they
|
|
describe workspace-owned flows.
|
|
|
|
### Knowledge core
|
|
|
|
A knowledge core is a stored snapshot of extracted knowledge (triples
|
|
and graph embeddings). Knowledge cores belong to a workspace and can be
|
|
loaded into any collection within that workspace.
|
|
|
|
Knowledge cores serve as a portable extraction output. You process
|
|
documents through a flow, the pipeline produces triples and embeddings,
|
|
and the results can be stored as a knowledge core. That core can later
|
|
be loaded into a different collection or reloaded after a collection is
|
|
cleared.
|
|
|
|
### Extracted knowledge
|
|
|
|
Extracted knowledge is the live, queryable content within a collection:
|
|
triples in the knowledge graph, graph embeddings, and document
|
|
embeddings. It is the product of processing source documents through a
|
|
flow into a specific collection.
|
|
|
|
Extracted knowledge is scoped to a workspace and a collection. It
|
|
cannot exist without both.
|
|
|
|
### Processing record
|
|
|
|
A processing record tracks which source document was processed, through
|
|
which flow, into which collection. It links the source document
|
|
(workspace-scoped) to the extracted knowledge (workspace + collection
|
|
scoped).
|
|
|
|
## Ownership summary
|
|
|
|
| Artefact | Owned by | Shared across collections? |
|
|
|----------|----------|---------------------------|
|
|
| Workspaces | Global (platform) | N/A |
|
|
| User accounts | Global (platform) | N/A |
|
|
| API keys | Global (platform) | N/A |
|
|
| Source documents | Workspace | Yes |
|
|
| Flows | Workspace | N/A |
|
|
| Flow blueprints | Workspace | N/A |
|
|
| Prompts | Workspace | N/A |
|
|
| Ontologies | Workspace | N/A |
|
|
| Schemas | Workspace | N/A |
|
|
| Tools | Workspace | N/A |
|
|
| Tool services | Workspace | N/A |
|
|
| MCP tools | Workspace | N/A |
|
|
| Agent patterns | Workspace | N/A |
|
|
| Agent task types | Workspace | N/A |
|
|
| Token costs | Workspace | N/A |
|
|
| Parameter types | Workspace | N/A |
|
|
| Interface descriptions | Workspace | N/A |
|
|
| Knowledge cores | Workspace | Yes — can be loaded into any collection |
|
|
| Collections | Workspace | N/A |
|
|
| Extracted knowledge | Workspace + collection | No |
|
|
| Processing records | Workspace + collection | No |
|
|
|
|
## Scoping summary
|
|
|
|
### Global (system-level)
|
|
|
|
A small number of artefacts exist outside any workspace:
|
|
|
|
- **Workspace registry** — the list of workspaces itself
|
|
- **User accounts** — users reference a workspace but are not owned by
|
|
one
|
|
- **API keys** — belong to users, not workspaces
|
|
|
|
These are managed by the IAM layer and exist at the platform level.
|
|
|
|
### Workspace-owned
|
|
|
|
All other configuration and data is workspace-owned:
|
|
|
|
- Flow definitions and parameters
|
|
- Flow blueprints
|
|
- Prompts
|
|
- Ontologies
|
|
- Schemas
|
|
- Tools, tool services, and MCP tools
|
|
- Agent patterns and agent task types
|
|
- Token costs
|
|
- Parameter types
|
|
- Interface descriptions
|
|
- Collection definitions
|
|
- Knowledge cores
|
|
- Source documents
|
|
- Collections and their extracted knowledge
|
|
|
|
## Relationship between artefacts
|
|
|
|
```
|
|
Platform (global)
|
|
|
|
|
+-- Workspaces
|
|
| |
|
|
+-- User accounts (each assigned to a workspace)
|
|
| |
|
|
+-- API keys (belong to users)
|
|
|
|
Workspace
|
|
|
|
|
+-- Source documents (uploaded, unprocessed)
|
|
|
|
|
+-- Flows (pipeline definitions: models, parameters, queues)
|
|
|
|
|
+-- Flow blueprints (templates for creating flows)
|
|
|
|
|
+-- Prompts (LLM instruction templates)
|
|
|
|
|
+-- Ontologies (entity and relationship definitions)
|
|
|
|
|
+-- Schemas (structured data type definitions)
|
|
|
|
|
+-- Tools, tool services, MCP tools (agent capabilities)
|
|
|
|
|
+-- Agent patterns and agent task types (agent behaviour)
|
|
|
|
|
+-- Token costs (LLM pricing per model)
|
|
|
|
|
+-- Parameter types (flow parameter definitions)
|
|
|
|
|
+-- Interface descriptions (flow connection points)
|
|
|
|
|
+-- Knowledge cores (stored extraction snapshots)
|
|
|
|
|
+-- Collections
|
|
|
|
|
+-- Extracted knowledge (triples, embeddings)
|
|
|
|
|
+-- Processing records (links documents to collections)
|
|
```
|
|
|
|
A typical workflow:
|
|
|
|
1. A source document is uploaded to the workspace.
|
|
2. A flow defines how to process it (which models, what parameters).
|
|
3. The document is processed through the flow into a collection.
|
|
4. Processing records track what was processed.
|
|
5. Extracted knowledge (triples, embeddings) is queryable within the
|
|
collection.
|
|
6. Optionally, the extracted knowledge is stored as a knowledge core
|
|
for later reuse.
|
|
|
|
## Implementation notes
|
|
|
|
The current codebase uses a `user` field in message metadata and storage
|
|
partition keys to identify the workspace. The `collection` field
|
|
identifies the collection within that workspace. The IAM spec describes
|
|
how the gateway maps authenticated credentials to a workspace identity
|
|
and sets these fields.
|
|
|
|
For details on how each storage backend implements this scoping, see:
|
|
|
|
- [Entity-Centric Graph](entity-centric-graph.md) — Cassandra KG schema
|
|
- [Neo4j User Collection Isolation](neo4j-user-collection-isolation.md)
|
|
- [Collection Management](collection-management.md)
|
|
|
|
### Known inconsistencies in current implementation
|
|
|
|
- **Pipeline intermediate tables** do not include collection in their
|
|
partition keys. Re-processing the same document into a different
|
|
collection may overwrite intermediate state.
|
|
- **Processing metadata** stores collection in the row payload but not
|
|
in the partition key, making collection-based queries inefficient.
|
|
- **Upload sessions** are keyed by upload ID, not workspace. The
|
|
gateway should validate workspace ownership before allowing
|
|
operations on upload sessions.
|
|
|
|
## References
|
|
|
|
- [Identity and Access Management](iam.md)
|
|
- [Collection Management](collection-management.md)
|
|
- [Entity-Centric Graph](entity-centric-graph.md)
|
|
- [Neo4j User Collection Isolation](neo4j-user-collection-isolation.md)
|
|
- [Multi-Tenant Support](multi-tenant-support.md)
|