Implements all three changes from the knowledge-core-completeness tech spec: 1. Named graph field preserved through Cassandra storage (7-element tuple), enabling provenance triples to retain their graph URIs on round-trip. 2. Provenance triples already arrive on triples-input — no routing change needed; Change 1 was sufficient. 3. Source material (library documents) streamed alongside triples and embeddings during core download/upload. The knowledge manager fetches the document hierarchy from the librarian on download and recreates it on upload, preserving the full provenance chain across instances.
21 KiB
| layout | title | parent |
|---|---|---|
| default | Knowledge Core Completeness | Tech Specs |
Knowledge Core Completeness
Overview
Knowledge cores are portable snapshots of extracted knowledge: triples, graph
embeddings, and document embeddings stored in Cassandra's knowledge keyspace.
They can be downloaded as files, transferred between TrustGraph instances, and
loaded back into vector and graph stores.
Recent additions to TrustGraph — explainability/provenance and named graphs — were not carried through to the knowledge core system. This means that exporting and re-importing a core loses provenance links, graph assignments, and source material, breaking the explainability chain.
This specification addresses three gaps:
- Named graphs not stored — The
g(graph name) field on triples is silently dropped when writing to the core store and comes back asNoneon read. - Provenance triples not captured — Provenance triples (PROV-O) are generated during extraction and flow to graph stores, but never enter the knowledge core store. It is unclear whether they arrive at the store in the correct form.
- Source material not included — Documents, text pages, and chunks in the librarian's bucket store are not part of the core. After loading a core on a different instance, provenance links to source material point at nothing.
Goals
- Self-contained cores: A downloaded knowledge core file contains everything needed to reconstruct the full knowledge graph including provenance and source attribution on a fresh instance.
- Named graph preservation: Round-tripping a core preserves graph assignments on all triples.
- Backward compatibility: Existing core files (without graph names or source material) can still be uploaded and loaded. New fields are optional on import.
- No change to core identity: A core is still identified by its document ID. The additional data is associated with the same core ID.
- Minimal file format changes: Extend the existing msgpack record format with new record types rather than restructuring existing ones.
Background
Current Lifecycle
Extraction pipeline
│
├─ triples ──────────────────► knowledge core store (Cassandra)
├─ graph embeddings ─────────► knowledge core store (Cassandra)
├─ document embeddings ──────► knowledge core store (Cassandra)
├─ provenance triples ───────► graph store (only)
└─ source documents ─────────► librarian bucket store (only)
Download: Cassandra ──► knowledge manager ──► API gateway ──► client file
Upload: client file ──► API gateway ──► knowledge manager ──► Cassandra
Load: Cassandra ──► knowledge manager ──► Pulsar topics ──► graph/vector stores
Current Core File Format (msgpack)
A core file is a sequence of concatenated msgpack records. Each record is a
2-element tuple: (type_tag, payload).
| Type tag | Payload | Description |
|---|---|---|
"t" |
{"m": {id, root, collection}, "t": [triple_dicts]} |
Triple batch |
"ge" |
{"m": {id, root, collection}, "e": [{entity, vector}]} |
Graph embedding batch |
What's Missing
Named Graphs
The Triple dataclass has a g: str | None field (graph name IRI), used to
separate provenance graphs (urn:graph:source, urn:graph:retrieval) from
the default graph. However:
- Cassandra schema (
knowledge.triplestable): stores a 6-tuple per triple(s_val, s_is_uri, p_val, p_is_uri, o_val, o_is_uri)— no graph field. add_triples()(tables/knowledge.py:231): destructures onlys,p,o—gis discarded.get_triples()(tables/knowledge.py:396): reconstructsTriplewithgdefaulting toNone.- Core file format: triple dicts do not include a graph field.
Provenance Triples
Provenance triples are generated in the extraction pipeline
(trustgraph-base/trustgraph/provenance/triples.py) and published to graph
store topics. They use named graphs (urn:graph:source,
urn:graph:retrieval) and PROV-O vocabulary.
The knowledge core store processor (storage/knowledge/store.py) listens on
triples-input and graph-embeddings-input. Whether provenance triples
arrive on the same triples-input topic or a separate one needs
verification. Even if they do arrive, the graph name would be lost (per
above).
Source Material
The librarian stores the full document hierarchy in a separate system:
- Blob store (S3/MinIO): original documents, text pages, chunks —
keyed by object UUID under
doc/{object_id}. - Cassandra
librarykeyspace: document metadata includingid,kind(MIME type),title,parent_id,document_type(source/extracted),object_id(blob reference).
Provenance triples link extracted facts back to chunk/page/document IDs. Those IDs resolve through the librarian. When a core is loaded on a different instance, the librarian has no matching documents, so the entire provenance chain is broken.
Key Source Files
| Component | File | Purpose |
|---|---|---|
| Core Cassandra schema | trustgraph-flow/trustgraph/tables/knowledge.py |
Table definitions, read/write |
| Core manager | trustgraph-flow/trustgraph/cores/knowledge.py |
API operations, load-to-store |
| Core store processor | trustgraph-flow/trustgraph/storage/knowledge/store.py |
Extraction → Cassandra |
| CLI download | trustgraph-cli/trustgraph/cli/get_kg_core.py |
Core → msgpack file |
| CLI upload | trustgraph-cli/trustgraph/cli/put_kg_core.py |
Msgpack file → core |
| CLI load | trustgraph-cli/trustgraph/cli/load_kg_core.py |
Core → graph/vector stores |
| API client | trustgraph-base/trustgraph/api/knowledge.py |
Client-side knowledge API |
| Triple schema | trustgraph-base/trustgraph/schema/core/primitives.py |
Triple dataclass with g field |
| Provenance generation | trustgraph-base/trustgraph/provenance/triples.py |
PROV-O triple creation |
| Librarian | trustgraph-flow/trustgraph/librarian/librarian.py |
Document storage service |
| Library tables | trustgraph-flow/trustgraph/tables/library.py |
Document metadata in Cassandra |
| Blob store | trustgraph-flow/trustgraph/librarian/blob_store.py |
S3/MinIO object storage |
Technical Design
Change 1: Named Graph Field in Core Storage
Cassandra Schema
Extend the triples tuple from 6 to 7 elements, adding the graph name:
triples list<tuple<
text, boolean, -- s_val, s_is_uri
text, boolean, -- p_val, p_is_uri
text, boolean, -- o_val, o_is_uri
text -- graph name (empty string = default graph)
>>
Migration: The schema change uses ALTER TABLE or is handled by
creating a new table version. Existing rows with 6-element tuples must be
handled gracefully on read — if the tuple has 6 elements, treat graph as
default.
Write Path (add_triples)
Change tables/knowledge.py:add_triples() to include triple.g:
triples = [
(
*term_to_tuple(v.s), *term_to_tuple(v.p), *term_to_tuple(v.o),
v.g or ""
)
for v in m.triples
]
Read Path (get_triples)
Change tables/knowledge.py:get_triples() to restore the graph name:
Triple(
s = tuple_to_term(elt[0], elt[1]),
p = tuple_to_term(elt[2], elt[3]),
o = tuple_to_term(elt[4], elt[5]),
g = elt[6] if len(elt) > 6 and elt[6] else None,
)
The len(elt) > 6 guard provides backward compatibility with existing
6-element rows.
Core File Format
Extend triple dicts in the "t" record to include the graph name:
# In get_kg_core.py write_triple — each triple dict gains "g" key
{"s": ..., "p": ..., "o": ..., "g": "urn:graph:source"}
On read (put_kg_core.py), treat missing "g" key as default graph for
backward compatibility with old core files.
Change 2: Provenance Triples in Cores
Investigation Required
Before implementation, verify:
- Whether provenance triples arrive on the
triples-inputtopic that the knowledge core store processor already listens on. - If not, which topic they use, and whether the store processor should subscribe to it.
If provenance triples already arrive at the store
The only change needed is Change 1 (named graphs) — the provenance triples are already being stored, just without their graph name. Once graph names are preserved, provenance triples will round-trip correctly.
If provenance triples do NOT arrive at the store
Two options:
Option A — Route provenance to the existing store topic: Configure the
flow so provenance triples are published to the same triples-input topic.
This is the simpler approach and keeps the store processor unchanged.
Option B — Add a subscription: Add a new ConsumerSpec in the store
processor for the provenance topic. This keeps provenance routing
independent but adds complexity.
Recommendation: Option A, unless there is a reason provenance triples are intentionally kept off the core store topic.
Change 3: Source Material in Cores
This is the largest change. The goal is that when a core is loaded on a fresh instance, provenance links to source material resolve.
Architecture
Source material is not stored in the knowledge core tables. It lives in
the librarian (Cassandra library keyspace + S3/MinIO blob store) and is
fetched on demand via the librarian's existing service API.
The knowledge manager acts as a client of the librarian service — it calls the librarian's request/response API over pub/sub to retrieve document metadata and content. It does not access the library's Cassandra tables or blob store directly.
Transport
The librarian's pub/sub API already handles chunking of large documents. This chunking is designed to be websocket-friendly, so library content flowing through the API gateway to external clients does not require re-chunking. The API gateway remains a transport layer.
Download:
Knowledge manager ──pub/sub──► Librarian (fetch metadata + content)
Knowledge manager ──pub/sub──► API gateway ──websocket──► Client
Upload:
Client ──websocket──► API gateway ──pub/sub──► Knowledge manager
Knowledge manager ──pub/sub──► Librarian (store metadata + content)
What to Include
The provenance chain links facts → chunks → pages → documents. For the chain to resolve, the core must include:
- Document metadata — the library record for each document in the hierarchy (id, kind, title, parent_id, document_type, etc.)
- Document content — the blob data for each document (original file, extracted text pages, text chunks)
Including the full hierarchy is necessary because:
- A user viewing provenance needs to traverse fact → chunk → page → document
- The chunk text is needed to show what text a fact was extracted from
- The page text provides broader context
- The original document is needed for full source attribution
Size Implications
Source material will significantly increase core file sizes. A rough model:
| Component | Typical size per document |
|---|---|
| Triples + embeddings (current) | 1-10 MB |
| Chunk text (all chunks) | ~same as original document |
| Page text (all pages) | ~same as original document |
| Original document (PDF, etc.) | Varies widely (KB to hundreds of MB) |
For a 10 MB PDF, the core could grow from ~5 MB to ~25 MB (original + derived text + existing data). For large document sets, cores could become very large.
Decision needed: Whether to include original documents or just derived text (pages + chunks). Including only derived text still allows provenance display but loses the ability to serve the original file.
New Core File Record Types
Add new msgpack record types for library content:
| Type tag | Payload | Description |
|---|---|---|
"lm" |
{"id", "kind", "title", "parent_id", "document_type", "comments", "tags", "metadata"} |
Library document metadata |
"lb" |
{"id", "data"} |
Library document blob content (chunked by pub/sub layer) |
These are emitted after the existing "t" and "ge" records during
download and processed during upload.
Download Path
Extend KnowledgeManager.get_kg_core() to:
- Stream triples and graph embeddings from the core store (existing behavior).
- Use the librarian service API to retrieve documents associated with
this core ID:
a. Fetch the root document metadata and content.
b. Use
list-childrento discover child documents (pages, chunks). c. Recursively fetch metadata and content for each child. - Stream each document as
"lm"(metadata) and"lb"(content) records.
The knowledge manager gains the librarian service as a pub/sub dependency. Large document content is chunked by the librarian's existing pub/sub transport — the knowledge manager receives and forwards these chunks without buffering the full blob in memory.
Upload Path
Extend KnowledgeManager.put_kg_core() to handle the new record types:
- For
"lm"records: call the librarian service API to create/update the document metadata. - For
"lb"records: call the librarian service API to store the document content.
Parent-child relationships are preserved because parent_id is stored in
the metadata. Documents should be processed in hierarchy order (parent
before child) to satisfy any ordering constraints.
Load Path
The load path (_load_kg_core) publishes triples and embeddings to Pulsar
topics for ingestion into graph/vector stores. Source material does not need
to flow through the load path — it is already in the librarian after the
upload step and can be accessed directly by services that need it.
No changes to the load path for source material.
CLI Changes
tg-get-kg-core: Add handling for "lm" and "lb" record types in
the file writer.
tg-put-kg-core: Add handling for "lm" and "lb" record types in
the file reader. Send library records to the knowledge manager alongside
triple/embedding records.
Associating Documents with Cores
The core ID is metadata.root, which is the root document ID from the
librarian. This provides a natural join: the core's root document and all
its children (pages, chunks) are the source material for that core.
The librarian's list-children API provides the child documents. A
recursive traversal from the root document collects the full hierarchy.
API Changes
KnowledgeResponse Schema
Add optional fields to KnowledgeResponse for library data:
@dataclass
class KnowledgeResponse:
error: Error | None = None
ids: list | None = None
eos: bool = False
triples: Triples | None = None
graph_embeddings: GraphEmbeddings | None = None
document_embeddings: DocumentEmbeddings | None = None
library_metadata: LibraryMetadata | None = None # new
library_blob: LibraryBlob | None = None # new
New Schema Types
@dataclass
class LibraryMetadata:
id: str
kind: str | None = None
title: str | None = None
parent_id: str | None = None
document_type: str | None = None
comments: str | None = None
tags: list[str] | None = None
metadata: list[Triple] | None = None
@dataclass
class LibraryBlob:
id: str
data: bytes
Socket API
The existing streaming protocol for get-kg-core / put-kg-core carries
these new fields naturally — responses already stream multiple record types.
Dependencies Between Changes
Change 1 (named graphs) ◄── Change 2 depends on this
│
└── Change 2 (provenance triples)
│
└── Change 3 (source material) is independent
Change 1 is a prerequisite for Change 2 (provenance triples use named graphs). Change 3 is independent and can be implemented in parallel.
Security Considerations
- Workspace isolation: Core download/upload must respect workspace boundaries. Source material from the librarian must only be included if it belongs to the same workspace as the core. This is already enforced by the existing workspace-scoped queries.
- Large blob transfer: Streaming large documents through the API is handled by the librarian's existing pub/sub chunking, which is designed to be websocket-friendly. No additional chunking layer is needed.
- Cross-instance trust: When uploading a core from an external source, the library content should be treated as untrusted input. Document metadata and blob content should be validated before insertion.
Performance Considerations
- Core file size: Including source material will significantly increase core file sizes. Consider adding a flag to download/upload commands to optionally exclude source material for use cases where only the knowledge graph is needed.
- Streaming: All paths already use streaming (paged Cassandra queries, msgpack record-at-a-time). Library content should follow the same pattern.
- Cassandra schema migration: Changing the tuple width in the
triplestable requires careful handling. Cassandra frozen tuples cannot be altered in place — a migration strategy is needed (see Migration Plan).
Testing Strategy
- Unit tests: Triple round-trip with graph name (write → read →
verify
gfield preserved). Backward compatibility with 6-element tuples. - Integration tests: Full lifecycle — extract with provenance → download core → upload to fresh instance → load → verify provenance chain resolves.
- File format tests: Read old-format core files (no graph name, no library records) and verify they load without error.
- Library inclusion tests: Download core with source material → upload → verify documents accessible through librarian.
Migration Plan
Cassandra Schema
The triples table stores tuples in a list<tuple<...>> column. Cassandra
does not support altering the type of an existing column. Options:
Option A — New table: Create a triples_v2 table with the 7-element
tuple. Migrate data from triples to triples_v2. The read path checks
both tables during a transition period, then the old table is dropped.
Option B — Dual read: Keep the existing table. The read path handles both 6-element and 7-element tuples by checking length. New writes use 7-element tuples. This works if Cassandra accepts variable-length tuples in a list — needs verification.
Option C — Separate graph column: Instead of extending the tuple, add a
parallel graphs list<text> column where graphs[i] corresponds to
triples[i]. This avoids tuple migration entirely but requires keeping the
two lists in sync.
Recommendation: Verify Option B first (simplest). Fall back to Option A if Cassandra rejects mixed tuple lengths.
Core File Format
Backward compatible by design:
- Old files lack
"g"in triple dicts and have no"lm"/"lb"records → handled by defaults. - New files read by old code → old code ignores unknown record types (the
existing
read_messageraises on unknown types, so this needs a small fix to skip unknown types gracefully).
Open Questions
-
Provenance topic routing: Do provenance triples currently arrive at the
triples-inputtopic consumed by the knowledge core store? If not, what topic are they on? -
Include original documents?: Should cores include the original uploaded document (e.g. PDF), or only derived text (pages + chunks)? Including originals makes cores fully self-contained but potentially very large. Excluding them preserves provenance text display but loses the ability to serve the original file.
-
Optional source material: Should there be a flag on download/upload to include or exclude source material? This would let users choose between compact cores (knowledge only) and complete cores (knowledge + sources).
-
Cassandra tuple migration: Can Cassandra handle mixed-length tuples in a
list<tuple<...>>column, or is a table migration required? -
Document embedding cores: DE cores are managed alongside KG cores. Do they need the same treatment (source material inclusion)? The document embeddings reference chunk IDs — the same provenance chain applies.
-
Core versioning: Should the core file include a version marker so readers can distinguish old-format from new-format files without trial-and-error parsing?
References
- Extraction-time provenance:
docs/tech-specs/extraction-time-provenance.md - Query-time explainability:
docs/tech-specs/query-time-explainability.md - Agent explainability:
docs/tech-specs/agent-explainability.md - Data ownership model:
docs/tech-specs/data-ownership-model.md