mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-10 15:25:14 +02:00

feat: complete knowledge core storage — named graphs, provenance, source material (#973 )

Implements all three changes from the knowledge-core-completeness tech spec:

1. Named graph field preserved through Cassandra storage (7-element tuple),
   enabling provenance triples to retain their graph URIs on round-trip.

2. Provenance triples already arrive on triples-input — no routing change
   needed; Change 1 was sufficient.

3. Source material (library documents) streamed alongside triples and
   embeddings during core download/upload. The knowledge manager fetches
   the document hierarchy from the librarian on download and recreates it
   on upload, preserving the full provenance chain across instances.

2026-06-03 10:46:52 +01:00

21 KiB

Raw Blame History

layout	title	parent
default	Knowledge Core Completeness	Tech Specs

Knowledge Core Completeness

Overview

Knowledge cores are portable snapshots of extracted knowledge: triples, graph embeddings, and document embeddings stored in Cassandra's knowledge keyspace. They can be downloaded as files, transferred between TrustGraph instances, and loaded back into vector and graph stores.

Recent additions to TrustGraph — explainability/provenance and named graphs — were not carried through to the knowledge core system. This means that exporting and re-importing a core loses provenance links, graph assignments, and source material, breaking the explainability chain.

This specification addresses three gaps:

Named graphs not stored — The g (graph name) field on triples is silently dropped when writing to the core store and comes back as None on read.
Provenance triples not captured — Provenance triples (PROV-O) are generated during extraction and flow to graph stores, but never enter the knowledge core store. It is unclear whether they arrive at the store in the correct form.
Source material not included — Documents, text pages, and chunks in the librarian's bucket store are not part of the core. After loading a core on a different instance, provenance links to source material point at nothing.

Goals

Self-contained cores: A downloaded knowledge core file contains everything needed to reconstruct the full knowledge graph including provenance and source attribution on a fresh instance.
Named graph preservation: Round-tripping a core preserves graph assignments on all triples.
Backward compatibility: Existing core files (without graph names or source material) can still be uploaded and loaded. New fields are optional on import.
No change to core identity: A core is still identified by its document ID. The additional data is associated with the same core ID.
Minimal file format changes: Extend the existing msgpack record format with new record types rather than restructuring existing ones.

Background

Current Lifecycle

Extraction pipeline
    │
    ├─ triples ──────────────────► knowledge core store (Cassandra)
    ├─ graph embeddings ─────────► knowledge core store (Cassandra)
    ├─ document embeddings ──────► knowledge core store (Cassandra)
    ├─ provenance triples ───────► graph store (only)
    └─ source documents ─────────► librarian bucket store (only)

Download:  Cassandra ──► knowledge manager ──► API gateway ──► client file
Upload:    client file ──► API gateway ──► knowledge manager ──► Cassandra
Load:      Cassandra ──► knowledge manager ──► Pulsar topics ──► graph/vector stores

Current Core File Format (msgpack)

A core file is a sequence of concatenated msgpack records. Each record is a 2-element tuple: (type_tag, payload).

Type tag	Payload	Description
`"t"`	`{"m": {id, root, collection}, "t": [triple_dicts]}`	Triple batch
`"ge"`	`{"m": {id, root, collection}, "e": [{entity, vector}]}`	Graph embedding batch

What's Missing

Named Graphs

The Triple dataclass has a g: str | None field (graph name IRI), used to separate provenance graphs (urn:graph:source, urn:graph:retrieval) from the default graph. However:

Cassandra schema (knowledge.triples table): stores a 6-tuple per triple (s_val, s_is_uri, p_val, p_is_uri, o_val, o_is_uri) — no graph field.
add_triples() (tables/knowledge.py:231): destructures only s, p, o — g is discarded.
get_triples() (tables/knowledge.py:396): reconstructs Triple with g defaulting to None.
Core file format: triple dicts do not include a graph field.

Provenance Triples

Provenance triples are generated in the extraction pipeline (trustgraph-base/trustgraph/provenance/triples.py) and published to graph store topics. They use named graphs (urn:graph:source, urn:graph:retrieval) and PROV-O vocabulary.

The knowledge core store processor (storage/knowledge/store.py) listens on triples-input and graph-embeddings-input. Whether provenance triples arrive on the same triples-input topic or a separate one needs verification. Even if they do arrive, the graph name would be lost (per above).

Source Material

The librarian stores the full document hierarchy in a separate system:

Blob store (S3/MinIO): original documents, text pages, chunks — keyed by object UUID under doc/{object_id}.
Cassandra library keyspace: document metadata including id, kind (MIME type), title, parent_id, document_type (source/extracted), object_id (blob reference).

Provenance triples link extracted facts back to chunk/page/document IDs. Those IDs resolve through the librarian. When a core is loaded on a different instance, the librarian has no matching documents, so the entire provenance chain is broken.

Key Source Files

Component	File	Purpose
Core Cassandra schema	`trustgraph-flow/trustgraph/tables/knowledge.py`	Table definitions, read/write
Core manager	`trustgraph-flow/trustgraph/cores/knowledge.py`	API operations, load-to-store
Core store processor	`trustgraph-flow/trustgraph/storage/knowledge/store.py`	Extraction → Cassandra
CLI download	`trustgraph-cli/trustgraph/cli/get_kg_core.py`	Core → msgpack file
CLI upload	`trustgraph-cli/trustgraph/cli/put_kg_core.py`	Msgpack file → core
CLI load	`trustgraph-cli/trustgraph/cli/load_kg_core.py`	Core → graph/vector stores
API client	`trustgraph-base/trustgraph/api/knowledge.py`	Client-side knowledge API
Triple schema	`trustgraph-base/trustgraph/schema/core/primitives.py`	Triple dataclass with `g` field
Provenance generation	`trustgraph-base/trustgraph/provenance/triples.py`	PROV-O triple creation
Librarian	`trustgraph-flow/trustgraph/librarian/librarian.py`	Document storage service
Library tables	`trustgraph-flow/trustgraph/tables/library.py`	Document metadata in Cassandra
Blob store	`trustgraph-flow/trustgraph/librarian/blob_store.py`	S3/MinIO object storage

Technical Design

Change 1: Named Graph Field in Core Storage

Cassandra Schema

Extend the triples tuple from 6 to 7 elements, adding the graph name:

triples list<tuple<
    text, boolean,       -- s_val, s_is_uri
    text, boolean,       -- p_val, p_is_uri
    text, boolean,       -- o_val, o_is_uri
    text                 -- graph name (empty string = default graph)
>>

Migration: The schema change uses ALTER TABLE or is handled by creating a new table version. Existing rows with 6-element tuples must be handled gracefully on read — if the tuple has 6 elements, treat graph as default.

Write Path (`add_triples`)

Change tables/knowledge.py:add_triples() to include triple.g:

triples = [
    (
        *term_to_tuple(v.s), *term_to_tuple(v.p), *term_to_tuple(v.o),
        v.g or ""
    )
    for v in m.triples
]

Read Path (`get_triples`)

Change tables/knowledge.py:get_triples() to restore the graph name:

Triple(
    s = tuple_to_term(elt[0], elt[1]),
    p = tuple_to_term(elt[2], elt[3]),
    o = tuple_to_term(elt[4], elt[5]),
    g = elt[6] if len(elt) > 6 and elt[6] else None,
)

The len(elt) > 6 guard provides backward compatibility with existing 6-element rows.

Core File Format

Extend triple dicts in the "t" record to include the graph name:

# In get_kg_core.py write_triple — each triple dict gains "g" key
{"s": ..., "p": ..., "o": ..., "g": "urn:graph:source"}

On read (put_kg_core.py), treat missing "g" key as default graph for backward compatibility with old core files.

Change 2: Provenance Triples in Cores

Investigation Required

Before implementation, verify:

Whether provenance triples arrive on the triples-input topic that the knowledge core store processor already listens on.
If not, which topic they use, and whether the store processor should subscribe to it.

If provenance triples already arrive at the store

The only change needed is Change 1 (named graphs) — the provenance triples are already being stored, just without their graph name. Once graph names are preserved, provenance triples will round-trip correctly.

If provenance triples do NOT arrive at the store

Two options:

Option A — Route provenance to the existing store topic: Configure the flow so provenance triples are published to the same triples-input topic. This is the simpler approach and keeps the store processor unchanged.

Option B — Add a subscription: Add a new ConsumerSpec in the store processor for the provenance topic. This keeps provenance routing independent but adds complexity.

Recommendation: Option A, unless there is a reason provenance triples are intentionally kept off the core store topic.

Change 3: Source Material in Cores

This is the largest change. The goal is that when a core is loaded on a fresh instance, provenance links to source material resolve.

Architecture

Source material is not stored in the knowledge core tables. It lives in the librarian (Cassandra library keyspace + S3/MinIO blob store) and is fetched on demand via the librarian's existing service API.

The knowledge manager acts as a client of the librarian service — it calls the librarian's request/response API over pub/sub to retrieve document metadata and content. It does not access the library's Cassandra tables or blob store directly.

Transport

The librarian's pub/sub API already handles chunking of large documents. This chunking is designed to be websocket-friendly, so library content flowing through the API gateway to external clients does not require re-chunking. The API gateway remains a transport layer.

Download:
  Knowledge manager ──pub/sub──► Librarian (fetch metadata + content)
  Knowledge manager ──pub/sub──► API gateway ──websocket──► Client

Upload:
  Client ──websocket──► API gateway ──pub/sub──► Knowledge manager
  Knowledge manager ──pub/sub──► Librarian (store metadata + content)

What to Include

The provenance chain links facts → chunks → pages → documents. For the chain to resolve, the core must include:

Document metadata — the library record for each document in the hierarchy (id, kind, title, parent_id, document_type, etc.)
Document content — the blob data for each document (original file, extracted text pages, text chunks)

Including the full hierarchy is necessary because:

A user viewing provenance needs to traverse fact → chunk → page → document
The chunk text is needed to show what text a fact was extracted from
The page text provides broader context
The original document is needed for full source attribution

Size Implications

Source material will significantly increase core file sizes. A rough model:

Component	Typical size per document
Triples + embeddings (current)	1-10 MB
Chunk text (all chunks)	~same as original document
Page text (all pages)	~same as original document
Original document (PDF, etc.)	Varies widely (KB to hundreds of MB)

For a 10 MB PDF, the core could grow from ~5 MB to ~25 MB (original + derived text + existing data). For large document sets, cores could become very large.

Decision needed: Whether to include original documents or just derived text (pages + chunks). Including only derived text still allows provenance display but loses the ability to serve the original file.

New Core File Record Types

Add new msgpack record types for library content:

Type tag	Payload	Description
`"lm"`	`{"id", "kind", "title", "parent_id", "document_type", "comments", "tags", "metadata"}`	Library document metadata
`"lb"`	`{"id", "data"}`	Library document blob content (chunked by pub/sub layer)

These are emitted after the existing "t" and "ge" records during download and processed during upload.

Download Path

Extend KnowledgeManager.get_kg_core() to:

Stream triples and graph embeddings from the core store (existing behavior).
Use the librarian service API to retrieve documents associated with this core ID: a. Fetch the root document metadata and content. b. Use list-children to discover child documents (pages, chunks). c. Recursively fetch metadata and content for each child.
Stream each document as "lm" (metadata) and "lb" (content) records.

The knowledge manager gains the librarian service as a pub/sub dependency. Large document content is chunked by the librarian's existing pub/sub transport — the knowledge manager receives and forwards these chunks without buffering the full blob in memory.

Upload Path

Extend KnowledgeManager.put_kg_core() to handle the new record types:

For "lm" records: call the librarian service API to create/update the document metadata.
For "lb" records: call the librarian service API to store the document content.

Parent-child relationships are preserved because parent_id is stored in the metadata. Documents should be processed in hierarchy order (parent before child) to satisfy any ordering constraints.

Load Path

The load path (_load_kg_core) publishes triples and embeddings to Pulsar topics for ingestion into graph/vector stores. Source material does not need to flow through the load path — it is already in the librarian after the upload step and can be accessed directly by services that need it.

No changes to the load path for source material.

CLI Changes

tg-get-kg-core: Add handling for "lm" and "lb" record types in the file writer.

tg-put-kg-core: Add handling for "lm" and "lb" record types in the file reader. Send library records to the knowledge manager alongside triple/embedding records.

Associating Documents with Cores

The core ID is metadata.root, which is the root document ID from the librarian. This provides a natural join: the core's root document and all its children (pages, chunks) are the source material for that core.

The librarian's list-children API provides the child documents. A recursive traversal from the root document collects the full hierarchy.

API Changes

KnowledgeResponse Schema

Add optional fields to KnowledgeResponse for library data:

@dataclass
class KnowledgeResponse:
    error: Error | None = None
    ids: list | None = None
    eos: bool = False
    triples: Triples | None = None
    graph_embeddings: GraphEmbeddings | None = None
    document_embeddings: DocumentEmbeddings | None = None
    library_metadata: LibraryMetadata | None = None    # new
    library_blob: LibraryBlob | None = None            # new

New Schema Types

@dataclass
class LibraryMetadata:
    id: str
    kind: str | None = None
    title: str | None = None
    parent_id: str | None = None
    document_type: str | None = None
    comments: str | None = None
    tags: list[str] | None = None
    metadata: list[Triple] | None = None

@dataclass
class LibraryBlob:
    id: str
    data: bytes

Socket API

The existing streaming protocol for get-kg-core / put-kg-core carries these new fields naturally — responses already stream multiple record types.

Dependencies Between Changes

Change 1 (named graphs)  ◄── Change 2 depends on this
         │
         └── Change 2 (provenance triples)
                      │
                      └── Change 3 (source material) is independent

Change 1 is a prerequisite for Change 2 (provenance triples use named graphs). Change 3 is independent and can be implemented in parallel.

Security Considerations

Workspace isolation: Core download/upload must respect workspace boundaries. Source material from the librarian must only be included if it belongs to the same workspace as the core. This is already enforced by the existing workspace-scoped queries.
Large blob transfer: Streaming large documents through the API is handled by the librarian's existing pub/sub chunking, which is designed to be websocket-friendly. No additional chunking layer is needed.
Cross-instance trust: When uploading a core from an external source, the library content should be treated as untrusted input. Document metadata and blob content should be validated before insertion.

Performance Considerations

Core file size: Including source material will significantly increase core file sizes. Consider adding a flag to download/upload commands to optionally exclude source material for use cases where only the knowledge graph is needed.
Streaming: All paths already use streaming (paged Cassandra queries, msgpack record-at-a-time). Library content should follow the same pattern.
Cassandra schema migration: Changing the tuple width in the triples table requires careful handling. Cassandra frozen tuples cannot be altered in place — a migration strategy is needed (see Migration Plan).

Testing Strategy

Unit tests: Triple round-trip with graph name (write → read → verify g field preserved). Backward compatibility with 6-element tuples.
Integration tests: Full lifecycle — extract with provenance → download core → upload to fresh instance → load → verify provenance chain resolves.
File format tests: Read old-format core files (no graph name, no library records) and verify they load without error.
Library inclusion tests: Download core with source material → upload → verify documents accessible through librarian.

Migration Plan

Cassandra Schema

The triples table stores tuples in a list<tuple<...>> column. Cassandra does not support altering the type of an existing column. Options:

Option A — New table: Create a triples_v2 table with the 7-element tuple. Migrate data from triples to triples_v2. The read path checks both tables during a transition period, then the old table is dropped.

Option B — Dual read: Keep the existing table. The read path handles both 6-element and 7-element tuples by checking length. New writes use 7-element tuples. This works if Cassandra accepts variable-length tuples in a list — needs verification.

Option C — Separate graph column: Instead of extending the tuple, add a parallel graphs list<text> column where graphs[i] corresponds to triples[i]. This avoids tuple migration entirely but requires keeping the two lists in sync.

Recommendation: Verify Option B first (simplest). Fall back to Option A if Cassandra rejects mixed tuple lengths.

Core File Format

Backward compatible by design:

Old files lack "g" in triple dicts and have no "lm"/"lb" records → handled by defaults.
New files read by old code → old code ignores unknown record types (the existing read_message raises on unknown types, so this needs a small fix to skip unknown types gracefully).

Open Questions

Provenance topic routing: Do provenance triples currently arrive at the triples-input topic consumed by the knowledge core store? If not, what topic are they on?
Include original documents?: Should cores include the original uploaded document (e.g. PDF), or only derived text (pages + chunks)? Including originals makes cores fully self-contained but potentially very large. Excluding them preserves provenance text display but loses the ability to serve the original file.
Optional source material: Should there be a flag on download/upload to include or exclude source material? This would let users choose between compact cores (knowledge only) and complete cores (knowledge + sources).
Cassandra tuple migration: Can Cassandra handle mixed-length tuples in a list<tuple<...>> column, or is a table migration required?
Document embedding cores: DE cores are managed alongside KG cores. Do they need the same treatment (source material inclusion)? The document embeddings reference chunk IDs — the same provenance chain applies.
Core versioning: Should the core file include a version marker so readers can distinguish old-format from new-format files without trial-and-error parsing?

References

Extraction-time provenance: docs/tech-specs/extraction-time-provenance.md
Query-time explainability: docs/tech-specs/query-time-explainability.md
Agent explainability: docs/tech-specs/agent-explainability.md
Data ownership model: docs/tech-specs/data-ownership-model.md

21 KiB Raw Blame History

Knowledge Core Completeness

Overview

Goals

Background

Current Lifecycle

Current Core File Format (msgpack)

What's Missing

Named Graphs

Provenance Triples

Source Material

Key Source Files

Technical Design

Change 1: Named Graph Field in Core Storage

Cassandra Schema

Write Path (add_triples)

Read Path (get_triples)

Core File Format

Change 2: Provenance Triples in Cores

Investigation Required

If provenance triples already arrive at the store

If provenance triples do NOT arrive at the store

Change 3: Source Material in Cores

Architecture

Transport

What to Include

Size Implications

New Core File Record Types

Download Path

Upload Path

Load Path

CLI Changes

Associating Documents with Cores

API Changes

KnowledgeResponse Schema

New Schema Types

Socket API

Dependencies Between Changes

Security Considerations

Performance Considerations

Testing Strategy

Migration Plan

Cassandra Schema

Core File Format

Open Questions

References

21 KiB

Raw Blame History

Write Path (`add_triples`)

Read Path (`get_triples`)