feat: complete knowledge core storage — named graphs, provenance, source material (#973)

Implements all three changes from the knowledge-core-completeness tech spec: 1. Named graph field preserved through Cassandra storage (7-element tuple), enabling provenance triples to retain their graph URIs on round-trip. 2. Provenance triples already arrive on triples-input — no routing change needed; Change 1 was sufficient. 3. Source material (library documents) streamed alongside triples and embeddings during core download/upload. The knowledge manager fetches the document hierarchy from the librarian on download and recreates it on upload, preserving the full provenance chain across instances.
2026-06-10 23:35:14 +02:00 · 2026-06-03 10:46:52 +01:00 · 2026-06-03 10:46:52 +01:00 · 6df7471a55
commit 6df7471a55
parent aa158e1ba3
14 changed files with 1347 additions and 15 deletions
--- a/docs/tech-specs/knowledge-core-completeness.md
+++ b/docs/tech-specs/knowledge-core-completeness.md
@ -0,0 +1,535 @@
+---
+layout: default
+title: "Knowledge Core Completeness"
+parent: "Tech Specs"
+---
+
+# Knowledge Core Completeness
+
+## Overview
+
+Knowledge cores are portable snapshots of extracted knowledge: triples, graph
+embeddings, and document embeddings stored in Cassandra's `knowledge` keyspace.
+They can be downloaded as files, transferred between TrustGraph instances, and
+loaded back into vector and graph stores.
+
+Recent additions to TrustGraph — explainability/provenance and named graphs —
+were not carried through to the knowledge core system. This means that
+exporting and re-importing a core loses provenance links, graph assignments,
+and source material, breaking the explainability chain.
+
+This specification addresses three gaps:
+
+1. **Named graphs not stored** — The `g` (graph name) field on triples is
+   silently dropped when writing to the core store and comes back as `None`
+   on read.
+2. **Provenance triples not captured** — Provenance triples (PROV-O) are
+   generated during extraction and flow to graph stores, but never enter
+   the knowledge core store. It is unclear whether they arrive at the store
+   in the correct form.
+3. **Source material not included** — Documents, text pages, and chunks in
+   the librarian's bucket store are not part of the core. After loading a
+   core on a different instance, provenance links to source material point
+   at nothing.
+
+## Goals
+
+- **Self-contained cores**: A downloaded knowledge core file contains
+  everything needed to reconstruct the full knowledge graph including
+  provenance and source attribution on a fresh instance.
+- **Named graph preservation**: Round-tripping a core preserves graph
+  assignments on all triples.
+- **Backward compatibility**: Existing core files (without graph names or
+  source material) can still be uploaded and loaded. New fields are optional
+  on import.
+- **No change to core identity**: A core is still identified by its document
+  ID. The additional data is associated with the same core ID.
+- **Minimal file format changes**: Extend the existing msgpack record format
+  with new record types rather than restructuring existing ones.
+
+## Background
+
+### Current Lifecycle
+
+```
+Extraction pipeline
+    │
+    ├─ triples ──────────────────► knowledge core store (Cassandra)
+    ├─ graph embeddings ─────────► knowledge core store (Cassandra)
+    ├─ document embeddings ──────► knowledge core store (Cassandra)
+    ├─ provenance triples ───────► graph store (only)
+    └─ source documents ─────────► librarian bucket store (only)
+
+Download:  Cassandra ──► knowledge manager ──► API gateway ──► client file
+Upload:    client file ──► API gateway ──► knowledge manager ──► Cassandra
+Load:      Cassandra ──► knowledge manager ──► Pulsar topics ──► graph/vector stores
+```
+
+### Current Core File Format (msgpack)
+
+A core file is a sequence of concatenated msgpack records. Each record is a
+2-element tuple: `(type_tag, payload)`.
+
+| Type tag | Payload | Description |
+|----------|---------|-------------|
+| `"t"` | `{"m": {id, root, collection}, "t": [triple_dicts]}` | Triple batch |
+| `"ge"` | `{"m": {id, root, collection}, "e": [{entity, vector}]}` | Graph embedding batch |
+
+### What's Missing
+
+#### Named Graphs
+
+The `Triple` dataclass has a `g: str | None` field (graph name IRI), used to
+separate provenance graphs (`urn:graph:source`, `urn:graph:retrieval`) from
+the default graph. However:
+
+- **Cassandra schema** (`knowledge.triples` table): stores a 6-tuple per
+  triple `(s_val, s_is_uri, p_val, p_is_uri, o_val, o_is_uri)` — no graph
+  field.
+- **`add_triples()`** (`tables/knowledge.py:231`): destructures only `s`,
+  `p`, `o` — `g` is discarded.
+- **`get_triples()`** (`tables/knowledge.py:396`): reconstructs `Triple`
+  with `g` defaulting to `None`.
+- **Core file format**: triple dicts do not include a graph field.
+
+#### Provenance Triples
+
+Provenance triples are generated in the extraction pipeline
+(`trustgraph-base/trustgraph/provenance/triples.py`) and published to graph
+store topics. They use named graphs (`urn:graph:source`,
+`urn:graph:retrieval`) and PROV-O vocabulary.
+
+The knowledge core store processor (`storage/knowledge/store.py`) listens on
+`triples-input` and `graph-embeddings-input`. Whether provenance triples
+arrive on the same `triples-input` topic or a separate one needs
+verification. Even if they do arrive, the graph name would be lost (per
+above).
+
+#### Source Material
+
+The librarian stores the full document hierarchy in a separate system:
+
+- **Blob store** (S3/MinIO): original documents, text pages, chunks —
+  keyed by object UUID under `doc/{object_id}`.
+- **Cassandra `library` keyspace**: document metadata including `id`,
+  `kind` (MIME type), `title`, `parent_id`, `document_type`
+  (`source`/`extracted`), `object_id` (blob reference).
+
+Provenance triples link extracted facts back to chunk/page/document IDs.
+Those IDs resolve through the librarian. When a core is loaded on a
+different instance, the librarian has no matching documents, so the entire
+provenance chain is broken.
+
+### Key Source Files
+
+| Component | File | Purpose |
+|-----------|------|---------|
+| Core Cassandra schema | `trustgraph-flow/trustgraph/tables/knowledge.py` | Table definitions, read/write |
+| Core manager | `trustgraph-flow/trustgraph/cores/knowledge.py` | API operations, load-to-store |
+| Core store processor | `trustgraph-flow/trustgraph/storage/knowledge/store.py` | Extraction → Cassandra |
+| CLI download | `trustgraph-cli/trustgraph/cli/get_kg_core.py` | Core → msgpack file |
+| CLI upload | `trustgraph-cli/trustgraph/cli/put_kg_core.py` | Msgpack file → core |
+| CLI load | `trustgraph-cli/trustgraph/cli/load_kg_core.py` | Core → graph/vector stores |
+| API client | `trustgraph-base/trustgraph/api/knowledge.py` | Client-side knowledge API |
+| Triple schema | `trustgraph-base/trustgraph/schema/core/primitives.py` | Triple dataclass with `g` field |
+| Provenance generation | `trustgraph-base/trustgraph/provenance/triples.py` | PROV-O triple creation |
+| Librarian | `trustgraph-flow/trustgraph/librarian/librarian.py` | Document storage service |
+| Library tables | `trustgraph-flow/trustgraph/tables/library.py` | Document metadata in Cassandra |
+| Blob store | `trustgraph-flow/trustgraph/librarian/blob_store.py` | S3/MinIO object storage |
+
+## Technical Design
+
+### Change 1: Named Graph Field in Core Storage
+
+#### Cassandra Schema
+
+Extend the `triples` tuple from 6 to 7 elements, adding the graph name:
+
+```
+triples list<tuple<
+    text, boolean,       -- s_val, s_is_uri
+    text, boolean,       -- p_val, p_is_uri
+    text, boolean,       -- o_val, o_is_uri
+    text                 -- graph name (empty string = default graph)
+>>
+```
+
+**Migration**: The schema change uses `ALTER TABLE` or is handled by
+creating a new table version. Existing rows with 6-element tuples must be
+handled gracefully on read — if the tuple has 6 elements, treat graph as
+default.
+
+#### Write Path (`add_triples`)
+
+Change `tables/knowledge.py:add_triples()` to include `triple.g`:
+
+```python
+triples = [
+    (
+        *term_to_tuple(v.s), *term_to_tuple(v.p), *term_to_tuple(v.o),
+        v.g or ""
+    )
+    for v in m.triples
+]
+```
+
+#### Read Path (`get_triples`)
+
+Change `tables/knowledge.py:get_triples()` to restore the graph name:
+
+```python
+Triple(
+    s = tuple_to_term(elt[0], elt[1]),
+    p = tuple_to_term(elt[2], elt[3]),
+    o = tuple_to_term(elt[4], elt[5]),
+    g = elt[6] if len(elt) > 6 and elt[6] else None,
+)
+```
+
+The `len(elt) > 6` guard provides backward compatibility with existing
+6-element rows.
+
+#### Core File Format
+
+Extend triple dicts in the `"t"` record to include the graph name:
+
+```python
+# In get_kg_core.py write_triple — each triple dict gains "g" key
+{"s": ..., "p": ..., "o": ..., "g": "urn:graph:source"}
+```
+
+On read (`put_kg_core.py`), treat missing `"g"` key as default graph for
+backward compatibility with old core files.
+
+### Change 2: Provenance Triples in Cores
+
+#### Investigation Required
+
+Before implementation, verify:
+
+1. Whether provenance triples arrive on the `triples-input` topic that the
+   knowledge core store processor already listens on.
+2. If not, which topic they use, and whether the store processor should
+   subscribe to it.
+
+#### If provenance triples already arrive at the store
+
+The only change needed is Change 1 (named graphs) — the provenance triples
+are already being stored, just without their graph name. Once graph names
+are preserved, provenance triples will round-trip correctly.
+
+#### If provenance triples do NOT arrive at the store
+
+Two options:
+
+**Option A — Route provenance to the existing store topic**: Configure the
+flow so provenance triples are published to the same `triples-input` topic.
+This is the simpler approach and keeps the store processor unchanged.
+
+**Option B — Add a subscription**: Add a new `ConsumerSpec` in the store
+processor for the provenance topic. This keeps provenance routing
+independent but adds complexity.
+
+Recommendation: Option A, unless there is a reason provenance triples are
+intentionally kept off the core store topic.
+
+### Change 3: Source Material in Cores
+
+This is the largest change. The goal is that when a core is loaded on a
+fresh instance, provenance links to source material resolve.
+
+#### Architecture
+
+Source material is **not stored in the knowledge core tables**. It lives in
+the librarian (Cassandra `library` keyspace + S3/MinIO blob store) and is
+fetched on demand via the librarian's existing service API.
+
+The knowledge manager acts as a **client of the librarian service** — it
+calls the librarian's request/response API over pub/sub to retrieve document
+metadata and content. It does not access the library's Cassandra tables or
+blob store directly.
+
+#### Transport
+
+The librarian's pub/sub API already handles chunking of large documents.
+This chunking is designed to be websocket-friendly, so library content
+flowing through the API gateway to external clients does not require
+re-chunking. The API gateway remains a transport layer.
+
+```
+Download:
+  Knowledge manager ──pub/sub──► Librarian (fetch metadata + content)
+  Knowledge manager ──pub/sub──► API gateway ──websocket──► Client
+
+Upload:
+  Client ──websocket──► API gateway ──pub/sub──► Knowledge manager
+  Knowledge manager ──pub/sub──► Librarian (store metadata + content)
+```
+
+#### What to Include
+
+The provenance chain links facts → chunks → pages → documents. For the
+chain to resolve, the core must include:
+
+1. **Document metadata** — the library record for each document in the
+   hierarchy (id, kind, title, parent_id, document_type, etc.)
+2. **Document content** — the blob data for each document (original file,
+   extracted text pages, text chunks)
+
+Including the full hierarchy is necessary because:
+- A user viewing provenance needs to traverse fact → chunk → page → document
+- The chunk text is needed to show what text a fact was extracted from
+- The page text provides broader context
+- The original document is needed for full source attribution
+
+#### Size Implications
+
+Source material will significantly increase core file sizes. A rough model:
+
+| Component | Typical size per document |
+|-----------|-------------------------|
+| Triples + embeddings (current) | 1-10 MB |
+| Chunk text (all chunks) | ~same as original document |
+| Page text (all pages) | ~same as original document |
+| Original document (PDF, etc.) | Varies widely (KB to hundreds of MB) |
+
+For a 10 MB PDF, the core could grow from ~5 MB to ~25 MB (original +
+derived text + existing data). For large document sets, cores could become
+very large.
+
+**Decision needed**: Whether to include original documents or just derived
+text (pages + chunks). Including only derived text still allows provenance
+display but loses the ability to serve the original file.
+
+#### New Core File Record Types
+
+Add new msgpack record types for library content:
+
+| Type tag | Payload | Description |
+|----------|---------|-------------|
+| `"lm"` | `{"id", "kind", "title", "parent_id", "document_type", "comments", "tags", "metadata"}` | Library document metadata |
+| `"lb"` | `{"id", "data"}` | Library document blob content (chunked by pub/sub layer) |
+
+These are emitted after the existing `"t"` and `"ge"` records during
+download and processed during upload.
+
+#### Download Path
+
+Extend `KnowledgeManager.get_kg_core()` to:
+
+1. Stream triples and graph embeddings from the core store (existing
+   behavior).
+2. Use the librarian service API to retrieve documents associated with
+   this core ID:
+   a. Fetch the root document metadata and content.
+   b. Use `list-children` to discover child documents (pages, chunks).
+   c. Recursively fetch metadata and content for each child.
+3. Stream each document as `"lm"` (metadata) and `"lb"` (content) records.
+
+The knowledge manager gains the librarian service as a pub/sub dependency.
+Large document content is chunked by the librarian's existing pub/sub
+transport — the knowledge manager receives and forwards these chunks without
+buffering the full blob in memory.
+
+#### Upload Path
+
+Extend `KnowledgeManager.put_kg_core()` to handle the new record types:
+
+1. For `"lm"` records: call the librarian service API to create/update
+   the document metadata.
+2. For `"lb"` records: call the librarian service API to store the
+   document content.
+
+Parent-child relationships are preserved because `parent_id` is stored in
+the metadata. Documents should be processed in hierarchy order (parent
+before child) to satisfy any ordering constraints.
+
+#### Load Path
+
+The load path (`_load_kg_core`) publishes triples and embeddings to Pulsar
+topics for ingestion into graph/vector stores. Source material does not need
+to flow through the load path — it is already in the librarian after the
+upload step and can be accessed directly by services that need it.
+
+No changes to the load path for source material.
+
+#### CLI Changes
+
+**`tg-get-kg-core`**: Add handling for `"lm"` and `"lb"` record types in
+the file writer.
+
+**`tg-put-kg-core`**: Add handling for `"lm"` and `"lb"` record types in
+the file reader. Send library records to the knowledge manager alongside
+triple/embedding records.
+
+#### Associating Documents with Cores
+
+The core ID is `metadata.root`, which is the root document ID from the
+librarian. This provides a natural join: the core's root document and all
+its children (pages, chunks) are the source material for that core.
+
+The librarian's `list-children` API provides the child documents. A
+recursive traversal from the root document collects the full hierarchy.
+
+### API Changes
+
+#### KnowledgeResponse Schema
+
+Add optional fields to `KnowledgeResponse` for library data:
+
+```python
+@dataclass
+class KnowledgeResponse:
+    error: Error | None = None
+    ids: list | None = None
+    eos: bool = False
+    triples: Triples | None = None
+    graph_embeddings: GraphEmbeddings | None = None
+    document_embeddings: DocumentEmbeddings | None = None
+    library_metadata: LibraryMetadata | None = None    # new
+    library_blob: LibraryBlob | None = None            # new
+```
+
+#### New Schema Types
+
+```python
+@dataclass
+class LibraryMetadata:
+    id: str
+    kind: str | None = None
+    title: str | None = None
+    parent_id: str | None = None
+    document_type: str | None = None
+    comments: str | None = None
+    tags: list[str] | None = None
+    metadata: list[Triple] | None = None
+
+@dataclass
+class LibraryBlob:
+    id: str
+    data: bytes
+```
+
+#### Socket API
+
+The existing streaming protocol for `get-kg-core` / `put-kg-core` carries
+these new fields naturally — responses already stream multiple record types.
+
+### Dependencies Between Changes
+
+```
+Change 1 (named graphs)  ◄── Change 2 depends on this
+         │
+         └── Change 2 (provenance triples)
+                      │
+                      └── Change 3 (source material) is independent
+```
+
+Change 1 is a prerequisite for Change 2 (provenance triples use named
+graphs). Change 3 is independent and can be implemented in parallel.
+
+## Security Considerations
+
+- **Workspace isolation**: Core download/upload must respect workspace
+  boundaries. Source material from the librarian must only be included if
+  it belongs to the same workspace as the core. This is already enforced
+  by the existing workspace-scoped queries.
+- **Large blob transfer**: Streaming large documents through the API
+  is handled by the librarian's existing pub/sub chunking, which is
+  designed to be websocket-friendly. No additional chunking layer is
+  needed.
+- **Cross-instance trust**: When uploading a core from an external source,
+  the library content should be treated as untrusted input. Document
+  metadata and blob content should be validated before insertion.
+
+## Performance Considerations
+
+- **Core file size**: Including source material will significantly increase
+  core file sizes. Consider adding a flag to download/upload commands to
+  optionally exclude source material for use cases where only the knowledge
+  graph is needed.
+- **Streaming**: All paths already use streaming (paged Cassandra queries,
+  msgpack record-at-a-time). Library content should follow the same pattern.
+- **Cassandra schema migration**: Changing the tuple width in the `triples`
+  table requires careful handling. Cassandra frozen tuples cannot be altered
+  in place — a migration strategy is needed (see Migration Plan).
+
+## Testing Strategy
+
+- **Unit tests**: Triple round-trip with graph name (write → read →
+  verify `g` field preserved). Backward compatibility with 6-element tuples.
+- **Integration tests**: Full lifecycle — extract with provenance → download
+  core → upload to fresh instance → load → verify provenance chain resolves.
+- **File format tests**: Read old-format core files (no graph name, no
+  library records) and verify they load without error.
+- **Library inclusion tests**: Download core with source material → upload →
+  verify documents accessible through librarian.
+
+## Migration Plan
+
+### Cassandra Schema
+
+The `triples` table stores tuples in a `list<tuple<...>>` column. Cassandra
+does not support altering the type of an existing column. Options:
+
+**Option A — New table**: Create a `triples_v2` table with the 7-element
+tuple. Migrate data from `triples` to `triples_v2`. The read path checks
+both tables during a transition period, then the old table is dropped.
+
+**Option B — Dual read**: Keep the existing table. The read path handles
+both 6-element and 7-element tuples by checking length. New writes use
+7-element tuples. This works if Cassandra accepts variable-length tuples in
+a list — **needs verification**.
+
+**Option C — Separate graph column**: Instead of extending the tuple, add a
+parallel `graphs list<text>` column where `graphs[i]` corresponds to
+`triples[i]`. This avoids tuple migration entirely but requires keeping the
+two lists in sync.
+
+Recommendation: Verify Option B first (simplest). Fall back to Option A if
+Cassandra rejects mixed tuple lengths.
+
+### Core File Format
+
+Backward compatible by design:
+- Old files lack `"g"` in triple dicts and have no `"lm"`/`"lb"` records →
+  handled by defaults.
+- New files read by old code → old code ignores unknown record types (the
+  existing `read_message` raises on unknown types, so this needs a small
+  fix to skip unknown types gracefully).
+
+## Open Questions
+
+1. **Provenance topic routing**: Do provenance triples currently arrive at
+   the `triples-input` topic consumed by the knowledge core store? If not,
+   what topic are they on?
+
+2. **Include original documents?**: Should cores include the original
+   uploaded document (e.g. PDF), or only derived text (pages + chunks)?
+   Including originals makes cores fully self-contained but potentially
+   very large. Excluding them preserves provenance text display but loses
+   the ability to serve the original file.
+
+3. **Optional source material**: Should there be a flag on download/upload
+   to include or exclude source material? This would let users choose
+   between compact cores (knowledge only) and complete cores (knowledge +
+   sources).
+
+4. **Cassandra tuple migration**: Can Cassandra handle mixed-length tuples
+   in a `list<tuple<...>>` column, or is a table migration required?
+
+5. **Document embedding cores**: DE cores are managed alongside KG cores.
+   Do they need the same treatment (source material inclusion)?  The
+   document embeddings reference chunk IDs — the same provenance chain
+   applies.
+
+6. **Core versioning**: Should the core file include a version marker so
+   readers can distinguish old-format from new-format files without
+   trial-and-error parsing?
+
+## References
+
+- Extraction-time provenance: `docs/tech-specs/extraction-time-provenance.md`
+- Query-time explainability: `docs/tech-specs/query-time-explainability.md`
+- Agent explainability: `docs/tech-specs/agent-explainability.md`
+- Data ownership model: `docs/tech-specs/data-ownership-model.md`