trustgraph/docs/tech-specs/knowledge-core-completeness.md

---
layout: default
title: "Knowledge Core Completeness"
parent: "Tech Specs"
---

# Knowledge Core Completeness

## Overview

Knowledge cores are portable snapshots of extracted knowledge: triples, graph
embeddings, and document embeddings stored in Cassandra's `knowledge` keyspace.
They can be downloaded as files, transferred between TrustGraph instances, and
loaded back into vector and graph stores.

Recent additions to TrustGraph — explainability/provenance and named graphs —
were not carried through to the knowledge core system. This means that
exporting and re-importing a core loses provenance links, graph assignments,
and source material, breaking the explainability chain.

This specification addresses three gaps:

1. **Named graphs not stored** — The `g` (graph name) field on triples is
   silently dropped when writing to the core store and comes back as `None`
   on read.
2. **Provenance triples not captured** — Provenance triples (PROV-O) are
   generated during extraction and flow to graph stores, but never enter
   the knowledge core store. It is unclear whether they arrive at the store
   in the correct form.
3. **Source material not included** — Documents, text pages, and chunks in
   the librarian's bucket store are not part of the core. After loading a
   core on a different instance, provenance links to source material point
   at nothing.

## Goals

- **Self-contained cores**: A downloaded knowledge core file contains
  everything needed to reconstruct the full knowledge graph including
  provenance and source attribution on a fresh instance.
- **Named graph preservation**: Round-tripping a core preserves graph
  assignments on all triples.
- **Backward compatibility**: Existing core files (without graph names or
  source material) can still be uploaded and loaded. New fields are optional
  on import.
- **No change to core identity**: A core is still identified by its document
  ID. The additional data is associated with the same core ID.
- **Minimal file format changes**: Extend the existing msgpack record format
  with new record types rather than restructuring existing ones.

## Background

### Current Lifecycle

```
Extraction pipeline
    │
    ├─ triples ──────────────────► knowledge core store (Cassandra)
    ├─ graph embeddings ─────────► knowledge core store (Cassandra)
    ├─ document embeddings ──────► knowledge core store (Cassandra)
    ├─ provenance triples ───────► graph store (only)
    └─ source documents ─────────► librarian bucket store (only)

Download:  Cassandra ──► knowledge manager ──► API gateway ──► client file
Upload:    client file ──► API gateway ──► knowledge manager ──► Cassandra
Load:      Cassandra ──► knowledge manager ──► Pulsar topics ──► graph/vector stores
```

### Current Core File Format (msgpack)

A core file is a sequence of concatenated msgpack records. Each record is a
2-element tuple: `(type_tag, payload)`.

| Type tag | Payload | Description |
|----------|---------|-------------|
| `"t"` | `{"m": {id, root, collection}, "t": [triple_dicts]}` | Triple batch |
| `"ge"` | `{"m": {id, root, collection}, "e": [{entity, vector}]}` | Graph embedding batch |

### What's Missing

#### Named Graphs

The `Triple` dataclass has a `g: str | None` field (graph name IRI), used to
separate provenance graphs (`urn:graph:source`, `urn:graph:retrieval`) from
the default graph. However:

- **Cassandra schema** (`knowledge.triples` table): stores a 6-tuple per
  triple `(s_val, s_is_uri, p_val, p_is_uri, o_val, o_is_uri)` — no graph
  field.
- **`add_triples()`** (`tables/knowledge.py:231`): destructures only `s`,
  `p`, `o` — `g` is discarded.
- **`get_triples()`** (`tables/knowledge.py:396`): reconstructs `Triple`
  with `g` defaulting to `None`.
- **Core file format**: triple dicts do not include a graph field.

#### Provenance Triples

Provenance triples are generated in the extraction pipeline
(`trustgraph-base/trustgraph/provenance/triples.py`) and published to graph
store topics. They use named graphs (`urn:graph:source`,
`urn:graph:retrieval`) and PROV-O vocabulary.

The knowledge core store processor (`storage/knowledge/store.py`) listens on
`triples-input` and `graph-embeddings-input`. Whether provenance triples
arrive on the same `triples-input` topic or a separate one needs
verification. Even if they do arrive, the graph name would be lost (per
above).

#### Source Material

The librarian stores the full document hierarchy in a separate system:

- **Blob store** (S3/MinIO): original documents, text pages, chunks —
  keyed by object UUID under `doc/{object_id}`.
- **Cassandra `library` keyspace**: document metadata including `id`,
  `kind` (MIME type), `title`, `parent_id`, `document_type`
  (`source`/`extracted`), `object_id` (blob reference).

Provenance triples link extracted facts back to chunk/page/document IDs.
Those IDs resolve through the librarian. When a core is loaded on a
different instance, the librarian has no matching documents, so the entire
provenance chain is broken.

### Key Source Files

| Component | File | Purpose |
|-----------|------|---------|
| Core Cassandra schema | `trustgraph-flow/trustgraph/tables/knowledge.py` | Table definitions, read/write |
| Core manager | `trustgraph-flow/trustgraph/cores/knowledge.py` | API operations, load-to-store |
| Core store processor | `trustgraph-flow/trustgraph/storage/knowledge/store.py` | Extraction → Cassandra |
| CLI download | `trustgraph-cli/trustgraph/cli/get_kg_core.py` | Core → msgpack file |
| CLI upload | `trustgraph-cli/trustgraph/cli/put_kg_core.py` | Msgpack file → core |
| CLI load | `trustgraph-cli/trustgraph/cli/load_kg_core.py` | Core → graph/vector stores |
| API client | `trustgraph-base/trustgraph/api/knowledge.py` | Client-side knowledge API |
| Triple schema | `trustgraph-base/trustgraph/schema/core/primitives.py` | Triple dataclass with `g` field |
| Provenance generation | `trustgraph-base/trustgraph/provenance/triples.py` | PROV-O triple creation |
| Librarian | `trustgraph-flow/trustgraph/librarian/librarian.py` | Document storage service |
| Library tables | `trustgraph-flow/trustgraph/tables/library.py` | Document metadata in Cassandra |
| Blob store | `trustgraph-flow/trustgraph/librarian/blob_store.py` | S3/MinIO object storage |

## Technical Design

### Change 1: Named Graph Field in Core Storage

#### Cassandra Schema

Extend the `triples` tuple from 6 to 7 elements, adding the graph name:

```
triples list<tuple<
    text, boolean,       -- s_val, s_is_uri
    text, boolean,       -- p_val, p_is_uri
    text, boolean,       -- o_val, o_is_uri
    text                 -- graph name (empty string = default graph)
>>
```

**Migration**: The schema change uses `ALTER TABLE` or is handled by
creating a new table version. Existing rows with 6-element tuples must be
handled gracefully on read — if the tuple has 6 elements, treat graph as
default.

#### Write Path (`add_triples`)

Change `tables/knowledge.py:add_triples()` to include `triple.g`:

```python
triples = [
    (
        *term_to_tuple(v.s), *term_to_tuple(v.p), *term_to_tuple(v.o),
        v.g or ""
    )
    for v in m.triples
]
```

#### Read Path (`get_triples`)

Change `tables/knowledge.py:get_triples()` to restore the graph name:

```python
Triple(
    s = tuple_to_term(elt[0], elt[1]),
    p = tuple_to_term(elt[2], elt[3]),
    o = tuple_to_term(elt[4], elt[5]),
    g = elt[6] if len(elt) > 6 and elt[6] else None,
)
```

The `len(elt) > 6` guard provides backward compatibility with existing
6-element rows.

#### Core File Format

Extend triple dicts in the `"t"` record to include the graph name:

```python
# In get_kg_core.py write_triple — each triple dict gains "g" key
{"s": ..., "p": ..., "o": ..., "g": "urn:graph:source"}
```

On read (`put_kg_core.py`), treat missing `"g"` key as default graph for
backward compatibility with old core files.

### Change 2: Provenance Triples in Cores

#### Investigation Required

Before implementation, verify:

1. Whether provenance triples arrive on the `triples-input` topic that the
   knowledge core store processor already listens on.
2. If not, which topic they use, and whether the store processor should
   subscribe to it.

#### If provenance triples already arrive at the store

The only change needed is Change 1 (named graphs) — the provenance triples
are already being stored, just without their graph name. Once graph names
are preserved, provenance triples will round-trip correctly.

#### If provenance triples do NOT arrive at the store

Two options:

**Option A — Route provenance to the existing store topic**: Configure the
flow so provenance triples are published to the same `triples-input` topic.
This is the simpler approach and keeps the store processor unchanged.

**Option B — Add a subscription**: Add a new `ConsumerSpec` in the store
processor for the provenance topic. This keeps provenance routing
independent but adds complexity.

Recommendation: Option A, unless there is a reason provenance triples are
intentionally kept off the core store topic.

### Change 3: Source Material in Cores

This is the largest change. The goal is that when a core is loaded on a
fresh instance, provenance links to source material resolve.

#### Architecture

Source material is **not stored in the knowledge core tables**. It lives in
the librarian (Cassandra `library` keyspace + S3/MinIO blob store) and is
fetched on demand via the librarian's existing service API.

The knowledge manager acts as a **client of the librarian service** — it
calls the librarian's request/response API over pub/sub to retrieve document
metadata and content. It does not access the library's Cassandra tables or
blob store directly.

#### Transport

The librarian's pub/sub API already handles chunking of large documents.
This chunking is designed to be websocket-friendly, so library content
flowing through the API gateway to external clients does not require
re-chunking. The API gateway remains a transport layer.

```
Download:
  Knowledge manager ──pub/sub──► Librarian (fetch metadata + content)
  Knowledge manager ──pub/sub──► API gateway ──websocket──► Client

Upload:
  Client ──websocket──► API gateway ──pub/sub──► Knowledge manager
  Knowledge manager ──pub/sub──► Librarian (store metadata + content)
```

#### What to Include

The provenance chain links facts → chunks → pages → documents. For the
chain to resolve, the core must include:

1. **Document metadata** — the library record for each document in the
   hierarchy (id, kind, title, parent_id, document_type, etc.)
2. **Document content** — the blob data for each document (original file,
   extracted text pages, text chunks)

Including the full hierarchy is necessary because:
- A user viewing provenance needs to traverse fact → chunk → page → document
- The chunk text is needed to show what text a fact was extracted from
- The page text provides broader context
- The original document is needed for full source attribution

#### Size Implications

Source material will significantly increase core file sizes. A rough model:

| Component | Typical size per document |
|-----------|-------------------------|
| Triples + embeddings (current) | 1-10 MB |
| Chunk text (all chunks) | ~same as original document |
| Page text (all pages) | ~same as original document |
| Original document (PDF, etc.) | Varies widely (KB to hundreds of MB) |

For a 10 MB PDF, the core could grow from ~5 MB to ~25 MB (original +
derived text + existing data). For large document sets, cores could become
very large.

**Decision needed**: Whether to include original documents or just derived
text (pages + chunks). Including only derived text still allows provenance
display but loses the ability to serve the original file.

#### New Core File Record Types

Add new msgpack record types for library content:

| Type tag | Payload | Description |
|----------|---------|-------------|
| `"lm"` | `{"id", "kind", "title", "parent_id", "document_type", "comments", "tags", "metadata"}` | Library document metadata |
| `"lb"` | `{"id", "data"}` | Library document blob content (chunked by pub/sub layer) |

These are emitted after the existing `"t"` and `"ge"` records during
download and processed during upload.

#### Download Path

Extend `KnowledgeManager.get_kg_core()` to:

1. Stream triples and graph embeddings from the core store (existing
   behavior).
2. Use the librarian service API to retrieve documents associated with
   this core ID:
   a. Fetch the root document metadata and content.
   b. Use `list-children` to discover child documents (pages, chunks).
   c. Recursively fetch metadata and content for each child.
3. Stream each document as `"lm"` (metadata) and `"lb"` (content) records.

The knowledge manager gains the librarian service as a pub/sub dependency.
Large document content is chunked by the librarian's existing pub/sub
transport — the knowledge manager receives and forwards these chunks without
buffering the full blob in memory.

#### Upload Path

Extend `KnowledgeManager.put_kg_core()` to handle the new record types:

1. For `"lm"` records: call the librarian service API to create/update
   the document metadata.
2. For `"lb"` records: call the librarian service API to store the
   document content.

Parent-child relationships are preserved because `parent_id` is stored in
the metadata. Documents should be processed in hierarchy order (parent
before child) to satisfy any ordering constraints.

#### Load Path

The load path (`_load_kg_core`) publishes triples and embeddings to Pulsar
topics for ingestion into graph/vector stores. Source material does not need
to flow through the load path — it is already in the librarian after the
upload step and can be accessed directly by services that need it.

No changes to the load path for source material.

#### CLI Changes

**`tg-get-kg-core`**: Add handling for `"lm"` and `"lb"` record types in
the file writer.

**`tg-put-kg-core`**: Add handling for `"lm"` and `"lb"` record types in
the file reader. Send library records to the knowledge manager alongside
triple/embedding records.

#### Associating Documents with Cores

The core ID is `metadata.root`, which is the root document ID from the
librarian. This provides a natural join: the core's root document and all
its children (pages, chunks) are the source material for that core.

The librarian's `list-children` API provides the child documents. A
recursive traversal from the root document collects the full hierarchy.

### API Changes

#### KnowledgeResponse Schema

Add optional fields to `KnowledgeResponse` for library data:

```python
@dataclass
class KnowledgeResponse:
    error: Error | None = None
    ids: list | None = None
    eos: bool = False
    triples: Triples | None = None
    graph_embeddings: GraphEmbeddings | None = None
    document_embeddings: DocumentEmbeddings | None = None
    library_metadata: LibraryMetadata | None = None    # new
    library_blob: LibraryBlob | None = None            # new
```

#### New Schema Types

```python
@dataclass
class LibraryMetadata:
    id: str
    kind: str | None = None
    title: str | None = None
    parent_id: str | None = None
    document_type: str | None = None
    comments: str | None = None
    tags: list[str] | None = None
    metadata: list[Triple] | None = None

@dataclass
class LibraryBlob:
    id: str
    data: bytes
```

#### Socket API

The existing streaming protocol for `get-kg-core` / `put-kg-core` carries
these new fields naturally — responses already stream multiple record types.

### Dependencies Between Changes

```
Change 1 (named graphs)  ◄── Change 2 depends on this
         │
         └── Change 2 (provenance triples)
                      │
                      └── Change 3 (source material) is independent
```

Change 1 is a prerequisite for Change 2 (provenance triples use named
graphs). Change 3 is independent and can be implemented in parallel.

## Security Considerations

- **Workspace isolation**: Core download/upload must respect workspace
  boundaries. Source material from the librarian must only be included if
  it belongs to the same workspace as the core. This is already enforced
  by the existing workspace-scoped queries.
- **Large blob transfer**: Streaming large documents through the API
  is handled by the librarian's existing pub/sub chunking, which is
  designed to be websocket-friendly. No additional chunking layer is
  needed.
- **Cross-instance trust**: When uploading a core from an external source,
  the library content should be treated as untrusted input. Document
  metadata and blob content should be validated before insertion.

## Performance Considerations

- **Core file size**: Including source material will significantly increase
  core file sizes. Consider adding a flag to download/upload commands to
  optionally exclude source material for use cases where only the knowledge
  graph is needed.
- **Streaming**: All paths already use streaming (paged Cassandra queries,
  msgpack record-at-a-time). Library content should follow the same pattern.
- **Cassandra schema migration**: Changing the tuple width in the `triples`
  table requires careful handling. Cassandra frozen tuples cannot be altered
  in place — a migration strategy is needed (see Migration Plan).

## Testing Strategy

- **Unit tests**: Triple round-trip with graph name (write → read →
  verify `g` field preserved). Backward compatibility with 6-element tuples.
- **Integration tests**: Full lifecycle — extract with provenance → download
  core → upload to fresh instance → load → verify provenance chain resolves.
- **File format tests**: Read old-format core files (no graph name, no
  library records) and verify they load without error.
- **Library inclusion tests**: Download core with source material → upload →
  verify documents accessible through librarian.

## Migration Plan

### Cassandra Schema

The `triples` table stores tuples in a `list<tuple<...>>` column. Cassandra
does not support altering the type of an existing column. Options:

**Option A — New table**: Create a `triples_v2` table with the 7-element
tuple. Migrate data from `triples` to `triples_v2`. The read path checks
both tables during a transition period, then the old table is dropped.

**Option B — Dual read**: Keep the existing table. The read path handles
both 6-element and 7-element tuples by checking length. New writes use
7-element tuples. This works if Cassandra accepts variable-length tuples in
a list — **needs verification**.

**Option C — Separate graph column**: Instead of extending the tuple, add a
parallel `graphs list<text>` column where `graphs[i]` corresponds to
`triples[i]`. This avoids tuple migration entirely but requires keeping the
two lists in sync.

Recommendation: Verify Option B first (simplest). Fall back to Option A if
Cassandra rejects mixed tuple lengths.

### Core File Format

Backward compatible by design:
- Old files lack `"g"` in triple dicts and have no `"lm"`/`"lb"` records →
  handled by defaults.
- New files read by old code → old code ignores unknown record types (the
  existing `read_message` raises on unknown types, so this needs a small
  fix to skip unknown types gracefully).

## Open Questions

1. **Provenance topic routing**: Do provenance triples currently arrive at
   the `triples-input` topic consumed by the knowledge core store? If not,
   what topic are they on?

2. **Include original documents?**: Should cores include the original
   uploaded document (e.g. PDF), or only derived text (pages + chunks)?
   Including originals makes cores fully self-contained but potentially
   very large. Excluding them preserves provenance text display but loses
   the ability to serve the original file.

3. **Optional source material**: Should there be a flag on download/upload
   to include or exclude source material? This would let users choose
   between compact cores (knowledge only) and complete cores (knowledge +
   sources).

4. **Cassandra tuple migration**: Can Cassandra handle mixed-length tuples
   in a `list<tuple<...>>` column, or is a table migration required?

5. **Document embedding cores**: DE cores are managed alongside KG cores.
   Do they need the same treatment (source material inclusion)?  The
   document embeddings reference chunk IDs — the same provenance chain
   applies.

6. **Core versioning**: Should the core file include a version marker so
   readers can distinguish old-format from new-format files without
   trial-and-error parsing?

## References

- Extraction-time provenance: `docs/tech-specs/extraction-time-provenance.md`
- Query-time explainability: `docs/tech-specs/query-time-explainability.md`
- Agent explainability: `docs/tech-specs/agent-explainability.md`
- Data ownership model: `docs/tech-specs/data-ownership-model.md`