Subscribes to prompt request/response Pulsar queues, correlates
messages by ID, and logs a summary with template name, truncated
terms, and elapsed time. Streaming responses are accumulated and
shown at completion. Supports prompt and prompt-rag queue types.
Replace per-triple provenance reification with subgraph model
Extraction provenance previously created a full reification (statement
URI, activity, agent) for every single extracted triple, producing ~13
provenance triples per knowledge triple. Since each chunk is processed
by a single LLM call, this was both redundant and semantically
inaccurate.
Now one subgraph object is created per chunk extraction, with
tg:contains linking to each extracted triple. For 20 extractions from
a chunk this reduces provenance from ~260 triples to ~33.
- Rename tg:reifies -> tg:contains, stmt_uri -> subgraph_uri
- Replace triple_provenance_triples() with subgraph_provenance_triples()
- Refactor kg-extract-definitions and kg-extract-relationships to
generate provenance once per chunk instead of per triple
- Add subgraph provenance to kg-extract-ontology and kg-extract-agent
(previously had none)
- Update CLI tools and tech specs to match
Also rename tg-show-document-hierarchy to tg-show-extraction-provenance.
Added extra typing for extraction provenance, fixed extraction prov CLI
Add explainability CLI tools for debugging provenance data
- tg-show-document-hierarchy: Display document → page → chunk → edge
hierarchy by traversing prov:wasDerivedFrom relationships
- tg-list-explain-traces: List all GraphRAG sessions with questions
and timestamps from the retrieval graph
- tg-show-explain-trace: Show full explainability cascade for a
GraphRAG session (question → exploration → focus → synthesis)
These tools query the source and retrieval graphs to help debug
and explore provenance/explainability data stored during document
processing and GraphRAG queries.
New CLI tool that enables selective queries against the triple store
unlike tg-show-graph which dumps the entire graph.
Features:
- Filter by subject, predicate, object, and/or named graph
- Auto-detection of term types (IRI, literal, quoted triple)
- Two ways to specify quoted triples:
- Inline Turtle-style: -o "<<s p o>>"
- Explicit flags: --qt-subject, --qt-predicate, --qt-object
- Output formats: space-separated, pipe-separated, JSON, JSON Lines
- Streaming mode for efficient large result sets
Auto-detection rules:
- http://, https://, urn:, or <wrapped> -> IRI
- <<s p o>> -> quoted triple
- Otherwise -> literal
Tech spec
BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py):
- get_stream() - yields document content in chunks for streaming retrieval
- create_multipart_upload() - initializes S3 multipart upload, returns
upload_id
- upload_part() - uploads a single part, returns etag
- complete_multipart_upload() - finalizes upload with part etags
- abort_multipart_upload() - cancels and cleans up
Cassandra schema (trustgraph-flow/trustgraph/tables/library.py):
- New upload_session table with 24-hour TTL
- Index on user for listing sessions
- Prepared statements for all operations
- Methods: create_upload_session(), get_upload_session(),
update_upload_session_chunk(), delete_upload_session(),
list_upload_sessions()
- Schema extended with UploadSession, UploadProgress, and new
request/response fields
- Librarian methods: begin_upload, upload_chunk, complete_upload,
abort_upload, get_upload_status, list_uploads
- Service routing for all new operations
- Python SDK with transparent chunked upload:
- add_document() auto-switches to chunked for files > 10MB
- Progress callback support (on_progress)
- get_pending_uploads(), get_upload_status(), abort_upload(),
resume_upload()
- Document table: Added parent_id and document_type columns with index
- Document schema (knowledge/document.py): Added document_id field for
streaming retrieval
- Librarian operations:
- add-child-document for extracted PDF pages
- list-children to get child documents
- stream-document for chunked content retrieval
- Cascade delete removes children when parent is deleted
- list-documents filters children by default
- PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large
documents from librarian API to temp file
- Librarian service (librarian/service.py): Sends document_id instead of
content for large PDFs (>2MB)
- Deprecated tools (load_pdf.py, load_text.py): Added deprecation
warnings directing users to tg-add-library-document +
tg-start-library-processing
Remove load_pdf and load_text utils
Move chunker/librarian comms to base class
Updating tests
* CLI tools for tg-invoke-graph-embeddings, tg-invoke-document-embeddings,
and tg-invoke-embeddings. Just useful for diagnostics.
* Fix tg-load-knowledge
* Changed schema for Value -> Term, majorly breaking change
* Following the schema change, Value -> Term into all processing
* Updated Cassandra for g, p, s, o index patterns (7 indexes)
* Reviewed and updated all tests
* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
* Updates for agent API with streaming support
* Added tg-dump-queues tool to dump Pulsar queues to a log
* Updated tg-invoke-agent, incremental output
* Queue dumper CLI - might be useful for debug
* Updating for tests
* Tweak the structured query schema
* Structure query service
* Gateway support for nlp-query and structured-query
* API support
* Added CLI
* Update tests
* More tests