Commit graph

1312 commits

Author SHA1 Message Date
cybermaggedon
24bbe94136
Document chunks not stored in vector store (#665)
- Schema - ChunkEmbeddings now uses chunk_id: str instead of chunk: bytes
- Schema - DocumentEmbeddingsResponse now returns chunk_ids: list[str]
  instead of chunks
- Translators - Updated to serialize/deserialize chunk_id
- Clients - DocumentEmbeddingsClient.query() returns chunk_ids
- SDK/API - flow.py, socket_client.py, bulk_client.py updated
- Document embeddings service - Stores chunk_id (document ID) instead
  of chunk text
- Storage writers - Qdrant, Milvus, Pinecone store chunk_id in payload
- Query services - Return chunk_id from vector store searches
- Gateway dispatchers - Serialize chunk_id in API responses
- Document RAG - Added librarian client to fetch chunk content from
  Garage using chunk_ids
- CLI tools - Updated all three tools:
  - invoke_document_embeddings.py - displays chunk_ids, removed
    max_chunk_length
  - save_doc_embeds.py - exports chunk_id
  - load_doc_embeds.py - imports chunk_id
2026-03-07 23:10:45 +00:00
cybermaggedon
be358efe67
Fix tests (#663) 2026-03-06 12:40:02 +00:00
cybermaggedon
2b9232917c
Fix/extraction prov (#662)
Quoted triple fixes, including...

1. Updated triple_provenance_triples() in triples.py:
   - Now accepts a Triple object directly
   - Creates the reification triple using TRIPLE term type: stmt_uri tg:reifies
         <<extracted_triple>>
   - Includes it in the returned provenance triples
    
2. Updated definitions extractor:
   - Added imports for provenance functions and component version
   - Added ParameterSpec for optional llm-model and ontology flow parameters
   - For each definition triple, generates provenance with reification
    
3. Updated relationships extractor:
   - Same changes as definitions extractor
2026-03-06 12:23:58 +00:00
cybermaggedon
cd5580be59
Extract-time provenance (#661)
1. Shared Provenance Module - URI generators, namespace constants,
   triple builders, vocabulary bootstrap
2. Librarian - Emits document metadata to graph on processing
   initiation (vocabulary bootstrap + PROV-O triples)
3. PDF Extractor - Saves pages as child documents, emits parent-child
   provenance edges, forwards page IDs
4. Chunker - Saves chunks as child documents, emits provenance edges,
   forwards chunk ID + content
5. Knowledge Extractors (both definitions and relationships):
   - Link entities to chunks via SUBJECT_OF (not top-level document)
   - Removed duplicate metadata emission (now handled by librarian)
   - Get chunk_doc_id and chunk_uri from incoming Chunk message
6. Embedding Provenance:
   - EntityContext schema has chunk_id field
   - EntityEmbeddings schema has chunk_id field
   - Definitions extractor sets chunk_id when creating EntityContext
   - Graph embeddings processor passes chunk_id through to
     EntityEmbeddings

Provenance Flow:
Document → Page (PDF) → Chunk → Extracted Facts/Embeddings
    ↓           ↓          ↓              ↓
  librarian  librarian  librarian    (chunk_id reference)
  + graph    + graph    + graph

Each artifact is stored in librarian with parent-child linking, and PROV-O
edges are emitted to the knowledge graph for full traceability from any
extracted fact back to its source document.

Also, updating tests
2026-03-05 18:36:10 +00:00
cybermaggedon
d8f0a576af
Document API updates (#660)
* Doc streaming from librarian

* Fix chunk minimum confusion

* Add CLI args
2026-03-05 15:20:45 +00:00
cybermaggedon
a630e143ef
Incremental / large document loading (#659)
Tech spec

BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py):
- get_stream() - yields document content in chunks for streaming retrieval
- create_multipart_upload() - initializes S3 multipart upload, returns
  upload_id
- upload_part() - uploads a single part, returns etag
- complete_multipart_upload() - finalizes upload with part etags
- abort_multipart_upload() - cancels and cleans up

Cassandra schema (trustgraph-flow/trustgraph/tables/library.py):
- New upload_session table with 24-hour TTL
- Index on user for listing sessions
- Prepared statements for all operations
- Methods: create_upload_session(), get_upload_session(),
  update_upload_session_chunk(), delete_upload_session(),
  list_upload_sessions()

- Schema extended with UploadSession, UploadProgress, and new
  request/response fields
- Librarian methods: begin_upload, upload_chunk, complete_upload,
  abort_upload, get_upload_status, list_uploads
- Service routing for all new operations
- Python SDK with transparent chunked upload:
  - add_document() auto-switches to chunked for files > 10MB
  - Progress callback support (on_progress)
  - get_pending_uploads(), get_upload_status(), abort_upload(),
    resume_upload()

- Document table: Added parent_id and document_type columns with index
- Document schema (knowledge/document.py): Added document_id field for
  streaming retrieval
- Librarian operations:
  - add-child-document for extracted PDF pages
  - list-children to get child documents
  - stream-document for chunked content retrieval
  - Cascade delete removes children when parent is deleted
  - list-documents filters children by default
- PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large
  documents from librarian API to temp file
- Librarian service (librarian/service.py): Sends document_id instead of
  content for large PDFs (>2MB)
- Deprecated tools (load_pdf.py, load_text.py): Added deprecation
  warnings directing users to tg-add-library-document +
  tg-start-library-processing

Remove load_pdf and load_text utils

Move chunker/librarian comms to base class

Updating tests
2026-03-04 16:57:58 +00:00
cybermaggedon
a38ca9474f
Tool services - dynamically pluggable tool implementations for agent frameworks (#658)
* New schema

* Tool service implementation

* Base class

* Joke service, for testing

* Update unit tests for tool services
2026-03-04 14:51:32 +00:00
cybermaggedon
0b83c08ae4
Use model in Azure LLM integration (#657) 2026-03-04 12:06:06 +00:00
cybermaggedon
e19ea8667d
Tool services tech spec (#656) 2026-02-28 14:46:13 +00:00
cybermaggedon
4d31cd4c03
Agent explainability tech specs (#655)
* Query time provenance tech spec

* Extraction provenance placeholder
2026-02-28 14:44:18 +00:00
cybermaggedon
88fe8468bc
Update CI for 2.1 release (#653) 2026-02-28 11:10:11 +00:00
cybermaggedon
b915602635
Merge master into release/v2.1 (#652)
* Just to align the README, making future merge easier

---------

Co-authored-by: Jack Colquitt <126733989+JackColquitt@users.noreply.github.com>
2026-02-28 11:07:03 +00:00
cybermaggedon
b9d7bf9a8b
Merge 2.0 to master (#651) 2026-02-28 11:03:14 +00:00
Jack Colquitt
3666ece2c5
Fix formatting in README.md for RAG pipelines 2026-02-24 18:15:00 -08:00
Jack Colquitt
da7f7adb8d
Update retrieval policies description in README 2026-02-24 16:08:47 -08:00
Jack Colquitt
e520a67442
Update link to Context Cores in README 2026-02-24 16:07:48 -08:00
Jack Colquitt
325c2acd32
Revise README for clarity and updated features
Updated README to enhance clarity and focus on the context backend for AI agents. Removed outdated features and added details about Context Cores.
2026-02-24 16:05:57 -08:00
cybermaggedon
6d8da748d7
Fix mismatching ge-query / graph-embeddings-query service idents (#648) 2026-02-24 12:17:29 +00:00
cybermaggedon
7d2d59a80f
Fix/tests (#647) 2026-02-23 22:01:47 +00:00
cybermaggedon
4bbc6d844f
Row embeddings APIs exposed (#646)
* Added row embeddings API and CLI support

* Updated protocol specs

* Row embeddings agent tool

* Add new agent tool to CLI
2026-02-23 21:52:56 +00:00
cybermaggedon
1809c1f56d
Structured data 2 (#645)
* Structured data refactor - multi-index tables, remove need for manual mods to the Cassandra tables

* Tech spec updated to track implementation
2026-02-23 15:56:29 +00:00
cybermaggedon
5ffad92345
Fix subscriber unexpected message causing queue clogging (#642)
queue clogging.
2026-02-23 14:34:05 +00:00
cybermaggedon
0116eb3dea
Fix Goog AI Studio (#641) 2026-02-20 10:27:47 +00:00
cybermaggedon
08063a5ee9
Remove unused deps (#640)
* Removed the Google GenAI hard-coded install
2026-02-20 10:13:44 +00:00
cybermaggedon
2d8dbf4cdb
Move GAIStudio to vertexai package to simplify deps (#639) 2026-02-20 08:46:29 +00:00
cybermaggedon
769c56bbea
Use ClientError & code to determine 429 error (#638) 2026-02-20 08:00:07 +00:00
cybermaggedon
b2e768c309
Fixing Uri import error (#636) 2026-02-16 19:18:40 +00:00
cybermaggedon
89b69fdb08
Fix weird Onttology URI issue (#637) 2026-02-16 19:18:29 +00:00
cybermaggedon
d886358be6
Entity & triple batch size limits (#635)
* Entities and triples are emitted in batches with a batch limit to manage
overloading downstream.

* Update tests
2026-02-16 17:38:03 +00:00
cybermaggedon
fe389354f6
Fix d/g attribute error (#634) 2026-02-16 13:34:08 +00:00
cybermaggedon
00c1ca681b
Entity-centric graph (#633)
* Tech spec for new entity-centric graph schema

* Graph implementation
2026-02-16 13:26:43 +00:00
Jack Colquitt
9f06161616
Fix formatting of deployment instructions in README 2026-02-11 20:07:15 -08:00
Jack Colquitt
6fd708d7cd
Revise README sections for Getting Started and Tech Stack
Updated section titles and links in the README for clarity.
2026-02-11 20:04:50 -08:00
Jack Colquitt
9c89a512c7
Refine README content for clarity and consistency
Updated formatting and capitalization for consistency in the README.
2026-02-11 20:00:18 -08:00
Jack Colquitt
e18d2bd855
Revise README for Docker deployment and remove section
Updated README to include Docker deployment instructions and removed Configuration Builder section.
2026-02-11 19:54:56 -08:00
Jack Colquitt
103b3c31e5
Enhance README with quickstart and data streaming info
Added quickstart section and updated control plane summary.
2026-02-11 19:43:29 -08:00
cybermaggedon
f24f1ebd80
Migrate to VertexAI to google-genai SDK from deprecated library (#632)
* Migrate to VertexAI to google-genai SDK from deprecated library

* Fix tests, mock the correct API
2026-02-09 20:43:33 +00:00
cybermaggedon
2781c7d87c
Fix LLM metrics (#631)
* Fix mistral metrics

* Fix to other models
2026-02-09 19:35:42 +00:00
cybermaggedon
4fca97d555
Output the entity term as well as its definition as entity contexts (#629) 2026-02-09 15:18:05 +00:00
cybermaggedon
8574861196
Protect null embeddings - v2.0 (#627)
* Don't emit graph embeddings if there aren't any.

* Don't store graph embeddings in a knowledge store if there's an empty list.

* Translate between Cassandra's 'null' representing an empty list and an
  empty list which is what the surrounding code wants (and stored in the
  first place).

* Avoid emitting empty embedding lists

* Avoid output empty triple lists

* Fix tests
2026-02-09 14:57:36 +00:00
Jack Colquitt
377be2e6df
Fix header capitalization in README.md 2026-02-07 19:34:32 -08:00
Jack Colquitt
0d4236aef6
Revise README with new title and feature descriptions
Updated the README to reflect new branding and clarify the features of TrustGraph.
2026-02-07 15:21:25 -08:00
Jack Colquitt
f13345c567
Revise README for context and terminology updates
Updated the README to reflect changes in terminology and improve clarity.
2026-02-05 18:39:07 -08:00
cybermaggedon
98827e5561
Fix version needing updating in pipelines (#625) 2026-02-04 14:12:01 +00:00
cybermaggedon
6bf08c3ace
Feature/more cli diags (#624)
* CLI tools for tg-invoke-graph-embeddings, tg-invoke-document-embeddings,
and tg-invoke-embeddings.  Just useful for diagnostics.

* Fix tg-load-knowledge
2026-02-04 14:10:30 +00:00
Jack Colquitt
f20a1d3fe1
Update project title from 'Factory' to 'Engine' 2026-01-28 12:28:53 -08:00
cybermaggedon
23cc4dfdd1
Fix: version needed updating in pipelines (#623) 2026-01-27 15:42:01 +00:00
cybermaggedon
cf0daedefa
Changed schema for Value -> Term, majorly breaking change (#622)
* Changed schema for Value -> Term, majorly breaking change

* Following the schema change, Value -> Term into all processing

* Updated Cassandra for g, p, s, o index patterns (7 indexes)

* Reviewed and updated all tests

* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
2026-01-27 13:48:08 +00:00
cybermaggedon
e061f2c633
Graph contexts tech spec (#621) 2026-01-26 22:41:00 +00:00
cybermaggedon
e214eb4e02
Feature/prompts jsonl (#619)
* Tech spec

* JSONL implementation complete

* Updated prompt client users

* Fix tests
2026-01-26 17:38:00 +00:00