Commit graph

1113 commits

Author SHA1 Message Date
cybermaggedon
4fa7cc7d7c
Fix/embeddings integration 2 (#670) 2026-03-08 19:42:26 +00:00
cybermaggedon
919b760c05
Update embeddings integration for new batch embeddings interfaces (#669)
* Fix vector extraction

* Fix embeddings integration
2026-03-08 19:41:52 +00:00
cybermaggedon
0a2ce47a88
Batch embeddings (#668)
Base Service (trustgraph-base/trustgraph/base/embeddings_service.py):
- Changed on_request to use request.texts

FastEmbed Processor
(trustgraph-flow/trustgraph/embeddings/fastembed/processor.py):
- on_embeddings(texts, model=None) now processes full batch efficiently
- Returns [[v.tolist()] for v in vecs] - list of vector sets

Ollama Processor (trustgraph-flow/trustgraph/embeddings/ollama/processor.py):
- on_embeddings(texts, model=None) passes list directly to Ollama
- Returns [[embedding] for embedding in embeds.embeddings]

EmbeddingsClient (trustgraph-base/trustgraph/base/embeddings_client.py):
- embed(texts, timeout=300) accepts list of texts

Tests Updated:
- test_fastembed_dynamic_model.py - 4 tests updated for new interface
- test_ollama_dynamic_model.py - 4 tests updated for new interface

Updated CLI, SDK and APIs
2026-03-08 18:36:54 +00:00
cybermaggedon
3bf8a65409
Fix tests (#666) 2026-03-07 23:38:09 +00:00
cybermaggedon
24bbe94136
Document chunks not stored in vector store (#665)
- Schema - ChunkEmbeddings now uses chunk_id: str instead of chunk: bytes
- Schema - DocumentEmbeddingsResponse now returns chunk_ids: list[str]
  instead of chunks
- Translators - Updated to serialize/deserialize chunk_id
- Clients - DocumentEmbeddingsClient.query() returns chunk_ids
- SDK/API - flow.py, socket_client.py, bulk_client.py updated
- Document embeddings service - Stores chunk_id (document ID) instead
  of chunk text
- Storage writers - Qdrant, Milvus, Pinecone store chunk_id in payload
- Query services - Return chunk_id from vector store searches
- Gateway dispatchers - Serialize chunk_id in API responses
- Document RAG - Added librarian client to fetch chunk content from
  Garage using chunk_ids
- CLI tools - Updated all three tools:
  - invoke_document_embeddings.py - displays chunk_ids, removed
    max_chunk_length
  - save_doc_embeds.py - exports chunk_id
  - load_doc_embeds.py - imports chunk_id
2026-03-07 23:10:45 +00:00
cybermaggedon
be358efe67
Fix tests (#663) 2026-03-06 12:40:02 +00:00
cybermaggedon
2b9232917c
Fix/extraction prov (#662)
Quoted triple fixes, including...

1. Updated triple_provenance_triples() in triples.py:
   - Now accepts a Triple object directly
   - Creates the reification triple using TRIPLE term type: stmt_uri tg:reifies
         <<extracted_triple>>
   - Includes it in the returned provenance triples
    
2. Updated definitions extractor:
   - Added imports for provenance functions and component version
   - Added ParameterSpec for optional llm-model and ontology flow parameters
   - For each definition triple, generates provenance with reification
    
3. Updated relationships extractor:
   - Same changes as definitions extractor
2026-03-06 12:23:58 +00:00
cybermaggedon
cd5580be59
Extract-time provenance (#661)
1. Shared Provenance Module - URI generators, namespace constants,
   triple builders, vocabulary bootstrap
2. Librarian - Emits document metadata to graph on processing
   initiation (vocabulary bootstrap + PROV-O triples)
3. PDF Extractor - Saves pages as child documents, emits parent-child
   provenance edges, forwards page IDs
4. Chunker - Saves chunks as child documents, emits provenance edges,
   forwards chunk ID + content
5. Knowledge Extractors (both definitions and relationships):
   - Link entities to chunks via SUBJECT_OF (not top-level document)
   - Removed duplicate metadata emission (now handled by librarian)
   - Get chunk_doc_id and chunk_uri from incoming Chunk message
6. Embedding Provenance:
   - EntityContext schema has chunk_id field
   - EntityEmbeddings schema has chunk_id field
   - Definitions extractor sets chunk_id when creating EntityContext
   - Graph embeddings processor passes chunk_id through to
     EntityEmbeddings

Provenance Flow:
Document → Page (PDF) → Chunk → Extracted Facts/Embeddings
    ↓           ↓          ↓              ↓
  librarian  librarian  librarian    (chunk_id reference)
  + graph    + graph    + graph

Each artifact is stored in librarian with parent-child linking, and PROV-O
edges are emitted to the knowledge graph for full traceability from any
extracted fact back to its source document.

Also, updating tests
2026-03-05 18:36:10 +00:00
cybermaggedon
d8f0a576af
Document API updates (#660)
* Doc streaming from librarian

* Fix chunk minimum confusion

* Add CLI args
2026-03-05 15:20:45 +00:00
cybermaggedon
a630e143ef
Incremental / large document loading (#659)
Tech spec

BlobStore (trustgraph-flow/trustgraph/librarian/blob_store.py):
- get_stream() - yields document content in chunks for streaming retrieval
- create_multipart_upload() - initializes S3 multipart upload, returns
  upload_id
- upload_part() - uploads a single part, returns etag
- complete_multipart_upload() - finalizes upload with part etags
- abort_multipart_upload() - cancels and cleans up

Cassandra schema (trustgraph-flow/trustgraph/tables/library.py):
- New upload_session table with 24-hour TTL
- Index on user for listing sessions
- Prepared statements for all operations
- Methods: create_upload_session(), get_upload_session(),
  update_upload_session_chunk(), delete_upload_session(),
  list_upload_sessions()

- Schema extended with UploadSession, UploadProgress, and new
  request/response fields
- Librarian methods: begin_upload, upload_chunk, complete_upload,
  abort_upload, get_upload_status, list_uploads
- Service routing for all new operations
- Python SDK with transparent chunked upload:
  - add_document() auto-switches to chunked for files > 10MB
  - Progress callback support (on_progress)
  - get_pending_uploads(), get_upload_status(), abort_upload(),
    resume_upload()

- Document table: Added parent_id and document_type columns with index
- Document schema (knowledge/document.py): Added document_id field for
  streaming retrieval
- Librarian operations:
  - add-child-document for extracted PDF pages
  - list-children to get child documents
  - stream-document for chunked content retrieval
  - Cascade delete removes children when parent is deleted
  - list-documents filters children by default
- PDF decoder (decoding/pdf/pdf_decoder.py): Updated to stream large
  documents from librarian API to temp file
- Librarian service (librarian/service.py): Sends document_id instead of
  content for large PDFs (>2MB)
- Deprecated tools (load_pdf.py, load_text.py): Added deprecation
  warnings directing users to tg-add-library-document +
  tg-start-library-processing

Remove load_pdf and load_text utils

Move chunker/librarian comms to base class

Updating tests
2026-03-04 16:57:58 +00:00
cybermaggedon
a38ca9474f
Tool services - dynamically pluggable tool implementations for agent frameworks (#658)
* New schema

* Tool service implementation

* Base class

* Joke service, for testing

* Update unit tests for tool services
2026-03-04 14:51:32 +00:00
cybermaggedon
0b83c08ae4
Use model in Azure LLM integration (#657) 2026-03-04 12:06:06 +00:00
cybermaggedon
e19ea8667d
Tool services tech spec (#656) 2026-02-28 14:46:13 +00:00
cybermaggedon
4d31cd4c03
Agent explainability tech specs (#655)
* Query time provenance tech spec

* Extraction provenance placeholder
2026-02-28 14:44:18 +00:00
cybermaggedon
88fe8468bc
Update CI for 2.1 release (#653) 2026-02-28 11:10:11 +00:00
cybermaggedon
b915602635
Merge master into release/v2.1 (#652)
* Just to align the README, making future merge easier

---------

Co-authored-by: Jack Colquitt <126733989+JackColquitt@users.noreply.github.com>
2026-02-28 11:07:03 +00:00
cybermaggedon
6d8da748d7
Fix mismatching ge-query / graph-embeddings-query service idents (#648) 2026-02-24 12:17:29 +00:00
cybermaggedon
7d2d59a80f
Fix/tests (#647) 2026-02-23 22:01:47 +00:00
cybermaggedon
4bbc6d844f
Row embeddings APIs exposed (#646)
* Added row embeddings API and CLI support

* Updated protocol specs

* Row embeddings agent tool

* Add new agent tool to CLI
2026-02-23 21:52:56 +00:00
cybermaggedon
1809c1f56d
Structured data 2 (#645)
* Structured data refactor - multi-index tables, remove need for manual mods to the Cassandra tables

* Tech spec updated to track implementation
2026-02-23 15:56:29 +00:00
cybermaggedon
5ffad92345
Fix subscriber unexpected message causing queue clogging (#642)
queue clogging.
2026-02-23 14:34:05 +00:00
cybermaggedon
0116eb3dea
Fix Goog AI Studio (#641) 2026-02-20 10:27:47 +00:00
cybermaggedon
08063a5ee9
Remove unused deps (#640)
* Removed the Google GenAI hard-coded install
2026-02-20 10:13:44 +00:00
cybermaggedon
2d8dbf4cdb
Move GAIStudio to vertexai package to simplify deps (#639) 2026-02-20 08:46:29 +00:00
cybermaggedon
769c56bbea
Use ClientError & code to determine 429 error (#638) 2026-02-20 08:00:07 +00:00
cybermaggedon
b2e768c309
Fixing Uri import error (#636) 2026-02-16 19:18:40 +00:00
cybermaggedon
89b69fdb08
Fix weird Onttology URI issue (#637) 2026-02-16 19:18:29 +00:00
cybermaggedon
d886358be6
Entity & triple batch size limits (#635)
* Entities and triples are emitted in batches with a batch limit to manage
overloading downstream.

* Update tests
2026-02-16 17:38:03 +00:00
cybermaggedon
fe389354f6
Fix d/g attribute error (#634) 2026-02-16 13:34:08 +00:00
cybermaggedon
00c1ca681b
Entity-centric graph (#633)
* Tech spec for new entity-centric graph schema

* Graph implementation
2026-02-16 13:26:43 +00:00
cybermaggedon
f24f1ebd80
Migrate to VertexAI to google-genai SDK from deprecated library (#632)
* Migrate to VertexAI to google-genai SDK from deprecated library

* Fix tests, mock the correct API
2026-02-09 20:43:33 +00:00
cybermaggedon
2781c7d87c
Fix LLM metrics (#631)
* Fix mistral metrics

* Fix to other models
2026-02-09 19:35:42 +00:00
cybermaggedon
4fca97d555
Output the entity term as well as its definition as entity contexts (#629) 2026-02-09 15:18:05 +00:00
cybermaggedon
8574861196
Protect null embeddings - v2.0 (#627)
* Don't emit graph embeddings if there aren't any.

* Don't store graph embeddings in a knowledge store if there's an empty list.

* Translate between Cassandra's 'null' representing an empty list and an
  empty list which is what the surrounding code wants (and stored in the
  first place).

* Avoid emitting empty embedding lists

* Avoid output empty triple lists

* Fix tests
2026-02-09 14:57:36 +00:00
cybermaggedon
98827e5561
Fix version needing updating in pipelines (#625) 2026-02-04 14:12:01 +00:00
cybermaggedon
6bf08c3ace
Feature/more cli diags (#624)
* CLI tools for tg-invoke-graph-embeddings, tg-invoke-document-embeddings,
and tg-invoke-embeddings.  Just useful for diagnostics.

* Fix tg-load-knowledge
2026-02-04 14:10:30 +00:00
cybermaggedon
23cc4dfdd1
Fix: version needed updating in pipelines (#623) 2026-01-27 15:42:01 +00:00
cybermaggedon
cf0daedefa
Changed schema for Value -> Term, majorly breaking change (#622)
* Changed schema for Value -> Term, majorly breaking change

* Following the schema change, Value -> Term into all processing

* Updated Cassandra for g, p, s, o index patterns (7 indexes)

* Reviewed and updated all tests

* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
2026-01-27 13:48:08 +00:00
cybermaggedon
e061f2c633
Graph contexts tech spec (#621) 2026-01-26 22:41:00 +00:00
cybermaggedon
e214eb4e02
Feature/prompts jsonl (#619)
* Tech spec

* JSONL implementation complete

* Updated prompt client users

* Fix tests
2026-01-26 17:38:00 +00:00
cybermaggedon
e4f0013841
Open 1.9 branch (#620) 2026-01-26 17:36:25 +00:00
cybermaggedon
11f41b07ab
Get neo4j to use limit (#618)
* Get neo4j to use limit

* Fix tests - they we exact matching on query strings
2026-01-22 15:16:34 +00:00
cybermaggedon
58c00149a7
Normalise URLs so that / suffix is optional (#617) 2026-01-19 13:34:33 +00:00
cybermaggedon
83ea15dae7
Fixed flows/flow key issue in config (#616) 2026-01-16 00:10:44 +00:00
cybermaggedon
1c006d5b14
Python API docs (#614)
* Python API docs working

* Python API doc generation
2026-01-15 15:12:32 +00:00
cybermaggedon
8a17375603
Add AsyncAPI spec for websocket (#613)
* AsyncAPI for websocket docs

* Delete old docs

* Update docs/README.md to point to docs site

* Add generated API docs
2026-01-15 11:57:16 +00:00
cybermaggedon
fce43ae035
REST API OpenAPI spec (#612)
* OpenAPI spec in specs/api.  Checked lint with redoc.
2026-01-15 11:04:37 +00:00
cybermaggedon
62b754d788
Fix flow loading (#611) 2026-01-14 16:23:15 +00:00
cybermaggedon
387afee7b7
Fix load-doc (#610) 2026-01-14 15:46:29 +00:00
cybermaggedon
b08db761d7
Fix config inconsistency (#609)
* Plural/singular confusion in config key

* Flow class vs flow blueprint nomenclature change

* Update docs & CLI to reflect the above
2026-01-14 12:31:40 +00:00