Commit graph

119 commits

Author SHA1 Message Date
cybermaggedon
d9dc4cbab5
SPARQL query service (#754)
SPARQL 1.1 query service wrapping pub/sub triples interface

Add a backend-agnostic SPARQL query service that parses SPARQL
queries using rdflib, decomposes them into triple pattern lookups
via the existing TriplesClient pub/sub interface, and performs
in-memory joins, filters, and projections.

Includes:
- SPARQL parser, algebra evaluator, expression evaluator, solution
  sequence operations (BGP, JOIN, OPTIONAL, UNION, FILTER, BIND,
  VALUES, GROUP BY, ORDER BY, LIMIT/OFFSET, DISTINCT, aggregates)
- FlowProcessor service with TriplesClientSpec
- Gateway dispatcher, request/response translators, API spec
- Python SDK method (FlowInstance.sparql_query)
- CLI command (tg-invoke-sparql-query)
- Tech spec (docs/tech-specs/sparql-query.md)

New unit tests for SPARQL query
2026-04-02 17:21:39 +01:00
cybermaggedon
24f0190ce7
RabbitMQ pub/sub backend with topic exchange architecture (#752)
Adds a RabbitMQ backend as an alternative to Pulsar, selectable via
PUBSUB_BACKEND=rabbitmq. Both backends implement the same PubSubBackend
protocol — no application code changes needed to switch.

RabbitMQ topology:
- Single topic exchange per topicspace (e.g. 'tg')
- Routing key derived from queue class and topic name
- Shared consumers: named queue bound to exchange (competing, round-robin)
- Exclusive consumers: anonymous auto-delete queue (broadcast, each gets
  every message). Used by Subscriber and config push consumer.
- Thread-local producer connections (pika is not thread-safe)
- Push-based consumption via basic_consume with process_data_events
  for heartbeat processing

Consumer model changes:
- Consumer class creates one backend consumer per concurrent task
  (required for pika thread safety, harmless for Pulsar)
- Consumer class accepts consumer_type parameter
- Subscriber passes consumer_type='exclusive' for broadcast semantics
- Config push consumer uses consumer_type='exclusive' so every
  processor instance receives config updates
- handle_one_from_queue receives consumer as parameter for correct
  per-connection ack/nack

LibrarianClient:
- New shared client class replacing duplicated librarian request-response
  code across 6+ services (chunking, decoders, RAG, etc.)
- Uses stream-document instead of get-document-content for fetching
  document content in 1MB chunks (avoids broker message size limits)
- Standalone object (self.librarian = LibrarianClient(...)) not a mixin
- get-document-content marked deprecated in schema and OpenAPI spec

Serialisation:
- Extracted dataclass_to_dict/dict_to_dataclass to shared
  serialization.py (used by both Pulsar and RabbitMQ backends)

Librarian queues:
- Changed from flow class (persistent) back to request/response class
  now that stream-document eliminates large single messages
- API upload chunk size reduced from 5MB to 3MB to stay under broker
  limits after base64 encoding

Factory and CLI:
- get_pubsub() handles 'rabbitmq' backend with RabbitMQ connection params
- add_pubsub_args() includes RabbitMQ options (host, port, credentials)
- add_pubsub_args(standalone=True) defaults to localhost for CLI tools
- init_trustgraph skips Pulsar admin setup for non-Pulsar backends
- tg-dump-queues and tg-monitor-prompts use backend abstraction
- BaseClient and ConfigClient accept generic pubsub config
2026-04-02 12:47:16 +01:00
cybermaggedon
e65ea217a2
agent-orchestrator improvements (#743)
agent-orchestrator improvements:
- Improve agent trace
- Improve queue dumping
- Fixing supervisor pattern
- Fix synthesis step to remove loop

Minor dev environment improvements:
- Improve queue dump output for JSON
- Reduce dev container rebuild
2026-03-31 11:24:30 +01:00
cybermaggedon
5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
cybermaggedon
1809c1f56d
Structured data 2 (#645)
* Structured data refactor - multi-index tables, remove need for manual mods to the Cassandra tables

* Tech spec updated to track implementation
2026-02-23 15:56:29 +00:00
cybermaggedon
cf0daedefa
Changed schema for Value -> Term, majorly breaking change (#622)
* Changed schema for Value -> Term, majorly breaking change

* Following the schema change, Value -> Term into all processing

* Updated Cassandra for g, p, s, o index patterns (7 indexes)

* Reviewed and updated all tests

* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
2026-01-27 13:48:08 +00:00
cybermaggedon
9a34ab1b93
Complete remaining parameter work (#530)
* Fix CLI typo

* Complete flow parameters work, still needs implementation in LLMs
2025-09-24 13:58:34 +01:00
cybermaggedon
f6bccd7438
Parallel contain builds (#515) 2025-09-11 12:32:04 +01:00
cybermaggedon
6c681967ab
Validate librarian collection (#453) 2025-08-07 21:36:24 +01:00
cybermaggedon
98022d6af4
Migrate from setup.py to pyproject.toml (#440)
* Converted setup.py to pyproject.toml

* Modern package infrastructure as recommended by py docs
2025-07-23 21:22:08 +01:00
cybermaggedon
ac977d18f4
Add MCP container push (#425) 2025-07-03 17:00:59 +01:00
cybermaggedon
f907ea7db8
PoC MCP server (#419)
* Very initial MCP server PoC for TrustGraph

* Put service on port 8000

* Add MCP container and packages to buildout
2025-07-02 18:19:23 +01:00
cybermaggedon
4461d7b289
Feature/persist config (#370)
* Cassandra tables for config

* Config is backed by Cassandra
2025-05-07 12:58:32 +01:00
cybermaggedon
d0da122bed
Fix/llms (#366)
* Fix LMStudio, cache documents with tg-load-sample-documents

* Fix Mistral
2025-05-06 16:17:16 +01:00
cybermaggedon
a9197d11ee
Feature/configure flows (#345)
- Keeps processing in different flows separate so that data can go to different stores / collections etc.
- Potentially supports different processing flows
- Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow
2025-04-22 20:21:38 +01:00
cybermaggedon
1d222235d3
Configuration initialisation (#335)
* - Fixed error reporting in config
- Updated tg-init-pulsar to be able to load initial config to config-svc
- Tweaked API naming and added more config calls

* Tools to dump out prompts and agent tools
2025-04-02 13:52:33 +01:00
cybermaggedon
c759d55734
Added module which does OCR for PDF, pdf-ocr in a separate package (#324)
(has a lot of dependencies).  Uses Tesseract.
2025-03-20 09:29:40 +00:00
JackColquitt
5f5cf8fd07 Added basic Mistral API support 2025-03-14 17:47:59 -07:00
cybermaggedon
edcdc4d59d
Feature/separate containers (#287)
* Separate containerfiles

* Add push to Makefile

* Update image names in the templates
2025-01-28 19:36:05 +00:00
cybermaggedon
dc2b599fda
Fix/release broken (#257)
* Break release into 3 jobs

* Replace Github action with podman command
2025-01-06 21:45:42 +00:00
Avi Avni
1ab2a7ff6b
wip integrate falkordb (#211) 2024-12-18 21:01:24 +00:00
cybermaggedon
7df7843dad
Main/remove parquet (#195)
* Remove Parquet code, and package build
2024-12-06 08:51:10 +00:00
Cyber MacGeddon
43756d872b Set dependencies up for the 0.13 branch. Set version=0.0.0 in Makefile
to spot build errors.
2024-10-15 00:31:08 +01:00
Cyber MacGeddon
5850b6c136 Merge branch 'release/v0.11' into release/v0.12 2024-10-09 19:38:13 +01:00
cybermaggedon
a711bc1dde
Fix trustgraph broken linkage (#109) 2024-10-08 20:33:14 +01:00
cybermaggedon
148092a6af
Fix/lock 0.11 version (#108)
* - Locked 0.11 packages to 0.11 deps
- Added 'trustgraph' uber-package which installs the rest
- Added dependency to set package versions before building packages

* Bump version
2024-10-04 22:12:39 +01:00
cybermaggedon
dda29bb663
Workflows (#105)
* Some basic structure for workflows
* Add PyPI publication for 0.12
* Bump version
* Test bundle generation
* Install jsonnet
* Use release action to automate release creation
2024-10-04 17:28:07 +01:00
cybermaggedon
222dc9982c
Feature/azure openai templates (#104)
* Azure OpenAI LLM templates
* Bump version, fix package versions
* Add azure-openai to template generation
2024-10-04 15:47:46 +01:00
Cyber MacGeddon
adba99f270 Bump version 2024-10-02 22:25:48 +01:00
cybermaggedon
db9ed06b1c
Fix/broken kg extract topics (#97)
* Add missing kg-extract-topics service

* Bump version
2024-10-02 22:23:00 +01:00
cybermaggedon
14672f7f0e
Fix/processor state specify prom (#93)
* Provide mean to specify -p prometheus server
* Bump version
2024-10-01 22:14:28 +01:00
Cyber MacGeddon
2e6be5cdce Bump version 2024-10-01 21:06:07 +01:00
cybermaggedon
56a9ac3ba9
Change LLM latency dashboard to be rate & bump version (#92) 2024-10-01 21:04:55 +01:00
cybermaggedon
ef1b8b5a13
Feature/metering dashboard (#89)
* Bump version

* Added Prom metrics to metering, added dashboard

* Update YAMLs

* Add $ on axis

* Tweak dashboard
2024-10-01 06:46:41 +01:00
cybermaggedon
88a7dfa126
Maint/rename pkg (#88)
* Rename trustgraph-utils -> trustgraph-cli
* Update YAMLs
2024-09-30 22:20:26 +01:00
cybermaggedon
771d9fc2c7
Make util pathnames have tg- prefix (#87) 2024-09-30 21:24:22 +01:00
cybermaggedon
0e4c9c69ee
Add twine upload target (#86) 2024-09-30 21:07:18 +01:00
cybermaggedon
c26ada08c2
Fix VertexAI package. Add Python packaging to Makefile. (#85)
Bump version & generate templates.
2024-09-30 20:50:20 +01:00
cybermaggedon
f00baab1b8
Maint/fix build env (#84)
* Put README placeholders for packages in place
* Bump version
2024-09-30 19:47:09 +01:00
cybermaggedon
9b91d5eee3
Feature/pkgsplit (#83)
* Starting to spawn base package
* More package hacking
* Bedrock and VertexAI
* Parquet split
* Updated templates
* Utils
2024-09-30 19:36:09 +01:00
cybermaggedon
3fb75c617b
Maint/auto pkg versions (#82)
* Remove need to manage setup.py version

* Update YAMLs
2024-09-30 16:38:50 +01:00
cybermaggedon
cdace22ee4
Feature/simpler subpackages (#81)
* Back to simpler directory structure

* Bump version, update templates
2024-09-30 16:16:20 +01:00
cybermaggedon
f081933217
Feature/subpackages (#80)
* Renaming what will become the core package

* Tweaking to get  package build working

* Fix metering merge

* Rename to core directory

* Bump version.  Use namespace searching for packaging trustgraph-core

* Change references to trustgraph-core

* Forming embeddings-hf package

* Reference modules in core package.

* Build both packages to one container, bump version

* Update YAMLs
2024-09-30 14:00:29 +01:00
cybermaggedon
14d79ef9f1
Streamline startup (#79)
* Separate Prom metrics, different processors as different jobs

* Create producers before consumers, may streamline startup.

* Bump version

* Add Pulsar init command, will replace pulsar-admin invocations.

* Integrate tg-init-pulsar with YAMLs

* Update YAMLs
2024-09-30 12:19:22 +01:00
Cyber MacGeddon
5e8a1520ee Bump version & update templates 2024-09-30 00:00:36 +01:00
cybermaggedon
74a14639bd
Feature/track processor state (#78)
* Add a Prom metric to consumers & consumer/producers to track the running
state.

* New script, gets processor state using prometheus

* Bump version, add tg-processor-state to package

* Update templates
2024-09-29 23:50:57 +01:00
cybermaggedon
efc364583b
Fix/graph rag uses wrong prompt (#77)
* Fix queue name invocation, use correct names, not defaults

* Bump version

* Update templates
2024-09-29 20:38:50 +01:00
Cyber MacGeddon
5951fb4e56 Bump version to 0.11.1 2024-09-29 18:15:34 +01:00
cybermaggedon
fa30544999
Fix/revert template change (#71)
* Ditched the deploy directory (going away in 0.11) and putting
YAML files in top-dir of Github (for now).

* Update Makefile for the template change
2024-09-29 18:13:34 +01:00
Cyber MacGeddon
e5249c2bac Bump version 2024-09-29 18:03:32 +01:00