Commit graph

122 commits

Author SHA1 Message Date
cybermaggedon
d35473f7f7
feat: workspace-based multi-tenancy, replacing user as tenancy axis (#840)
Introduces `workspace` as the isolation boundary for config, flows,
library, and knowledge data. Removes `user` as a schema-level field
throughout the code, API specs, and tests; workspace provides the
same separation more cleanly at the trusted flow.workspace layer
rather than through client-supplied message fields.

Design
------
- IAM tech spec (docs/tech-specs/iam.md) documents current state,
  proposed auth/access model, and migration direction.
- Data ownership model (docs/tech-specs/data-ownership-model.md)
  captures the workspace/collection/flow hierarchy.

Schema + messaging
------------------
- Drop `user` field from AgentRequest/Step, GraphRagQuery,
  DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest,
  Sparql/Rows/Structured QueryRequest, ToolServiceRequest.
- Keep collection/workspace routing via flow.workspace at the
  service layer.
- Translators updated to not serialise/deserialise user.

API specs
---------
- OpenAPI schemas and path examples cleaned of user fields.
- Websocket async-api messages updated.
- Removed the unused parameters/User.yaml.

Services + base
---------------
- Librarian, collection manager, knowledge, config: all operations
  scoped by workspace. Config client API takes workspace as first
  positional arg.
- `flow.workspace` set at flow start time by the infrastructure;
  no longer pass-through from clients.
- Tool service drops user-personalisation passthrough.

CLI + SDK
---------
- tg-init-workspace and workspace-aware import/export.
- All tg-* commands drop user args; accept --workspace.
- Python API/SDK (flow, socket_client, async_*, explainability,
  library) drop user kwargs from every method signature.

MCP server
----------
- All tool endpoints drop user parameters; socket_manager no longer
  keyed per user.

Flow service
------------
- Closure-based topic cleanup on flow stop: only delete topics
  whose blueprint template was parameterised AND no remaining
  live flow (across all workspaces) still resolves to that topic.
  Three scopes fall out naturally from template analysis:
    * {id} -> per-flow, deleted on stop
    * {blueprint} -> per-blueprint, kept while any flow of the
      same blueprint exists
    * {workspace} -> per-workspace, kept while any flow in the
      workspace exists
    * literal -> global, never deleted (e.g. tg.request.librarian)
  Fixes a bug where stopping a flow silently destroyed the global
  librarian exchange, wedging all library operations until manual
  restart.

RabbitMQ backend
----------------
- heartbeat=60, blocked_connection_timeout=300. Catches silently
  dead connections (broker restart, orphaned channels, network
  partitions) within ~2 heartbeat windows, so the consumer
  reconnects and re-binds its queue rather than sitting forever
  on a zombie connection.

Tests
-----
- Full test refresh: unit, integration, contract, provenance.
- Dropped user-field assertions and constructor kwargs across
  ~100 test files.
- Renamed user-collection isolation tests to workspace-collection.
2026-04-21 23:23:01 +01:00
Cyber MacGeddon
5e6c96bdd1 Separate platform builds & combine to single manifest 2026-04-13 23:14:37 +01:00
cybermaggedon
76c1752748
Use manifests to build for amd64 and arm64 (#798)
This adds ARM container builds to the CI pipeline
2026-04-13 20:51:09 +01:00
cybermaggedon
d9dc4cbab5
SPARQL query service (#754)
SPARQL 1.1 query service wrapping pub/sub triples interface

Add a backend-agnostic SPARQL query service that parses SPARQL
queries using rdflib, decomposes them into triple pattern lookups
via the existing TriplesClient pub/sub interface, and performs
in-memory joins, filters, and projections.

Includes:
- SPARQL parser, algebra evaluator, expression evaluator, solution
  sequence operations (BGP, JOIN, OPTIONAL, UNION, FILTER, BIND,
  VALUES, GROUP BY, ORDER BY, LIMIT/OFFSET, DISTINCT, aggregates)
- FlowProcessor service with TriplesClientSpec
- Gateway dispatcher, request/response translators, API spec
- Python SDK method (FlowInstance.sparql_query)
- CLI command (tg-invoke-sparql-query)
- Tech spec (docs/tech-specs/sparql-query.md)

New unit tests for SPARQL query
2026-04-02 17:21:39 +01:00
cybermaggedon
24f0190ce7
RabbitMQ pub/sub backend with topic exchange architecture (#752)
Adds a RabbitMQ backend as an alternative to Pulsar, selectable via
PUBSUB_BACKEND=rabbitmq. Both backends implement the same PubSubBackend
protocol — no application code changes needed to switch.

RabbitMQ topology:
- Single topic exchange per topicspace (e.g. 'tg')
- Routing key derived from queue class and topic name
- Shared consumers: named queue bound to exchange (competing, round-robin)
- Exclusive consumers: anonymous auto-delete queue (broadcast, each gets
  every message). Used by Subscriber and config push consumer.
- Thread-local producer connections (pika is not thread-safe)
- Push-based consumption via basic_consume with process_data_events
  for heartbeat processing

Consumer model changes:
- Consumer class creates one backend consumer per concurrent task
  (required for pika thread safety, harmless for Pulsar)
- Consumer class accepts consumer_type parameter
- Subscriber passes consumer_type='exclusive' for broadcast semantics
- Config push consumer uses consumer_type='exclusive' so every
  processor instance receives config updates
- handle_one_from_queue receives consumer as parameter for correct
  per-connection ack/nack

LibrarianClient:
- New shared client class replacing duplicated librarian request-response
  code across 6+ services (chunking, decoders, RAG, etc.)
- Uses stream-document instead of get-document-content for fetching
  document content in 1MB chunks (avoids broker message size limits)
- Standalone object (self.librarian = LibrarianClient(...)) not a mixin
- get-document-content marked deprecated in schema and OpenAPI spec

Serialisation:
- Extracted dataclass_to_dict/dict_to_dataclass to shared
  serialization.py (used by both Pulsar and RabbitMQ backends)

Librarian queues:
- Changed from flow class (persistent) back to request/response class
  now that stream-document eliminates large single messages
- API upload chunk size reduced from 5MB to 3MB to stay under broker
  limits after base64 encoding

Factory and CLI:
- get_pubsub() handles 'rabbitmq' backend with RabbitMQ connection params
- add_pubsub_args() includes RabbitMQ options (host, port, credentials)
- add_pubsub_args(standalone=True) defaults to localhost for CLI tools
- init_trustgraph skips Pulsar admin setup for non-Pulsar backends
- tg-dump-queues and tg-monitor-prompts use backend abstraction
- BaseClient and ConfigClient accept generic pubsub config
2026-04-02 12:47:16 +01:00
cybermaggedon
e65ea217a2
agent-orchestrator improvements (#743)
agent-orchestrator improvements:
- Improve agent trace
- Improve queue dumping
- Fixing supervisor pattern
- Fix synthesis step to remove loop

Minor dev environment improvements:
- Improve queue dump output for JSON
- Reduce dev container rebuild
2026-03-31 11:24:30 +01:00
cybermaggedon
5c6fe90fe2
Add universal document decoder with multi-format support (#705)
Add universal document decoder with multi-format support
using 'unstructured'.

New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.

All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable.  PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.

Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.

New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.

Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
2026-03-23 12:56:35 +00:00
cybermaggedon
1809c1f56d
Structured data 2 (#645)
* Structured data refactor - multi-index tables, remove need for manual mods to the Cassandra tables

* Tech spec updated to track implementation
2026-02-23 15:56:29 +00:00
cybermaggedon
cf0daedefa
Changed schema for Value -> Term, majorly breaking change (#622)
* Changed schema for Value -> Term, majorly breaking change

* Following the schema change, Value -> Term into all processing

* Updated Cassandra for g, p, s, o index patterns (7 indexes)

* Reviewed and updated all tests

* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
2026-01-27 13:48:08 +00:00
cybermaggedon
9a34ab1b93
Complete remaining parameter work (#530)
* Fix CLI typo

* Complete flow parameters work, still needs implementation in LLMs
2025-09-24 13:58:34 +01:00
cybermaggedon
f6bccd7438
Parallel contain builds (#515) 2025-09-11 12:32:04 +01:00
cybermaggedon
6c681967ab
Validate librarian collection (#453) 2025-08-07 21:36:24 +01:00
cybermaggedon
98022d6af4
Migrate from setup.py to pyproject.toml (#440)
* Converted setup.py to pyproject.toml

* Modern package infrastructure as recommended by py docs
2025-07-23 21:22:08 +01:00
cybermaggedon
ac977d18f4
Add MCP container push (#425) 2025-07-03 17:00:59 +01:00
cybermaggedon
f907ea7db8
PoC MCP server (#419)
* Very initial MCP server PoC for TrustGraph

* Put service on port 8000

* Add MCP container and packages to buildout
2025-07-02 18:19:23 +01:00
cybermaggedon
4461d7b289
Feature/persist config (#370)
* Cassandra tables for config

* Config is backed by Cassandra
2025-05-07 12:58:32 +01:00
cybermaggedon
d0da122bed
Fix/llms (#366)
* Fix LMStudio, cache documents with tg-load-sample-documents

* Fix Mistral
2025-05-06 16:17:16 +01:00
cybermaggedon
a9197d11ee
Feature/configure flows (#345)
- Keeps processing in different flows separate so that data can go to different stores / collections etc.
- Potentially supports different processing flows
- Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow
2025-04-22 20:21:38 +01:00
cybermaggedon
1d222235d3
Configuration initialisation (#335)
* - Fixed error reporting in config
- Updated tg-init-pulsar to be able to load initial config to config-svc
- Tweaked API naming and added more config calls

* Tools to dump out prompts and agent tools
2025-04-02 13:52:33 +01:00
cybermaggedon
c759d55734
Added module which does OCR for PDF, pdf-ocr in a separate package (#324)
(has a lot of dependencies).  Uses Tesseract.
2025-03-20 09:29:40 +00:00
JackColquitt
5f5cf8fd07 Added basic Mistral API support 2025-03-14 17:47:59 -07:00
cybermaggedon
edcdc4d59d
Feature/separate containers (#287)
* Separate containerfiles

* Add push to Makefile

* Update image names in the templates
2025-01-28 19:36:05 +00:00
cybermaggedon
dc2b599fda
Fix/release broken (#257)
* Break release into 3 jobs

* Replace Github action with podman command
2025-01-06 21:45:42 +00:00
Avi Avni
1ab2a7ff6b
wip integrate falkordb (#211) 2024-12-18 21:01:24 +00:00
cybermaggedon
7df7843dad
Main/remove parquet (#195)
* Remove Parquet code, and package build
2024-12-06 08:51:10 +00:00
Cyber MacGeddon
43756d872b Set dependencies up for the 0.13 branch. Set version=0.0.0 in Makefile
to spot build errors.
2024-10-15 00:31:08 +01:00
Cyber MacGeddon
5850b6c136 Merge branch 'release/v0.11' into release/v0.12 2024-10-09 19:38:13 +01:00
cybermaggedon
a711bc1dde
Fix trustgraph broken linkage (#109) 2024-10-08 20:33:14 +01:00
cybermaggedon
148092a6af
Fix/lock 0.11 version (#108)
* - Locked 0.11 packages to 0.11 deps
- Added 'trustgraph' uber-package which installs the rest
- Added dependency to set package versions before building packages

* Bump version
2024-10-04 22:12:39 +01:00
cybermaggedon
dda29bb663
Workflows (#105)
* Some basic structure for workflows
* Add PyPI publication for 0.12
* Bump version
* Test bundle generation
* Install jsonnet
* Use release action to automate release creation
2024-10-04 17:28:07 +01:00
cybermaggedon
222dc9982c
Feature/azure openai templates (#104)
* Azure OpenAI LLM templates
* Bump version, fix package versions
* Add azure-openai to template generation
2024-10-04 15:47:46 +01:00
Cyber MacGeddon
adba99f270 Bump version 2024-10-02 22:25:48 +01:00
cybermaggedon
db9ed06b1c
Fix/broken kg extract topics (#97)
* Add missing kg-extract-topics service

* Bump version
2024-10-02 22:23:00 +01:00
cybermaggedon
14672f7f0e
Fix/processor state specify prom (#93)
* Provide mean to specify -p prometheus server
* Bump version
2024-10-01 22:14:28 +01:00
Cyber MacGeddon
2e6be5cdce Bump version 2024-10-01 21:06:07 +01:00
cybermaggedon
56a9ac3ba9
Change LLM latency dashboard to be rate & bump version (#92) 2024-10-01 21:04:55 +01:00
cybermaggedon
ef1b8b5a13
Feature/metering dashboard (#89)
* Bump version

* Added Prom metrics to metering, added dashboard

* Update YAMLs

* Add $ on axis

* Tweak dashboard
2024-10-01 06:46:41 +01:00
cybermaggedon
88a7dfa126
Maint/rename pkg (#88)
* Rename trustgraph-utils -> trustgraph-cli
* Update YAMLs
2024-09-30 22:20:26 +01:00
cybermaggedon
771d9fc2c7
Make util pathnames have tg- prefix (#87) 2024-09-30 21:24:22 +01:00
cybermaggedon
0e4c9c69ee
Add twine upload target (#86) 2024-09-30 21:07:18 +01:00
cybermaggedon
c26ada08c2
Fix VertexAI package. Add Python packaging to Makefile. (#85)
Bump version & generate templates.
2024-09-30 20:50:20 +01:00
cybermaggedon
f00baab1b8
Maint/fix build env (#84)
* Put README placeholders for packages in place
* Bump version
2024-09-30 19:47:09 +01:00
cybermaggedon
9b91d5eee3
Feature/pkgsplit (#83)
* Starting to spawn base package
* More package hacking
* Bedrock and VertexAI
* Parquet split
* Updated templates
* Utils
2024-09-30 19:36:09 +01:00
cybermaggedon
3fb75c617b
Maint/auto pkg versions (#82)
* Remove need to manage setup.py version

* Update YAMLs
2024-09-30 16:38:50 +01:00
cybermaggedon
cdace22ee4
Feature/simpler subpackages (#81)
* Back to simpler directory structure

* Bump version, update templates
2024-09-30 16:16:20 +01:00
cybermaggedon
f081933217
Feature/subpackages (#80)
* Renaming what will become the core package

* Tweaking to get  package build working

* Fix metering merge

* Rename to core directory

* Bump version.  Use namespace searching for packaging trustgraph-core

* Change references to trustgraph-core

* Forming embeddings-hf package

* Reference modules in core package.

* Build both packages to one container, bump version

* Update YAMLs
2024-09-30 14:00:29 +01:00
cybermaggedon
14d79ef9f1
Streamline startup (#79)
* Separate Prom metrics, different processors as different jobs

* Create producers before consumers, may streamline startup.

* Bump version

* Add Pulsar init command, will replace pulsar-admin invocations.

* Integrate tg-init-pulsar with YAMLs

* Update YAMLs
2024-09-30 12:19:22 +01:00
Cyber MacGeddon
5e8a1520ee Bump version & update templates 2024-09-30 00:00:36 +01:00
cybermaggedon
74a14639bd
Feature/track processor state (#78)
* Add a Prom metric to consumers & consumer/producers to track the running
state.

* New script, gets processor state using prometheus

* Bump version, add tg-processor-state to package

* Update templates
2024-09-29 23:50:57 +01:00
cybermaggedon
efc364583b
Fix/graph rag uses wrong prompt (#77)
* Fix queue name invocation, use correct names, not defaults

* Bump version

* Update templates
2024-09-29 20:38:50 +01:00