Pull requests back-ported to release/v2.2: 801, 802, 804, 805
Restructure container builds for multi-platform support, enabling
ARM-based deployments (e.g. Apple Silicon via Docker Desktop).
Makefile:
- Replace per-container named targets with pattern rules
(container-%, manifest-%, platform-%-{amd64,arm64},
combine-manifest-%)
- Add parallel CI targets: platform builds push per-arch images,
combine-manifest creates and pushes the multi-arch manifest list
- Remove legacy cruft targets (update-dcs, update-templates)
CI (release.yaml):
- Split single deploy job into build-platform-image (16 parallel
jobs: 8 containers x 2 platforms) and combine-manifests (8 jobs,
metadata only)
- Use native ARM runners (ubuntu-24.04-arm)
Containerfile.hf:
- Downgrade to Python 3.12 (PyTorch lacks arm64 wheels for 3.13)
- Use standard PyTorch package instead of +cpu variant (no arm64 wheels
on the cpu index)
SPARQL 1.1 query service wrapping pub/sub triples interface
Add a backend-agnostic SPARQL query service that parses SPARQL
queries using rdflib, decomposes them into triple pattern lookups
via the existing TriplesClient pub/sub interface, and performs
in-memory joins, filters, and projections.
Includes:
- SPARQL parser, algebra evaluator, expression evaluator, solution
sequence operations (BGP, JOIN, OPTIONAL, UNION, FILTER, BIND,
VALUES, GROUP BY, ORDER BY, LIMIT/OFFSET, DISTINCT, aggregates)
- FlowProcessor service with TriplesClientSpec
- Gateway dispatcher, request/response translators, API spec
- Python SDK method (FlowInstance.sparql_query)
- CLI command (tg-invoke-sparql-query)
- Tech spec (docs/tech-specs/sparql-query.md)
New unit tests for SPARQL query
Adds a RabbitMQ backend as an alternative to Pulsar, selectable via
PUBSUB_BACKEND=rabbitmq. Both backends implement the same PubSubBackend
protocol — no application code changes needed to switch.
RabbitMQ topology:
- Single topic exchange per topicspace (e.g. 'tg')
- Routing key derived from queue class and topic name
- Shared consumers: named queue bound to exchange (competing, round-robin)
- Exclusive consumers: anonymous auto-delete queue (broadcast, each gets
every message). Used by Subscriber and config push consumer.
- Thread-local producer connections (pika is not thread-safe)
- Push-based consumption via basic_consume with process_data_events
for heartbeat processing
Consumer model changes:
- Consumer class creates one backend consumer per concurrent task
(required for pika thread safety, harmless for Pulsar)
- Consumer class accepts consumer_type parameter
- Subscriber passes consumer_type='exclusive' for broadcast semantics
- Config push consumer uses consumer_type='exclusive' so every
processor instance receives config updates
- handle_one_from_queue receives consumer as parameter for correct
per-connection ack/nack
LibrarianClient:
- New shared client class replacing duplicated librarian request-response
code across 6+ services (chunking, decoders, RAG, etc.)
- Uses stream-document instead of get-document-content for fetching
document content in 1MB chunks (avoids broker message size limits)
- Standalone object (self.librarian = LibrarianClient(...)) not a mixin
- get-document-content marked deprecated in schema and OpenAPI spec
Serialisation:
- Extracted dataclass_to_dict/dict_to_dataclass to shared
serialization.py (used by both Pulsar and RabbitMQ backends)
Librarian queues:
- Changed from flow class (persistent) back to request/response class
now that stream-document eliminates large single messages
- API upload chunk size reduced from 5MB to 3MB to stay under broker
limits after base64 encoding
Factory and CLI:
- get_pubsub() handles 'rabbitmq' backend with RabbitMQ connection params
- add_pubsub_args() includes RabbitMQ options (host, port, credentials)
- add_pubsub_args(standalone=True) defaults to localhost for CLI tools
- init_trustgraph skips Pulsar admin setup for non-Pulsar backends
- tg-dump-queues and tg-monitor-prompts use backend abstraction
- BaseClient and ConfigClient accept generic pubsub config
Add universal document decoder with multi-format support
using 'unstructured'.
New universal decoder service powered by the unstructured
library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF,
ODT, EPUB and more through a single service. Tables are preserved
as HTML markup for better downstream extraction. Images are
stored in the librarian but excluded from the text
pipeline. Configurable section grouping strategies
(whole-document, heading, element-type, count, size) for non-page
formats. Page-based formats (PDF, PPTX, XLSX) are automatically
grouped by page.
All four decoders (PDF, Mistral OCR, Tesseract OCR, universal)
now share the "document-decoder" ident so they are
interchangeable. PDF-only decoders fetch document metadata to
check MIME type and gracefully skip unsupported formats.
Librarian changes: removed MIME type whitelist validation so any
document format can be ingested. Simplified routing so text/plain
goes to text-load and everything else goes to document-load.
Removed dual inline/streaming data paths — documents always use
document_id for content retrieval.
New provenance entity types (tg:Section, tg:Image) and metadata
predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for
richer explainability.
Universal decoder is in its own package (trustgraph-unstructured)
and container image (trustgraph-unstructured).
* Changed schema for Value -> Term, majorly breaking change
* Following the schema change, Value -> Term into all processing
* Updated Cassandra for g, p, s, o index patterns (7 indexes)
* Reviewed and updated all tests
* Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
- Keeps processing in different flows separate so that data can go to different stores / collections etc.
- Potentially supports different processing flows
- Tidies the processing API with common base-classes for e.g. LLMs, and automatic configuration of 'clients' to use the right queue names in a flow
* - Fixed error reporting in config
- Updated tg-init-pulsar to be able to load initial config to config-svc
- Tweaked API naming and added more config calls
* Tools to dump out prompts and agent tools
* - Locked 0.11 packages to 0.11 deps
- Added 'trustgraph' uber-package which installs the rest
- Added dependency to set package versions before building packages
* Bump version
* Some basic structure for workflows
* Add PyPI publication for 0.12
* Bump version
* Test bundle generation
* Install jsonnet
* Use release action to automate release creation
* Renaming what will become the core package
* Tweaking to get package build working
* Fix metering merge
* Rename to core directory
* Bump version. Use namespace searching for packaging trustgraph-core
* Change references to trustgraph-core
* Forming embeddings-hf package
* Reference modules in core package.
* Build both packages to one container, bump version
* Update YAMLs
* Separate Prom metrics, different processors as different jobs
* Create producers before consumers, may streamline startup.
* Bump version
* Add Pulsar init command, will replace pulsar-admin invocations.
* Integrate tg-init-pulsar with YAMLs
* Update YAMLs
* Add a Prom metric to consumers & consumer/producers to track the running
state.
* New script, gets processor state using prometheus
* Bump version, add tg-processor-state to package
* Update templates