Commit graph

1132 commits

Author SHA1 Message Date
cybermaggedon
6cbaf88fc6 fix: ontology extractor reads .objects, not .object, from PromptResult (#842)
The extract-with-ontologies prompt is a JSONL prompt, which means the
prompt service returns a PromptResult with response_type="jsonl" and
the parsed items in `.objects` (plural).  The ontology extractor was
reading `.object` (singular) — the field used for response_type="json"
— which is always None for JSONL prompts.

Effect: the parser received None on every chunk, hit its "Unexpected
response type: <class 'NoneType'>" branch, returned no ExtractionResult,
and extract_with_simplified_format returned []. Every extraction
silently produced zero triples.

Graphs populated only with the seed ontology schema (TBox) and
document/chunk provenance — no instance triples at all.  The e2e test
threshold of >=100 edges per collection was met by schema + provenance
alone, so the failure mode was invisible until RAG queries couldn't
find any content.

Regression introduced in v2.3 with the token-usage work (commit
56d700f3 / 14e49d83) when PromptClient.prompt() began returning a
PromptResult wrapper instead of the raw text/dict/list.  All other
call sites of .prompt() across retrieval/, agent/, orchestrator/ were
already reading the correct field for their prompt's response_type;
ontology extraction was the sole stranded caller.

Also adds tests/unit/test_extract/test_ontology/test_extract_with_simplified_format.py
covering:
  - happy path: populated .objects produces non-empty triples
  - production failure shape: .objects=None returns [] cleanly
  - empty .objects returns [] without raising
  - defensive: do not silently fall back to .object for a JSONL prompt
2026-04-22 12:10:42 +01:00
cybermaggedon
8be128aa59 fix: api-gateway evicts cached dispatchers when a flow stops (#841)
DispatcherManager caches one ServiceRequestor per (flow_id, kind) in
self.dispatchers, lazily created on first use.  stop_flow dropped the
flow from self.flows but never touched the cached dispatchers, so
their publisher/subscriber connections persisted — bound to the
per-flow exchanges that flow-svc tears down when the flow stops.

If the same flow id was later re-created, flow-svc re-declared fresh
per-flow exchanges, but the gateway's cached dispatcher still held a
subscription queue bound to the now-gone old response exchange.
Requests went out fine (publishers target exchanges by name and the
new exchange has the right name), but responses landed on an exchange
with no binding to the dispatcher's queue and were silently dropped.
The calling CLI or websocket session hung waiting for a reply that
would never arrive.

Reproduction before fix:

    tg-start-flow -i test-flow-1 ...
    # any query on test-flow-1 works
    tg-stop-flow  -i test-flow-1
    tg-start-flow -i test-flow-1 ...
    tg-show-graph -f test-flow-1 -C <collection>   # hangs

Flows that were never stopped (e.g. "default" in a typical session)
were unaffected — their cached dispatcher still pointed at live
plumbing.  That's why the bug appeared flow-name-specific at first
glance; it's actually lifecycle-specific.

Fix: in stop_flow, evict and cleanly stop() every cached dispatcher
keyed on the stopped flow id.  Next request after restart constructs
a fresh dispatcher against the freshly-declared exchanges.  Tuple
shape check preserves global dispatchers, which use (None, kind) as
their key and must survive flow churn.

Uses pop(id, None) instead of del in case stop_flow is invoked
defensively for a flow the gateway never saw.
2026-04-22 12:10:21 +01:00
cybermaggedon
d35473f7f7
feat: workspace-based multi-tenancy, replacing user as tenancy axis (#840)
Introduces `workspace` as the isolation boundary for config, flows,
library, and knowledge data. Removes `user` as a schema-level field
throughout the code, API specs, and tests; workspace provides the
same separation more cleanly at the trusted flow.workspace layer
rather than through client-supplied message fields.

Design
------
- IAM tech spec (docs/tech-specs/iam.md) documents current state,
  proposed auth/access model, and migration direction.
- Data ownership model (docs/tech-specs/data-ownership-model.md)
  captures the workspace/collection/flow hierarchy.

Schema + messaging
------------------
- Drop `user` field from AgentRequest/Step, GraphRagQuery,
  DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest,
  Sparql/Rows/Structured QueryRequest, ToolServiceRequest.
- Keep collection/workspace routing via flow.workspace at the
  service layer.
- Translators updated to not serialise/deserialise user.

API specs
---------
- OpenAPI schemas and path examples cleaned of user fields.
- Websocket async-api messages updated.
- Removed the unused parameters/User.yaml.

Services + base
---------------
- Librarian, collection manager, knowledge, config: all operations
  scoped by workspace. Config client API takes workspace as first
  positional arg.
- `flow.workspace` set at flow start time by the infrastructure;
  no longer pass-through from clients.
- Tool service drops user-personalisation passthrough.

CLI + SDK
---------
- tg-init-workspace and workspace-aware import/export.
- All tg-* commands drop user args; accept --workspace.
- Python API/SDK (flow, socket_client, async_*, explainability,
  library) drop user kwargs from every method signature.

MCP server
----------
- All tool endpoints drop user parameters; socket_manager no longer
  keyed per user.

Flow service
------------
- Closure-based topic cleanup on flow stop: only delete topics
  whose blueprint template was parameterised AND no remaining
  live flow (across all workspaces) still resolves to that topic.
  Three scopes fall out naturally from template analysis:
    * {id} -> per-flow, deleted on stop
    * {blueprint} -> per-blueprint, kept while any flow of the
      same blueprint exists
    * {workspace} -> per-workspace, kept while any flow in the
      workspace exists
    * literal -> global, never deleted (e.g. tg.request.librarian)
  Fixes a bug where stopping a flow silently destroyed the global
  librarian exchange, wedging all library operations until manual
  restart.

RabbitMQ backend
----------------
- heartbeat=60, blocked_connection_timeout=300. Catches silently
  dead connections (broker restart, orphaned channels, network
  partitions) within ~2 heartbeat windows, so the consumer
  reconnects and re-binds its queue rather than sitting forever
  on a zombie connection.

Tests
-----
- Full test refresh: unit, integration, contract, provenance.
- Dropped user-field assertions and constructor kwargs across
  ~100 test files.
- Renamed user-collection isolation tests to workspace-collection.
2026-04-21 23:23:01 +01:00
cybermaggedon
9332089b3d
Setup for 2.4 release branch (#839) 2026-04-21 21:36:46 +01:00
cybermaggedon
424ace44c4
Fix library queue lifecycle (#838)
* Don't delete the global queues (librarian) when flows are deleted

* 60s heartbeat timeouts on RabbitMQ
2026-04-21 21:30:19 +01:00
Het Patel
0ef49ab6ae
feat: standardize LLM rate-limiting and exception handling (#835)
- HTTP 429 translates to TooManyRequests (retryable)
- HTTP 503 translates to LlmError
2026-04-21 16:15:11 +01:00
cybermaggedon
e7efb673ef
Structure the tech specs directory (#836)
Tech spec some subdirectories for different languages
2026-04-21 16:06:41 +01:00
cybermaggedon
48da6c5f8b
Test fixes for Kafka (#834)
Test fixes for Kafka:
- Consumer: isinstance(max_concurrency, int) instead of is not None:
  MagicMock won't pass the check
- Kafka test: updated expected topic name to
  tg.request.prompt.default (colons replaced with dots)
2026-04-18 23:06:01 +01:00
cybermaggedon
b615150624
fix: resolve multiple Kafka backend issues blocking message delivery (#833)
- Producer: add delivery callback to surface send errors instead of
  silently swallowing them, and raise message.max.bytes to 10MB
- Consumer: raise fetch.message.max.bytes to 10MB to match producer,
  tighten session/heartbeat timeouts for fast group joins, and add
  partition assign/revoke logging for diagnostics
- Topic naming: replace colons with dots in topic names since Kafka
  rejects colons (flow:tg:document-load:default was producing invalid
  topic name tg.flow.document-load:default)
- Response consumers: use auto.offset.reset=earliest instead of latest
  so responses published before partition assignment aren't lost
- UNKNOWN_TOPIC_OR_PART: treat as timeout instead of fatal error so
  consumers wait for auto-created topics instead of crashing
- Concurrency: cap consumer workers to 1 for Kafka since topics have
  1 partition — extra consumers trigger rebalance storms that block
  all message delivery
2026-04-18 22:42:24 +01:00
Het Patel
adea976203 feat: implement retry logic and exponential backoff for S3 operations (#829)
* feat: implement retry logic and exponential backoff for S3 operations

* test: fix librarian mocks after BlobStore async conversion
2026-04-18 12:05:37 +01:00
cybermaggedon
cce3acd84f
fix: repair deferred imports to preserve module-level names for test patching (#831)
A previous commit moved SDK imports into __init__/methods and
stashed them on self, which broke @patch targets in 24 unit tests.

This fixes the approach: chunker and pdf_decoder use module-level
sentinels with global/if-None guards so imports are still deferred but
patchable. Google AI Studio reverts to standard module-level imports
since the module is only loaded when communicating with Gemini.
Keeps lazy loading on other imports.
2026-04-18 11:43:21 +01:00
cybermaggedon
d7745baab4
Add Kafka pub/sub backend (#830)
Third backend alongside Pulsar and RabbitMQ. Topics map 1:1 to Kafka
topics, subscriptions map to consumer groups. Response/notify uses
unique consumer groups with correlation ID filtering. Topic lifecycle
managed via AdminClient with class-based retention.

Initial code drop: Needs major integration testing
2026-04-18 11:18:34 +01:00
Syed Ishmum Ahnaf
81cde7baf9 fix for issue #821: deferring optional SDK imports to runtime for provider modules (#828) 2026-04-18 11:16:37 +01:00
cybermaggedon
3505bfdd25
refactor: use one fanout exchange per topic instead of shared topic exchange (#827)
The RabbitMQ backend used a single topic exchange per topicspace
with routing keys to differentiate logical topics. This meant the
flow service had to manually create named queues for every
processor-topic pair, including producer-side topics — creating
phantom queues that accumulated unread message copies indefinitely.

Replace with one fanout exchange per logical topic. Consumers now
declare and bind their own queues on connect. The flow service
manages topic lifecycle (create/delete exchanges) rather than queue
lifecycle, and only collects unique topic identifiers instead of
per-processor (topic, subscription) pairs.

Backend API: create_queue/delete_queue/ensure_queue replaced with
create_topic/delete_topic/ensure_topic (subscription parameter
removed).
2026-04-17 18:01:35 +01:00
Het Patel
391b9076f3 feat: add domain and range validation to triple extraction in extract.py (#825) 2026-04-17 11:29:57 +01:00
cybermaggedon
9f84891fcc
Flow service lifecycle management (#822)
feat: separate flow service from config service with explicit queue
lifecycle management

The flow service is now an independent service that owns the lifecycle
of flow and blueprint queues. System services own their own queues.
Consumers never create queues.

Flow service separation:
- New service at trustgraph-flow/trustgraph/flow/service/
- Uses async ConfigClient (RequestResponse pattern) to talk to config
  service
- Config service stripped of all flow handling

Queue lifecycle management:
- PubSubBackend protocol gains create_queue, delete_queue,
  queue_exists, ensure_queue — all async
- RabbitMQ: implements via pika with asyncio.to_thread internally
- Pulsar: stubs for future admin REST API implementation
- Consumer _connect() no longer creates queues (passive=True for named
  queues)
- System services call ensure_queue on startup
- Flow service creates queues on flow start, deletes on flow stop
- Flow service ensures queues for pre-existing flows on startup

Two-phase flow stop:
- Phase 1: set flow status to "stopping", delete processor config
  entries
- Phase 2: retry queue deletion, then delete flow record

Config restructure:
- active-flow config replaced with processor:{name} types
- Each processor has its own config type, each flow variant is a key
- Flow start/stop use batch put/delete — single config push per
  operation
- FlowProcessor subscribes to its own type only

Blueprint format:
- Processor entries split into topics and parameters dicts
- Flow interfaces use {"flow": "topic"} instead of bare strings
- Specs (ConsumerSpec, ProducerSpec, etc.) read from
  definition["topics"]

Tests updated
2026-04-16 17:19:39 +01:00
Lennard Geißler
645b6a66fd
fix: replace deprecated asyncio.iscoroutinefunction with inspect.iscoroutinefunction (#819)
asyncio.iscoroutinefunction is deprecated since Python 3.14 and slated for
removal in 3.16. The inspect equivalent has an identical signature and return
semantics. Replaces 8 call sites across 3 modules to silence DeprecationWarnings
reported in #818.
2026-04-16 10:57:39 +01:00
Trevin Chow
ef8bb3aed4
fix: replace deprecated datetime.utcnow() with timezone-aware datetime.now(timezone.utc) (#816)
Python 3.14 deprecates datetime.utcnow(). Replace all 9 occurrences with
datetime.now(timezone.utc) and normalize the output to preserve the existing
ISO-8601 "Z"-suffixed format so downstream parsers are unaffected.

Fixes #814
2026-04-16 10:16:11 +01:00
cybermaggedon
95e4839da7
Fix docstring breakage in type hint change (#817) 2026-04-16 10:10:58 +01:00
RaccoonLabs
706e62b7c2 feat: add type hints to all public functions in trustgraph/base (#803)
feat: add type hints to all public functions in trustgraph/base

Add type annotations to 23 modules covering:
- Metrics classes (ConsumerMetrics, ProducerMetrics, etc.)
- Spec classes (ConsumerSpec, ProducerSpec, SubscriberSpec, etc.)
- Service classes with add_args() and run() methods
- Utility functions (logging, pubsub, clients)
- AsyncProcessor methods

All 93 public functions now fully typed.

Refs #785

* refactor: deduplicate imports and move __future__ after docstrings

Addresses review feedback on PR #803:
- Remove duplicate 'from argparse import ArgumentParser' across 12 files
- Move 'from __future__ import annotations' to line 1 in all files
- Clean up excessive blank lines
2026-04-16 10:08:19 +01:00
cybermaggedon
22096e07e2
Fix tests broken by the recent RabbitMQ/Cassandra async fixes (#815)
- Fix invalid key in config causing rogue warning
- Fix asyncio test tags
2026-04-16 10:00:18 +01:00
Alex Jenkins
fdb52a6bfc Add docstrings to public classes (#812)
Add class-level docstrings to five public classes in trustgraph-base:
Flow, LlmService, ConsumerMetrics, ToolClient, and TriplesStoreService.
Each docstring summarises the class's role in the system to aid
discoverability for new contributors.

Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>
2026-04-16 09:07:08 +01:00
cybermaggedon
2f64ffc99d
Add missing PyYAML package (#811) 2026-04-15 15:37:46 +01:00
cybermaggedon
2bf4af294e
Better proc group logging and concurrency (#810)
- Silence pika, cassandra etc. logging at INFO (too much chatter) 
- Add per processor log tags so that logs can be understood in
  processor group.
- Deal with RabbitMQ lag weirdness
- Added more processor group examples
2026-04-15 14:52:01 +01:00
cybermaggedon
ce3c8b421b
Remove Pulsar healthcheck from tg-verify-system-health (#809)
The Pulsar healthcheck was removed from
tg-verify-system-health and re-added accidentally
in a PR.  This removes it again.
2026-04-14 16:12:24 +01:00
cybermaggedon
f11c0ad0cb
Processor group implementation: dev wrapper (#808)
Processor group implementation: A wrapper to launch multiple
processors in a single processor

- trustgraph-base/trustgraph/base/processor_group.py — group runner
  module. run_group(config) is the async body; run() is the
  endpoint. Loads JSON or YAML config, validates that every entry
  has a unique params.id, instantiates each class via importlib,
  shares one TaskGroup, mirrors AsyncProcessor.launch's retry loop
  and Prometheus startup.
- trustgraph-base/pyproject.toml — added [project.scripts] block
  with processor-group = "trustgraph.base.processor_group:run".

Key behaviours:
- Unique id enforced up front — missing or duplicate params.id fails
  fast with a clear error, preventing the Prometheus Info label
  collision we flagged.
- No registry — dotted class path is the identifier; any
  AsyncProcessor descendant importable at runtime is packable.
- YAML import is lazy — only pulled in if the config file ends in
  .yaml/.yml, so JSON-only users don't need PyYAML installed.
- Single Prometheus server — start_http_server runs once at
  startup, before the retry loop, matching launch()'s pattern.
- Retry loop — same shape as AsyncProcessor.launch: catches
  ExceptionGroup from TaskGroup, logs, sleeps 4s,
  retries. Fail-group semantics (one processor dying tears down the
  group) — simple and surfaces bugs, as discussed.

Example config:

  processors:
    - class: trustgraph.extract.kg.definitions.extract.Processor
      params:
        id: kg-extract-definitions
    - class: trustgraph.chunking.recursive.Processor
      params:
        id: chunker-recursive

Run with processor-group -c group.yaml.
2026-04-14 15:19:04 +01:00
Alex Jenkins
8954fa3ad7 Feat: TrustGraph i18n & Documentation Translation Updates (#781)
Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.

Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.
2026-04-14 12:08:32 +01:00
cybermaggedon
f976f1b6fe
Fix test suite Prometheus registry pollution and remove default (#806)
Fix test suite Prometheus registry pollution and remove default coverage

The test_metrics.py fixture (added in 4e63dbda) deleted
class-level metric singleton attributes without restoring them. This
stripped the hasattr() guards that prevent duplicate Prometheus
registration, so every subsequent Processor construction in the test
run raised ValueError: Duplicated timeseries. Fix by saving and
restoring the original attributes around each test.

Remove --cov=trustgraph from pytest.ini defaults — the coverage
import hooks interact badly with the trustgraph namespace package,
causing duplicate class loading. Coverage can still be requested
explicitly via the command line.
2026-04-14 11:07:23 +01:00
Zeel Desai
39dcd1d386 Add unit tests for base helper modules (#797)
- add unit tests for base metrics, logging, spec, parameter_spec,
  and flow modules
- add a lightweight test-only module loader so these tests can run
  without optional runtime dependencies
- fix Parameter.start/stop to accept self
2026-04-14 10:58:15 +01:00
cybermaggedon
a89d6261c7
Move ARM container builds to ARM runners (#805) 2026-04-14 09:39:07 +01:00
Cyber MacGeddon
5e6c96bdd1 Separate platform builds & combine to single manifest 2026-04-13 23:14:37 +01:00
cybermaggedon
5108b3db95
Get hf onto Python 3.12 so that it works on AMD64 and ARM64 (#802) 2026-04-13 21:49:59 +01:00
cybermaggedon
3af794ceea
Use manifests to build for amd64 and arm64 (#798) (#801)
Remove Pytorch +cpu invocation, so it builds on ARM
2026-04-13 21:18:58 +01:00
cybermaggedon
baf8d861af
Fix Qemu setup (#800) 2026-04-13 21:03:45 +01:00
cybermaggedon
d11fcabaed
Fix CI pipeline (#799)
Matrix labels are wrong, fix them
2026-04-13 20:57:09 +01:00
cybermaggedon
76c1752748
Use manifests to build for amd64 and arm64 (#798)
This adds ARM container builds to the CI pipeline
2026-04-13 20:51:09 +01:00
cybermaggedon
1515dbaf08
Updated tech specs for agent & explainability changes (#796) 2026-04-13 17:29:24 +01:00
cybermaggedon
d2751553a3
Add agent explainability instrumentation and unify envelope field naming (#795)
Addresses recommendations from the UX developer's agent experience report.
Adds provenance predicates, DAG structure changes, error resilience, and
a published OWL ontology.

Explainability additions:

- Tool candidates: tg:toolCandidate on Analysis events lists the tools
  visible to the LLM for each iteration (names only, descriptions in config)
- Termination reason: tg:terminationReason on Conclusion/Synthesis events
  (final-answer, plan-complete, subagents-complete)
- Step counter: tg:stepNumber on iteration events
- Pattern decision: new tg:PatternDecision entity in the DAG between
  session and first iteration, carrying tg:pattern and tg:taskType
- Latency: tg:llmDurationMs on Analysis events, tg:toolDurationMs on
  Observation events
- Token counts on events: tg:inToken/tg:outToken/tg:llmModel on
  Grounding, Focus, Synthesis, and Analysis events
- Tool/parse errors: tg:toolError on Observation events with tg:Error
  mixin type. Parse failures return as error observations instead of
  crashing the agent, giving it a chance to retry.

Envelope unification:

- Rename chunk_type to message_type across AgentResponse schema,
  translator, SDK types, socket clients, CLI, and all tests.
  Agent and RAG services now both use message_type on the wire.

Ontology:

- specs/ontology/trustgraph.ttl — OWL vocabulary covering all 26 classes,
  7 object properties, and 36+ datatype properties including new predicates.

DAG structure tests:

- tests/unit/test_provenance/test_dag_structure.py verifies the
  wasDerivedFrom chain for GraphRAG, DocumentRAG, and all three agent
  patterns (react, plan, supervisor) including the pattern-decision link.
2026-04-13 16:16:42 +01:00
cybermaggedon
14e49d83c7
Expose LLM token usage across all service layers (#782)
Expose LLM token usage (in_token, out_token, model) across all
service layers

Propagate token counts from LLM services through the prompt,
text-completion, graph-RAG, document-RAG, and agent orchestrator
pipelines to the API gateway and Python SDK. All fields are Optional
— None means "not available", distinguishing from a real zero count.

Key changes:

- Schema: Add in_token/out_token/model to TextCompletionResponse,
  PromptResponse, GraphRagResponse, DocumentRagResponse,
  AgentResponse

- TextCompletionClient: New TextCompletionResult return type. Split
  into text_completion() (non-streaming) and
  text_completion_stream() (streaming with per-chunk handler
  callback)

- PromptClient: New PromptResult with response_type
  (text/json/jsonl), typed fields (text/object/objects), and token
  usage. All callers updated.

- RAG services: Accumulate token usage across all prompt calls
  (extract-concepts, edge-scoring, edge-reasoning,
  synthesis). Non-streaming path sends single combined response
  instead of chunk + end_of_session.

- Agent orchestrator: UsageTracker accumulates tokens across
  meta-router, pattern prompt calls, and react reasoning. Attached
  to end_of_dialog.

- Translators: Encode token fields when not None (is not None, not truthy)

- Python SDK: RAG and text-completion methods return
  TextCompletionResult (non-streaming) or RAGChunk/AgentAnswer with
  token fields (streaming)

- CLI: --show-usage flag on tg-invoke-llm, tg-invoke-prompt,
  tg-invoke-graph-rag, tg-invoke-document-rag, tg-invoke-agent
2026-04-13 14:38:34 +01:00
cybermaggedon
67cfa80836
SPARQL CLI reports errors from service (#794)
SPARQL query CLI ignores errors from the SPARQL service and just
emits a zero row output. This change causes an error to be reported
2026-04-13 14:31:33 +01:00
cybermaggedon
ffe310af7c
Fix RabbitMQ request/response race and chunker Flow API drift (#779)
* Fix Metadata/EntityEmbeddings schema migration tail and add regression tests (#776)

The Metadata dataclass dropped its `metadata: list[Triple]` field
and EntityEmbeddings/ChunkEmbeddings settled on a singular
`vector: list[float]` field, but several call sites kept passing
`Metadata(metadata=...)` and `EntityEmbeddings(vectors=...)`. The
bugs were latent until a websocket client first hit
`/api/v1/flow/default/import/entity-contexts`, at which point the
dispatcher TypeError'd on construction.

Production fixes (5 call sites on the same migration tail):

  * trustgraph-flow gateway dispatchers entity_contexts_import.py
    and graph_embeddings_import.py — drop the stale
    Metadata(metadata=...)  kwarg; switch graph_embeddings_import
    to the singular `vector` wire key.
  * trustgraph-base messaging translators knowledge.py and
    document_loading.py — fix decode side to read the singular
    `"vector"` key, matching what their own encode sides have
    always written.
  * trustgraph-flow tables/knowledge.py — fix Cassandra row
    deserialiser to construct EntityEmbeddings(vector=...)
    instead of vectors=.
  * trustgraph-flow gateway core_import/core_export — switch the
    kg-core msgpack wire format to the singular `"v"`/`"vector"`
    key and drop the dead `m["m"]` envelope field that referenced
    the removed Metadata.metadata triples list (it was a
    guaranteed KeyError on the export side).

Defense-in-depth regression coverage (32 new tests across 7 files):

  * tests/contract/test_schema_field_contracts.py — pin the field
    set of Metadata, EntityEmbeddings, ChunkEmbeddings,
    EntityContext so any future schema rename fails CI loudly
    with a clear diff.
  * tests/unit/test_translators/test_knowledge_translator_roundtrip.py
    and test_document_embeddings_translator_roundtrip.py -
    encode→decode round-trip the affected translators end to end,
    locking in the singular `"vector"` wire key.
  * tests/unit/test_gateway/test_entity_contexts_import_dispatcher.py
    and test_graph_embeddings_import_dispatcher.py — exercise the
    websocket dispatchers' receive() path with realistic
    payloads, the direct regression test for the original
    production crash.
  * tests/unit/test_gateway/test_core_import_export_roundtrip.py
    — pack/unpack the kg-core msgpack format through the real
    dispatcher classes (with KnowledgeRequestor mocked),
    including a full export→import round-trip.
  * tests/unit/test_tables/test_knowledge_table_store.py —
    exercise the Cassandra row → schema conversion via __new__ to
    bypass the live cluster connection.

Also fixes an unrelated leaked-coroutine RuntimeWarning in
test_gateway/test_service.py::test_run_method_calls_web_run_app: the
mocked aiohttp.web.run_app now closes the coroutine that Api.run() hands
it, mirroring what the real run_app would do, instead of leaving it for
the GC to complain about.

* Fix RabbitMQ request/response race and chunker Flow API drift

Two unrelated regressions surfaced after the v2.2 queue class
refactor.  Bundled here because both are small and both block
production.

1. Request/response race against ephemeral RabbitMQ response
queues

Commit feeb92b3 switched response/notify queues to per-subscriber
auto-delete exclusive queues. That fixed orphaned-queue
accumulation but introduced a setup race: Subscriber.start()
created the run() task and returned immediately, while the
underlying RabbitMQ consumer only declared and bound its queue
lazily on the first receive() call.  RequestResponse.request()
therefore published the request before any queue was bound to the
matching routing key, and the broker dropped the reply. Symptoms:
"Failed to fetch config on notify" / "Request timeout exception"
repeating roughly every 10s in api-gateway, document-embeddings
and any other service exercising the config notify path.

Fix:

  * Add ensure_connected() to the BackendConsumer protocol;
    implement it on RabbitMQBackendConsumer (calls _connect
    synchronously, declaring and binding the queue) and as a
    no-op on PulsarBackendConsumer (Pulsar's client.subscribe is
    already synchronous at construction).

  * Convert Subscriber's readiness signal from a non-existent
    Event to an asyncio.Future created in start(). run() calls
    consumer.ensure_connected() immediately after
    create_consumer() and sets _ready.set_result(None) on first
    successful bind. start() awaits the future via asyncio.wait
    so it returns only once the consumer is fully bound. Any
    reply published after start() returns is therefore guaranteed
    to land in a bound queue.

  * First-attempt connection failures call
    _ready.set_exception(e) and exit run() so start() unblocks
    with the error rather than hanging forever — the existing
    higher-level retry pattern in fetch_and_apply_config takes
    over from there. Runtime failures after a successful start
    still go through the existing retry-with-backoff path.

  * Update the two existing graceful-shutdown tests that
    monkey-patch Subscriber.run with a custom coroutine to honor
    the new contract by signalling _ready themselves.

  * Add tests/unit/test_base/test_subscriber_readiness.py with
    five regression tests pinning the readiness contract:
    ensure_connected must be called before start() returns;
    start() must block while ensure_connected runs
    (race-condition guard with a threading.Event gate);
    first-attempt create_consumer and ensure_connected failures
    must propagate to start() instead of hanging;
    ensure_connected must run before any receive() call.

2. Chunker Flow parameter lookup using the wrong attribute

trustgraph-base/trustgraph/base/chunking_service.py was reading
flow.parameters.get("chunk-size") and chunk-overlap, but the Flow
class has no `parameters` attribute — parameter lookup is exposed
through Flow.__call__ (flow("chunk-size") returns the resolved
value or None).  The exception was caught and logged as a
WARNING, so chunking continued with the default sizes and any
configured chunk-size / chunk-overlap was silently ignored:

    chunker - WARNING - Could not parse chunk-size parameter:
    'Flow' object has no attribute 'parameters'

The chunker tests didn't catch this because they constructed
mock_flow = MagicMock() and configured
mock_flow.parameters.get.side_effect = ..., which is the same
phantom attribute MagicMock auto-creates on demand. Tests and
production agreed on the wrong API.

Fix: switch chunking_service.py to flow("chunk-size") /
flow("chunk-overlap"). Update both chunker test files to mock the
__call__ side_effect instead of the phantom parameters.get,
merging parameter values into the existing flow() lookup the
on_message tests already used for producer resolution.
2026-04-11 01:29:38 +01:00
cybermaggedon
c23e28aa66
Fix Metadata/EntityEmbeddings schema migration tail and add regression tests (#777)
The Metadata dataclass dropped its `metadata: list[Triple]` field
and EntityEmbeddings/ChunkEmbeddings settled on a singular
`vector: list[float]` field, but several call sites kept passing
`Metadata(metadata=...)` and `EntityEmbeddings(vectors=...)`. The
bugs were latent until a websocket client first hit
`/api/v1/flow/default/import/entity-contexts`, at which point the
dispatcher TypeError'd on construction.

Production fixes (5 call sites on the same migration tail):

  * trustgraph-flow gateway dispatchers entity_contexts_import.py
    and graph_embeddings_import.py — drop the stale
    Metadata(metadata=...)  kwarg; switch graph_embeddings_import
    to the singular `vector` wire key.
  * trustgraph-base messaging translators knowledge.py and
    document_loading.py — fix decode side to read the singular
    `"vector"` key, matching what their own encode sides have
    always written.
  * trustgraph-flow tables/knowledge.py — fix Cassandra row
    deserialiser to construct EntityEmbeddings(vector=...)
    instead of vectors=.
  * trustgraph-flow gateway core_import/core_export — switch the
    kg-core msgpack wire format to the singular `"v"`/`"vector"`
    key and drop the dead `m["m"]` envelope field that referenced
    the removed Metadata.metadata triples list (it was a
    guaranteed KeyError on the export side).

Defense-in-depth regression coverage (32 new tests across 7 files):

  * tests/contract/test_schema_field_contracts.py — pin the field
    set of Metadata, EntityEmbeddings, ChunkEmbeddings,
    EntityContext so any future schema rename fails CI loudly
    with a clear diff.
  * tests/unit/test_translators/test_knowledge_translator_roundtrip.py
    and test_document_embeddings_translator_roundtrip.py -
    encode→decode round-trip the affected translators end to end,
    locking in the singular `"vector"` wire key.
  * tests/unit/test_gateway/test_entity_contexts_import_dispatcher.py
    and test_graph_embeddings_import_dispatcher.py — exercise the
    websocket dispatchers' receive() path with realistic
    payloads, the direct regression test for the original
    production crash.
  * tests/unit/test_gateway/test_core_import_export_roundtrip.py
    — pack/unpack the kg-core msgpack format through the real
    dispatcher classes (with KnowledgeRequestor mocked),
    including a full export→import round-trip.
  * tests/unit/test_tables/test_knowledge_table_store.py —
    exercise the Cassandra row → schema conversion via __new__ to
    bypass the live cluster connection.

Also fixes an unrelated leaked-coroutine RuntimeWarning in
test_gateway/test_service.py::test_run_method_calls_web_run_app: the
mocked aiohttp.web.run_app now closes the coroutine that Api.run() hands
it, mirroring what the real run_app would do, instead of leaving it for
the GC to complain about.
2026-04-10 20:43:45 +01:00
cybermaggedon
0994d4b05f
Open 2.3 release branch (#775)
* Update packages and CI for new release branch
2026-04-10 14:42:19 +01:00
cybermaggedon
ad0bff10ee
master -> release/v2.3 (#774)
* Mainly README changes
2026-04-10 14:38:46 +01:00
cybermaggedon
feeb92b33f
Refactor: Derive consumer behaviour from queue class (#772)
Derive consumer behaviour from queue class, remove
consumer_type parameter

The queue class prefix (flow, request, response, notify) now
fully determines consumer behaviour in both RabbitMQ and Pulsar
backends.  Added 'notify' class for ephemeral broadcast (config
push notifications).  Response and notify classes always create
per-subscriber auto-delete queues, eliminating orphaned queues
that accumulated on service restarts.

Change init-trustgraph to set up the 'notify' namespace in
Pulsar instead of old hangover 'state'.

Fixes 'stuck backlog' on RabbitMQ config notification queue.
2026-04-09 09:55:41 +01:00
cybermaggedon
aff96e57cb
Added Explainable AI agent demo in Typescript (#770)
(Not functional code)
2026-04-08 14:16:14 +01:00
cybermaggedon
e81418c58f
fix: preserve literal types in focus quoted triples and document tracing (#769)
The triples client returns Uri/Literal (str subclasses), not Term
objects.  _quoted_triple() treated all values as IRIs, so literal
objects like skos:definition values were mistyped in focus
provenance events, and trace_source_documents could not match
them in the store.

Added to_term() to convert Uri/Literal back to Term, threaded a
term_map from follow_edges_batch through
get_subgraph/get_labelgraph into uri_map, and updated
_quoted_triple to accept Term objects directly.
2026-04-08 13:37:02 +01:00
cybermaggedon
4b5bfacab1
Forward missing explain_triples through RAG clients and agent tool callback (#768)
fix: forward explain_triples through RAG clients and agent tool callback
- RAG clients and the KnowledgeQueryImpl tool callback were
  dropping explain_triples from explain events, losing provenance
  data (including focus edge selections) when graph-rag is invoked
  via the agent.

Tests for provenance and explainability (56 new):
- Client-level forwarding of explain_triples
- Graph-RAG structural chain
  (question → grounding → exploration → focus → synthesis)
- Graph-RAG integration with mocked subsidiary clients
- Document-RAG integration
  (question → grounding → exploration → synthesis)
- Agent-orchestrator all 3 patterns: react, plan-then-execute,
  supervisor
2026-04-08 11:41:17 +01:00
cybermaggedon
e899370d98
Update docs for 2.2 release (#766)
- Update protocol specs
- Update protocol docs
- Update API specs
2026-04-07 22:24:59 +01:00
cybermaggedon
c20e6540ec
Subscriber resilience and RabbitMQ fixes (#765)
Subscriber resilience: recreate consumer after connection failure

- Move consumer creation from Subscriber.start() into the run() loop,
  matching the pattern used by Consumer. If the connection drops and the
  consumer is closed in the finally block, the loop now recreates it on
  the next iteration instead of spinning forever on a None consumer.

Consumer thread safety:
- Dedicated ThreadPoolExecutor per consumer so all pika operations
  (create, receive, acknowledge, negative_acknowledge) run on the
  same thread — pika BlockingConnection is not thread-safe
- Applies to both Consumer and Subscriber classes

Config handler type audit — fix four mismatched type registrations:
- librarian: was ["librarian"] (non-existent type), now ["flow",
  "active-flow"] (matches config["flow"] that the handler reads)
- cores/service: was ["kg-core"], now ["flow"] (reads
  config["flow"])
- metering/counter: was ["token-costs"], now ["token-cost"]
  (singular)
- agent/mcp_tool: was ["mcp-tool"], now ["mcp"] (reads
  config["mcp"])

Update tests
2026-04-07 14:51:14 +01:00