trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-12 08:15:14 +02:00

Author	SHA1	Message	Date
cybermaggedon	dfb6d26a56	Fix RabbitMQ request/response race and chunker Flow API drift (#780 ) Two unrelated regressions surfaced after the v2.2 queue class refactor. Bundled here because both are small and both block production. 1. Request/response race against ephemeral RabbitMQ response queues Commit `feeb92b3` switched response/notify queues to per-subscriber auto-delete exclusive queues. That fixed orphaned-queue accumulation but introduced a setup race: Subscriber.start() created the run() task and returned immediately, while the underlying RabbitMQ consumer only declared and bound its queue lazily on the first receive() call. RequestResponse.request() therefore published the request before any queue was bound to the matching routing key, and the broker dropped the reply. Symptoms: "Failed to fetch config on notify" / "Request timeout exception" repeating roughly every 10s in api-gateway, document-embeddings and any other service exercising the config notify path. Fix: * Add ensure_connected() to the BackendConsumer protocol; implement it on RabbitMQBackendConsumer (calls _connect synchronously, declaring and binding the queue) and as a no-op on PulsarBackendConsumer (Pulsar's client.subscribe is already synchronous at construction). * Convert Subscriber's readiness signal from a non-existent Event to an asyncio.Future created in start(). run() calls consumer.ensure_connected() immediately after create_consumer() and sets _ready.set_result(None) on first successful bind. start() awaits the future via asyncio.wait so it returns only once the consumer is fully bound. Any reply published after start() returns is therefore guaranteed to land in a bound queue. * First-attempt connection failures call _ready.set_exception(e) and exit run() so start() unblocks with the error rather than hanging forever — the existing higher-level retry pattern in fetch_and_apply_config takes over from there. Runtime failures after a successful start still go through the existing retry-with-backoff path. * Update the two existing graceful-shutdown tests that monkey-patch Subscriber.run with a custom coroutine to honor the new contract by signalling _ready themselves. * Add tests/unit/test_base/test_subscriber_readiness.py with five regression tests pinning the readiness contract: ensure_connected must be called before start() returns; start() must block while ensure_connected runs (race-condition guard with a threading.Event gate); first-attempt create_consumer and ensure_connected failures must propagate to start() instead of hanging; ensure_connected must run before any receive() call. 2. Chunker Flow parameter lookup using the wrong attribute trustgraph-base/trustgraph/base/chunking_service.py was reading flow.parameters.get("chunk-size") and chunk-overlap, but the Flow class has no `parameters` attribute — parameter lookup is exposed through Flow.__call__ (flow("chunk-size") returns the resolved value or None). The exception was caught and logged as a WARNING, so chunking continued with the default sizes and any configured chunk-size / chunk-overlap was silently ignored: chunker - WARNING - Could not parse chunk-size parameter: 'Flow' object has no attribute 'parameters' The chunker tests didn't catch this because they constructed mock_flow = MagicMock() and configured mock_flow.parameters.get.side_effect = ..., which is the same phantom attribute MagicMock auto-creates on demand. Tests and production agreed on the wrong API. Fix: switch chunking_service.py to flow("chunk-size") / flow("chunk-overlap"). Update both chunker test files to mock the __call__ side_effect instead of the phantom parameters.get, merging parameter values into the existing flow() lookup the on_message tests already used for producer resolution.	2026-04-11 01:29:33 +01:00
cybermaggedon	c20e6540ec	Subscriber resilience and RabbitMQ fixes (#765 ) Subscriber resilience: recreate consumer after connection failure - Move consumer creation from Subscriber.start() into the run() loop, matching the pattern used by Consumer. If the connection drops and the consumer is closed in the finally block, the loop now recreates it on the next iteration instead of spinning forever on a None consumer. Consumer thread safety: - Dedicated ThreadPoolExecutor per consumer so all pika operations (create, receive, acknowledge, negative_acknowledge) run on the same thread — pika BlockingConnection is not thread-safe - Applies to both Consumer and Subscriber classes Config handler type audit — fix four mismatched type registrations: - librarian: was ["librarian"] (non-existent type), now ["flow", "active-flow"] (matches config["flow"] that the handler reads) - cores/service: was ["kg-core"], now ["flow"] (reads config["flow"]) - metering/counter: was ["token-costs"], now ["token-cost"] (singular) - agent/mcp_tool: was ["mcp-tool"], now ["mcp"] (reads config["mcp"]) Update tests	2026-04-07 14:51:14 +01:00
cybermaggedon	4acd853023	Config push notify pattern: replace stateful pub/sub with signal+ fetch (#760 ) Replace the config push mechanism that broadcast the full config blob on a 'state' class pub/sub queue with a lightweight notify signal containing only the version number and affected config types. Processors fetch the full config via request/response from the config service when notified. This eliminates the need for the pub/sub 'state' queue class and stateful pub/sub services entirely. The config push queue moves from 'state' to 'flow' class — a simple transient signal rather than a retained message. This solves the RabbitMQ late-subscriber problem where restarting processes never received the current config because their fresh queue had no historical messages. Key changes: - ConfigPush schema: config dict replaced with types list - Subscribe-then-fetch startup with retry: processors subscribe to notify queue, fetch config via request/response, then process buffered notifies with version comparison to avoid race conditions - register_config_handler() accepts optional types parameter so handlers only fire when their config types change - Short-lived config request/response clients to avoid subscriber contention on non-persistent response topics - Config service passes affected types through put/delete/flow operations - Gateway ConfigReceiver rewritten with same notify pattern and retry loop Tests updated New tests: - register_config_handler: without types, with types, multiple types, multiple handlers - on_config_notify: old/same version skipped, irrelevant types skipped (version still updated), relevant type triggers fetch, handler without types always called, mixed handler filtering, empty types invokes all, fetch failure handled gracefully - fetch_config: returns config+version, raises on error response, stops client even on exception - fetch_and_apply_config: applies to all handlers on startup, retries on failure	2026-04-06 16:57:27 +01:00
cybermaggedon	f2ae0e8623	Embeddings API scores (#671 ) - Put scores in all responses - Remove unused 'middle' vector layer. Vector of texts -> vector of (vector embedding)	2026-03-09 10:53:44 +00:00
cybermaggedon	3bf8a65409	Fix tests (#666 )	2026-03-07 23:38:09 +00:00
cybermaggedon	5ffad92345	Fix subscriber unexpected message causing queue clogging (#642 ) queue clogging.	2026-02-23 14:34:05 +00:00
cybermaggedon	b08db761d7	Fix config inconsistency (#609 ) * Plural/singular confusion in config key * Flow class vs flow blueprint nomenclature change * Update docs & CLI to reflect the above	2026-01-14 12:31:40 +00:00
cybermaggedon	f79d0603f7	Update to add streaming tests (#600 )	2026-01-06 21:48:05 +00:00
cybermaggedon	5304f96fe6	Fix tests (#593 ) * Fix unit/integration/contract tests which were broken by messaging fabric work	2025-12-19 08:53:21 +00:00
cybermaggedon	7d07f802a8	Basic multitenant support (#583 ) * Tech spec * Address multi-tenant queue option problems in CLI * Modified collection service to use config * Changed storage management to use the config service definition	2025-12-05 21:45:30 +00:00
cybermaggedon	43cfcb18a0	More LLM param test coverage (#535 ) * More LLM tests * Fixing tests	2025-09-26 01:00:30 +01:00
cybermaggedon	85e669c763	Fixing more Cassandra consistency issues (#488 ) * Fixing more Cassandra work * Fix tests	2025-09-04 00:58:11 +01:00
cybermaggedon	ccaec88a72	Feature/consolidate cassandra config (#483 ) * Cassandra consolidation of parameters * New Cassandra configuration helper * Implemented Cassanda config refactor * New tests	2025-09-03 23:41:22 +01:00
cybermaggedon	38826c7de1	trustgraph-base .chunks / .documents confusion in the API (#481 ) * trustgraph-base .chunks / .documents confusion in the API * Added tests, fixed test failures in code * Fix file dup error * Fix contract error	2025-09-02 17:58:53 +01:00
cybermaggedon	96c2b73457	Fix import export graceful shutdown (#476 ) * Tech spec for graceful shutdown * Graceful shutdown of importers/exporters * Update socket to include graceful shutdown orchestration * Adding tests for conditions tracked in this PR	2025-08-28 13:39:28 +01:00
cybermaggedon	2f7fddd206	Test suite executed from CI pipeline (#433 ) * Test strategy & test cases * Unit tests * Integration tests	2025-07-14 14:57:44 +01:00

16 commits