Two unrelated regressions surfaced after the v2.2 queue class
refactor. Bundled here because both are small and both block
production.
1. Request/response race against ephemeral RabbitMQ response
queues
Commit feeb92b3 switched response/notify queues to per-subscriber
auto-delete exclusive queues. That fixed orphaned-queue
accumulation but introduced a setup race: Subscriber.start()
created the run() task and returned immediately, while the
underlying RabbitMQ consumer only declared and bound its queue
lazily on the first receive() call. RequestResponse.request()
therefore published the request before any queue was bound to the
matching routing key, and the broker dropped the reply. Symptoms:
"Failed to fetch config on notify" / "Request timeout exception"
repeating roughly every 10s in api-gateway, document-embeddings
and any other service exercising the config notify path.
Fix:
* Add ensure_connected() to the BackendConsumer protocol;
implement it on RabbitMQBackendConsumer (calls _connect
synchronously, declaring and binding the queue) and as a
no-op on PulsarBackendConsumer (Pulsar's client.subscribe is
already synchronous at construction).
* Convert Subscriber's readiness signal from a non-existent
Event to an asyncio.Future created in start(). run() calls
consumer.ensure_connected() immediately after
create_consumer() and sets _ready.set_result(None) on first
successful bind. start() awaits the future via asyncio.wait
so it returns only once the consumer is fully bound. Any
reply published after start() returns is therefore guaranteed
to land in a bound queue.
* First-attempt connection failures call
_ready.set_exception(e) and exit run() so start() unblocks
with the error rather than hanging forever — the existing
higher-level retry pattern in fetch_and_apply_config takes
over from there. Runtime failures after a successful start
still go through the existing retry-with-backoff path.
* Update the two existing graceful-shutdown tests that
monkey-patch Subscriber.run with a custom coroutine to honor
the new contract by signalling _ready themselves.
* Add tests/unit/test_base/test_subscriber_readiness.py with
five regression tests pinning the readiness contract:
ensure_connected must be called before start() returns;
start() must block while ensure_connected runs
(race-condition guard with a threading.Event gate);
first-attempt create_consumer and ensure_connected failures
must propagate to start() instead of hanging;
ensure_connected must run before any receive() call.
2. Chunker Flow parameter lookup using the wrong attribute
trustgraph-base/trustgraph/base/chunking_service.py was reading
flow.parameters.get("chunk-size") and chunk-overlap, but the Flow
class has no `parameters` attribute — parameter lookup is exposed
through Flow.__call__ (flow("chunk-size") returns the resolved
value or None). The exception was caught and logged as a
WARNING, so chunking continued with the default sizes and any
configured chunk-size / chunk-overlap was silently ignored:
chunker - WARNING - Could not parse chunk-size parameter:
'Flow' object has no attribute 'parameters'
The chunker tests didn't catch this because they constructed
mock_flow = MagicMock() and configured
mock_flow.parameters.get.side_effect = ..., which is the same
phantom attribute MagicMock auto-creates on demand. Tests and
production agreed on the wrong API.
Fix: switch chunking_service.py to flow("chunk-size") /
flow("chunk-overlap"). Update both chunker test files to mock the
__call__ side_effect instead of the phantom parameters.get,
merging parameter values into the existing flow() lookup the
on_message tests already used for producer resolution.
Subscriber resilience: recreate consumer after connection failure
- Move consumer creation from Subscriber.start() into the run() loop,
matching the pattern used by Consumer. If the connection drops and the
consumer is closed in the finally block, the loop now recreates it on
the next iteration instead of spinning forever on a None consumer.
Consumer thread safety:
- Dedicated ThreadPoolExecutor per consumer so all pika operations
(create, receive, acknowledge, negative_acknowledge) run on the
same thread — pika BlockingConnection is not thread-safe
- Applies to both Consumer and Subscriber classes
Config handler type audit — fix four mismatched type registrations:
- librarian: was ["librarian"] (non-existent type), now ["flow",
"active-flow"] (matches config["flow"] that the handler reads)
- cores/service: was ["kg-core"], now ["flow"] (reads
config["flow"])
- metering/counter: was ["token-costs"], now ["token-cost"]
(singular)
- agent/mcp_tool: was ["mcp-tool"], now ["mcp"] (reads
config["mcp"])
Update tests
Replace the config push mechanism that broadcast the full config
blob on a 'state' class pub/sub queue with a lightweight notify
signal containing only the version number and affected config
types. Processors fetch the full config via request/response from
the config service when notified.
This eliminates the need for the pub/sub 'state' queue class and
stateful pub/sub services entirely. The config push queue moves
from 'state' to 'flow' class — a simple transient signal rather
than a retained message. This solves the RabbitMQ
late-subscriber problem where restarting processes never received
the current config because their fresh queue had no historical
messages.
Key changes:
- ConfigPush schema: config dict replaced with types list
- Subscribe-then-fetch startup with retry: processors subscribe
to notify queue, fetch config via request/response, then
process buffered notifies with version comparison to avoid race
conditions
- register_config_handler() accepts optional types parameter so
handlers only fire when their config types change
- Short-lived config request/response clients to avoid subscriber
contention on non-persistent response topics
- Config service passes affected types through put/delete/flow
operations
- Gateway ConfigReceiver rewritten with same notify pattern and
retry loop
Tests updated
New tests:
- register_config_handler: without types, with types, multiple
types, multiple handlers
- on_config_notify: old/same version skipped, irrelevant types
skipped (version still updated), relevant type triggers fetch,
handler without types always called, mixed handler filtering,
empty types invokes all, fetch failure handled gracefully
- fetch_config: returns config+version, raises on error response,
stops client even on exception
- fetch_and_apply_config: applies to all handlers on startup,
retries on failure
* Tech spec
* Address multi-tenant queue option problems in CLI
* Modified collection service to use config
* Changed storage management to use the config service definition
* Tech spec for graceful shutdown
* Graceful shutdown of importers/exporters
* Update socket to include graceful shutdown orchestration
* Adding tests for conditions tracked in this PR