Fix RabbitMQ request/response race and chunker Flow API drift (#779)

* Fix Metadata/EntityEmbeddings schema migration tail and add regression tests (#776)

The Metadata dataclass dropped its `metadata: list[Triple]` field
and EntityEmbeddings/ChunkEmbeddings settled on a singular
`vector: list[float]` field, but several call sites kept passing
`Metadata(metadata=...)` and `EntityEmbeddings(vectors=...)`. The
bugs were latent until a websocket client first hit
`/api/v1/flow/default/import/entity-contexts`, at which point the
dispatcher TypeError'd on construction.

Production fixes (5 call sites on the same migration tail):

  * trustgraph-flow gateway dispatchers entity_contexts_import.py
    and graph_embeddings_import.py — drop the stale
    Metadata(metadata=...)  kwarg; switch graph_embeddings_import
    to the singular `vector` wire key.
  * trustgraph-base messaging translators knowledge.py and
    document_loading.py — fix decode side to read the singular
    `"vector"` key, matching what their own encode sides have
    always written.
  * trustgraph-flow tables/knowledge.py — fix Cassandra row
    deserialiser to construct EntityEmbeddings(vector=...)
    instead of vectors=.
  * trustgraph-flow gateway core_import/core_export — switch the
    kg-core msgpack wire format to the singular `"v"`/`"vector"`
    key and drop the dead `m["m"]` envelope field that referenced
    the removed Metadata.metadata triples list (it was a
    guaranteed KeyError on the export side).

Defense-in-depth regression coverage (32 new tests across 7 files):

  * tests/contract/test_schema_field_contracts.py — pin the field
    set of Metadata, EntityEmbeddings, ChunkEmbeddings,
    EntityContext so any future schema rename fails CI loudly
    with a clear diff.
  * tests/unit/test_translators/test_knowledge_translator_roundtrip.py
    and test_document_embeddings_translator_roundtrip.py -
    encode→decode round-trip the affected translators end to end,
    locking in the singular `"vector"` wire key.
  * tests/unit/test_gateway/test_entity_contexts_import_dispatcher.py
    and test_graph_embeddings_import_dispatcher.py — exercise the
    websocket dispatchers' receive() path with realistic
    payloads, the direct regression test for the original
    production crash.
  * tests/unit/test_gateway/test_core_import_export_roundtrip.py
    — pack/unpack the kg-core msgpack format through the real
    dispatcher classes (with KnowledgeRequestor mocked),
    including a full export→import round-trip.
  * tests/unit/test_tables/test_knowledge_table_store.py —
    exercise the Cassandra row → schema conversion via __new__ to
    bypass the live cluster connection.

Also fixes an unrelated leaked-coroutine RuntimeWarning in
test_gateway/test_service.py::test_run_method_calls_web_run_app: the
mocked aiohttp.web.run_app now closes the coroutine that Api.run() hands
it, mirroring what the real run_app would do, instead of leaving it for
the GC to complain about.

* Fix RabbitMQ request/response race and chunker Flow API drift

Two unrelated regressions surfaced after the v2.2 queue class
refactor.  Bundled here because both are small and both block
production.

1. Request/response race against ephemeral RabbitMQ response
queues

Commit feeb92b3 switched response/notify queues to per-subscriber
auto-delete exclusive queues. That fixed orphaned-queue
accumulation but introduced a setup race: Subscriber.start()
created the run() task and returned immediately, while the
underlying RabbitMQ consumer only declared and bound its queue
lazily on the first receive() call.  RequestResponse.request()
therefore published the request before any queue was bound to the
matching routing key, and the broker dropped the reply. Symptoms:
"Failed to fetch config on notify" / "Request timeout exception"
repeating roughly every 10s in api-gateway, document-embeddings
and any other service exercising the config notify path.

Fix:

  * Add ensure_connected() to the BackendConsumer protocol;
    implement it on RabbitMQBackendConsumer (calls _connect
    synchronously, declaring and binding the queue) and as a
    no-op on PulsarBackendConsumer (Pulsar's client.subscribe is
    already synchronous at construction).

  * Convert Subscriber's readiness signal from a non-existent
    Event to an asyncio.Future created in start(). run() calls
    consumer.ensure_connected() immediately after
    create_consumer() and sets _ready.set_result(None) on first
    successful bind. start() awaits the future via asyncio.wait
    so it returns only once the consumer is fully bound. Any
    reply published after start() returns is therefore guaranteed
    to land in a bound queue.

  * First-attempt connection failures call
    _ready.set_exception(e) and exit run() so start() unblocks
    with the error rather than hanging forever — the existing
    higher-level retry pattern in fetch_and_apply_config takes
    over from there. Runtime failures after a successful start
    still go through the existing retry-with-backoff path.

  * Update the two existing graceful-shutdown tests that
    monkey-patch Subscriber.run with a custom coroutine to honor
    the new contract by signalling _ready themselves.

  * Add tests/unit/test_base/test_subscriber_readiness.py with
    five regression tests pinning the readiness contract:
    ensure_connected must be called before start() returns;
    start() must block while ensure_connected runs
    (race-condition guard with a threading.Event gate);
    first-attempt create_consumer and ensure_connected failures
    must propagate to start() instead of hanging;
    ensure_connected must run before any receive() call.

2. Chunker Flow parameter lookup using the wrong attribute

trustgraph-base/trustgraph/base/chunking_service.py was reading
flow.parameters.get("chunk-size") and chunk-overlap, but the Flow
class has no `parameters` attribute — parameter lookup is exposed
through Flow.__call__ (flow("chunk-size") returns the resolved
value or None).  The exception was caught and logged as a
WARNING, so chunking continued with the default sizes and any
configured chunk-size / chunk-overlap was silently ignored:

    chunker - WARNING - Could not parse chunk-size parameter:
    'Flow' object has no attribute 'parameters'

The chunker tests didn't catch this because they constructed
mock_flow = MagicMock() and configured
mock_flow.parameters.get.side_effect = ..., which is the same
phantom attribute MagicMock auto-creates on demand. Tests and
production agreed on the wrong API.

Fix: switch chunking_service.py to flow("chunk-size") /
flow("chunk-overlap"). Update both chunker test files to mock the
__call__ side_effect instead of the phantom parameters.get,
merging parameter values into the existing flow() lookup the
on_message tests already used for producer resolution.
This commit is contained in:
cybermaggedon 2026-04-11 01:29:38 +01:00 committed by GitHub
parent c23e28aa66
commit ffe310af7c
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
9 changed files with 329 additions and 24 deletions

View file

@ -70,11 +70,12 @@ class TestRecursiveChunkerSimple(IsolatedAsyncioTestCase):
# Mock message and flow
mock_message = MagicMock()
mock_consumer = MagicMock()
# Flow exposes parameter lookup via __call__: flow("chunk-size")
mock_flow = MagicMock()
mock_flow.parameters.get.side_effect = lambda param: {
mock_flow.side_effect = lambda key: {
"chunk-size": 2000, # Override chunk size
"chunk-overlap": None # Use default chunk overlap
}.get(param)
}.get(key)
# Act
chunk_size, chunk_overlap = await processor.chunk_document(
@ -105,10 +106,10 @@ class TestRecursiveChunkerSimple(IsolatedAsyncioTestCase):
mock_message = MagicMock()
mock_consumer = MagicMock()
mock_flow = MagicMock()
mock_flow.parameters.get.side_effect = lambda param: {
mock_flow.side_effect = lambda key: {
"chunk-size": None, # Use default chunk size
"chunk-overlap": 200 # Override chunk overlap
}.get(param)
}.get(key)
# Act
chunk_size, chunk_overlap = await processor.chunk_document(
@ -139,10 +140,10 @@ class TestRecursiveChunkerSimple(IsolatedAsyncioTestCase):
mock_message = MagicMock()
mock_consumer = MagicMock()
mock_flow = MagicMock()
mock_flow.parameters.get.side_effect = lambda param: {
mock_flow.side_effect = lambda key: {
"chunk-size": 1500, # Override chunk size
"chunk-overlap": 150 # Override chunk overlap
}.get(param)
}.get(key)
# Act
chunk_size, chunk_overlap = await processor.chunk_document(
@ -195,15 +196,15 @@ class TestRecursiveChunkerSimple(IsolatedAsyncioTestCase):
mock_consumer = MagicMock()
mock_producer = AsyncMock()
mock_triples_producer = AsyncMock()
# Flow.__call__ resolves parameters and producers/consumers from the
# same dict — merge both kinds here.
mock_flow = MagicMock()
mock_flow.parameters.get.side_effect = lambda param: {
mock_flow.side_effect = lambda key: {
"chunk-size": 1500,
"chunk-overlap": 150,
}.get(param)
mock_flow.side_effect = lambda name: {
"output": mock_producer,
"triples": mock_triples_producer,
}.get(name)
}.get(key)
# Act
await processor.on_message(mock_message, mock_consumer, mock_flow)
@ -241,7 +242,7 @@ class TestRecursiveChunkerSimple(IsolatedAsyncioTestCase):
mock_message = MagicMock()
mock_consumer = MagicMock()
mock_flow = MagicMock()
mock_flow.parameters.get.return_value = None # No overrides
mock_flow.side_effect = lambda key: None # No overrides
# Act
chunk_size, chunk_overlap = await processor.chunk_document(