trustgraph/tests/unit/test_base/test_subscriber_readiness.py

190 lines
6.9 KiB
Python
Raw Normal View History

Fix RabbitMQ request/response race and chunker Flow API drift (#779) * Fix Metadata/EntityEmbeddings schema migration tail and add regression tests (#776) The Metadata dataclass dropped its `metadata: list[Triple]` field and EntityEmbeddings/ChunkEmbeddings settled on a singular `vector: list[float]` field, but several call sites kept passing `Metadata(metadata=...)` and `EntityEmbeddings(vectors=...)`. The bugs were latent until a websocket client first hit `/api/v1/flow/default/import/entity-contexts`, at which point the dispatcher TypeError'd on construction. Production fixes (5 call sites on the same migration tail): * trustgraph-flow gateway dispatchers entity_contexts_import.py and graph_embeddings_import.py — drop the stale Metadata(metadata=...) kwarg; switch graph_embeddings_import to the singular `vector` wire key. * trustgraph-base messaging translators knowledge.py and document_loading.py — fix decode side to read the singular `"vector"` key, matching what their own encode sides have always written. * trustgraph-flow tables/knowledge.py — fix Cassandra row deserialiser to construct EntityEmbeddings(vector=...) instead of vectors=. * trustgraph-flow gateway core_import/core_export — switch the kg-core msgpack wire format to the singular `"v"`/`"vector"` key and drop the dead `m["m"]` envelope field that referenced the removed Metadata.metadata triples list (it was a guaranteed KeyError on the export side). Defense-in-depth regression coverage (32 new tests across 7 files): * tests/contract/test_schema_field_contracts.py — pin the field set of Metadata, EntityEmbeddings, ChunkEmbeddings, EntityContext so any future schema rename fails CI loudly with a clear diff. * tests/unit/test_translators/test_knowledge_translator_roundtrip.py and test_document_embeddings_translator_roundtrip.py - encode→decode round-trip the affected translators end to end, locking in the singular `"vector"` wire key. * tests/unit/test_gateway/test_entity_contexts_import_dispatcher.py and test_graph_embeddings_import_dispatcher.py — exercise the websocket dispatchers' receive() path with realistic payloads, the direct regression test for the original production crash. * tests/unit/test_gateway/test_core_import_export_roundtrip.py — pack/unpack the kg-core msgpack format through the real dispatcher classes (with KnowledgeRequestor mocked), including a full export→import round-trip. * tests/unit/test_tables/test_knowledge_table_store.py — exercise the Cassandra row → schema conversion via __new__ to bypass the live cluster connection. Also fixes an unrelated leaked-coroutine RuntimeWarning in test_gateway/test_service.py::test_run_method_calls_web_run_app: the mocked aiohttp.web.run_app now closes the coroutine that Api.run() hands it, mirroring what the real run_app would do, instead of leaving it for the GC to complain about. * Fix RabbitMQ request/response race and chunker Flow API drift Two unrelated regressions surfaced after the v2.2 queue class refactor. Bundled here because both are small and both block production. 1. Request/response race against ephemeral RabbitMQ response queues Commit feeb92b3 switched response/notify queues to per-subscriber auto-delete exclusive queues. That fixed orphaned-queue accumulation but introduced a setup race: Subscriber.start() created the run() task and returned immediately, while the underlying RabbitMQ consumer only declared and bound its queue lazily on the first receive() call. RequestResponse.request() therefore published the request before any queue was bound to the matching routing key, and the broker dropped the reply. Symptoms: "Failed to fetch config on notify" / "Request timeout exception" repeating roughly every 10s in api-gateway, document-embeddings and any other service exercising the config notify path. Fix: * Add ensure_connected() to the BackendConsumer protocol; implement it on RabbitMQBackendConsumer (calls _connect synchronously, declaring and binding the queue) and as a no-op on PulsarBackendConsumer (Pulsar's client.subscribe is already synchronous at construction). * Convert Subscriber's readiness signal from a non-existent Event to an asyncio.Future created in start(). run() calls consumer.ensure_connected() immediately after create_consumer() and sets _ready.set_result(None) on first successful bind. start() awaits the future via asyncio.wait so it returns only once the consumer is fully bound. Any reply published after start() returns is therefore guaranteed to land in a bound queue. * First-attempt connection failures call _ready.set_exception(e) and exit run() so start() unblocks with the error rather than hanging forever — the existing higher-level retry pattern in fetch_and_apply_config takes over from there. Runtime failures after a successful start still go through the existing retry-with-backoff path. * Update the two existing graceful-shutdown tests that monkey-patch Subscriber.run with a custom coroutine to honor the new contract by signalling _ready themselves. * Add tests/unit/test_base/test_subscriber_readiness.py with five regression tests pinning the readiness contract: ensure_connected must be called before start() returns; start() must block while ensure_connected runs (race-condition guard with a threading.Event gate); first-attempt create_consumer and ensure_connected failures must propagate to start() instead of hanging; ensure_connected must run before any receive() call. 2. Chunker Flow parameter lookup using the wrong attribute trustgraph-base/trustgraph/base/chunking_service.py was reading flow.parameters.get("chunk-size") and chunk-overlap, but the Flow class has no `parameters` attribute — parameter lookup is exposed through Flow.__call__ (flow("chunk-size") returns the resolved value or None). The exception was caught and logged as a WARNING, so chunking continued with the default sizes and any configured chunk-size / chunk-overlap was silently ignored: chunker - WARNING - Could not parse chunk-size parameter: 'Flow' object has no attribute 'parameters' The chunker tests didn't catch this because they constructed mock_flow = MagicMock() and configured mock_flow.parameters.get.side_effect = ..., which is the same phantom attribute MagicMock auto-creates on demand. Tests and production agreed on the wrong API. Fix: switch chunking_service.py to flow("chunk-size") / flow("chunk-overlap"). Update both chunker test files to mock the __call__ side_effect instead of the phantom parameters.get, merging parameter values into the existing flow() lookup the on_message tests already used for producer resolution.
2026-04-11 01:29:38 +01:00
"""
Regression tests for Subscriber.start() readiness barrier.
Background: prior to the eager-connect fix, Subscriber.start() created
the run() task and returned immediately. The underlying backend consumer
was lazily connected on its first receive() call, which left a setup
race for request/response clients using ephemeral per-subscriber response
queues (RabbitMQ auto-delete exclusive queues): the request would be
published before the response queue was bound, and the broker would
silently drop the reply. fetch_config(), document-embeddings, and
api-gateway all hit this with "Failed to fetch config on notify" /
"Request timeout exception" symptoms.
These tests pin the readiness contract:
await subscriber.start()
# at this point, consumer.ensure_connected() MUST have run
so that any future change which removes the eager bind, or moves it
back to lazy initialisation, fails CI loudly.
"""
import asyncio
import pytest
from unittest.mock import MagicMock
from trustgraph.base.subscriber import Subscriber
def _make_backend(ensure_connected_side_effect=None,
receive_side_effect=None):
"""Build a fake backend whose consumer records ensure_connected /
receive calls. ensure_connected_side_effect lets a test inject a
delay or exception."""
backend = MagicMock()
consumer = MagicMock()
consumer.ensure_connected = MagicMock(
side_effect=ensure_connected_side_effect,
)
# By default receive raises a timeout-style exception that the
# subscriber loop is supposed to swallow as a "no message yet" — this
# keeps the subscriber idling cleanly while the test inspects state.
if receive_side_effect is None:
receive_side_effect = TimeoutError("No message received within timeout")
consumer.receive = MagicMock(side_effect=receive_side_effect)
consumer.acknowledge = MagicMock()
consumer.negative_acknowledge = MagicMock()
consumer.pause_message_listener = MagicMock()
consumer.unsubscribe = MagicMock()
consumer.close = MagicMock()
backend.create_consumer.return_value = consumer
return backend, consumer
def _make_subscriber(backend):
return Subscriber(
backend=backend,
topic="response:tg:config",
subscription="test-sub",
consumer_name="test-consumer",
schema=dict,
max_size=10,
drain_timeout=1.0,
backpressure_strategy="block",
)
class TestSubscriberReadiness:
@pytest.mark.asyncio
async def test_start_calls_ensure_connected_before_returning(self):
"""The barrier: ensure_connected must have been invoked at least
once by the time start() returns."""
backend, consumer = _make_backend()
subscriber = _make_subscriber(backend)
await subscriber.start()
try:
consumer.ensure_connected.assert_called_once()
finally:
await subscriber.stop()
@pytest.mark.asyncio
async def test_start_blocks_until_ensure_connected_completes(self):
"""If ensure_connected is slow, start() must wait for it. This is
the actual race-condition guard it would have failed against
the buggy version where start() returned before run() had even
scheduled the consumer creation."""
connect_started = asyncio.Event()
release_connect = asyncio.Event()
# ensure_connected runs in the executor thread, so we need a
# threading-safe gate. Use a simple busy-wait on a flag set by
# the asyncio side via call_soon_threadsafe — but the simpler
# path is to give it a sleep and observe ordering.
import threading
gate = threading.Event()
def slow_connect():
connect_started.set() # safe: only mutates the Event flag
gate.wait(timeout=2.0)
backend, consumer = _make_backend(
ensure_connected_side_effect=slow_connect,
)
subscriber = _make_subscriber(backend)
start_task = asyncio.create_task(subscriber.start())
# Wait until ensure_connected has begun executing.
await asyncio.wait_for(connect_started.wait(), timeout=2.0)
# ensure_connected is in flight — start() must NOT have returned.
assert not start_task.done(), (
"start() returned before ensure_connected() completed — "
"the readiness barrier is broken and the request/response "
"race condition is back."
)
# Release the gate; start() should now complete promptly.
gate.set()
await asyncio.wait_for(start_task, timeout=2.0)
consumer.ensure_connected.assert_called_once()
await subscriber.stop()
@pytest.mark.asyncio
async def test_start_propagates_consumer_creation_failure(self):
"""If create_consumer() raises, start() must surface the error
rather than hang on the readiness future. The old code path
retried indefinitely inside run() and never let start() unblock."""
backend = MagicMock()
backend.create_consumer.side_effect = RuntimeError("broker down")
subscriber = _make_subscriber(backend)
with pytest.raises(RuntimeError, match="broker down"):
await asyncio.wait_for(subscriber.start(), timeout=2.0)
@pytest.mark.asyncio
async def test_start_propagates_ensure_connected_failure(self):
"""Same contract for an ensure_connected() that raises (e.g. the
broker is up but the queue declare/bind fails)."""
backend, consumer = _make_backend(
ensure_connected_side_effect=RuntimeError("queue declare failed"),
)
subscriber = _make_subscriber(backend)
with pytest.raises(RuntimeError, match="queue declare failed"):
await asyncio.wait_for(subscriber.start(), timeout=2.0)
@pytest.mark.asyncio
async def test_ensure_connected_runs_before_subscriber_running_log(self):
"""Subtle ordering: ensure_connected MUST happen before the
receive loop, so that any reply is captured. We assert this by
checking ensure_connected was called before any receive call."""
call_order = []
def record_ensure():
call_order.append("ensure_connected")
def record_receive(*args, **kwargs):
call_order.append("receive")
raise TimeoutError("No message received within timeout")
backend, consumer = _make_backend(
ensure_connected_side_effect=record_ensure,
receive_side_effect=record_receive,
)
subscriber = _make_subscriber(backend)
await subscriber.start()
# Give the receive loop a tick to run at least once.
await asyncio.sleep(0.05)
await subscriber.stop()
# ensure_connected must come first; receive may not have happened
# yet on a fast machine, but if it did, it must come after.
assert call_order, "neither ensure_connected nor receive was called"
assert call_order[0] == "ensure_connected"