mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 08:26:21 +02:00

Config push notify pattern: replace stateful pub/sub with signal+ fetch (#760 )

Replace the config push mechanism that broadcast the full config
blob on a 'state' class pub/sub queue with a lightweight notify
signal containing only the version number and affected config
types. Processors fetch the full config via request/response from
the config service when notified.

This eliminates the need for the pub/sub 'state' queue class and
stateful pub/sub services entirely. The config push queue moves
from 'state' to 'flow' class — a simple transient signal rather
than a retained message.  This solves the RabbitMQ
late-subscriber problem where restarting processes never received
the current config because their fresh queue had no historical
messages.

Key changes:
- ConfigPush schema: config dict replaced with types list
- Subscribe-then-fetch startup with retry: processors subscribe
  to notify queue, fetch config via request/response, then
  process buffered notifies with version comparison to avoid race
  conditions
- register_config_handler() accepts optional types parameter so
  handlers only fire when their config types change
- Short-lived config request/response clients to avoid subscriber
  contention on non-persistent response topics
- Config service passes affected types through put/delete/flow
  operations
- Gateway ConfigReceiver rewritten with same notify pattern and
  retry loop

Tests updated

New tests:
- register_config_handler: without types, with types, multiple
  types, multiple handlers
- on_config_notify: old/same version skipped, irrelevant types
  skipped (version still updated), relevant type triggers fetch,
  handler without types always called, mixed handler filtering,
  empty types invokes all, fetch failure handled gracefully
- fetch_config: returns config+version, raises on error response,
  stops client even on exception
- fetch_and_apply_config: applies to all handlers on startup,
  retries on failure

2026-04-06 16:57:27 +01:00

9.6 KiB

Raw Blame History

Config Push "Notify" Pattern Technical Specification

Overview

Replace the current config push mechanism — which broadcasts the full config blob on a state class queue — with a lightweight "notify" notification containing only the version number and affected types. Processors that care about those types fetch the full config via the existing request/response interface.

This solves the RabbitMQ late-subscriber problem: when a process restarts, its fresh queue has no historical messages, so it never receives the current config state. With the notify pattern, the push queue is only a signal — the source of truth is the config service's request/response API, which is always available.

Problem

On Pulsar, state class queues are persistent topics. A new subscriber with InitialPosition.Earliest reads from message 0 and receives the last config push. On RabbitMQ, each subscriber gets a fresh per-subscriber queue (named with a new UUID). Messages published before the queue existed are gone. A restarting processor never gets the current config.

Design

The Notify Message

The ConfigPush schema changes from carrying the full config to carrying just a version number and the list of affected config types:

@dataclass
class ConfigPush:
    version: int = 0
    types: list[str] = field(default_factory=list)

When the config service handles a put or delete, it knows which types were affected (from the request's values[].type or keys[].type). It includes those in the notify. On startup, the config service sends a notify with an empty types list (meaning "everything").

The critical ordering to avoid missing an update:

Subscribe to the config push queue. Buffer incoming notify messages.
Fetch the full config via request/response (operation: "config"). This returns the config dict and a version number.
Apply the fetched config to all registered handlers.
Process buffered notifys. For any notify with version > fetched_version, re-fetch and re-apply. Discard notifys with version <= fetched_version.
Enter steady state. Process future notifys as they arrive.

This is safe because:

If an update happens before the subscription, the fetch picks it up.
If an update happens between subscribe and fetch, it's in the buffer.
If an update happens after the fetch, it arrives on the queue normally.
Version comparison ensures no duplicate processing.

Processor API

The current API requires processors to understand the full config dict structure. The new API should be cleaner — processors declare which config types they care about and provide a handler that receives only the relevant config subset.

Current API

# In processor __init__:
self.register_config_handler(self.on_configure_flows)

# Handler receives the entire config dict:
async def on_configure_flows(self, config, version):
    if "active-flow" not in config:
        return
    if self.id in config["active-flow"]:
        flow_config = json.loads(config["active-flow"][self.id])
    # ...

New API

# In processor __init__:
self.register_config_handler(
    handler=self.on_configure_flows,
    types=["active-flow"],
)

# Handler receives only the relevant config subset, same signature:
async def on_configure_flows(self, config, version):
    # config still contains the full dict, but handler is only called
    # when "active-flow" type changes (or on startup)
    if "active-flow" not in config:
        return
    # ...

The types parameter is optional. If omitted, the handler is called for every config change (backward compatible). If specified, the handler is only invoked when the notify's types list intersects with the handler's types, or on startup (empty types list = everything).

Internal Registration Structure

# In AsyncProcessor:
def register_config_handler(self, handler, types=None):
    self.config_handlers.append({
        "handler": handler,
        "types": set(types) if types else None,  # None = all types
    })

Notify Processing Logic

async def on_config_notify(self, message, consumer, flow):
    notify_version = message.value().version
    notify_types = set(message.value().types)

    # Skip if we already have this version or newer
    if notify_version <= self.config_version:
        return

    # Fetch full config from config service
    config, version = await self.config_client.config()
    self.config_version = version

    # Determine which handlers to invoke
    for entry in self.config_handlers:
        handler_types = entry["types"]
        if handler_types is None:
            # Handler cares about everything
            await entry["handler"](config, version)
        elif not notify_types or notify_types & handler_types:
            # notify_types empty = startup (invoke all),
            # or intersection with handler's types
            await entry["handler"](config, version)

Config Service Changes

Push Method

The push() method changes to send only version + types:

async def push(self, types=None):
    version = await self.config.get_version()
    resp = ConfigPush(
        version=version,
        types=types or [],
    )
    await self.config_push_producer.send(resp)

Put/Delete Handlers

Extract affected types and pass to push:

async def handle_put(self, v):
    types = list(set(k.type for k in v.values))
    for k in v.values:
        await self.table_store.put_config(k.type, k.key, k.value)
    await self.inc_version()
    await self.push(types=types)

async def handle_delete(self, v):
    types = list(set(k.type for k in v.keys))
    for k in v.keys:
        await self.table_store.delete_key(k.type, k.key)
    await self.inc_version()
    await self.push(types=types)

Queue Class Change

The config push queue changes from state class to flow class. The push is now a transient signal — the source of truth is the config service's request/response API, not the queue. flow class is persistent (survives broker restarts) but doesn't require last-message retention, which was the root cause of the RabbitMQ problem.

config_push_queue = queue('config', cls='flow')  # was cls='state'

Startup Push

On startup, the config service sends a notify with empty types list (signalling "everything changed"):

async def start(self):
    await self.push(types=[])  # Empty = all types
    await self.config_request_consumer.start()

AsyncProcessor Changes

The AsyncProcessor needs a config request/response client alongside the push consumer. The startup sequence becomes:

async def start(self):
    # 1. Start the push consumer (begins buffering notifys)
    await self.config_sub_task.start()

    # 2. Fetch current config via request/response
    config, version = await self.config_client.config()
    self.config_version = version

    # 3. Apply to all handlers (startup = all handlers invoked)
    for entry in self.config_handlers:
        await entry["handler"](config, version)

    # 4. Buffered notifys are now processed by on_config_notify,
    #    which skips versions <= self.config_version

The config client needs to be created in __init__ using the existing request/response queue infrastructure. The ConfigClient from trustgraph.clients.config_client already exists but uses a synchronous blocking pattern. An async variant or integration with the processor's pub/sub backend is needed.

Existing Config Handler Types

For reference, the config types currently used by handlers:

Handler	Type(s)	Used By
`on_configure_flows`	`active-flow`	All FlowProcessor subclasses
`on_collection_config`	`collection`	Storage services (triples, embeddings, rows)
`on_prompt_config`	`prompt`	Prompt template service, agent extract
`on_schema_config`	`schema`	Rows storage, row embeddings, NLP query, structured diag
`on_cost_config`	`token-costs`	Metering service
`on_ontology_config`	`ontology`	Ontology extraction
`on_librarian_config`	`librarian`	Librarian service
`on_mcp_config`	`mcp-tool`	MCP tool service
`on_knowledge_config`	`kg-core`	Cores service

Implementation Order

Update ConfigPush schema — change config field to types field.
Update config service — modify push() to send version + types. Modify handle_put/handle_delete to extract affected types.
Add async config query to AsyncProcessor — create a request/response client for config queries within the processor's event loop.
Implement subscribe-then-fetch startup — reorder AsyncProcessor.start() to subscribe first, then fetch, then process buffered notifys with version comparison.
Update register_config_handler — add optional types parameter. Update on_config_notify to filter by type intersection.
Update existing handlers — add types parameter to all register_config_handler calls across the codebase.
Backward compatibility — handlers without types parameter continue to work (invoked for all changes).

Risks

Thundering herd: if many processors restart simultaneously, they all hit the config service API at once. Mitigated by the config service already being designed for request/response load, and the number of processors being small (tens, not thousands).
Config service availability: processors now depend on the config service being up at startup, not just having received a push. This is already the case in practice — without config, processors can't do anything useful.

9.6 KiB Raw Blame History