mirror of
https://github.com/trustgraph-ai/trustgraph.git
synced 2026-04-27 17:36:23 +02:00
Config push notify pattern: replace stateful pub/sub with signal+ fetch (#760)
Replace the config push mechanism that broadcast the full config blob on a 'state' class pub/sub queue with a lightweight notify signal containing only the version number and affected config types. Processors fetch the full config via request/response from the config service when notified. This eliminates the need for the pub/sub 'state' queue class and stateful pub/sub services entirely. The config push queue moves from 'state' to 'flow' class — a simple transient signal rather than a retained message. This solves the RabbitMQ late-subscriber problem where restarting processes never received the current config because their fresh queue had no historical messages. Key changes: - ConfigPush schema: config dict replaced with types list - Subscribe-then-fetch startup with retry: processors subscribe to notify queue, fetch config via request/response, then process buffered notifies with version comparison to avoid race conditions - register_config_handler() accepts optional types parameter so handlers only fire when their config types change - Short-lived config request/response clients to avoid subscriber contention on non-persistent response topics - Config service passes affected types through put/delete/flow operations - Gateway ConfigReceiver rewritten with same notify pattern and retry loop Tests updated New tests: - register_config_handler: without types, with types, multiple types, multiple handlers - on_config_notify: old/same version skipped, irrelevant types skipped (version still updated), relevant type triggers fetch, handler without types always called, mixed handler filtering, empty types invokes all, fetch failure handled gracefully - fetch_config: returns config+version, raises on error response, stops client even on exception - fetch_and_apply_config: applies to all handlers on startup, retries on failure
This commit is contained in:
parent
d4723566cb
commit
4acd853023
37 changed files with 1449 additions and 406 deletions
282
docs/tech-specs/config-push-poke.md
Normal file
282
docs/tech-specs/config-push-poke.md
Normal file
|
|
@ -0,0 +1,282 @@
|
|||
# Config Push "Notify" Pattern Technical Specification
|
||||
|
||||
## Overview
|
||||
|
||||
Replace the current config push mechanism — which broadcasts the full config
|
||||
blob on a `state` class queue — with a lightweight "notify" notification
|
||||
containing only the version number and affected types. Processors that care
|
||||
about those types fetch the full config via the existing request/response
|
||||
interface.
|
||||
|
||||
This solves the RabbitMQ late-subscriber problem: when a process restarts,
|
||||
its fresh queue has no historical messages, so it never receives the current
|
||||
config state. With the notify pattern, the push queue is only a signal — the
|
||||
source of truth is the config service's request/response API, which is
|
||||
always available.
|
||||
|
||||
## Problem
|
||||
|
||||
On Pulsar, `state` class queues are persistent topics. A new subscriber
|
||||
with `InitialPosition.Earliest` reads from message 0 and receives the
|
||||
last config push. On RabbitMQ, each subscriber gets a fresh per-subscriber
|
||||
queue (named with a new UUID). Messages published before the queue existed
|
||||
are gone. A restarting processor never gets the current config.
|
||||
|
||||
## Design
|
||||
|
||||
### The Notify Message
|
||||
|
||||
The `ConfigPush` schema changes from carrying the full config to carrying
|
||||
just a version number and the list of affected config types:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ConfigPush:
|
||||
version: int = 0
|
||||
types: list[str] = field(default_factory=list)
|
||||
```
|
||||
|
||||
When the config service handles a `put` or `delete`, it knows which types
|
||||
were affected (from the request's `values[].type` or `keys[].type`). It
|
||||
includes those in the notify. On startup, the config service sends a notify
|
||||
with an empty types list (meaning "everything").
|
||||
|
||||
### Subscribe-then-Fetch Startup (No Race Condition)
|
||||
|
||||
The critical ordering to avoid missing an update:
|
||||
|
||||
1. **Subscribe** to the config push queue. Buffer incoming notify messages.
|
||||
2. **Fetch** the full config via request/response (`operation: "config"`).
|
||||
This returns the config dict and a version number.
|
||||
3. **Apply** the fetched config to all registered handlers.
|
||||
4. **Process** buffered notifys. For any notify with `version > fetched_version`,
|
||||
re-fetch and re-apply. Discard notifys with `version <= fetched_version`.
|
||||
5. **Enter steady state**. Process future notifys as they arrive.
|
||||
|
||||
This is safe because:
|
||||
- If an update happens before the subscription, the fetch picks it up.
|
||||
- If an update happens between subscribe and fetch, it's in the buffer.
|
||||
- If an update happens after the fetch, it arrives on the queue normally.
|
||||
- Version comparison ensures no duplicate processing.
|
||||
|
||||
### Processor API
|
||||
|
||||
The current API requires processors to understand the full config dict
|
||||
structure. The new API should be cleaner — processors declare which config
|
||||
types they care about and provide a handler that receives only the relevant
|
||||
config subset.
|
||||
|
||||
#### Current API
|
||||
|
||||
```python
|
||||
# In processor __init__:
|
||||
self.register_config_handler(self.on_configure_flows)
|
||||
|
||||
# Handler receives the entire config dict:
|
||||
async def on_configure_flows(self, config, version):
|
||||
if "active-flow" not in config:
|
||||
return
|
||||
if self.id in config["active-flow"]:
|
||||
flow_config = json.loads(config["active-flow"][self.id])
|
||||
# ...
|
||||
```
|
||||
|
||||
#### New API
|
||||
|
||||
```python
|
||||
# In processor __init__:
|
||||
self.register_config_handler(
|
||||
handler=self.on_configure_flows,
|
||||
types=["active-flow"],
|
||||
)
|
||||
|
||||
# Handler receives only the relevant config subset, same signature:
|
||||
async def on_configure_flows(self, config, version):
|
||||
# config still contains the full dict, but handler is only called
|
||||
# when "active-flow" type changes (or on startup)
|
||||
if "active-flow" not in config:
|
||||
return
|
||||
# ...
|
||||
```
|
||||
|
||||
The `types` parameter is optional. If omitted, the handler is called for
|
||||
every config change (backward compatible). If specified, the handler is
|
||||
only invoked when the notify's `types` list intersects with the handler's
|
||||
types, or on startup (empty types list = everything).
|
||||
|
||||
#### Internal Registration Structure
|
||||
|
||||
```python
|
||||
# In AsyncProcessor:
|
||||
def register_config_handler(self, handler, types=None):
|
||||
self.config_handlers.append({
|
||||
"handler": handler,
|
||||
"types": set(types) if types else None, # None = all types
|
||||
})
|
||||
```
|
||||
|
||||
#### Notify Processing Logic
|
||||
|
||||
```python
|
||||
async def on_config_notify(self, message, consumer, flow):
|
||||
notify_version = message.value().version
|
||||
notify_types = set(message.value().types)
|
||||
|
||||
# Skip if we already have this version or newer
|
||||
if notify_version <= self.config_version:
|
||||
return
|
||||
|
||||
# Fetch full config from config service
|
||||
config, version = await self.config_client.config()
|
||||
self.config_version = version
|
||||
|
||||
# Determine which handlers to invoke
|
||||
for entry in self.config_handlers:
|
||||
handler_types = entry["types"]
|
||||
if handler_types is None:
|
||||
# Handler cares about everything
|
||||
await entry["handler"](config, version)
|
||||
elif not notify_types or notify_types & handler_types:
|
||||
# notify_types empty = startup (invoke all),
|
||||
# or intersection with handler's types
|
||||
await entry["handler"](config, version)
|
||||
```
|
||||
|
||||
### Config Service Changes
|
||||
|
||||
#### Push Method
|
||||
|
||||
The `push()` method changes to send only version + types:
|
||||
|
||||
```python
|
||||
async def push(self, types=None):
|
||||
version = await self.config.get_version()
|
||||
resp = ConfigPush(
|
||||
version=version,
|
||||
types=types or [],
|
||||
)
|
||||
await self.config_push_producer.send(resp)
|
||||
```
|
||||
|
||||
#### Put/Delete Handlers
|
||||
|
||||
Extract affected types and pass to push:
|
||||
|
||||
```python
|
||||
async def handle_put(self, v):
|
||||
types = list(set(k.type for k in v.values))
|
||||
for k in v.values:
|
||||
await self.table_store.put_config(k.type, k.key, k.value)
|
||||
await self.inc_version()
|
||||
await self.push(types=types)
|
||||
|
||||
async def handle_delete(self, v):
|
||||
types = list(set(k.type for k in v.keys))
|
||||
for k in v.keys:
|
||||
await self.table_store.delete_key(k.type, k.key)
|
||||
await self.inc_version()
|
||||
await self.push(types=types)
|
||||
```
|
||||
|
||||
#### Queue Class Change
|
||||
|
||||
The config push queue changes from `state` class to `flow` class. The push
|
||||
is now a transient signal — the source of truth is the config service's
|
||||
request/response API, not the queue. `flow` class is persistent (survives
|
||||
broker restarts) but doesn't require last-message retention, which was the
|
||||
root cause of the RabbitMQ problem.
|
||||
|
||||
```python
|
||||
config_push_queue = queue('config', cls='flow') # was cls='state'
|
||||
```
|
||||
|
||||
#### Startup Push
|
||||
|
||||
On startup, the config service sends a notify with empty types list
|
||||
(signalling "everything changed"):
|
||||
|
||||
```python
|
||||
async def start(self):
|
||||
await self.push(types=[]) # Empty = all types
|
||||
await self.config_request_consumer.start()
|
||||
```
|
||||
|
||||
### AsyncProcessor Changes
|
||||
|
||||
The `AsyncProcessor` needs a config request/response client alongside the
|
||||
push consumer. The startup sequence becomes:
|
||||
|
||||
```python
|
||||
async def start(self):
|
||||
# 1. Start the push consumer (begins buffering notifys)
|
||||
await self.config_sub_task.start()
|
||||
|
||||
# 2. Fetch current config via request/response
|
||||
config, version = await self.config_client.config()
|
||||
self.config_version = version
|
||||
|
||||
# 3. Apply to all handlers (startup = all handlers invoked)
|
||||
for entry in self.config_handlers:
|
||||
await entry["handler"](config, version)
|
||||
|
||||
# 4. Buffered notifys are now processed by on_config_notify,
|
||||
# which skips versions <= self.config_version
|
||||
```
|
||||
|
||||
The config client needs to be created in `__init__` using the existing
|
||||
request/response queue infrastructure. The `ConfigClient` from
|
||||
`trustgraph.clients.config_client` already exists but uses a synchronous
|
||||
blocking pattern. An async variant or integration with the processor's
|
||||
pub/sub backend is needed.
|
||||
|
||||
### Existing Config Handler Types
|
||||
|
||||
For reference, the config types currently used by handlers:
|
||||
|
||||
| Handler | Type(s) | Used By |
|
||||
|---------|---------|---------|
|
||||
| `on_configure_flows` | `active-flow` | All FlowProcessor subclasses |
|
||||
| `on_collection_config` | `collection` | Storage services (triples, embeddings, rows) |
|
||||
| `on_prompt_config` | `prompt` | Prompt template service, agent extract |
|
||||
| `on_schema_config` | `schema` | Rows storage, row embeddings, NLP query, structured diag |
|
||||
| `on_cost_config` | `token-costs` | Metering service |
|
||||
| `on_ontology_config` | `ontology` | Ontology extraction |
|
||||
| `on_librarian_config` | `librarian` | Librarian service |
|
||||
| `on_mcp_config` | `mcp-tool` | MCP tool service |
|
||||
| `on_knowledge_config` | `kg-core` | Cores service |
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. **Update ConfigPush schema** — change `config` field to `types` field.
|
||||
|
||||
2. **Update config service** — modify `push()` to send version + types.
|
||||
Modify `handle_put`/`handle_delete` to extract affected types.
|
||||
|
||||
3. **Add async config query to AsyncProcessor** — create a
|
||||
request/response client for config queries within the processor's
|
||||
event loop.
|
||||
|
||||
4. **Implement subscribe-then-fetch startup** — reorder
|
||||
`AsyncProcessor.start()` to subscribe first, then fetch, then
|
||||
process buffered notifys with version comparison.
|
||||
|
||||
5. **Update register_config_handler** — add optional `types` parameter.
|
||||
Update `on_config_notify` to filter by type intersection.
|
||||
|
||||
6. **Update existing handlers** — add `types` parameter to all
|
||||
`register_config_handler` calls across the codebase.
|
||||
|
||||
7. **Backward compatibility** — handlers without `types` parameter
|
||||
continue to work (invoked for all changes).
|
||||
|
||||
## Risks
|
||||
|
||||
- **Thundering herd**: if many processors restart simultaneously, they
|
||||
all hit the config service API at once. Mitigated by the config service
|
||||
already being designed for request/response load, and the number of
|
||||
processors being small (tens, not thousands).
|
||||
|
||||
- **Config service availability**: processors now depend on the config
|
||||
service being up at startup, not just having received a push. This is
|
||||
already the case in practice — without config, processors can't do
|
||||
anything useful.
|
||||
Loading…
Add table
Add a link
Reference in a new issue