mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-25 08:26:21 +02:00

Alex Jenkins 8954fa3ad7 Feat: TrustGraph i18n & Documentation Translation Updates (#781 )

Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.

Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.

2026-04-14 12:08:32 +01:00

37 KiB

Raw Blame History

layout	title	parent
default	Pub/Sub Abstraction: Broker-Independent Messaging	Tech Specs

Pub/Sub Abstraction: Broker-Independent Messaging

Problem

TrustGraph's messaging infrastructure is deeply coupled to Apache Pulsar in ways that go beyond the transport layer. This coupling creates several concrete problems.

1. Schema system is Pulsar-native

Every message type in the system is defined as a pulsar.schema.Record subclass using Pulsar field types (String(), Integer(), Boolean(), etc.). This means:

The pulsar Python package is a build dependency for trustgraph-base, even though trustgraph-base contains no transport logic
Any code that imports a message schema transitively depends on Pulsar
The schema definitions cannot be reused with a different broker without the Pulsar library installed
What's actually happening on the wire is JSON serialisation — the Pulsar schema machinery adds complexity without adding value over plain JSON encode/decode

2. Translators are named after the broker

The translator layer that converts between internal Python objects and wire format uses methods called to_pulsar() and from_pulsar(). These are really just JSON encode/decode operations — they have nothing to do with Pulsar specifically. The naming creates a false impression that the translation is broker-specific, when in reality any broker that carries JSON payloads would use identical logic.

3. Queue names use Pulsar URI format

Queue identifiers throughout the codebase use Pulsar's persistent://tenant/namespace/topic or non-persistent://tenant/namespace/topic URI format. These are hardcoded in schema definitions and referenced across services. RabbitMQ, Redis Streams, or any other broker would use completely different naming conventions. There is no abstraction between the logical identity of a queue and its broker-specific address.

4. Broker selection is not configurable

There is no mechanism to select a different pub/sub backend at deployment time. The Pulsar client is instantiated directly in the gateway and via PulsarClient in the base processor. Switching to a different broker would require code changes across multiple packages, not a configuration change.

5. Architectural requirements are implicit

TrustGraph relies on specific pub/sub behaviours — shared subscriptions for load balancing, message acknowledgement for reliability, message properties for correlation — but these requirements are not documented. This makes it difficult to evaluate whether a candidate broker (RabbitMQ, Redis Streams, NATS, etc.) actually satisfies the system's needs, or where the gaps would be.

Design Goals

Goal 1: Remove the link between Pulsar schemas and application code

Message types should be plain Python objects (dataclasses) that know how to serialise to and from JSON. The pulsar.schema.Record base class and Pulsar field types should not appear in schema definitions. The pub/sub transport layer sends and receives JSON bytes; the schema layer handles the mapping between JSON and typed Python objects independently.

Goal 2: Remove `to_pulsar` / `from_pulsar` naming

The translator methods should reflect what they actually do: encode a Python object to a JSON-compatible dict, and decode a JSON-compatible dict back to a Python object. The naming should be broker-neutral (e.g. encode / decode, or to_dict / from_dict).

Goal 3: Schema objects provide encode/decode

Each message type should be a Python dataclass (or similar) with a well-defined mapping to and from JSON. For example:

@dataclass
class TextCompletionRequest:
    system: str
    prompt: str
    streaming: bool = False

Given {"system": "You are helpful", "prompt": "Hello", "streaming": false} on the wire, decoding produces an object where request.system is "You are helpful", request.prompt is "Hello", and request.streaming is False. Encoding does the reverse. This is the schema's concern, not the broker's.

Goal 4: Abstract queue naming

Queue identifiers should not use Pulsar URI format (persistent://tg/flow/topic). A broker-neutral naming scheme is needed so that each backend can map logical queue names to its native format. The right approach here is not yet clear and needs to be worked through — considerations include how to express quality-of-service, multi-tenancy, and namespace separation without leaking broker concepts.

Goal 5: Document pub/sub architectural requirements

TrustGraph's actual requirements from the pub/sub layer need to be formally specified. This includes:

Delivery semantics: Which queues need at-least-once delivery? Are any fire-and-forget?
Consumer patterns: Shared subscriptions (competing consumers for load balancing), exclusive subscriptions, fan-out/broadcast
Message acknowledgement: Positive ack, negative ack (redelivery), timeout-based redelivery
Message properties: Key-value metadata on messages used for correlation (e.g. request IDs, flow routing)
Ordering guarantees: Per-topic ordering, per-key ordering, or no ordering required
Message size: Typical and maximum message sizes (some payloads include base64-encoded documents)
Persistence: Which messages must survive broker restarts
Consumer positioning: Ability to consume from earliest (replay) vs latest (live tail)
Connection model: Long-lived connections with reconnection, or transient

Documenting these requirements makes it possible to evaluate RabbitMQ or any other candidate against concrete criteria rather than discovering gaps during implementation.

Pub/Sub Architectural Requirements (As-Is)

This section documents what TrustGraph currently needs from its pub/sub layer. These are the as-is requirements — some may be revisited or relaxed in a future design if it makes broker portability easier.

Consumer model

All consumers use shared subscriptions (competing consumers). Multiple instances of the same processor read from the same subscription, and each message is delivered to exactly one instance. This is the load-balancing mechanism.

No exclusive or failover subscriptions are used anywhere in the codebase, despite infrastructure support for them.

Consumers support configurable concurrency — multiple async tasks within a single process can independently call receive() on the same subscription.

Delivery semantics

Almost all queues are non-persistent / best-effort (q0). The only persistent queue is config_push_queue (q2, exactly-once), which pushes full configuration state to processors. Since config pushes are idempotent (full state, not deltas), the persistence requirement here is about surviving broker restarts, not about exactly-once semantics per se.

Flow processing queues (request/response pairs for LLM, RAG, agent, etc.) are all non-persistent. Messages in flight are lost on broker restart. This is acceptable because:

Requests originate from a client that will time out and retry
There is no durable work-in-progress that would be corrupted by message loss
The system is designed for real-time query processing, not batch pipelines

Message acknowledgement

Positive acknowledgement: After successful handler execution, the message is acknowledged. This removes it from the subscription.

Negative acknowledgement: On handler failure (unhandled exception or rate-limit timeout), the message is negatively acknowledged, which triggers redelivery by the broker. Rate-limited messages retry for up to 7200 seconds before giving up and negatively acknowledging.

Orphaned messages: In the request-response subscriber pattern, messages that arrive with no matching waiter (e.g. the requester timed out) are positively acknowledged and discarded. This prevents redelivery storms.

Message properties

Messages carry a small set of key-value string properties as metadata, separate from the payload. The primary use is a "id" property for request-response correlation — the requester generates a unique ID, attaches it as a property, and the responder echoes it back so the subscriber can match responses to waiters.

Agent orchestration correlation (correlation_id, parent_session_id) is carried in the message payload, not in properties.

Consumer positioning

Two modes are used:

Earliest: The configuration consumer starts from the beginning of the topic to receive full configuration history on startup. This is the only use of earliest positioning.
Latest (default): All flow consumers start from the current position, processing only new messages.

Message ordering

Not required. The codebase explicitly does not depend on message ordering:

Shared subscriptions distribute messages across consumers without ordering guarantees
Concurrent handler tasks within a consumer process messages in arbitrary order
Request-response correlation uses IDs, not positional ordering
The supervisor fan-out/fan-in pattern collects results in a dictionary, order-independent
Configuration pushes are full state snapshots, not ordered deltas

Message sizes

Most messages are small JSON payloads (< 10KB). The exceptions:

Document content: Large documents (PDFs, text files) can be sent through the chunking service with base64 encoding. Pulsar's chunking feature (chunking_enabled) handles automatic splitting of oversized messages.
Agent observations: LLM-generated text can be several KB but rarely exceeds typical message size limits.

A replacement broker needs to either support large messages natively or provide a chunking/streaming mechanism. Alternatively, the large-document path could be refactored to use a side-channel (e.g. object store reference) instead of inline payload.

Fan-out patterns

Supervisor fan-out: One supervisor request decomposes into N independent sub-agent requests, each emitted as a separate message on the agent request queue. Different agent instances pick them up via the shared subscription. A correlation ID links the completions back to the original decomposition. This is not pub/sub fan-out (one message to many consumers) — it's application-level fan-out (many messages to one queue).

Request-response isolation: Each client creates a unique subscription name on response queues so it only receives its own responses. This means the response queue effectively has many independent subscribers, each seeing a filtered subset of messages based on the "id" property match.

Reconnection and resilience

Reconnection logic lives in the Consumer/Producer/Publisher/Subscriber classes, not in the broker client. These classes handle:

Automatic reconnection on connection loss
Retry loops with backoff
Graceful shutdown (unsubscribe, close)

The broker client itself is expected to provide a basic connection that can fail, and the wrapper classes handle recovery. This is important for the abstraction — the backend interface can be simple because resilience is handled above it.

Queue inventory

Queue	Persistence	Purpose
config push	Persistent (q2)	Full configuration state broadcast
config request/response	Non-persistent	Configuration queries
flow request/response	Non-persistent	Flow management
knowledge request/response	Non-persistent	Knowledge graph operations
librarian request/response	Non-persistent	Document storage operations
document embeddings request/response	Non-persistent	Document vector queries
row embeddings request/response	Non-persistent	Row vector queries
collection request/response	Non-persistent	Collection management

Additionally, each processing service (LLM, RAG, agent, prompt, embeddings, etc.) has dynamically defined request/response queue pairs configured at deployment time.

Summary of hard requirements for a replacement broker

Shared subscription / competing consumers — multiple consumers on one queue, each message delivered to exactly one
Message acknowledgement — positive ack (remove from queue) and negative ack (trigger redelivery)
Message properties — key-value metadata on messages, at minimum a string "id" field
Two consumer start positions — from beginning of topic and from current position
Persistence for at least one queue — config state must survive broker restart
Messages up to several MB — or a chunking mechanism for large payloads
No ordering requirement — simplifies broker selection significantly

Candidate Brokers

A quick assessment of alternatives against the hard requirements above.

RabbitMQ

The primary candidate. Mature, widely deployed, well understood.

Competing consumers: Yes — multiple consumers on a queue, round-robin delivery. This is RabbitMQ's native model.
Acknowledgement: Yes — basic.ack and basic.nack with requeue flag.
Message properties: Yes — headers and properties on every message. The correlation_id and message_id fields are first-class concepts.
Consumer positioning: Yes, via RabbitMQ Streams (3.9+). Streams are append-only logs that support reading from any offset — beginning, end, or timestamp. Classic queues are consumed destructively (no replay), but streams solve this cleanly. The state queue class maps to a RabbitMQ stream. Additionally, the Last Value Cache Exchange plugin can retain the most recent message per routing key for new consumers.
Persistence: Yes — durable queues and persistent messages survive broker restart.
Large messages: No hard limit but not designed for very large payloads. Practical limit around 128MB with default config. Adequate for current use.
Ordering: FIFO per queue (stronger than required).
Operational complexity: Low. Single binary, no ZooKeeper/BookKeeper dependencies. Significantly simpler to operate than Pulsar.
Ecosystem: Excellent client libraries, management UI, mature tooling.

Gaps: None significant. RabbitMQ Streams cover the replay/earliest positioning requirement.

Apache Kafka

High-throughput distributed log. More infrastructure than TrustGraph likely needs.

Competing consumers: Yes — consumer groups with partition assignment.
Acknowledgement: Yes — offset commits. No per-message negative ack; failed messages require application-level retry or dead-letter handling.
Message properties: Yes — message headers (key-value byte arrays).
Consumer positioning: Yes — seek to earliest or latest offset. Supports full replay.
Persistence: Yes — all messages are persisted to the log by default.
Large messages: Configurable (max.message.bytes), default 1MB, can be increased. Large payloads are discouraged by design.
Ordering: Per-partition ordering (stronger than required).
Operational complexity: High. Requires ZooKeeper (or KRaft), partition management, replication config. Overkill for typical TrustGraph deployments.
Ecosystem: Excellent client libraries, schema registry, Connect framework.

Gaps: No native negative acknowledgement. Operational complexity is high for small-to-medium deployments. Partition count must be planned upfront for parallelism.

Redis Streams

Lightweight option using Redis as a message broker.

Competing consumers: Yes — consumer groups with XREADGROUP.
Acknowledgement: Yes — XACK. Pending entries list tracks unacknowledged messages. No explicit negative ack but unacknowledged messages can be claimed after timeout via XAUTOCLAIM.
Message properties: No native separation between properties and payload. Would need to encode properties as fields within the stream entry or in the payload.
Consumer positioning: Yes — 0 (earliest) or $ (latest) on group creation.
Persistence: Yes — Redis persistence (RDB/AOF), though Redis is primarily an in-memory system.
Large messages: Practical limit tied to Redis memory. Not suited for large payloads.
Ordering: Per-stream ordering (stronger than required).
Operational complexity: Low if Redis is already in the stack. No additional infrastructure.

Gaps: No native message properties. Memory-bound. Persistence depends on Redis configuration. Not a natural fit for message broker patterns.

NATS / NATS JetStream

Lightweight, high-performance messaging. JetStream adds persistence.

Competing consumers: Yes — queue groups in core NATS; consumer groups in JetStream.
Acknowledgement: JetStream only — Ack, Nak (with redelivery), InProgress (extend timeout).
Message properties: Yes — message headers (key-value).
Consumer positioning: JetStream — deliver all, deliver last, deliver new, deliver by sequence/time.
Persistence: JetStream only. Core NATS is fire-and-forget.
Large messages: Default 1MB, configurable up to 64MB.
Ordering: Per-subject ordering.
Operational complexity: Very low. Single binary, no dependencies. Clustering is straightforward.

Gaps: Requires JetStream for persistence and acknowledgement. Smaller ecosystem than RabbitMQ/Kafka.

Assessment Summary

Requirement	RabbitMQ	Kafka	Redis Streams	NATS JetStream
Competing consumers	Yes	Yes	Yes	Yes
Positive/negative ack	Yes	Partial	Partial	Yes
Message properties	Yes	Yes	No	Yes
Earliest positioning	Yes (Streams)	Yes	Yes	Yes
Persistence	Yes	Yes	Partial	Yes
Large messages	Yes	Configurable	No	Configurable
Operational simplicity	Good	Poor	Good	Good

RabbitMQ is the strongest candidate given TrustGraph's requirements and deployment profile. The only gap (earliest consumer positioning for config) has known workarounds. Operational simplicity is a significant advantage over Pulsar.

Approach

Current state

The codebase has already undergone a partial abstraction. The picture is better than the problem statement might suggest:

Backend abstraction exists: backend.py defines Protocol-based interfaces (PubSubBackend, BackendProducer, BackendConsumer, Message). The Pulsar implementation lives in pulsar_backend.py.
Schemas are already dataclasses: Message types in schema/services/*.py are plain Python dataclasses with type hints, not Pulsar Record subclasses. This was the hardest part of the old spec and it's done.
Serialization is JSON-based: pulsar_backend.py contains dataclass_to_dict() and dict_to_dataclass() helpers that handle the round-trip. The wire format is JSON.
Factory pattern exists: pubsub.py has get_pubsub() which creates a backend from configuration. Currently only Pulsar is implemented.
Consumer/Producer/Publisher/Subscriber are backend-agnostic: These classes accept a backend parameter and delegate transport operations to it. They own retry, reconnection, metrics, and concurrency.

What remains is cleanup, not a rewrite.

What needs to change

1. Rename translator methods

The translator base class (messaging/translators/base.py) defines to_pulsar() and from_pulsar() as abstract methods. Every translator implements these. The methods convert between external API dicts and internal dataclass objects — nothing Pulsar-specific happens in them.

Change: Rename to decode() (external dict → dataclass) and encode() (dataclass → external dict). Update all translator subclasses and all call sites.

This is a mechanical rename. The method bodies don't change.

2. Rename translator base classes

The base classes Translator, MessageTranslator, and SendTranslator reference "pulsar" in docstrings and parameter names. Clean these up so the naming reflects what the layer actually does: translating between the external API representation (JSON dicts from HTTP/WebSocket) and the internal schema (dataclasses).

3. Move serialization out of the Pulsar backend

dataclass_to_dict() and dict_to_dataclass() currently live in pulsar_backend.py but are not Pulsar-specific. They handle the conversion between dataclasses and JSON-compatible dicts, which every backend needs.

Change: Move these to a shared location (e.g. trustgraph/base/serialization.py or alongside the schema definitions). The backend interface sends and receives dicts; serialization to/from dataclasses happens at a layer above.

This means the backend Protocol simplifies: send() accepts a dict and properties, value() returns a dict. The Consumer/Producer layer handles dataclass ↔ dict conversion using the shared serializers.

4. Abstract queue naming

Queue names currently use the format q0/tg/flow/queue-name or q2/tg/config/queue-name, which the Pulsar backend maps to non-persistent://tg/flow/queue-name or persistent://tg/config/queue-name.

This is an open design question. Options:

Option A: Simple string names. Queues are just strings like "text-completion-request". The backend is responsible for mapping to its native format (Pulsar adds persistent://tg/flow/ prefix, RabbitMQ uses the string as-is or adds a vhost prefix). Persistence and namespace are configuration concerns, not embedded in the name.

Option B: Structured queue descriptor. A small object that carries the logical name plus metadata:

@dataclass
class QueueDescriptor:
    name: str                    # e.g. "text-completion-request"
    namespace: str = "flow"      # logical grouping
    persistent: bool = False     # must survive broker restart

The backend maps this to its native format.

Option C: Keep the current format (q0/tg/flow/name) but document it as a TrustGraph convention, not a Pulsar convention. Backends parse it.

Option B is the most explicit. Option A is the simplest. Either is workable. The key constraint is that persistence is a property of the queue definition, not a runtime choice — the config push queue is persistent, everything else is not.

5. Implement RabbitMQ backend

Write rabbitmq_backend.py implementing the PubSubBackend Protocol:

create_producer(): Creates a channel and declares the target queue. send() publishes to the default exchange with the queue name as routing key. Properties map to AMQP basic properties (specifically message_id for the "id" property).
create_consumer(): Declares the queue and starts consuming with basic_consume. Shared subscription is the default RabbitMQ model — multiple consumers on one queue get round-robin delivery. acknowledge() maps to basic_ack, negative_acknowledge() maps to basic_nack with requeue=True.
Persistence: For persistent queues, declare as durable with delivery_mode=2 on messages. For non-persistent queues, declare as non-durable.
Consumer positioning: RabbitMQ queues are consumed destructively, so "earliest" doesn't apply in the Pulsar sense. For the config push use case, use a fanout exchange with per-consumer exclusive queues — each new processor gets its own queue that receives all config publishes, plus the last-value can be handled by having the config service re-publish on startup.
Large messages: RabbitMQ handles messages up to rabbit.max_message_size (default 128MB). No chunking needed.

The factory in pubsub.py gets a new branch:

if backend_type == 'rabbitmq':
    return RabbitMQBackend(
        host=config.get('rabbitmq_host'),
        port=config.get('rabbitmq_port'),
        username=config.get('rabbitmq_username'),
        password=config.get('rabbitmq_password'),
        vhost=config.get('rabbitmq_vhost', '/'),
    )

Backend selection via PUBSUB_BACKEND=rabbitmq environment variable or --pubsub-backend rabbitmq CLI flag.

6. Clean up remaining Pulsar references

After the above changes, Pulsar-specific code should be confined to:

pulsar_backend.py — the Pulsar implementation
pubsub.py — the factory that imports it

Audit and remove any remaining Pulsar imports, Pulsar exception handling, or Pulsar-specific concepts from:

async_processor.py (currently catches _pulsar.Interrupted)
consumer.py, subscriber.py (if any Pulsar exceptions leak through)
Schema files (should be clean already, but verify)
Gateway service (currently instantiates Pulsar client directly)

The gateway is a special case — it currently bypasses the abstraction layer and creates a Pulsar client directly for dispatching API requests. It should use the same get_pubsub() factory as everything else.

What stays the same

Schema definitions: Already dataclasses. No changes needed.
Consumer/Producer/Publisher/Subscriber: Already backend-agnostic. No changes to their core logic.
FlowProcessor and spec wiring: Already uses processor.pubsub to create backend instances. No changes.
Backend Protocol: The interface in backend.py is sound. Minor refinement possible (dict vs dataclass at the boundary) but the shape is right.

Concrete cleanups

The following files have Pulsar-specific imports that should not be there after the abstraction is complete. Pulsar imports should be confined to pulsar_backend.py and the factory in pubsub.py.

Dead imports (unused, can just be removed):

trustgraph-base/trustgraph/base/pubsub.py — from pulsar.schema import JsonSchema, import pulsar, import _pulsar. The JsonSchema import is unused since the switch to BytesSchema. The pulsar/_pulsar imports are only used by the legacy PulsarClient class which should be removed (superseded by PulsarBackend).
trustgraph-base/trustgraph/base/flow_processor.py — from pulsar.schema import JsonSchema. Unused.

Legacy PulsarClient class:

trustgraph-base/trustgraph/base/pubsub.py — The PulsarClient class is a leftover from before the backend abstraction. get_pubsub() still references PulsarClient.default_pulsar_host for defaults. Move the defaults to PulsarBackend or to environment variable reads in the factory, then delete PulsarClient.

Client libraries using Pulsar directly:

trustgraph-base/trustgraph/clients/base.py — import pulsar, import _pulsar, from pulsar.schema import JsonSchema. This is the base class for the old synchronous client library. These clients predate the backend abstraction and use Pulsar directly.
trustgraph-base/trustgraph/clients/embeddings_client.py — from pulsar.schema import JsonSchema, import _pulsar.
trustgraph-base/trustgraph/clients/*.py (agent, config, document_embeddings, document_rag, graph_embeddings, graph_rag, llm, prompt, row_embeddings, triples_query) — all import _pulsar for exception handling.

These clients are the internal request-response clients used by processors. They need to be migrated to use the backend abstraction or their Pulsar exception handling needs to be wrapped behind a backend-agnostic exception type.

Translator base class:

trustgraph-base/trustgraph/messaging/translators/base.py — from pulsar.schema import Record. Used in type hints. Should be removed when to_pulsar/from_pulsar are renamed.

Gateway service (bypasses abstraction):

trustgraph-flow/trustgraph/gateway/service.py — import pulsar. Creates a Pulsar client directly.
trustgraph-flow/trustgraph/gateway/config/receiver.py — import pulsar. Direct Pulsar usage.

The gateway should use get_pubsub() like everything else.

Storage writers:

trustgraph-flow/trustgraph/storage/triples/neo4j/write.py — import pulsar
trustgraph-flow/trustgraph/storage/triples/memgraph/write.py — import pulsar
trustgraph-flow/trustgraph/storage/triples/falkordb/write.py — import pulsar
trustgraph-flow/trustgraph/storage/triples/cassandra/write.py — import pulsar

These need investigation — likely Pulsar exception handling or direct client usage that should go through the abstraction.

Log level:

trustgraph-base/trustgraph/log_level.py — import _pulsar. Used to set Pulsar's log level. Should be moved into pulsar_backend.py.

Queue naming

The current scheme encodes QoS, tenant, namespace, and queue name into a slash-separated string (q0/tg/request/config) which the Pulsar backend parses and maps to a Pulsar URI (non-persistent://tg/request/config). This was an attempt at abstraction but it has problems:

QoS in the name was a mistake — it's a property of the queue definition, not something that belongs in the name. A queue is either persistent or it isn't; that's decided once when the queue is defined.
The tenant/namespace structure mirrors Pulsar's model. RabbitMQ doesn't use this — it has vhosts and exchange/queue names. Pretending the naming isn't TrustGraph-specific just leaks Pulsar concepts.
The topic() helper generates these strings, and the backend parses them apart. This is unnecessary indirection.

There are two categories of queue in TrustGraph:

Infrastructure queues — defined in code, used for system services. These are fixed and well-known:

Queue	Persistent	Purpose
`config-request`	No	Config queries
`config-response`	No	Config query responses
`config-push`	Yes	Config state broadcast
`flow-request`	No	Flow management queries
`flow-response`	No	Flow management responses
`librarian-request`	No	Document storage operations
`librarian-response`	No	Document storage responses
`knowledge-request`	No	Knowledge graph operations
`knowledge-response`	No	Knowledge graph responses
`document-embeddings-request`	No	Document vector queries
`document-embeddings-response`	No	Document vector responses
`row-embeddings-request`	No	Row vector queries
`row-embeddings-response`	No	Row vector responses
`collection-request`	No	Collection management
`collection-response`	No	Collection management responses

Flow queues — defined in configuration, created dynamically per flow. The queue names come from the config service (e.g. text-completion-request, graph-rag-request, agent-request). Each flow instance has its own set of these queues.

For infrastructure queues, the name is just a string. Persistence is a property of the queue definition, not encoded in the name. The backend maps the name to whatever its native format requires.

For flow queues, the name comes from configuration. The config service already distributes queue names as strings — the backend just needs to be able to use them.

Proposed scheme: CLASS:TOPICSPACE:TOPIC

A queue name has three parts separated by colons:

CLASS — a small enum that defines the queue's operational characteristics. The backend knows what each class means in terms of persistence, TTL, memory limits, etc. There are only four classes:

Class	Persistent	TTL	Behaviour
`flow`	Yes	Long	Processing pipeline queues. Messages survive broker restart.
`request`	No	Short	Transient request-response. Low TTL, no persistence needed — clients retry on failure.
`response`	No	Short	Same as request, for the response side.
`state`	Yes	Retained	Last-value state broadcast. Consumers need the most recent value on startup, plus any future updates. Config push is the primary example.

TOPICSPACE — deployment isolation. Keeps different TrustGraph deployments separate when sharing the same pub/sub infrastructure. Most deployments just use tg. Avoids the overloaded terms "tenant" and "namespace".
TOPIC — the logical queue identity. What the queue is for.

Examples:

flow:tg:text-completion-request
flow:tg:graph-rag-request
flow:tg:agent-request
request:tg:librarian
response:tg:librarian
request:tg:config
response:tg:config
state:tg:config
request:tg:flow
response:tg:flow

Backend mapping:

Each backend parses the three parts and maps them to its native concepts:

Pulsar: flow:tg:text-completion-request → persistent://tg/flow/text-completion-request. Class maps to persistent/non-persistent and namespace. State class uses persistent topic with earliest consumer positioning.
RabbitMQ: Topicspace maps to vhost. Class determines queue durability and TTL policy. State class uses a last-value queue (via plugin) or a fanout exchange pattern where each consumer gets the retained state on connect.
Kafka: flow.tg.text-completion-request as topic name. Class determines retention and compaction policy. State class maps to a compacted topic (last value per key).

Why this works:

The class enum is small and stable — adding a new class is rare and deliberate
Queue properties (persistence, TTL) are implied by class, not encoded in the name
Dynamic registration works naturally — the config service publishes flow:tg:text-completion-request and the backend knows how to declare it from the flow class
The colon separator is unambiguous, easy to split, doesn't conflict with URIs or path separators that backends use internally
No pretence of being generic — this is a TrustGraph convention, and that's fine

Serialization boundary

Decision: the backend owns the wire format.

The contract between the Consumer/Producer layer and the backend is dataclass objects in, dataclass objects out:

send() accepts a dataclass instance and properties dict
receive() returns a message whose value() is a dataclass instance

What happens on the wire is the backend's concern. The Pulsar backend uses JSON (via dataclass_to_dict / dict_to_dataclass). A RabbitMQ backend would likely also use JSON. A future backend could use Protobuf, MessagePack, or Avro if the broker benefits from it.

The serialization helpers stay inside the backend that uses them — they are not shared infrastructure. Each backend brings its own serialization strategy. The Consumer/Producer layer never thinks about wire format.

Gateway service

Decision: the gateway uses the backend abstraction like any other component.

The gateway currently bridges WebSocket/REST to Pulsar directly, bypassing the abstraction layer. It translates incoming API JSON to Pulsar schema objects, sends them, receives responses as Pulsar schema objects, and translates back to API JSON. Since the wire format is JSON in both directions, this is effectively a no-op round trip through the schema machinery.

With the backend abstraction, the gateway follows the same pattern as every other component:

Incoming API JSON → translator decode() → dataclass
Dataclass → backend send() (backend handles wire format)
Backend receive() → dataclass
Dataclass → translator encode() → API JSON → WebSocket/REST client

This is architecturally simple — one code path, no special cases. The gateway depends on the schema dataclasses and the translator layer, which it already does. The overhead of deserialize-then-reserialize is negligible for the message sizes involved. And it keeps all options open — if a future backend uses a non-JSON wire format, the gateway still works without changes.

Implementation Order

Phase 1: Rename translators

Rename to_pulsar() → decode(), from_pulsar() → encode() across all translator classes and call sites. Remove from pulsar.schema import Record from the translator base class. Mechanical find-and-replace, no behavioural changes.

Phase 2: Queue naming

Replace the topic() helper with the CLASS:TOPICSPACE:TOPIC scheme. Update all queue definitions in schema/services/*.py and schema/knowledge/*.py. Update PulsarBackend.map_topic() to parse the new format. Verify all existing functionality still works with Pulsar.

Phase 3: Clean up Pulsar leaks

Work through the concrete cleanups list: remove dead imports, delete the legacy PulsarClient class, migrate the client libraries and gateway to use the backend abstraction. After this phase, pulsar imports exist only in pulsar_backend.py.

Phase 4: RabbitMQ backend

Implement rabbitmq_backend.py against the existing PubSubBackend Protocol. Map queue classes to RabbitMQ concepts: flow → durable queues, request/response → non-durable queues with TTL, state → RabbitMQ streams. Add rabbitmq as a backend option in the factory. Test end-to-end with PUBSUB_BACKEND=rabbitmq.

Phases 1-3 are safe to do on main — they don't change behaviour, just clean up. Phase 4 is additive — it adds a new backend without touching the existing one.

Config distribution on RabbitMQ

The state queue class needs "start from earliest" semantics — a newly started processor must receive the current configuration state.

RabbitMQ Streams (available since 3.9) solve this directly. Streams are persistent, append-only logs that support consumer offset positioning. The RabbitMQ backend maps the state class to a stream, and consumers attach with offset first to read from the beginning, or last to read the most recent entry plus future updates.

Since config pushes are full state snapshots (not deltas), a consumer only needs the most recent entry. The RabbitMQ backend can use last offset positioning for state class consumers, which delivers the last message in the stream followed by any new messages. This matches the current behaviour where processors read config on startup and then react to updates.

37 KiB Raw Blame History