A generic, long-running bootstrap processor that converges a
deployment to its configured initial state and then idles.
Replaces the previous one-shot `tg-init-trustgraph` container model
and provides an extension point for enterprise / third-party
initialisers.
See docs/tech-specs/bootstrap.md for the full design.
Bootstrapper
------------
A single AsyncProcessor (trustgraph.bootstrap.bootstrapper.Processor)
that:
* Reads a list of initialiser specifications (class, name, flag,
params) from either a direct `initialisers` parameter
(processor-group embedding) or a YAML/JSON file (`-c`, CLI).
* On each wake, runs a cheap service-gate (config-svc +
flow-svc round-trips), then iterates the initialiser list,
running each whose configured flag differs from the one stored
in __system__/init-state/<name>.
* Stores per-initialiser completion state in the reserved
__system__ workspace.
* Adapts cadence: ~5s on gate failure, ~15s while converging,
~300s in steady state.
* Isolates failures — one initialiser's exception does not block
others in the same cycle; the failed one retries next wake.
Initialiser contract
--------------------
* Subclass trustgraph.bootstrap.base.Initialiser.
* Implement async run(ctx, old_flag, new_flag).
* Opt out of the service gate with class attr
wait_for_services=False (only used by PulsarTopology, since
config-svc cannot come up until Pulsar namespaces exist).
* ctx carries short-lived config and flow-svc clients plus a
scoped logger.
Core initialisers (trustgraph.bootstrap.initialisers.*)
-------------------------------------------------------
* PulsarTopology — creates Pulsar tenant + namespaces
(pre-gate, blocking HTTP offloaded to
executor).
* TemplateSeed — seeds __template__ from an external JSON
file; re-run is upsert-missing by default,
overwrite-all opt-in.
* WorkspaceInit — populates a named workspace from either
the full contents of __template__ or a
seed file; raises cleanly if the template
isn't seeded yet so the bootstrapper retries
on the next cycle.
* DefaultFlowStart — starts a specific flow in a workspace;
no-ops if the flow is already running.
Enterprise or third-party initialisers plug in via fully-qualified
dotted class paths in the bootstrapper's configuration — no core
code change required.
Config service
--------------
* push(): filter out reserved workspaces (ids starting with "_")
from the change notifications. Stored config is preserved; only
the broadcast is suppressed, so bootstrap / template state lives
in config-svc without live processors ever reacting to it.
Config client
-------------
* ConfigClient.get_all(workspace): wraps the existing `config`
operation to return {type: {key: value}} for a workspace.
WorkspaceInit uses it to copy __template__ without needing a
hardcoded types list.
pyproject.toml
--------------
* Adds a `bootstrap` console script pointing at the new Processor.
* Remove tg-init-trustgraph, superceded by bootstrap processor
- Connect failures (DNS, connect refused, server disconnect) now
return 502 Bad Gateway with a body that names the upstream URL.
- Other exceptions still return 500 but now include the exception
message in the body and log with exc_info=True so the stack trace
lands in the gateway log.
- Also fixed the logging.error → logger.error inconsistency in the
same block (module had a named logger at the top that wasn't being
used).
flow-svc's long-lived ConfigClient was constructed with
subscription=f"{self.id}--config--{id}", where id=params.get("id") is
the deterministic processor id. On Pulsar the config-response topic
maps to class=response -> Exclusive subscription; when the supervisor
restarts flow-svc within Pulsar's inactive-subscription TTL (minutes),
the previous process's ghost consumer still holds the subscription
and the new process's re-subscribe is rejected with ConsumerBusy,
crash-looping flow-svc.
This is a v2.2 -> v2.3 regression in practice, but not a change in
subscription semantics: the Exclusive mapping for response/notify is
identical between releases. The regression is that PR #822 split
flow-svc out of config-svc and added this new, long-lived
request/response call site — the new site simply didn't follow the
uuid convention used by the equivalent sites elsewhere
(gateway/config/receiver.py, AsyncProcessor._create_config_client).
Fix: generate a fresh uuid per process instance for the subscription
suffix, matching that convention.
The extract-with-ontologies prompt is a JSONL prompt, which means the
prompt service returns a PromptResult with response_type="jsonl" and
the parsed items in `.objects` (plural). The ontology extractor was
reading `.object` (singular) — the field used for response_type="json"
— which is always None for JSONL prompts.
Effect: the parser received None on every chunk, hit its "Unexpected
response type: <class 'NoneType'>" branch, returned no ExtractionResult,
and extract_with_simplified_format returned []. Every extraction
silently produced zero triples.
Graphs populated only with the seed ontology schema (TBox) and
document/chunk provenance — no instance triples at all. The e2e test
threshold of >=100 edges per collection was met by schema + provenance
alone, so the failure mode was invisible until RAG queries couldn't
find any content.
Regression introduced in v2.3 with the token-usage work (commit
56d700f3 / 14e49d83) when PromptClient.prompt() began returning a
PromptResult wrapper instead of the raw text/dict/list. All other
call sites of .prompt() across retrieval/, agent/, orchestrator/ were
already reading the correct field for their prompt's response_type;
ontology extraction was the sole stranded caller.
Also adds tests/unit/test_extract/test_ontology/test_extract_with_simplified_format.py
covering:
- happy path: populated .objects produces non-empty triples
- production failure shape: .objects=None returns [] cleanly
- empty .objects returns [] without raising
- defensive: do not silently fall back to .object for a JSONL prompt
DispatcherManager caches one ServiceRequestor per (flow_id, kind) in
self.dispatchers, lazily created on first use. stop_flow dropped the
flow from self.flows but never touched the cached dispatchers, so
their publisher/subscriber connections persisted — bound to the
per-flow exchanges that flow-svc tears down when the flow stops.
If the same flow id was later re-created, flow-svc re-declared fresh
per-flow exchanges, but the gateway's cached dispatcher still held a
subscription queue bound to the now-gone old response exchange.
Requests went out fine (publishers target exchanges by name and the
new exchange has the right name), but responses landed on an exchange
with no binding to the dispatcher's queue and were silently dropped.
The calling CLI or websocket session hung waiting for a reply that
would never arrive.
Reproduction before fix:
tg-start-flow -i test-flow-1 ...
# any query on test-flow-1 works
tg-stop-flow -i test-flow-1
tg-start-flow -i test-flow-1 ...
tg-show-graph -f test-flow-1 -C <collection> # hangs
Flows that were never stopped (e.g. "default" in a typical session)
were unaffected — their cached dispatcher still pointed at live
plumbing. That's why the bug appeared flow-name-specific at first
glance; it's actually lifecycle-specific.
Fix: in stop_flow, evict and cleanly stop() every cached dispatcher
keyed on the stopped flow id. Next request after restart constructs
a fresh dispatcher against the freshly-declared exchanges. Tuple
shape check preserves global dispatchers, which use (None, kind) as
their key and must survive flow churn.
Uses pop(id, None) instead of del in case stop_flow is invoked
defensively for a flow the gateway never saw.
Introduces `workspace` as the isolation boundary for config, flows,
library, and knowledge data. Removes `user` as a schema-level field
throughout the code, API specs, and tests; workspace provides the
same separation more cleanly at the trusted flow.workspace layer
rather than through client-supplied message fields.
Design
------
- IAM tech spec (docs/tech-specs/iam.md) documents current state,
proposed auth/access model, and migration direction.
- Data ownership model (docs/tech-specs/data-ownership-model.md)
captures the workspace/collection/flow hierarchy.
Schema + messaging
------------------
- Drop `user` field from AgentRequest/Step, GraphRagQuery,
DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest,
Sparql/Rows/Structured QueryRequest, ToolServiceRequest.
- Keep collection/workspace routing via flow.workspace at the
service layer.
- Translators updated to not serialise/deserialise user.
API specs
---------
- OpenAPI schemas and path examples cleaned of user fields.
- Websocket async-api messages updated.
- Removed the unused parameters/User.yaml.
Services + base
---------------
- Librarian, collection manager, knowledge, config: all operations
scoped by workspace. Config client API takes workspace as first
positional arg.
- `flow.workspace` set at flow start time by the infrastructure;
no longer pass-through from clients.
- Tool service drops user-personalisation passthrough.
CLI + SDK
---------
- tg-init-workspace and workspace-aware import/export.
- All tg-* commands drop user args; accept --workspace.
- Python API/SDK (flow, socket_client, async_*, explainability,
library) drop user kwargs from every method signature.
MCP server
----------
- All tool endpoints drop user parameters; socket_manager no longer
keyed per user.
Flow service
------------
- Closure-based topic cleanup on flow stop: only delete topics
whose blueprint template was parameterised AND no remaining
live flow (across all workspaces) still resolves to that topic.
Three scopes fall out naturally from template analysis:
* {id} -> per-flow, deleted on stop
* {blueprint} -> per-blueprint, kept while any flow of the
same blueprint exists
* {workspace} -> per-workspace, kept while any flow in the
workspace exists
* literal -> global, never deleted (e.g. tg.request.librarian)
Fixes a bug where stopping a flow silently destroyed the global
librarian exchange, wedging all library operations until manual
restart.
RabbitMQ backend
----------------
- heartbeat=60, blocked_connection_timeout=300. Catches silently
dead connections (broker restart, orphaned channels, network
partitions) within ~2 heartbeat windows, so the consumer
reconnects and re-binds its queue rather than sitting forever
on a zombie connection.
Tests
-----
- Full test refresh: unit, integration, contract, provenance.
- Dropped user-field assertions and constructor kwargs across
~100 test files.
- Renamed user-collection isolation tests to workspace-collection.
Test fixes for Kafka:
- Consumer: isinstance(max_concurrency, int) instead of is not None:
MagicMock won't pass the check
- Kafka test: updated expected topic name to
tg.request.prompt.default (colons replaced with dots)
- Producer: add delivery callback to surface send errors instead of
silently swallowing them, and raise message.max.bytes to 10MB
- Consumer: raise fetch.message.max.bytes to 10MB to match producer,
tighten session/heartbeat timeouts for fast group joins, and add
partition assign/revoke logging for diagnostics
- Topic naming: replace colons with dots in topic names since Kafka
rejects colons (flow:tg:document-load:default was producing invalid
topic name tg.flow.document-load:default)
- Response consumers: use auto.offset.reset=earliest instead of latest
so responses published before partition assignment aren't lost
- UNKNOWN_TOPIC_OR_PART: treat as timeout instead of fatal error so
consumers wait for auto-created topics instead of crashing
- Concurrency: cap consumer workers to 1 for Kafka since topics have
1 partition — extra consumers trigger rebalance storms that block
all message delivery
A previous commit moved SDK imports into __init__/methods and
stashed them on self, which broke @patch targets in 24 unit tests.
This fixes the approach: chunker and pdf_decoder use module-level
sentinels with global/if-None guards so imports are still deferred but
patchable. Google AI Studio reverts to standard module-level imports
since the module is only loaded when communicating with Gemini.
Keeps lazy loading on other imports.
Third backend alongside Pulsar and RabbitMQ. Topics map 1:1 to Kafka
topics, subscriptions map to consumer groups. Response/notify uses
unique consumer groups with correlation ID filtering. Topic lifecycle
managed via AdminClient with class-based retention.
Initial code drop: Needs major integration testing
The RabbitMQ backend used a single topic exchange per topicspace
with routing keys to differentiate logical topics. This meant the
flow service had to manually create named queues for every
processor-topic pair, including producer-side topics — creating
phantom queues that accumulated unread message copies indefinitely.
Replace with one fanout exchange per logical topic. Consumers now
declare and bind their own queues on connect. The flow service
manages topic lifecycle (create/delete exchanges) rather than queue
lifecycle, and only collects unique topic identifiers instead of
per-processor (topic, subscription) pairs.
Backend API: create_queue/delete_queue/ensure_queue replaced with
create_topic/delete_topic/ensure_topic (subscription parameter
removed).
feat: separate flow service from config service with explicit queue
lifecycle management
The flow service is now an independent service that owns the lifecycle
of flow and blueprint queues. System services own their own queues.
Consumers never create queues.
Flow service separation:
- New service at trustgraph-flow/trustgraph/flow/service/
- Uses async ConfigClient (RequestResponse pattern) to talk to config
service
- Config service stripped of all flow handling
Queue lifecycle management:
- PubSubBackend protocol gains create_queue, delete_queue,
queue_exists, ensure_queue — all async
- RabbitMQ: implements via pika with asyncio.to_thread internally
- Pulsar: stubs for future admin REST API implementation
- Consumer _connect() no longer creates queues (passive=True for named
queues)
- System services call ensure_queue on startup
- Flow service creates queues on flow start, deletes on flow stop
- Flow service ensures queues for pre-existing flows on startup
Two-phase flow stop:
- Phase 1: set flow status to "stopping", delete processor config
entries
- Phase 2: retry queue deletion, then delete flow record
Config restructure:
- active-flow config replaced with processor:{name} types
- Each processor has its own config type, each flow variant is a key
- Flow start/stop use batch put/delete — single config push per
operation
- FlowProcessor subscribes to its own type only
Blueprint format:
- Processor entries split into topics and parameters dicts
- Flow interfaces use {"flow": "topic"} instead of bare strings
- Specs (ConsumerSpec, ProducerSpec, etc.) read from
definition["topics"]
Tests updated
asyncio.iscoroutinefunction is deprecated since Python 3.14 and slated for
removal in 3.16. The inspect equivalent has an identical signature and return
semantics. Replaces 8 call sites across 3 modules to silence DeprecationWarnings
reported in #818.
Python 3.14 deprecates datetime.utcnow(). Replace all 9 occurrences with
datetime.now(timezone.utc) and normalize the output to preserve the existing
ISO-8601 "Z"-suffixed format so downstream parsers are unaffected.
Fixes#814
feat: add type hints to all public functions in trustgraph/base
Add type annotations to 23 modules covering:
- Metrics classes (ConsumerMetrics, ProducerMetrics, etc.)
- Spec classes (ConsumerSpec, ProducerSpec, SubscriberSpec, etc.)
- Service classes with add_args() and run() methods
- Utility functions (logging, pubsub, clients)
- AsyncProcessor methods
All 93 public functions now fully typed.
Refs #785
* refactor: deduplicate imports and move __future__ after docstrings
Addresses review feedback on PR #803:
- Remove duplicate 'from argparse import ArgumentParser' across 12 files
- Move 'from __future__ import annotations' to line 1 in all files
- Clean up excessive blank lines
Add class-level docstrings to five public classes in trustgraph-base:
Flow, LlmService, ConsumerMetrics, ToolClient, and TriplesStoreService.
Each docstring summarises the class's role in the system to aid
discoverability for new contributors.
Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>
- Silence pika, cassandra etc. logging at INFO (too much chatter)
- Add per processor log tags so that logs can be understood in
processor group.
- Deal with RabbitMQ lag weirdness
- Added more processor group examples
Processor group implementation: A wrapper to launch multiple
processors in a single processor
- trustgraph-base/trustgraph/base/processor_group.py — group runner
module. run_group(config) is the async body; run() is the
endpoint. Loads JSON or YAML config, validates that every entry
has a unique params.id, instantiates each class via importlib,
shares one TaskGroup, mirrors AsyncProcessor.launch's retry loop
and Prometheus startup.
- trustgraph-base/pyproject.toml — added [project.scripts] block
with processor-group = "trustgraph.base.processor_group:run".
Key behaviours:
- Unique id enforced up front — missing or duplicate params.id fails
fast with a clear error, preventing the Prometheus Info label
collision we flagged.
- No registry — dotted class path is the identifier; any
AsyncProcessor descendant importable at runtime is packable.
- YAML import is lazy — only pulled in if the config file ends in
.yaml/.yml, so JSON-only users don't need PyYAML installed.
- Single Prometheus server — start_http_server runs once at
startup, before the retry loop, matching launch()'s pattern.
- Retry loop — same shape as AsyncProcessor.launch: catches
ExceptionGroup from TaskGroup, logs, sleeps 4s,
retries. Fail-group semantics (one processor dying tears down the
group) — simple and surfaces bugs, as discussed.
Example config:
processors:
- class: trustgraph.extract.kg.definitions.extract.Processor
params:
id: kg-extract-definitions
- class: trustgraph.chunking.recursive.Processor
params:
id: chunker-recursive
Run with processor-group -c group.yaml.
Native CLI i18n: The TrustGraph CLI has built-in translation support
that dynamically loads language strings. You can test and use
different languages by simply passing the --lang flag (e.g., --lang
es for Spanish, --lang ru for Russian) or by configuring your
environment's LANG variable.
Automated Docs Translations: This PR introduces autonomously
translated Markdown documentation into several target languages,
including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew,
Arabic, Simplified Chinese, and Russian.
Fix test suite Prometheus registry pollution and remove default coverage
The test_metrics.py fixture (added in 4e63dbda) deleted
class-level metric singleton attributes without restoring them. This
stripped the hasattr() guards that prevent duplicate Prometheus
registration, so every subsequent Processor construction in the test
run raised ValueError: Duplicated timeseries. Fix by saving and
restoring the original attributes around each test.
Remove --cov=trustgraph from pytest.ini defaults — the coverage
import hooks interact badly with the trustgraph namespace package,
causing duplicate class loading. Coverage can still be requested
explicitly via the command line.
- add unit tests for base metrics, logging, spec, parameter_spec,
and flow modules
- add a lightweight test-only module loader so these tests can run
without optional runtime dependencies
- fix Parameter.start/stop to accept self