A generic, long-running bootstrap processor that converges a
deployment to its configured initial state and then idles.
Replaces the previous one-shot `tg-init-trustgraph` container model
and provides an extension point for enterprise / third-party
initialisers.
See docs/tech-specs/bootstrap.md for the full design.
Bootstrapper
------------
A single AsyncProcessor (trustgraph.bootstrap.bootstrapper.Processor)
that:
* Reads a list of initialiser specifications (class, name, flag,
params) from either a direct `initialisers` parameter
(processor-group embedding) or a YAML/JSON file (`-c`, CLI).
* On each wake, runs a cheap service-gate (config-svc +
flow-svc round-trips), then iterates the initialiser list,
running each whose configured flag differs from the one stored
in __system__/init-state/<name>.
* Stores per-initialiser completion state in the reserved
__system__ workspace.
* Adapts cadence: ~5s on gate failure, ~15s while converging,
~300s in steady state.
* Isolates failures — one initialiser's exception does not block
others in the same cycle; the failed one retries next wake.
Initialiser contract
--------------------
* Subclass trustgraph.bootstrap.base.Initialiser.
* Implement async run(ctx, old_flag, new_flag).
* Opt out of the service gate with class attr
wait_for_services=False (only used by PulsarTopology, since
config-svc cannot come up until Pulsar namespaces exist).
* ctx carries short-lived config and flow-svc clients plus a
scoped logger.
Core initialisers (trustgraph.bootstrap.initialisers.*)
-------------------------------------------------------
* PulsarTopology — creates Pulsar tenant + namespaces
(pre-gate, blocking HTTP offloaded to
executor).
* TemplateSeed — seeds __template__ from an external JSON
file; re-run is upsert-missing by default,
overwrite-all opt-in.
* WorkspaceInit — populates a named workspace from either
the full contents of __template__ or a
seed file; raises cleanly if the template
isn't seeded yet so the bootstrapper retries
on the next cycle.
* DefaultFlowStart — starts a specific flow in a workspace;
no-ops if the flow is already running.
Enterprise or third-party initialisers plug in via fully-qualified
dotted class paths in the bootstrapper's configuration — no core
code change required.
Config service
--------------
* push(): filter out reserved workspaces (ids starting with "_")
from the change notifications. Stored config is preserved; only
the broadcast is suppressed, so bootstrap / template state lives
in config-svc without live processors ever reacting to it.
Config client
-------------
* ConfigClient.get_all(workspace): wraps the existing `config`
operation to return {type: {key: value}} for a workspace.
WorkspaceInit uses it to copy __template__ without needing a
hardcoded types list.
pyproject.toml
--------------
* Adds a `bootstrap` console script pointing at the new Processor.
* Remove tg-init-trustgraph, superceded by bootstrap processor
12 KiB
| layout | title | parent |
|---|---|---|
| default | Bootstrap Framework Technical Specification | Tech Specs |
Bootstrap Framework Technical Specification
Overview
A generic, pluggable framework for running one-time initialisation steps
against a TrustGraph deployment — replacing the dedicated
tg-init-trustgraph container with a long-running processor that
converges the system to a desired initial state and then idles.
The framework is content-agnostic. It knows how to run, retry,
mark-as-done, and surface failures; the actual init work lives in
small pluggable classes called initialisers. Core initialisers
ship in the trustgraph-flow package; enterprise and third-party
initialisers can be loaded by dotted path without any core code
change.
Motivation
The existing tg-init-trustgraph is a one-shot CLI run in its own
container. It performs two very different jobs (Pulsar topology
setup and config seeding) in a single script, is wasteful as a whole
container, cannot handle partial-success states, and has no way to
extend the boot process with enterprise-specific concerns (user
provisioning, workspace initialisation, IAM scaffolding) without
forking the tool.
A pluggable, long-running reconciler addresses all of this and slots naturally into the existing processor-group model.
Design
Bootstrapper Processor
A single AsyncProcessor subclass. One entry in a processor group.
Parameters include the processor's own identity and a list of
initialiser specifications — each spec names a class (by dotted
path), a unique instance name, a flag string, and the parameters
that will be passed to the initialiser's constructor.
On each wake the bootstrapper does the following, in order:
- Open a short-lived context (config client, flow-svc client, logger). The context is torn down at the end of the wake so steady-state idle cost is effectively nil.
- Run all pre-service initialisers (those that opt out of the
service gate — principally
PulsarTopology, which must run before the services it gates on can even come up). - Check the service gate: cheap round-trips to config-svc and flow-svc. If either fails, skip to the sleep step using the short gate-retry cadence.
- Run all post-service initialisers that haven't already completed at the currently-configured flag.
- Sleep. Cadence adapts to state (see below).
Initialiser Contract
An initialiser is a class with:
- A class-level
nameidentifier, unique within the bootstrapper's configuration. This is the key under which completion state is stored. - A class-level
wait_for_servicesflag. WhenTrue(the default) the initialiser runs only after the service gate passes. WhenFalse, it runs before the gate, on every wake. - A constructor that accepts the initialiser's own params as kwargs.
- An async
run(ctx, old_flag, new_flag)method that performs the init work and returns on success. Any raised exception is logged and treated as a transient failure — the stored flag is not updated and the initialiser will re-run on the next cycle.
old_flag is the previously-stored flag string, or None if the
initialiser has never successfully run in this deployment. new_flag
is the flag the operator has configured for this run. This pair
lets an initialiser distinguish a clean first-run from a migration
between flag versions and behave accordingly (see "Flag change and
re-run safety" below).
Context
The context is the bootstrapper-owned object passed to every
initialiser's run() method. Its fields are deliberately narrow:
| Field | Purpose |
|---|---|
logger |
A child logger named for the initialiser instance |
config |
A short-lived ConfigClient for config-svc reads/writes |
flow |
A short-lived RequestResponse client for flow-svc |
The context is always fully-populated regardless of which services a given initialiser uses, for symmetry. Additional fields may be added in future without breaking existing initialisers. Clients are started at the beginning of a wake cycle and stopped at the end.
Initialisers that need services beyond config-svc and flow-svc are responsible for their own readiness checks and for raising cleanly when a prerequisite is not met.
Completion State
Per-initialiser completion state is stored in the reserved
__system__ workspace, under a dedicated config type for bootstrap
state. The stored value is the flag string that was configured when
the initialiser last succeeded.
On each cycle, for each initialiser, the bootstrapper reads the stored flag and compares it to the currently-configured flag. If they match, the initialiser is skipped silently. If they differ, the initialiser runs; on success, the stored flag is updated.
Because the state lives in a reserved (_-prefixed) workspace, it
is stored by config-svc but excluded from the config push broadcast.
Live processors never see it and cannot act on it.
The Service Gate
The gate is a cheap, bootstrapper-internal check that config-svc
and flow-svc are both reachable and responsive. It is intentionally
a simple pair of low-cost round-trips — a config list against
__system__ and a flow-svc list-blueprints — rather than any
deeper health check.
Its purpose is to avoid filling logs with noise and to concentrate
retry effort during the brief window when services are coming up.
The gate is applied only to initialisers with
wait_for_services=True (the default); False is reserved for
initialisers that set up infrastructure the gate itself depends on.
Adaptive Cadence
The sleep between wake cycles is chosen from three tiers based on observed state:
| Tier | Duration | When |
|---|---|---|
| Gate backoff | ~5 s | Services not responding — concentrate retry during startup |
| Init retry | ~15 s | Gate passes but at least one initialiser is not yet at its configured flag — transient failures, waiting on prereqs, recently-bumped flag not yet applied |
| Steady | ~300 s | All configured initialisers at their configured flag; gate passes; nothing to do |
The short tiers ensure a fresh deployment converges quickly; steady state costs a single round-trip per initialiser every few minutes.
Failure Handling
An initialiser raising an exception does not stop the bootstrapper or block other initialisers. Each initialiser in the cycle is attempted independently; failures are logged and retried on the next cycle. This means there is no ordered-DAG enforcement: order of initialisers in the configuration determines the attempt order within a cycle, but a dependency between two initialisers is expressed by the dependant raising cleanly when its prerequisite isn't satisfied. Over successive cycles the system converges.
Flag Change and Re-run Safety
Each initialiser's completion state is a string flag chosen by the
operator. Typically these follow a simple version pattern
(v1, v2, ...), but the bootstrapper imposes no format.
Changing the flag in the group configuration causes the corresponding initialiser to re-run on the next cycle. Initialisers must be written so that re-running after a flag bump is safe — they receive both the previous and the new flag and are responsible for either cleanly re-applying the work or performing a step-change migration from the prior state.
This gives operators an explicit, visible mechanism for triggering re-initialisation. Re-runs are never implicit.
Core Initialisers
The following initialisers ship in trustgraph.bootstrap.initialisers
and cover the base deployment case.
PulsarTopology
Creates the Pulsar tenant and the four namespaces
(flow, request, response, notify) with appropriate
retention policies if they don't exist.
Opts out of the service gate (wait_for_services = False) because
config-svc and flow-svc cannot come online until the Pulsar
namespaces exist.
Parameters: Pulsar admin URL, tenant name.
Idempotent via the admin API (GET-then-PUT). Flag change causes re-evaluation of all namespaces; any absent are created.
TemplateSeed
Populates the reserved __template__ workspace from an external
JSON seed file. The seed file has the standard shape of
{config-type: {config-key: value}}.
Runs post-gate. Parameters: path to the seed file, overwrite policy (upsert-missing only, or overwrite-all).
On clean run, writes the whole file. On flag change, behaviour depends on the overwrite policy — typically upsert-missing so that operator-customised keys are preserved across seed-file upgrades.
WorkspaceInit
Creates a named workspace and populates it from the seed file or
from the full contents of the __template__ workspace.
Runs post-gate. Parameters: workspace name, source (seed file or
__template__), optional seed_file path, overwrite flag.
When source is template, the initialiser copies every config
type and key present in __template__ — there is no per-type
selection. Deployments that want to seed only a subset should
either curate the seed file they feed to TemplateSeed or use
source: seed-file directly here.
Raises cleanly if its source does not exist — depends on
TemplateSeed having run in the same cycle or a prior one.
DefaultFlowStart
Starts a specific flow in a specific workspace using a specific blueprint.
Runs post-gate. Parameters: workspace name, flow id, blueprint name, description, optional parameter overrides.
Separated from WorkspaceInit deliberately so that deployments
which want a workspace without an auto-started flow can simply omit
this initialiser from their bootstrap configuration.
Extensibility
New initialisers are added by:
- Subclassing the initialiser base class.
- Implementing
run(ctx, old_flag, new_flag). - Choosing
wait_for_services(almost alwaysTrue). - Adding an entry in the bootstrapper's configuration with the new class's dotted path.
No core code changes are required to add an enterprise or third-party
initialiser. Enterprise builds ship their own package with their own
initialiser classes (e.g. CreateAdminUser, ProvisionWorkspaces)
and reference them in the bootstrapper config alongside the core
initialisers.
Reserved Workspaces
This specification relies on the "reserved workspace" convention:
- Any workspace id beginning with
_is reserved. - Reserved workspaces are stored normally by config-svc but never appear in the config push broadcast.
- Live processors cannot react to reserved-workspace state.
The bootstrapper uses two reserved workspaces:
__template__— factory-default seed config, readable by initialisers that copy-from-template.__system__— bootstrapper completion state (under theinit-stateconfig type) and any other system-internal bookkeeping.
See the reserved-workspace convention in the config service for the general rule and its enforcement.
Non-Goals
- No DAG scheduling across initialisers. Dependencies are expressed by the dependant failing cleanly until its prerequisite is met, and convergence over subsequent cycles.
- No parallel execution of initialisers within a cycle. A cycle runs each initialiser sequentially.
- No implicit re-runs. Re-running an initialiser requires an explicit flag change by the operator.
- No cross-initialiser atomicity. Each initialiser's completion is recorded independently on its own success.
Operational Notes
- Running the bootstrapper as a processor-group entry replaces the
previous
tg-init-trustgraphcontainer. The bootstrapper is also CLI-invocable directly for standalone testing viaProcessor.launch(...). - First-boot convergence is typically a handful of short cycles followed by a transition to the steady cadence. Deployments should expect the first few minutes of logs to show initialisation activity, thereafter effective silence.
- Bumping a flag is a deliberate operational act. The log line emitted on re-run makes the event visible for audit.