trustgraph/docs/tech-specs/bootstrap.md
cybermaggedon ae9936c9cc
feat: pluggable bootstrap framework with ordered initialisers (#847)
A generic, long-running bootstrap processor that converges a
deployment to its configured initial state and then idles.
Replaces the previous one-shot `tg-init-trustgraph` container model
and provides an extension point for enterprise / third-party
initialisers.

See docs/tech-specs/bootstrap.md for the full design.

Bootstrapper
------------
A single AsyncProcessor (trustgraph.bootstrap.bootstrapper.Processor)
that:

  * Reads a list of initialiser specifications (class, name, flag,
    params) from either a direct `initialisers` parameter
    (processor-group embedding) or a YAML/JSON file (`-c`, CLI).
  * On each wake, runs a cheap service-gate (config-svc +
    flow-svc round-trips), then iterates the initialiser list,
    running each whose configured flag differs from the one stored
    in __system__/init-state/<name>.
  * Stores per-initialiser completion state in the reserved
    __system__ workspace.
  * Adapts cadence: ~5s on gate failure, ~15s while converging,
    ~300s in steady state.
  * Isolates failures — one initialiser's exception does not block
    others in the same cycle; the failed one retries next wake.

Initialiser contract
--------------------
  * Subclass trustgraph.bootstrap.base.Initialiser.
  * Implement async run(ctx, old_flag, new_flag).
  * Opt out of the service gate with class attr
    wait_for_services=False (only used by PulsarTopology, since
    config-svc cannot come up until Pulsar namespaces exist).
  * ctx carries short-lived config and flow-svc clients plus a
    scoped logger.

Core initialisers (trustgraph.bootstrap.initialisers.*)
-------------------------------------------------------
  * PulsarTopology   — creates Pulsar tenant + namespaces
                       (pre-gate, blocking HTTP offloaded to
                        executor).
  * TemplateSeed     — seeds __template__ from an external JSON
                       file; re-run is upsert-missing by default,
                       overwrite-all opt-in.
  * WorkspaceInit    — populates a named workspace from either
                       the full contents of __template__ or a
                       seed file; raises cleanly if the template
                       isn't seeded yet so the bootstrapper retries
                       on the next cycle.
  * DefaultFlowStart — starts a specific flow in a workspace;
                       no-ops if the flow is already running.

Enterprise or third-party initialisers plug in via fully-qualified
dotted class paths in the bootstrapper's configuration — no core
code change required.

Config service
--------------
  * push(): filter out reserved workspaces (ids starting with "_")
    from the change notifications.  Stored config is preserved; only
    the broadcast is suppressed, so bootstrap / template state lives
    in config-svc without live processors ever reacting to it.

Config client
-------------
  * ConfigClient.get_all(workspace): wraps the existing `config`
    operation to return {type: {key: value}} for a workspace.
    WorkspaceInit uses it to copy __template__ without needing a
    hardcoded types list.

pyproject.toml
--------------
  * Adds a `bootstrap` console script pointing at the new Processor.

* Remove tg-init-trustgraph, superceded by bootstrap processor
2026-04-22 18:03:46 +01:00

12 KiB

layout title parent
default Bootstrap Framework Technical Specification Tech Specs

Bootstrap Framework Technical Specification

Overview

A generic, pluggable framework for running one-time initialisation steps against a TrustGraph deployment — replacing the dedicated tg-init-trustgraph container with a long-running processor that converges the system to a desired initial state and then idles.

The framework is content-agnostic. It knows how to run, retry, mark-as-done, and surface failures; the actual init work lives in small pluggable classes called initialisers. Core initialisers ship in the trustgraph-flow package; enterprise and third-party initialisers can be loaded by dotted path without any core code change.

Motivation

The existing tg-init-trustgraph is a one-shot CLI run in its own container. It performs two very different jobs (Pulsar topology setup and config seeding) in a single script, is wasteful as a whole container, cannot handle partial-success states, and has no way to extend the boot process with enterprise-specific concerns (user provisioning, workspace initialisation, IAM scaffolding) without forking the tool.

A pluggable, long-running reconciler addresses all of this and slots naturally into the existing processor-group model.

Design

Bootstrapper Processor

A single AsyncProcessor subclass. One entry in a processor group. Parameters include the processor's own identity and a list of initialiser specifications — each spec names a class (by dotted path), a unique instance name, a flag string, and the parameters that will be passed to the initialiser's constructor.

On each wake the bootstrapper does the following, in order:

  1. Open a short-lived context (config client, flow-svc client, logger). The context is torn down at the end of the wake so steady-state idle cost is effectively nil.
  2. Run all pre-service initialisers (those that opt out of the service gate — principally PulsarTopology, which must run before the services it gates on can even come up).
  3. Check the service gate: cheap round-trips to config-svc and flow-svc. If either fails, skip to the sleep step using the short gate-retry cadence.
  4. Run all post-service initialisers that haven't already completed at the currently-configured flag.
  5. Sleep. Cadence adapts to state (see below).

Initialiser Contract

An initialiser is a class with:

  • A class-level name identifier, unique within the bootstrapper's configuration. This is the key under which completion state is stored.
  • A class-level wait_for_services flag. When True (the default) the initialiser runs only after the service gate passes. When False, it runs before the gate, on every wake.
  • A constructor that accepts the initialiser's own params as kwargs.
  • An async run(ctx, old_flag, new_flag) method that performs the init work and returns on success. Any raised exception is logged and treated as a transient failure — the stored flag is not updated and the initialiser will re-run on the next cycle.

old_flag is the previously-stored flag string, or None if the initialiser has never successfully run in this deployment. new_flag is the flag the operator has configured for this run. This pair lets an initialiser distinguish a clean first-run from a migration between flag versions and behave accordingly (see "Flag change and re-run safety" below).

Context

The context is the bootstrapper-owned object passed to every initialiser's run() method. Its fields are deliberately narrow:

Field Purpose
logger A child logger named for the initialiser instance
config A short-lived ConfigClient for config-svc reads/writes
flow A short-lived RequestResponse client for flow-svc

The context is always fully-populated regardless of which services a given initialiser uses, for symmetry. Additional fields may be added in future without breaking existing initialisers. Clients are started at the beginning of a wake cycle and stopped at the end.

Initialisers that need services beyond config-svc and flow-svc are responsible for their own readiness checks and for raising cleanly when a prerequisite is not met.

Completion State

Per-initialiser completion state is stored in the reserved __system__ workspace, under a dedicated config type for bootstrap state. The stored value is the flag string that was configured when the initialiser last succeeded.

On each cycle, for each initialiser, the bootstrapper reads the stored flag and compares it to the currently-configured flag. If they match, the initialiser is skipped silently. If they differ, the initialiser runs; on success, the stored flag is updated.

Because the state lives in a reserved (_-prefixed) workspace, it is stored by config-svc but excluded from the config push broadcast. Live processors never see it and cannot act on it.

The Service Gate

The gate is a cheap, bootstrapper-internal check that config-svc and flow-svc are both reachable and responsive. It is intentionally a simple pair of low-cost round-trips — a config list against __system__ and a flow-svc list-blueprints — rather than any deeper health check.

Its purpose is to avoid filling logs with noise and to concentrate retry effort during the brief window when services are coming up. The gate is applied only to initialisers with wait_for_services=True (the default); False is reserved for initialisers that set up infrastructure the gate itself depends on.

Adaptive Cadence

The sleep between wake cycles is chosen from three tiers based on observed state:

Tier Duration When
Gate backoff ~5 s Services not responding — concentrate retry during startup
Init retry ~15 s Gate passes but at least one initialiser is not yet at its configured flag — transient failures, waiting on prereqs, recently-bumped flag not yet applied
Steady ~300 s All configured initialisers at their configured flag; gate passes; nothing to do

The short tiers ensure a fresh deployment converges quickly; steady state costs a single round-trip per initialiser every few minutes.

Failure Handling

An initialiser raising an exception does not stop the bootstrapper or block other initialisers. Each initialiser in the cycle is attempted independently; failures are logged and retried on the next cycle. This means there is no ordered-DAG enforcement: order of initialisers in the configuration determines the attempt order within a cycle, but a dependency between two initialisers is expressed by the dependant raising cleanly when its prerequisite isn't satisfied. Over successive cycles the system converges.

Flag Change and Re-run Safety

Each initialiser's completion state is a string flag chosen by the operator. Typically these follow a simple version pattern (v1, v2, ...), but the bootstrapper imposes no format.

Changing the flag in the group configuration causes the corresponding initialiser to re-run on the next cycle. Initialisers must be written so that re-running after a flag bump is safe — they receive both the previous and the new flag and are responsible for either cleanly re-applying the work or performing a step-change migration from the prior state.

This gives operators an explicit, visible mechanism for triggering re-initialisation. Re-runs are never implicit.

Core Initialisers

The following initialisers ship in trustgraph.bootstrap.initialisers and cover the base deployment case.

PulsarTopology

Creates the Pulsar tenant and the four namespaces (flow, request, response, notify) with appropriate retention policies if they don't exist.

Opts out of the service gate (wait_for_services = False) because config-svc and flow-svc cannot come online until the Pulsar namespaces exist.

Parameters: Pulsar admin URL, tenant name.

Idempotent via the admin API (GET-then-PUT). Flag change causes re-evaluation of all namespaces; any absent are created.

TemplateSeed

Populates the reserved __template__ workspace from an external JSON seed file. The seed file has the standard shape of {config-type: {config-key: value}}.

Runs post-gate. Parameters: path to the seed file, overwrite policy (upsert-missing only, or overwrite-all).

On clean run, writes the whole file. On flag change, behaviour depends on the overwrite policy — typically upsert-missing so that operator-customised keys are preserved across seed-file upgrades.

WorkspaceInit

Creates a named workspace and populates it from the seed file or from the full contents of the __template__ workspace.

Runs post-gate. Parameters: workspace name, source (seed file or __template__), optional seed_file path, overwrite flag.

When source is template, the initialiser copies every config type and key present in __template__ — there is no per-type selection. Deployments that want to seed only a subset should either curate the seed file they feed to TemplateSeed or use source: seed-file directly here.

Raises cleanly if its source does not exist — depends on TemplateSeed having run in the same cycle or a prior one.

DefaultFlowStart

Starts a specific flow in a specific workspace using a specific blueprint.

Runs post-gate. Parameters: workspace name, flow id, blueprint name, description, optional parameter overrides.

Separated from WorkspaceInit deliberately so that deployments which want a workspace without an auto-started flow can simply omit this initialiser from their bootstrap configuration.

Extensibility

New initialisers are added by:

  1. Subclassing the initialiser base class.
  2. Implementing run(ctx, old_flag, new_flag).
  3. Choosing wait_for_services (almost always True).
  4. Adding an entry in the bootstrapper's configuration with the new class's dotted path.

No core code changes are required to add an enterprise or third-party initialiser. Enterprise builds ship their own package with their own initialiser classes (e.g. CreateAdminUser, ProvisionWorkspaces) and reference them in the bootstrapper config alongside the core initialisers.

Reserved Workspaces

This specification relies on the "reserved workspace" convention:

  • Any workspace id beginning with _ is reserved.
  • Reserved workspaces are stored normally by config-svc but never appear in the config push broadcast.
  • Live processors cannot react to reserved-workspace state.

The bootstrapper uses two reserved workspaces:

  • __template__ — factory-default seed config, readable by initialisers that copy-from-template.
  • __system__ — bootstrapper completion state (under the init-state config type) and any other system-internal bookkeeping.

See the reserved-workspace convention in the config service for the general rule and its enforcement.

Non-Goals

  • No DAG scheduling across initialisers. Dependencies are expressed by the dependant failing cleanly until its prerequisite is met, and convergence over subsequent cycles.
  • No parallel execution of initialisers within a cycle. A cycle runs each initialiser sequentially.
  • No implicit re-runs. Re-running an initialiser requires an explicit flag change by the operator.
  • No cross-initialiser atomicity. Each initialiser's completion is recorded independently on its own success.

Operational Notes

  • Running the bootstrapper as a processor-group entry replaces the previous tg-init-trustgraph container. The bootstrapper is also CLI-invocable directly for standalone testing via Processor.launch(...).
  • First-boot convergence is typically a handful of short cycles followed by a transition to the steady cadence. Deployments should expect the first few minutes of logs to show initialisation activity, thereafter effective silence.
  • Bumping a flag is a deliberate operational act. The log line emitted on re-run makes the event visible for audit.