trustgraph/docs/tech-specs/bootstrap.md

---
layout: default
title: "Bootstrap Framework Technical Specification"
parent: "Tech Specs"
---

# Bootstrap Framework Technical Specification

## Overview

A generic, pluggable framework for running one-time initialisation steps
against a TrustGraph deployment — replacing the dedicated
`tg-init-trustgraph` container with a long-running processor that
converges the system to a desired initial state and then idles.

The framework is content-agnostic. It knows how to run, retry,
mark-as-done, and surface failures; the actual init work lives in
small pluggable classes called **initialisers**. Core initialisers
ship in the `trustgraph-flow` package; enterprise and third-party
initialisers can be loaded by dotted path without any core code
change.

## Motivation

The existing `tg-init-trustgraph` is a one-shot CLI run in its own
container. It performs two very different jobs (Pulsar topology
setup and config seeding) in a single script, is wasteful as a whole
container, cannot handle partial-success states, and has no way to
extend the boot process with enterprise-specific concerns (user
provisioning, workspace initialisation, IAM scaffolding) without
forking the tool.

A pluggable, long-running reconciler addresses all of this and slots
naturally into the existing processor-group model.

## Design

### Bootstrapper Processor

A single `AsyncProcessor` subclass. One entry in a processor group.
Parameters include the processor's own identity and a list of
**initialiser specifications** — each spec names a class (by dotted
path), a unique instance name, a flag string, and the parameters
that will be passed to the initialiser's constructor.

On each wake the bootstrapper does the following, in order:

1. Open a short-lived context (config client, flow-svc client,
   logger). The context is torn down at the end of the wake so
   steady-state idle cost is effectively nil.
2. Run all **pre-service initialisers** (those that opt out of the
   service gate — principally `PulsarTopology`, which must run
   before the services it gates on can even come up).
3. Check the **service gate**: cheap round-trips to config-svc and
   flow-svc. If either fails, skip to the sleep step using the
   short gate-retry cadence.
4. Run all **post-service initialisers** that haven't already
   completed at the currently-configured flag.
5. Sleep. Cadence adapts to state (see below).

### Initialiser Contract

An initialiser is a class with:

- A class-level `name` identifier, unique within the bootstrapper's
  configuration. This is the key under which completion state is
  stored.
- A class-level `wait_for_services` flag. When `True` (the default)
  the initialiser runs only after the service gate passes. When
  `False`, it runs before the gate, on every wake.
- A constructor that accepts the initialiser's own params as kwargs.
- An async `run(ctx, old_flag, new_flag)` method that performs the
  init work and returns on success. Any raised exception is
  logged and treated as a transient failure — the stored flag is
  not updated and the initialiser will re-run on the next cycle.

`old_flag` is the previously-stored flag string, or `None` if the
initialiser has never successfully run in this deployment. `new_flag`
is the flag the operator has configured for this run. This pair
lets an initialiser distinguish a clean first-run from a migration
between flag versions and behave accordingly (see "Flag change and
re-run safety" below).

### Context

The context is the bootstrapper-owned object passed to every
initialiser's `run()` method. Its fields are deliberately narrow:

| Field | Purpose |
|---|---|
| `logger` | A child logger named for the initialiser instance |
| `config` | A short-lived `ConfigClient` for config-svc reads/writes |
| `flow`   | A short-lived `RequestResponse` client for flow-svc |

The context is always fully-populated regardless of which services
a given initialiser uses, for symmetry. Additional fields may be
added in future without breaking existing initialisers. Clients are
started at the beginning of a wake cycle and stopped at the end.

Initialisers that need services beyond config-svc and flow-svc are
responsible for their own readiness checks and for raising cleanly
when a prerequisite is not met.

### Completion State

Per-initialiser completion state is stored in the reserved
`__system__` workspace, under a dedicated config type for bootstrap
state. The stored value is the flag string that was configured when
the initialiser last succeeded.

On each cycle, for each initialiser, the bootstrapper reads the
stored flag and compares it to the currently-configured flag. If
they match, the initialiser is skipped silently. If they differ,
the initialiser runs; on success, the stored flag is updated.

Because the state lives in a reserved (`_`-prefixed) workspace, it
is stored by config-svc but excluded from the config push broadcast.
Live processors never see it and cannot act on it.

### The Service Gate

The gate is a cheap, bootstrapper-internal check that config-svc
and flow-svc are both reachable and responsive. It is intentionally
a simple pair of low-cost round-trips — a config list against
`__system__` and a flow-svc `list-blueprints` — rather than any
deeper health check.

Its purpose is to avoid filling logs with noise and to concentrate
retry effort during the brief window when services are coming up.
The gate is applied only to initialisers with
`wait_for_services=True` (the default); `False` is reserved for
initialisers that set up infrastructure the gate itself depends on.

### Adaptive Cadence

The sleep between wake cycles is chosen from three tiers based on
observed state:

| Tier | Duration | When |
|---|---|---|
| Gate backoff | ~5 s | Services not responding — concentrate retry during startup |
| Init retry | ~15 s | Gate passes but at least one initialiser is not yet at its configured flag — transient failures, waiting on prereqs, recently-bumped flag not yet applied |
| Steady | ~300 s | All configured initialisers at their configured flag; gate passes; nothing to do |

The short tiers ensure a fresh deployment converges quickly;
steady state costs a single round-trip per initialiser every few
minutes.

### Failure Handling

An initialiser raising an exception does not stop the bootstrapper
or block other initialisers. Each initialiser in the cycle is
attempted independently; failures are logged and retried on the next
cycle. This means there is no ordered-DAG enforcement: order of
initialisers in the configuration determines the attempt order
within a cycle, but a dependency between two initialisers is
expressed by the dependant raising cleanly when its prerequisite
isn't satisfied. Over successive cycles the system converges.

### Flag Change and Re-run Safety

Each initialiser's completion state is a string flag chosen by the
operator. Typically these follow a simple version pattern
(`v1`, `v2`, ...), but the bootstrapper imposes no format.

Changing the flag in the group configuration causes the
corresponding initialiser to re-run on the next cycle. Initialisers
must be written so that re-running after a flag bump is safe — they
receive both the previous and the new flag and are responsible for
either cleanly re-applying the work or performing a step-change
migration from the prior state.

This gives operators an explicit, visible mechanism for triggering
re-initialisation. Re-runs are never implicit.

## Core Initialisers

The following initialisers ship in `trustgraph.bootstrap.initialisers`
and cover the base deployment case.

### PulsarTopology

Creates the Pulsar tenant and the four namespaces
(`flow`, `request`, `response`, `notify`) with appropriate
retention policies if they don't exist.

Opts out of the service gate (`wait_for_services = False`) because
config-svc and flow-svc cannot come online until the Pulsar
namespaces exist.

Parameters: Pulsar admin URL, tenant name.

Idempotent via the admin API (GET-then-PUT). Flag change causes
re-evaluation of all namespaces; any absent are created.

### TemplateSeed

Populates the reserved `__template__` workspace from an external
JSON seed file. The seed file has the standard shape of
`{config-type: {config-key: value}}`.

Runs post-gate. Parameters: path to the seed file, overwrite
policy (upsert-missing only, or overwrite-all).

On clean run, writes the whole file. On flag change, behaviour
depends on the overwrite policy — typically upsert-missing so
that operator-customised keys are preserved across seed-file
upgrades.

### WorkspaceInit

Creates a named workspace and populates it from the seed file or
from the full contents of the `__template__` workspace.

Runs post-gate. Parameters: workspace name, source (seed file or
`__template__`), optional `seed_file` path, `overwrite` flag.

When `source` is `template`, the initialiser copies every config
type and key present in `__template__` — there is no per-type
selection. Deployments that want to seed only a subset should
either curate the seed file they feed to `TemplateSeed` or use
`source: seed-file` directly here.

Raises cleanly if its source does not exist — depends on
`TemplateSeed` having run in the same cycle or a prior one.

### DefaultFlowStart

Starts a specific flow in a specific workspace using a specific
blueprint.

Runs post-gate. Parameters: workspace name, flow id, blueprint
name, description, optional parameter overrides.

Separated from `WorkspaceInit` deliberately so that deployments
which want a workspace without an auto-started flow can simply omit
this initialiser from their bootstrap configuration.

## Extensibility

New initialisers are added by:

1. Subclassing the initialiser base class.
2. Implementing `run(ctx, old_flag, new_flag)`.
3. Choosing `wait_for_services` (almost always `True`).
4. Adding an entry in the bootstrapper's configuration with the new
   class's dotted path.

No core code changes are required to add an enterprise or third-party
initialiser. Enterprise builds ship their own package with their own
initialiser classes (e.g. `CreateAdminUser`, `ProvisionWorkspaces`)
and reference them in the bootstrapper config alongside the core
initialisers.

## Reserved Workspaces

This specification relies on the "reserved workspace" convention:

- Any workspace id beginning with `_` is reserved.
- Reserved workspaces are stored normally by config-svc but never
  appear in the config push broadcast.
- Live processors cannot react to reserved-workspace state.

The bootstrapper uses two reserved workspaces:

- `__template__` — factory-default seed config, readable by
  initialisers that copy-from-template.
- `__system__` — bootstrapper completion state (under the
  `init-state` config type) and any other system-internal bookkeeping.

See the reserved-workspace convention in the config service for
the general rule and its enforcement.

## Non-Goals

- No DAG scheduling across initialisers. Dependencies are expressed
  by the dependant failing cleanly until its prerequisite is met,
  and convergence over subsequent cycles.
- No parallel execution of initialisers within a cycle. A cycle runs
  each initialiser sequentially.
- No implicit re-runs. Re-running an initialiser requires an explicit
  flag change by the operator.
- No cross-initialiser atomicity. Each initialiser's completion is
  recorded independently on its own success.

## Operational Notes

- Running the bootstrapper as a processor-group entry replaces the
  previous `tg-init-trustgraph` container. The bootstrapper is also
  CLI-invocable directly for standalone testing via
  `Processor.launch(...)`.
- First-boot convergence is typically a handful of short cycles
  followed by a transition to the steady cadence. Deployments
  should expect the first few minutes of logs to show
  initialisation activity, thereafter effective silence.
- Bumping a flag is a deliberate operational act. The log line
  emitted on re-run makes the event visible for audit.
feat: pluggable bootstrap framework with ordered initialisers (#847) A generic, long-running bootstrap processor that converges a deployment to its configured initial state and then idles. Replaces the previous one-shot `tg-init-trustgraph` container model and provides an extension point for enterprise / third-party initialisers. See docs/tech-specs/bootstrap.md for the full design. Bootstrapper ------------ A single AsyncProcessor (trustgraph.bootstrap.bootstrapper.Processor) that: * Reads a list of initialiser specifications (class, name, flag, params) from either a direct `initialisers` parameter (processor-group embedding) or a YAML/JSON file (`-c`, CLI). * On each wake, runs a cheap service-gate (config-svc + flow-svc round-trips), then iterates the initialiser list, running each whose configured flag differs from the one stored in __system__/init-state/<name>. * Stores per-initialiser completion state in the reserved __system__ workspace. * Adapts cadence: ~5s on gate failure, ~15s while converging, ~300s in steady state. * Isolates failures — one initialiser's exception does not block others in the same cycle; the failed one retries next wake. Initialiser contract -------------------- * Subclass trustgraph.bootstrap.base.Initialiser. * Implement async run(ctx, old_flag, new_flag). * Opt out of the service gate with class attr wait_for_services=False (only used by PulsarTopology, since config-svc cannot come up until Pulsar namespaces exist). * ctx carries short-lived config and flow-svc clients plus a scoped logger. Core initialisers (trustgraph.bootstrap.initialisers.) ------------------------------------------------------- PulsarTopology — creates Pulsar tenant + namespaces (pre-gate, blocking HTTP offloaded to executor). * TemplateSeed — seeds __template__ from an external JSON file; re-run is upsert-missing by default, overwrite-all opt-in. * WorkspaceInit — populates a named workspace from either the full contents of __template__ or a seed file; raises cleanly if the template isn't seeded yet so the bootstrapper retries on the next cycle. * DefaultFlowStart — starts a specific flow in a workspace; no-ops if the flow is already running. Enterprise or third-party initialisers plug in via fully-qualified dotted class paths in the bootstrapper's configuration — no core code change required. Config service -------------- * push(): filter out reserved workspaces (ids starting with "_") from the change notifications. Stored config is preserved; only the broadcast is suppressed, so bootstrap / template state lives in config-svc without live processors ever reacting to it. Config client ------------- * ConfigClient.get_all(workspace): wraps the existing `config` operation to return {type: {key: value}} for a workspace. WorkspaceInit uses it to copy __template__ without needing a hardcoded types list. pyproject.toml -------------- * Adds a `bootstrap` console script pointing at the new Processor. * Remove tg-init-trustgraph, superceded by bootstrap processor 2026-04-22 18:03:46 +01:00			`---`
			`layout: default`
			`title: "Bootstrap Framework Technical Specification"`
			`parent: "Tech Specs"`
			`---`

			`# Bootstrap Framework Technical Specification`

			`## Overview`

			`A generic, pluggable framework for running one-time initialisation steps`
			`against a TrustGraph deployment — replacing the dedicated`
			`tg-init-trustgraph` container with a long-running processor that
			`converges the system to a desired initial state and then idles.`

			`The framework is content-agnostic. It knows how to run, retry,`
			`mark-as-done, and surface failures; the actual init work lives in`
			`small pluggable classes called initialisers. Core initialisers`
			ship in the `trustgraph-flow` package; enterprise and third-party
			`initialisers can be loaded by dotted path without any core code`
			`change.`

			`## Motivation`

			The existing `tg-init-trustgraph` is a one-shot CLI run in its own
			`container. It performs two very different jobs (Pulsar topology`
			`setup and config seeding) in a single script, is wasteful as a whole`
			`container, cannot handle partial-success states, and has no way to`
			`extend the boot process with enterprise-specific concerns (user`
			`provisioning, workspace initialisation, IAM scaffolding) without`
			`forking the tool.`

			`A pluggable, long-running reconciler addresses all of this and slots`
			`naturally into the existing processor-group model.`

			`## Design`

			`### Bootstrapper Processor`

			A single `AsyncProcessor` subclass. One entry in a processor group.
			`Parameters include the processor's own identity and a list of`
			`initialiser specifications — each spec names a class (by dotted`
			`path), a unique instance name, a flag string, and the parameters`
			`that will be passed to the initialiser's constructor.`

			`On each wake the bootstrapper does the following, in order:`

			`1. Open a short-lived context (config client, flow-svc client,`
			`logger). The context is torn down at the end of the wake so`
			`steady-state idle cost is effectively nil.`
			`2. Run all pre-service initialisers (those that opt out of the`
			service gate — principally `PulsarTopology`, which must run
			`before the services it gates on can even come up).`
			`3. Check the service gate: cheap round-trips to config-svc and`
			`flow-svc. If either fails, skip to the sleep step using the`
			`short gate-retry cadence.`
			`4. Run all post-service initialisers that haven't already`
			`completed at the currently-configured flag.`
			`5. Sleep. Cadence adapts to state (see below).`

			`### Initialiser Contract`

			`An initialiser is a class with:`

			- A class-level `name` identifier, unique within the bootstrapper's
			`configuration. This is the key under which completion state is`
			`stored.`
			- A class-level `wait_for_services` flag. When `True` (the default)
			`the initialiser runs only after the service gate passes. When`
			`False`, it runs before the gate, on every wake.
			`- A constructor that accepts the initialiser's own params as kwargs.`
			- An async `run(ctx, old_flag, new_flag)` method that performs the
			`init work and returns on success. Any raised exception is`
			`logged and treated as a transient failure — the stored flag is`
			`not updated and the initialiser will re-run on the next cycle.`

			`old_flag` is the previously-stored flag string, or `None` if the
			initialiser has never successfully run in this deployment. `new_flag`
			`is the flag the operator has configured for this run. This pair`
			`lets an initialiser distinguish a clean first-run from a migration`
			`between flag versions and behave accordingly (see "Flag change and`
			`re-run safety" below).`

			`### Context`

			`The context is the bootstrapper-owned object passed to every`
			initialiser's `run()` method. Its fields are deliberately narrow:

			`\| Field \| Purpose \|`
			`\|---\|---\|`
			\| `logger` \| A child logger named for the initialiser instance \|
			\| `config` \| A short-lived `ConfigClient` for config-svc reads/writes \|
			\| `flow` \| A short-lived `RequestResponse` client for flow-svc \|

			`The context is always fully-populated regardless of which services`
			`a given initialiser uses, for symmetry. Additional fields may be`
			`added in future without breaking existing initialisers. Clients are`
			`started at the beginning of a wake cycle and stopped at the end.`

			`Initialisers that need services beyond config-svc and flow-svc are`
			`responsible for their own readiness checks and for raising cleanly`
			`when a prerequisite is not met.`

			`### Completion State`

			`Per-initialiser completion state is stored in the reserved`
			`__system__` workspace, under a dedicated config type for bootstrap
			`state. The stored value is the flag string that was configured when`
			`the initialiser last succeeded.`

			`On each cycle, for each initialiser, the bootstrapper reads the`
			`stored flag and compares it to the currently-configured flag. If`
			`they match, the initialiser is skipped silently. If they differ,`
			`the initialiser runs; on success, the stored flag is updated.`

			Because the state lives in a reserved (`_`-prefixed) workspace, it
			`is stored by config-svc but excluded from the config push broadcast.`
			`Live processors never see it and cannot act on it.`

			`### The Service Gate`

			`The gate is a cheap, bootstrapper-internal check that config-svc`
			`and flow-svc are both reachable and responsive. It is intentionally`
			`a simple pair of low-cost round-trips — a config list against`
			`__system__` and a flow-svc `list-blueprints` — rather than any
			`deeper health check.`

			`Its purpose is to avoid filling logs with noise and to concentrate`
			`retry effort during the brief window when services are coming up.`
			`The gate is applied only to initialisers with`
			`wait_for_services=True` (the default); `False` is reserved for
			`initialisers that set up infrastructure the gate itself depends on.`

			`### Adaptive Cadence`

			`The sleep between wake cycles is chosen from three tiers based on`
			`observed state:`

			`\| Tier \| Duration \| When \|`
			`\|---\|---\|---\|`
			`\| Gate backoff \| ~5 s \| Services not responding — concentrate retry during startup \|`
			`\| Init retry \| ~15 s \| Gate passes but at least one initialiser is not yet at its configured flag — transient failures, waiting on prereqs, recently-bumped flag not yet applied \|`
			`\| Steady \| ~300 s \| All configured initialisers at their configured flag; gate passes; nothing to do \|`

			`The short tiers ensure a fresh deployment converges quickly;`
			`steady state costs a single round-trip per initialiser every few`
			`minutes.`

			`### Failure Handling`

			`An initialiser raising an exception does not stop the bootstrapper`
			`or block other initialisers. Each initialiser in the cycle is`
			`attempted independently; failures are logged and retried on the next`
			`cycle. This means there is no ordered-DAG enforcement: order of`
			`initialisers in the configuration determines the attempt order`
			`within a cycle, but a dependency between two initialisers is`
			`expressed by the dependant raising cleanly when its prerequisite`
			`isn't satisfied. Over successive cycles the system converges.`

			`### Flag Change and Re-run Safety`

			`Each initialiser's completion state is a string flag chosen by the`
			`operator. Typically these follow a simple version pattern`
			(`v1`, `v2`, ...), but the bootstrapper imposes no format.

			`Changing the flag in the group configuration causes the`
			`corresponding initialiser to re-run on the next cycle. Initialisers`
			`must be written so that re-running after a flag bump is safe — they`
			`receive both the previous and the new flag and are responsible for`
			`either cleanly re-applying the work or performing a step-change`
			`migration from the prior state.`

			`This gives operators an explicit, visible mechanism for triggering`
			`re-initialisation. Re-runs are never implicit.`

			`## Core Initialisers`

			The following initialisers ship in `trustgraph.bootstrap.initialisers`
			`and cover the base deployment case.`

			`### PulsarTopology`

			`Creates the Pulsar tenant and the four namespaces`
			(`flow`, `request`, `response`, `notify`) with appropriate
			`retention policies if they don't exist.`

			Opts out of the service gate (`wait_for_services = False`) because
			`config-svc and flow-svc cannot come online until the Pulsar`
			`namespaces exist.`

			`Parameters: Pulsar admin URL, tenant name.`

			`Idempotent via the admin API (GET-then-PUT). Flag change causes`
			`re-evaluation of all namespaces; any absent are created.`

			`### TemplateSeed`

			Populates the reserved `__template__` workspace from an external
			`JSON seed file. The seed file has the standard shape of`
			`{config-type: {config-key: value}}`.

			`Runs post-gate. Parameters: path to the seed file, overwrite`
			`policy (upsert-missing only, or overwrite-all).`

			`On clean run, writes the whole file. On flag change, behaviour`
			`depends on the overwrite policy — typically upsert-missing so`
			`that operator-customised keys are preserved across seed-file`
			`upgrades.`

			`### WorkspaceInit`

			`Creates a named workspace and populates it from the seed file or`
			from the full contents of the `__template__` workspace.

			`Runs post-gate. Parameters: workspace name, source (seed file or`
			`__template__`), optional `seed_file` path, `overwrite` flag.

			When `source` is `template`, the initialiser copies every config
			type and key present in `__template__` — there is no per-type
			`selection. Deployments that want to seed only a subset should`
			either curate the seed file they feed to `TemplateSeed` or use
			`source: seed-file` directly here.

			`Raises cleanly if its source does not exist — depends on`
			`TemplateSeed` having run in the same cycle or a prior one.

			`### DefaultFlowStart`

			`Starts a specific flow in a specific workspace using a specific`
			`blueprint.`

			`Runs post-gate. Parameters: workspace name, flow id, blueprint`
			`name, description, optional parameter overrides.`

			Separated from `WorkspaceInit` deliberately so that deployments
			`which want a workspace without an auto-started flow can simply omit`
			`this initialiser from their bootstrap configuration.`

			`## Extensibility`

			`New initialisers are added by:`

			`1. Subclassing the initialiser base class.`
			2. Implementing `run(ctx, old_flag, new_flag)`.
			3. Choosing `wait_for_services` (almost always `True`).
			`4. Adding an entry in the bootstrapper's configuration with the new`
			`class's dotted path.`

			`No core code changes are required to add an enterprise or third-party`
			`initialiser. Enterprise builds ship their own package with their own`
			initialiser classes (e.g. `CreateAdminUser`, `ProvisionWorkspaces`)
			`and reference them in the bootstrapper config alongside the core`
			`initialisers.`

			`## Reserved Workspaces`

			`This specification relies on the "reserved workspace" convention:`

			- Any workspace id beginning with `_` is reserved.
			`- Reserved workspaces are stored normally by config-svc but never`
			`appear in the config push broadcast.`
			`- Live processors cannot react to reserved-workspace state.`

			`The bootstrapper uses two reserved workspaces:`

			- `__template__` — factory-default seed config, readable by
			`initialisers that copy-from-template.`
			- `__system__` — bootstrapper completion state (under the
			`init-state` config type) and any other system-internal bookkeeping.

			`See the reserved-workspace convention in the config service for`
			`the general rule and its enforcement.`

			`## Non-Goals`

			`- No DAG scheduling across initialisers. Dependencies are expressed`
			`by the dependant failing cleanly until its prerequisite is met,`
			`and convergence over subsequent cycles.`
			`- No parallel execution of initialisers within a cycle. A cycle runs`
			`each initialiser sequentially.`
			`- No implicit re-runs. Re-running an initialiser requires an explicit`
			`flag change by the operator.`
			`- No cross-initialiser atomicity. Each initialiser's completion is`
			`recorded independently on its own success.`

			`## Operational Notes`

			`- Running the bootstrapper as a processor-group entry replaces the`
			previous `tg-init-trustgraph` container. The bootstrapper is also
			`CLI-invocable directly for standalone testing via`
			`Processor.launch(...)`.
			`- First-boot convergence is typically a handful of short cycles`
			`followed by a transition to the steady cadence. Deployments`
			`should expect the first few minutes of logs to show`
			`initialisation activity, thereafter effective silence.`
			`- Bumping a flag is a deliberate operational act. The log line`
			`emitted on re-run makes the event visible for audit.`