trustgraph/docs/tech-specs/bootstrap.md

298 lines
12 KiB
Markdown
Raw Normal View History

feat: pluggable bootstrap framework with ordered initialisers (#847) A generic, long-running bootstrap processor that converges a deployment to its configured initial state and then idles. Replaces the previous one-shot `tg-init-trustgraph` container model and provides an extension point for enterprise / third-party initialisers. See docs/tech-specs/bootstrap.md for the full design. Bootstrapper ------------ A single AsyncProcessor (trustgraph.bootstrap.bootstrapper.Processor) that: * Reads a list of initialiser specifications (class, name, flag, params) from either a direct `initialisers` parameter (processor-group embedding) or a YAML/JSON file (`-c`, CLI). * On each wake, runs a cheap service-gate (config-svc + flow-svc round-trips), then iterates the initialiser list, running each whose configured flag differs from the one stored in __system__/init-state/<name>. * Stores per-initialiser completion state in the reserved __system__ workspace. * Adapts cadence: ~5s on gate failure, ~15s while converging, ~300s in steady state. * Isolates failures — one initialiser's exception does not block others in the same cycle; the failed one retries next wake. Initialiser contract -------------------- * Subclass trustgraph.bootstrap.base.Initialiser. * Implement async run(ctx, old_flag, new_flag). * Opt out of the service gate with class attr wait_for_services=False (only used by PulsarTopology, since config-svc cannot come up until Pulsar namespaces exist). * ctx carries short-lived config and flow-svc clients plus a scoped logger. Core initialisers (trustgraph.bootstrap.initialisers.*) ------------------------------------------------------- * PulsarTopology — creates Pulsar tenant + namespaces (pre-gate, blocking HTTP offloaded to executor). * TemplateSeed — seeds __template__ from an external JSON file; re-run is upsert-missing by default, overwrite-all opt-in. * WorkspaceInit — populates a named workspace from either the full contents of __template__ or a seed file; raises cleanly if the template isn't seeded yet so the bootstrapper retries on the next cycle. * DefaultFlowStart — starts a specific flow in a workspace; no-ops if the flow is already running. Enterprise or third-party initialisers plug in via fully-qualified dotted class paths in the bootstrapper's configuration — no core code change required. Config service -------------- * push(): filter out reserved workspaces (ids starting with "_") from the change notifications. Stored config is preserved; only the broadcast is suppressed, so bootstrap / template state lives in config-svc without live processors ever reacting to it. Config client ------------- * ConfigClient.get_all(workspace): wraps the existing `config` operation to return {type: {key: value}} for a workspace. WorkspaceInit uses it to copy __template__ without needing a hardcoded types list. pyproject.toml -------------- * Adds a `bootstrap` console script pointing at the new Processor. * Remove tg-init-trustgraph, superceded by bootstrap processor
2026-04-22 18:03:46 +01:00
---
layout: default
title: "Bootstrap Framework Technical Specification"
parent: "Tech Specs"
---
# Bootstrap Framework Technical Specification
## Overview
A generic, pluggable framework for running one-time initialisation steps
against a TrustGraph deployment — replacing the dedicated
`tg-init-trustgraph` container with a long-running processor that
converges the system to a desired initial state and then idles.
The framework is content-agnostic. It knows how to run, retry,
mark-as-done, and surface failures; the actual init work lives in
small pluggable classes called **initialisers**. Core initialisers
ship in the `trustgraph-flow` package; enterprise and third-party
initialisers can be loaded by dotted path without any core code
change.
## Motivation
The existing `tg-init-trustgraph` is a one-shot CLI run in its own
container. It performs two very different jobs (Pulsar topology
setup and config seeding) in a single script, is wasteful as a whole
container, cannot handle partial-success states, and has no way to
extend the boot process with enterprise-specific concerns (user
provisioning, workspace initialisation, IAM scaffolding) without
forking the tool.
A pluggable, long-running reconciler addresses all of this and slots
naturally into the existing processor-group model.
## Design
### Bootstrapper Processor
A single `AsyncProcessor` subclass. One entry in a processor group.
Parameters include the processor's own identity and a list of
**initialiser specifications** — each spec names a class (by dotted
path), a unique instance name, a flag string, and the parameters
that will be passed to the initialiser's constructor.
On each wake the bootstrapper does the following, in order:
1. Open a short-lived context (config client, flow-svc client,
logger). The context is torn down at the end of the wake so
steady-state idle cost is effectively nil.
2. Run all **pre-service initialisers** (those that opt out of the
service gate — principally `PulsarTopology`, which must run
before the services it gates on can even come up).
3. Check the **service gate**: cheap round-trips to config-svc and
flow-svc. If either fails, skip to the sleep step using the
short gate-retry cadence.
4. Run all **post-service initialisers** that haven't already
completed at the currently-configured flag.
5. Sleep. Cadence adapts to state (see below).
### Initialiser Contract
An initialiser is a class with:
- A class-level `name` identifier, unique within the bootstrapper's
configuration. This is the key under which completion state is
stored.
- A class-level `wait_for_services` flag. When `True` (the default)
the initialiser runs only after the service gate passes. When
`False`, it runs before the gate, on every wake.
- A constructor that accepts the initialiser's own params as kwargs.
- An async `run(ctx, old_flag, new_flag)` method that performs the
init work and returns on success. Any raised exception is
logged and treated as a transient failure — the stored flag is
not updated and the initialiser will re-run on the next cycle.
`old_flag` is the previously-stored flag string, or `None` if the
initialiser has never successfully run in this deployment. `new_flag`
is the flag the operator has configured for this run. This pair
lets an initialiser distinguish a clean first-run from a migration
between flag versions and behave accordingly (see "Flag change and
re-run safety" below).
### Context
The context is the bootstrapper-owned object passed to every
initialiser's `run()` method. Its fields are deliberately narrow:
| Field | Purpose |
|---|---|
| `logger` | A child logger named for the initialiser instance |
| `config` | A short-lived `ConfigClient` for config-svc reads/writes |
| `flow` | A short-lived `RequestResponse` client for flow-svc |
The context is always fully-populated regardless of which services
a given initialiser uses, for symmetry. Additional fields may be
added in future without breaking existing initialisers. Clients are
started at the beginning of a wake cycle and stopped at the end.
Initialisers that need services beyond config-svc and flow-svc are
responsible for their own readiness checks and for raising cleanly
when a prerequisite is not met.
### Completion State
Per-initialiser completion state is stored in the reserved
`__system__` workspace, under a dedicated config type for bootstrap
state. The stored value is the flag string that was configured when
the initialiser last succeeded.
On each cycle, for each initialiser, the bootstrapper reads the
stored flag and compares it to the currently-configured flag. If
they match, the initialiser is skipped silently. If they differ,
the initialiser runs; on success, the stored flag is updated.
Because the state lives in a reserved (`_`-prefixed) workspace, it
is stored by config-svc but excluded from the config push broadcast.
Live processors never see it and cannot act on it.
### The Service Gate
The gate is a cheap, bootstrapper-internal check that config-svc
and flow-svc are both reachable and responsive. It is intentionally
a simple pair of low-cost round-trips — a config list against
`__system__` and a flow-svc `list-blueprints` — rather than any
deeper health check.
Its purpose is to avoid filling logs with noise and to concentrate
retry effort during the brief window when services are coming up.
The gate is applied only to initialisers with
`wait_for_services=True` (the default); `False` is reserved for
initialisers that set up infrastructure the gate itself depends on.
### Adaptive Cadence
The sleep between wake cycles is chosen from three tiers based on
observed state:
| Tier | Duration | When |
|---|---|---|
| Gate backoff | ~5 s | Services not responding — concentrate retry during startup |
| Init retry | ~15 s | Gate passes but at least one initialiser is not yet at its configured flag — transient failures, waiting on prereqs, recently-bumped flag not yet applied |
| Steady | ~300 s | All configured initialisers at their configured flag; gate passes; nothing to do |
The short tiers ensure a fresh deployment converges quickly;
steady state costs a single round-trip per initialiser every few
minutes.
### Failure Handling
An initialiser raising an exception does not stop the bootstrapper
or block other initialisers. Each initialiser in the cycle is
attempted independently; failures are logged and retried on the next
cycle. This means there is no ordered-DAG enforcement: order of
initialisers in the configuration determines the attempt order
within a cycle, but a dependency between two initialisers is
expressed by the dependant raising cleanly when its prerequisite
isn't satisfied. Over successive cycles the system converges.
### Flag Change and Re-run Safety
Each initialiser's completion state is a string flag chosen by the
operator. Typically these follow a simple version pattern
(`v1`, `v2`, ...), but the bootstrapper imposes no format.
Changing the flag in the group configuration causes the
corresponding initialiser to re-run on the next cycle. Initialisers
must be written so that re-running after a flag bump is safe — they
receive both the previous and the new flag and are responsible for
either cleanly re-applying the work or performing a step-change
migration from the prior state.
This gives operators an explicit, visible mechanism for triggering
re-initialisation. Re-runs are never implicit.
## Core Initialisers
The following initialisers ship in `trustgraph.bootstrap.initialisers`
and cover the base deployment case.
### PulsarTopology
Creates the Pulsar tenant and the four namespaces
(`flow`, `request`, `response`, `notify`) with appropriate
retention policies if they don't exist.
Opts out of the service gate (`wait_for_services = False`) because
config-svc and flow-svc cannot come online until the Pulsar
namespaces exist.
Parameters: Pulsar admin URL, tenant name.
Idempotent via the admin API (GET-then-PUT). Flag change causes
re-evaluation of all namespaces; any absent are created.
### TemplateSeed
Populates the reserved `__template__` workspace from an external
JSON seed file. The seed file has the standard shape of
`{config-type: {config-key: value}}`.
Runs post-gate. Parameters: path to the seed file, overwrite
policy (upsert-missing only, or overwrite-all).
On clean run, writes the whole file. On flag change, behaviour
depends on the overwrite policy — typically upsert-missing so
that operator-customised keys are preserved across seed-file
upgrades.
### WorkspaceInit
Creates a named workspace and populates it from the seed file or
from the full contents of the `__template__` workspace.
Runs post-gate. Parameters: workspace name, source (seed file or
`__template__`), optional `seed_file` path, `overwrite` flag.
When `source` is `template`, the initialiser copies every config
type and key present in `__template__` — there is no per-type
selection. Deployments that want to seed only a subset should
either curate the seed file they feed to `TemplateSeed` or use
`source: seed-file` directly here.
Raises cleanly if its source does not exist — depends on
`TemplateSeed` having run in the same cycle or a prior one.
### DefaultFlowStart
Starts a specific flow in a specific workspace using a specific
blueprint.
Runs post-gate. Parameters: workspace name, flow id, blueprint
name, description, optional parameter overrides.
Separated from `WorkspaceInit` deliberately so that deployments
which want a workspace without an auto-started flow can simply omit
this initialiser from their bootstrap configuration.
## Extensibility
New initialisers are added by:
1. Subclassing the initialiser base class.
2. Implementing `run(ctx, old_flag, new_flag)`.
3. Choosing `wait_for_services` (almost always `True`).
4. Adding an entry in the bootstrapper's configuration with the new
class's dotted path.
No core code changes are required to add an enterprise or third-party
initialiser. Enterprise builds ship their own package with their own
initialiser classes (e.g. `CreateAdminUser`, `ProvisionWorkspaces`)
and reference them in the bootstrapper config alongside the core
initialisers.
## Reserved Workspaces
This specification relies on the "reserved workspace" convention:
- Any workspace id beginning with `_` is reserved.
- Reserved workspaces are stored normally by config-svc but never
appear in the config push broadcast.
- Live processors cannot react to reserved-workspace state.
The bootstrapper uses two reserved workspaces:
- `__template__` — factory-default seed config, readable by
initialisers that copy-from-template.
- `__system__` — bootstrapper completion state (under the
`init-state` config type) and any other system-internal bookkeeping.
See the reserved-workspace convention in the config service for
the general rule and its enforcement.
## Non-Goals
- No DAG scheduling across initialisers. Dependencies are expressed
by the dependant failing cleanly until its prerequisite is met,
and convergence over subsequent cycles.
- No parallel execution of initialisers within a cycle. A cycle runs
each initialiser sequentially.
- No implicit re-runs. Re-running an initialiser requires an explicit
flag change by the operator.
- No cross-initialiser atomicity. Each initialiser's completion is
recorded independently on its own success.
## Operational Notes
- Running the bootstrapper as a processor-group entry replaces the
previous `tg-init-trustgraph` container. The bootstrapper is also
CLI-invocable directly for standalone testing via
`Processor.launch(...)`.
- First-boot convergence is typically a handful of short cycles
followed by a transition to the steady cadence. Deployments
should expect the first few minutes of logs to show
initialisation activity, thereafter effective silence.
- Bumping a flag is a deliberate operational act. The log line
emitted on re-run makes the event visible for audit.