Commit graph

4 commits

Author SHA1 Message Date
cybermaggedon
5e28d3cce0
refactor(iam): pluggable IAM regime via authenticate/authorise contract (#853)
The gateway no longer holds any policy state — capability sets, role
definitions, workspace scope rules.  Per the IAM contract it asks the
regime "may this identity perform this capability on this resource?"
per request.  That moves the OSS role-based regime entirely into
iam-svc, which can be replaced (SSO, ABAC, ReBAC) without changing
the gateway, the wire protocol, or backend services.

Contract:
- authenticate(credential) -> Identity (handle, workspace,
  principal_id, source).  No roles, claims, or policy state surface
  to the gateway.
- authorise(identity, capability, resource, parameters) -> (allow,
  ttl).  Cached per-decision (regime TTL clamped above; fail-closed
  on regime errors).
- authorise_many available as a fan-out variant.

Operation registry drives every authorisation decision:
- /api/v1/iam -> IamEndpoint, looks up bare op name (create-user,
  list-workspaces, ...).
- /api/v1/{kind} -> RegistryRoutedVariableEndpoint, <kind>:<op>
  (config:get, flow:list-blueprints, librarian:add-document, ...).
- /api/v1/flow/{flow}/service/{kind} -> flow-service:<kind>.
- /api/v1/flow/{flow}/{import,export}/{kind} ->
  flow-{import,export}:<kind>.
- WS Mux per-frame -> flow-service:<kind>; closes a gap where
  authenticated users could hit any service kind.
85 operations registered across the surface.

JWT carries identity only — sub + workspace.  The roles claim is gone;
the gateway never reads policy state from a credential.

The three coarse *_KIND_CAPABILITY maps are removed.  The registry is
the only source of truth for the capability + resource shape of an
operation.  Tests migrated to the new Identity shape and to
authorise()-mocked auth doubles.

Specs updated: docs/tech-specs/iam-contract.md (Identity surface,
caching, registry-naming conventions), iam.md (JWT shape, gateway
flow, role section reframed as OSS-regime detail), iam-protocol.md
(positioned as one implementation of the contract).
2026-04-28 16:19:41 +01:00
cybermaggedon
67b2fc448f
feat: IAM service, gateway auth middleware, capability model, and CLIs (#849)
Replaces the legacy GATEWAY_SECRET shared-token gate with an IAM-backed
identity and authorisation model.  The gateway no longer has an
"allow-all" or "no auth" mode; every request is authenticated via the
IAM service, authorised against a capability model that encodes both
the operation and the workspace it targets, and rejected with a
deliberately-uninformative 401 / 403 on any failure.

IAM service (trustgraph-flow/trustgraph/iam, trustgraph-base/schema/iam)
-----------------------------------------------------------------------
* New backend service (iam-svc) owning users, workspaces, API keys,
  passwords and JWT signing keys in Cassandra.  Reached over the
  standard pub/sub request/response pattern; gateway is the only
  caller.
* Operations: bootstrap, resolve-api-key, login, get-signing-key-public,
  rotate-signing-key, create/list/get/update/disable/delete/enable-user,
  change-password, reset-password, create/list/get/update/disable-
  workspace, create/list/revoke-api-key.
* Ed25519 JWT signing (alg=EdDSA).  Key rotation writes a new kid and
  retires the previous one; validation is grace-period friendly.
* Passwords: PBKDF2-HMAC-SHA-256, 600k iterations, per-user salt.
* API keys: 128-bit random, SHA-256 hashed.  Plaintext returned once.
* Bootstrap is explicit: --bootstrap-mode {token,bootstrap} is a
  required startup argument with no permissive default.  Masked
  "auth failure" errors hide whether a refused bootstrap request was
  due to mode, state, or authorisation.

Gateway authentication (trustgraph-flow/trustgraph/gateway/auth.py)
-------------------------------------------------------------------
* IamAuth replaces the legacy Authenticator.  Distinguishes JWTs
  (three-segment dotted) from API keys by shape; verifies JWTs
  locally using the cached IAM public key; resolves API keys via
  IAM with a short-TTL hash-keyed cache.  Every failure path
  surfaces the same 401 body ("auth failure") so callers cannot
  enumerate credential state.
* Public key is fetched at gateway startup with a bounded retry loop;
  traffic does not begin flowing until auth has started.

Capability model (trustgraph-flow/trustgraph/gateway/capabilities.py)
---------------------------------------------------------------------
* Roles have two dimensions: a capability set and a workspace scope.
  OSS ships reader / writer / admin; the first two are workspace-
  assigned, admin is cross-workspace ("*").  No "cross-workspace"
  pseudo-capability — workspace permission is a property of the role.
* check(identity, capability, target_workspace=None) is the single
  authorisation test: some role must grant the capability *and* be
  active in the target workspace.
* enforce_workspace validates a request-body workspace against the
  caller's role scopes and injects the resolved value.  Cross-
  workspace admin is permitted by role scope, not by a bypass.
* Gateway endpoints declare a required capability explicitly — no
  permissive default.  Construction fails fast if omitted.  Enterprise
  editions can replace the role table without changing the wire
  protocol.

WebSocket first-frame auth (dispatch/mux.py, endpoint/socket.py)
----------------------------------------------------------------
* /api/v1/socket handshake unconditionally accepts; authentication
  runs on the first WebSocket frame ({"type":"auth","token":"..."})
  with {"type":"auth-ok","workspace":"..."} / {"type":"auth-failed"}.
  The socket stays open on failure so the client can re-authenticate
  — browsers treat a handshake-time 401 as terminal, breaking
  reconnection.
* Mux.receive rejects every non-auth frame before auth succeeds,
  enforces the caller's workspace (envelope + inner payload) using
  the role-scope resolver, and supports mid-session re-auth.
* Flow import/export streaming endpoints keep the legacy ?token=
  handshake (URL-scoped short-lived transfers; no re-auth need).

Auth surface
------------
* POST /api/v1/auth/login — public, returns a JWT.
* POST /api/v1/auth/bootstrap — public; forwards to IAM's bootstrap
  op which itself enforces mode + tables-empty.
* POST /api/v1/auth/change-password — any authenticated user.
* POST /api/v1/iam — admin-only generic forwarder for the rest of
  the IAM API (per-op REST endpoints to follow in a later change).

Removed / breaking
------------------
* GATEWAY_SECRET / --api-token / default_api_token and the legacy
  Authenticator.permitted contract.  The gateway cannot run without
  IAM.
* ?token= on /api/v1/socket.
* DispatcherManager and Mux both raise on auth=None — no silent
  downgrade path.

CLI tools (trustgraph-cli)
--------------------------
tg-bootstrap-iam, tg-login, tg-create-user, tg-list-users,
tg-disable-user, tg-enable-user, tg-delete-user, tg-change-password,
tg-reset-password, tg-create-api-key, tg-list-api-keys,
tg-revoke-api-key, tg-create-workspace, tg-list-workspaces.  Passwords
read via getpass; tokens / one-time secrets written to stdout with
operator context on stderr so shell composition works cleanly.
AsyncSocketClient / SocketClient updated to the first-frame auth
protocol.

Specifications
--------------
* docs/tech-specs/iam.md updated with the error policy, workspace
  resolver extension point, and OSS role-scope model.
* docs/tech-specs/iam-protocol.md (new) — transport, dataclasses,
  operation table, error taxonomy, bootstrap modes.
* docs/tech-specs/capabilities.md (new) — capability vocabulary, OSS
  role bundles, agent-as-composition note, enforcement-boundary
  policy, enterprise extensibility.

Tests
-----
* test_auth.py (rewritten) — IamAuth + JWT round-trip with real
  Ed25519 keypairs + API-key cache behaviour.
* test_capabilities.py (new) — role table sanity, check across
  role x workspace combinations, enforce_workspace paths,
  unknown-cap / unknown-role fail-closed.
* Every endpoint test construction now names its capability
  explicitly (no permissive defaults relied upon).  New tests pin
  the fail-closed invariants: DispatcherManager / Mux refuse
  auth=None; i18n path-traversal defense is exercised.
* test_socket_graceful_shutdown rewritten against IamAuth.
2026-04-24 17:29:10 +01:00
cybermaggedon
ae9936c9cc
feat: pluggable bootstrap framework with ordered initialisers (#847)
A generic, long-running bootstrap processor that converges a
deployment to its configured initial state and then idles.
Replaces the previous one-shot `tg-init-trustgraph` container model
and provides an extension point for enterprise / third-party
initialisers.

See docs/tech-specs/bootstrap.md for the full design.

Bootstrapper
------------
A single AsyncProcessor (trustgraph.bootstrap.bootstrapper.Processor)
that:

  * Reads a list of initialiser specifications (class, name, flag,
    params) from either a direct `initialisers` parameter
    (processor-group embedding) or a YAML/JSON file (`-c`, CLI).
  * On each wake, runs a cheap service-gate (config-svc +
    flow-svc round-trips), then iterates the initialiser list,
    running each whose configured flag differs from the one stored
    in __system__/init-state/<name>.
  * Stores per-initialiser completion state in the reserved
    __system__ workspace.
  * Adapts cadence: ~5s on gate failure, ~15s while converging,
    ~300s in steady state.
  * Isolates failures — one initialiser's exception does not block
    others in the same cycle; the failed one retries next wake.

Initialiser contract
--------------------
  * Subclass trustgraph.bootstrap.base.Initialiser.
  * Implement async run(ctx, old_flag, new_flag).
  * Opt out of the service gate with class attr
    wait_for_services=False (only used by PulsarTopology, since
    config-svc cannot come up until Pulsar namespaces exist).
  * ctx carries short-lived config and flow-svc clients plus a
    scoped logger.

Core initialisers (trustgraph.bootstrap.initialisers.*)
-------------------------------------------------------
  * PulsarTopology   — creates Pulsar tenant + namespaces
                       (pre-gate, blocking HTTP offloaded to
                        executor).
  * TemplateSeed     — seeds __template__ from an external JSON
                       file; re-run is upsert-missing by default,
                       overwrite-all opt-in.
  * WorkspaceInit    — populates a named workspace from either
                       the full contents of __template__ or a
                       seed file; raises cleanly if the template
                       isn't seeded yet so the bootstrapper retries
                       on the next cycle.
  * DefaultFlowStart — starts a specific flow in a workspace;
                       no-ops if the flow is already running.

Enterprise or third-party initialisers plug in via fully-qualified
dotted class paths in the bootstrapper's configuration — no core
code change required.

Config service
--------------
  * push(): filter out reserved workspaces (ids starting with "_")
    from the change notifications.  Stored config is preserved; only
    the broadcast is suppressed, so bootstrap / template state lives
    in config-svc without live processors ever reacting to it.

Config client
-------------
  * ConfigClient.get_all(workspace): wraps the existing `config`
    operation to return {type: {key: value}} for a workspace.
    WorkspaceInit uses it to copy __template__ without needing a
    hardcoded types list.

pyproject.toml
--------------
  * Adds a `bootstrap` console script pointing at the new Processor.

* Remove tg-init-trustgraph, superceded by bootstrap processor
2026-04-22 18:03:46 +01:00
cybermaggedon
d35473f7f7
feat: workspace-based multi-tenancy, replacing user as tenancy axis (#840)
Introduces `workspace` as the isolation boundary for config, flows,
library, and knowledge data. Removes `user` as a schema-level field
throughout the code, API specs, and tests; workspace provides the
same separation more cleanly at the trusted flow.workspace layer
rather than through client-supplied message fields.

Design
------
- IAM tech spec (docs/tech-specs/iam.md) documents current state,
  proposed auth/access model, and migration direction.
- Data ownership model (docs/tech-specs/data-ownership-model.md)
  captures the workspace/collection/flow hierarchy.

Schema + messaging
------------------
- Drop `user` field from AgentRequest/Step, GraphRagQuery,
  DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest,
  Sparql/Rows/Structured QueryRequest, ToolServiceRequest.
- Keep collection/workspace routing via flow.workspace at the
  service layer.
- Translators updated to not serialise/deserialise user.

API specs
---------
- OpenAPI schemas and path examples cleaned of user fields.
- Websocket async-api messages updated.
- Removed the unused parameters/User.yaml.

Services + base
---------------
- Librarian, collection manager, knowledge, config: all operations
  scoped by workspace. Config client API takes workspace as first
  positional arg.
- `flow.workspace` set at flow start time by the infrastructure;
  no longer pass-through from clients.
- Tool service drops user-personalisation passthrough.

CLI + SDK
---------
- tg-init-workspace and workspace-aware import/export.
- All tg-* commands drop user args; accept --workspace.
- Python API/SDK (flow, socket_client, async_*, explainability,
  library) drop user kwargs from every method signature.

MCP server
----------
- All tool endpoints drop user parameters; socket_manager no longer
  keyed per user.

Flow service
------------
- Closure-based topic cleanup on flow stop: only delete topics
  whose blueprint template was parameterised AND no remaining
  live flow (across all workspaces) still resolves to that topic.
  Three scopes fall out naturally from template analysis:
    * {id} -> per-flow, deleted on stop
    * {blueprint} -> per-blueprint, kept while any flow of the
      same blueprint exists
    * {workspace} -> per-workspace, kept while any flow in the
      workspace exists
    * literal -> global, never deleted (e.g. tg.request.librarian)
  Fixes a bug where stopping a flow silently destroyed the global
  librarian exchange, wedging all library operations until manual
  restart.

RabbitMQ backend
----------------
- heartbeat=60, blocked_connection_timeout=300. Catches silently
  dead connections (broker restart, orphaned channels, network
  partitions) within ~2 heartbeat windows, so the consumer
  reconnects and re-binds its queue rather than sitting forever
  on a zombie connection.

Tests
-----
- Full test refresh: unit, integration, contract, provenance.
- Dropped user-field assertions and constructor kwargs across
  ~100 test files.
- Renamed user-collection isolation tests to workspace-collection.
2026-04-21 23:23:01 +01:00