trustgraph/docs/tech-specs/iam-protocol.md
cybermaggedon 67b2fc448f
feat: IAM service, gateway auth middleware, capability model, and CLIs (#849)
Replaces the legacy GATEWAY_SECRET shared-token gate with an IAM-backed
identity and authorisation model.  The gateway no longer has an
"allow-all" or "no auth" mode; every request is authenticated via the
IAM service, authorised against a capability model that encodes both
the operation and the workspace it targets, and rejected with a
deliberately-uninformative 401 / 403 on any failure.

IAM service (trustgraph-flow/trustgraph/iam, trustgraph-base/schema/iam)
-----------------------------------------------------------------------
* New backend service (iam-svc) owning users, workspaces, API keys,
  passwords and JWT signing keys in Cassandra.  Reached over the
  standard pub/sub request/response pattern; gateway is the only
  caller.
* Operations: bootstrap, resolve-api-key, login, get-signing-key-public,
  rotate-signing-key, create/list/get/update/disable/delete/enable-user,
  change-password, reset-password, create/list/get/update/disable-
  workspace, create/list/revoke-api-key.
* Ed25519 JWT signing (alg=EdDSA).  Key rotation writes a new kid and
  retires the previous one; validation is grace-period friendly.
* Passwords: PBKDF2-HMAC-SHA-256, 600k iterations, per-user salt.
* API keys: 128-bit random, SHA-256 hashed.  Plaintext returned once.
* Bootstrap is explicit: --bootstrap-mode {token,bootstrap} is a
  required startup argument with no permissive default.  Masked
  "auth failure" errors hide whether a refused bootstrap request was
  due to mode, state, or authorisation.

Gateway authentication (trustgraph-flow/trustgraph/gateway/auth.py)
-------------------------------------------------------------------
* IamAuth replaces the legacy Authenticator.  Distinguishes JWTs
  (three-segment dotted) from API keys by shape; verifies JWTs
  locally using the cached IAM public key; resolves API keys via
  IAM with a short-TTL hash-keyed cache.  Every failure path
  surfaces the same 401 body ("auth failure") so callers cannot
  enumerate credential state.
* Public key is fetched at gateway startup with a bounded retry loop;
  traffic does not begin flowing until auth has started.

Capability model (trustgraph-flow/trustgraph/gateway/capabilities.py)
---------------------------------------------------------------------
* Roles have two dimensions: a capability set and a workspace scope.
  OSS ships reader / writer / admin; the first two are workspace-
  assigned, admin is cross-workspace ("*").  No "cross-workspace"
  pseudo-capability — workspace permission is a property of the role.
* check(identity, capability, target_workspace=None) is the single
  authorisation test: some role must grant the capability *and* be
  active in the target workspace.
* enforce_workspace validates a request-body workspace against the
  caller's role scopes and injects the resolved value.  Cross-
  workspace admin is permitted by role scope, not by a bypass.
* Gateway endpoints declare a required capability explicitly — no
  permissive default.  Construction fails fast if omitted.  Enterprise
  editions can replace the role table without changing the wire
  protocol.

WebSocket first-frame auth (dispatch/mux.py, endpoint/socket.py)
----------------------------------------------------------------
* /api/v1/socket handshake unconditionally accepts; authentication
  runs on the first WebSocket frame ({"type":"auth","token":"..."})
  with {"type":"auth-ok","workspace":"..."} / {"type":"auth-failed"}.
  The socket stays open on failure so the client can re-authenticate
  — browsers treat a handshake-time 401 as terminal, breaking
  reconnection.
* Mux.receive rejects every non-auth frame before auth succeeds,
  enforces the caller's workspace (envelope + inner payload) using
  the role-scope resolver, and supports mid-session re-auth.
* Flow import/export streaming endpoints keep the legacy ?token=
  handshake (URL-scoped short-lived transfers; no re-auth need).

Auth surface
------------
* POST /api/v1/auth/login — public, returns a JWT.
* POST /api/v1/auth/bootstrap — public; forwards to IAM's bootstrap
  op which itself enforces mode + tables-empty.
* POST /api/v1/auth/change-password — any authenticated user.
* POST /api/v1/iam — admin-only generic forwarder for the rest of
  the IAM API (per-op REST endpoints to follow in a later change).

Removed / breaking
------------------
* GATEWAY_SECRET / --api-token / default_api_token and the legacy
  Authenticator.permitted contract.  The gateway cannot run without
  IAM.
* ?token= on /api/v1/socket.
* DispatcherManager and Mux both raise on auth=None — no silent
  downgrade path.

CLI tools (trustgraph-cli)
--------------------------
tg-bootstrap-iam, tg-login, tg-create-user, tg-list-users,
tg-disable-user, tg-enable-user, tg-delete-user, tg-change-password,
tg-reset-password, tg-create-api-key, tg-list-api-keys,
tg-revoke-api-key, tg-create-workspace, tg-list-workspaces.  Passwords
read via getpass; tokens / one-time secrets written to stdout with
operator context on stderr so shell composition works cleanly.
AsyncSocketClient / SocketClient updated to the first-frame auth
protocol.

Specifications
--------------
* docs/tech-specs/iam.md updated with the error policy, workspace
  resolver extension point, and OSS role-scope model.
* docs/tech-specs/iam-protocol.md (new) — transport, dataclasses,
  operation table, error taxonomy, bootstrap modes.
* docs/tech-specs/capabilities.md (new) — capability vocabulary, OSS
  role bundles, agent-as-composition note, enforcement-boundary
  policy, enterprise extensibility.

Tests
-----
* test_auth.py (rewritten) — IamAuth + JWT round-trip with real
  Ed25519 keypairs + API-key cache behaviour.
* test_capabilities.py (new) — role table sanity, check across
  role x workspace combinations, enforce_workspace paths,
  unknown-cap / unknown-role fail-closed.
* Every endpoint test construction now names its capability
  explicitly (no permissive defaults relied upon).  New tests pin
  the fail-closed invariants: DispatcherManager / Mux refuse
  auth=None; i18n path-traversal defense is exercised.
* test_socket_graceful_shutdown rewritten against IamAuth.
2026-04-24 17:29:10 +01:00

329 lines
14 KiB
Markdown

---
layout: default
title: "IAM Service Protocol Technical Specification"
parent: "Tech Specs"
---
# IAM Service Protocol Technical Specification
## Overview
The IAM service is a backend processor, reached over the standard
request/response pub/sub pattern. It is the authority for users,
workspaces, API keys, and login credentials. The API gateway
delegates to it for authentication resolution and for all user /
workspace / key management.
This document defines the wire protocol: the `IamRequest` and
`IamResponse` dataclasses, the operation set, the per-operation
input and output fields, the error taxonomy, and the initial HTTP
forwarding endpoint used while IAM is being integrated into the
gateway.
Architectural context — roles, capabilities, workspace scoping,
enforcement boundary — lives in [`iam.md`](iam.md) and
[`capabilities.md`](capabilities.md).
## Transport
- **Request topic:** `request:tg/request/iam-request`
- **Response topic:** `response:tg/response/iam-response`
- **Pattern:** request/response, correlated by the `id` message
property, the same pattern used by `config-svc` and `flow-svc`.
- **Caller:** the API gateway only. Under the enforcement-boundary
policy (see capabilities spec), the IAM service trusts the bus
and performs no per-request authentication or capability check
against the caller. The gateway has already evaluated capability
membership and workspace scoping before sending the request.
## Dataclasses
### `IamRequest`
```python
@dataclass
class IamRequest:
# One of the operation strings below.
operation: str = ""
# Scope of this request. Required on every workspace-scoped
# operation. Omitted (or empty) for system-level ops
# (workspace CRUD, signing-key ops, bootstrap, resolve-api-key,
# login).
workspace: str = ""
# Acting user id, for audit. Set by the gateway to the
# authenticated caller's id on user-initiated operations.
# Empty for internal-origin (bootstrap, reconcilers) and for
# resolve-api-key / login (no actor yet).
actor: str = ""
# --- identity selectors ---
user_id: str = ""
username: str = "" # login; unique within a workspace
key_id: str = "" # revoke-api-key, list-api-keys (own)
api_key: str = "" # resolve-api-key (plaintext)
# --- credentials ---
password: str = "" # login, change-password (current)
new_password: str = "" # change-password
# --- user fields ---
user: UserInput | None = None # create-user, update-user
# --- workspace fields ---
workspace_record: WorkspaceInput | None = None # create-workspace, update-workspace
# --- api key fields ---
key: ApiKeyInput | None = None # create-api-key
```
### `IamResponse`
```python
@dataclass
class IamResponse:
# Populated on success of operations that return them.
user: UserRecord | None = None # create-user, get-user, update-user
users: list[UserRecord] = field(default_factory=list) # list-users
workspace: WorkspaceRecord | None = None # create-workspace, get-workspace, update-workspace
workspaces: list[WorkspaceRecord] = field(default_factory=list) # list-workspaces
# create-api-key returns the plaintext once. Never populated
# on any other operation.
api_key_plaintext: str = ""
api_key: ApiKeyRecord | None = None # create-api-key
api_keys: list[ApiKeyRecord] = field(default_factory=list) # list-api-keys
# login, rotate-signing-key
jwt: str = ""
jwt_expires: str = "" # ISO-8601 UTC
# get-signing-key-public
signing_key_public: str = "" # PEM
# resolve-api-key returns who this key authenticates as.
resolved_user_id: str = ""
resolved_workspace: str = ""
resolved_roles: list[str] = field(default_factory=list)
# reset-password
temporary_password: str = "" # returned once to the operator
# bootstrap: on first run, the initial admin's one-time API key
# is returned for the operator to capture.
bootstrap_admin_user_id: str = ""
bootstrap_admin_api_key: str = ""
# Present on any failed operation.
error: Error | None = None
```
### Value types
```python
@dataclass
class UserInput:
username: str = ""
name: str = ""
email: str = ""
password: str = "" # only on create-user; never on update-user
roles: list[str] = field(default_factory=list)
enabled: bool = True
must_change_password: bool = False
@dataclass
class UserRecord:
id: str = ""
workspace: str = ""
username: str = ""
name: str = ""
email: str = ""
roles: list[str] = field(default_factory=list)
enabled: bool = True
must_change_password: bool = False
created: str = "" # ISO-8601 UTC
# Password hash is never included in any response.
@dataclass
class WorkspaceInput:
id: str = ""
name: str = ""
enabled: bool = True
@dataclass
class WorkspaceRecord:
id: str = ""
name: str = ""
enabled: bool = True
created: str = "" # ISO-8601 UTC
@dataclass
class ApiKeyInput:
user_id: str = ""
name: str = "" # operator-facing label, e.g. "laptop"
expires: str = "" # optional ISO-8601 UTC; empty = no expiry
@dataclass
class ApiKeyRecord:
id: str = ""
user_id: str = ""
name: str = ""
prefix: str = "" # first 4 chars of plaintext, for identification in lists
expires: str = "" # empty = no expiry
created: str = ""
last_used: str = "" # empty if never used
# key_hash is never included in any response.
```
## Operations
| Operation | Request fields | Response fields | Notes |
|---|---|---|---|
| `login` | `username`, `password`, `workspace` (optional) | `jwt`, `jwt_expires` | If `workspace` omitted, IAM resolves to the user's assigned workspace. |
| `resolve-api-key` | `api_key` (plaintext) | `resolved_user_id`, `resolved_workspace`, `resolved_roles` | Gateway-internal. Service returns `auth-failed` for unknown / expired / revoked keys. |
| `change-password` | `user_id`, `password` (current), `new_password` | — | Self-service. IAM validates `password` against stored hash. |
| `reset-password` | `user_id` | `temporary_password` | Admin-initiated. IAM generates a random password, sets `must_change_password=true` on the user, returns the plaintext once. |
| `create-user` | `workspace`, `user` | `user` | Admin-only. `user.password` is hashed and stored; `user.roles` must be subset of known roles. |
| `list-users` | `workspace` | `users` | |
| `get-user` | `workspace`, `user_id` | `user` | |
| `update-user` | `workspace`, `user_id`, `user` | `user` | `password` field on `user` is rejected; use `change-password` / `reset-password`. |
| `disable-user` | `workspace`, `user_id` | — | Soft-delete; sets `enabled=false`. Revokes all the user's API keys. |
| `create-workspace` | `workspace_record` | `workspace` | System-level. |
| `list-workspaces` | — | `workspaces` | System-level. |
| `get-workspace` | `workspace_record` (id only) | `workspace` | System-level. |
| `update-workspace` | `workspace_record` | `workspace` | System-level. |
| `disable-workspace` | `workspace_record` (id only) | — | System-level. Sets `enabled=false`; revokes all workspace API keys; disables all users in the workspace. |
| `create-api-key` | `workspace`, `key` | `api_key_plaintext`, `api_key` | Plaintext returned **once**; only hash stored. `key.name` required. |
| `list-api-keys` | `workspace`, `user_id` | `api_keys` | |
| `revoke-api-key` | `workspace`, `key_id` | — | Deletes the key record. |
| `get-signing-key-public` | — | `signing_key_public` | Gateway fetches this at startup. |
| `rotate-signing-key` | — | — | System-level. Introduces a new signing key; old key continues to validate JWTs for a grace period (implementation-defined, minimum 1h). |
| `bootstrap` | — | `bootstrap_admin_user_id`, `bootstrap_admin_api_key` | If IAM tables are empty, creates the initial `default` workspace, an `admin` user, an initial API key, and an initial signing key; returns them once. No-op on subsequent calls (returns empty fields). |
## Error taxonomy
All errors are carried in the `IamResponse.error` field. `error.type`
is one of the values below; `error.message` is a human-readable
string that is **not** surfaced verbatim to external callers (the
gateway maps to `auth failure` / `access denied` per the IAM error
policy).
| `type` | When |
|---|---|
| `invalid-argument` | Malformed request (missing required field, unknown operation, invalid format). |
| `not-found` | Named resource does not exist (`user_id`, `key_id`, workspace). |
| `duplicate` | Create operation collides with an existing resource (username, workspace id, key name). |
| `auth-failed` | `login` with wrong credentials; `resolve-api-key` with unknown / expired / revoked key; `change-password` with wrong current password. Single bucket to deny oracle attacks. |
| `weak-password` | Password does not meet policy (length, complexity — policy defined at service level). |
| `disabled` | Target user or workspace has `enabled=false`. |
| `operation-not-permitted` | Non-admin attempting system-level operation, or workspace-scoped operation attempting to affect another workspace. |
| `internal-error` | Unexpected IAM-side failure. Log and surface as 500 at the gateway. |
The gateway is responsible for translating `auth-failed` and
`operation-not-permitted` into the obfuscated external error
response (`"auth failure"` / `"access denied"`); `invalid-argument`
becomes a descriptive 400; `not-found` / `duplicate` /
`weak-password` / `disabled` become descriptive 4xx but never leak
IAM-internal detail.
## Credential storage
- **Passwords** are stored using a slow KDF (bcrypt / argon2id — the
service picks; documented as an implementation detail). The
`password_hash` column stores the full KDF-encoded string
(algorithm, cost, salt, hash). Not a plain SHA-256.
- **API keys** are stored as SHA-256 of the plaintext. API keys
are 128-bit random values (`tg_` + base64url); the entropy
makes a slow hash unnecessary. The hash serves as the primary
key on the `iam_api_keys` table, enabling O(1) lookup on
`resolve-api-key`.
- **JWT signing key** is stored as an RSA or Ed25519 private key
(implementation choice) in a dedicated `iam_signing_keys` table
with a `kid`, `created`, and optional `retired` timestamp. At
most one active key; up to N retired keys are kept for a grace
period to validate previously-issued JWTs.
Passwords, API-key plaintext, and signing-key private material are
never returned in any response other than the explicit one-time
responses above (`reset-password`, `create-api-key`, `bootstrap`).
## Bootstrap modes
`iam-svc` requires a bootstrap mode to be chosen at startup. There is
no default — an unset or invalid mode causes the service to refuse
to start. The purpose is to force the operator to make an explicit
security decision rather than rely on an implicit "safe" fallback.
| Mode | Startup behaviour | `bootstrap` operation | Suitability |
|---|---|---|---|
| `token` | On first start with empty tables, auto-seeds the `default` workspace, admin user, admin API key (using the operator-provided `--bootstrap-token`), and an initial signing key. No-op on subsequent starts. | Refused — returns `auth-failed` / `"auth failure"` regardless of caller. | Production, any public-exposure deployment. |
| `bootstrap` | No startup seeding. Tables remain empty until the `bootstrap` operation is invoked over the pub/sub bus (typically via `tg-bootstrap-iam`). | Live while tables are empty. Generates and returns the admin API key once. Refused (`auth-failed`) once tables are populated. | Dev / compose up / CI. **Not safe under public exposure** — any caller reaching the gateway's `/api/v1/iam` forwarder before the operator can cause a token to be issued to them. Operators choosing this mode accept that risk. |
### Error masking
In both modes, any refused invocation of the `bootstrap` operation
returns the same error (`auth-failed` / `"auth failure"`). A caller
cannot distinguish:
- "service is in token mode"
- "service is in bootstrap mode but already bootstrapped"
- "operation forbidden"
This matches the general IAM error-policy stance (see `iam.md`) and
prevents externally enumerating IAM's state.
### Bootstrap-token lifecycle
The bootstrap token — whether operator-supplied (`token` mode) or
service-generated (`bootstrap` mode) — is a one-time credential. It
is stored as admin's single API key, tagged `name="bootstrap"`. The
operator's first admin action after bootstrap should be:
1. Create a durable admin user and API key (or issue a durable API
key to the bootstrap admin).
2. Revoke the bootstrap key via `revoke-api-key`.
3. Remove the bootstrap token from any deployment configuration.
The `name="bootstrap"` marker makes bootstrap keys easy to detect in
tooling (e.g. a `tg-list-api-keys` filter).
## HTTP forwarding (initial integration)
For the initial gateway integration — before the IAM service is
wired into the authentication middleware — the gateway exposes a
single forwarding endpoint:
```
POST /api/v1/iam
```
- Request body is a JSON encoding of `IamRequest`.
- Response body is a JSON encoding of `IamResponse`.
- The gateway's existing authentication (`GATEWAY_SECRET` bearer)
gates access to this endpoint so the IAM protocol can be
exercised end-to-end in tests without touching the live auth
path.
- This endpoint is **not** the final shape. Once the middleware is
in place, per-operation REST endpoints replace it (for example
`POST /api/v1/auth/login`, `POST /api/v1/users`, `DELETE
/api/v1/api-keys/{id}`), and this generic forwarder is removed.
The endpoint performs only message marshalling: it does not read
or rewrite fields in the request, and it applies no capability
check. All authorisation for user / workspace / key management
lands in the subsequent middleware work.
## Non-goals for this spec
- REST endpoint shape for the final gateway surface — covered in
Phase 2 of the IAM implementation plan, not here.
- OIDC / SAML external IdP protocol — out of scope for open source.
- Key-signing algorithm choice, password KDF choice, JWT claim
layout — implementation details captured in code + ADRs, not
locked in the protocol spec.
## References
- [Identity and Access Management Specification](iam.md)
- [Capability Vocabulary Specification](capabilities.md)