feat: IAM service, gateway auth middleware, capability model, and CLIs (#849)

Replaces the legacy GATEWAY_SECRET shared-token gate with an IAM-backed identity and authorisation model. The gateway no longer has an "allow-all" or "no auth" mode; every request is authenticated via the IAM service, authorised against a capability model that encodes both the operation and the workspace it targets, and rejected with a deliberately-uninformative 401 / 403 on any failure. IAM service (trustgraph-flow/trustgraph/iam, trustgraph-base/schema/iam) ----------------------------------------------------------------------- * New backend service (iam-svc) owning users, workspaces, API keys, passwords and JWT signing keys in Cassandra. Reached over the standard pub/sub request/response pattern; gateway is the only caller. * Operations: bootstrap, resolve-api-key, login, get-signing-key-public, rotate-signing-key, create/list/get/update/disable/delete/enable-user, change-password, reset-password, create/list/get/update/disable- workspace, create/list/revoke-api-key. * Ed25519 JWT signing (alg=EdDSA). Key rotation writes a new kid and retires the previous one; validation is grace-period friendly. * Passwords: PBKDF2-HMAC-SHA-256, 600k iterations, per-user salt. * API keys: 128-bit random, SHA-256 hashed. Plaintext returned once. * Bootstrap is explicit: --bootstrap-mode {token,bootstrap} is a required startup argument with no permissive default. Masked "auth failure" errors hide whether a refused bootstrap request was due to mode, state, or authorisation. Gateway authentication (trustgraph-flow/trustgraph/gateway/auth.py) ------------------------------------------------------------------- * IamAuth replaces the legacy Authenticator. Distinguishes JWTs (three-segment dotted) from API keys by shape; verifies JWTs locally using the cached IAM public key; resolves API keys via IAM with a short-TTL hash-keyed cache. Every failure path surfaces the same 401 body ("auth failure") so callers cannot enumerate credential state. * Public key is fetched at gateway startup with a bounded retry loop; traffic does not begin flowing until auth has started. Capability model (trustgraph-flow/trustgraph/gateway/capabilities.py) --------------------------------------------------------------------- * Roles have two dimensions: a capability set and a workspace scope. OSS ships reader / writer / admin; the first two are workspace- assigned, admin is cross-workspace ("*"). No "cross-workspace" pseudo-capability — workspace permission is a property of the role. * check(identity, capability, target_workspace=None) is the single authorisation test: some role must grant the capability *and* be active in the target workspace. * enforce_workspace validates a request-body workspace against the caller's role scopes and injects the resolved value. Cross- workspace admin is permitted by role scope, not by a bypass. * Gateway endpoints declare a required capability explicitly — no permissive default. Construction fails fast if omitted. Enterprise editions can replace the role table without changing the wire protocol. WebSocket first-frame auth (dispatch/mux.py, endpoint/socket.py) ---------------------------------------------------------------- * /api/v1/socket handshake unconditionally accepts; authentication runs on the first WebSocket frame ({"type":"auth","token":"..."}) with {"type":"auth-ok","workspace":"..."} / {"type":"auth-failed"}. The socket stays open on failure so the client can re-authenticate — browsers treat a handshake-time 401 as terminal, breaking reconnection. * Mux.receive rejects every non-auth frame before auth succeeds, enforces the caller's workspace (envelope + inner payload) using the role-scope resolver, and supports mid-session re-auth. * Flow import/export streaming endpoints keep the legacy ?token= handshake (URL-scoped short-lived transfers; no re-auth need). Auth surface ------------ * POST /api/v1/auth/login — public, returns a JWT. * POST /api/v1/auth/bootstrap — public; forwards to IAM's bootstrap op which itself enforces mode + tables-empty. * POST /api/v1/auth/change-password — any authenticated user. * POST /api/v1/iam — admin-only generic forwarder for the rest of the IAM API (per-op REST endpoints to follow in a later change). Removed / breaking ------------------ * GATEWAY_SECRET / --api-token / default_api_token and the legacy Authenticator.permitted contract. The gateway cannot run without IAM. * ?token= on /api/v1/socket. * DispatcherManager and Mux both raise on auth=None — no silent downgrade path. CLI tools (trustgraph-cli) -------------------------- tg-bootstrap-iam, tg-login, tg-create-user, tg-list-users, tg-disable-user, tg-enable-user, tg-delete-user, tg-change-password, tg-reset-password, tg-create-api-key, tg-list-api-keys, tg-revoke-api-key, tg-create-workspace, tg-list-workspaces. Passwords read via getpass; tokens / one-time secrets written to stdout with operator context on stderr so shell composition works cleanly. AsyncSocketClient / SocketClient updated to the first-frame auth protocol. Specifications -------------- * docs/tech-specs/iam.md updated with the error policy, workspace resolver extension point, and OSS role-scope model. * docs/tech-specs/iam-protocol.md (new) — transport, dataclasses, operation table, error taxonomy, bootstrap modes. * docs/tech-specs/capabilities.md (new) — capability vocabulary, OSS role bundles, agent-as-composition note, enforcement-boundary policy, enterprise extensibility. Tests ----- * test_auth.py (rewritten) — IamAuth + JWT round-trip with real Ed25519 keypairs + API-key cache behaviour. * test_capabilities.py (new) — role table sanity, check across role x workspace combinations, enforce_workspace paths, unknown-cap / unknown-role fail-closed. * Every endpoint test construction now names its capability explicitly (no permissive defaults relied upon). New tests pin the fail-closed invariants: DispatcherManager / Mux refuse auth=None; i18n path-traversal defense is exercised. * test_socket_graceful_shutdown rewritten against IamAuth.
2026-05-01 11:26:22 +02:00 · 2026-04-24 17:29:10 +01:00 · 2026-04-24 17:29:10 +01:00 · 67b2fc448f
commit 67b2fc448f
parent ae9936c9cc
61 changed files with 6474 additions and 792 deletions
--- a/docs/tech-specs/capabilities.md
+++ b/docs/tech-specs/capabilities.md
@ -0,0 +1,218 @@
+---
+layout: default
+title: "Capability Vocabulary Technical Specification"
+parent: "Tech Specs"
+---
+
+# Capability Vocabulary Technical Specification
+
+## Overview
+
+Authorisation in TrustGraph is **capability-based**. Every gateway
+endpoint maps to exactly one *capability*; a user's roles each grant
+a set of capabilities; an authenticated request is permitted when
+the required capability is a member of the union of the caller's
+role capability sets.
+
+This document defines the capability vocabulary — the closed list
+of capability strings that the gateway recognises — and the
+open-source edition's role bundles.
+
+The capability mechanism is shared between open-source and potential
+3rd party enterprise capability. The open-source edition ships a
+fixed three-role bundle (`reader`, `writer`, `admin`). Enterprise
+capability may define additional roles by composing their own
+capability bundles from the same vocabulary; no protocol, gateway,
+or backend-service change is required.
+
+## Motivation
+
+The original IAM spec used hierarchical "minimum role" checks
+(`admin` implies `writer` implies `reader`). That shape is simple
+but paints the role model into a corner: any enterprise need to
+grant a subset of admin abilities (helpdesk that can reset
+passwords but not edit flows; analyst who can query but not ingest)
+requires a protocol-level change.
+
+A capability vocabulary decouples "what a request needs" from
+"what roles a user has" and makes the role table pure data. The
+open-source bundles can stay coarse while the enterprise role
+table expands without any code movement.
+
+## Design
+
+### Capability string format
+
+`<subsystem>:<verb>` or `<subsystem>` (for capabilities with no
+natural read/write split). All lowercase, kebab-case for
+multi-word subsystems.
+
+### Capability list
+
+**Data plane**
+
+| Capability | Covers |
+|---|---|
+| `agent` | agent (query-only; no write counterpart) |
+| `graph:read` | graph-rag, graph-embeddings-query, triples-query, sparql, graph-embeddings-export, triples-export |
+| `graph:write` | triples-import, graph-embeddings-import |
+| `documents:read` | document-rag, document-embeddings-query, document-embeddings-export, entity-contexts-export, document-stream-export, library list / fetch |
+| `documents:write` | document-embeddings-import, entity-contexts-import, text-load, document-load, library add / replace / delete |
+| `rows:read` | rows-query, row-embeddings-query, nlp-query, structured-query, structured-diag |
+| `rows:write` | rows-import |
+| `llm` | text-completion, prompt (stateless invocation) |
+| `embeddings` | Raw text-embedding service (stateless compute; typed-data embedding stores live under their data-subject capability) |
+| `mcp` | mcp-tool |
+| `collections:read` | List / describe collections |
+| `collections:write` | Create / delete collections |
+| `knowledge:read` | List / get knowledge cores |
+| `knowledge:write` | Create / delete knowledge cores |
+
+**Control plane**
+
+| Capability | Covers |
+|---|---|
+| `config:read` | Read workspace config |
+| `config:write` | Write workspace config |
+| `flows:read` | List / describe flows, blueprints, flow classes |
+| `flows:write` | Start / stop / update flows |
+| `users:read` | List / get users within the workspace |
+| `users:write` | Create / update / disable users within the workspace |
+| `users:admin` | Assign / remove roles on users within the workspace |
+| `keys:self` | Create / revoke / list **own** API keys |
+| `keys:admin` | Create / revoke / list **any user's** API keys within the workspace |
+| `workspaces:admin` | Create / delete / disable workspaces (system-level) |
+| `iam:admin` | JWT signing-key rotation, IAM-level operations |
+| `metrics:read` | Prometheus metrics proxy |
+
+### Open-source role bundles
+
+The open-source edition ships three roles:
+
+| Role | Capabilities |
+|---|---|
+| `reader` | `agent`, `graph:read`, `documents:read`, `rows:read`, `llm`, `embeddings`, `mcp`, `collections:read`, `knowledge:read`, `flows:read`, `config:read`, `keys:self` |
+| `writer` | everything in `reader` **+** `graph:write`, `documents:write`, `rows:write`, `collections:write`, `knowledge:write` |
+| `admin` | everything in `writer` **+** `config:write`, `flows:write`, `users:read`, `users:write`, `users:admin`, `keys:admin`, `workspaces:admin`, `iam:admin`, `metrics:read` |
+
+Open-source bundles are deliberately coarse. `workspaces:admin` and
+`iam:admin` live inside `admin` without a separate role; a single
+`admin` user holds the keys to the whole deployment.
+
+### The `agent` capability and composition
+
+The `agent` capability is granted independently of the capabilities
+it composes under the hood (`llm`, `graph`, `documents`, `rows`,
+`mcp`, etc.). A user holding `agent` but not `llm` can still cause
+LLM invocations because the agent implementation chooses which
+services to invoke on the caller's behalf.
+
+This is deliberate. A common policy is "allow controlled access
+via the agent, deny raw model calls" — granting `agent` without
+granting `llm` expresses exactly that. An administrator granting
+`agent` should treat it as a grant of everything the agent
+composes at deployment time.
+
+### Authorisation evaluation
+
+For a request bearing a resolved set of roles
+`R = {r1, r2, ...}` against an endpoint that requires capability
+`c`:
+
+```
+allow if c IN union(bundle(r) for r in R)
+```
+
+No hierarchy, no precedence, no role-order sensitivity. A user
+with a single role is the common case; a user with multiple roles
+gets the union of their bundles.
+
+### Enforcement boundary
+
+Capability checks — and authentication — are applied **only at the
+API gateway**, on requests arriving from external callers.
+Operations originating inside the platform (backend service to
+backend service, agent to LLM, flow-svc to config-svc, bootstrap
+initialisers, scheduled reconcilers, autonomous flow steps) are
+**not capability-checked**. Backend services trust the workspace
+set by the gateway on inbound pub/sub messages and trust
+internally-originated messages without further authorisation.
+
+This policy has four consequences that are part of the spec, not
+accidents of implementation:
+
+1. **The gateway is the single trust boundary for user
+   authorisation.** Every backend service is a downstream consumer
+   of an already-authorised workspace scope.
+2. **Pub/sub carries workspace, not user identity.** Messages on
+   the bus do not carry credentials or the identity that originated
+   a request; they carry the resolved workspace only. This keeps
+   the bus protocol free of secrets and aligns with the workspace
+   resolver's role as the gateway-side narrowing step.
+3. **Composition is transitive.** Granting a capability that the
+   platform composes internally (for example, `agent`) transitively
+   grants everything that capability composes under the hood,
+   because the downstream calls are internal-origin and are not
+   re-checked. The composite nature of `agent` described above is
+   a consequence of this policy, not a special case.
+4. **Internal-origin operations have no user.** Bootstrap,
+   reconcilers, and other platform-initiated work act with
+   system-level authority. The workspace field on such messages
+   identifies which workspace's data is being touched, not who
+   asked.
+
+**Trust model.** Whoever has pub/sub access is implicitly trusted
+to act as any workspace. Defense-in-depth within the backend is
+not part of this design; the security perimeter is the gateway
+and the bus itself (TLS / network isolation between the bus and
+any untrusted network).
+
+### Unknown capabilities and unknown roles
+
+- An endpoint declaring an unknown capability is a server-side bug
+  and fails closed (403, logged).
+- A user carrying a role name that is not defined in the role table
+  is ignored for authorisation purposes and logged as a warning.
+  Behaviour is deterministic: unknown roles contribute zero
+  capabilities.
+
+### Capability scope
+
+Every capability is **implicitly scoped to the caller's resolved
+workspace**. A `users:write` capability does not permit a user
+in workspace `acme` to create users in workspace `beta` — the
+workspace-resolver has already narrowed the request to one
+workspace before the capability check runs. See the IAM
+specification for the workspace-resolver contract.
+
+The three exceptions are the system-level capabilities
+`workspaces:admin` and `iam:admin`, which operate across
+workspaces by definition, and `metrics:read`, which returns
+process-level series not scoped to any workspace.
+
+## Enterprise extensibility
+
+Enterprise editions extend the role table additively:
+
+```
+data-analyst:   {query, library:read, collections:read, knowledge:read}
+helpdesk:       {users:read, users:write, users:admin, keys:admin}
+data-engineer:  writer + {flows:read, config:read}
+workspace-owner: admin − {workspaces:admin, iam:admin}
+```
+
+None of this requires a protocol change — the wire-protocol `roles`
+field on user records is already a set, the gateway's
+capability-check is already capability-based, and the capability
+vocabulary is closed. Enterprises may introduce roles whose bundles
+compose the same capabilities differently.
+
+When an enterprise introduces a new capability (e.g. for a feature
+that does not exist in open source), the capability string is
+added to the vocabulary and recognised by the gateway build that
+ships that feature.
+
+## References
+
+- [Identity and Access Management Specification](iam.md)
+- [Architecture Principles](architecture-principles.md)
--- a/docs/tech-specs/iam-protocol.md
+++ b/docs/tech-specs/iam-protocol.md
@ -0,0 +1,329 @@
+---
+layout: default
+title: "IAM Service Protocol Technical Specification"
+parent: "Tech Specs"
+---
+
+# IAM Service Protocol Technical Specification
+
+## Overview
+
+The IAM service is a backend processor, reached over the standard
+request/response pub/sub pattern. It is the authority for users,
+workspaces, API keys, and login credentials. The API gateway
+delegates to it for authentication resolution and for all user /
+workspace / key management.
+
+This document defines the wire protocol: the `IamRequest` and
+`IamResponse` dataclasses, the operation set, the per-operation
+input and output fields, the error taxonomy, and the initial HTTP
+forwarding endpoint used while IAM is being integrated into the
+gateway.
+
+Architectural context — roles, capabilities, workspace scoping,
+enforcement boundary — lives in [`iam.md`](iam.md) and
+[`capabilities.md`](capabilities.md).
+
+## Transport
+
+- **Request topic:** `request:tg/request/iam-request`
+- **Response topic:** `response:tg/response/iam-response`
+- **Pattern:** request/response, correlated by the `id` message
+  property, the same pattern used by `config-svc` and `flow-svc`.
+- **Caller:** the API gateway only. Under the enforcement-boundary
+  policy (see capabilities spec), the IAM service trusts the bus
+  and performs no per-request authentication or capability check
+  against the caller. The gateway has already evaluated capability
+  membership and workspace scoping before sending the request.
+
+## Dataclasses
+
+### `IamRequest`
+
+```python
+@dataclass
+class IamRequest:
+    # One of the operation strings below.
+    operation: str = ""
+
+    # Scope of this request.  Required on every workspace-scoped
+    # operation.  Omitted (or empty) for system-level ops
+    # (workspace CRUD, signing-key ops, bootstrap, resolve-api-key,
+    # login).
+    workspace: str = ""
+
+    # Acting user id, for audit.  Set by the gateway to the
+    # authenticated caller's id on user-initiated operations.
+    # Empty for internal-origin (bootstrap, reconcilers) and for
+    # resolve-api-key / login (no actor yet).
+    actor: str = ""
+
+    # --- identity selectors ---
+    user_id: str = ""
+    username: str = ""          # login; unique within a workspace
+    key_id: str = ""            # revoke-api-key, list-api-keys (own)
+    api_key: str = ""           # resolve-api-key (plaintext)
+
+    # --- credentials ---
+    password: str = ""          # login, change-password (current)
+    new_password: str = ""      # change-password
+
+    # --- user fields ---
+    user: UserInput | None = None       # create-user, update-user
+
+    # --- workspace fields ---
+    workspace_record: WorkspaceInput | None = None   # create-workspace, update-workspace
+
+    # --- api key fields ---
+    key: ApiKeyInput | None = None      # create-api-key
+```
+
+### `IamResponse`
+
+```python
+@dataclass
+class IamResponse:
+    # Populated on success of operations that return them.
+    user: UserRecord | None = None              # create-user, get-user, update-user
+    users: list[UserRecord] = field(default_factory=list)   # list-users
+    workspace: WorkspaceRecord | None = None    # create-workspace, get-workspace, update-workspace
+    workspaces: list[WorkspaceRecord] = field(default_factory=list)  # list-workspaces
+
+    # create-api-key returns the plaintext once.  Never populated
+    # on any other operation.
+    api_key_plaintext: str = ""
+    api_key: ApiKeyRecord | None = None          # create-api-key
+    api_keys: list[ApiKeyRecord] = field(default_factory=list)  # list-api-keys
+
+    # login, rotate-signing-key
+    jwt: str = ""
+    jwt_expires: str = ""        # ISO-8601 UTC
+
+    # get-signing-key-public
+    signing_key_public: str = ""  # PEM
+
+    # resolve-api-key returns who this key authenticates as.
+    resolved_user_id: str = ""
+    resolved_workspace: str = ""
+    resolved_roles: list[str] = field(default_factory=list)
+
+    # reset-password
+    temporary_password: str = ""  # returned once to the operator
+
+    # bootstrap: on first run, the initial admin's one-time API key
+    # is returned for the operator to capture.
+    bootstrap_admin_user_id: str = ""
+    bootstrap_admin_api_key: str = ""
+
+    # Present on any failed operation.
+    error: Error | None = None
+```
+
+### Value types
+
+```python
+@dataclass
+class UserInput:
+    username: str = ""
+    name: str = ""
+    email: str = ""
+    password: str = ""          # only on create-user; never on update-user
+    roles: list[str] = field(default_factory=list)
+    enabled: bool = True
+    must_change_password: bool = False
+
+@dataclass
+class UserRecord:
+    id: str = ""
+    workspace: str = ""
+    username: str = ""
+    name: str = ""
+    email: str = ""
+    roles: list[str] = field(default_factory=list)
+    enabled: bool = True
+    must_change_password: bool = False
+    created: str = ""           # ISO-8601 UTC
+    # Password hash is never included in any response.
+
+@dataclass
+class WorkspaceInput:
+    id: str = ""
+    name: str = ""
+    enabled: bool = True
+
+@dataclass
+class WorkspaceRecord:
+    id: str = ""
+    name: str = ""
+    enabled: bool = True
+    created: str = ""           # ISO-8601 UTC
+
+@dataclass
+class ApiKeyInput:
+    user_id: str = ""
+    name: str = ""              # operator-facing label, e.g. "laptop"
+    expires: str = ""           # optional ISO-8601 UTC; empty = no expiry
+
+@dataclass
+class ApiKeyRecord:
+    id: str = ""
+    user_id: str = ""
+    name: str = ""
+    prefix: str = ""            # first 4 chars of plaintext, for identification in lists
+    expires: str = ""           # empty = no expiry
+    created: str = ""
+    last_used: str = ""         # empty if never used
+    # key_hash is never included in any response.
+```
+
+## Operations
+
+| Operation | Request fields | Response fields | Notes |
+|---|---|---|---|
+| `login` | `username`, `password`, `workspace` (optional) | `jwt`, `jwt_expires` | If `workspace` omitted, IAM resolves to the user's assigned workspace. |
+| `resolve-api-key` | `api_key` (plaintext) | `resolved_user_id`, `resolved_workspace`, `resolved_roles` | Gateway-internal. Service returns `auth-failed` for unknown / expired / revoked keys. |
+| `change-password` | `user_id`, `password` (current), `new_password` | — | Self-service. IAM validates `password` against stored hash. |
+| `reset-password` | `user_id` | `temporary_password` | Admin-initiated. IAM generates a random password, sets `must_change_password=true` on the user, returns the plaintext once. |
+| `create-user` | `workspace`, `user` | `user` | Admin-only. `user.password` is hashed and stored; `user.roles` must be subset of known roles. |
+| `list-users` | `workspace` | `users` | |
+| `get-user` | `workspace`, `user_id` | `user` | |
+| `update-user` | `workspace`, `user_id`, `user` | `user` | `password` field on `user` is rejected; use `change-password` / `reset-password`. |
+| `disable-user` | `workspace`, `user_id` | — | Soft-delete; sets `enabled=false`. Revokes all the user's API keys. |
+| `create-workspace` | `workspace_record` | `workspace` | System-level. |
+| `list-workspaces` | — | `workspaces` | System-level. |
+| `get-workspace` | `workspace_record` (id only) | `workspace` | System-level. |
+| `update-workspace` | `workspace_record` | `workspace` | System-level. |
+| `disable-workspace` | `workspace_record` (id only) | — | System-level. Sets `enabled=false`; revokes all workspace API keys; disables all users in the workspace. |
+| `create-api-key` | `workspace`, `key` | `api_key_plaintext`, `api_key` | Plaintext returned **once**; only hash stored. `key.name` required. |
+| `list-api-keys` | `workspace`, `user_id` | `api_keys` | |
+| `revoke-api-key` | `workspace`, `key_id` | — | Deletes the key record. |
+| `get-signing-key-public` | — | `signing_key_public` | Gateway fetches this at startup. |
+| `rotate-signing-key` | — | — | System-level. Introduces a new signing key; old key continues to validate JWTs for a grace period (implementation-defined, minimum 1h). |
+| `bootstrap` | — | `bootstrap_admin_user_id`, `bootstrap_admin_api_key` | If IAM tables are empty, creates the initial `default` workspace, an `admin` user, an initial API key, and an initial signing key; returns them once. No-op on subsequent calls (returns empty fields). |
+
+## Error taxonomy
+
+All errors are carried in the `IamResponse.error` field. `error.type`
+is one of the values below; `error.message` is a human-readable
+string that is **not** surfaced verbatim to external callers (the
+gateway maps to `auth failure` / `access denied` per the IAM error
+policy).
+
+| `type` | When |
+|---|---|
+| `invalid-argument` | Malformed request (missing required field, unknown operation, invalid format). |
+| `not-found` | Named resource does not exist (`user_id`, `key_id`, workspace). |
+| `duplicate` | Create operation collides with an existing resource (username, workspace id, key name). |
+| `auth-failed` | `login` with wrong credentials; `resolve-api-key` with unknown / expired / revoked key; `change-password` with wrong current password. Single bucket to deny oracle attacks. |
+| `weak-password` | Password does not meet policy (length, complexity — policy defined at service level). |
+| `disabled` | Target user or workspace has `enabled=false`. |
+| `operation-not-permitted` | Non-admin attempting system-level operation, or workspace-scoped operation attempting to affect another workspace. |
+| `internal-error` | Unexpected IAM-side failure. Log and surface as 500 at the gateway. |
+
+The gateway is responsible for translating `auth-failed` and
+`operation-not-permitted` into the obfuscated external error
+response (`"auth failure"` / `"access denied"`); `invalid-argument`
+becomes a descriptive 400; `not-found` / `duplicate` /
+`weak-password` / `disabled` become descriptive 4xx but never leak
+IAM-internal detail.
+
+## Credential storage
+
+- **Passwords** are stored using a slow KDF (bcrypt / argon2id — the
+  service picks; documented as an implementation detail). The
+  `password_hash` column stores the full KDF-encoded string
+  (algorithm, cost, salt, hash). Not a plain SHA-256.
+- **API keys** are stored as SHA-256 of the plaintext. API keys
+  are 128-bit random values (`tg_` + base64url); the entropy
+  makes a slow hash unnecessary. The hash serves as the primary
+  key on the `iam_api_keys` table, enabling O(1) lookup on
+  `resolve-api-key`.
+- **JWT signing key** is stored as an RSA or Ed25519 private key
+  (implementation choice) in a dedicated `iam_signing_keys` table
+  with a `kid`, `created`, and optional `retired` timestamp. At
+  most one active key; up to N retired keys are kept for a grace
+  period to validate previously-issued JWTs.
+
+Passwords, API-key plaintext, and signing-key private material are
+never returned in any response other than the explicit one-time
+responses above (`reset-password`, `create-api-key`, `bootstrap`).
+
+## Bootstrap modes
+
+`iam-svc` requires a bootstrap mode to be chosen at startup. There is
+no default — an unset or invalid mode causes the service to refuse
+to start. The purpose is to force the operator to make an explicit
+security decision rather than rely on an implicit "safe" fallback.
+
+| Mode | Startup behaviour | `bootstrap` operation | Suitability |
+|---|---|---|---|
+| `token` | On first start with empty tables, auto-seeds the `default` workspace, admin user, admin API key (using the operator-provided `--bootstrap-token`), and an initial signing key. No-op on subsequent starts. | Refused — returns `auth-failed` / `"auth failure"` regardless of caller. | Production, any public-exposure deployment. |
+| `bootstrap` | No startup seeding. Tables remain empty until the `bootstrap` operation is invoked over the pub/sub bus (typically via `tg-bootstrap-iam`). | Live while tables are empty. Generates and returns the admin API key once. Refused (`auth-failed`) once tables are populated. | Dev / compose up / CI. **Not safe under public exposure** — any caller reaching the gateway's `/api/v1/iam` forwarder before the operator can cause a token to be issued to them. Operators choosing this mode accept that risk. |
+
+### Error masking
+
+In both modes, any refused invocation of the `bootstrap` operation
+returns the same error (`auth-failed` / `"auth failure"`). A caller
+cannot distinguish:
+
+- "service is in token mode"
+- "service is in bootstrap mode but already bootstrapped"
+- "operation forbidden"
+
+This matches the general IAM error-policy stance (see `iam.md`) and
+prevents externally enumerating IAM's state.
+
+### Bootstrap-token lifecycle
+
+The bootstrap token — whether operator-supplied (`token` mode) or
+service-generated (`bootstrap` mode) — is a one-time credential. It
+is stored as admin's single API key, tagged `name="bootstrap"`. The
+operator's first admin action after bootstrap should be:
+
+1. Create a durable admin user and API key (or issue a durable API
+   key to the bootstrap admin).
+2. Revoke the bootstrap key via `revoke-api-key`.
+3. Remove the bootstrap token from any deployment configuration.
+
+The `name="bootstrap"` marker makes bootstrap keys easy to detect in
+tooling (e.g. a `tg-list-api-keys` filter).
+
+## HTTP forwarding (initial integration)
+
+For the initial gateway integration — before the IAM service is
+wired into the authentication middleware — the gateway exposes a
+single forwarding endpoint:
+
+```
+POST /api/v1/iam
+```
+
+- Request body is a JSON encoding of `IamRequest`.
+- Response body is a JSON encoding of `IamResponse`.
+- The gateway's existing authentication (`GATEWAY_SECRET` bearer)
+  gates access to this endpoint so the IAM protocol can be
+  exercised end-to-end in tests without touching the live auth
+  path.
+- This endpoint is **not** the final shape. Once the middleware is
+  in place, per-operation REST endpoints replace it (for example
+  `POST /api/v1/auth/login`, `POST /api/v1/users`, `DELETE
+  /api/v1/api-keys/{id}`), and this generic forwarder is removed.
+
+The endpoint performs only message marshalling: it does not read
+or rewrite fields in the request, and it applies no capability
+check. All authorisation for user / workspace / key management
+lands in the subsequent middleware work.
+
+## Non-goals for this spec
+
+- REST endpoint shape for the final gateway surface — covered in
+  Phase 2 of the IAM implementation plan, not here.
+- OIDC / SAML external IdP protocol — out of scope for open source.
+- Key-signing algorithm choice, password KDF choice, JWT claim
+  layout — implementation details captured in code + ADRs, not
+  locked in the protocol spec.
+
+## References
+
+- [Identity and Access Management Specification](iam.md)
+- [Capability Vocabulary Specification](capabilities.md)
--- a/docs/tech-specs/iam.md
+++ b/docs/tech-specs/iam.md
@ -423,6 +423,37 @@ resolve API keys and to handle login requests. User management
 operations (create user, revoke key, etc.) also go through the IAM
 service.

+### Error policy
+
+External error responses carry **no diagnostic detail** for
+authentication or access-control failures. The goal is to give an
+attacker probing the endpoint no signal about which condition they
+tripped.
+
+| Category | HTTP | Body | WebSocket frame |
+|----------|------|------|-----------------|
+| Authentication failure | `401 Unauthorized` | `{"error": "auth failure"}` | `{"type": "auth-failed", "error": "auth failure"}` |
+| Access control failure | `403 Forbidden` | `{"error": "access denied"}` | `{"error": "access denied"}` (endpoint-specific frame type) |
+
+"Authentication failure" covers missing credential, malformed
+credential, invalid signature, expired token, revoked API key, and
+unknown API key — all indistinguishable to the caller.
+
+"Access control failure" covers role insufficient, workspace
+mismatch, user disabled, and workspace disabled — all
+indistinguishable to the caller.
+
+**Server-side logging is richer.** The audit log records the specific
+reason (`"workspace-mismatch: user alice assigned 'acme', requested
+'beta'"`, `"role-insufficient: admin required, user has writer"`,
+etc.) for operators and post-incident forensics. These messages never
+appear in responses.
+
+Other error classes (bad request, internal error) remain descriptive
+because they do not reveal anything about the auth or access-control
+surface — e.g. `"missing required field 'workspace'"` or
+`"invalid JSON"` is fine.
+
 ### Gateway changes

 The current `Authenticator` class is replaced with a thin authentication
@ -713,6 +744,16 @@ These are not implemented but the architecture does not preclude them:
 - **Multi-workspace access.** Users could be granted access to
  additional workspaces beyond their primary assignment. The workspace
  validation step checks a grant list instead of a single assignment.
+- **Workspace resolver.** Workspace resolution on each authenticated
+  request — "given this user and this requested workspace, which
+  workspace (if any) may the request operate on?" — is encapsulated
+  in a single pluggable resolver. The open-source edition ships a
+  resolver that permits only the user's single assigned workspace;
+  enterprise editions that implement multi-workspace access swap in a
+  resolver that consults a permitted set. The wire protocol (the
+  optional `workspace` field on the authenticated request) is
+  identical in both editions, so clients written against one edition
+  work unchanged against the other.
 - **Rules-based access control.** A separate access control service
  could evaluate fine-grained policies (per-collection permissions,
  operation-level restrictions, time-based access). The gateway