mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-30 19:06:21 +02:00

iam: self-service ops, optional workspace filters, Mux service routing (#855 )

Three threads, all reinforcing the contract's system-level vs.
workspace-association distinction.

WS Mux service routing
- tg-show-flows (and any workspace-level service over the WS) was
  failing with "unknown service" because the post-refactor Mux
  unconditionally looked up flow-service:<kind>.  Now branches on
  the envelope's flow field: with flow → flow-service:<kind>;
  without flow → <kind>:<op> from the inner body; with bare op
  lookup for service=iam.  Resource and parameters come from the
  matched op's own extractors — same path the HTTP endpoints take.

Optional workspace on system-level user/key ops
- list-users returns the deployment-wide list when no workspace is
  supplied, filters when one is.  get-user, update-user,
  disable-user, enable-user, delete-user, reset-password,
  create-api-key, list-api-keys, revoke-api-key all treat workspace
  as an optional integrity check rather than a required argument.
- create-user keeps workspace required — there it's the new user's
  home-workspace binding, a parameter rather than an address.
- API keys reclassified as SYSTEM-level resources.  By the same
  reasoning that makes users system-level, an API key is a
  credential record on a deployment-wide registry; the workspace it
  authenticates to is a property, not a containment.

Self-service surface
- whoami: returns the caller's own user record.  AUTHENTICATED-only;
  no users:read capability required.  Foundation for UI affordances
  that depend on the caller's permissions.
- bootstrap-status: POST /api/v1/auth/bootstrap-status, PUBLIC,
  side-effect-free.  Returns {bootstrap_available: bool} so a
  first-run UI can decide whether to render setup without consuming
  the bootstrap op.
- Gateway now injects actor=identity.handle on every authenticated
  forward to iam-svc (IamEndpoint and WS Mux iam path), overwriting
  any caller-supplied value.  Underpins whoami, audit logging, and
  future regime-side decisions that need actor identity.
- tg-whoami and tg-update-user CLIs.

Spec polish
- iam-contract.md: actor-injection rule documented; whoami /
  bootstrap-status added to operations list; permission-scope
  framing tightened (workspace scope is a property of the grant,
  not the user or role).
- iam.md: self-service section; gateway flow gains the actor-
  injection step; role section reframed so iam-svc constraints
  don't leak into contract-level prose.
- iam-protocol.md: ops table updated for whoami, bootstrap-status,
  optional-workspace pattern; bootstrap_available added to the
  IamResponse listing.

2026-04-28 22:13:12 +01:00

18 KiB

Raw Blame History

layout	title	parent
default	IAM Contract Technical Specification	Tech Specs

IAM Contract Technical Specification

Overview

The IAM contract is the abstraction between the API gateway and any identity / access management regime that fronts it. The gateway treats IAM as a black box behind two operations — authenticate and authorise — plus a small surface of management operations. No regime-specific concept (roles, scopes, groups, claims, policy languages) is visible to the gateway, and no gateway-specific concept (capability vocabulary, request anatomy) is visible to backend services.

The TrustGraph open-source distribution ships one IAM regime — a role-based implementation defined in iam-protocol.md — that is one implementation of this contract. Enterprise editions can replace it with a different regime (OIDC / SSO, ABAC, ReBAC, external policy engine) without changing the gateway, the wire protocol, or the backends.

Motivation

Authorisation models vary by deployment. A small team might be happy with three predefined roles; an enterprise might need group- mapping from an upstream IdP, attribute-based policies, or relationship-based access control. Hard-wiring any one of those into the gateway forces every other regime to either compromise its model or be re-implemented.

A narrow contract — "authenticate this credential" and "may this identity perform this operation on this resource" — captures what the gateway actually needs to know without committing to a policy shape. The IAM regime owns the policy decision; the gateway is a generic enforcement point.

Operations

`authenticate`

authenticate(credential: bytes) → Identity | AuthFailure

Validates a credential the client presented. The gateway treats the credential as opaque bytes — for the OSS regime today that's either an API key plaintext or a JWT, but the gateway does not parse them; the IAM regime decides.

On success, returns an Identity. On any failure the IAM regime returns the same opaque AuthFailure — never a description of which condition failed. This is the spec's masked-error rule: an attacker probing the endpoint cannot distinguish "no such key", "expired", "wrong signature", "revoked", "user disabled", etc.

`authorise`

authorise(identity: Identity,
          capability: str,
          resource: Resource,
          parameters: dict)
    → Decision

Asks whether the identity is permitted to perform the named capability on the named resource, given the operation's parameters. Returns allow or deny. identity is whatever authenticate returned for this caller; the gateway never decomposes it.

The four arguments separate concerns:

identity — who is asking.
capability — what permission they are exercising (e.g. users:write, graph:read). Permission, not structure.
resource — what is being operated on, as a structured identifier. See The Resource model below.
parameters — operation-specific data that the regime may need to consider beyond the resource identifier. Used when a decision depends on attributes the request supplies — e.g. creating a user with workspace association W: the resource is the system-level user registry, and W is a parameter the regime checks against the caller's permissions for users:write.

Different regimes use the four arguments differently — one regime might evaluate role bundles whose grants carry workspace scope; another might consult upstream IdP group memberships; an ABAC regime evaluates a policy with all four as inputs. The contract is unchanged.

`authorise_many`

authorise_many(identity: Identity,
               checks: list[(str, Resource, dict)])
    → list[Decision]

Bulk variant of authorise. Same semantics, one round-trip for many decisions. Used when an operation fans out to multiple resources (e.g. an agent that touches several workspaces) and a single permission check isn't sufficient.

authorise_many is not just a performance optimisation; it pins the contract for fan-out operations early, before clients (or internal callers) build patterns that assume one-permission-check- per-request. Regimes implement it as a loop over authorise unless they have a more efficient path.

Management operations

Beyond the request-time authenticate / authorise, the contract also covers identity-lifecycle and credential-lifecycle operations that are invoked by administrative requests rather than by the authentication path. These are regime-specific in detail (an SSO regime that delegates user management to the IdP may not implement most of them) but the operation set the gateway can forward is:

User management: create-user, list-users, get-user, update-user, disable-user, enable-user, delete-user
Credential management: create-api-key, list-api-keys, revoke-api-key, change-password, reset-password
Workspace management: create-workspace, list-workspaces, get-workspace, update-workspace, disable-workspace
Session management: login, whoami
Key management: get-signing-key-public, rotate-signing-key
Bootstrap: bootstrap, bootstrap-status

whoami is the self-read counterpart to get-user: any authenticated caller can read their own identity record without holding a user-management capability. It is the gating-free probe a UI uses to render affordances appropriate to the caller's role.

bootstrap-status is a side-effect-free probe of whether an unconsumed bootstrap call would currently succeed. It exists so a first-run UI can decide whether to render setup without invoking the consuming bootstrap op. Public — no authentication.

A regime that does not support one of these (e.g. an SSO regime where users are managed in the IdP) returns a defined "not supported" error; the gateway surfaces it as a 501.

Actor injection

For any management operation forwarded by the gateway after authentication, the gateway injects the authenticated caller's handle as an actor field on the request. Regimes use actor to identify who is making the request — distinct from the operation's target (which lives in user_id / key_id / workspace_record / etc.) — for purposes such as:

Self-service operations (whoami, change-password) that resolve "the caller" without taking a target argument.
Audit logging, where the actor is recorded against the change.
Decisions that depend on the resolved resource state. The gateway authorises against the parameters on the request, but it cannot know the resolved resource's actual properties (e.g. the workspace association of a target user) before the regime loads it. When that matters, the regime can re-decide using the actor's permissions and the resolved record — closing a class of cases the gateway-side check can't see.

Caller-supplied actor values on the request body are overwritten by the gateway — the gateway is the only authority for actor identity, and a regime that consults actor can rely on it being authentic.

The `Identity` surface

Identity is mostly opaque. The gateway holds the value as a token to quote back when calling authorise, never decomposing it. But there are a few gateway-side concerns that need a small surface:

Field	Purpose
`handle`	Opaque reference passed back to `authorise`. Regime-defined; gateway treats as a string.
`workspace`	The workspace this credential authenticates to. Used by the gateway only as a default-fill-in for operations that omit a workspace. Never used as policy input — when authorisation needs to know which workspace the operation acts on, the operation places it in the resource address (or a parameter), and the regime decides.
`principal_id`	Stable identifier the gateway logs for audit (a user id, a sub claim, a service account id). Never used for authorisation — that's `authorise`'s job.
`source`	How the credential was presented (`api-key`, `jwt`, …). Non-policy; useful for logs and metrics only.

Anything else — roles, claims, group memberships, policy attributes — stays inside the regime and is reachable only via authorise.

The `Resource` model

A Resource is a structured value identifying what is being operated on. Resources live at one of three levels in TrustGraph, based on where the resource exists in the deployment:

Resource levels

Level	What lives there	Resource shape
System	The user registry, the workspace registry, the signing key, the audit log — anything that exists once per deployment.	`{}`
Workspace	A workspace's config, flow definitions, library (documents), knowledge cores, collections — things that exist within a workspace.	`{workspace: "..."}`
Flow	A flow's knowledge graph, agent state, LLM context, embedding state, MCP context — things that exist within a flow within a workspace.	`{workspace: "...", flow: "..."}`

Note carefully:

Users are a system-level resource. A user record exists at the deployment level; the fact that a user has a workspace association (one in OSS, possibly many in other regimes) is a property of the user record, not a containment. Operations on the user registry have resource = {}; the workspace association appears as a parameter, not as a resource address component.
Workspaces themselves are a system-level resource. The workspace registry exists at the deployment level. create- workspace and list-workspaces are system-level operations; the workspace identifier in their bodies is a parameter, not an address.
A workspace's contents are workspace-level resources. A workspace's config, flows, library, etc. live within a workspace. Their resource address is {workspace: ...}.
A flow's contents are flow-level resources. Knowledge graphs, agents, etc. live within a flow. Their resource address is {workspace: ..., flow: ...}.

Component vocabulary

Component	Type	Meaning	Used by
`workspace`	string	Identifier of the workspace whose contents are being operated on	workspace-level and flow-level resource addresses
`flow`	string	Identifier of a flow within a workspace; always paired with `workspace`	flow-level resource addresses
`collection`	string	Reserved for finer-grained scoping within a workspace	future / enterprise
`document`	string	Reserved for per-document scoping	future / enterprise

A Resource is a partial mapping of these components to values. The level of the resource (system / workspace / flow) determines which components must be present. An empty {} is the system-level resource.

Workspace as parameter vs. address

Workspace plays two distinct roles in operations and shows up in two distinct places:

As a resource address component — workspace identifies the thing being operated on. Lives in resource.workspace. Example: config:read reads the config of workspace W.
As an operation parameter — workspace is data the operation acts on or filters by, while the resource itself is system-level. Lives in parameters.workspace. Example: users:write creates a user with workspace association W; the resource is the user registry (system), and W is a parameter.

These are not interchangeable. The IAM regime considers each role separately; the OSS role table, for instance, applies workspace- scope to the address component when checking workspace-level operations, and to a parameter when checking "create-user-with-workspace-W". Both end up enforcing the admin's scope, but through different code paths.

Extension rules

The vocabulary is closed but extensible. Adding a new component:

The component is added to the vocabulary in this spec, with a defined name, type, and meaning.
Existing IAM regimes ignore unknown components (forward compatibility — adding a new component does not break older regimes that don't understand it).
Older gateways that don't populate a new component leave it unset; regimes that need it for a decision treat "unset" as "absent" and decide accordingly (typically: cannot grant permission scoped to a component the gateway didn't supply).

A regime that wants stricter behaviour (e.g. fail-closed on unknown components rather than ignoring them) declares so as part of its own configuration; the contract default is "ignore unknown".

Operation registry (gateway-side)

Mapping a request onto (capability, resource, parameters) is service-specific — it cannot be inferred from the capability alone. The gateway maintains an operation registry that declares, per operation:

The required capability.
The resource level (system / workspace / flow) — determines the shape of the resource identifier.
How to extract the resource address components (workspace, flow) from the request — from URL path, WebSocket envelope, or body.
Which body fields are operation parameters (and which of those the IAM regime should see in the parameters argument).

This registry is part of the gateway's endpoint declarations, not part of the IAM contract. The contract specifies what arguments authorise receives; how the gateway populates them is its own concern.

In the OSS gateway, registry keys follow these conventions:

Pattern	Used by	Resource level
bare op name (`create-user`, `list-users`, `login`, …)	`/api/v1/iam` and the auth surface	system / workspace, per op
`<kind>:<op>` (`config:get`, `flow:list-blueprints`, `librarian:add-document`, …)	`/api/v1/{kind}` (workspace-scoped global services)	workspace
`flow-service:<kind>` (`flow-service:agent`, `flow-service:graph-rag`, …)	`/api/v1/flow/{flow}/service/{kind}` and the WS Mux	flow
`flow-import:<kind>` / `flow-export:<kind>`	`/api/v1/flow/{flow}/{import,export}/{kind}` streaming sockets	flow

Keys are an OSS-gateway implementation detail — the contract does not constrain naming. The conventions above exist so the registry key is uniquely derivable from the request path and (where applicable) body without ambiguity.

Caching

Both authenticate and authorise results are cached at the gateway, on different policies:

authenticate — cached by a hash of the credential. The OSS gateway uses a fixed short TTL (currently 60 s) so that revoked API keys and disabled users stop working within the TTL window without any push mechanism. Regimes that want a different behaviour can return an expires hint with the identity; the gateway honours the smaller of expires and its own ceiling.
authorise — cached by a hash of (handle, capability, resource, parameters). The regime returns a suggested TTL with the decision; the gateway clamps it above by a deployment-set ceiling (currently 60 s). Both allow and deny decisions are cached; denies briefly, to avoid hammering the regime with repeated rejected attempts.

The TTL ceiling caps the revocation latency window — a role revoked at the regime takes effect at the gateway no later than the ceiling. Operators that need stricter revocation can lower the ceiling.

Failure modes

Condition	Behaviour
`authenticate` returns AuthFailure	Gateway responds 401 with the masked `auth failure` body.
`authorise` returns deny	Gateway responds 403 with the masked `access denied` body.
IAM regime unreachable	Gateway responds 401 / 503 (deployment-defined). No fail-open.
`authorise_many` partial deny	Gateway treats the request as denied; the operation is rejected. Partial-success semantics are not part of the contract.
Regime returns "not supported" for a management operation	Gateway responds 501.

There is no fallback or "soft" decision path. An IAM regime that is unavailable, slow, or returning errors causes requests to fail closed.

Implementations

Open-source role-based regime

Defined in iam-protocol.md. Implements the contract via:

A pub/sub request/response service (iam-svc) reached only by the gateway over the message bus.
Credentials are API keys (opaque) or JWTs (Ed25519, locally validated by the gateway against the regime's published public key).
authorise reduces to a lookup against the role bundles in capabilities.md, with each grant's workspace scope checked against the operation's workspace component.
Identity, user, and workspace records live in Cassandra.

The OSS regime is deliberately simple — three roles, a single workspace association per user (a regime data-model decision, not a contract assertion), no policy language. Other regimes can grant the same user different permissions in different workspaces without changing anything outside the regime.

Future regimes

The contract is shaped to admit, without code change in the gateway:

OIDC / SSO — authenticate validates an OIDC ID token via the IdP's JWKS; Identity.handle carries the verified subject and group claims; authorise evaluates against group-to- capability mappings configured at the regime.
ABAC / Policy engine — authorise calls out to a policy engine (Rego, Cedar, custom DSL) with the identity's attributes and the resource as the policy input.
ReBAC (Zanzibar-style) — authorise translates (identity, capability, resource) into a relationship-tuple lookup against a tuple store.
Hybrid — multiple regimes composed: e.g. authenticate via SSO, authorise via local policy.

None of these require gateway changes. The contract surface is the same; the regime is what differs.

References

Identity and Access Management Specification — overall design and the gateway-side framing.
IAM Service Protocol Specification — the OSS regime's wire-level protocol.
Capability Vocabulary Specification — the capability strings the gateway uses as authorise input.

18 KiB Raw Blame History