trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-04-28 09:56:22 +02:00

Author	SHA1	Message	Date
cybermaggedon	67b2fc448f	feat: IAM service, gateway auth middleware, capability model, and CLIs (#849 ) Replaces the legacy GATEWAY_SECRET shared-token gate with an IAM-backed identity and authorisation model. The gateway no longer has an "allow-all" or "no auth" mode; every request is authenticated via the IAM service, authorised against a capability model that encodes both the operation and the workspace it targets, and rejected with a deliberately-uninformative 401 / 403 on any failure. IAM service (trustgraph-flow/trustgraph/iam, trustgraph-base/schema/iam) ----------------------------------------------------------------------- * New backend service (iam-svc) owning users, workspaces, API keys, passwords and JWT signing keys in Cassandra. Reached over the standard pub/sub request/response pattern; gateway is the only caller. * Operations: bootstrap, resolve-api-key, login, get-signing-key-public, rotate-signing-key, create/list/get/update/disable/delete/enable-user, change-password, reset-password, create/list/get/update/disable- workspace, create/list/revoke-api-key. * Ed25519 JWT signing (alg=EdDSA). Key rotation writes a new kid and retires the previous one; validation is grace-period friendly. * Passwords: PBKDF2-HMAC-SHA-256, 600k iterations, per-user salt. * API keys: 128-bit random, SHA-256 hashed. Plaintext returned once. * Bootstrap is explicit: --bootstrap-mode {token,bootstrap} is a required startup argument with no permissive default. Masked "auth failure" errors hide whether a refused bootstrap request was due to mode, state, or authorisation. Gateway authentication (trustgraph-flow/trustgraph/gateway/auth.py) ------------------------------------------------------------------- * IamAuth replaces the legacy Authenticator. Distinguishes JWTs (three-segment dotted) from API keys by shape; verifies JWTs locally using the cached IAM public key; resolves API keys via IAM with a short-TTL hash-keyed cache. Every failure path surfaces the same 401 body ("auth failure") so callers cannot enumerate credential state. * Public key is fetched at gateway startup with a bounded retry loop; traffic does not begin flowing until auth has started. Capability model (trustgraph-flow/trustgraph/gateway/capabilities.py) --------------------------------------------------------------------- * Roles have two dimensions: a capability set and a workspace scope. OSS ships reader / writer / admin; the first two are workspace- assigned, admin is cross-workspace (""). No "cross-workspace" pseudo-capability — workspace permission is a property of the role. check(identity, capability, target_workspace=None) is the single authorisation test: some role must grant the capability and be active in the target workspace. * enforce_workspace validates a request-body workspace against the caller's role scopes and injects the resolved value. Cross- workspace admin is permitted by role scope, not by a bypass. * Gateway endpoints declare a required capability explicitly — no permissive default. Construction fails fast if omitted. Enterprise editions can replace the role table without changing the wire protocol. WebSocket first-frame auth (dispatch/mux.py, endpoint/socket.py) ---------------------------------------------------------------- * /api/v1/socket handshake unconditionally accepts; authentication runs on the first WebSocket frame ({"type":"auth","token":"..."}) with {"type":"auth-ok","workspace":"..."} / {"type":"auth-failed"}. The socket stays open on failure so the client can re-authenticate — browsers treat a handshake-time 401 as terminal, breaking reconnection. * Mux.receive rejects every non-auth frame before auth succeeds, enforces the caller's workspace (envelope + inner payload) using the role-scope resolver, and supports mid-session re-auth. * Flow import/export streaming endpoints keep the legacy ?token= handshake (URL-scoped short-lived transfers; no re-auth need). Auth surface ------------ * POST /api/v1/auth/login — public, returns a JWT. * POST /api/v1/auth/bootstrap — public; forwards to IAM's bootstrap op which itself enforces mode + tables-empty. * POST /api/v1/auth/change-password — any authenticated user. * POST /api/v1/iam — admin-only generic forwarder for the rest of the IAM API (per-op REST endpoints to follow in a later change). Removed / breaking ------------------ * GATEWAY_SECRET / --api-token / default_api_token and the legacy Authenticator.permitted contract. The gateway cannot run without IAM. * ?token= on /api/v1/socket. * DispatcherManager and Mux both raise on auth=None — no silent downgrade path. CLI tools (trustgraph-cli) -------------------------- tg-bootstrap-iam, tg-login, tg-create-user, tg-list-users, tg-disable-user, tg-enable-user, tg-delete-user, tg-change-password, tg-reset-password, tg-create-api-key, tg-list-api-keys, tg-revoke-api-key, tg-create-workspace, tg-list-workspaces. Passwords read via getpass; tokens / one-time secrets written to stdout with operator context on stderr so shell composition works cleanly. AsyncSocketClient / SocketClient updated to the first-frame auth protocol. Specifications -------------- * docs/tech-specs/iam.md updated with the error policy, workspace resolver extension point, and OSS role-scope model. * docs/tech-specs/iam-protocol.md (new) — transport, dataclasses, operation table, error taxonomy, bootstrap modes. * docs/tech-specs/capabilities.md (new) — capability vocabulary, OSS role bundles, agent-as-composition note, enforcement-boundary policy, enterprise extensibility. Tests ----- * test_auth.py (rewritten) — IamAuth + JWT round-trip with real Ed25519 keypairs + API-key cache behaviour. * test_capabilities.py (new) — role table sanity, check across role x workspace combinations, enforce_workspace paths, unknown-cap / unknown-role fail-closed. * Every endpoint test construction now names its capability explicitly (no permissive defaults relied upon). New tests pin the fail-closed invariants: DispatcherManager / Mux refuse auth=None; i18n path-traversal defense is exercised. * test_socket_graceful_shutdown rewritten against IamAuth.	2026-04-24 17:29:10 +01:00
cybermaggedon	d35473f7f7	feat: workspace-based multi-tenancy, replacing user as tenancy axis (#840 ) Introduces `workspace` as the isolation boundary for config, flows, library, and knowledge data. Removes `user` as a schema-level field throughout the code, API specs, and tests; workspace provides the same separation more cleanly at the trusted flow.workspace layer rather than through client-supplied message fields. Design ------ - IAM tech spec (docs/tech-specs/iam.md) documents current state, proposed auth/access model, and migration direction. - Data ownership model (docs/tech-specs/data-ownership-model.md) captures the workspace/collection/flow hierarchy. Schema + messaging ------------------ - Drop `user` field from AgentRequest/Step, GraphRagQuery, DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest, Sparql/Rows/Structured QueryRequest, ToolServiceRequest. - Keep collection/workspace routing via flow.workspace at the service layer. - Translators updated to not serialise/deserialise user. API specs --------- - OpenAPI schemas and path examples cleaned of user fields. - Websocket async-api messages updated. - Removed the unused parameters/User.yaml. Services + base --------------- - Librarian, collection manager, knowledge, config: all operations scoped by workspace. Config client API takes workspace as first positional arg. - `flow.workspace` set at flow start time by the infrastructure; no longer pass-through from clients. - Tool service drops user-personalisation passthrough. CLI + SDK --------- - tg-init-workspace and workspace-aware import/export. - All tg-* commands drop user args; accept --workspace. - Python API/SDK (flow, socket_client, async_, explainability, library) drop user kwargs from every method signature. MCP server ---------- - All tool endpoints drop user parameters; socket_manager no longer keyed per user. Flow service ------------ - Closure-based topic cleanup on flow stop: only delete topics whose blueprint template was parameterised AND no remaining live flow (across all workspaces) still resolves to that topic. Three scopes fall out naturally from template analysis: {id} -> per-flow, deleted on stop * {blueprint} -> per-blueprint, kept while any flow of the same blueprint exists * {workspace} -> per-workspace, kept while any flow in the workspace exists * literal -> global, never deleted (e.g. tg.request.librarian) Fixes a bug where stopping a flow silently destroyed the global librarian exchange, wedging all library operations until manual restart. RabbitMQ backend ---------------- - heartbeat=60, blocked_connection_timeout=300. Catches silently dead connections (broker restart, orphaned channels, network partitions) within ~2 heartbeat windows, so the consumer reconnects and re-binds its queue rather than sitting forever on a zombie connection. Tests ----- - Full test refresh: unit, integration, contract, provenance. - Dropped user-field assertions and constructor kwargs across ~100 test files. - Renamed user-collection isolation tests to workspace-collection.	2026-04-21 23:23:01 +01:00
cybermaggedon	9f84891fcc	Flow service lifecycle management (#822 ) feat: separate flow service from config service with explicit queue lifecycle management The flow service is now an independent service that owns the lifecycle of flow and blueprint queues. System services own their own queues. Consumers never create queues. Flow service separation: - New service at trustgraph-flow/trustgraph/flow/service/ - Uses async ConfigClient (RequestResponse pattern) to talk to config service - Config service stripped of all flow handling Queue lifecycle management: - PubSubBackend protocol gains create_queue, delete_queue, queue_exists, ensure_queue — all async - RabbitMQ: implements via pika with asyncio.to_thread internally - Pulsar: stubs for future admin REST API implementation - Consumer _connect() no longer creates queues (passive=True for named queues) - System services call ensure_queue on startup - Flow service creates queues on flow start, deletes on flow stop - Flow service ensures queues for pre-existing flows on startup Two-phase flow stop: - Phase 1: set flow status to "stopping", delete processor config entries - Phase 2: retry queue deletion, then delete flow record Config restructure: - active-flow config replaced with processor:{name} types - Each processor has its own config type, each flow variant is a key - Flow start/stop use batch put/delete — single config push per operation - FlowProcessor subscribes to its own type only Blueprint format: - Processor entries split into topics and parameters dicts - Flow interfaces use {"flow": "topic"} instead of bare strings - Specs (ConsumerSpec, ProducerSpec, etc.) read from definition["topics"] Tests updated	2026-04-16 17:19:39 +01:00
V.Sreeram	d4723566cb	fix: prevent duplicate dispatcher creation race condition in invoke_global_service (#715 ) * fix: prevent duplicate dispatcher creation race condition in invoke_global_service Concurrent coroutines could all pass the `if key in self.dispatchers` check before any of them wrote the result back, because `await dispatcher.start()` yields to the event loop. This caused multiple Pulsar consumers to be created on the same shared subscription, distributing responses round-robin and dropping ~2/3 of them — manifesting as a permanent spinner in the Workbench UI. Apply a double-checked asyncio.Lock in both `invoke_global_service` and `invoke_flow_service` so only one dispatcher is ever created per service key. * test: add concurrent-dispatch tests for race condition fix Add asyncio.gather-based tests that verify invoke_global_service and invoke_flow_service create exactly one dispatcher under concurrent calls, preventing the duplicate Pulsar consumer bug.	2026-04-06 11:14:32 +01:00
cybermaggedon	5304f96fe6	Fix tests (#593 ) * Fix unit/integration/contract tests which were broken by messaging fabric work	2025-12-19 08:53:21 +00:00
cybermaggedon	ba95fa226b	Gateway queue overrides (#584 )	2025-12-06 11:01:20 +00:00
cybermaggedon	2f7fddd206	Test suite executed from CI pipeline (#433 ) * Test strategy & test cases * Unit tests * Integration tests	2025-07-14 14:57:44 +01:00

7 commits