feat: IAM service, gateway auth middleware, capability model, and CLIs (#849)

Replaces the legacy GATEWAY_SECRET shared-token gate with an IAM-backed
identity and authorisation model.  The gateway no longer has an
"allow-all" or "no auth" mode; every request is authenticated via the
IAM service, authorised against a capability model that encodes both
the operation and the workspace it targets, and rejected with a
deliberately-uninformative 401 / 403 on any failure.

IAM service (trustgraph-flow/trustgraph/iam, trustgraph-base/schema/iam)
-----------------------------------------------------------------------
* New backend service (iam-svc) owning users, workspaces, API keys,
  passwords and JWT signing keys in Cassandra.  Reached over the
  standard pub/sub request/response pattern; gateway is the only
  caller.
* Operations: bootstrap, resolve-api-key, login, get-signing-key-public,
  rotate-signing-key, create/list/get/update/disable/delete/enable-user,
  change-password, reset-password, create/list/get/update/disable-
  workspace, create/list/revoke-api-key.
* Ed25519 JWT signing (alg=EdDSA).  Key rotation writes a new kid and
  retires the previous one; validation is grace-period friendly.
* Passwords: PBKDF2-HMAC-SHA-256, 600k iterations, per-user salt.
* API keys: 128-bit random, SHA-256 hashed.  Plaintext returned once.
* Bootstrap is explicit: --bootstrap-mode {token,bootstrap} is a
  required startup argument with no permissive default.  Masked
  "auth failure" errors hide whether a refused bootstrap request was
  due to mode, state, or authorisation.

Gateway authentication (trustgraph-flow/trustgraph/gateway/auth.py)
-------------------------------------------------------------------
* IamAuth replaces the legacy Authenticator.  Distinguishes JWTs
  (three-segment dotted) from API keys by shape; verifies JWTs
  locally using the cached IAM public key; resolves API keys via
  IAM with a short-TTL hash-keyed cache.  Every failure path
  surfaces the same 401 body ("auth failure") so callers cannot
  enumerate credential state.
* Public key is fetched at gateway startup with a bounded retry loop;
  traffic does not begin flowing until auth has started.

Capability model (trustgraph-flow/trustgraph/gateway/capabilities.py)
---------------------------------------------------------------------
* Roles have two dimensions: a capability set and a workspace scope.
  OSS ships reader / writer / admin; the first two are workspace-
  assigned, admin is cross-workspace ("*").  No "cross-workspace"
  pseudo-capability — workspace permission is a property of the role.
* check(identity, capability, target_workspace=None) is the single
  authorisation test: some role must grant the capability *and* be
  active in the target workspace.
* enforce_workspace validates a request-body workspace against the
  caller's role scopes and injects the resolved value.  Cross-
  workspace admin is permitted by role scope, not by a bypass.
* Gateway endpoints declare a required capability explicitly — no
  permissive default.  Construction fails fast if omitted.  Enterprise
  editions can replace the role table without changing the wire
  protocol.

WebSocket first-frame auth (dispatch/mux.py, endpoint/socket.py)
----------------------------------------------------------------
* /api/v1/socket handshake unconditionally accepts; authentication
  runs on the first WebSocket frame ({"type":"auth","token":"..."})
  with {"type":"auth-ok","workspace":"..."} / {"type":"auth-failed"}.
  The socket stays open on failure so the client can re-authenticate
  — browsers treat a handshake-time 401 as terminal, breaking
  reconnection.
* Mux.receive rejects every non-auth frame before auth succeeds,
  enforces the caller's workspace (envelope + inner payload) using
  the role-scope resolver, and supports mid-session re-auth.
* Flow import/export streaming endpoints keep the legacy ?token=
  handshake (URL-scoped short-lived transfers; no re-auth need).

Auth surface
------------
* POST /api/v1/auth/login — public, returns a JWT.
* POST /api/v1/auth/bootstrap — public; forwards to IAM's bootstrap
  op which itself enforces mode + tables-empty.
* POST /api/v1/auth/change-password — any authenticated user.
* POST /api/v1/iam — admin-only generic forwarder for the rest of
  the IAM API (per-op REST endpoints to follow in a later change).

Removed / breaking
------------------
* GATEWAY_SECRET / --api-token / default_api_token and the legacy
  Authenticator.permitted contract.  The gateway cannot run without
  IAM.
* ?token= on /api/v1/socket.
* DispatcherManager and Mux both raise on auth=None — no silent
  downgrade path.

CLI tools (trustgraph-cli)
--------------------------
tg-bootstrap-iam, tg-login, tg-create-user, tg-list-users,
tg-disable-user, tg-enable-user, tg-delete-user, tg-change-password,
tg-reset-password, tg-create-api-key, tg-list-api-keys,
tg-revoke-api-key, tg-create-workspace, tg-list-workspaces.  Passwords
read via getpass; tokens / one-time secrets written to stdout with
operator context on stderr so shell composition works cleanly.
AsyncSocketClient / SocketClient updated to the first-frame auth
protocol.

Specifications
--------------
* docs/tech-specs/iam.md updated with the error policy, workspace
  resolver extension point, and OSS role-scope model.
* docs/tech-specs/iam-protocol.md (new) — transport, dataclasses,
  operation table, error taxonomy, bootstrap modes.
* docs/tech-specs/capabilities.md (new) — capability vocabulary, OSS
  role bundles, agent-as-composition note, enforcement-boundary
  policy, enterprise extensibility.

Tests
-----
* test_auth.py (rewritten) — IamAuth + JWT round-trip with real
  Ed25519 keypairs + API-key cache behaviour.
* test_capabilities.py (new) — role table sanity, check across
  role x workspace combinations, enforce_workspace paths,
  unknown-cap / unknown-role fail-closed.
* Every endpoint test construction now names its capability
  explicitly (no permissive defaults relied upon).  New tests pin
  the fail-closed invariants: DispatcherManager / Mux refuse
  auth=None; i18n path-traversal defense is exercised.
* test_socket_graceful_shutdown rewritten against IamAuth.
This commit is contained in:
cybermaggedon 2026-04-24 17:29:10 +01:00 committed by GitHub
parent ae9936c9cc
commit 67b2fc448f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
61 changed files with 6474 additions and 792 deletions

View file

@ -1,355 +1,179 @@
"""
Tests for Gateway Service API
Tests for gateway/service.py the Api class that wires together
the pub/sub backend, IAM auth, config receiver, dispatcher manager,
and endpoint manager.
The legacy ``GATEWAY_SECRET`` / ``default_api_token`` / allow-all
surface is gone, so the tests here focus on the Api's construction
and composition rather than the removed auth behaviour. IamAuth's
own behaviour is covered in test_auth.py.
"""
import pytest
import asyncio
from unittest.mock import Mock, patch, MagicMock, AsyncMock
from unittest.mock import AsyncMock, Mock, patch
from aiohttp import web
import pulsar
from trustgraph.gateway.service import Api, run, default_pulsar_host, default_prometheus_url, default_timeout, default_port, default_api_token
# Tests for Gateway Service API
from trustgraph.gateway.service import (
Api,
default_pulsar_host, default_prometheus_url,
default_timeout, default_port,
)
from trustgraph.gateway.auth import IamAuth
class TestApi:
"""Test cases for Api class"""
# -- constants -------------------------------------------------------------
def test_api_initialization_with_defaults(self):
"""Test Api initialization with default values"""
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub:
mock_backend = Mock()
mock_get_pubsub.return_value = mock_backend
api = Api()
class TestDefaults:
assert api.port == default_port
assert api.timeout == default_timeout
assert api.pulsar_host == default_pulsar_host
assert api.pulsar_api_key is None
assert api.prometheus_url == default_prometheus_url + "/"
assert api.auth.allow_all is True
def test_exports_default_constants(self):
# These are consumed by CLIs / tests / docs. Sanity-check
# that they're the expected shape.
assert default_port == 8088
assert default_timeout == 600
assert default_pulsar_host.startswith("pulsar://")
assert default_prometheus_url.startswith("http")
# Verify get_pubsub was called
mock_get_pubsub.assert_called_once()
def test_api_initialization_with_custom_config(self):
"""Test Api initialization with custom configuration"""
# -- Api construction ------------------------------------------------------
@pytest.fixture
def mock_backend():
return Mock()
@pytest.fixture
def api(mock_backend):
with patch(
"trustgraph.gateway.service.get_pubsub",
return_value=mock_backend,
):
yield Api()
class TestApiConstruction:
def test_defaults(self, api):
assert api.port == default_port
assert api.timeout == default_timeout
assert api.pulsar_host == default_pulsar_host
assert api.pulsar_api_key is None
# prometheus_url gets normalised with a trailing slash
assert api.prometheus_url == default_prometheus_url + "/"
def test_auth_is_iam_backed(self, api):
# Any Api always gets an IamAuth. There is no "no auth" mode
# (GATEWAY_SECRET / allow_all has been removed — see IAM spec).
assert isinstance(api.auth, IamAuth)
def test_components_wired(self, api):
assert api.config_receiver is not None
assert api.dispatcher_manager is not None
assert api.endpoint_manager is not None
def test_dispatcher_manager_has_auth(self, api):
# The Mux uses this handle for first-frame socket auth.
assert api.dispatcher_manager.auth is api.auth
def test_custom_config(self, mock_backend):
config = {
"port": 9000,
"timeout": 300,
"pulsar_host": "pulsar://custom-host:6650",
"pulsar_api_key": "test-api-key",
"pulsar_listener": "custom-listener",
"pulsar_api_key": "custom-key",
"prometheus_url": "http://custom-prometheus:9090",
"api_token": "secret-token"
}
with patch(
"trustgraph.gateway.service.get_pubsub",
return_value=mock_backend,
):
a = Api(**config)
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub:
mock_backend = Mock()
mock_get_pubsub.return_value = mock_backend
assert a.port == 9000
assert a.timeout == 300
assert a.pulsar_host == "pulsar://custom-host:6650"
assert a.pulsar_api_key == "custom-key"
# Trailing slash added.
assert a.prometheus_url == "http://custom-prometheus:9090/"
api = Api(**config)
def test_prometheus_url_already_has_trailing_slash(self, mock_backend):
with patch(
"trustgraph.gateway.service.get_pubsub",
return_value=mock_backend,
):
a = Api(prometheus_url="http://p:9090/")
assert a.prometheus_url == "http://p:9090/"
assert api.port == 9000
assert api.timeout == 300
assert api.pulsar_host == "pulsar://custom-host:6650"
assert api.pulsar_api_key == "test-api-key"
assert api.prometheus_url == "http://custom-prometheus:9090/"
assert api.auth.token == "secret-token"
assert api.auth.allow_all is False
def test_queue_overrides_parsed_for_config(self, mock_backend):
with patch(
"trustgraph.gateway.service.get_pubsub",
return_value=mock_backend,
):
a = Api(
config_request_queue="alt-config-req",
config_response_queue="alt-config-resp",
)
overrides = a.dispatcher_manager.queue_overrides
assert overrides.get("config", {}).get("request") == "alt-config-req"
assert overrides.get("config", {}).get("response") == "alt-config-resp"
# Verify get_pubsub was called with config
mock_get_pubsub.assert_called_once_with(**config)
def test_api_initialization_with_pulsar_api_key(self):
"""Test Api initialization with Pulsar API key authentication"""
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub:
mock_get_pubsub.return_value = Mock()
# -- app_factory -----------------------------------------------------------
api = Api(pulsar_api_key="test-key")
# Verify api key was stored
assert api.pulsar_api_key == "test-key"
mock_get_pubsub.assert_called_once()
def test_api_initialization_prometheus_url_normalization(self):
"""Test that prometheus_url gets normalized with trailing slash"""
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub:
mock_get_pubsub.return_value = Mock()
# Test URL without trailing slash
api = Api(prometheus_url="http://prometheus:9090")
assert api.prometheus_url == "http://prometheus:9090/"
# Test URL with trailing slash
api = Api(prometheus_url="http://prometheus:9090/")
assert api.prometheus_url == "http://prometheus:9090/"
def test_api_initialization_empty_api_token_means_no_auth(self):
"""Test that empty API token results in allow_all authentication"""
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub:
mock_get_pubsub.return_value = Mock()
api = Api(api_token="")
assert api.auth.allow_all is True
def test_api_initialization_none_api_token_means_no_auth(self):
"""Test that None API token results in allow_all authentication"""
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub:
mock_get_pubsub.return_value = Mock()
api = Api(api_token=None)
assert api.auth.allow_all is True
class TestAppFactory:
@pytest.mark.asyncio
async def test_app_factory_creates_application(self):
"""Test that app_factory creates aiohttp application"""
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub:
mock_get_pubsub.return_value = Mock()
api = Api()
# Mock the dependencies
api.config_receiver = Mock()
api.config_receiver.start = AsyncMock()
api.endpoint_manager = Mock()
api.endpoint_manager.add_routes = Mock()
api.endpoint_manager.start = AsyncMock()
app = await api.app_factory()
assert isinstance(app, web.Application)
assert app._client_max_size == 256 * 1024 * 1024
# Verify that config receiver was started
api.config_receiver.start.assert_called_once()
# Verify that endpoint manager was configured
api.endpoint_manager.add_routes.assert_called_once_with(app)
api.endpoint_manager.start.assert_called_once()
async def test_creates_aiohttp_app(self, api):
# Stub out the long-tail dependencies that reach out to IAM /
# pub/sub so we can exercise the factory in isolation.
api.auth.start = AsyncMock()
api.config_receiver = Mock()
api.config_receiver.start = AsyncMock()
api.endpoint_manager = Mock()
api.endpoint_manager.add_routes = Mock()
api.endpoint_manager.start = AsyncMock()
api.endpoints = []
app = await api.app_factory()
assert isinstance(app, web.Application)
assert app._client_max_size == 256 * 1024 * 1024
api.auth.start.assert_called_once()
api.config_receiver.start.assert_called_once()
api.endpoint_manager.add_routes.assert_called_once_with(app)
api.endpoint_manager.start.assert_called_once()
@pytest.mark.asyncio
async def test_app_factory_with_custom_endpoints(self):
"""Test app_factory with custom endpoints"""
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub:
mock_get_pubsub.return_value = Mock()
api = Api()
# Mock custom endpoints
mock_endpoint1 = Mock()
mock_endpoint1.add_routes = Mock()
mock_endpoint1.start = AsyncMock()
mock_endpoint2 = Mock()
mock_endpoint2.add_routes = Mock()
mock_endpoint2.start = AsyncMock()
api.endpoints = [mock_endpoint1, mock_endpoint2]
# Mock the dependencies
api.config_receiver = Mock()
api.config_receiver.start = AsyncMock()
api.endpoint_manager = Mock()
api.endpoint_manager.add_routes = Mock()
api.endpoint_manager.start = AsyncMock()
app = await api.app_factory()
# Verify custom endpoints were configured
mock_endpoint1.add_routes.assert_called_once_with(app)
mock_endpoint1.start.assert_called_once()
mock_endpoint2.add_routes.assert_called_once_with(app)
mock_endpoint2.start.assert_called_once()
async def test_auth_start_runs_before_accepting_traffic(self, api):
"""``auth.start()`` fetches the IAM signing key, and must
complete (or time out) before the gateway begins accepting
requests. It's the first await in app_factory."""
order = []
def test_run_method_calls_web_run_app(self):
"""Test that run method calls web.run_app"""
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub, \
patch('aiohttp.web.run_app') as mock_run_app:
mock_get_pubsub.return_value = Mock()
# AsyncMock.side_effect expects a sync callable (its return
# value becomes the coroutine's return); a plain list.append
# avoids the "coroutine was never awaited" trap of an async
# side_effect.
api.auth.start = AsyncMock(
side_effect=lambda: order.append("auth"),
)
api.config_receiver = Mock()
api.config_receiver.start = AsyncMock(
side_effect=lambda: order.append("config"),
)
api.endpoint_manager = Mock()
api.endpoint_manager.add_routes = Mock()
api.endpoint_manager.start = AsyncMock(
side_effect=lambda: order.append("endpoints"),
)
api.endpoints = []
# Api.run() passes self.app_factory() — a coroutine — to
# web.run_app, which would normally consume it inside its own
# event loop. Since we mock run_app, close the coroutine here
# so it doesn't leak as an "unawaited coroutine" RuntimeWarning.
def _consume_coro(coro, **kwargs):
coro.close()
mock_run_app.side_effect = _consume_coro
await api.app_factory()
api = Api(port=8080)
api.run()
# Verify run_app was called once with the correct port
mock_run_app.assert_called_once()
args, kwargs = mock_run_app.call_args
assert len(args) == 1 # Should have one positional arg (the coroutine)
assert kwargs == {'port': 8080} # Should have port keyword arg
def test_api_components_initialization(self):
"""Test that all API components are properly initialized"""
with patch('trustgraph.gateway.service.get_pubsub') as mock_get_pubsub:
mock_get_pubsub.return_value = Mock()
api = Api()
# Verify all components are initialized
assert api.config_receiver is not None
assert api.dispatcher_manager is not None
assert api.endpoint_manager is not None
assert api.endpoints == []
# Verify component relationships
assert api.dispatcher_manager.backend == api.pubsub_backend
assert api.dispatcher_manager.config_receiver == api.config_receiver
assert api.endpoint_manager.dispatcher_manager == api.dispatcher_manager
# EndpointManager doesn't store auth directly, it passes it to individual endpoints
class TestRunFunction:
"""Test cases for the run() function"""
def test_run_function_with_metrics_enabled(self):
"""Test run function with metrics enabled"""
import warnings
# Suppress the specific async warning with a broader pattern
warnings.filterwarnings("ignore", message=".*Api.app_factory.*was never awaited", category=RuntimeWarning)
with patch('argparse.ArgumentParser.parse_args') as mock_parse_args, \
patch('trustgraph.gateway.service.start_http_server') as mock_start_http_server:
# Mock command line arguments
mock_args = Mock()
mock_args.metrics = True
mock_args.metrics_port = 8000
mock_parse_args.return_value = mock_args
# Create a simple mock instance without any async methods
mock_api_instance = Mock()
mock_api_instance.run = Mock()
# Create a mock Api class without importing the real one
mock_api = Mock(return_value=mock_api_instance)
# Patch using context manager to avoid importing the real Api class
with patch('trustgraph.gateway.service.Api', mock_api):
# Mock vars() to return a dict
with patch('builtins.vars') as mock_vars:
mock_vars.return_value = {
'metrics': True,
'metrics_port': 8000,
'pulsar_host': default_pulsar_host,
'timeout': default_timeout
}
run()
# Verify metrics server was started
mock_start_http_server.assert_called_once_with(8000)
# Verify Api was created and run was called
mock_api.assert_called_once()
mock_api_instance.run.assert_called_once()
@patch('trustgraph.gateway.service.start_http_server')
@patch('argparse.ArgumentParser.parse_args')
def test_run_function_with_metrics_disabled(self, mock_parse_args, mock_start_http_server):
"""Test run function with metrics disabled"""
# Mock command line arguments
mock_args = Mock()
mock_args.metrics = False
mock_parse_args.return_value = mock_args
# Create a simple mock instance without any async methods
mock_api_instance = Mock()
mock_api_instance.run = Mock()
# Patch the Api class inside the test without using decorators
with patch('trustgraph.gateway.service.Api') as mock_api:
mock_api.return_value = mock_api_instance
# Mock vars() to return a dict
with patch('builtins.vars') as mock_vars:
mock_vars.return_value = {
'metrics': False,
'metrics_port': 8000,
'pulsar_host': default_pulsar_host,
'timeout': default_timeout
}
run()
# Verify metrics server was NOT started
mock_start_http_server.assert_not_called()
# Verify Api was created and run was called
mock_api.assert_called_once()
mock_api_instance.run.assert_called_once()
@patch('argparse.ArgumentParser.parse_args')
def test_run_function_argument_parsing(self, mock_parse_args):
"""Test that run function properly parses command line arguments"""
# Mock command line arguments
mock_args = Mock()
mock_args.metrics = False
mock_parse_args.return_value = mock_args
# Create a simple mock instance without any async methods
mock_api_instance = Mock()
mock_api_instance.run = Mock()
# Mock vars() to return a dict with all expected arguments
expected_args = {
'pulsar_host': 'pulsar://test:6650',
'pulsar_api_key': 'test-key',
'pulsar_listener': 'test-listener',
'prometheus_url': 'http://test-prometheus:9090',
'port': 9000,
'timeout': 300,
'api_token': 'secret',
'log_level': 'INFO',
'metrics': False,
'metrics_port': 8001
}
# Patch the Api class inside the test without using decorators
with patch('trustgraph.gateway.service.Api') as mock_api:
mock_api.return_value = mock_api_instance
with patch('builtins.vars') as mock_vars:
mock_vars.return_value = expected_args
run()
# Verify Api was created with the parsed arguments
mock_api.assert_called_once_with(**expected_args)
mock_api_instance.run.assert_called_once()
def test_run_function_creates_argument_parser(self):
"""Test that run function creates argument parser with correct arguments"""
with patch('argparse.ArgumentParser') as mock_parser_class:
mock_parser = Mock()
mock_parser_class.return_value = mock_parser
mock_parser.parse_args.return_value = Mock(metrics=False)
with patch('trustgraph.gateway.service.Api') as mock_api, \
patch('builtins.vars') as mock_vars:
mock_vars.return_value = {'metrics': False}
mock_api.return_value = Mock()
run()
# Verify ArgumentParser was created
mock_parser_class.assert_called_once()
# Verify add_argument was called for each expected argument
expected_arguments = [
'pulsar-host', 'pulsar-api-key', 'pulsar-listener',
'prometheus-url', 'port', 'timeout', 'api-token',
'log-level', 'metrics', 'metrics-port'
]
# Check that add_argument was called multiple times (once for each arg)
assert mock_parser.add_argument.call_count >= len(expected_arguments)
# auth.start must be first (before config receiver, before
# any endpoint starts).
assert order[0] == "auth"
# All three must have run.
assert set(order) == {"auth", "config", "endpoints"}