Commit graph

1358 commits

Author SHA1 Message Date
Cyber MacGeddon
159b1e2824 Merge branch 'release/v2.4' 2026-05-11 15:15:50 +01:00
KOTHA-SRIVIBHU
dd974b0cac fix: replace bare excepts in NLTK initialization (#896) 2026-05-11 15:12:25 +01:00
KOTHA-SRIVIBHU
ab02c02b33
fix: replace bare excepts in NLTK initialization (#896) 2026-05-11 15:11:30 +01:00
Sahil Yadav
d08ec56a73 fix: resolve publisher resource leak and field parse validation (#886) 2026-05-11 15:06:54 +01:00
Sahil Yadav
c2f1759bdf
fix: resolve publisher resource leak and field parse validation (#886) 2026-05-11 15:06:24 +01:00
Jack Colquitt
80a7579639
Enhance README.md descriptions for TrustGraph and Context Core (#892)
Refined the description of TrustGraph and updated the Context Core explanation for clarity.
2026-05-08 13:45:06 -07:00
cybermaggedon
fd8d5b2c42
Recent fixes -> release/v2.4 (#891)
* Fix publisher resource leak in librarian submit_document (#883)

Wrap pub.start()/pub.send() in try/finally to guarantee pub.stop() is
called on error. Remove unnecessary asyncio.sleep(1) kludge.

* Make Cassandra replication factor configurable (issue #787) (#887)

Add CASSANDRA_REPLICATION_FACTOR environment variable and
--cassandra-replication-factor CLI argument to cassandra_config.py.

Update all four table store constructors (ConfigTableStore,
KnowledgeTableStore, LibraryTableStore, IamTableStore) to accept
an optional replication_factor parameter and use it in keyspace
creation CQL queries.

Thread the replication factor through all service constructors:
Configuration, KnowledgeManager, Librarian, IamService, and
knowledge store Processor.

* Update tests

---------

Co-authored-by: gittihub-jpg <rico@springer-mail.net>
2026-05-08 19:48:12 +01:00
gittihub-jpg
e23d4a5b58
Make Cassandra replication factor configurable (issue #787) (#887)
Add CASSANDRA_REPLICATION_FACTOR environment variable and
--cassandra-replication-factor CLI argument to cassandra_config.py.

Update all four table store constructors (ConfigTableStore,
KnowledgeTableStore, LibraryTableStore, IamTableStore) to accept
an optional replication_factor parameter and use it in keyspace
creation CQL queries.

Thread the replication factor through all service constructors:
Configuration, KnowledgeManager, Librarian, IamService, and
knowledge store Processor.
2026-05-08 19:31:58 +01:00
gittihub-jpg
f9d6606423
Fix publisher resource leak in librarian submit_document (#883)
Wrap pub.start()/pub.send() in try/finally to guarantee pub.stop() is
called on error. Remove unnecessary asyncio.sleep(1) kludge.
2026-05-08 19:22:48 +01:00
cybermaggedon
fe542b3d33
Remove race condition in workspace initialisation if iam-svc is up (#867)
Remove race condition in workspace initialisation if iam-svc is up
before config-svc.

iam.py — handle_create_workspace:
- Config registration (_on_workspace_created) moves before the IAM
  table write, so it's a prerequisite. If the config put fails,
  the exception propagates and the IAM create doesn't happen.
- On duplicate, the IAM table write is skipped but config
  registration still runs (idempotent put). Returns the existing
  record with no error instead of returning _err("duplicate", ...).

service.py — _announce_workspace_created → _ensure_workspace_registered:
- Renamed to reflect the new semantics.
- Exceptions propagate instead of being swallowed — if config
  registration fails, the caller sees the error.
2026-05-06 21:21:48 +01:00
cybermaggedon
d282d72db1
Fixed document-rag workspace problem (#866)
- Fixed document-rag workspace problem
- OpenAI text-completion processor now puts 'not-set' in the token
  if no token is set (new OpenAI library requires it to be set to
  something.
- Update tests
2026-05-06 14:55:21 +01:00
cybermaggedon
03cc5ac80f
Per-flow librarian clients and per-workspace response queues (#865)
Replace singleton LibrarianClient with per-flow instances via the new
LibrarianSpec, giving each flow its own librarian tied to the
workspace-scoped request/response queues from the blueprint.

Move all workspace-scoped services (config, flow, librarian, knowledge)
from a single base-queue response producer to per-workspace response
producers created alongside the existing per-workspace request
consumers.  Update the gateway dispatcher and bootstrapper flow client
to subscribe to the matching workspace-scoped response queues.

Fix WorkspaceInit to register workspaces through the IAM
create-workspace API so they appear in __workspaces__ and are visible
to the gateway.  Simplify the bootstrapper gate to only check
config-svc reachability.

Updated tests accordingly.
2026-05-06 12:01:01 +01:00
Jack Colquitt
1ffae12559
Revise README to reflect agent runtime platform (#864)
Updated platform description and added messaging systems.
2026-05-05 11:47:19 -07:00
cybermaggedon
01bf1d89d5
Fixed a circular dependency causing bootstrap to fail (#863)
TemplateSeed and WorkspaceInit now run pre-gate. They'll
write templates and register the default workspace before the gate
checks flow-svc, breaking the circular dependency.
2026-05-05 16:00:21 +01:00
cybermaggedon
9f2bfbce0c
Per-workspace queue routing for workspace-scoped services (#862)
Workspace identity is now determined by queue infrastructure instead of
message body fields, closing a privilege-escalation vector where a caller
could spoof workspace in the request payload.

- Add WorkspaceProcessor base class: discovers workspaces from config at
  startup, creates per-workspace consumers (queue:workspace), and manages
  consumer lifecycle on workspace create/delete events
- Roll out to librarian, flow-svc, knowledge cores, and config-svc
- Config service gets a dual-queue regime: a system queue for
  cross-workspace ops (getvalues-all-ws, bootstrapper writes to
  __workspaces__) and per-workspace queues for tenant-scoped ops, with
  workspace discovery from its own Cassandra store
- Remove workspace field from request schemas (FlowRequest,
  LibrarianRequest, KnowledgeRequest, CollectionManagementRequest) and
  from DocumentMetadata / ProcessingMetadata — table stores now accept
  workspace as an explicit parameter
- Strip workspace encode/decode from all message translators and gateway
  serializers
- Gateway enforces workspace existence: reject requests targeting
  non-existent workspaces instead of routing to queues with no consumer
- Config service provisions new workspaces from __template__ on creation
- Add workspace lifecycle hooks to AsyncProcessor so any processor can
  react to workspace create/delete without subclassing WorkspaceProcessor
2026-05-04 10:30:03 +01:00
cybermaggedon
9be257ceee
Update packages with vulns in container builds (#861)
* Fix vulns-flagged imports

* Fix archaic pulls in the "trustgraph" package

* Add unstructured to meta package
2026-04-30 20:02:53 +01:00
cybermaggedon
89f058d35b
Fix erroneous handling of __template__ and __system__ workspaces (#860)
__template__ and __system__ are not real workspace but config
zones for workspace template and init handling.  They should not
be processed as workspaces.

Now ignores workspace beginning with '_' prefix.
2026-04-30 09:53:58 +01:00
cybermaggedon
69b0b9b326
Add missing websockets dep (#859)
Added 'websockets' to pyproject.toml in trustgraph-base
2026-04-30 09:53:32 +01:00
Cyber MacGeddon
c112af0ab0 align chunker + googleaistudio fixes with release/v2.4
Master had a parallel sibling fix for issue #821 (PR #828) using
self.RecursiveCharacterTextSplitter / self.TokenTextSplitter; release
branches converged on the bare module-level form.  Adopt release/v2.4's
version so downstream branches don't drift further.
2026-04-29 18:01:31 +01:00
Cyber MacGeddon
f3434307c5 Merge branch 'release/v2.4' 2026-04-29 17:57:24 +01:00
Jack Colquitt
627cb1e0d8
Remove Cossmology badge from README (#857)
Removed the badge for trustgraph on Cossmology from the README.
2026-04-28 16:54:23 -07:00
cybermaggedon
d0850ff381
Delete some stuff to free up disk space (#856) 2026-04-28 22:46:02 +01:00
cybermaggedon
9fc1d4527b
iam: self-service ops, optional workspace filters, Mux service routing (#855)
Three threads, all reinforcing the contract's system-level vs.
workspace-association distinction.

WS Mux service routing
- tg-show-flows (and any workspace-level service over the WS) was
  failing with "unknown service" because the post-refactor Mux
  unconditionally looked up flow-service:<kind>.  Now branches on
  the envelope's flow field: with flow → flow-service:<kind>;
  without flow → <kind>:<op> from the inner body; with bare op
  lookup for service=iam.  Resource and parameters come from the
  matched op's own extractors — same path the HTTP endpoints take.

Optional workspace on system-level user/key ops
- list-users returns the deployment-wide list when no workspace is
  supplied, filters when one is.  get-user, update-user,
  disable-user, enable-user, delete-user, reset-password,
  create-api-key, list-api-keys, revoke-api-key all treat workspace
  as an optional integrity check rather than a required argument.
- create-user keeps workspace required — there it's the new user's
  home-workspace binding, a parameter rather than an address.
- API keys reclassified as SYSTEM-level resources.  By the same
  reasoning that makes users system-level, an API key is a
  credential record on a deployment-wide registry; the workspace it
  authenticates to is a property, not a containment.

Self-service surface
- whoami: returns the caller's own user record.  AUTHENTICATED-only;
  no users:read capability required.  Foundation for UI affordances
  that depend on the caller's permissions.
- bootstrap-status: POST /api/v1/auth/bootstrap-status, PUBLIC,
  side-effect-free.  Returns {bootstrap_available: bool} so a
  first-run UI can decide whether to render setup without consuming
  the bootstrap op.
- Gateway now injects actor=identity.handle on every authenticated
  forward to iam-svc (IamEndpoint and WS Mux iam path), overwriting
  any caller-supplied value.  Underpins whoami, audit logging, and
  future regime-side decisions that need actor identity.
- tg-whoami and tg-update-user CLIs.

Spec polish
- iam-contract.md: actor-injection rule documented; whoami /
  bootstrap-status added to operations list; permission-scope
  framing tightened (workspace scope is a property of the grant,
  not the user or role).
- iam.md: self-service section; gateway flow gains the actor-
  injection step; role section reframed so iam-svc constraints
  don't leak into contract-level prose.
- iam-protocol.md: ops table updated for whoami, bootstrap-status,
  optional-workspace pattern; bootstrap_available added to the
  IamResponse listing.
2026-04-28 22:13:12 +01:00
Trevin Chow
6302eb8c97 test(ontology): harden domain/range validation + add missing tests (#848)
Fixes #826. Addresses all five points the maintainer called out in
the follow-up to #825.

Source change (trustgraph-flow/trustgraph/extract/kg/ontology/extract.py):
- Added `_is_subclass_of(cls, target, ontology_subset, max_depth=100)`
  helper with visited-set cycle detection + a defensive depth cap.
  LLM-generated ontologies may emit cycles (A subclass_of B,
  B subclass_of A); the prior while-loop would infinite-loop on that.
- Replaced both near-identical domain and range subclass walks in
  `is_valid_triple` with a single call to the new helper. Net is
  -20 duplicated lines + 26-line helper.

Tests (tests/unit/test_extract/test_ontology/test_prompt_and_extraction.py):
- test_is_valid_triple_subclass_is_accepted: domain expects Recipe,
  actual type is Cake (subclass), validates.
- test_is_valid_triple_handles_subclass_cycle_without_infinite_loop:
  A subclass_of B, B subclass_of A; call returns False within the
  depth cap rather than hanging.
- test_parse_and_validate_triples_collects_entity_types_from_rdf_type:
  end-to-end path: rdf:type triples build the entity_types dict,
  subsequent domain-check triples validate against it.
- test_is_valid_triple_entity_types_none_default: the None default
  path now has explicit coverage.

156 existing tests in tests/unit/test_extract/test_ontology still pass.
2026-04-28 16:37:42 +01:00
Trevin Chow
e59aa023d4
test(ontology): harden domain/range validation + add missing tests (#848)
Fixes #826. Addresses all five points the maintainer called out in
the follow-up to #825.

Source change (trustgraph-flow/trustgraph/extract/kg/ontology/extract.py):
- Added `_is_subclass_of(cls, target, ontology_subset, max_depth=100)`
  helper with visited-set cycle detection + a defensive depth cap.
  LLM-generated ontologies may emit cycles (A subclass_of B,
  B subclass_of A); the prior while-loop would infinite-loop on that.
- Replaced both near-identical domain and range subclass walks in
  `is_valid_triple` with a single call to the new helper. Net is
  -20 duplicated lines + 26-line helper.

Tests (tests/unit/test_extract/test_ontology/test_prompt_and_extraction.py):
- test_is_valid_triple_subclass_is_accepted: domain expects Recipe,
  actual type is Cake (subclass), validates.
- test_is_valid_triple_handles_subclass_cycle_without_infinite_loop:
  A subclass_of B, B subclass_of A; call returns False within the
  depth cap rather than hanging.
- test_parse_and_validate_triples_collects_entity_types_from_rdf_type:
  end-to-end path: rdf:type triples build the entity_types dict,
  subsequent domain-check triples validate against it.
- test_is_valid_triple_entity_types_none_default: the None default
  path now has explicit coverage.

156 existing tests in tests/unit/test_extract/test_ontology still pass.
2026-04-28 16:33:49 +01:00
cybermaggedon
5e28d3cce0
refactor(iam): pluggable IAM regime via authenticate/authorise contract (#853)
The gateway no longer holds any policy state — capability sets, role
definitions, workspace scope rules.  Per the IAM contract it asks the
regime "may this identity perform this capability on this resource?"
per request.  That moves the OSS role-based regime entirely into
iam-svc, which can be replaced (SSO, ABAC, ReBAC) without changing
the gateway, the wire protocol, or backend services.

Contract:
- authenticate(credential) -> Identity (handle, workspace,
  principal_id, source).  No roles, claims, or policy state surface
  to the gateway.
- authorise(identity, capability, resource, parameters) -> (allow,
  ttl).  Cached per-decision (regime TTL clamped above; fail-closed
  on regime errors).
- authorise_many available as a fan-out variant.

Operation registry drives every authorisation decision:
- /api/v1/iam -> IamEndpoint, looks up bare op name (create-user,
  list-workspaces, ...).
- /api/v1/{kind} -> RegistryRoutedVariableEndpoint, <kind>:<op>
  (config:get, flow:list-blueprints, librarian:add-document, ...).
- /api/v1/flow/{flow}/service/{kind} -> flow-service:<kind>.
- /api/v1/flow/{flow}/{import,export}/{kind} ->
  flow-{import,export}:<kind>.
- WS Mux per-frame -> flow-service:<kind>; closes a gap where
  authenticated users could hit any service kind.
85 operations registered across the surface.

JWT carries identity only — sub + workspace.  The roles claim is gone;
the gateway never reads policy state from a credential.

The three coarse *_KIND_CAPABILITY maps are removed.  The registry is
the only source of truth for the capability + resource shape of an
operation.  Tests migrated to the new Identity shape and to
authorise()-mocked auth doubles.

Specs updated: docs/tech-specs/iam-contract.md (Identity surface,
caching, registry-naming conventions), iam.md (JWT shape, gateway
flow, role section reframed as OSS-regime detail), iam-protocol.md
(positioned as one implementation of the contract).
2026-04-28 16:19:41 +01:00
cybermaggedon
9f2d9adcb1
Fix Ollama async issue (#854)
* Fix Ollama sync issues - replaced with async

* Fix tests
2026-04-28 15:43:04 +01:00
cybermaggedon
b15f1a167c
Smoke-test websocket tool (#852) 2026-04-28 15:05:35 +01:00
cybermaggedon
666af1c4b3
feat(iam): allow bootstrap mode and token to be sourced from env vars (#851)
Adds an environment-variable fallback for the iam-svc bootstrap
configuration so the token can be injected from a Kubernetes Secret
(or any equivalent secret store) without ever appearing in the
processor-group YAML — which is typically version-controlled.

Resolution order is fixed and per-setting:

  bootstrap_mode  = params["bootstrap_mode"]   or  $IAM_BOOTSTRAP_MODE
  bootstrap_token = params["bootstrap_token"]  or  $IAM_BOOTSTRAP_TOKEN

If neither source supplies a value, the service refuses to start with
a clear message naming both options.  The two settings are resolved
independently, which lets operators commit the mode in YAML (it is
not a secret) while pulling the token from a Secret-backed
``IAM_BOOTSTRAP_TOKEN`` env var.

Validation invariants are unchanged:

* mode must be 'token' or 'bootstrap'
* mode='token' requires a token (from any source)
* mode='bootstrap' must NOT have a token (ambiguous intent)

There is no permissive fallback — the service fails closed in every
branch where configuration is incomplete.

docs/tech-specs/iam-protocol.md gains a 'Configuration sources'
subsection under 'Bootstrap modes' that documents the precedence
table and the K8s injection pattern.  The 'Bootstrap-token
lifecycle' step about removing the token after rotation now applies
to whichever source was used (Secret, env var, or YAML field).
2026-04-28 15:00:33 +01:00
cybermaggedon
67b2fc448f
feat: IAM service, gateway auth middleware, capability model, and CLIs (#849)
Replaces the legacy GATEWAY_SECRET shared-token gate with an IAM-backed
identity and authorisation model.  The gateway no longer has an
"allow-all" or "no auth" mode; every request is authenticated via the
IAM service, authorised against a capability model that encodes both
the operation and the workspace it targets, and rejected with a
deliberately-uninformative 401 / 403 on any failure.

IAM service (trustgraph-flow/trustgraph/iam, trustgraph-base/schema/iam)
-----------------------------------------------------------------------
* New backend service (iam-svc) owning users, workspaces, API keys,
  passwords and JWT signing keys in Cassandra.  Reached over the
  standard pub/sub request/response pattern; gateway is the only
  caller.
* Operations: bootstrap, resolve-api-key, login, get-signing-key-public,
  rotate-signing-key, create/list/get/update/disable/delete/enable-user,
  change-password, reset-password, create/list/get/update/disable-
  workspace, create/list/revoke-api-key.
* Ed25519 JWT signing (alg=EdDSA).  Key rotation writes a new kid and
  retires the previous one; validation is grace-period friendly.
* Passwords: PBKDF2-HMAC-SHA-256, 600k iterations, per-user salt.
* API keys: 128-bit random, SHA-256 hashed.  Plaintext returned once.
* Bootstrap is explicit: --bootstrap-mode {token,bootstrap} is a
  required startup argument with no permissive default.  Masked
  "auth failure" errors hide whether a refused bootstrap request was
  due to mode, state, or authorisation.

Gateway authentication (trustgraph-flow/trustgraph/gateway/auth.py)
-------------------------------------------------------------------
* IamAuth replaces the legacy Authenticator.  Distinguishes JWTs
  (three-segment dotted) from API keys by shape; verifies JWTs
  locally using the cached IAM public key; resolves API keys via
  IAM with a short-TTL hash-keyed cache.  Every failure path
  surfaces the same 401 body ("auth failure") so callers cannot
  enumerate credential state.
* Public key is fetched at gateway startup with a bounded retry loop;
  traffic does not begin flowing until auth has started.

Capability model (trustgraph-flow/trustgraph/gateway/capabilities.py)
---------------------------------------------------------------------
* Roles have two dimensions: a capability set and a workspace scope.
  OSS ships reader / writer / admin; the first two are workspace-
  assigned, admin is cross-workspace ("*").  No "cross-workspace"
  pseudo-capability — workspace permission is a property of the role.
* check(identity, capability, target_workspace=None) is the single
  authorisation test: some role must grant the capability *and* be
  active in the target workspace.
* enforce_workspace validates a request-body workspace against the
  caller's role scopes and injects the resolved value.  Cross-
  workspace admin is permitted by role scope, not by a bypass.
* Gateway endpoints declare a required capability explicitly — no
  permissive default.  Construction fails fast if omitted.  Enterprise
  editions can replace the role table without changing the wire
  protocol.

WebSocket first-frame auth (dispatch/mux.py, endpoint/socket.py)
----------------------------------------------------------------
* /api/v1/socket handshake unconditionally accepts; authentication
  runs on the first WebSocket frame ({"type":"auth","token":"..."})
  with {"type":"auth-ok","workspace":"..."} / {"type":"auth-failed"}.
  The socket stays open on failure so the client can re-authenticate
  — browsers treat a handshake-time 401 as terminal, breaking
  reconnection.
* Mux.receive rejects every non-auth frame before auth succeeds,
  enforces the caller's workspace (envelope + inner payload) using
  the role-scope resolver, and supports mid-session re-auth.
* Flow import/export streaming endpoints keep the legacy ?token=
  handshake (URL-scoped short-lived transfers; no re-auth need).

Auth surface
------------
* POST /api/v1/auth/login — public, returns a JWT.
* POST /api/v1/auth/bootstrap — public; forwards to IAM's bootstrap
  op which itself enforces mode + tables-empty.
* POST /api/v1/auth/change-password — any authenticated user.
* POST /api/v1/iam — admin-only generic forwarder for the rest of
  the IAM API (per-op REST endpoints to follow in a later change).

Removed / breaking
------------------
* GATEWAY_SECRET / --api-token / default_api_token and the legacy
  Authenticator.permitted contract.  The gateway cannot run without
  IAM.
* ?token= on /api/v1/socket.
* DispatcherManager and Mux both raise on auth=None — no silent
  downgrade path.

CLI tools (trustgraph-cli)
--------------------------
tg-bootstrap-iam, tg-login, tg-create-user, tg-list-users,
tg-disable-user, tg-enable-user, tg-delete-user, tg-change-password,
tg-reset-password, tg-create-api-key, tg-list-api-keys,
tg-revoke-api-key, tg-create-workspace, tg-list-workspaces.  Passwords
read via getpass; tokens / one-time secrets written to stdout with
operator context on stderr so shell composition works cleanly.
AsyncSocketClient / SocketClient updated to the first-frame auth
protocol.

Specifications
--------------
* docs/tech-specs/iam.md updated with the error policy, workspace
  resolver extension point, and OSS role-scope model.
* docs/tech-specs/iam-protocol.md (new) — transport, dataclasses,
  operation table, error taxonomy, bootstrap modes.
* docs/tech-specs/capabilities.md (new) — capability vocabulary, OSS
  role bundles, agent-as-composition note, enforcement-boundary
  policy, enterprise extensibility.

Tests
-----
* test_auth.py (rewritten) — IamAuth + JWT round-trip with real
  Ed25519 keypairs + API-key cache behaviour.
* test_capabilities.py (new) — role table sanity, check across
  role x workspace combinations, enforce_workspace paths,
  unknown-cap / unknown-role fail-closed.
* Every endpoint test construction now names its capability
  explicitly (no permissive defaults relied upon).  New tests pin
  the fail-closed invariants: DispatcherManager / Mux refuse
  auth=None; i18n path-traversal defense is exercised.
* test_socket_graceful_shutdown rewritten against IamAuth.
2026-04-24 17:29:10 +01:00
Jack Colquitt
cd9c307453
Add Cossmology badge to README (#850)
Added a badge for trustgraph on Cossmology to README.
2026-04-23 13:40:39 -07:00
cybermaggedon
ae9936c9cc
feat: pluggable bootstrap framework with ordered initialisers (#847)
A generic, long-running bootstrap processor that converges a
deployment to its configured initial state and then idles.
Replaces the previous one-shot `tg-init-trustgraph` container model
and provides an extension point for enterprise / third-party
initialisers.

See docs/tech-specs/bootstrap.md for the full design.

Bootstrapper
------------
A single AsyncProcessor (trustgraph.bootstrap.bootstrapper.Processor)
that:

  * Reads a list of initialiser specifications (class, name, flag,
    params) from either a direct `initialisers` parameter
    (processor-group embedding) or a YAML/JSON file (`-c`, CLI).
  * On each wake, runs a cheap service-gate (config-svc +
    flow-svc round-trips), then iterates the initialiser list,
    running each whose configured flag differs from the one stored
    in __system__/init-state/<name>.
  * Stores per-initialiser completion state in the reserved
    __system__ workspace.
  * Adapts cadence: ~5s on gate failure, ~15s while converging,
    ~300s in steady state.
  * Isolates failures — one initialiser's exception does not block
    others in the same cycle; the failed one retries next wake.

Initialiser contract
--------------------
  * Subclass trustgraph.bootstrap.base.Initialiser.
  * Implement async run(ctx, old_flag, new_flag).
  * Opt out of the service gate with class attr
    wait_for_services=False (only used by PulsarTopology, since
    config-svc cannot come up until Pulsar namespaces exist).
  * ctx carries short-lived config and flow-svc clients plus a
    scoped logger.

Core initialisers (trustgraph.bootstrap.initialisers.*)
-------------------------------------------------------
  * PulsarTopology   — creates Pulsar tenant + namespaces
                       (pre-gate, blocking HTTP offloaded to
                        executor).
  * TemplateSeed     — seeds __template__ from an external JSON
                       file; re-run is upsert-missing by default,
                       overwrite-all opt-in.
  * WorkspaceInit    — populates a named workspace from either
                       the full contents of __template__ or a
                       seed file; raises cleanly if the template
                       isn't seeded yet so the bootstrapper retries
                       on the next cycle.
  * DefaultFlowStart — starts a specific flow in a workspace;
                       no-ops if the flow is already running.

Enterprise or third-party initialisers plug in via fully-qualified
dotted class paths in the bootstrapper's configuration — no core
code change required.

Config service
--------------
  * push(): filter out reserved workspaces (ids starting with "_")
    from the change notifications.  Stored config is preserved; only
    the broadcast is suppressed, so bootstrap / template state lives
    in config-svc without live processors ever reacting to it.

Config client
-------------
  * ConfigClient.get_all(workspace): wraps the existing `config`
    operation to return {type: {key: value}} for a workspace.
    WorkspaceInit uses it to copy __template__ without needing a
    hardcoded types list.

pyproject.toml
--------------
  * Adds a `bootstrap` console script pointing at the new Processor.

* Remove tg-init-trustgraph, superceded by bootstrap processor
2026-04-22 18:03:46 +01:00
cybermaggedon
31027e30ae
Better reporting from api-gateway's metric endpoint (#845)
- Connect failures (DNS, connect refused, server disconnect) now
  return 502 Bad Gateway with a body that names the upstream URL.
- Other exceptions still return 500 but now include the exception
  message in the body and log with exc_info=True so the stack trace
  lands in the gateway log.
- Also fixed the logging.error → logger.error inconsistency in the
  same block (module had a named logger at the top that wasn't being
  used).
2026-04-22 16:16:57 +01:00
cybermaggedon
95c3b62ef1
Correct output from tg-show-flow-state and tg-show-processor-state (#846) 2026-04-22 16:16:36 +01:00
cybermaggedon
89cabee1b4
release/v2.4 -> master (#844) 2026-04-22 15:19:57 +01:00
cybermaggedon
7521e152b9 fix: uuid-ify flow-svc ConfigClient subscription to avoid Pulsar ConsumerBusy on restart (#843)
flow-svc's long-lived ConfigClient was constructed with
subscription=f"{self.id}--config--{id}", where id=params.get("id") is
the deterministic processor id.  On Pulsar the config-response topic
maps to class=response -> Exclusive subscription; when the supervisor
restarts flow-svc within Pulsar's inactive-subscription TTL (minutes),
the previous process's ghost consumer still holds the subscription
and the new process's re-subscribe is rejected with ConsumerBusy,
crash-looping flow-svc.

This is a v2.2 -> v2.3 regression in practice, but not a change in
subscription semantics: the Exclusive mapping for response/notify is
identical between releases.  The regression is that PR #822 split
flow-svc out of config-svc and added this new, long-lived
request/response call site — the new site simply didn't follow the
uuid convention used by the equivalent sites elsewhere
(gateway/config/receiver.py, AsyncProcessor._create_config_client).

Fix: generate a fresh uuid per process instance for the subscription
suffix, matching that convention.
2026-04-22 15:17:56 +01:00
cybermaggedon
6cbaf88fc6 fix: ontology extractor reads .objects, not .object, from PromptResult (#842)
The extract-with-ontologies prompt is a JSONL prompt, which means the
prompt service returns a PromptResult with response_type="jsonl" and
the parsed items in `.objects` (plural).  The ontology extractor was
reading `.object` (singular) — the field used for response_type="json"
— which is always None for JSONL prompts.

Effect: the parser received None on every chunk, hit its "Unexpected
response type: <class 'NoneType'>" branch, returned no ExtractionResult,
and extract_with_simplified_format returned []. Every extraction
silently produced zero triples.

Graphs populated only with the seed ontology schema (TBox) and
document/chunk provenance — no instance triples at all.  The e2e test
threshold of >=100 edges per collection was met by schema + provenance
alone, so the failure mode was invisible until RAG queries couldn't
find any content.

Regression introduced in v2.3 with the token-usage work (commit
56d700f3 / 14e49d83) when PromptClient.prompt() began returning a
PromptResult wrapper instead of the raw text/dict/list.  All other
call sites of .prompt() across retrieval/, agent/, orchestrator/ were
already reading the correct field for their prompt's response_type;
ontology extraction was the sole stranded caller.

Also adds tests/unit/test_extract/test_ontology/test_extract_with_simplified_format.py
covering:
  - happy path: populated .objects produces non-empty triples
  - production failure shape: .objects=None returns [] cleanly
  - empty .objects returns [] without raising
  - defensive: do not silently fall back to .object for a JSONL prompt
2026-04-22 12:10:42 +01:00
cybermaggedon
8be128aa59 fix: api-gateway evicts cached dispatchers when a flow stops (#841)
DispatcherManager caches one ServiceRequestor per (flow_id, kind) in
self.dispatchers, lazily created on first use.  stop_flow dropped the
flow from self.flows but never touched the cached dispatchers, so
their publisher/subscriber connections persisted — bound to the
per-flow exchanges that flow-svc tears down when the flow stops.

If the same flow id was later re-created, flow-svc re-declared fresh
per-flow exchanges, but the gateway's cached dispatcher still held a
subscription queue bound to the now-gone old response exchange.
Requests went out fine (publishers target exchanges by name and the
new exchange has the right name), but responses landed on an exchange
with no binding to the dispatcher's queue and were silently dropped.
The calling CLI or websocket session hung waiting for a reply that
would never arrive.

Reproduction before fix:

    tg-start-flow -i test-flow-1 ...
    # any query on test-flow-1 works
    tg-stop-flow  -i test-flow-1
    tg-start-flow -i test-flow-1 ...
    tg-show-graph -f test-flow-1 -C <collection>   # hangs

Flows that were never stopped (e.g. "default" in a typical session)
were unaffected — their cached dispatcher still pointed at live
plumbing.  That's why the bug appeared flow-name-specific at first
glance; it's actually lifecycle-specific.

Fix: in stop_flow, evict and cleanly stop() every cached dispatcher
keyed on the stopped flow id.  Next request after restart constructs
a fresh dispatcher against the freshly-declared exchanges.  Tuple
shape check preserves global dispatchers, which use (None, kind) as
their key and must survive flow churn.

Uses pop(id, None) instead of del in case stop_flow is invoked
defensively for a flow the gateway never saw.
2026-04-22 12:10:21 +01:00
cybermaggedon
d35473f7f7
feat: workspace-based multi-tenancy, replacing user as tenancy axis (#840)
Introduces `workspace` as the isolation boundary for config, flows,
library, and knowledge data. Removes `user` as a schema-level field
throughout the code, API specs, and tests; workspace provides the
same separation more cleanly at the trusted flow.workspace layer
rather than through client-supplied message fields.

Design
------
- IAM tech spec (docs/tech-specs/iam.md) documents current state,
  proposed auth/access model, and migration direction.
- Data ownership model (docs/tech-specs/data-ownership-model.md)
  captures the workspace/collection/flow hierarchy.

Schema + messaging
------------------
- Drop `user` field from AgentRequest/Step, GraphRagQuery,
  DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest,
  Sparql/Rows/Structured QueryRequest, ToolServiceRequest.
- Keep collection/workspace routing via flow.workspace at the
  service layer.
- Translators updated to not serialise/deserialise user.

API specs
---------
- OpenAPI schemas and path examples cleaned of user fields.
- Websocket async-api messages updated.
- Removed the unused parameters/User.yaml.

Services + base
---------------
- Librarian, collection manager, knowledge, config: all operations
  scoped by workspace. Config client API takes workspace as first
  positional arg.
- `flow.workspace` set at flow start time by the infrastructure;
  no longer pass-through from clients.
- Tool service drops user-personalisation passthrough.

CLI + SDK
---------
- tg-init-workspace and workspace-aware import/export.
- All tg-* commands drop user args; accept --workspace.
- Python API/SDK (flow, socket_client, async_*, explainability,
  library) drop user kwargs from every method signature.

MCP server
----------
- All tool endpoints drop user parameters; socket_manager no longer
  keyed per user.

Flow service
------------
- Closure-based topic cleanup on flow stop: only delete topics
  whose blueprint template was parameterised AND no remaining
  live flow (across all workspaces) still resolves to that topic.
  Three scopes fall out naturally from template analysis:
    * {id} -> per-flow, deleted on stop
    * {blueprint} -> per-blueprint, kept while any flow of the
      same blueprint exists
    * {workspace} -> per-workspace, kept while any flow in the
      workspace exists
    * literal -> global, never deleted (e.g. tg.request.librarian)
  Fixes a bug where stopping a flow silently destroyed the global
  librarian exchange, wedging all library operations until manual
  restart.

RabbitMQ backend
----------------
- heartbeat=60, blocked_connection_timeout=300. Catches silently
  dead connections (broker restart, orphaned channels, network
  partitions) within ~2 heartbeat windows, so the consumer
  reconnects and re-binds its queue rather than sitting forever
  on a zombie connection.

Tests
-----
- Full test refresh: unit, integration, contract, provenance.
- Dropped user-field assertions and constructor kwargs across
  ~100 test files.
- Renamed user-collection isolation tests to workspace-collection.
2026-04-21 23:23:01 +01:00
cybermaggedon
9332089b3d
Setup for 2.4 release branch (#839) 2026-04-21 21:36:46 +01:00
cybermaggedon
424ace44c4
Fix library queue lifecycle (#838)
* Don't delete the global queues (librarian) when flows are deleted

* 60s heartbeat timeouts on RabbitMQ
2026-04-21 21:30:19 +01:00
cybermaggedon
a24df8e990
release/v2.3 -> master (#837) 2026-04-21 16:30:02 +01:00
Het Patel
0ef49ab6ae
feat: standardize LLM rate-limiting and exception handling (#835)
- HTTP 429 translates to TooManyRequests (retryable)
- HTTP 503 translates to LlmError
2026-04-21 16:15:11 +01:00
cybermaggedon
e7efb673ef
Structure the tech specs directory (#836)
Tech spec some subdirectories for different languages
2026-04-21 16:06:41 +01:00
cybermaggedon
48da6c5f8b
Test fixes for Kafka (#834)
Test fixes for Kafka:
- Consumer: isinstance(max_concurrency, int) instead of is not None:
  MagicMock won't pass the check
- Kafka test: updated expected topic name to
  tg.request.prompt.default (colons replaced with dots)
2026-04-18 23:06:01 +01:00
cybermaggedon
b615150624
fix: resolve multiple Kafka backend issues blocking message delivery (#833)
- Producer: add delivery callback to surface send errors instead of
  silently swallowing them, and raise message.max.bytes to 10MB
- Consumer: raise fetch.message.max.bytes to 10MB to match producer,
  tighten session/heartbeat timeouts for fast group joins, and add
  partition assign/revoke logging for diagnostics
- Topic naming: replace colons with dots in topic names since Kafka
  rejects colons (flow:tg:document-load:default was producing invalid
  topic name tg.flow.document-load:default)
- Response consumers: use auto.offset.reset=earliest instead of latest
  so responses published before partition assignment aren't lost
- UNKNOWN_TOPIC_OR_PART: treat as timeout instead of fatal error so
  consumers wait for auto-created topics instead of crashing
- Concurrency: cap consumer workers to 1 for Kafka since topics have
  1 partition — extra consumers trigger rebalance storms that block
  all message delivery
2026-04-18 22:42:24 +01:00
Cyber MacGeddon
222537c26b Merge branch 'release/v2.3' 2026-04-18 12:09:52 +01:00
Het Patel
adea976203 feat: implement retry logic and exponential backoff for S3 operations (#829)
* feat: implement retry logic and exponential backoff for S3 operations

* test: fix librarian mocks after BlobStore async conversion
2026-04-18 12:05:37 +01:00
Het Patel
9a1b2463b6
feat: implement retry logic and exponential backoff for S3 operations (#829)
* feat: implement retry logic and exponential backoff for S3 operations

* test: fix librarian mocks after BlobStore async conversion
2026-04-18 12:05:19 +01:00
cybermaggedon
cce3acd84f
fix: repair deferred imports to preserve module-level names for test patching (#831)
A previous commit moved SDK imports into __init__/methods and
stashed them on self, which broke @patch targets in 24 unit tests.

This fixes the approach: chunker and pdf_decoder use module-level
sentinels with global/if-None guards so imports are still deferred but
patchable. Google AI Studio reverts to standard module-level imports
since the module is only loaded when communicating with Gemini.
Keeps lazy loading on other imports.
2026-04-18 11:43:21 +01:00