trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-13 16:55:14 +02:00

Author	SHA1	Message	Date
Cyber MacGeddon	ef412f2a99	Added delete user	2026-04-24 14:10:24 +01:00
Cyber MacGeddon	9ae79ff712	Updated CLI	2026-04-24 12:41:46 +01:00
Cyber MacGeddon	3bdb677607	CLI tools	2026-04-24 11:08:00 +01:00
Cyber MacGeddon	b5821bca6d	Fix socket token handling	2026-04-24 09:40:15 +01:00
Cyber MacGeddon	2a071b12d0	Putting workspace in the inner request	2026-04-24 08:25:35 +01:00
Cyber MacGeddon	843e68cded	Websocket auth protocol	2026-04-23 20:11:41 +01:00
Cyber MacGeddon	d5dabad001	IAM integrated with gateway	2026-04-23 19:45:28 +01:00
Cyber MacGeddon	8348b7728b	IAM secure bootstrap options	2026-04-23 19:23:18 +01:00
Cyber MacGeddon	832a030703	Rest of the IAM operations	2026-04-23 14:20:32 +01:00
Cyber MacGeddon	1f639642ac	ECDSA	2026-04-23 13:58:29 +01:00
Cyber MacGeddon	7be781b6e2	Add JWT login support	2026-04-23 13:47:49 +01:00
Cyber MacGeddon	0ca0f9999c	IAM implementation	2026-04-23 13:19:03 +01:00
Cyber MacGeddon	2f8fce4030	IAM implementation	2026-04-23 12:46:34 +01:00
Cyber MacGeddon	762a51e214	IAM protocol spec	2026-04-23 12:14:52 +01:00
Cyber MacGeddon	f1e336138d	Updated IAM spec	2026-04-23 11:58:07 +01:00
Cyber MacGeddon	b7c6c3a3a0	Update IAM spec	2026-04-23 11:29:09 +01:00
Cyber MacGeddon	78487d871f	Update IAM spec	2026-04-23 11:28:12 +01:00
cybermaggedon	ae9936c9cc	feat: pluggable bootstrap framework with ordered initialisers (#847 ) A generic, long-running bootstrap processor that converges a deployment to its configured initial state and then idles. Replaces the previous one-shot `tg-init-trustgraph` container model and provides an extension point for enterprise / third-party initialisers. See docs/tech-specs/bootstrap.md for the full design. Bootstrapper ------------ A single AsyncProcessor (trustgraph.bootstrap.bootstrapper.Processor) that: * Reads a list of initialiser specifications (class, name, flag, params) from either a direct `initialisers` parameter (processor-group embedding) or a YAML/JSON file (`-c`, CLI). * On each wake, runs a cheap service-gate (config-svc + flow-svc round-trips), then iterates the initialiser list, running each whose configured flag differs from the one stored in __system__/init-state/<name>. * Stores per-initialiser completion state in the reserved __system__ workspace. * Adapts cadence: ~5s on gate failure, ~15s while converging, ~300s in steady state. * Isolates failures — one initialiser's exception does not block others in the same cycle; the failed one retries next wake. Initialiser contract -------------------- * Subclass trustgraph.bootstrap.base.Initialiser. * Implement async run(ctx, old_flag, new_flag). * Opt out of the service gate with class attr wait_for_services=False (only used by PulsarTopology, since config-svc cannot come up until Pulsar namespaces exist). * ctx carries short-lived config and flow-svc clients plus a scoped logger. Core initialisers (trustgraph.bootstrap.initialisers.) ------------------------------------------------------- PulsarTopology — creates Pulsar tenant + namespaces (pre-gate, blocking HTTP offloaded to executor). * TemplateSeed — seeds __template__ from an external JSON file; re-run is upsert-missing by default, overwrite-all opt-in. * WorkspaceInit — populates a named workspace from either the full contents of __template__ or a seed file; raises cleanly if the template isn't seeded yet so the bootstrapper retries on the next cycle. * DefaultFlowStart — starts a specific flow in a workspace; no-ops if the flow is already running. Enterprise or third-party initialisers plug in via fully-qualified dotted class paths in the bootstrapper's configuration — no core code change required. Config service -------------- * push(): filter out reserved workspaces (ids starting with "_") from the change notifications. Stored config is preserved; only the broadcast is suppressed, so bootstrap / template state lives in config-svc without live processors ever reacting to it. Config client ------------- * ConfigClient.get_all(workspace): wraps the existing `config` operation to return {type: {key: value}} for a workspace. WorkspaceInit uses it to copy __template__ without needing a hardcoded types list. pyproject.toml -------------- * Adds a `bootstrap` console script pointing at the new Processor. * Remove tg-init-trustgraph, superceded by bootstrap processor	2026-04-22 18:03:46 +01:00
cybermaggedon	31027e30ae	Better reporting from api-gateway's metric endpoint (#845 ) - Connect failures (DNS, connect refused, server disconnect) now return 502 Bad Gateway with a body that names the upstream URL. - Other exceptions still return 500 but now include the exception message in the body and log with exc_info=True so the stack trace lands in the gateway log. - Also fixed the logging.error → logger.error inconsistency in the same block (module had a named logger at the top that wasn't being used).	2026-04-22 16:16:57 +01:00
cybermaggedon	95c3b62ef1	Correct output from tg-show-flow-state and tg-show-processor-state (#846 )	2026-04-22 16:16:36 +01:00
cybermaggedon	7521e152b9	fix: uuid-ify flow-svc ConfigClient subscription to avoid Pulsar ConsumerBusy on restart (#843 ) flow-svc's long-lived ConfigClient was constructed with subscription=f"{self.id}--config--{id}", where id=params.get("id") is the deterministic processor id. On Pulsar the config-response topic maps to class=response -> Exclusive subscription; when the supervisor restarts flow-svc within Pulsar's inactive-subscription TTL (minutes), the previous process's ghost consumer still holds the subscription and the new process's re-subscribe is rejected with ConsumerBusy, crash-looping flow-svc. This is a v2.2 -> v2.3 regression in practice, but not a change in subscription semantics: the Exclusive mapping for response/notify is identical between releases. The regression is that PR #822 split flow-svc out of config-svc and added this new, long-lived request/response call site — the new site simply didn't follow the uuid convention used by the equivalent sites elsewhere (gateway/config/receiver.py, AsyncProcessor._create_config_client). Fix: generate a fresh uuid per process instance for the subscription suffix, matching that convention.	2026-04-22 15:17:56 +01:00
cybermaggedon	6cbaf88fc6	fix: ontology extractor reads .objects, not .object, from PromptResult (#842 ) The extract-with-ontologies prompt is a JSONL prompt, which means the prompt service returns a PromptResult with response_type="jsonl" and the parsed items in `.objects` (plural). The ontology extractor was reading `.object` (singular) — the field used for response_type="json" — which is always None for JSONL prompts. Effect: the parser received None on every chunk, hit its "Unexpected response type: <class 'NoneType'>" branch, returned no ExtractionResult, and extract_with_simplified_format returned []. Every extraction silently produced zero triples. Graphs populated only with the seed ontology schema (TBox) and document/chunk provenance — no instance triples at all. The e2e test threshold of >=100 edges per collection was met by schema + provenance alone, so the failure mode was invisible until RAG queries couldn't find any content. Regression introduced in v2.3 with the token-usage work (commit `56d700f3` / `14e49d83`) when PromptClient.prompt() began returning a PromptResult wrapper instead of the raw text/dict/list. All other call sites of .prompt() across retrieval/, agent/, orchestrator/ were already reading the correct field for their prompt's response_type; ontology extraction was the sole stranded caller. Also adds tests/unit/test_extract/test_ontology/test_extract_with_simplified_format.py covering: - happy path: populated .objects produces non-empty triples - production failure shape: .objects=None returns [] cleanly - empty .objects returns [] without raising - defensive: do not silently fall back to .object for a JSONL prompt	2026-04-22 12:10:42 +01:00
cybermaggedon	8be128aa59	fix: api-gateway evicts cached dispatchers when a flow stops (#841 ) DispatcherManager caches one ServiceRequestor per (flow_id, kind) in self.dispatchers, lazily created on first use. stop_flow dropped the flow from self.flows but never touched the cached dispatchers, so their publisher/subscriber connections persisted — bound to the per-flow exchanges that flow-svc tears down when the flow stops. If the same flow id was later re-created, flow-svc re-declared fresh per-flow exchanges, but the gateway's cached dispatcher still held a subscription queue bound to the now-gone old response exchange. Requests went out fine (publishers target exchanges by name and the new exchange has the right name), but responses landed on an exchange with no binding to the dispatcher's queue and were silently dropped. The calling CLI or websocket session hung waiting for a reply that would never arrive. Reproduction before fix: tg-start-flow -i test-flow-1 ... # any query on test-flow-1 works tg-stop-flow -i test-flow-1 tg-start-flow -i test-flow-1 ... tg-show-graph -f test-flow-1 -C <collection> # hangs Flows that were never stopped (e.g. "default" in a typical session) were unaffected — their cached dispatcher still pointed at live plumbing. That's why the bug appeared flow-name-specific at first glance; it's actually lifecycle-specific. Fix: in stop_flow, evict and cleanly stop() every cached dispatcher keyed on the stopped flow id. Next request after restart constructs a fresh dispatcher against the freshly-declared exchanges. Tuple shape check preserves global dispatchers, which use (None, kind) as their key and must survive flow churn. Uses pop(id, None) instead of del in case stop_flow is invoked defensively for a flow the gateway never saw.	2026-04-22 12:10:21 +01:00
cybermaggedon	d35473f7f7	feat: workspace-based multi-tenancy, replacing user as tenancy axis (#840 ) Introduces `workspace` as the isolation boundary for config, flows, library, and knowledge data. Removes `user` as a schema-level field throughout the code, API specs, and tests; workspace provides the same separation more cleanly at the trusted flow.workspace layer rather than through client-supplied message fields. Design ------ - IAM tech spec (docs/tech-specs/iam.md) documents current state, proposed auth/access model, and migration direction. - Data ownership model (docs/tech-specs/data-ownership-model.md) captures the workspace/collection/flow hierarchy. Schema + messaging ------------------ - Drop `user` field from AgentRequest/Step, GraphRagQuery, DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest, Sparql/Rows/Structured QueryRequest, ToolServiceRequest. - Keep collection/workspace routing via flow.workspace at the service layer. - Translators updated to not serialise/deserialise user. API specs --------- - OpenAPI schemas and path examples cleaned of user fields. - Websocket async-api messages updated. - Removed the unused parameters/User.yaml. Services + base --------------- - Librarian, collection manager, knowledge, config: all operations scoped by workspace. Config client API takes workspace as first positional arg. - `flow.workspace` set at flow start time by the infrastructure; no longer pass-through from clients. - Tool service drops user-personalisation passthrough. CLI + SDK --------- - tg-init-workspace and workspace-aware import/export. - All tg-* commands drop user args; accept --workspace. - Python API/SDK (flow, socket_client, async_, explainability, library) drop user kwargs from every method signature. MCP server ---------- - All tool endpoints drop user parameters; socket_manager no longer keyed per user. Flow service ------------ - Closure-based topic cleanup on flow stop: only delete topics whose blueprint template was parameterised AND no remaining live flow (across all workspaces) still resolves to that topic. Three scopes fall out naturally from template analysis: {id} -> per-flow, deleted on stop * {blueprint} -> per-blueprint, kept while any flow of the same blueprint exists * {workspace} -> per-workspace, kept while any flow in the workspace exists * literal -> global, never deleted (e.g. tg.request.librarian) Fixes a bug where stopping a flow silently destroyed the global librarian exchange, wedging all library operations until manual restart. RabbitMQ backend ---------------- - heartbeat=60, blocked_connection_timeout=300. Catches silently dead connections (broker restart, orphaned channels, network partitions) within ~2 heartbeat windows, so the consumer reconnects and re-binds its queue rather than sitting forever on a zombie connection. Tests ----- - Full test refresh: unit, integration, contract, provenance. - Dropped user-field assertions and constructor kwargs across ~100 test files. - Renamed user-collection isolation tests to workspace-collection.	2026-04-21 23:23:01 +01:00
cybermaggedon	9332089b3d	Setup for 2.4 release branch (#839 )	2026-04-21 21:36:46 +01:00
cybermaggedon	424ace44c4	Fix library queue lifecycle (#838 ) * Don't delete the global queues (librarian) when flows are deleted * 60s heartbeat timeouts on RabbitMQ	2026-04-21 21:30:19 +01:00
Het Patel	0ef49ab6ae	feat: standardize LLM rate-limiting and exception handling (#835 ) - HTTP 429 translates to TooManyRequests (retryable) - HTTP 503 translates to LlmError	2026-04-21 16:15:11 +01:00
cybermaggedon	e7efb673ef	Structure the tech specs directory (#836 ) Tech spec some subdirectories for different languages	2026-04-21 16:06:41 +01:00
cybermaggedon	48da6c5f8b	Test fixes for Kafka (#834 ) Test fixes for Kafka: - Consumer: isinstance(max_concurrency, int) instead of is not None: MagicMock won't pass the check - Kafka test: updated expected topic name to tg.request.prompt.default (colons replaced with dots)	2026-04-18 23:06:01 +01:00
cybermaggedon	b615150624	fix: resolve multiple Kafka backend issues blocking message delivery (#833 ) - Producer: add delivery callback to surface send errors instead of silently swallowing them, and raise message.max.bytes to 10MB - Consumer: raise fetch.message.max.bytes to 10MB to match producer, tighten session/heartbeat timeouts for fast group joins, and add partition assign/revoke logging for diagnostics - Topic naming: replace colons with dots in topic names since Kafka rejects colons (flow:tg:document-load:default was producing invalid topic name tg.flow.document-load:default) - Response consumers: use auto.offset.reset=earliest instead of latest so responses published before partition assignment aren't lost - UNKNOWN_TOPIC_OR_PART: treat as timeout instead of fatal error so consumers wait for auto-created topics instead of crashing - Concurrency: cap consumer workers to 1 for Kafka since topics have 1 partition — extra consumers trigger rebalance storms that block all message delivery	2026-04-18 22:42:24 +01:00
Het Patel	adea976203	feat: implement retry logic and exponential backoff for S3 operations (#829 ) * feat: implement retry logic and exponential backoff for S3 operations * test: fix librarian mocks after BlobStore async conversion	2026-04-18 12:05:37 +01:00
cybermaggedon	cce3acd84f	fix: repair deferred imports to preserve module-level names for test patching (#831 ) A previous commit moved SDK imports into __init__/methods and stashed them on self, which broke @patch targets in 24 unit tests. This fixes the approach: chunker and pdf_decoder use module-level sentinels with global/if-None guards so imports are still deferred but patchable. Google AI Studio reverts to standard module-level imports since the module is only loaded when communicating with Gemini. Keeps lazy loading on other imports.	2026-04-18 11:43:21 +01:00
cybermaggedon	d7745baab4	Add Kafka pub/sub backend (#830 ) Third backend alongside Pulsar and RabbitMQ. Topics map 1:1 to Kafka topics, subscriptions map to consumer groups. Response/notify uses unique consumer groups with correlation ID filtering. Topic lifecycle managed via AdminClient with class-based retention. Initial code drop: Needs major integration testing	2026-04-18 11:18:34 +01:00
Syed Ishmum Ahnaf	81cde7baf9	fix for issue #821 : deferring optional SDK imports to runtime for provider modules (#828 )	2026-04-18 11:16:37 +01:00
cybermaggedon	3505bfdd25	refactor: use one fanout exchange per topic instead of shared topic exchange (#827 ) The RabbitMQ backend used a single topic exchange per topicspace with routing keys to differentiate logical topics. This meant the flow service had to manually create named queues for every processor-topic pair, including producer-side topics — creating phantom queues that accumulated unread message copies indefinitely. Replace with one fanout exchange per logical topic. Consumers now declare and bind their own queues on connect. The flow service manages topic lifecycle (create/delete exchanges) rather than queue lifecycle, and only collects unique topic identifiers instead of per-processor (topic, subscription) pairs. Backend API: create_queue/delete_queue/ensure_queue replaced with create_topic/delete_topic/ensure_topic (subscription parameter removed).	2026-04-17 18:01:35 +01:00
Het Patel	391b9076f3	feat: add domain and range validation to triple extraction in extract.py (#825 )	2026-04-17 11:29:57 +01:00
cybermaggedon	9f84891fcc	Flow service lifecycle management (#822 ) feat: separate flow service from config service with explicit queue lifecycle management The flow service is now an independent service that owns the lifecycle of flow and blueprint queues. System services own their own queues. Consumers never create queues. Flow service separation: - New service at trustgraph-flow/trustgraph/flow/service/ - Uses async ConfigClient (RequestResponse pattern) to talk to config service - Config service stripped of all flow handling Queue lifecycle management: - PubSubBackend protocol gains create_queue, delete_queue, queue_exists, ensure_queue — all async - RabbitMQ: implements via pika with asyncio.to_thread internally - Pulsar: stubs for future admin REST API implementation - Consumer _connect() no longer creates queues (passive=True for named queues) - System services call ensure_queue on startup - Flow service creates queues on flow start, deletes on flow stop - Flow service ensures queues for pre-existing flows on startup Two-phase flow stop: - Phase 1: set flow status to "stopping", delete processor config entries - Phase 2: retry queue deletion, then delete flow record Config restructure: - active-flow config replaced with processor:{name} types - Each processor has its own config type, each flow variant is a key - Flow start/stop use batch put/delete — single config push per operation - FlowProcessor subscribes to its own type only Blueprint format: - Processor entries split into topics and parameters dicts - Flow interfaces use {"flow": "topic"} instead of bare strings - Specs (ConsumerSpec, ProducerSpec, etc.) read from definition["topics"] Tests updated	2026-04-16 17:19:39 +01:00
Lennard Geißler	645b6a66fd	fix: replace deprecated asyncio.iscoroutinefunction with inspect.iscoroutinefunction (#819 ) asyncio.iscoroutinefunction is deprecated since Python 3.14 and slated for removal in 3.16. The inspect equivalent has an identical signature and return semantics. Replaces 8 call sites across 3 modules to silence DeprecationWarnings reported in #818.	2026-04-16 10:57:39 +01:00
Trevin Chow	ef8bb3aed4	fix: replace deprecated datetime.utcnow() with timezone-aware datetime.now(timezone.utc) (#816 ) Python 3.14 deprecates datetime.utcnow(). Replace all 9 occurrences with datetime.now(timezone.utc) and normalize the output to preserve the existing ISO-8601 "Z"-suffixed format so downstream parsers are unaffected. Fixes #814	2026-04-16 10:16:11 +01:00
cybermaggedon	95e4839da7	Fix docstring breakage in type hint change (#817 )	2026-04-16 10:10:58 +01:00
RaccoonLabs	706e62b7c2	feat: add type hints to all public functions in trustgraph/base (#803 ) feat: add type hints to all public functions in trustgraph/base Add type annotations to 23 modules covering: - Metrics classes (ConsumerMetrics, ProducerMetrics, etc.) - Spec classes (ConsumerSpec, ProducerSpec, SubscriberSpec, etc.) - Service classes with add_args() and run() methods - Utility functions (logging, pubsub, clients) - AsyncProcessor methods All 93 public functions now fully typed. Refs #785 * refactor: deduplicate imports and move __future__ after docstrings Addresses review feedback on PR #803: - Remove duplicate 'from argparse import ArgumentParser' across 12 files - Move 'from __future__ import annotations' to line 1 in all files - Clean up excessive blank lines	2026-04-16 10:08:19 +01:00
cybermaggedon	22096e07e2	Fix tests broken by the recent RabbitMQ/Cassandra async fixes (#815 ) - Fix invalid key in config causing rogue warning - Fix asyncio test tags	2026-04-16 10:00:18 +01:00
Alex Jenkins	fdb52a6bfc	Add docstrings to public classes (#812 ) Add class-level docstrings to five public classes in trustgraph-base: Flow, LlmService, ConsumerMetrics, ToolClient, and TriplesStoreService. Each docstring summarises the class's role in the system to aid discoverability for new contributors. Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>	2026-04-16 09:07:08 +01:00
cybermaggedon	2f64ffc99d	Add missing PyYAML package (#811 )	2026-04-15 15:37:46 +01:00
cybermaggedon	2bf4af294e	Better proc group logging and concurrency (#810 ) - Silence pika, cassandra etc. logging at INFO (too much chatter) - Add per processor log tags so that logs can be understood in processor group. - Deal with RabbitMQ lag weirdness - Added more processor group examples	2026-04-15 14:52:01 +01:00
cybermaggedon	ce3c8b421b	Remove Pulsar healthcheck from tg-verify-system-health (#809 ) The Pulsar healthcheck was removed from tg-verify-system-health and re-added accidentally in a PR. This removes it again.	2026-04-14 16:12:24 +01:00
cybermaggedon	f11c0ad0cb	Processor group implementation: dev wrapper (#808 ) Processor group implementation: A wrapper to launch multiple processors in a single processor - trustgraph-base/trustgraph/base/processor_group.py — group runner module. run_group(config) is the async body; run() is the endpoint. Loads JSON or YAML config, validates that every entry has a unique params.id, instantiates each class via importlib, shares one TaskGroup, mirrors AsyncProcessor.launch's retry loop and Prometheus startup. - trustgraph-base/pyproject.toml — added [project.scripts] block with processor-group = "trustgraph.base.processor_group:run". Key behaviours: - Unique id enforced up front — missing or duplicate params.id fails fast with a clear error, preventing the Prometheus Info label collision we flagged. - No registry — dotted class path is the identifier; any AsyncProcessor descendant importable at runtime is packable. - YAML import is lazy — only pulled in if the config file ends in .yaml/.yml, so JSON-only users don't need PyYAML installed. - Single Prometheus server — start_http_server runs once at startup, before the retry loop, matching launch()'s pattern. - Retry loop — same shape as AsyncProcessor.launch: catches ExceptionGroup from TaskGroup, logs, sleeps 4s, retries. Fail-group semantics (one processor dying tears down the group) — simple and surfaces bugs, as discussed. Example config: processors: - class: trustgraph.extract.kg.definitions.extract.Processor params: id: kg-extract-definitions - class: trustgraph.chunking.recursive.Processor params: id: chunker-recursive Run with processor-group -c group.yaml.	2026-04-14 15:19:04 +01:00
Alex Jenkins	8954fa3ad7	Feat: TrustGraph i18n & Documentation Translation Updates (#781 ) Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.	2026-04-14 12:08:32 +01:00
cybermaggedon	f976f1b6fe	Fix test suite Prometheus registry pollution and remove default (#806 ) Fix test suite Prometheus registry pollution and remove default coverage The test_metrics.py fixture (added in `4e63dbda`) deleted class-level metric singleton attributes without restoring them. This stripped the hasattr() guards that prevent duplicate Prometheus registration, so every subsequent Processor construction in the test run raised ValueError: Duplicated timeseries. Fix by saving and restoring the original attributes around each test. Remove --cov=trustgraph from pytest.ini defaults — the coverage import hooks interact badly with the trustgraph namespace package, causing duplicate class loading. Coverage can still be requested explicitly via the command line.	2026-04-14 11:07:23 +01:00
Zeel Desai	39dcd1d386	Add unit tests for base helper modules (#797 ) - add unit tests for base metrics, logging, spec, parameter_spec, and flow modules - add a lightweight test-only module loader so these tests can run without optional runtime dependencies - fix Parameter.start/stop to accept self	2026-04-14 10:58:15 +01:00

1 2 3 4 5 ...

1153 commits