trustgraph

mirror of https://github.com/trustgraph-ai/trustgraph.git synced 2026-06-14 09:15:13 +02:00

Author	SHA1	Message	Date
cybermaggedon	ae9936c9cc	feat: pluggable bootstrap framework with ordered initialisers (#847 ) A generic, long-running bootstrap processor that converges a deployment to its configured initial state and then idles. Replaces the previous one-shot `tg-init-trustgraph` container model and provides an extension point for enterprise / third-party initialisers. See docs/tech-specs/bootstrap.md for the full design. Bootstrapper ------------ A single AsyncProcessor (trustgraph.bootstrap.bootstrapper.Processor) that: * Reads a list of initialiser specifications (class, name, flag, params) from either a direct `initialisers` parameter (processor-group embedding) or a YAML/JSON file (`-c`, CLI). * On each wake, runs a cheap service-gate (config-svc + flow-svc round-trips), then iterates the initialiser list, running each whose configured flag differs from the one stored in __system__/init-state/<name>. * Stores per-initialiser completion state in the reserved __system__ workspace. * Adapts cadence: ~5s on gate failure, ~15s while converging, ~300s in steady state. * Isolates failures — one initialiser's exception does not block others in the same cycle; the failed one retries next wake. Initialiser contract -------------------- * Subclass trustgraph.bootstrap.base.Initialiser. * Implement async run(ctx, old_flag, new_flag). * Opt out of the service gate with class attr wait_for_services=False (only used by PulsarTopology, since config-svc cannot come up until Pulsar namespaces exist). * ctx carries short-lived config and flow-svc clients plus a scoped logger. Core initialisers (trustgraph.bootstrap.initialisers.) ------------------------------------------------------- PulsarTopology — creates Pulsar tenant + namespaces (pre-gate, blocking HTTP offloaded to executor). * TemplateSeed — seeds __template__ from an external JSON file; re-run is upsert-missing by default, overwrite-all opt-in. * WorkspaceInit — populates a named workspace from either the full contents of __template__ or a seed file; raises cleanly if the template isn't seeded yet so the bootstrapper retries on the next cycle. * DefaultFlowStart — starts a specific flow in a workspace; no-ops if the flow is already running. Enterprise or third-party initialisers plug in via fully-qualified dotted class paths in the bootstrapper's configuration — no core code change required. Config service -------------- * push(): filter out reserved workspaces (ids starting with "_") from the change notifications. Stored config is preserved; only the broadcast is suppressed, so bootstrap / template state lives in config-svc without live processors ever reacting to it. Config client ------------- * ConfigClient.get_all(workspace): wraps the existing `config` operation to return {type: {key: value}} for a workspace. WorkspaceInit uses it to copy __template__ without needing a hardcoded types list. pyproject.toml -------------- * Adds a `bootstrap` console script pointing at the new Processor. * Remove tg-init-trustgraph, superceded by bootstrap processor	2026-04-22 18:03:46 +01:00
cybermaggedon	31027e30ae	Better reporting from api-gateway's metric endpoint (#845 ) - Connect failures (DNS, connect refused, server disconnect) now return 502 Bad Gateway with a body that names the upstream URL. - Other exceptions still return 500 but now include the exception message in the body and log with exc_info=True so the stack trace lands in the gateway log. - Also fixed the logging.error → logger.error inconsistency in the same block (module had a named logger at the top that wasn't being used).	2026-04-22 16:16:57 +01:00
cybermaggedon	7521e152b9	fix: uuid-ify flow-svc ConfigClient subscription to avoid Pulsar ConsumerBusy on restart (#843 ) flow-svc's long-lived ConfigClient was constructed with subscription=f"{self.id}--config--{id}", where id=params.get("id") is the deterministic processor id. On Pulsar the config-response topic maps to class=response -> Exclusive subscription; when the supervisor restarts flow-svc within Pulsar's inactive-subscription TTL (minutes), the previous process's ghost consumer still holds the subscription and the new process's re-subscribe is rejected with ConsumerBusy, crash-looping flow-svc. This is a v2.2 -> v2.3 regression in practice, but not a change in subscription semantics: the Exclusive mapping for response/notify is identical between releases. The regression is that PR #822 split flow-svc out of config-svc and added this new, long-lived request/response call site — the new site simply didn't follow the uuid convention used by the equivalent sites elsewhere (gateway/config/receiver.py, AsyncProcessor._create_config_client). Fix: generate a fresh uuid per process instance for the subscription suffix, matching that convention.	2026-04-22 15:17:56 +01:00
cybermaggedon	6cbaf88fc6	fix: ontology extractor reads .objects, not .object, from PromptResult (#842 ) The extract-with-ontologies prompt is a JSONL prompt, which means the prompt service returns a PromptResult with response_type="jsonl" and the parsed items in `.objects` (plural). The ontology extractor was reading `.object` (singular) — the field used for response_type="json" — which is always None for JSONL prompts. Effect: the parser received None on every chunk, hit its "Unexpected response type: <class 'NoneType'>" branch, returned no ExtractionResult, and extract_with_simplified_format returned []. Every extraction silently produced zero triples. Graphs populated only with the seed ontology schema (TBox) and document/chunk provenance — no instance triples at all. The e2e test threshold of >=100 edges per collection was met by schema + provenance alone, so the failure mode was invisible until RAG queries couldn't find any content. Regression introduced in v2.3 with the token-usage work (commit `56d700f3` / `14e49d83`) when PromptClient.prompt() began returning a PromptResult wrapper instead of the raw text/dict/list. All other call sites of .prompt() across retrieval/, agent/, orchestrator/ were already reading the correct field for their prompt's response_type; ontology extraction was the sole stranded caller. Also adds tests/unit/test_extract/test_ontology/test_extract_with_simplified_format.py covering: - happy path: populated .objects produces non-empty triples - production failure shape: .objects=None returns [] cleanly - empty .objects returns [] without raising - defensive: do not silently fall back to .object for a JSONL prompt	2026-04-22 12:10:42 +01:00
cybermaggedon	8be128aa59	fix: api-gateway evicts cached dispatchers when a flow stops (#841 ) DispatcherManager caches one ServiceRequestor per (flow_id, kind) in self.dispatchers, lazily created on first use. stop_flow dropped the flow from self.flows but never touched the cached dispatchers, so their publisher/subscriber connections persisted — bound to the per-flow exchanges that flow-svc tears down when the flow stops. If the same flow id was later re-created, flow-svc re-declared fresh per-flow exchanges, but the gateway's cached dispatcher still held a subscription queue bound to the now-gone old response exchange. Requests went out fine (publishers target exchanges by name and the new exchange has the right name), but responses landed on an exchange with no binding to the dispatcher's queue and were silently dropped. The calling CLI or websocket session hung waiting for a reply that would never arrive. Reproduction before fix: tg-start-flow -i test-flow-1 ... # any query on test-flow-1 works tg-stop-flow -i test-flow-1 tg-start-flow -i test-flow-1 ... tg-show-graph -f test-flow-1 -C <collection> # hangs Flows that were never stopped (e.g. "default" in a typical session) were unaffected — their cached dispatcher still pointed at live plumbing. That's why the bug appeared flow-name-specific at first glance; it's actually lifecycle-specific. Fix: in stop_flow, evict and cleanly stop() every cached dispatcher keyed on the stopped flow id. Next request after restart constructs a fresh dispatcher against the freshly-declared exchanges. Tuple shape check preserves global dispatchers, which use (None, kind) as their key and must survive flow churn. Uses pop(id, None) instead of del in case stop_flow is invoked defensively for a flow the gateway never saw.	2026-04-22 12:10:21 +01:00
cybermaggedon	d35473f7f7	feat: workspace-based multi-tenancy, replacing user as tenancy axis (#840 ) Introduces `workspace` as the isolation boundary for config, flows, library, and knowledge data. Removes `user` as a schema-level field throughout the code, API specs, and tests; workspace provides the same separation more cleanly at the trusted flow.workspace layer rather than through client-supplied message fields. Design ------ - IAM tech spec (docs/tech-specs/iam.md) documents current state, proposed auth/access model, and migration direction. - Data ownership model (docs/tech-specs/data-ownership-model.md) captures the workspace/collection/flow hierarchy. Schema + messaging ------------------ - Drop `user` field from AgentRequest/Step, GraphRagQuery, DocumentRagQuery, Triples/Graph/Document/Row EmbeddingsRequest, Sparql/Rows/Structured QueryRequest, ToolServiceRequest. - Keep collection/workspace routing via flow.workspace at the service layer. - Translators updated to not serialise/deserialise user. API specs --------- - OpenAPI schemas and path examples cleaned of user fields. - Websocket async-api messages updated. - Removed the unused parameters/User.yaml. Services + base --------------- - Librarian, collection manager, knowledge, config: all operations scoped by workspace. Config client API takes workspace as first positional arg. - `flow.workspace` set at flow start time by the infrastructure; no longer pass-through from clients. - Tool service drops user-personalisation passthrough. CLI + SDK --------- - tg-init-workspace and workspace-aware import/export. - All tg-* commands drop user args; accept --workspace. - Python API/SDK (flow, socket_client, async_, explainability, library) drop user kwargs from every method signature. MCP server ---------- - All tool endpoints drop user parameters; socket_manager no longer keyed per user. Flow service ------------ - Closure-based topic cleanup on flow stop: only delete topics whose blueprint template was parameterised AND no remaining live flow (across all workspaces) still resolves to that topic. Three scopes fall out naturally from template analysis: {id} -> per-flow, deleted on stop * {blueprint} -> per-blueprint, kept while any flow of the same blueprint exists * {workspace} -> per-workspace, kept while any flow in the workspace exists * literal -> global, never deleted (e.g. tg.request.librarian) Fixes a bug where stopping a flow silently destroyed the global librarian exchange, wedging all library operations until manual restart. RabbitMQ backend ---------------- - heartbeat=60, blocked_connection_timeout=300. Catches silently dead connections (broker restart, orphaned channels, network partitions) within ~2 heartbeat windows, so the consumer reconnects and re-binds its queue rather than sitting forever on a zombie connection. Tests ----- - Full test refresh: unit, integration, contract, provenance. - Dropped user-field assertions and constructor kwargs across ~100 test files. - Renamed user-collection isolation tests to workspace-collection.	2026-04-21 23:23:01 +01:00
cybermaggedon	9332089b3d	Setup for 2.4 release branch (#839 )	2026-04-21 21:36:46 +01:00
cybermaggedon	424ace44c4	Fix library queue lifecycle (#838 ) * Don't delete the global queues (librarian) when flows are deleted * 60s heartbeat timeouts on RabbitMQ	2026-04-21 21:30:19 +01:00
Het Patel	0ef49ab6ae	feat: standardize LLM rate-limiting and exception handling (#835 ) - HTTP 429 translates to TooManyRequests (retryable) - HTTP 503 translates to LlmError	2026-04-21 16:15:11 +01:00
Het Patel	adea976203	feat: implement retry logic and exponential backoff for S3 operations (#829 ) * feat: implement retry logic and exponential backoff for S3 operations * test: fix librarian mocks after BlobStore async conversion	2026-04-18 12:05:37 +01:00
cybermaggedon	cce3acd84f	fix: repair deferred imports to preserve module-level names for test patching (#831 ) A previous commit moved SDK imports into __init__/methods and stashed them on self, which broke @patch targets in 24 unit tests. This fixes the approach: chunker and pdf_decoder use module-level sentinels with global/if-None guards so imports are still deferred but patchable. Google AI Studio reverts to standard module-level imports since the module is only loaded when communicating with Gemini. Keeps lazy loading on other imports.	2026-04-18 11:43:21 +01:00
Syed Ishmum Ahnaf	81cde7baf9	fix for issue #821 : deferring optional SDK imports to runtime for provider modules (#828 )	2026-04-18 11:16:37 +01:00
cybermaggedon	3505bfdd25	refactor: use one fanout exchange per topic instead of shared topic exchange (#827 ) The RabbitMQ backend used a single topic exchange per topicspace with routing keys to differentiate logical topics. This meant the flow service had to manually create named queues for every processor-topic pair, including producer-side topics — creating phantom queues that accumulated unread message copies indefinitely. Replace with one fanout exchange per logical topic. Consumers now declare and bind their own queues on connect. The flow service manages topic lifecycle (create/delete exchanges) rather than queue lifecycle, and only collects unique topic identifiers instead of per-processor (topic, subscription) pairs. Backend API: create_queue/delete_queue/ensure_queue replaced with create_topic/delete_topic/ensure_topic (subscription parameter removed).	2026-04-17 18:01:35 +01:00
Het Patel	391b9076f3	feat: add domain and range validation to triple extraction in extract.py (#825 )	2026-04-17 11:29:57 +01:00
cybermaggedon	9f84891fcc	Flow service lifecycle management (#822 ) feat: separate flow service from config service with explicit queue lifecycle management The flow service is now an independent service that owns the lifecycle of flow and blueprint queues. System services own their own queues. Consumers never create queues. Flow service separation: - New service at trustgraph-flow/trustgraph/flow/service/ - Uses async ConfigClient (RequestResponse pattern) to talk to config service - Config service stripped of all flow handling Queue lifecycle management: - PubSubBackend protocol gains create_queue, delete_queue, queue_exists, ensure_queue — all async - RabbitMQ: implements via pika with asyncio.to_thread internally - Pulsar: stubs for future admin REST API implementation - Consumer _connect() no longer creates queues (passive=True for named queues) - System services call ensure_queue on startup - Flow service creates queues on flow start, deletes on flow stop - Flow service ensures queues for pre-existing flows on startup Two-phase flow stop: - Phase 1: set flow status to "stopping", delete processor config entries - Phase 2: retry queue deletion, then delete flow record Config restructure: - active-flow config replaced with processor:{name} types - Each processor has its own config type, each flow variant is a key - Flow start/stop use batch put/delete — single config push per operation - FlowProcessor subscribes to its own type only Blueprint format: - Processor entries split into topics and parameters dicts - Flow interfaces use {"flow": "topic"} instead of bare strings - Specs (ConsumerSpec, ProducerSpec, etc.) read from definition["topics"] Tests updated	2026-04-16 17:19:39 +01:00
Lennard Geißler	645b6a66fd	fix: replace deprecated asyncio.iscoroutinefunction with inspect.iscoroutinefunction (#819 ) asyncio.iscoroutinefunction is deprecated since Python 3.14 and slated for removal in 3.16. The inspect equivalent has an identical signature and return semantics. Replaces 8 call sites across 3 modules to silence DeprecationWarnings reported in #818.	2026-04-16 10:57:39 +01:00
Trevin Chow	ef8bb3aed4	fix: replace deprecated datetime.utcnow() with timezone-aware datetime.now(timezone.utc) (#816 ) Python 3.14 deprecates datetime.utcnow(). Replace all 9 occurrences with datetime.now(timezone.utc) and normalize the output to preserve the existing ISO-8601 "Z"-suffixed format so downstream parsers are unaffected. Fixes #814	2026-04-16 10:16:11 +01:00
cybermaggedon	2bf4af294e	Better proc group logging and concurrency (#810 ) - Silence pika, cassandra etc. logging at INFO (too much chatter) - Add per processor log tags so that logs can be understood in processor group. - Deal with RabbitMQ lag weirdness - Added more processor group examples	2026-04-15 14:52:01 +01:00
cybermaggedon	f11c0ad0cb	Processor group implementation: dev wrapper (#808 ) Processor group implementation: A wrapper to launch multiple processors in a single processor - trustgraph-base/trustgraph/base/processor_group.py — group runner module. run_group(config) is the async body; run() is the endpoint. Loads JSON or YAML config, validates that every entry has a unique params.id, instantiates each class via importlib, shares one TaskGroup, mirrors AsyncProcessor.launch's retry loop and Prometheus startup. - trustgraph-base/pyproject.toml — added [project.scripts] block with processor-group = "trustgraph.base.processor_group:run". Key behaviours: - Unique id enforced up front — missing or duplicate params.id fails fast with a clear error, preventing the Prometheus Info label collision we flagged. - No registry — dotted class path is the identifier; any AsyncProcessor descendant importable at runtime is packable. - YAML import is lazy — only pulled in if the config file ends in .yaml/.yml, so JSON-only users don't need PyYAML installed. - Single Prometheus server — start_http_server runs once at startup, before the retry loop, matching launch()'s pattern. - Retry loop — same shape as AsyncProcessor.launch: catches ExceptionGroup from TaskGroup, logs, sleeps 4s, retries. Fail-group semantics (one processor dying tears down the group) — simple and surfaces bugs, as discussed. Example config: processors: - class: trustgraph.extract.kg.definitions.extract.Processor params: id: kg-extract-definitions - class: trustgraph.chunking.recursive.Processor params: id: chunker-recursive Run with processor-group -c group.yaml.	2026-04-14 15:19:04 +01:00
Alex Jenkins	8954fa3ad7	Feat: TrustGraph i18n & Documentation Translation Updates (#781 ) Native CLI i18n: The TrustGraph CLI has built-in translation support that dynamically loads language strings. You can test and use different languages by simply passing the --lang flag (e.g., --lang es for Spanish, --lang ru for Russian) or by configuring your environment's LANG variable. Automated Docs Translations: This PR introduces autonomously translated Markdown documentation into several target languages, including Spanish, Swahili, Portuguese, Turkish, Hindi, Hebrew, Arabic, Simplified Chinese, and Russian.	2026-04-14 12:08:32 +01:00
cybermaggedon	d2751553a3	Add agent explainability instrumentation and unify envelope field naming (#795 ) Addresses recommendations from the UX developer's agent experience report. Adds provenance predicates, DAG structure changes, error resilience, and a published OWL ontology. Explainability additions: - Tool candidates: tg:toolCandidate on Analysis events lists the tools visible to the LLM for each iteration (names only, descriptions in config) - Termination reason: tg:terminationReason on Conclusion/Synthesis events (final-answer, plan-complete, subagents-complete) - Step counter: tg:stepNumber on iteration events - Pattern decision: new tg:PatternDecision entity in the DAG between session and first iteration, carrying tg:pattern and tg:taskType - Latency: tg:llmDurationMs on Analysis events, tg:toolDurationMs on Observation events - Token counts on events: tg:inToken/tg:outToken/tg:llmModel on Grounding, Focus, Synthesis, and Analysis events - Tool/parse errors: tg:toolError on Observation events with tg:Error mixin type. Parse failures return as error observations instead of crashing the agent, giving it a chance to retry. Envelope unification: - Rename chunk_type to message_type across AgentResponse schema, translator, SDK types, socket clients, CLI, and all tests. Agent and RAG services now both use message_type on the wire. Ontology: - specs/ontology/trustgraph.ttl — OWL vocabulary covering all 26 classes, 7 object properties, and 36+ datatype properties including new predicates. DAG structure tests: - tests/unit/test_provenance/test_dag_structure.py verifies the wasDerivedFrom chain for GraphRAG, DocumentRAG, and all three agent patterns (react, plan, supervisor) including the pattern-decision link.	2026-04-13 16:16:42 +01:00
cybermaggedon	14e49d83c7	Expose LLM token usage across all service layers (#782 ) Expose LLM token usage (in_token, out_token, model) across all service layers Propagate token counts from LLM services through the prompt, text-completion, graph-RAG, document-RAG, and agent orchestrator pipelines to the API gateway and Python SDK. All fields are Optional — None means "not available", distinguishing from a real zero count. Key changes: - Schema: Add in_token/out_token/model to TextCompletionResponse, PromptResponse, GraphRagResponse, DocumentRagResponse, AgentResponse - TextCompletionClient: New TextCompletionResult return type. Split into text_completion() (non-streaming) and text_completion_stream() (streaming with per-chunk handler callback) - PromptClient: New PromptResult with response_type (text/json/jsonl), typed fields (text/object/objects), and token usage. All callers updated. - RAG services: Accumulate token usage across all prompt calls (extract-concepts, edge-scoring, edge-reasoning, synthesis). Non-streaming path sends single combined response instead of chunk + end_of_session. - Agent orchestrator: UsageTracker accumulates tokens across meta-router, pattern prompt calls, and react reasoning. Attached to end_of_dialog. - Translators: Encode token fields when not None (is not None, not truthy) - Python SDK: RAG and text-completion methods return TextCompletionResult (non-streaming) or RAGChunk/AgentAnswer with token fields (streaming) - CLI: --show-usage flag on tg-invoke-llm, tg-invoke-prompt, tg-invoke-graph-rag, tg-invoke-document-rag, tg-invoke-agent	2026-04-13 14:38:34 +01:00
cybermaggedon	c23e28aa66	Fix Metadata/EntityEmbeddings schema migration tail and add regression tests (#777 ) The Metadata dataclass dropped its `metadata: list[Triple]` field and EntityEmbeddings/ChunkEmbeddings settled on a singular `vector: list[float]` field, but several call sites kept passing `Metadata(metadata=...)` and `EntityEmbeddings(vectors=...)`. The bugs were latent until a websocket client first hit `/api/v1/flow/default/import/entity-contexts`, at which point the dispatcher TypeError'd on construction. Production fixes (5 call sites on the same migration tail): * trustgraph-flow gateway dispatchers entity_contexts_import.py and graph_embeddings_import.py — drop the stale Metadata(metadata=...) kwarg; switch graph_embeddings_import to the singular `vector` wire key. * trustgraph-base messaging translators knowledge.py and document_loading.py — fix decode side to read the singular `"vector"` key, matching what their own encode sides have always written. * trustgraph-flow tables/knowledge.py — fix Cassandra row deserialiser to construct EntityEmbeddings(vector=...) instead of vectors=. * trustgraph-flow gateway core_import/core_export — switch the kg-core msgpack wire format to the singular `"v"`/`"vector"` key and drop the dead `m["m"]` envelope field that referenced the removed Metadata.metadata triples list (it was a guaranteed KeyError on the export side). Defense-in-depth regression coverage (32 new tests across 7 files): * tests/contract/test_schema_field_contracts.py — pin the field set of Metadata, EntityEmbeddings, ChunkEmbeddings, EntityContext so any future schema rename fails CI loudly with a clear diff. * tests/unit/test_translators/test_knowledge_translator_roundtrip.py and test_document_embeddings_translator_roundtrip.py - encode→decode round-trip the affected translators end to end, locking in the singular `"vector"` wire key. * tests/unit/test_gateway/test_entity_contexts_import_dispatcher.py and test_graph_embeddings_import_dispatcher.py — exercise the websocket dispatchers' receive() path with realistic payloads, the direct regression test for the original production crash. * tests/unit/test_gateway/test_core_import_export_roundtrip.py — pack/unpack the kg-core msgpack format through the real dispatcher classes (with KnowledgeRequestor mocked), including a full export→import round-trip. * tests/unit/test_tables/test_knowledge_table_store.py — exercise the Cassandra row → schema conversion via __new__ to bypass the live cluster connection. Also fixes an unrelated leaked-coroutine RuntimeWarning in test_gateway/test_service.py::test_run_method_calls_web_run_app: the mocked aiohttp.web.run_app now closes the coroutine that Api.run() hands it, mirroring what the real run_app would do, instead of leaving it for the GC to complain about.	2026-04-10 20:43:45 +01:00
cybermaggedon	0994d4b05f	Open 2.3 release branch (#775 ) * Update packages and CI for new release branch	2026-04-10 14:42:19 +01:00
cybermaggedon	feeb92b33f	Refactor: Derive consumer behaviour from queue class (#772 ) Derive consumer behaviour from queue class, remove consumer_type parameter The queue class prefix (flow, request, response, notify) now fully determines consumer behaviour in both RabbitMQ and Pulsar backends. Added 'notify' class for ephemeral broadcast (config push notifications). Response and notify classes always create per-subscriber auto-delete queues, eliminating orphaned queues that accumulated on service restarts. Change init-trustgraph to set up the 'notify' namespace in Pulsar instead of old hangover 'state'. Fixes 'stuck backlog' on RabbitMQ config notification queue.	2026-04-09 09:55:41 +01:00
cybermaggedon	e81418c58f	fix: preserve literal types in focus quoted triples and document tracing (#769 ) The triples client returns Uri/Literal (str subclasses), not Term objects. _quoted_triple() treated all values as IRIs, so literal objects like skos:definition values were mistyped in focus provenance events, and trace_source_documents could not match them in the store. Added to_term() to convert Uri/Literal back to Term, threaded a term_map from follow_edges_batch through get_subgraph/get_labelgraph into uri_map, and updated _quoted_triple to accept Term objects directly.	2026-04-08 13:37:02 +01:00
cybermaggedon	4b5bfacab1	Forward missing explain_triples through RAG clients and agent tool callback (#768 ) fix: forward explain_triples through RAG clients and agent tool callback - RAG clients and the KnowledgeQueryImpl tool callback were dropping explain_triples from explain events, losing provenance data (including focus edge selections) when graph-rag is invoked via the agent. Tests for provenance and explainability (56 new): - Client-level forwarding of explain_triples - Graph-RAG structural chain (question → grounding → exploration → focus → synthesis) - Graph-RAG integration with mocked subsidiary clients - Document-RAG integration (question → grounding → exploration → synthesis) - Agent-orchestrator all 3 patterns: react, plan-then-execute, supervisor	2026-04-08 11:41:17 +01:00
cybermaggedon	c20e6540ec	Subscriber resilience and RabbitMQ fixes (#765 ) Subscriber resilience: recreate consumer after connection failure - Move consumer creation from Subscriber.start() into the run() loop, matching the pattern used by Consumer. If the connection drops and the consumer is closed in the finally block, the loop now recreates it on the next iteration instead of spinning forever on a None consumer. Consumer thread safety: - Dedicated ThreadPoolExecutor per consumer so all pika operations (create, receive, acknowledge, negative_acknowledge) run on the same thread — pika BlockingConnection is not thread-safe - Applies to both Consumer and Subscriber classes Config handler type audit — fix four mismatched type registrations: - librarian: was ["librarian"] (non-existent type), now ["flow", "active-flow"] (matches config["flow"] that the handler reads) - cores/service: was ["kg-core"], now ["flow"] (reads config["flow"]) - metering/counter: was ["token-costs"], now ["token-cost"] (singular) - agent/mcp_tool: was ["mcp-tool"], now ["mcp"] (reads config["mcp"]) Update tests	2026-04-07 14:51:14 +01:00
cybermaggedon	ddd4bd7790	Deliver explainability triples inline in retrieval response stream (#763 ) Provenance triples are now included directly in explain messages from GraphRAG, DocumentRAG, and Agent services, eliminating the need for follow-up knowledge graph queries to retrieve explainability details. Each explain message in the response stream now carries: - explain_id: root URI for this provenance step (unchanged) - explain_graph: named graph where triples are stored (unchanged) - explain_triples: the actual provenance triples for this step (new) Changes across the stack: - Schema: added explain_triples field to GraphRagResponse, DocumentRagResponse, and AgentResponse - Services: all explain message call sites pass triples through (graph_rag, document_rag, agent react, agent orchestrator) - Translators: encode explain_triples via TripleTranslator for gateway wire format - Python SDK: ProvenanceEvent now includes parsed ExplainEntity and raw triples; expanded event_type detection - CLI: invoke_graph_rag, invoke_agent, invoke_document_rag use inline entity when available, fall back to graph query - Tech specs updated Additional explainability test	2026-04-07 12:19:05 +01:00
cybermaggedon	2f8d6a3ffb	Fix agent config handler registration, remove debug prints, disable RabbitMQ heartbeats (#764 ) - Fix agent react and orchestrator services appending bare methods to config_handlers instead of using register_config_handler() — caused 'method object is not subscriptable' on config notify - Add exc_info to config fetch retry logging for proper tracebacks - Remove debug print statements from collection management dispatcher and translator - Disable RabbitMQ heartbeats (heartbeat=0) to prevent broker closing idle producer connections that can't process heartbeat frames from BlockingConnection	2026-04-07 12:11:12 +01:00
cybermaggedon	4acd853023	Config push notify pattern: replace stateful pub/sub with signal+ fetch (#760 ) Replace the config push mechanism that broadcast the full config blob on a 'state' class pub/sub queue with a lightweight notify signal containing only the version number and affected config types. Processors fetch the full config via request/response from the config service when notified. This eliminates the need for the pub/sub 'state' queue class and stateful pub/sub services entirely. The config push queue moves from 'state' to 'flow' class — a simple transient signal rather than a retained message. This solves the RabbitMQ late-subscriber problem where restarting processes never received the current config because their fresh queue had no historical messages. Key changes: - ConfigPush schema: config dict replaced with types list - Subscribe-then-fetch startup with retry: processors subscribe to notify queue, fetch config via request/response, then process buffered notifies with version comparison to avoid race conditions - register_config_handler() accepts optional types parameter so handlers only fire when their config types change - Short-lived config request/response clients to avoid subscriber contention on non-persistent response topics - Config service passes affected types through put/delete/flow operations - Gateway ConfigReceiver rewritten with same notify pattern and retry loop Tests updated New tests: - register_config_handler: without types, with types, multiple types, multiple handlers - on_config_notify: old/same version skipped, irrelevant types skipped (version still updated), relevant type triggers fetch, handler without types always called, mixed handler filtering, empty types invokes all, fetch failure handled gracefully - fetch_config: returns config+version, raises on error response, stops client even on exception - fetch_and_apply_config: applies to all handlers on startup, retries on failure	2026-04-06 16:57:27 +01:00
V.Sreeram	d4723566cb	fix: prevent duplicate dispatcher creation race condition in invoke_global_service (#715 ) * fix: prevent duplicate dispatcher creation race condition in invoke_global_service Concurrent coroutines could all pass the `if key in self.dispatchers` check before any of them wrote the result back, because `await dispatcher.start()` yields to the event loop. This caused multiple Pulsar consumers to be created on the same shared subscription, distributing responses round-robin and dropping ~2/3 of them — manifesting as a permanent spinner in the Workbench UI. Apply a double-checked asyncio.Lock in both `invoke_global_service` and `invoke_flow_service` so only one dispatcher is ever created per service key. * test: add concurrent-dispatch tests for race condition fix Add asyncio.gather-based tests that verify invoke_global_service and invoke_flow_service create exactly one dispatcher under concurrent calls, preventing the duplicate Pulsar consumer bug.	2026-04-06 11:14:32 +01:00
Alex Jenkins	10a931f04c	Feat: Auto-pull missing Ollama models (#757 ) * fix deadlink in readme Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu> * feat: Auto-pull Ollama models Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu> * fix: Restore namespace __init__.py files for package resolution Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu> * fix CI Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>	2026-04-06 11:10:53 +01:00
cybermaggedon	ee65d90fdd	SPARQL service supports batching/streaming (#755 )	2026-04-02 17:54:07 +01:00
cybermaggedon	d9dc4cbab5	SPARQL query service (#754 ) SPARQL 1.1 query service wrapping pub/sub triples interface Add a backend-agnostic SPARQL query service that parses SPARQL queries using rdflib, decomposes them into triple pattern lookups via the existing TriplesClient pub/sub interface, and performs in-memory joins, filters, and projections. Includes: - SPARQL parser, algebra evaluator, expression evaluator, solution sequence operations (BGP, JOIN, OPTIONAL, UNION, FILTER, BIND, VALUES, GROUP BY, ORDER BY, LIMIT/OFFSET, DISTINCT, aggregates) - FlowProcessor service with TriplesClientSpec - Gateway dispatcher, request/response translators, API spec - Python SDK method (FlowInstance.sparql_query) - CLI command (tg-invoke-sparql-query) - Tech spec (docs/tech-specs/sparql-query.md) New unit tests for SPARQL query	2026-04-02 17:21:39 +01:00
cybermaggedon	24f0190ce7	RabbitMQ pub/sub backend with topic exchange architecture (#752 ) Adds a RabbitMQ backend as an alternative to Pulsar, selectable via PUBSUB_BACKEND=rabbitmq. Both backends implement the same PubSubBackend protocol — no application code changes needed to switch. RabbitMQ topology: - Single topic exchange per topicspace (e.g. 'tg') - Routing key derived from queue class and topic name - Shared consumers: named queue bound to exchange (competing, round-robin) - Exclusive consumers: anonymous auto-delete queue (broadcast, each gets every message). Used by Subscriber and config push consumer. - Thread-local producer connections (pika is not thread-safe) - Push-based consumption via basic_consume with process_data_events for heartbeat processing Consumer model changes: - Consumer class creates one backend consumer per concurrent task (required for pika thread safety, harmless for Pulsar) - Consumer class accepts consumer_type parameter - Subscriber passes consumer_type='exclusive' for broadcast semantics - Config push consumer uses consumer_type='exclusive' so every processor instance receives config updates - handle_one_from_queue receives consumer as parameter for correct per-connection ack/nack LibrarianClient: - New shared client class replacing duplicated librarian request-response code across 6+ services (chunking, decoders, RAG, etc.) - Uses stream-document instead of get-document-content for fetching document content in 1MB chunks (avoids broker message size limits) - Standalone object (self.librarian = LibrarianClient(...)) not a mixin - get-document-content marked deprecated in schema and OpenAPI spec Serialisation: - Extracted dataclass_to_dict/dict_to_dataclass to shared serialization.py (used by both Pulsar and RabbitMQ backends) Librarian queues: - Changed from flow class (persistent) back to request/response class now that stream-document eliminates large single messages - API upload chunk size reduced from 5MB to 3MB to stay under broker limits after base64 encoding Factory and CLI: - get_pubsub() handles 'rabbitmq' backend with RabbitMQ connection params - add_pubsub_args() includes RabbitMQ options (host, port, credentials) - add_pubsub_args(standalone=True) defaults to localhost for CLI tools - init_trustgraph skips Pulsar admin setup for non-Pulsar backends - tg-dump-queues and tg-monitor-prompts use backend abstraction - BaseClient and ConfigClient accept generic pubsub config	2026-04-02 12:47:16 +01:00
cybermaggedon	4fb0b4d8e8	Pub/sub abstraction: decouple from Pulsar (#751 ) Remove Pulsar-specific concepts from application code so that the pub/sub backend is swappable via configuration. Rename translators: - to_pulsar/from_pulsar → decode/encode across all translator classes, dispatch handlers, and tests (55+ files) - from_response_with_completion → encode_with_completion - Remove pulsar.schema.Record from translator base class Queue naming (CLASS:TOPICSPACE:TOPIC): - Replace topic() helper with queue() using new format: flow:tg:name, request:tg:name, response:tg:name, state:tg:name - Queue class implies persistence/TTL (no QoS in names) - Update Pulsar backend map_topic() to parse new format - Librarian queues use flow class (persistent, for chunking) - Config push uses state class (persistent, last-value) - Remove 15 dead topic imports from schema files - Update init_trustgraph.py namespace: config → state Confine Pulsar to pulsar_backend.py: - Delete legacy PulsarClient class from pubsub.py - Move add_args to add_pubsub_args() with standalone flag for CLI tools (defaults to localhost) - PulsarBackendConsumer.receive() catches _pulsar.Timeout, raises standard TimeoutError - Remove Pulsar imports from: async_processor, flow_processor, log_level, all 11 client files, 4 storage writers, gateway service, gateway config receiver - Remove log_level/LoggerLevel from client API - Rewrite tg-monitor-prompts to use backend abstraction - Update tg-dump-queues to use add_pubsub_args Also: pubsub-abstraction.md tech spec covering problem statement, design goals, as-is requirements, candidate broker assessment, approach, and implementation order.	2026-04-01 20:16:53 +01:00
cybermaggedon	2bcf375103	Wire message_id on all answer chunks, fix DAG structure (#748 ) Wire message_id on all answer chunks, fix DAG structure message_id: - Add message_id to AgentAnswer dataclass and propagate in socket_client._parse_chunk - Wire message_id into answer callbacks and send_final_response for all three patterns (react, plan-then-execute, supervisor) - Supervisor decomposition thought and synthesis answer chunks now carry message_id DAG structure fixes: - Observation derives from sub-trace Synthesis (not Analysis) when a tool produces a sub-trace; tracked via last_sub_explain_uri on context - Subagent sessions derive from parent's Decomposition via parent_uri on agent_session_triples - Findings derive from subagent Conclusions (not Decomposition) - Synthesis derives from all findings (multiple wasDerivedFrom) ensuring single terminal node - agent_synthesis_triples accepts list of parent URIs - Explainability chain walker follows from sub-trace terminal to find downstream Observation Emit Analysis before tool execution: - Add on_action callback to react() in agent_manager.py, called after reason() but before tool invocation - Orchestrator and old service emit Analysis+ToolUse triples via on_action so sub-traces appear after their parent in the stream	2026-04-01 13:27:41 +01:00
cybermaggedon	153ae9ad30	Split Analysis into Analysis+ToolUse and Observation, add message_id (#747 ) Refactor agent provenance so that the decision (thought + tool selection) and the result (observation) are separate DAG entities: Question ← Analysis+ToolUse ← Observation ← ... ← Conclusion Analysis gains tg:ToolUse as a mixin RDF type and is emitted before tool execution via an on_action callback in react(). This ensures sub-traces (e.g. GraphRAG) appear after their parent Analysis in the streaming event order. Observation becomes a standalone prov:Entity with tg:Observation type, emitted after tool execution. The linear DAG chain runs through Observation — subsequent iterations and the Conclusion derive from it, not from the Analysis. message_id is populated on streaming AgentResponse for thought and observation chunks, using the provenance URI of the entity being built. This lets clients group streamed chunks by entity. Wire changes: - provenance/agent.py: Add ToolUse type, new agent_observation_triples(), remove observation from iteration - agent_manager.py: Add on_action callback between reason() and tool execution - orchestrator/pattern_base.py: Split emit, wire message_id, chain through observation URIs - orchestrator/react_pattern.py: Emit Analysis via on_action before tool runs - agent/react/service.py: Same for non-orchestrator path - api/explainability.py: New Observation class, updated dispatch and chain walker - api/types.py: Add message_id to AgentThought/AgentObservation - cli: Render Observation separately, [analysis: tool] labels	2026-03-31 17:51:22 +01:00
cybermaggedon	89e13a756a	Minor agent-orchestrator updates (#746 ) Tidy agent-orchestrator logs Added CLI support for selecting the pattern... tg-invoke-agent -q "What is the document about?" -p supervisor -v tg-invoke-agent -q "What is the document about?" -p plan-then-execute -v tg-invoke-agent -q "What is the document about?" -p react -v Added new event types to tg-show-explain-trace	2026-03-31 13:29:04 +01:00
cybermaggedon	7b734148b3	agent-orchestrator: add explainability provenance for all patterns (#744 ) agent-orchestrator: add explainability provenance for all agent patterns Extend the provenance/explainability system to provide human-readable reasoning traces for the orchestrator's three agent patterns. Previously only ReAct emitted provenance (session, iteration, conclusion). Now each pattern records its cognitive steps as typed RDF entities in the knowledge graph, using composable mixin types (e.g. Finding + Answer). New provenance chains: - Supervisor: Question → Decomposition → Finding ×N → Synthesis - Plan-then-Execute: Question → Plan → StepResult ×N → Synthesis - ReAct: Question → Analysis ×N → Conclusion (unchanged) New RDF types: Decomposition, Finding, Plan, StepResult. New predicates: tg:subagentGoal, tg:planStep. Reuses existing Synthesis + Answer mixin for final answers. Provenance library (trustgraph-base): - Triple builders, URI generators, vocabulary labels for new types - Client dataclasses with from_triples() dispatch - fetch_agent_trace() follows branching provenance chains - API exports updated Orchestrator (trustgraph-flow): - PatternBase emit methods for decomposition, finding, plan, step result, and synthesis - SupervisorPattern emits decomposition during fan-out - PlanThenExecutePattern emits plan and step results - Service emits finding triples on subagent completion - Synthesis provenance replaces generic final triples CLI (trustgraph-cli): - invoke_agent -x displays new entity types inline	2026-03-31 12:54:51 +01:00
cybermaggedon	e65ea217a2	agent-orchestrator improvements (#743 ) agent-orchestrator improvements: - Improve agent trace - Improve queue dumping - Fixing supervisor pattern - Fix synthesis step to remove loop Minor dev environment improvements: - Improve queue dump output for JSON - Reduce dev container rebuild	2026-03-31 11:24:30 +01:00
cybermaggedon	849987f0e6	Add multi-pattern orchestrator with plan-then-execute and supervisor (#739 ) Introduce an agent orchestrator service that supports three execution patterns (ReAct, plan-then-execute, supervisor) with LLM-based meta-routing to select the appropriate pattern and task type per request. Update the agent schema to support orchestration fields (correlation, sub-agents, plan steps) and remove legacy response fields (answer, thought, observation).	2026-03-31 00:32:49 +01:00
cybermaggedon	20204d87c3	Fix OpenAI compatibility issues for newer models and Azure config (#727 ) Use max_completion_tokens for OpenAI and Azure OpenAI providers: The OpenAI API deprecated max_tokens in favor of max_completion_tokens for chat completions. Newer models (gpt-4o, o1, o3) reject the old parameter with a 400 error. AZURE_API_VERSION env var now overrides the default API version: (falls back to 2024-12-01-preview). Update tests to test for expected structures	2026-03-28 11:19:45 +00:00
cybermaggedon	a634520509	Fix websocket error responses in Mux dispatcher (#726 ) Error responses from the websocket multiplexer were missing the request ID and using a bare string format instead of the structured error protocol. This caused clients to hang when a request failed (e.g. unsupported service for a flow) because the error could not be routed to the waiting caller. Include request ID in all error paths, use structured error format ({message, type}) with complete flag, and extract the ID early in receive() so even malformed requests get a routable error when possible. Updated tests - tests were coded against invalid protocol messages	2026-03-28 10:58:28 +00:00
cybermaggedon	25995d03f4	Fix stray log messages caused by librarian messages (#706 ) Warning generated by librarian responses meant for other services (chunker, embeddings, etc.) arriving on the shared response queue. The decoder's subscription picks them up, can't match them to a pending request, and logs a warning. Removed the warnings, as not serving a purpose.	2026-03-23 13:16:39 +00:00
cybermaggedon	5c6fe90fe2	Add universal document decoder with multi-format support (#705 ) Add universal document decoder with multi-format support using 'unstructured'. New universal decoder service powered by the unstructured library, handling DOCX, XLSX, PPTX, HTML, Markdown, CSV, RTF, ODT, EPUB and more through a single service. Tables are preserved as HTML markup for better downstream extraction. Images are stored in the librarian but excluded from the text pipeline. Configurable section grouping strategies (whole-document, heading, element-type, count, size) for non-page formats. Page-based formats (PDF, PPTX, XLSX) are automatically grouped by page. All four decoders (PDF, Mistral OCR, Tesseract OCR, universal) now share the "document-decoder" ident so they are interchangeable. PDF-only decoders fetch document metadata to check MIME type and gracefully skip unsupported formats. Librarian changes: removed MIME type whitelist validation so any document format can be ingested. Simplified routing so text/plain goes to text-load and everything else goes to document-load. Removed dual inline/streaming data paths — documents always use document_id for content retrieval. New provenance entity types (tg:Section, tg:Image) and metadata predicates (tg:elementTypes, tg:tableCount, tg:imageCount) for richer explainability. Universal decoder is in its own package (trustgraph-unstructured) and container image (trustgraph-unstructured).	2026-03-23 12:56:35 +00:00
cybermaggedon	4609424afe	Prepare 2.2 release branch (#704 )	2026-03-22 15:23:23 +00:00
cybermaggedon	96fd1eab15	Use UUID-based URNs for page and chunk IDs (#703 ) Page and chunk document IDs were deterministic ({doc_id}/p{num}, {doc_id}/p{num}/c{num}), causing "Document already exists" errors when reprocessing documents through different flows. Content may differ between runs due to different parameters or extractors, so deterministic IDs are incorrect. Pages now use urn:page:{uuid}, chunks use urn:chunk:{uuid}. Parent- child relationships are tracked via librarian metadata and provenance triples. Also brings Mistral OCR and Tesseract OCR decoders up to parity with the PDF decoder: librarian fetch/save support, per-page output with unique IDs, and provenance triple emission. Fixes Mistral OCR bug where only the first 5 pages were processed.	2026-03-21 21:17:03 +00:00
cybermaggedon	1a7b654bd3	Add semantic pre-filter for GraphRAG edge scoring (#702 ) Embed edge descriptions and compute cosine similarity against grounding concepts to reduce the number of edges sent to expensive LLM scoring. Controlled by edge_score_limit parameter (default 30), skipped when edge count is already below the limit. Also plumbs edge_score_limit and edge_limit parameters end-to-end: - CLI args (--edge-score-limit, --edge-limit) in both invoke and service - Socket client: fix parameter mapping to use hyphenated wire-format keys - Flow API, message translator, gateway all pass through correctly - Explainable code path (_question_explainable_api) now forwards all params - Default edge_score_limit changed from 50 to 30 based on typical subgraph sizes	2026-03-21 20:06:29 +00:00

1 2 3 4 5 ...

353 commits