Commit graph

1469 commits

Author SHA1 Message Date
cybermaggedon
3af794ceea
Use manifests to build for amd64 and arm64 (#798) (#801)
Remove Pytorch +cpu invocation, so it builds on ARM
2026-04-13 21:18:58 +01:00
cybermaggedon
baf8d861af
Fix Qemu setup (#800) 2026-04-13 21:03:45 +01:00
cybermaggedon
d11fcabaed
Fix CI pipeline (#799)
Matrix labels are wrong, fix them
2026-04-13 20:57:09 +01:00
cybermaggedon
76c1752748
Use manifests to build for amd64 and arm64 (#798)
This adds ARM container builds to the CI pipeline
2026-04-13 20:51:09 +01:00
cybermaggedon
1515dbaf08
Updated tech specs for agent & explainability changes (#796) 2026-04-13 17:29:24 +01:00
cybermaggedon
d2751553a3
Add agent explainability instrumentation and unify envelope field naming (#795)
Addresses recommendations from the UX developer's agent experience report.
Adds provenance predicates, DAG structure changes, error resilience, and
a published OWL ontology.

Explainability additions:

- Tool candidates: tg:toolCandidate on Analysis events lists the tools
  visible to the LLM for each iteration (names only, descriptions in config)
- Termination reason: tg:terminationReason on Conclusion/Synthesis events
  (final-answer, plan-complete, subagents-complete)
- Step counter: tg:stepNumber on iteration events
- Pattern decision: new tg:PatternDecision entity in the DAG between
  session and first iteration, carrying tg:pattern and tg:taskType
- Latency: tg:llmDurationMs on Analysis events, tg:toolDurationMs on
  Observation events
- Token counts on events: tg:inToken/tg:outToken/tg:llmModel on
  Grounding, Focus, Synthesis, and Analysis events
- Tool/parse errors: tg:toolError on Observation events with tg:Error
  mixin type. Parse failures return as error observations instead of
  crashing the agent, giving it a chance to retry.

Envelope unification:

- Rename chunk_type to message_type across AgentResponse schema,
  translator, SDK types, socket clients, CLI, and all tests.
  Agent and RAG services now both use message_type on the wire.

Ontology:

- specs/ontology/trustgraph.ttl — OWL vocabulary covering all 26 classes,
  7 object properties, and 36+ datatype properties including new predicates.

DAG structure tests:

- tests/unit/test_provenance/test_dag_structure.py verifies the
  wasDerivedFrom chain for GraphRAG, DocumentRAG, and all three agent
  patterns (react, plan, supervisor) including the pattern-decision link.
2026-04-13 16:16:42 +01:00
cybermaggedon
14e49d83c7
Expose LLM token usage across all service layers (#782)
Expose LLM token usage (in_token, out_token, model) across all
service layers

Propagate token counts from LLM services through the prompt,
text-completion, graph-RAG, document-RAG, and agent orchestrator
pipelines to the API gateway and Python SDK. All fields are Optional
— None means "not available", distinguishing from a real zero count.

Key changes:

- Schema: Add in_token/out_token/model to TextCompletionResponse,
  PromptResponse, GraphRagResponse, DocumentRagResponse,
  AgentResponse

- TextCompletionClient: New TextCompletionResult return type. Split
  into text_completion() (non-streaming) and
  text_completion_stream() (streaming with per-chunk handler
  callback)

- PromptClient: New PromptResult with response_type
  (text/json/jsonl), typed fields (text/object/objects), and token
  usage. All callers updated.

- RAG services: Accumulate token usage across all prompt calls
  (extract-concepts, edge-scoring, edge-reasoning,
  synthesis). Non-streaming path sends single combined response
  instead of chunk + end_of_session.

- Agent orchestrator: UsageTracker accumulates tokens across
  meta-router, pattern prompt calls, and react reasoning. Attached
  to end_of_dialog.

- Translators: Encode token fields when not None (is not None, not truthy)

- Python SDK: RAG and text-completion methods return
  TextCompletionResult (non-streaming) or RAGChunk/AgentAnswer with
  token fields (streaming)

- CLI: --show-usage flag on tg-invoke-llm, tg-invoke-prompt,
  tg-invoke-graph-rag, tg-invoke-document-rag, tg-invoke-agent
2026-04-13 14:38:34 +01:00
cybermaggedon
67cfa80836
SPARQL CLI reports errors from service (#794)
SPARQL query CLI ignores errors from the SPARQL service and just
emits a zero row output. This change causes an error to be reported
2026-04-13 14:31:33 +01:00
elpresidank
5a6ea1e70e merged in V2T 2026-04-12 17:34:39 -05:00
elpresidank
ee45cb4850 feat: fix RAG pipelines, Beep Graph branding, PWA, and ambient glow UI
Pipeline fixes:
- Fix agent getting empty response from graph-rag by combining answer +
  explain data in single message (RequestResponse returns first msg)
- Fix Doc RAG pipeline: add content field to Qdrant doc payload, seed 10
  document chunks, fix type mismatches across base/flow/client
- Forward explainability events from agent's KnowledgeQuery to client
- Add "agent" to TERM_BEARING_RESPONSE_SERVICES for triple translation
- Fix embeddings env var (OLLAMA_URL), user/collection threading, edge
  scoring threshold, and various protocol mismatches

Branding:
- Rename TrustGraph → Beep Graph (title, sidebar, settings, about)
- Custom lambda + ThugLife pixel glasses SVG logo component
- Forest green color palette (brand-50 through brand-900)
- SVG favicon + PNG icons (16/32/180/192/512)
- PWA manifest with service worker for offline shell caching
- Splash screen with animated logo pulse on app load
- Ambient glow background with drifting green radial blobs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 10:19:10 -05:00
elpresidank
87f6e5eb05 feat: chat message actions, explainability graphs, and graph query filters
Add chat UX improvements: message actions toolbar (copy/delete/regenerate)
on hover, inline explainability subgraph visualization from RAG/agent
queries, and token metadata for all chat modes. Enhance graph page with
SPO query filters, configurable triple limit, and type legend overlay.
Extract shared graph utilities for reuse across components.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 02:55:46 -05:00
elpresidank
d5dd15be72 feat: MCP Tools management UI with QA accessibility fixes
Add dedicated /mcp-tools page for managing MCP servers and tools from the
workbench. Includes CRUD dialogs, config API integration, and feature flag
gating via mcpTools switch. QA pass also fixes accessibility across existing
pages: aria-expanded on chat phase blocks, tabpanel tabindex on prompts,
toggle contrast ratio (WCAG 2.1 SC 1.4.11) on settings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 00:59:20 -05:00
elpresidank
f4d6e49217 Merge remote-tracking branch 'origin/master' into ts-port 2026-04-11 23:56:34 -05:00
elpresidank
338adf8668 fix: global focus-visible rings and light-mode border contrast
- Add global focus-visible outline for buttons, switches, selects, and
  inputs so all interactive elements show a visible brand-500 ring on
  keyboard focus (not just NavLinks and dialog close)
- Darken light-mode --color-border from #e4e4e7 to #d4d4d8 so input
  borders, dividers, and mode selector outlines are visible on white

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 23:28:38 -05:00
elpresidank
d097b790ff fix: comprehensive a11y and contrast QA pass across workbench
Automated QA loop (6 parallel browser agents, 2 rounds) found and fixed
15 accessibility, contrast, and responsive issues across all 8 pages:

- WCAG contrast: light-mode warning (#854d0e), error (#b91c1c), toggle
  off-state (surface-400), connection badge (fg-muted)
- ARIA: mode selector group+pressed, tab pattern ids+labelledby, nav
  and aside labels, dialog focus-return, alert roles on banners
- Responsive: library header flex-wrap, search/button aria-labels
- Focus: NavLink visible ring, dialog close button ring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 23:26:28 -05:00
cybermaggedon
ffe310af7c
Fix RabbitMQ request/response race and chunker Flow API drift (#779)
* Fix Metadata/EntityEmbeddings schema migration tail and add regression tests (#776)

The Metadata dataclass dropped its `metadata: list[Triple]` field
and EntityEmbeddings/ChunkEmbeddings settled on a singular
`vector: list[float]` field, but several call sites kept passing
`Metadata(metadata=...)` and `EntityEmbeddings(vectors=...)`. The
bugs were latent until a websocket client first hit
`/api/v1/flow/default/import/entity-contexts`, at which point the
dispatcher TypeError'd on construction.

Production fixes (5 call sites on the same migration tail):

  * trustgraph-flow gateway dispatchers entity_contexts_import.py
    and graph_embeddings_import.py — drop the stale
    Metadata(metadata=...)  kwarg; switch graph_embeddings_import
    to the singular `vector` wire key.
  * trustgraph-base messaging translators knowledge.py and
    document_loading.py — fix decode side to read the singular
    `"vector"` key, matching what their own encode sides have
    always written.
  * trustgraph-flow tables/knowledge.py — fix Cassandra row
    deserialiser to construct EntityEmbeddings(vector=...)
    instead of vectors=.
  * trustgraph-flow gateway core_import/core_export — switch the
    kg-core msgpack wire format to the singular `"v"`/`"vector"`
    key and drop the dead `m["m"]` envelope field that referenced
    the removed Metadata.metadata triples list (it was a
    guaranteed KeyError on the export side).

Defense-in-depth regression coverage (32 new tests across 7 files):

  * tests/contract/test_schema_field_contracts.py — pin the field
    set of Metadata, EntityEmbeddings, ChunkEmbeddings,
    EntityContext so any future schema rename fails CI loudly
    with a clear diff.
  * tests/unit/test_translators/test_knowledge_translator_roundtrip.py
    and test_document_embeddings_translator_roundtrip.py -
    encode→decode round-trip the affected translators end to end,
    locking in the singular `"vector"` wire key.
  * tests/unit/test_gateway/test_entity_contexts_import_dispatcher.py
    and test_graph_embeddings_import_dispatcher.py — exercise the
    websocket dispatchers' receive() path with realistic
    payloads, the direct regression test for the original
    production crash.
  * tests/unit/test_gateway/test_core_import_export_roundtrip.py
    — pack/unpack the kg-core msgpack format through the real
    dispatcher classes (with KnowledgeRequestor mocked),
    including a full export→import round-trip.
  * tests/unit/test_tables/test_knowledge_table_store.py —
    exercise the Cassandra row → schema conversion via __new__ to
    bypass the live cluster connection.

Also fixes an unrelated leaked-coroutine RuntimeWarning in
test_gateway/test_service.py::test_run_method_calls_web_run_app: the
mocked aiohttp.web.run_app now closes the coroutine that Api.run() hands
it, mirroring what the real run_app would do, instead of leaving it for
the GC to complain about.

* Fix RabbitMQ request/response race and chunker Flow API drift

Two unrelated regressions surfaced after the v2.2 queue class
refactor.  Bundled here because both are small and both block
production.

1. Request/response race against ephemeral RabbitMQ response
queues

Commit feeb92b3 switched response/notify queues to per-subscriber
auto-delete exclusive queues. That fixed orphaned-queue
accumulation but introduced a setup race: Subscriber.start()
created the run() task and returned immediately, while the
underlying RabbitMQ consumer only declared and bound its queue
lazily on the first receive() call.  RequestResponse.request()
therefore published the request before any queue was bound to the
matching routing key, and the broker dropped the reply. Symptoms:
"Failed to fetch config on notify" / "Request timeout exception"
repeating roughly every 10s in api-gateway, document-embeddings
and any other service exercising the config notify path.

Fix:

  * Add ensure_connected() to the BackendConsumer protocol;
    implement it on RabbitMQBackendConsumer (calls _connect
    synchronously, declaring and binding the queue) and as a
    no-op on PulsarBackendConsumer (Pulsar's client.subscribe is
    already synchronous at construction).

  * Convert Subscriber's readiness signal from a non-existent
    Event to an asyncio.Future created in start(). run() calls
    consumer.ensure_connected() immediately after
    create_consumer() and sets _ready.set_result(None) on first
    successful bind. start() awaits the future via asyncio.wait
    so it returns only once the consumer is fully bound. Any
    reply published after start() returns is therefore guaranteed
    to land in a bound queue.

  * First-attempt connection failures call
    _ready.set_exception(e) and exit run() so start() unblocks
    with the error rather than hanging forever — the existing
    higher-level retry pattern in fetch_and_apply_config takes
    over from there. Runtime failures after a successful start
    still go through the existing retry-with-backoff path.

  * Update the two existing graceful-shutdown tests that
    monkey-patch Subscriber.run with a custom coroutine to honor
    the new contract by signalling _ready themselves.

  * Add tests/unit/test_base/test_subscriber_readiness.py with
    five regression tests pinning the readiness contract:
    ensure_connected must be called before start() returns;
    start() must block while ensure_connected runs
    (race-condition guard with a threading.Event gate);
    first-attempt create_consumer and ensure_connected failures
    must propagate to start() instead of hanging;
    ensure_connected must run before any receive() call.

2. Chunker Flow parameter lookup using the wrong attribute

trustgraph-base/trustgraph/base/chunking_service.py was reading
flow.parameters.get("chunk-size") and chunk-overlap, but the Flow
class has no `parameters` attribute — parameter lookup is exposed
through Flow.__call__ (flow("chunk-size") returns the resolved
value or None).  The exception was caught and logged as a
WARNING, so chunking continued with the default sizes and any
configured chunk-size / chunk-overlap was silently ignored:

    chunker - WARNING - Could not parse chunk-size parameter:
    'Flow' object has no attribute 'parameters'

The chunker tests didn't catch this because they constructed
mock_flow = MagicMock() and configured
mock_flow.parameters.get.side_effect = ..., which is the same
phantom attribute MagicMock auto-creates on demand. Tests and
production agreed on the wrong API.

Fix: switch chunking_service.py to flow("chunk-size") /
flow("chunk-overlap"). Update both chunker test files to mock the
__call__ side_effect instead of the phantom parameters.get,
merging parameter values into the existing flow() lookup the
on_message tests already used for producer resolution.
2026-04-11 01:29:38 +01:00
cybermaggedon
c23e28aa66
Fix Metadata/EntityEmbeddings schema migration tail and add regression tests (#777)
The Metadata dataclass dropped its `metadata: list[Triple]` field
and EntityEmbeddings/ChunkEmbeddings settled on a singular
`vector: list[float]` field, but several call sites kept passing
`Metadata(metadata=...)` and `EntityEmbeddings(vectors=...)`. The
bugs were latent until a websocket client first hit
`/api/v1/flow/default/import/entity-contexts`, at which point the
dispatcher TypeError'd on construction.

Production fixes (5 call sites on the same migration tail):

  * trustgraph-flow gateway dispatchers entity_contexts_import.py
    and graph_embeddings_import.py — drop the stale
    Metadata(metadata=...)  kwarg; switch graph_embeddings_import
    to the singular `vector` wire key.
  * trustgraph-base messaging translators knowledge.py and
    document_loading.py — fix decode side to read the singular
    `"vector"` key, matching what their own encode sides have
    always written.
  * trustgraph-flow tables/knowledge.py — fix Cassandra row
    deserialiser to construct EntityEmbeddings(vector=...)
    instead of vectors=.
  * trustgraph-flow gateway core_import/core_export — switch the
    kg-core msgpack wire format to the singular `"v"`/`"vector"`
    key and drop the dead `m["m"]` envelope field that referenced
    the removed Metadata.metadata triples list (it was a
    guaranteed KeyError on the export side).

Defense-in-depth regression coverage (32 new tests across 7 files):

  * tests/contract/test_schema_field_contracts.py — pin the field
    set of Metadata, EntityEmbeddings, ChunkEmbeddings,
    EntityContext so any future schema rename fails CI loudly
    with a clear diff.
  * tests/unit/test_translators/test_knowledge_translator_roundtrip.py
    and test_document_embeddings_translator_roundtrip.py -
    encode→decode round-trip the affected translators end to end,
    locking in the singular `"vector"` wire key.
  * tests/unit/test_gateway/test_entity_contexts_import_dispatcher.py
    and test_graph_embeddings_import_dispatcher.py — exercise the
    websocket dispatchers' receive() path with realistic
    payloads, the direct regression test for the original
    production crash.
  * tests/unit/test_gateway/test_core_import_export_roundtrip.py
    — pack/unpack the kg-core msgpack format through the real
    dispatcher classes (with KnowledgeRequestor mocked),
    including a full export→import round-trip.
  * tests/unit/test_tables/test_knowledge_table_store.py —
    exercise the Cassandra row → schema conversion via __new__ to
    bypass the live cluster connection.

Also fixes an unrelated leaked-coroutine RuntimeWarning in
test_gateway/test_service.py::test_run_method_calls_web_run_app: the
mocked aiohttp.web.run_app now closes the coroutine that Api.run() hands
it, mirroring what the real run_app would do, instead of leaving it for
the GC to complain about.
2026-04-10 20:43:45 +01:00
cybermaggedon
0994d4b05f
Open 2.3 release branch (#775)
* Update packages and CI for new release branch
2026-04-10 14:42:19 +01:00
cybermaggedon
ad0bff10ee
master -> release/v2.3 (#774)
* Mainly README changes
2026-04-10 14:38:46 +01:00
cybermaggedon
ec8f740de3
release/v2.2 -> master (#773) 2026-04-10 14:36:58 +01:00
elpresidank
77a5fa5044 fix: QA regression pass — graph sizing, focus trap, contrast, accessibility
- Fix graph canvas using window dimensions instead of container by using
  ResizeObserver ref callback (attaches when conditionally-rendered
  container mounts)
- Fix dialog focus trap escaping to background — filter hidden/disabled
  elements from focusable selector
- Fix sidebar connection status and disconnection banner contrast — use
  semantic text-warning/bg-warning instead of amber-400 (1.65:1 → 4.5:1+)
- Add aria-label to chat textarea, htmlFor/id pairs to flows dialog inputs
- Add ARIA tab pattern to prompts page (role=tablist/tab/tabpanel,
  aria-selected, aria-controls)
- Fix prompts heading hierarchy (H1→H2 instead of H1→H3)
- Add flex-wrap to flows page header, fix badge contrast across pages
- Fix service-call race condition with early returns instead of console.log

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 07:48:01 -05:00
elpresidank
b854b56558 feat: MCP tool client infrastructure for agent extensibility
Add the full MCP tool pipeline enabling agents to invoke external tools
(like Brave Search) via MCP servers:

- Add ToolRequest/ToolResponse types and mcp-tool topics to @trustgraph/base
- Create McpToolService (FlowProcessor) that connects to external MCP servers
  via @modelcontextprotocol/sdk StreamableHTTP transport
- Add createMcpTool() to wire MCP tools into the agent's ReAct loop
- Implement config-driven tool registration in AgentService with backward-
  compatible fallback to hardcoded tools
- Add tool filtering by group and state (port of Python tool_filter.py)
- Register mcp-tool in gateway dispatcher and export from @trustgraph/flow
- Fix flow restart race condition: skip restart when flow definitions unchanged
- Update seed config with MCP server config and tool definitions
- Add run scripts for MCP tool service and Brave Search MCP server

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 05:45:46 -05:00
elpresidank
f2b376abef fix: FalkorDB result parsing, embeddings routing, triples query response, graph visualization
- Fix FalkorDB triples query: client v5 returns objects not arrays, use named field access
- Fix embeddings service: align spec names to "embeddings-request"/"embeddings-response"
- Fix client triplesQuery: read `triples` field instead of `response` from backend
- Fix graph page crash: guard against non-array triples, accept literals as entity nodes
- Add seed:demo script for AI industry knowledge graph (254 triples, 64 entities)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 04:59:36 -05:00
cybermaggedon
feeb92b33f
Refactor: Derive consumer behaviour from queue class (#772)
Derive consumer behaviour from queue class, remove
consumer_type parameter

The queue class prefix (flow, request, response, notify) now
fully determines consumer behaviour in both RabbitMQ and Pulsar
backends.  Added 'notify' class for ephemeral broadcast (config
push notifications).  Response and notify classes always create
per-subscriber auto-delete queues, eliminating orphaned queues
that accumulated on service restarts.

Change init-trustgraph to set up the 'notify' namespace in
Pulsar instead of old hangover 'state'.

Fixes 'stuck backlog' on RabbitMQ config notification queue.
2026-04-09 09:55:41 +01:00
cybermaggedon
befe951b2e
Merge 2.2 into master (#771) 2026-04-08 14:17:18 +01:00
cybermaggedon
aff96e57cb
Added Explainable AI agent demo in Typescript (#770)
(Not functional code)
2026-04-08 14:16:14 +01:00
cybermaggedon
e81418c58f
fix: preserve literal types in focus quoted triples and document tracing (#769)
The triples client returns Uri/Literal (str subclasses), not Term
objects.  _quoted_triple() treated all values as IRIs, so literal
objects like skos:definition values were mistyped in focus
provenance events, and trace_source_documents could not match
them in the store.

Added to_term() to convert Uri/Literal back to Term, threaded a
term_map from follow_edges_batch through
get_subgraph/get_labelgraph into uri_map, and updated
_quoted_triple to accept Term objects directly.
2026-04-08 13:37:02 +01:00
cybermaggedon
4b5bfacab1
Forward missing explain_triples through RAG clients and agent tool callback (#768)
fix: forward explain_triples through RAG clients and agent tool callback
- RAG clients and the KnowledgeQueryImpl tool callback were
  dropping explain_triples from explain events, losing provenance
  data (including focus edge selections) when graph-rag is invoked
  via the agent.

Tests for provenance and explainability (56 new):
- Client-level forwarding of explain_triples
- Graph-RAG structural chain
  (question → grounding → exploration → focus → synthesis)
- Graph-RAG integration with mocked subsidiary clients
- Document-RAG integration
  (question → grounding → exploration → synthesis)
- Agent-orchestrator all 3 patterns: react, plan-then-execute,
  supervisor
2026-04-08 11:41:17 +01:00
Cyber MacGeddon
dc72ed3cca Merge branch 'release/v2.2' 2026-04-07 22:29:55 +01:00
cybermaggedon
e899370d98
Update docs for 2.2 release (#766)
- Update protocol specs
- Update protocol docs
- Update API specs
2026-04-07 22:24:59 +01:00
elpresidank
580ee319a3 fix: prevent dispatcher race condition via promise-based lazy init
Store the initialization Promise in the requestors map synchronously
before yielding, so concurrent callers for the same key await the same
instance — prevents orphaned RequestResponse objects and duplicate NATS
subscriptions. Mirrors upstream fix 8f18ba02.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 12:11:11 -05:00
elpresidank
2a2e8e76a3 Merge remote-tracking branch 'origin/master' into ts-port 2026-04-07 10:51:24 -05:00
elpresidank
5e3929a883 fix: comprehensive QA audit — light mode, accessibility, error handling, code quality
- Fix light mode: theme-aware graph node labels, remove prose-invert for
  theme-safe markdown, add brand/semantic color overrides for light backgrounds
- Add 404 catch-all route redirecting unknown paths to /chat
- FalkorDB: add .catch() to connectPromise, add ensureConnected() to all
  store methods (createLiteral, relateNode, relateLiteral, deleteCollection)
- Accessibility: dialog role/aria-modal, toast aria-live, dismiss/zoom/search
  button aria-labels, close panel aria-label
- Lazy-load ForceGraph2D (splits 189KB into separate chunk, main bundle -26%)
- Cap conversation localStorage at 200 messages to prevent quota overflow
- Fix pnpm test: add --passWithNoTests to cli/mcp packages
- Add upload error notification instead of silent catch
- Remove unused class-variance-authority dep and dead tabs.tsx component
- Add @types/node to flow package devDependencies
- Remove stale FIXME comment in messages.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 09:15:59 -05:00
cybermaggedon
c20e6540ec
Subscriber resilience and RabbitMQ fixes (#765)
Subscriber resilience: recreate consumer after connection failure

- Move consumer creation from Subscriber.start() into the run() loop,
  matching the pattern used by Consumer. If the connection drops and the
  consumer is closed in the finally block, the loop now recreates it on
  the next iteration instead of spinning forever on a None consumer.

Consumer thread safety:
- Dedicated ThreadPoolExecutor per consumer so all pika operations
  (create, receive, acknowledge, negative_acknowledge) run on the
  same thread — pika BlockingConnection is not thread-safe
- Applies to both Consumer and Subscriber classes

Config handler type audit — fix four mismatched type registrations:
- librarian: was ["librarian"] (non-existent type), now ["flow",
  "active-flow"] (matches config["flow"] that the handler reads)
- cores/service: was ["kg-core"], now ["flow"] (reads
  config["flow"])
- metering/counter: was ["token-costs"], now ["token-cost"]
  (singular)
- agent/mcp_tool: was ["mcp-tool"], now ["mcp"] (reads
  config["mcp"])

Update tests
2026-04-07 14:51:14 +01:00
elpresidank
9ef9ef854f fix: iterative QA pass — resolve remaining bugs, UX and accessibility improvements
Three QA iterations to convergence (zero issues remaining):

Workbench UI:
- Connection badge: amber "Connected (no auth)" for unauthenticated state
- Theme persistence: restore script in index.html + localStorage sync
- Settings About section: add bottom padding so content isn't clipped
- Clear messages: cancel in-flight requests when clearing chat
- Feature switch labels: proper casing + acronym handling (MCP, LLM)
- Token Cost badge: hidden during loading state
- ARIA: role="switch", aria-checked on toggles, aria-labels on buttons
- ConfigApi: null-safe chaining for getPrompts/getSystemPrompt

Grafana dashboards:
- Auto-refresh 30s on all 3 dashboards
- Panel heights reduced to fit viewport without scrolling
- Anonymous role upgraded to Editor for Explore access

Infrastructure:
- Nginx: DNS resolver with variable-based upstream (prevents crash loop)
- Workbench port set to 3002 in .env

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 06:33:22 -05:00
cybermaggedon
ddd4bd7790
Deliver explainability triples inline in retrieval response stream (#763)
Provenance triples are now included directly in explain messages from
GraphRAG, DocumentRAG, and Agent services, eliminating the need for
follow-up knowledge graph queries to retrieve explainability details.

Each explain message in the response stream now carries:
- explain_id: root URI for this provenance step (unchanged)
- explain_graph: named graph where triples are stored (unchanged)
- explain_triples: the actual provenance triples for this step (new)

Changes across the stack:
- Schema: added explain_triples field to GraphRagResponse,
  DocumentRagResponse, and AgentResponse
- Services: all explain message call sites pass triples through
  (graph_rag, document_rag, agent react, agent orchestrator)
- Translators: encode explain_triples via TripleTranslator for
  gateway wire format
- Python SDK: ProvenanceEvent now includes parsed ExplainEntity
  and raw triples; expanded event_type detection
- CLI: invoke_graph_rag, invoke_agent, invoke_document_rag use
  inline entity when available, fall back to graph query
- Tech specs updated

Additional explainability test
2026-04-07 12:19:05 +01:00
cybermaggedon
2f8d6a3ffb
Fix agent config handler registration, remove debug prints, disable RabbitMQ heartbeats (#764)
- Fix agent react and orchestrator services appending bare methods
  to config_handlers instead of using register_config_handler() —
  caused 'method object is not subscriptable' on config notify
- Add exc_info to config fetch retry logging for proper tracebacks
- Remove debug print statements from collection management
  dispatcher and translator
- Disable RabbitMQ heartbeats (heartbeat=0) to prevent broker
  closing idle producer connections that can't process heartbeat
  frames from BlockingConnection
2026-04-07 12:11:12 +01:00
Sreeram Venkatasubramanian
c737e8c356
fix: reduce consumer poll timeout from 2000ms to 100ms (#761) 2026-04-07 12:09:20 +01:00
Sreeram Venkatasubramanian
f0c9039b76 fix: reduce consumer poll timeout from 2000ms to 100ms 2026-04-07 12:02:27 +01:00
elpresidank
3a80872482 fix: comprehensive QA — resolve 13 bugs, add UX improvements across all services
Client SDK: add .catch() to graphRagStreaming/documentRagStreaming (silent timeout),
null-guard JSON.parse in getPrompts/getSystemPrompt/getPrompt.

Backend: implement "getvalues" config operation for token costs, null-check
createTerm() in FalkorDB triples query, add knowledge-cores service entrypoint
and Docker entry, return proper HTTP 400/404 for gateway error responses.

Workbench: cancel button + elapsed timer for chat, clear agent spinner on error,
flow dialog inline validation, responsive header wrapping, knowledge cores
loading timeout, sidebar/page naming consistency, theme toggle indicator.

Infrastructure: enable Grafana Explore for viewers, add gateway Prometheus
scrape target, fix RAG pipeline dashboard layout (6 panels visible),
filter Service Health to configured targets only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 05:20:10 -05:00
elpresidank
72870a7e2e feat: add unit tests, Docker polish, and workbench UX improvements
Unit tests: Consumer class (7), recursive-splitter (10), parseJsonResponse (11) — 28 total.
Docker: add 5 commented LLM provider services, dev compose override, .env.example.
Workbench: chat persistence, error boundary, disconnect banner, prompts error handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 03:51:29 -05:00
elpresidank
c7eefee607 feat: add Docker entrypoints, LLM providers, pipeline hardening, workbench pages
Phase 9 — four parallel workstreams:

- Stream A: 14 Docker entrypoints for containerized deployment
- Stream B: Pipeline hardening — robust JSON parsing, LLM retry logic,
  consumer negative-ack, FalkorDB test import fix
- Stream C: Azure OpenAI, OpenAI-compatible, and Mistral LLM providers
- Stream D: Workbench Prompts, Token Cost, Knowledge Cores pages +
  Settings feature switches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 03:22:55 -05:00
elpresidank
50fb311d2d feat: real PDF pipeline test — end-to-end knowledge extraction working
Add full pipeline test that generates a real PDF, processes it through
the entire pipeline, and verifies knowledge lands in FalkorDB:

- Create test PDF generator using pdf-lib (2-page doc about Acme Corp)
- Add testFullPipeline() to integration tests with store verification
- Fix FalkorDB client connect() — createClient returns unconnected client
  in both TriplesStore and TriplesQuery classes

Results: PDF decoded (2 pages) → chunked (2 chunks) → extracted
(4 relationships) → 16 triples stored in FalkorDB including:
  alice-johnson → is-a-senior-engineer → acme-corporation
  cloudsync → uses-aws-for-hosting → amazon-web-services
  provenance: pages → prov:wasDerivedFrom → source document

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 02:19:12 -05:00
elpresidank
5bc7a1b6fc fix: resolve FlowProcessor topic collisions, librarian timeout, tests
Two bugs found during end-to-end testing:

1. FlowProcessor never restarted flows when config changed — it only
   started them once. Stale NATS JetStream data from previous sessions
   caused services to bind to wrong topics. Fix: stop and restart flows
   on every config push that includes flow definitions.

2. Gateway publishToTopic sent messages without an id property. Pipeline
   FlowProcessor handlers check properties.id and silently return if
   missing. Fix: auto-generate a message id when publishing to topics.

Both fixes validated: 13/13 integration tests passing, PDF decoder
correctly receives and processes document messages through the pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:53:55 -05:00
elpresidank
c545213224 feat: add query/retrieval FlowProcessor services and missing runner scripts
Wire up the query and retrieval side of the pipeline so the agent can
answer questions from stored knowledge:

- Triples query service (FalkorDB) — all SPO pattern queries via NATS
- Graph embeddings query service (Qdrant) — entity vector similarity
- Document embeddings query service (Qdrant) — chunk vector similarity
- Graph RAG service — full concept→entity→traverse→score→synthesize pipeline
- Document RAG service — embed→find chunks→synthesize pipeline
- Runner scripts for chunker, extractor, embeddings (missing from Phase 5)
- Add DocumentEmbeddingsRequest/Response schema types
- Add RAG prompt templates (extract-concepts, edge-scoring, synthesize)
- Add graph/doc embeddings query topics to seed config + flow manager
- Add all pipeline/query/retrieval services to docker-compose
- 8 new runner scripts, 8 new pnpm script aliases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:05:54 -05:00
elpresidank
8f7008822a feat: add document pipeline — PDF decoder, Ollama LLM, storage services
Add end-to-end document processing pipeline:
- PDF decoder service (pdfjs-dist) extracts text per page from librarian docs
- Ollama native LLM service for local model inference
- FalkorDB triples store FlowProcessor consumer
- Qdrant graph embeddings store FlowProcessor consumer
- Fix spec name collisions in chunker/extractor (input→chunk-input, etc.)
- Gateway /load endpoint to trigger document processing
- Align flow manager blueprint and seed config with full pipeline topics
- Add runner scripts and test coverage for document load

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:47:43 -05:00
elpresidank
8f9de7604e fix: make abstract class constructors protected
Marks FlowProcessor and EmbeddingsService constructors as protected
since these classes should only be instantiated via subclasses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 21:52:00 -05:00
cybermaggedon
4acd853023
Config push notify pattern: replace stateful pub/sub with signal+ fetch (#760)
Replace the config push mechanism that broadcast the full config
blob on a 'state' class pub/sub queue with a lightweight notify
signal containing only the version number and affected config
types. Processors fetch the full config via request/response from
the config service when notified.

This eliminates the need for the pub/sub 'state' queue class and
stateful pub/sub services entirely. The config push queue moves
from 'state' to 'flow' class — a simple transient signal rather
than a retained message.  This solves the RabbitMQ
late-subscriber problem where restarting processes never received
the current config because their fresh queue had no historical
messages.

Key changes:
- ConfigPush schema: config dict replaced with types list
- Subscribe-then-fetch startup with retry: processors subscribe
  to notify queue, fetch config via request/response, then
  process buffered notifies with version comparison to avoid race
  conditions
- register_config_handler() accepts optional types parameter so
  handlers only fire when their config types change
- Short-lived config request/response clients to avoid subscriber
  contention on non-persistent response topics
- Config service passes affected types through put/delete/flow
  operations
- Gateway ConfigReceiver rewritten with same notify pattern and
  retry loop

Tests updated

New tests:
- register_config_handler: without types, with types, multiple
  types, multiple handlers
- on_config_notify: old/same version skipped, irrelevant types
  skipped (version still updated), relevant type triggers fetch,
  handler without types always called, mixed handler filtering,
  empty types invokes all, fetch failure handled gracefully
- fetch_config: returns config+version, raises on error response,
  stops client even on exception
- fetch_and_apply_config: applies to all handlers on startup,
  retries on failure
2026-04-06 16:57:27 +01:00
V.Sreeram
d4723566cb fix: prevent duplicate dispatcher creation race condition in invoke_global_service (#715)
* fix: prevent duplicate dispatcher creation race condition in invoke_global_service

Concurrent coroutines could all pass the `if key in self.dispatchers` check
before any of them wrote the result back, because `await dispatcher.start()`
yields to the event loop. This caused multiple Pulsar consumers to be created
on the same shared subscription, distributing responses round-robin and
dropping ~2/3 of them — manifesting as a permanent spinner in the Workbench UI.

Apply a double-checked asyncio.Lock in both `invoke_global_service` and
`invoke_flow_service` so only one dispatcher is ever created per service key.

* test: add concurrent-dispatch tests for race condition fix

Add asyncio.gather-based tests that verify invoke_global_service and
invoke_flow_service create exactly one dispatcher under concurrent calls,
preventing the duplicate Pulsar consumer bug.
2026-04-06 11:14:32 +01:00
V.Sreeram
8f18ba0257
fix: prevent duplicate dispatcher creation race condition in invoke_global_service (#715)
* fix: prevent duplicate dispatcher creation race condition in invoke_global_service

Concurrent coroutines could all pass the `if key in self.dispatchers` check
before any of them wrote the result back, because `await dispatcher.start()`
yields to the event loop. This caused multiple Pulsar consumers to be created
on the same shared subscription, distributing responses round-robin and
dropping ~2/3 of them — manifesting as a permanent spinner in the Workbench UI.

Apply a double-checked asyncio.Lock in both `invoke_global_service` and
`invoke_flow_service` so only one dispatcher is ever created per service key.

* test: add concurrent-dispatch tests for race condition fix

Add asyncio.gather-based tests that verify invoke_global_service and
invoke_flow_service create exactly one dispatcher under concurrent calls,
preventing the duplicate Pulsar consumer bug.
2026-04-06 11:13:59 +01:00