Commit graph

1459 commits

Author SHA1 Message Date
elpresidank
87f6e5eb05 feat: chat message actions, explainability graphs, and graph query filters
Add chat UX improvements: message actions toolbar (copy/delete/regenerate)
on hover, inline explainability subgraph visualization from RAG/agent
queries, and token metadata for all chat modes. Enhance graph page with
SPO query filters, configurable triple limit, and type legend overlay.
Extract shared graph utilities for reuse across components.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 02:55:46 -05:00
elpresidank
d5dd15be72 feat: MCP Tools management UI with QA accessibility fixes
Add dedicated /mcp-tools page for managing MCP servers and tools from the
workbench. Includes CRUD dialogs, config API integration, and feature flag
gating via mcpTools switch. QA pass also fixes accessibility across existing
pages: aria-expanded on chat phase blocks, tabpanel tabindex on prompts,
toggle contrast ratio (WCAG 2.1 SC 1.4.11) on settings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 00:59:20 -05:00
elpresidank
f4d6e49217 Merge remote-tracking branch 'origin/master' into ts-port 2026-04-11 23:56:34 -05:00
elpresidank
338adf8668 fix: global focus-visible rings and light-mode border contrast
- Add global focus-visible outline for buttons, switches, selects, and
  inputs so all interactive elements show a visible brand-500 ring on
  keyboard focus (not just NavLinks and dialog close)
- Darken light-mode --color-border from #e4e4e7 to #d4d4d8 so input
  borders, dividers, and mode selector outlines are visible on white

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 23:28:38 -05:00
elpresidank
d097b790ff fix: comprehensive a11y and contrast QA pass across workbench
Automated QA loop (6 parallel browser agents, 2 rounds) found and fixed
15 accessibility, contrast, and responsive issues across all 8 pages:

- WCAG contrast: light-mode warning (#854d0e), error (#b91c1c), toggle
  off-state (surface-400), connection badge (fg-muted)
- ARIA: mode selector group+pressed, tab pattern ids+labelledby, nav
  and aside labels, dialog focus-return, alert roles on banners
- Responsive: library header flex-wrap, search/button aria-labels
- Focus: NavLink visible ring, dialog close button ring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 23:26:28 -05:00
cybermaggedon
ffe310af7c
Fix RabbitMQ request/response race and chunker Flow API drift (#779)
* Fix Metadata/EntityEmbeddings schema migration tail and add regression tests (#776)

The Metadata dataclass dropped its `metadata: list[Triple]` field
and EntityEmbeddings/ChunkEmbeddings settled on a singular
`vector: list[float]` field, but several call sites kept passing
`Metadata(metadata=...)` and `EntityEmbeddings(vectors=...)`. The
bugs were latent until a websocket client first hit
`/api/v1/flow/default/import/entity-contexts`, at which point the
dispatcher TypeError'd on construction.

Production fixes (5 call sites on the same migration tail):

  * trustgraph-flow gateway dispatchers entity_contexts_import.py
    and graph_embeddings_import.py — drop the stale
    Metadata(metadata=...)  kwarg; switch graph_embeddings_import
    to the singular `vector` wire key.
  * trustgraph-base messaging translators knowledge.py and
    document_loading.py — fix decode side to read the singular
    `"vector"` key, matching what their own encode sides have
    always written.
  * trustgraph-flow tables/knowledge.py — fix Cassandra row
    deserialiser to construct EntityEmbeddings(vector=...)
    instead of vectors=.
  * trustgraph-flow gateway core_import/core_export — switch the
    kg-core msgpack wire format to the singular `"v"`/`"vector"`
    key and drop the dead `m["m"]` envelope field that referenced
    the removed Metadata.metadata triples list (it was a
    guaranteed KeyError on the export side).

Defense-in-depth regression coverage (32 new tests across 7 files):

  * tests/contract/test_schema_field_contracts.py — pin the field
    set of Metadata, EntityEmbeddings, ChunkEmbeddings,
    EntityContext so any future schema rename fails CI loudly
    with a clear diff.
  * tests/unit/test_translators/test_knowledge_translator_roundtrip.py
    and test_document_embeddings_translator_roundtrip.py -
    encode→decode round-trip the affected translators end to end,
    locking in the singular `"vector"` wire key.
  * tests/unit/test_gateway/test_entity_contexts_import_dispatcher.py
    and test_graph_embeddings_import_dispatcher.py — exercise the
    websocket dispatchers' receive() path with realistic
    payloads, the direct regression test for the original
    production crash.
  * tests/unit/test_gateway/test_core_import_export_roundtrip.py
    — pack/unpack the kg-core msgpack format through the real
    dispatcher classes (with KnowledgeRequestor mocked),
    including a full export→import round-trip.
  * tests/unit/test_tables/test_knowledge_table_store.py —
    exercise the Cassandra row → schema conversion via __new__ to
    bypass the live cluster connection.

Also fixes an unrelated leaked-coroutine RuntimeWarning in
test_gateway/test_service.py::test_run_method_calls_web_run_app: the
mocked aiohttp.web.run_app now closes the coroutine that Api.run() hands
it, mirroring what the real run_app would do, instead of leaving it for
the GC to complain about.

* Fix RabbitMQ request/response race and chunker Flow API drift

Two unrelated regressions surfaced after the v2.2 queue class
refactor.  Bundled here because both are small and both block
production.

1. Request/response race against ephemeral RabbitMQ response
queues

Commit feeb92b3 switched response/notify queues to per-subscriber
auto-delete exclusive queues. That fixed orphaned-queue
accumulation but introduced a setup race: Subscriber.start()
created the run() task and returned immediately, while the
underlying RabbitMQ consumer only declared and bound its queue
lazily on the first receive() call.  RequestResponse.request()
therefore published the request before any queue was bound to the
matching routing key, and the broker dropped the reply. Symptoms:
"Failed to fetch config on notify" / "Request timeout exception"
repeating roughly every 10s in api-gateway, document-embeddings
and any other service exercising the config notify path.

Fix:

  * Add ensure_connected() to the BackendConsumer protocol;
    implement it on RabbitMQBackendConsumer (calls _connect
    synchronously, declaring and binding the queue) and as a
    no-op on PulsarBackendConsumer (Pulsar's client.subscribe is
    already synchronous at construction).

  * Convert Subscriber's readiness signal from a non-existent
    Event to an asyncio.Future created in start(). run() calls
    consumer.ensure_connected() immediately after
    create_consumer() and sets _ready.set_result(None) on first
    successful bind. start() awaits the future via asyncio.wait
    so it returns only once the consumer is fully bound. Any
    reply published after start() returns is therefore guaranteed
    to land in a bound queue.

  * First-attempt connection failures call
    _ready.set_exception(e) and exit run() so start() unblocks
    with the error rather than hanging forever — the existing
    higher-level retry pattern in fetch_and_apply_config takes
    over from there. Runtime failures after a successful start
    still go through the existing retry-with-backoff path.

  * Update the two existing graceful-shutdown tests that
    monkey-patch Subscriber.run with a custom coroutine to honor
    the new contract by signalling _ready themselves.

  * Add tests/unit/test_base/test_subscriber_readiness.py with
    five regression tests pinning the readiness contract:
    ensure_connected must be called before start() returns;
    start() must block while ensure_connected runs
    (race-condition guard with a threading.Event gate);
    first-attempt create_consumer and ensure_connected failures
    must propagate to start() instead of hanging;
    ensure_connected must run before any receive() call.

2. Chunker Flow parameter lookup using the wrong attribute

trustgraph-base/trustgraph/base/chunking_service.py was reading
flow.parameters.get("chunk-size") and chunk-overlap, but the Flow
class has no `parameters` attribute — parameter lookup is exposed
through Flow.__call__ (flow("chunk-size") returns the resolved
value or None).  The exception was caught and logged as a
WARNING, so chunking continued with the default sizes and any
configured chunk-size / chunk-overlap was silently ignored:

    chunker - WARNING - Could not parse chunk-size parameter:
    'Flow' object has no attribute 'parameters'

The chunker tests didn't catch this because they constructed
mock_flow = MagicMock() and configured
mock_flow.parameters.get.side_effect = ..., which is the same
phantom attribute MagicMock auto-creates on demand. Tests and
production agreed on the wrong API.

Fix: switch chunking_service.py to flow("chunk-size") /
flow("chunk-overlap"). Update both chunker test files to mock the
__call__ side_effect instead of the phantom parameters.get,
merging parameter values into the existing flow() lookup the
on_message tests already used for producer resolution.
2026-04-11 01:29:38 +01:00
cybermaggedon
c23e28aa66
Fix Metadata/EntityEmbeddings schema migration tail and add regression tests (#777)
The Metadata dataclass dropped its `metadata: list[Triple]` field
and EntityEmbeddings/ChunkEmbeddings settled on a singular
`vector: list[float]` field, but several call sites kept passing
`Metadata(metadata=...)` and `EntityEmbeddings(vectors=...)`. The
bugs were latent until a websocket client first hit
`/api/v1/flow/default/import/entity-contexts`, at which point the
dispatcher TypeError'd on construction.

Production fixes (5 call sites on the same migration tail):

  * trustgraph-flow gateway dispatchers entity_contexts_import.py
    and graph_embeddings_import.py — drop the stale
    Metadata(metadata=...)  kwarg; switch graph_embeddings_import
    to the singular `vector` wire key.
  * trustgraph-base messaging translators knowledge.py and
    document_loading.py — fix decode side to read the singular
    `"vector"` key, matching what their own encode sides have
    always written.
  * trustgraph-flow tables/knowledge.py — fix Cassandra row
    deserialiser to construct EntityEmbeddings(vector=...)
    instead of vectors=.
  * trustgraph-flow gateway core_import/core_export — switch the
    kg-core msgpack wire format to the singular `"v"`/`"vector"`
    key and drop the dead `m["m"]` envelope field that referenced
    the removed Metadata.metadata triples list (it was a
    guaranteed KeyError on the export side).

Defense-in-depth regression coverage (32 new tests across 7 files):

  * tests/contract/test_schema_field_contracts.py — pin the field
    set of Metadata, EntityEmbeddings, ChunkEmbeddings,
    EntityContext so any future schema rename fails CI loudly
    with a clear diff.
  * tests/unit/test_translators/test_knowledge_translator_roundtrip.py
    and test_document_embeddings_translator_roundtrip.py -
    encode→decode round-trip the affected translators end to end,
    locking in the singular `"vector"` wire key.
  * tests/unit/test_gateway/test_entity_contexts_import_dispatcher.py
    and test_graph_embeddings_import_dispatcher.py — exercise the
    websocket dispatchers' receive() path with realistic
    payloads, the direct regression test for the original
    production crash.
  * tests/unit/test_gateway/test_core_import_export_roundtrip.py
    — pack/unpack the kg-core msgpack format through the real
    dispatcher classes (with KnowledgeRequestor mocked),
    including a full export→import round-trip.
  * tests/unit/test_tables/test_knowledge_table_store.py —
    exercise the Cassandra row → schema conversion via __new__ to
    bypass the live cluster connection.

Also fixes an unrelated leaked-coroutine RuntimeWarning in
test_gateway/test_service.py::test_run_method_calls_web_run_app: the
mocked aiohttp.web.run_app now closes the coroutine that Api.run() hands
it, mirroring what the real run_app would do, instead of leaving it for
the GC to complain about.
2026-04-10 20:43:45 +01:00
cybermaggedon
0994d4b05f
Open 2.3 release branch (#775)
* Update packages and CI for new release branch
2026-04-10 14:42:19 +01:00
cybermaggedon
ad0bff10ee
master -> release/v2.3 (#774)
* Mainly README changes
2026-04-10 14:38:46 +01:00
cybermaggedon
ec8f740de3
release/v2.2 -> master (#773) 2026-04-10 14:36:58 +01:00
elpresidank
77a5fa5044 fix: QA regression pass — graph sizing, focus trap, contrast, accessibility
- Fix graph canvas using window dimensions instead of container by using
  ResizeObserver ref callback (attaches when conditionally-rendered
  container mounts)
- Fix dialog focus trap escaping to background — filter hidden/disabled
  elements from focusable selector
- Fix sidebar connection status and disconnection banner contrast — use
  semantic text-warning/bg-warning instead of amber-400 (1.65:1 → 4.5:1+)
- Add aria-label to chat textarea, htmlFor/id pairs to flows dialog inputs
- Add ARIA tab pattern to prompts page (role=tablist/tab/tabpanel,
  aria-selected, aria-controls)
- Fix prompts heading hierarchy (H1→H2 instead of H1→H3)
- Add flex-wrap to flows page header, fix badge contrast across pages
- Fix service-call race condition with early returns instead of console.log

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 07:48:01 -05:00
elpresidank
b854b56558 feat: MCP tool client infrastructure for agent extensibility
Add the full MCP tool pipeline enabling agents to invoke external tools
(like Brave Search) via MCP servers:

- Add ToolRequest/ToolResponse types and mcp-tool topics to @trustgraph/base
- Create McpToolService (FlowProcessor) that connects to external MCP servers
  via @modelcontextprotocol/sdk StreamableHTTP transport
- Add createMcpTool() to wire MCP tools into the agent's ReAct loop
- Implement config-driven tool registration in AgentService with backward-
  compatible fallback to hardcoded tools
- Add tool filtering by group and state (port of Python tool_filter.py)
- Register mcp-tool in gateway dispatcher and export from @trustgraph/flow
- Fix flow restart race condition: skip restart when flow definitions unchanged
- Update seed config with MCP server config and tool definitions
- Add run scripts for MCP tool service and Brave Search MCP server

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 05:45:46 -05:00
elpresidank
f2b376abef fix: FalkorDB result parsing, embeddings routing, triples query response, graph visualization
- Fix FalkorDB triples query: client v5 returns objects not arrays, use named field access
- Fix embeddings service: align spec names to "embeddings-request"/"embeddings-response"
- Fix client triplesQuery: read `triples` field instead of `response` from backend
- Fix graph page crash: guard against non-array triples, accept literals as entity nodes
- Add seed:demo script for AI industry knowledge graph (254 triples, 64 entities)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 04:59:36 -05:00
cybermaggedon
feeb92b33f
Refactor: Derive consumer behaviour from queue class (#772)
Derive consumer behaviour from queue class, remove
consumer_type parameter

The queue class prefix (flow, request, response, notify) now
fully determines consumer behaviour in both RabbitMQ and Pulsar
backends.  Added 'notify' class for ephemeral broadcast (config
push notifications).  Response and notify classes always create
per-subscriber auto-delete queues, eliminating orphaned queues
that accumulated on service restarts.

Change init-trustgraph to set up the 'notify' namespace in
Pulsar instead of old hangover 'state'.

Fixes 'stuck backlog' on RabbitMQ config notification queue.
2026-04-09 09:55:41 +01:00
cybermaggedon
befe951b2e
Merge 2.2 into master (#771) 2026-04-08 14:17:18 +01:00
cybermaggedon
aff96e57cb
Added Explainable AI agent demo in Typescript (#770)
(Not functional code)
2026-04-08 14:16:14 +01:00
cybermaggedon
e81418c58f
fix: preserve literal types in focus quoted triples and document tracing (#769)
The triples client returns Uri/Literal (str subclasses), not Term
objects.  _quoted_triple() treated all values as IRIs, so literal
objects like skos:definition values were mistyped in focus
provenance events, and trace_source_documents could not match
them in the store.

Added to_term() to convert Uri/Literal back to Term, threaded a
term_map from follow_edges_batch through
get_subgraph/get_labelgraph into uri_map, and updated
_quoted_triple to accept Term objects directly.
2026-04-08 13:37:02 +01:00
cybermaggedon
4b5bfacab1
Forward missing explain_triples through RAG clients and agent tool callback (#768)
fix: forward explain_triples through RAG clients and agent tool callback
- RAG clients and the KnowledgeQueryImpl tool callback were
  dropping explain_triples from explain events, losing provenance
  data (including focus edge selections) when graph-rag is invoked
  via the agent.

Tests for provenance and explainability (56 new):
- Client-level forwarding of explain_triples
- Graph-RAG structural chain
  (question → grounding → exploration → focus → synthesis)
- Graph-RAG integration with mocked subsidiary clients
- Document-RAG integration
  (question → grounding → exploration → synthesis)
- Agent-orchestrator all 3 patterns: react, plan-then-execute,
  supervisor
2026-04-08 11:41:17 +01:00
Cyber MacGeddon
dc72ed3cca Merge branch 'release/v2.2' 2026-04-07 22:29:55 +01:00
cybermaggedon
e899370d98
Update docs for 2.2 release (#766)
- Update protocol specs
- Update protocol docs
- Update API specs
2026-04-07 22:24:59 +01:00
elpresidank
580ee319a3 fix: prevent dispatcher race condition via promise-based lazy init
Store the initialization Promise in the requestors map synchronously
before yielding, so concurrent callers for the same key await the same
instance — prevents orphaned RequestResponse objects and duplicate NATS
subscriptions. Mirrors upstream fix 8f18ba02.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 12:11:11 -05:00
elpresidank
2a2e8e76a3 Merge remote-tracking branch 'origin/master' into ts-port 2026-04-07 10:51:24 -05:00
elpresidank
5e3929a883 fix: comprehensive QA audit — light mode, accessibility, error handling, code quality
- Fix light mode: theme-aware graph node labels, remove prose-invert for
  theme-safe markdown, add brand/semantic color overrides for light backgrounds
- Add 404 catch-all route redirecting unknown paths to /chat
- FalkorDB: add .catch() to connectPromise, add ensureConnected() to all
  store methods (createLiteral, relateNode, relateLiteral, deleteCollection)
- Accessibility: dialog role/aria-modal, toast aria-live, dismiss/zoom/search
  button aria-labels, close panel aria-label
- Lazy-load ForceGraph2D (splits 189KB into separate chunk, main bundle -26%)
- Cap conversation localStorage at 200 messages to prevent quota overflow
- Fix pnpm test: add --passWithNoTests to cli/mcp packages
- Add upload error notification instead of silent catch
- Remove unused class-variance-authority dep and dead tabs.tsx component
- Add @types/node to flow package devDependencies
- Remove stale FIXME comment in messages.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 09:15:59 -05:00
cybermaggedon
c20e6540ec
Subscriber resilience and RabbitMQ fixes (#765)
Subscriber resilience: recreate consumer after connection failure

- Move consumer creation from Subscriber.start() into the run() loop,
  matching the pattern used by Consumer. If the connection drops and the
  consumer is closed in the finally block, the loop now recreates it on
  the next iteration instead of spinning forever on a None consumer.

Consumer thread safety:
- Dedicated ThreadPoolExecutor per consumer so all pika operations
  (create, receive, acknowledge, negative_acknowledge) run on the
  same thread — pika BlockingConnection is not thread-safe
- Applies to both Consumer and Subscriber classes

Config handler type audit — fix four mismatched type registrations:
- librarian: was ["librarian"] (non-existent type), now ["flow",
  "active-flow"] (matches config["flow"] that the handler reads)
- cores/service: was ["kg-core"], now ["flow"] (reads
  config["flow"])
- metering/counter: was ["token-costs"], now ["token-cost"]
  (singular)
- agent/mcp_tool: was ["mcp-tool"], now ["mcp"] (reads
  config["mcp"])

Update tests
2026-04-07 14:51:14 +01:00
elpresidank
9ef9ef854f fix: iterative QA pass — resolve remaining bugs, UX and accessibility improvements
Three QA iterations to convergence (zero issues remaining):

Workbench UI:
- Connection badge: amber "Connected (no auth)" for unauthenticated state
- Theme persistence: restore script in index.html + localStorage sync
- Settings About section: add bottom padding so content isn't clipped
- Clear messages: cancel in-flight requests when clearing chat
- Feature switch labels: proper casing + acronym handling (MCP, LLM)
- Token Cost badge: hidden during loading state
- ARIA: role="switch", aria-checked on toggles, aria-labels on buttons
- ConfigApi: null-safe chaining for getPrompts/getSystemPrompt

Grafana dashboards:
- Auto-refresh 30s on all 3 dashboards
- Panel heights reduced to fit viewport without scrolling
- Anonymous role upgraded to Editor for Explore access

Infrastructure:
- Nginx: DNS resolver with variable-based upstream (prevents crash loop)
- Workbench port set to 3002 in .env

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 06:33:22 -05:00
cybermaggedon
ddd4bd7790
Deliver explainability triples inline in retrieval response stream (#763)
Provenance triples are now included directly in explain messages from
GraphRAG, DocumentRAG, and Agent services, eliminating the need for
follow-up knowledge graph queries to retrieve explainability details.

Each explain message in the response stream now carries:
- explain_id: root URI for this provenance step (unchanged)
- explain_graph: named graph where triples are stored (unchanged)
- explain_triples: the actual provenance triples for this step (new)

Changes across the stack:
- Schema: added explain_triples field to GraphRagResponse,
  DocumentRagResponse, and AgentResponse
- Services: all explain message call sites pass triples through
  (graph_rag, document_rag, agent react, agent orchestrator)
- Translators: encode explain_triples via TripleTranslator for
  gateway wire format
- Python SDK: ProvenanceEvent now includes parsed ExplainEntity
  and raw triples; expanded event_type detection
- CLI: invoke_graph_rag, invoke_agent, invoke_document_rag use
  inline entity when available, fall back to graph query
- Tech specs updated

Additional explainability test
2026-04-07 12:19:05 +01:00
cybermaggedon
2f8d6a3ffb
Fix agent config handler registration, remove debug prints, disable RabbitMQ heartbeats (#764)
- Fix agent react and orchestrator services appending bare methods
  to config_handlers instead of using register_config_handler() —
  caused 'method object is not subscriptable' on config notify
- Add exc_info to config fetch retry logging for proper tracebacks
- Remove debug print statements from collection management
  dispatcher and translator
- Disable RabbitMQ heartbeats (heartbeat=0) to prevent broker
  closing idle producer connections that can't process heartbeat
  frames from BlockingConnection
2026-04-07 12:11:12 +01:00
Sreeram Venkatasubramanian
c737e8c356
fix: reduce consumer poll timeout from 2000ms to 100ms (#761) 2026-04-07 12:09:20 +01:00
Sreeram Venkatasubramanian
f0c9039b76 fix: reduce consumer poll timeout from 2000ms to 100ms 2026-04-07 12:02:27 +01:00
elpresidank
3a80872482 fix: comprehensive QA — resolve 13 bugs, add UX improvements across all services
Client SDK: add .catch() to graphRagStreaming/documentRagStreaming (silent timeout),
null-guard JSON.parse in getPrompts/getSystemPrompt/getPrompt.

Backend: implement "getvalues" config operation for token costs, null-check
createTerm() in FalkorDB triples query, add knowledge-cores service entrypoint
and Docker entry, return proper HTTP 400/404 for gateway error responses.

Workbench: cancel button + elapsed timer for chat, clear agent spinner on error,
flow dialog inline validation, responsive header wrapping, knowledge cores
loading timeout, sidebar/page naming consistency, theme toggle indicator.

Infrastructure: enable Grafana Explore for viewers, add gateway Prometheus
scrape target, fix RAG pipeline dashboard layout (6 panels visible),
filter Service Health to configured targets only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 05:20:10 -05:00
elpresidank
72870a7e2e feat: add unit tests, Docker polish, and workbench UX improvements
Unit tests: Consumer class (7), recursive-splitter (10), parseJsonResponse (11) — 28 total.
Docker: add 5 commented LLM provider services, dev compose override, .env.example.
Workbench: chat persistence, error boundary, disconnect banner, prompts error handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 03:51:29 -05:00
elpresidank
c7eefee607 feat: add Docker entrypoints, LLM providers, pipeline hardening, workbench pages
Phase 9 — four parallel workstreams:

- Stream A: 14 Docker entrypoints for containerized deployment
- Stream B: Pipeline hardening — robust JSON parsing, LLM retry logic,
  consumer negative-ack, FalkorDB test import fix
- Stream C: Azure OpenAI, OpenAI-compatible, and Mistral LLM providers
- Stream D: Workbench Prompts, Token Cost, Knowledge Cores pages +
  Settings feature switches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 03:22:55 -05:00
elpresidank
50fb311d2d feat: real PDF pipeline test — end-to-end knowledge extraction working
Add full pipeline test that generates a real PDF, processes it through
the entire pipeline, and verifies knowledge lands in FalkorDB:

- Create test PDF generator using pdf-lib (2-page doc about Acme Corp)
- Add testFullPipeline() to integration tests with store verification
- Fix FalkorDB client connect() — createClient returns unconnected client
  in both TriplesStore and TriplesQuery classes

Results: PDF decoded (2 pages) → chunked (2 chunks) → extracted
(4 relationships) → 16 triples stored in FalkorDB including:
  alice-johnson → is-a-senior-engineer → acme-corporation
  cloudsync → uses-aws-for-hosting → amazon-web-services
  provenance: pages → prov:wasDerivedFrom → source document

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 02:19:12 -05:00
elpresidank
5bc7a1b6fc fix: resolve FlowProcessor topic collisions, librarian timeout, tests
Two bugs found during end-to-end testing:

1. FlowProcessor never restarted flows when config changed — it only
   started them once. Stale NATS JetStream data from previous sessions
   caused services to bind to wrong topics. Fix: stop and restart flows
   on every config push that includes flow definitions.

2. Gateway publishToTopic sent messages without an id property. Pipeline
   FlowProcessor handlers check properties.id and silently return if
   missing. Fix: auto-generate a message id when publishing to topics.

Both fixes validated: 13/13 integration tests passing, PDF decoder
correctly receives and processes document messages through the pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:53:55 -05:00
elpresidank
c545213224 feat: add query/retrieval FlowProcessor services and missing runner scripts
Wire up the query and retrieval side of the pipeline so the agent can
answer questions from stored knowledge:

- Triples query service (FalkorDB) — all SPO pattern queries via NATS
- Graph embeddings query service (Qdrant) — entity vector similarity
- Document embeddings query service (Qdrant) — chunk vector similarity
- Graph RAG service — full concept→entity→traverse→score→synthesize pipeline
- Document RAG service — embed→find chunks→synthesize pipeline
- Runner scripts for chunker, extractor, embeddings (missing from Phase 5)
- Add DocumentEmbeddingsRequest/Response schema types
- Add RAG prompt templates (extract-concepts, edge-scoring, synthesize)
- Add graph/doc embeddings query topics to seed config + flow manager
- Add all pipeline/query/retrieval services to docker-compose
- 8 new runner scripts, 8 new pnpm script aliases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 01:05:54 -05:00
elpresidank
8f7008822a feat: add document pipeline — PDF decoder, Ollama LLM, storage services
Add end-to-end document processing pipeline:
- PDF decoder service (pdfjs-dist) extracts text per page from librarian docs
- Ollama native LLM service for local model inference
- FalkorDB triples store FlowProcessor consumer
- Qdrant graph embeddings store FlowProcessor consumer
- Fix spec name collisions in chunker/extractor (input→chunk-input, etc.)
- Gateway /load endpoint to trigger document processing
- Align flow manager blueprint and seed config with full pipeline topics
- Add runner scripts and test coverage for document load

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:47:43 -05:00
elpresidank
8f9de7604e fix: make abstract class constructors protected
Marks FlowProcessor and EmbeddingsService constructors as protected
since these classes should only be instantiated via subclasses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 21:52:00 -05:00
cybermaggedon
4acd853023
Config push notify pattern: replace stateful pub/sub with signal+ fetch (#760)
Replace the config push mechanism that broadcast the full config
blob on a 'state' class pub/sub queue with a lightweight notify
signal containing only the version number and affected config
types. Processors fetch the full config via request/response from
the config service when notified.

This eliminates the need for the pub/sub 'state' queue class and
stateful pub/sub services entirely. The config push queue moves
from 'state' to 'flow' class — a simple transient signal rather
than a retained message.  This solves the RabbitMQ
late-subscriber problem where restarting processes never received
the current config because their fresh queue had no historical
messages.

Key changes:
- ConfigPush schema: config dict replaced with types list
- Subscribe-then-fetch startup with retry: processors subscribe
  to notify queue, fetch config via request/response, then
  process buffered notifies with version comparison to avoid race
  conditions
- register_config_handler() accepts optional types parameter so
  handlers only fire when their config types change
- Short-lived config request/response clients to avoid subscriber
  contention on non-persistent response topics
- Config service passes affected types through put/delete/flow
  operations
- Gateway ConfigReceiver rewritten with same notify pattern and
  retry loop

Tests updated

New tests:
- register_config_handler: without types, with types, multiple
  types, multiple handlers
- on_config_notify: old/same version skipped, irrelevant types
  skipped (version still updated), relevant type triggers fetch,
  handler without types always called, mixed handler filtering,
  empty types invokes all, fetch failure handled gracefully
- fetch_config: returns config+version, raises on error response,
  stops client even on exception
- fetch_and_apply_config: applies to all handlers on startup,
  retries on failure
2026-04-06 16:57:27 +01:00
V.Sreeram
d4723566cb fix: prevent duplicate dispatcher creation race condition in invoke_global_service (#715)
* fix: prevent duplicate dispatcher creation race condition in invoke_global_service

Concurrent coroutines could all pass the `if key in self.dispatchers` check
before any of them wrote the result back, because `await dispatcher.start()`
yields to the event loop. This caused multiple Pulsar consumers to be created
on the same shared subscription, distributing responses round-robin and
dropping ~2/3 of them — manifesting as a permanent spinner in the Workbench UI.

Apply a double-checked asyncio.Lock in both `invoke_global_service` and
`invoke_flow_service` so only one dispatcher is ever created per service key.

* test: add concurrent-dispatch tests for race condition fix

Add asyncio.gather-based tests that verify invoke_global_service and
invoke_flow_service create exactly one dispatcher under concurrent calls,
preventing the duplicate Pulsar consumer bug.
2026-04-06 11:14:32 +01:00
V.Sreeram
8f18ba0257
fix: prevent duplicate dispatcher creation race condition in invoke_global_service (#715)
* fix: prevent duplicate dispatcher creation race condition in invoke_global_service

Concurrent coroutines could all pass the `if key in self.dispatchers` check
before any of them wrote the result back, because `await dispatcher.start()`
yields to the event loop. This caused multiple Pulsar consumers to be created
on the same shared subscription, distributing responses round-robin and
dropping ~2/3 of them — manifesting as a permanent spinner in the Workbench UI.

Apply a double-checked asyncio.Lock in both `invoke_global_service` and
`invoke_flow_service` so only one dispatcher is ever created per service key.

* test: add concurrent-dispatch tests for race condition fix

Add asyncio.gather-based tests that verify invoke_global_service and
invoke_flow_service create exactly one dispatcher under concurrent calls,
preventing the duplicate Pulsar consumer bug.
2026-04-06 11:13:59 +01:00
Alex Jenkins
10a931f04c Feat: Auto-pull missing Ollama models (#757)
* fix deadlink in readme

Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>

* feat: Auto-pull Ollama models

Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>

* fix: Restore namespace __init__.py files for package resolution

Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>

* fix CI

Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>
2026-04-06 11:10:53 +01:00
Alex Jenkins
7daa06e9e4
Feat: Auto-pull missing Ollama models (#757)
* fix deadlink in readme

Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>

* feat: Auto-pull Ollama models

Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>

* fix: Restore namespace __init__.py files for package resolution

Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>

* fix CI

Signed-off-by: Jenkins, Kenneth Alexander <kjenkins60@gatech.edu>
2026-04-06 11:10:14 +01:00
elpresidank
25d4227cb5 fix: resolve FlowProcessor topic collisions, librarian timeout, tests
Fix critical bug where all FlowProcessor services shared the same spec
names ("request"/"response"), causing them to steal each other's NATS
topics. Now each service uses unique spec names matching the flow config
topic keys (e.g., "text-completion-request", "prompt-request",
"agent-request").

Fix librarian NATS consumer timeout (500ms → 2000ms, below NATS minimum).

Update seed-config and test-pipeline with correct flow topic mappings.
Add prompt template runner script.

Smoke test results: 11/11 passing (config CRUD, WebSocket, LLM,
librarian CRUD). Agent routing verified via manual curl test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 01:02:10 -05:00
elpresidank
515fc0c264 fix: Docker build fixes, add agent/librarian/flow-manager to compose
Fix Containerfiles:
- Move tsconfig.json to workspace config layer for early availability
- Add missing workspace package.json entries for pnpm lockfile resolution

Docker Compose:
- Move Grafana from port 3000 to 3030 (avoid conflicts)
- Add agent, librarian, and flow-manager app services
- Add librarian-data volume for document persistence

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:41:01 -05:00
elpresidank
7db5a1023e feat: add flow manager, config seeding, and expanded integration tests
Flow Management Service:
- FlowManagerService (AsyncProcessor) handling list/get/start/stop flows
  and list/get blueprints via kebab-case wire format
- Default blueprint with all service topic mappings
- Pushes flow config to config service on start/stop

Config Seeding:
- seed-config.ts script pushes prompt templates (extract-relationships,
  extract-definitions, document-prompt, kg-prompt) and default flow
  definition via gateway REST API

Integration Tests:
- Librarian CRUD: add-document, list-documents, get-content, delete
- Agent query: verifies routing through gateway to agent service
- Skip flags: SKIP_LIBRARIAN=1, SKIP_AGENT=1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:37:03 -05:00
elpresidank
d1f24cf759 feat: add Docker deployment with Containerfile, entrypoints, and nginx
Multi-stage Containerfile for all Node.js services (single image,
different CMD per docker-compose service). ESM entrypoints for gateway,
config, text-completion, prompt, embeddings, agent, and librarian.

Workbench gets a separate Containerfile (nginx:alpine) with SPA routing
and API/WebSocket proxy to gateway.

Docker Compose updated with 6 app services (gateway, config-service,
text-completion, prompt, embeddings, workbench) using shared
trustgraph-ts:local image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:21:00 -05:00
elpresidank
f09ef4de45 feat: add document pipeline, ReAct agent, and knowledge core services
Document Pipeline (Team A):
- LibrarianService: document storage with filesystem backend, metadata
  persistence, child document hierarchy, collection management
- ChunkingService: recursive character text splitter with configurable
  chunk size/overlap, FlowProcessor pattern
- KnowledgeExtractService: combined relationship + definition extraction
  using prompt service and LLM, emits RDF triples and entity contexts
- KnowledgeCoreService: knowledge core CRUD with streaming export and
  flow-based loading

ReAct Agent (Team B):
- StreamingReActParser: state machine for parsing LLM output into
  Thought/Action/ActionInput/FinalAnswer sections
- Three MVP tools: KnowledgeQuery (GraphRAG), DocumentQuery (DocRAG),
  TriplesQuery with RequestResponse clients
- AgentService FlowProcessor with ReAct loop, tool execution, and
  streaming chunk responses (thought/observation/answer)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:19:37 -05:00
elpresidank
5ed3f0e2d8 feat: add schema foundation for document pipeline, agent, and deployment
Add missing topics (librarian, knowledge, collection-management, flow),
pipeline message types (TextDocument, Chunk, Triples, EntityContexts),
service message types (Librarian, Knowledge, Collection, Flow CRUD),
and update AgentResponse for streaming chunk format.

Add RequestResponseSpec enabling flow-scoped request/response calls
(needed by knowledge extraction and agent services). Add requestor
registry to Flow class with proper lifecycle management.

Add end_of_dialog to gateway's isComplete() check for agent streaming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:11:29 -05:00
elpresidank
28747e1a92 fix: NATS pipeline bugs, add integration tests and service runners
Fix three critical bugs preventing the NATS message pipeline from working:

- FlowProcessor now subscribes to config-push topic (was missing entirely),
  using DeliverPolicy.All to replay config on service restart
- NATS streams use wildcard subjects (tg.flow.>) instead of per-topic
  narrow filters that caused 503 errors on publish
- Subscriber dispatch loop has exponential backoff on errors to prevent
  tight error loops

Add service runner scripts (gateway, config, LLM) and a 7-test
integration suite that verifies config CRUD, WebSocket round-trip,
and full LLM text-completion through the NATS pipeline.

Fix Docker Compose infra: pin Tempo to v2.6.1, remove deprecated Loki
config fields, add user:0 for volume permissions, remap conflicting
ports (FalkorDB 6380, OTLP 4327/4328).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 23:41:39 -05:00
elpresidank
0042f9259c fix: linter cleanup on flow service implementations
Minor fixes from linter: readonly modifiers, unused parameter prefixes,
type narrowing in graph-rag BFS traversal and edge scoring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 22:52:40 -05:00