Processor group implementation: dev wrapper (#808)

Processor group implementation: A wrapper to launch multiple processors in a single processor - trustgraph-base/trustgraph/base/processor_group.py — group runner module. run_group(config) is the async body; run() is the endpoint. Loads JSON or YAML config, validates that every entry has a unique params.id, instantiates each class via importlib, shares one TaskGroup, mirrors AsyncProcessor.launch's retry loop and Prometheus startup. - trustgraph-base/pyproject.toml — added [project.scripts] block with processor-group = "trustgraph.base.processor_group:run". Key behaviours: - Unique id enforced up front — missing or duplicate params.id fails fast with a clear error, preventing the Prometheus Info label collision we flagged. - No registry — dotted class path is the identifier; any AsyncProcessor descendant importable at runtime is packable. - YAML import is lazy — only pulled in if the config file ends in .yaml/.yml, so JSON-only users don't need PyYAML installed. - Single Prometheus server — start_http_server runs once at startup, before the retry loop, matching launch()'s pattern. - Retry loop — same shape as AsyncProcessor.launch: catches ExceptionGroup from TaskGroup, logs, sleeps 4s, retries. Fail-group semantics (one processor dying tears down the group) — simple and surfaces bugs, as discussed. Example config: processors: - class: trustgraph.extract.kg.definitions.extract.Processor params: id: kg-extract-definitions - class: trustgraph.chunking.recursive.Processor params: id: chunker-recursive Run with processor-group -c group.yaml.
2026-04-25 00:16:23 +02:00 · 2026-04-14 15:19:04 +01:00 · 2026-04-14 15:19:04 +01:00 · f11c0ad0cb
commit f11c0ad0cb
parent 8954fa3ad7
6 changed files with 580 additions and 11 deletions
--- a/dev-tools/proc-group/README.md
+++ b/dev-tools/proc-group/README.md
@ -0,0 +1,117 @@
 # proc-group — run TrustGraph as a single process
 A dev-focused alternative to the per-container deployment. Instead of 30+
 containers each running a single processor, `processor-group` runs all the
 processors as asyncio tasks inside one Python process, sharing the event
 loop, Prometheus registry, and (importantly) resources on your laptop.
 This is **not** for production. Scale deployments should keep using
 per-processor containers — one failure bringing down the whole process,
 no horizontal scaling, and a single giant log are fine for dev and a
 bad idea in prod.
 ## What this directory contains
 - `group.yaml` — the group runner config. One entry per processor, each
  with the dotted class path and a params dict. Defaults (pubsub backend,
  rabbitmq host, log level) are pulled in per-entry with a YAML anchor.
 - `README.md` — this file.
 ## Prerequisites
 Install the TrustGraph packages into a venv:
 ```
 pip install trustgraph-base trustgraph-flow trustgraph-unstructured
 ```
 `trustgraph-base` provides the `processor-group` endpoint. The others
 provide the processor classes that `group.yaml` imports at runtime.
 `trustgraph-unstructured` is only needed if you want `document-decoder`
 (the `universal-decoder` processor).
 ## Running it
 Start infrastructure (cassandra, qdrant, rabbitmq, garage, observability
 stack) with a working compose file. These aren't packable into the group -
 they're third-party services.  You may be able to run these as standalone
 services.
 To get Cassandra to be accessible from the host, you need to 
 set a couple of environment variables:
 ```
      CASSANDRA_BROADCAST_ADDRESS: 127.0.0.1
      CASSANDRA_LISTEN_ADDRESS: 127.0.0.1
 ```
 and also set `network: host`.  Then start services:
 ```
 podman-compose up -d cassandra qdrant rabbitmq
 podman-compose up -d garage garage-init
 podman-compose up -d loki prometheus grafana
 podman-compose up -d init-trustgraph
 ```
 `init-trustgraph` is a one-shot that seeds config and the default flow
 into cassandra/rabbitmq.  Don't leave too long a delay between starting
 `init-trustgraph` and running the processor-group, because it needs to
 talk to the config service.
 Run the api-gateway separately — it's an aiohttp HTTP server, not an
 `AsyncProcessor`, so the group runner doesn't host it:
 Raise the file descriptor limit — 30+ processors sharing one process
 open far more sockets than the default 1024 allows:
 ```
 ulimit -n 65536
 ```
 Then start the group from a terminal:
 ```
 processor-group -c group.yaml --no-loki-enabled
 ```
 You'll see every processor's startup messages interleaved in one log.
 Each processor has a supervisor that restarts it independently on
 failure, so a transient crash (or a dependency that isn't ready yet)
 only affects that one processor — siblings keep running and the failing
 one self-heals on the next retry.
 Finally when everything is running you can start the API gateway from
 its own terminal:
 ```
 api-gateway \
    --pubsub-backend rabbitmq --rabbitmq-host localhost \
    --loki-url http://localhost:3100/loki/api/v1/push \
    --no-metrics
 ```
 ## When things go wrong
 - **"Too many open files"** — raise `ulimit -n` further. 65536 is
  usually plenty but some workflows need more.
 - **One processor failing repeatedly** — look for its id in the log. The
  supervisor will log each failure before restarting. Fix the cause
  (missing env var, unreachable dependency, bad params) and the
  processor self-heals on the next 4-second retry without restarting
  the whole group.
 - **Ctrl-C leaves the process hung** — the pika and cassandra drivers
  spawn non-cooperative threads that asyncio can't cancel. Use Ctrl-\
  (SIGQUIT) to force-kill. Not a bug in the group runner, just a
  limitation of those libraries.
 ## Environment variables
 Processors that talk to external LLMs or APIs read their credentials
 from env vars, same as in the per-container deployment:
 - `OPENAI_TOKEN`, `OPENAI_BASE_URL` — for `text-completion` /
  `text-completion-rag`
 Export whatever your particular `group.yaml` needs before running.
--- a/dev-tools/proc-group/group.yaml
+++ b/dev-tools/proc-group/group.yaml
@ -0,0 +1,257 @@
 # Multi-processor group config, derived from docker-compose.yaml.
 #
 # Covers every AsyncProcessor-based service from the compose file.
 # Out of scope:
 #   - api-gateway       (aiohttp, not AsyncProcessor)
 #   - init-trustgraph   (one-shot init, not a processor)
 #   - document-decoder  (universal-decoder, trustgraph-unstructured package —
 #                        packable but lives in a separate image/package)
 #   - mcp-server        (trustgraph-mcp package, separate image)
 #   - ddg-mcp-server    (third-party image)
 #   - infrastructure    (cassandra, rabbitmq, qdrant, garage, grafana,
 #                        prometheus, loki, workbench-ui)
 #
 # Run with:
 #   processor-group -c group.yaml
 _defaults: &defaults
  pubsub_backend: rabbitmq
  rabbitmq_host: localhost
  log_level: INFO
 processors:
  - class: trustgraph.agent.orchestrator.Processor
    params:
      <<: *defaults
      id: agent-manager
  - class: trustgraph.chunking.recursive.Processor
    params:
      <<: *defaults
      id: chunker
      chunk_size: 2000
      chunk_overlap: 50
  - class: trustgraph.config.service.Processor
    params:
      <<: *defaults
      id: config-svc
      cassandra_host: localhost
  - class: trustgraph.decoding.universal.Processor
    params:
      <<: *defaults
      id: document-decoder
  - class: trustgraph.embeddings.document_embeddings.Processor
    params:
      <<: *defaults
      id: document-embeddings
  - class: trustgraph.retrieval.document_rag.Processor
    params:
      <<: *defaults
      id: document-rag
      doc_limit: 20
  - class: trustgraph.embeddings.fastembed.Processor
    params:
      <<: *defaults
      id: embeddings
      concurrency: 1
  - class: trustgraph.embeddings.graph_embeddings.Processor
    params:
      <<: *defaults
      id: graph-embeddings
  - class: trustgraph.retrieval.graph_rag.Processor
    params:
      <<: *defaults
      id: graph-rag
      concurrency: 1
      entity_limit: 50
      triple_limit: 30
      edge_limit: 30
      edge_score_limit: 10
      max_subgraph_size: 100
      max_path_length: 2
  - class: trustgraph.extract.kg.agent.Processor
    params:
      <<: *defaults
      id: kg-extract-agent
      concurrency: 1
  - class: trustgraph.extract.kg.definitions.Processor
    params:
      <<: *defaults
      id: kg-extract-definitions
      concurrency: 1
  - class: trustgraph.extract.kg.ontology.Processor
    params:
      <<: *defaults
      id: kg-extract-ontology
      concurrency: 1
  - class: trustgraph.extract.kg.relationships.Processor
    params:
      <<: *defaults
      id: kg-extract-relationships
      concurrency: 1
  - class: trustgraph.extract.kg.rows.Processor
    params:
      <<: *defaults
      id: kg-extract-rows
      concurrency: 1
  - class: trustgraph.cores.service.Processor
    params:
      <<: *defaults
      id: knowledge
      cassandra_host: localhost
  - class: trustgraph.storage.knowledge.store.Processor
    params:
      <<: *defaults
      id: kg-store
      cassandra_host: localhost
  - class: trustgraph.librarian.Processor
    params:
      <<: *defaults
      id: librarian
      cassandra_host: localhost
      object_store_endpoint: localhost:3900
      object_store_access_key: GK000000000000000000000001
      object_store_secret_key: b171f00be9be4c32c734f4c05fe64c527a8ab5eb823b376cfa8c2531f70fc427
      object_store_region: garage
  - class: trustgraph.agent.mcp_tool.Service
    params:
      <<: *defaults
      id: mcp-tool
  - class: trustgraph.metering.Processor
    params:
      <<: *defaults
      id: metering
  - class: trustgraph.metering.Processor
    params:
      <<: *defaults
      id: metering-rag
  - class: trustgraph.retrieval.nlp_query.Processor
    params:
      <<: *defaults
      id: nlp-query
  - class: trustgraph.prompt.template.Processor
    params:
      <<: *defaults
      id: prompt
      concurrency: 1
  - class: trustgraph.prompt.template.Processor
    params:
      <<: *defaults
      id: prompt-rag
      concurrency: 1
  - class: trustgraph.query.doc_embeddings.qdrant.Processor
    params:
      <<: *defaults
      id: doc-embeddings-query
      store_uri: http://localhost:6333
  - class: trustgraph.query.graph_embeddings.qdrant.Processor
    params:
      <<: *defaults
      id: graph-embeddings-query
      store_uri: http://localhost:6333
  - class: trustgraph.query.row_embeddings.qdrant.Processor
    params:
      <<: *defaults
      id: row-embeddings-query
      store_uri: http://localhost:6333
  - class: trustgraph.query.rows.cassandra.Processor
    params:
      <<: *defaults
      id: rows-query
      cassandra_host: localhost
  - class: trustgraph.query.triples.cassandra.Processor
    params:
      <<: *defaults
      id: triples-query
      cassandra_host: localhost
  - class: trustgraph.embeddings.row_embeddings.Processor
    params:
      <<: *defaults
      id: row-embeddings
  - class: trustgraph.query.sparql.Processor
    params:
      <<: *defaults
      id: sparql-query
  - class: trustgraph.storage.doc_embeddings.qdrant.Processor
    params:
      <<: *defaults
      id: doc-embeddings-write
      store_uri: http://localhost:6333
  - class: trustgraph.storage.graph_embeddings.qdrant.Processor
    params:
      <<: *defaults
      id: graph-embeddings-write
      store_uri: http://localhost:6333
  - class: trustgraph.storage.row_embeddings.qdrant.Processor
    params:
      <<: *defaults
      id: row-embeddings-write
      store_uri: http://localhost:6333
  - class: trustgraph.storage.rows.cassandra.Processor
    params:
      <<: *defaults
      id: rows-write
      cassandra_host: localhost
  - class: trustgraph.storage.triples.cassandra.Processor
    params:
      <<: *defaults
      id: triples-write
      cassandra_host: localhost
  - class: trustgraph.retrieval.structured_diag.Processor
    params:
      <<: *defaults
      id: structured-diag
  - class: trustgraph.retrieval.structured_query.Processor
    params:
      <<: *defaults
      id: structured-query
  - class: trustgraph.model.text_completion.openai.Processor
    params:
      <<: *defaults
      id: text-completion
      max_output: 8192
      temperature: 0.0
  - class: trustgraph.model.text_completion.openai.Processor
    params:
      <<: *defaults
      id: text-completion-rag
      max_output: 8192
      temperature: 0.0
--- a/trustgraph-base/pyproject.toml
+++ b/trustgraph-base/pyproject.toml
@ -24,6 +24,9 @@ classifiers = [
 [project.urls]
 Homepage = "https://github.com/trustgraph-ai/trustgraph"
 [project.scripts]
 processor-group = "trustgraph.base.processor_group:run"
 [tool.setuptools.packages.find]
 include = ["trustgraph*"]
--- a/trustgraph-base/trustgraph/base/processor_group.py
+++ b/trustgraph-base/trustgraph/base/processor_group.py
@ -0,0 +1,197 @@
 # Multi-processor group runner.  Runs multiple AsyncProcessor descendants
 # as concurrent tasks inside a single process, sharing one event loop,
 # one Prometheus HTTP server, and one pub/sub backend pool.
 #
 # Intended for dev and resource-constrained deployments.  Scale deployments
 # should continue to use per-processor endpoints.
 #
 # Group config is a YAML or JSON file with shape:
 #
 #   processors:
 #     - class: trustgraph.extract.kg.definitions.extract.Processor
 #       params:
 #         id: kg-extract-definitions
 #         triples_batch_size: 1000
 #     - class: trustgraph.chunking.recursive.Processor
 #       params:
 #         id: chunker-recursive
 #
 # Each entry's params are passed directly to the class constructor alongside
 # the shared taskgroup.  Defaults live inside each processor class.
 import argparse
 import asyncio
 import importlib
 import json
 import logging
 import time
 from prometheus_client import start_http_server
 from . logging import add_logging_args, setup_logging
 logger = logging.getLogger(__name__)
 def _load_config(path):
    with open(path) as f:
        text = f.read()
    if path.endswith((".yaml", ".yml")):
        import yaml
        return yaml.safe_load(text)
    return json.loads(text)
 def _resolve_class(dotted):
    module_path, _, class_name = dotted.rpartition(".")
    if not module_path:
        raise ValueError(
            f"Processor class must be a dotted path, got {dotted!r}"
        )
    module = importlib.import_module(module_path)
    return getattr(module, class_name)
 RESTART_DELAY_SECONDS = 4
 async def _supervise(entry):
    """Run one processor with its own nested TaskGroup, restarting on any
    failure.  Each processor is isolated from its siblings — a crash here
    does not propagate to the outer group."""
    pid = entry["params"]["id"]
    class_path = entry["class"]
    while True:
        try:
            async with asyncio.TaskGroup() as inner_tg:
                cls = _resolve_class(class_path)
                params = dict(entry.get("params", {}))
                params["taskgroup"] = inner_tg
                logger.info(f"Starting {class_path} as {pid}")
                p = cls(**params)
                await p.start()
                inner_tg.create_task(p.run())
            # Clean exit — processor's run() returned without raising.
            # Treat as a transient shutdown and restart, matching the
            # behaviour of per-container `restart: on-failure`.
            logger.warning(
                f"Processor {pid} exited cleanly, will restart"
            )
        except asyncio.CancelledError:
            logger.info(f"Processor {pid} cancelled")
            raise
        except BaseExceptionGroup as eg:
            for e in eg.exceptions:
                logger.error(
                    f"Processor {pid} failure: {type(e).__name__}: {e}",
                    exc_info=e,
                )
        except Exception as e:
            logger.error(
                f"Processor {pid} failure: {type(e).__name__}: {e}",
                exc_info=True,
            )
        logger.info(
            f"Restarting {pid} in {RESTART_DELAY_SECONDS}s..."
        )
        await asyncio.sleep(RESTART_DELAY_SECONDS)
 async def run_group(config):
    entries = config.get("processors", [])
    if not entries:
        raise RuntimeError("Group config has no processors")
    seen_ids = set()
    for entry in entries:
        pid = entry.get("params", {}).get("id")
        if pid is None:
            raise RuntimeError(
                f"Entry {entry.get('class')!r} missing params.id — "
                f"required for metrics labelling"
            )
        if pid in seen_ids:
            raise RuntimeError(f"Duplicate processor id {pid!r} in group")
        seen_ids.add(pid)
    async with asyncio.TaskGroup() as outer_tg:
        for entry in entries:
            outer_tg.create_task(_supervise(entry))
 def run():
    parser = argparse.ArgumentParser(
        prog="processor-group",
        description="Run multiple processors as tasks in one process",
    )
    parser.add_argument(
        "-c", "--config",
        required=True,
        help="Path to group config file (JSON or YAML)",
    )
    parser.add_argument(
        "--metrics",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="Metrics enabled (default: true)",
    )
    parser.add_argument(
        "-P", "--metrics-port",
        type=int,
        default=8000,
        help="Prometheus metrics port (default: 8000)",
    )
    add_logging_args(parser)
    args = vars(parser.parse_args())
    setup_logging(args)
    config = _load_config(args["config"])
    if args["metrics"]:
        start_http_server(args["metrics_port"])
    while True:
        logger.info("Starting group...")
        try:
            asyncio.run(run_group(config))
        except KeyboardInterrupt:
            logger.info("Keyboard interrupt.")
            return
        except ExceptionGroup as e:
            logger.error("Exception group:")
            for se in e.exceptions:
                logger.error(f"  Type: {type(se)}")
                logger.error(f"  Exception: {se}", exc_info=se)
        except Exception as e:
            logger.error(f"Type: {type(e)}")
            logger.error(f"Exception: {e}", exc_info=True)
        logger.warning("Will retry...")
        time.sleep(4)
        logger.info("Retrying...")
--- a/trustgraph-flow/trustgraph/gateway/service.py
+++ b/trustgraph-flow/trustgraph/gateway/service.py
@ -9,7 +9,7 @@ from aiohttp import web
 import logging
 import os
-from trustgraph.base.logging import setup_logging
+from trustgraph.base.logging import setup_logging, add_logging_args
 from trustgraph.base.pubsub import get_pubsub, add_pubsub_args
 from . auth import Authenticator
@ -195,12 +195,7 @@ def run():
        help=f'Secret API token (default: no auth)',
    )
-    parser.add_argument(
+    add_logging_args(parser)
        '-l', '--log-level',
        default='INFO',
        choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'],
        help=f'Log level (default: INFO)'
    )
    parser.add_argument(
        '--metrics',
--- a/trustgraph-flow/trustgraph/metering/counter.py
+++ b/trustgraph-flow/trustgraph/metering/counter.py
@ -102,10 +102,10 @@ class Processor(FlowProcessor):
            __class__.cost_metric.labels(model=modelname, direction="input").inc(cost_in)
            __class__.cost_metric.labels(model=modelname, direction="output").inc(cost_out)
-        logger.info(f"Model: {modelname}")
+        logger.debug(
-        logger.info(f"Input Tokens: {num_in}")
+            f"Model: {modelname}, in={num_in}, out={num_out}, "
-        logger.info(f"Output Tokens: {num_out}")
+            f"cost=${cost_per_call}"
-        logger.info(f"Cost for call: ${cost_per_call}")
+        )
    @staticmethod
    def add_args(parser):