feat(cli): add ingest LLM rate-limit governor with paced retries (#261)

* feat(cli): add ingest rate limit governor * feat(cli): wire ingest rate-limit config * feat(cli): report provider rate-limit signals * feat(cli): show ingest rate-limit waits * fix(cli): complete rate-limit event coverage * fix(cli): abort ingest provider calls cleanly * fix(cli): propagate ingest cancellation * fix(cli): reject pre-aborted ingest rate-limit waits * fix(cli): honor Claude rate-limit reset waits * fix(cli): retry thrown Codex rate-limit failures * fix(cli): type Claude rate-limit result details * fix(cli): emit ingest rate-limit countdowns from rejected signals * fix(cli): report ai sdk rate-limit header utilization * fix(cli): gate LLM rate-limit retries on the governor budget The AI SDK and Codex runtimes retried 429 / opaque rate-limit failures up to 6-7 times with no backoff when constructed without a RateLimitGovernor (scan, memory, setup) or with pacing disabled, ignoring Retry-After and worsening the limit. The outer retry loop only cooperates with the governor's pause, so without active pacing there is no backoff to apply. Route the retry bound through a single source: RateLimitGovernor .maxRetryAttempts(), which returns retry.maxAttempts when enabled and 1 (no outer retry) when absent or disabled. All three runtimes (ai-sdk, codex, claude-code) now use it, so ingest.rateLimit.retry.maxAttempts genuinely controls attempts and the hard-coded 6 (plus Codex's off-by-one extra attempt) is gone. Backend-native retry (e.g. the AI SDK's maxRetries) still handles transient 429s. Also correct the ktx.yaml docs for maxWaitMs (caps each wait, not the whole run) and maxAttempts, and sync uv.lock ktx-sl/ktx-daemon to 0.9.0.
2026-07-25 12:01:03 +02:00 · 2026-06-05 12:10:27 +02:00 · 2026-06-05 12:10:27 +02:00 · c3d8cedb0b
commit c3d8cedb0b
parent 5a8821073b
35 changed files with 2336 additions and 72 deletions
--- a/docs-site/content/docs/cli-reference/ktx-ingest.mdx
+++ b/docs-site/content/docs/cli-reference/ktx-ingest.mdx
@ -177,7 +177,9 @@ Slowest phase: reconciliation (2m 05s, 48% of wall time). 2 work units (1 failed

 Work units run serially by default (`ingest.workUnits.maxConcurrency` is `1`);
 raise it in `ktx.yaml` if the profile shows the run is bound by serialized
-work-unit agent loops.
+work-unit agent loops. If the provider reports an LLM rate limit, **ktx** shows
+a transient wait message and temporarily reduces effective work-unit concurrency
+according to `ingest.rateLimit`.

 ## Common errors

--- a/docs-site/content/docs/configuration/ktx-yaml.mdx
+++ b/docs-site/content/docs/configuration/ktx-yaml.mdx
@ -452,6 +452,16 @@ ingest:
    stepBudget: 40
    maxConcurrency: 2
    failureMode: continue
+  rateLimit:
+    enabled: true
+    throttleThreshold: 0.8
+    minConcurrencyUnderPressure: 1
+    maxWaitMs: 600000
+    retry:
+      maxAttempts: 6
+      baseDelayMs: 1000
+      maxDelayMs: 60000
+      jitter: true
 ```

 ### Adapters
@ -498,6 +508,24 @@ handles failures.
 | `workUnits.maxConcurrency` | `int > 0` | `1` | How many work units run in parallel. |
 | `workUnits.failureMode` | `abort` \| `continue` | `continue` | `abort` stops the whole ingest run on the first failure; `continue` records it and keeps going. |

+### Rate limits
+
+`rateLimit` controls provider-neutral pacing for LLM calls during ingest. When a
+provider reports a subscription window, retry-after delay, or HTTP 429,
+**ktx** pauses new work-unit model calls, shows a transient wait in the CLI,
+and reduces work-unit concurrency while the provider is under pressure.
+
+| Field | Type | Default | Purpose |
+|-------|------|---------|---------|
+| `rateLimit.enabled` | `boolean` | `true` | Master switch for ingest LLM rate-limit pacing and visible waits. |
+| `rateLimit.throttleThreshold` | `number between 0 and 1` | `0.8` | Fraction of a known provider window at which **ktx** starts reducing concurrency. |
+| `rateLimit.minConcurrencyUnderPressure` | `int > 0` | `1` | Effective work-unit concurrency while a provider is under rate-limit pressure. |
+| `rateLimit.maxWaitMs` | `int > 0` | unset | Caps how long a single provider-reset wait can last. This bounds each wait, not the whole run: after a capped wait elapses **ktx** retries and may pause again. Omit to wait until the provider's reset time. |
+| `rateLimit.retry.maxAttempts` | `int > 0` | `6` | Maximum attempts for a single rate-limited LLM call before the failure surfaces (counts the first try). Also bounds how far opaque backoff grows for responses without a reset time or retry-after value. |
+| `rateLimit.retry.baseDelayMs` | `int > 0` | `1000` | Initial opaque retry delay in milliseconds. |
+| `rateLimit.retry.maxDelayMs` | `int > 0` | `60000` | Maximum opaque retry delay in milliseconds. |
+| `rateLimit.retry.jitter` | `boolean` | `true` | Add jitter to opaque retry delays. |
+
 ## `scan`

 `scan` configures how schema-level inputs become structured context: