* feat: add Helm chart for Kubernetes deployment
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Replace bundled Bitnami subcharts with in-chart manifests on official images
The Bitnami catalog removed all versioned image tags from docker.io/bitnami in
Aug 2025 (old images frozen in bitnamilegacy, maintained catalog now behind a
Broadcom subscription), so the bundled postgresql/redis/minio subcharts no
longer pull. Replace them with plain in-chart manifests built on official
upstream images, keeping the internal/all-in-one path fully self-contained and
free of third-party chart packaging that can disappear:
- internal-postgres.yaml: pgvector/pgvector:pg17 — upstream Postgres plus the
`vector` extension the migrations require. POSTGRES_USER=dograh is the initdb
superuser, so CREATE EXTENSION vector succeeds.
- internal-redis.yaml: redis:7.4-alpine, password-protected, AOF persistence.
- internal-minio.yaml: minio/minio, root creds shared with the app via a single
secret (can't drift); the app auto-creates its bucket.
Service/secret names are unchanged (<rel>-postgresql, <rel>-redisinternal-master,
<rel>-minio) so the app wiring is untouched. Dep passwords are generated once and
persisted across upgrades via lookup. Drop the Chart.yaml dependencies,
Chart.lock, and the `helm dependency` step; the internal manifests gate on the
mode toggles (database.mode=internal, etc.).
Also fixes surfaced by smoke-testing on a live EKS cluster:
- Dockerfile: ship the per-service run_*.sh entrypoints the chart invokes.
- migrate-job: run as a post-install/pre-upgrade hook (the bundled Postgres does
not exist during pre-install) with a wait-for-postgres init container.
- backend env: declare POSTGRES_PASSWORD/REDIS_PASSWORD before the DATABASE_URL/
REDIS_URL that interpolate them (Kubernetes only expands back-references).
- worker liveness probes: pgrep isn't in the slim runtime image; check
/proc/1/cmdline instead (each worker execs its process as PID 1).
- UI: set HOSTNAME=0.0.0.0 so Next.js standalone doesn't bind to the k8s-injected
pod name (which maps to the pod IP only, breaking port-forward/loopback).
Verified end-to-end on EKS 1.36: all pods Ready, migrations applied (pgvector
extension + 27 tables), UI login page and web API served via port-forward.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
ChatwootWidget's visibility effect took the synchronous fast path whenever
`window.$chatwoot` was truthy. But the SDK assigns `$chatwoot` (with
`toggleBubbleVisibility`) synchronously in run(), while `.woot--bubble-holder`
is only appended later, after the widget iframe finishes loading. Calling
`toggleBubbleVisibility("show")` in that gap makes the SDK dereference
`null.classList` — an intermittent crash on navigating to /workflow that
surfaces via the React error boundary (follow-up to #483).
Gate the immediate-apply path on the bubble holder actually being in the DOM;
when it's absent, fall through to the existing `chatwoot:ready` listener, which
the SDK fires only after the holder is created.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: auto-assign per-worktree backend port via VS Code folderOpen task
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: remove .conductor dev setup (moved to native git worktrees)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: drain active calls before rolling updates
* Use provisional VAD interruption strategy
* feat: wire provisional VAD configuration
* chore: refactor user turn strategies
* chore: bump pipecat
* fix: reject misrouted smallwebrtc runs on the telephony websocket
A smallwebrtc (browser/WebRTC) workflow run is established through the WebRTC
signaling endpoint, not the PSTN telephony websocket. When such a run reached
_handle_telephony_websocket it read no "provider" from initial_context and
closed with an opaque "Provider type not found". Detect smallwebrtc runs and
close with a clear reason pointing to the signaling endpoint, without setting
the run to running or invoking a telephony provider. Also store the provider on
smallwebrtc runs at creation so they are self-describing, and make the generic
no-provider close reason include the run id and mode.
Closes#433
* fix: merge workflow run initial context defaults
---------
Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
* feat(twilio): add Answering Machine Detection (AMD) support via telephony config
Closes#339
* chore: regenerate OpenAPI spec to fix drift-check
The openapi.json snapshot had drifted from the FastAPI app definition
because main gained new organization endpoints (billing, credits,
context) after this branch was created. Regenerate it with
'python -m scripts.dump_docs_openapi' to bring it back in sync.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add provider-level AMD hooks
* fix: handle db error while persisting amd result
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Sabiha Khan <sabihak89@gmail.com>
Co-authored-by: Sabiha Khan <87858386+chewwbaka@users.noreply.github.com>
* feat(scripts): generate REDIS_PASSWORD on setup, plumb through compose
Per the discussion on #453, this takes the recommended path of extending
the setup scripts rather than introducing a parallel compose file.
- scripts/setup_remote.sh now generates REDIS_PASSWORD alongside
OSS_JWT_SECRET and POSTGRES_PASSWORD and writes it to the rendered
.env (with a short comment noting it can be rotated, unlike the
postgres password which is baked into the volume on first init).
- scripts/start_docker.sh now generates REDIS_PASSWORD on first run
if missing, mirroring the existing OSS_JWT_SECRET pattern (reuses
generate_secret, which falls back through python3 → openssl →
/dev/urandom).
- docker-compose.yaml and docker-compose-local.yaml now interpolate
${REDIS_PASSWORD:-redissecret} in the redis --requirepass, the redis
healthcheck, and the api REDIS_URL.
The :-redissecret fallback preserves backwards compatibility for users
with an existing .env that predates this change — they keep the old
value until they regenerate. New installs (via either script) get a
secure random hex.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Harden local Docker secret setup
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
* update tuner documentation
* docs: update Tuner integration walkthrough video
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: mohamed salem <259547077+mohamedsalem-bot@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>