dograh/api
Tararais fd0d144b08
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks

Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").

Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):

- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
  Payload is rendered once and frozen so retries are deterministic; secrets are
  not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
  a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
    * 2xx            -> succeeded
    * transient      -> retry with capped exponential backoff (RequestError /
                        5xx / 408 / 425 / 429), up to max_attempts then dead_letter
    * permanent 4xx  -> dead_letter immediately (no pointless looping)
  It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
  sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
  deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
  dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
  repr(e) so empty-message errors like ConnectTimeout are diagnosable.

Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.

Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.

* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup

Address review feedback on the webhook delivery pipeline:

- deliver_webhook now atomically claims a delivery (conditional UPDATE that
  leases scheduled_for) before sending, so concurrent ARQ executions can't
  double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
  accepted the payload (2xx) but the success DB-write fails, the row is left
  pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
  drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
  instead of the deprecated op.get_bind().

* fix(webhooks): idempotent delivery creation and drop secret custom headers

Address the remaining review feedback:

- Add a (workflow_run_id, webhook_node_id) unique constraint and make
  create_webhook_delivery a get-or-create returning (delivery, created). A
  retried run_integrations now reuses the existing row instead of creating and
  sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
  Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
  pointing at the credential store (which is re-resolved securely at send time).
  Non-secret custom headers are unaffected.

* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id

Address follow-up review feedback:

- webhook_node_id is now NOT NULL so a NULL can't slip past the
  (workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
  api-key/...) instead of an exact name list, catching variants like
  X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
  so reconciling a delivered-but-unrecorded row isn't deduped against the
  original attempt's already-completed ARQ job. The atomic claim still ensures
  at most one send.

* fix(webhooks): scope delivery rows to workflow org

---------

Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 21:44:14 +05:30
..
alembic feat(webhooks): durable retrying delivery for final webhooks (#478) 2026-07-02 21:44:14 +05:30
assets feat: telephony call transfer (#155) 2026-02-16 14:33:33 +05:30
db feat(webhooks): durable retrying delivery for final webhooks (#478) 2026-07-02 21:44:14 +05:30
errors Feat/inbound telephony (#113) 2026-01-12 10:10:30 +05:30
mcp_server feat: create tools using MCP 2026-05-31 16:50:44 +05:30
native/rnnoise Initial Commit 🚀 🚀 2025-09-09 14:37:32 +05:30
routes feat: support inbound vonage calls (#480) 2026-06-29 16:27:19 +05:30
schemas feat(webhooks): durable retrying delivery for final webhooks (#478) 2026-07-02 21:44:14 +05:30
services feat: add template variable rendering for transfer call destination 2026-07-02 13:36:29 +05:30
tasks feat(webhooks): durable retrying delivery for final webhooks (#478) 2026-07-02 21:44:14 +05:30
tests feat(webhooks): durable retrying delivery for final webhooks (#478) 2026-07-02 21:44:14 +05:30
utils Fix realtime initial greeting handling (#481) 2026-06-29 17:25:42 +05:30
.cursorignore Initial Commit 🚀 🚀 2025-09-09 14:37:32 +05:30
.dockerignore Initial Commit 🚀 🚀 2025-09-09 14:37:32 +05:30
.env.example chore: drain active calls before rolling updates (#474) 2026-06-29 06:00:31 +05:30
.env.test.example chore: drain active calls before rolling updates (#474) 2026-06-29 06:00:31 +05:30
.gitignore Initial Commit 🚀 🚀 2025-09-09 14:37:32 +05:30
__init__.py Initial Commit 🚀 🚀 2025-09-09 14:37:32 +05:30
AGENTS.md feat: add chat based testing for voice agent (#308) 2026-05-21 15:20:02 +05:30
alembic.ini chore: bump pipecat version and fix tests (#263) 2026-05-04 21:35:37 +05:30
app.py fix: add CORS preflight handler and ACAO header for embed config endpoint (#403) 2026-06-03 21:27:44 +05:30
CLAUDE.md Chore/add setup and contributing docs (#90) 2025-12-27 09:25:20 +05:30
conftest.py feat: add devcontainer based setup (#352) 2026-05-25 20:44:22 +05:30
constants.py feat(webhooks): durable retrying delivery for final webhooks (#478) 2026-07-02 21:44:14 +05:30
Dockerfile feat(scripts): free trusted HTTPS via sslip.io for public-IP remote i… (#460) 2026-06-27 17:19:29 +05:30
enums.py chore: refactor status processor (#465) 2026-06-24 22:07:35 +05:30
logging_config.py feat: add headless mode, redesign floating widget, refactor lifecycle callbacks (#268) 2026-05-07 12:23:41 +05:30
pyproject.toml chore(main): release dograh 1.39.0 (#469) 2026-06-27 17:20:00 +05:30
pytest.ini feat: refactor node spec and add mcp tools (#244) 2026-04-21 07:56:16 +05:30
requirements.dev.txt feat: add devcontainer based setup (#352) 2026-05-25 20:44:22 +05:30
requirements.txt Implement cost calculator for Tuber (#471) 2026-07-02 12:51:14 +05:30
sdk_expose.py feat: refactor node spec and add mcp tools (#244) 2026-04-21 07:56:16 +05:30