2026-02-25 13:53:30 +05:30
|
|
|
"""Execute integrations (QA analysis, webhooks) after workflow run completion."""
|
2025-12-22 14:08:30 +05:30
|
|
|
|
2026-02-25 13:53:30 +05:30
|
|
|
import random
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
from datetime import UTC, datetime
|
2026-01-23 18:53:59 +05:30
|
|
|
from typing import Any, Dict, Optional
|
2025-09-09 14:37:32 +05:30
|
|
|
|
|
|
|
|
from loguru import logger
|
2026-05-07 12:23:41 +05:30
|
|
|
from pipecat.utils.enums import EndTaskReason
|
|
|
|
|
from pipecat.utils.run_context import set_current_org_id, set_current_run_id
|
2026-04-21 07:56:16 +05:30
|
|
|
from pydantic import ValidationError
|
2025-09-09 14:37:32 +05:30
|
|
|
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
from api.constants import BACKEND_API_ENDPOINT, DEFAULT_WEBHOOK_DELIVERY_CONFIG
|
2025-09-09 14:37:32 +05:30
|
|
|
from api.db import db_client
|
2026-01-02 13:11:02 +05:30
|
|
|
from api.db.models import WorkflowRunModel
|
2026-03-23 11:36:39 +05:30
|
|
|
from api.enums import OrganizationConfigurationKey
|
2026-05-20 10:07:33 +01:00
|
|
|
from api.services.integrations import (
|
|
|
|
|
IntegrationCompletionContext,
|
|
|
|
|
has_completion_handlers,
|
|
|
|
|
run_completion_handlers,
|
|
|
|
|
)
|
2026-03-23 11:36:39 +05:30
|
|
|
from api.services.pipecat.tracing_config import register_org_langfuse_credentials
|
2026-04-21 07:56:16 +05:30
|
|
|
from api.services.workflow.dto import (
|
|
|
|
|
QANodeData,
|
|
|
|
|
QARFNode,
|
|
|
|
|
WebhookNodeData,
|
|
|
|
|
WebhookRFNode,
|
|
|
|
|
)
|
2026-02-25 17:17:48 +05:30
|
|
|
from api.services.workflow.qa import run_per_node_qa_analysis
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
from api.tasks.function_names import FunctionNames
|
2026-06-16 15:19:49 +05:30
|
|
|
from api.utils.recording_artifacts import get_recording_storage_key
|
2025-09-09 14:37:32 +05:30
|
|
|
from api.utils.template_renderer import render_template
|
|
|
|
|
|
|
|
|
|
|
2026-02-25 13:53:30 +05:30
|
|
|
def _should_skip_qa(
|
2026-04-21 07:56:16 +05:30
|
|
|
qa_data: QANodeData,
|
2026-02-25 13:53:30 +05:30
|
|
|
workflow_run: WorkflowRunModel,
|
|
|
|
|
) -> str | None:
|
|
|
|
|
"""Check whether QA analysis should be skipped for this call.
|
|
|
|
|
|
|
|
|
|
Returns a reason string if the call should be skipped, or None if it should proceed.
|
|
|
|
|
"""
|
|
|
|
|
usage_info = workflow_run.usage_info or {}
|
|
|
|
|
call_duration = usage_info.get("call_duration_seconds")
|
2026-04-21 07:56:16 +05:30
|
|
|
if call_duration is not None and call_duration < qa_data.qa_min_call_duration:
|
|
|
|
|
return (
|
|
|
|
|
f"call duration ({call_duration:.1f}s) below minimum "
|
|
|
|
|
f"({qa_data.qa_min_call_duration}s)"
|
|
|
|
|
)
|
2026-02-25 13:53:30 +05:30
|
|
|
|
2026-04-21 07:56:16 +05:30
|
|
|
if not qa_data.qa_voicemail_calls:
|
2026-02-25 13:53:30 +05:30
|
|
|
gathered_context = workflow_run.gathered_context or {}
|
|
|
|
|
call_disposition = gathered_context.get("call_disposition", "")
|
|
|
|
|
if call_disposition == EndTaskReason.VOICEMAIL_DETECTED.value:
|
|
|
|
|
return "voicemail call and QA voicemail calls is disabled"
|
|
|
|
|
|
2026-04-21 07:56:16 +05:30
|
|
|
if qa_data.qa_sample_rate < 100:
|
2026-02-25 13:53:30 +05:30
|
|
|
roll = random.randint(1, 100)
|
2026-04-21 07:56:16 +05:30
|
|
|
if roll > qa_data.qa_sample_rate:
|
|
|
|
|
return (
|
|
|
|
|
f"excluded by sampling ({qa_data.qa_sample_rate}% sample rate, "
|
|
|
|
|
f"rolled {roll})"
|
|
|
|
|
)
|
2026-02-25 13:53:30 +05:30
|
|
|
|
|
|
|
|
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
async def _run_qa_nodes(
|
|
|
|
|
qa_nodes: list[dict],
|
|
|
|
|
workflow_run: WorkflowRunModel,
|
|
|
|
|
workflow_run_id: int,
|
2026-02-25 17:17:48 +05:30
|
|
|
workflow_definition: dict,
|
|
|
|
|
definition_id: int | None,
|
2026-02-25 13:53:30 +05:30
|
|
|
) -> Dict[str, Any]:
|
|
|
|
|
"""Run QA analysis for each enabled QA node and aggregate results.
|
|
|
|
|
|
|
|
|
|
Returns:
|
|
|
|
|
Dict keyed by node ID with QA analysis results.
|
|
|
|
|
"""
|
|
|
|
|
results: Dict[str, Any] = {}
|
|
|
|
|
|
|
|
|
|
for node in qa_nodes:
|
|
|
|
|
node_id = node.get("id", "unknown")
|
2026-04-21 07:56:16 +05:30
|
|
|
try:
|
|
|
|
|
qa_node = QARFNode.model_validate(node)
|
|
|
|
|
except ValidationError as e:
|
|
|
|
|
logger.warning(f"QA node #{node_id} failed validation, skipping: {e}")
|
|
|
|
|
results[f"qa_{node_id}"] = {"error": "validation_failed"}
|
|
|
|
|
continue
|
|
|
|
|
|
|
|
|
|
qa_data = qa_node.data
|
|
|
|
|
node_name = qa_data.name
|
2026-02-25 13:53:30 +05:30
|
|
|
|
2026-04-21 07:56:16 +05:30
|
|
|
if not qa_data.qa_enabled:
|
2026-02-25 13:53:30 +05:30
|
|
|
logger.debug(f"QA node '{node_name}' is disabled, skipping")
|
|
|
|
|
continue
|
|
|
|
|
|
2026-04-21 07:56:16 +05:30
|
|
|
skip_reason = _should_skip_qa(qa_data, workflow_run)
|
2026-02-25 13:53:30 +05:30
|
|
|
if skip_reason:
|
|
|
|
|
logger.info(f"Skipping QA node '{node_name}' (#{node_id}): {skip_reason}")
|
|
|
|
|
results[f"qa_{node_id}"] = {"skipped": True, "reason": skip_reason}
|
|
|
|
|
continue
|
|
|
|
|
|
|
|
|
|
try:
|
|
|
|
|
logger.info(f"Running QA analysis for node '{node_name}' (#{node_id})")
|
2026-02-25 17:17:48 +05:30
|
|
|
result = await run_per_node_qa_analysis(
|
2026-04-21 07:56:16 +05:30
|
|
|
qa_data,
|
2026-02-25 17:17:48 +05:30
|
|
|
workflow_run,
|
|
|
|
|
workflow_run_id,
|
|
|
|
|
workflow_definition,
|
|
|
|
|
definition_id,
|
|
|
|
|
)
|
2026-02-25 13:53:30 +05:30
|
|
|
results[f"qa_{node_id}"] = result
|
2026-02-25 17:17:48 +05:30
|
|
|
# Log summary from node_results
|
|
|
|
|
node_results = result.get("node_results", {})
|
2026-02-25 13:53:30 +05:30
|
|
|
logger.info(
|
|
|
|
|
f"QA analysis complete for '{node_name}': "
|
2026-02-25 17:17:48 +05:30
|
|
|
f"{len(node_results)} nodes analyzed"
|
2026-02-25 13:53:30 +05:30
|
|
|
)
|
|
|
|
|
except Exception as e:
|
|
|
|
|
logger.error(f"QA analysis failed for node '{node_name}': {e}")
|
|
|
|
|
results[f"qa_{node_id}"] = {"error": str(e)}
|
|
|
|
|
|
|
|
|
|
return results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
async def _update_usage_info_with_qa_tokens(
|
|
|
|
|
workflow_run_id: int,
|
|
|
|
|
workflow_run: WorkflowRunModel,
|
|
|
|
|
qa_results: Dict[str, Any],
|
|
|
|
|
) -> None:
|
|
|
|
|
"""Add QA analysis LLM token usage to the workflow run's usage_info."""
|
|
|
|
|
try:
|
|
|
|
|
usage_info = dict(workflow_run.usage_info or {})
|
|
|
|
|
llm_usage = dict(usage_info.get("llm", {}))
|
|
|
|
|
|
|
|
|
|
for _node_key, result in qa_results.items():
|
|
|
|
|
token_usage = result.get("token_usage")
|
|
|
|
|
model = result.get("model")
|
|
|
|
|
if not token_usage or not model:
|
|
|
|
|
continue
|
|
|
|
|
|
|
|
|
|
key = f"QAAnalysis|||{model}"
|
|
|
|
|
if key in llm_usage:
|
|
|
|
|
# Aggregate if multiple QA nodes use the same model
|
|
|
|
|
existing = llm_usage[key]
|
|
|
|
|
for field in (
|
|
|
|
|
"prompt_tokens",
|
|
|
|
|
"completion_tokens",
|
|
|
|
|
"total_tokens",
|
|
|
|
|
"cache_read_input_tokens",
|
|
|
|
|
):
|
|
|
|
|
existing[field] = (existing.get(field) or 0) + (
|
|
|
|
|
token_usage.get(field) or 0
|
|
|
|
|
)
|
|
|
|
|
else:
|
|
|
|
|
llm_usage[key] = token_usage
|
|
|
|
|
|
|
|
|
|
usage_info["llm"] = llm_usage
|
|
|
|
|
await db_client.update_workflow_run(
|
|
|
|
|
run_id=workflow_run_id, usage_info=usage_info
|
|
|
|
|
)
|
|
|
|
|
logger.info(f"Updated usage_info with QA token usage for run {workflow_run_id}")
|
|
|
|
|
except Exception as e:
|
|
|
|
|
logger.error(f"Failed to update usage_info with QA tokens: {e}")
|
|
|
|
|
|
|
|
|
|
|
2025-12-22 14:08:30 +05:30
|
|
|
async def run_integrations_post_workflow_run(_ctx, workflow_run_id: int):
|
2025-09-09 14:37:32 +05:30
|
|
|
"""
|
2026-02-25 13:53:30 +05:30
|
|
|
Run integrations after a workflow run completes.
|
2025-09-09 14:37:32 +05:30
|
|
|
|
|
|
|
|
This function:
|
2025-12-22 14:08:30 +05:30
|
|
|
1. Gets the workflow run and its contexts
|
2026-02-25 13:53:30 +05:30
|
|
|
2. Runs QA analysis nodes (if any)
|
|
|
|
|
3. Stores QA results in annotations
|
|
|
|
|
4. Executes webhook nodes with QA results available in render context
|
2025-09-09 14:37:32 +05:30
|
|
|
"""
|
|
|
|
|
set_current_run_id(workflow_run_id)
|
2026-02-25 13:53:30 +05:30
|
|
|
logger.info("Running integrations for workflow run")
|
2025-09-09 14:37:32 +05:30
|
|
|
|
|
|
|
|
try:
|
2025-12-22 14:08:30 +05:30
|
|
|
# Step 1: Get workflow run with full context
|
2025-09-09 14:37:32 +05:30
|
|
|
workflow_run, organization_id = await db_client.get_workflow_run_with_context(
|
|
|
|
|
workflow_run_id
|
|
|
|
|
)
|
|
|
|
|
|
2025-12-22 14:08:30 +05:30
|
|
|
if not workflow_run or not workflow_run.workflow:
|
2026-01-12 16:05:57 +05:30
|
|
|
logger.warning("Workflow run or workflow not found")
|
2025-09-09 14:37:32 +05:30
|
|
|
return
|
|
|
|
|
|
2025-12-22 14:08:30 +05:30
|
|
|
if not organization_id:
|
2026-02-25 13:53:30 +05:30
|
|
|
logger.warning("No organization found, skipping integrations")
|
2025-09-09 14:37:32 +05:30
|
|
|
return
|
|
|
|
|
|
2026-03-23 11:36:39 +05:30
|
|
|
# Set org context for tracing and register org-specific Langfuse credentials
|
|
|
|
|
# FIXME: If an org removes langfuse credentials during an exisitng deployment
|
|
|
|
|
# we should unregister an existing langfuse credentials for that org.
|
|
|
|
|
set_current_org_id(organization_id)
|
|
|
|
|
langfuse_config = await db_client.get_configuration_value(
|
|
|
|
|
organization_id,
|
|
|
|
|
OrganizationConfigurationKey.LANGFUSE_CREDENTIALS.value,
|
|
|
|
|
)
|
|
|
|
|
if langfuse_config:
|
|
|
|
|
register_org_langfuse_credentials(
|
|
|
|
|
org_id=organization_id,
|
|
|
|
|
host=langfuse_config.get("host"),
|
|
|
|
|
public_key=langfuse_config.get("public_key"),
|
|
|
|
|
secret_key=langfuse_config.get("secret_key"),
|
|
|
|
|
)
|
|
|
|
|
|
2026-04-08 19:20:31 +05:30
|
|
|
# Step 2: Get workflow definition from the run's pinned version
|
|
|
|
|
workflow_definition = workflow_run.definition.workflow_json
|
|
|
|
|
definition_id = workflow_run.definition.id
|
2026-02-25 17:17:48 +05:30
|
|
|
|
2025-12-22 14:08:30 +05:30
|
|
|
if not workflow_definition:
|
2026-02-25 13:53:30 +05:30
|
|
|
logger.debug("No workflow definition, skipping integrations")
|
2025-09-09 14:37:32 +05:30
|
|
|
return
|
|
|
|
|
|
2026-02-25 13:53:30 +05:30
|
|
|
# Step 3: Extract integration nodes
|
2025-12-22 14:08:30 +05:30
|
|
|
nodes = workflow_definition.get("nodes", [])
|
2026-02-25 13:53:30 +05:30
|
|
|
qa_nodes = [n for n in nodes if n.get("type") == "qa"]
|
2025-12-22 14:08:30 +05:30
|
|
|
webhook_nodes = [n for n in nodes if n.get("type") == "webhook"]
|
2026-05-20 10:07:33 +01:00
|
|
|
has_registered_integrations = has_completion_handlers(workflow_definition)
|
2025-09-09 14:37:32 +05:30
|
|
|
|
2026-05-20 10:07:33 +01:00
|
|
|
# Step 4: Generate a public access token for any run that needs post-call work.
|
2026-01-29 20:43:53 +05:30
|
|
|
has_campaign = workflow_run.campaign_id is not None
|
2026-05-20 10:07:33 +01:00
|
|
|
if (
|
|
|
|
|
not webhook_nodes
|
|
|
|
|
and not qa_nodes
|
|
|
|
|
and not has_registered_integrations
|
|
|
|
|
and not has_campaign
|
|
|
|
|
):
|
2026-02-25 13:53:30 +05:30
|
|
|
logger.debug("No integration nodes and no campaign, skipping")
|
2026-01-29 20:43:53 +05:30
|
|
|
return
|
|
|
|
|
|
2026-05-20 10:07:33 +01:00
|
|
|
public_token = await db_client.ensure_public_access_token(workflow_run_id)
|
2026-01-29 20:43:53 +05:30
|
|
|
|
2026-02-25 13:53:30 +05:30
|
|
|
# Step 5: Run QA analysis before webhooks
|
|
|
|
|
if qa_nodes:
|
|
|
|
|
logger.info(f"Found {len(qa_nodes)} QA nodes to execute")
|
2026-02-25 17:17:48 +05:30
|
|
|
qa_results = await _run_qa_nodes(
|
|
|
|
|
qa_nodes,
|
|
|
|
|
workflow_run,
|
|
|
|
|
workflow_run_id,
|
|
|
|
|
workflow_definition,
|
|
|
|
|
definition_id,
|
|
|
|
|
)
|
2026-02-25 13:53:30 +05:30
|
|
|
|
|
|
|
|
if qa_results:
|
|
|
|
|
# Add QA token usage to workflow run's usage_info
|
|
|
|
|
await _update_usage_info_with_qa_tokens(
|
|
|
|
|
workflow_run_id, workflow_run, qa_results
|
|
|
|
|
)
|
|
|
|
|
|
2026-02-25 18:01:09 +05:30
|
|
|
# Collect unique tags across all QA node results for top-level filtering
|
|
|
|
|
all_tags: set[str] = set()
|
|
|
|
|
for qa_key, qa_result in qa_results.items():
|
|
|
|
|
for node_result in qa_result.get("node_results", {}).values():
|
|
|
|
|
for tag in node_result.get("tags", []):
|
|
|
|
|
if isinstance(tag, str):
|
|
|
|
|
all_tags.add(tag)
|
|
|
|
|
elif isinstance(tag, dict) and "tag" in tag:
|
|
|
|
|
all_tags.add(tag["tag"])
|
|
|
|
|
if all_tags:
|
|
|
|
|
qa_results["tags"] = sorted(all_tags)
|
|
|
|
|
|
|
|
|
|
await db_client.update_workflow_run(
|
|
|
|
|
workflow_run_id, annotations=qa_results
|
|
|
|
|
)
|
|
|
|
|
|
2026-02-25 13:53:30 +05:30
|
|
|
# Re-fetch workflow_run to get updated annotations
|
|
|
|
|
workflow_run, _ = await db_client.get_workflow_run_with_context(
|
|
|
|
|
workflow_run_id
|
|
|
|
|
)
|
|
|
|
|
|
2026-05-20 10:07:33 +01:00
|
|
|
# Step 6: Run registered third-party integrations after uploads are complete
|
|
|
|
|
integration_results = await run_completion_handlers(
|
|
|
|
|
context=IntegrationCompletionContext(
|
|
|
|
|
workflow_run_id=workflow_run_id,
|
|
|
|
|
workflow_run=workflow_run,
|
|
|
|
|
workflow_definition=workflow_definition,
|
|
|
|
|
definition_id=definition_id,
|
|
|
|
|
organization_id=organization_id,
|
|
|
|
|
public_token=public_token,
|
|
|
|
|
)
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
if integration_results:
|
|
|
|
|
await db_client.update_workflow_run(
|
|
|
|
|
workflow_run_id, annotations=integration_results
|
|
|
|
|
)
|
|
|
|
|
workflow_run, _ = await db_client.get_workflow_run_with_context(
|
|
|
|
|
workflow_run_id
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Step 7: Execute webhooks
|
2025-12-22 14:08:30 +05:30
|
|
|
if not webhook_nodes:
|
|
|
|
|
logger.debug("No webhook nodes in workflow")
|
2025-09-09 14:37:32 +05:30
|
|
|
return
|
|
|
|
|
|
2025-12-22 14:08:30 +05:30
|
|
|
logger.info(f"Found {len(webhook_nodes)} webhook nodes to execute")
|
2025-09-09 14:37:32 +05:30
|
|
|
|
2026-05-20 10:07:33 +01:00
|
|
|
# Step 8: Build render context (includes annotations from QA and integrations)
|
2026-01-23 18:53:59 +05:30
|
|
|
render_context = _build_render_context(workflow_run, public_token)
|
|
|
|
|
|
2026-05-20 10:07:33 +01:00
|
|
|
# Step 9: Execute each webhook node
|
2025-12-22 14:08:30 +05:30
|
|
|
for node in webhook_nodes:
|
2026-04-21 07:56:16 +05:30
|
|
|
node_id = node.get("id", "unknown")
|
|
|
|
|
try:
|
|
|
|
|
webhook_node = WebhookRFNode.model_validate(node)
|
|
|
|
|
except ValidationError as e:
|
|
|
|
|
logger.warning(
|
|
|
|
|
f"Webhook node #{node_id} failed validation, skipping: {e}"
|
|
|
|
|
)
|
|
|
|
|
continue
|
|
|
|
|
|
|
|
|
|
webhook_data = webhook_node.data
|
2025-12-22 14:08:30 +05:30
|
|
|
try:
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
await _enqueue_webhook_delivery(
|
2025-12-22 14:08:30 +05:30
|
|
|
webhook_data=webhook_data,
|
|
|
|
|
render_context=render_context,
|
|
|
|
|
organization_id=organization_id,
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
workflow_run_id=workflow_run_id,
|
|
|
|
|
webhook_node_id=str(node_id),
|
2025-12-22 14:08:30 +05:30
|
|
|
)
|
|
|
|
|
except Exception as e:
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
logger.warning(f"Failed to enqueue webhook '{webhook_data.name}': {e}")
|
2025-09-09 14:37:32 +05:30
|
|
|
|
|
|
|
|
except Exception as e:
|
2026-02-25 13:53:30 +05:30
|
|
|
logger.error(f"Error running integrations: {e}", exc_info=True)
|
2025-09-09 14:37:32 +05:30
|
|
|
raise
|
|
|
|
|
|
|
|
|
|
|
2026-01-23 18:53:59 +05:30
|
|
|
def _build_render_context(
|
|
|
|
|
workflow_run: WorkflowRunModel, public_token: Optional[str] = None
|
|
|
|
|
) -> Dict[str, Any]:
|
|
|
|
|
"""Build the context dict for template rendering.
|
|
|
|
|
|
|
|
|
|
Args:
|
|
|
|
|
workflow_run: The workflow run model
|
|
|
|
|
public_token: Optional public access token for download URLs
|
|
|
|
|
|
|
|
|
|
Returns:
|
|
|
|
|
Dict containing all fields available for template rendering
|
|
|
|
|
"""
|
2026-06-16 15:19:49 +05:30
|
|
|
extra = workflow_run.extra or {}
|
|
|
|
|
user_recording_key = get_recording_storage_key(extra, "user")
|
|
|
|
|
bot_recording_key = get_recording_storage_key(extra, "bot")
|
|
|
|
|
|
2026-01-23 18:53:59 +05:30
|
|
|
context = {
|
2025-12-22 14:08:30 +05:30
|
|
|
# Top-level fields
|
|
|
|
|
"workflow_run_id": workflow_run.id,
|
|
|
|
|
"workflow_run_name": workflow_run.name,
|
|
|
|
|
"workflow_id": workflow_run.workflow_id,
|
|
|
|
|
"workflow_name": workflow_run.workflow.name if workflow_run.workflow else None,
|
2026-05-12 19:32:41 +05:30
|
|
|
"campaign_id": workflow_run.campaign_id,
|
|
|
|
|
"call_time": (workflow_run.created_at or datetime.now(UTC)).isoformat(),
|
2025-12-22 14:08:30 +05:30
|
|
|
# Nested contexts
|
|
|
|
|
"initial_context": workflow_run.initial_context or {},
|
|
|
|
|
"gathered_context": workflow_run.gathered_context or {},
|
|
|
|
|
"cost_info": workflow_run.usage_info or {},
|
2026-02-25 13:53:30 +05:30
|
|
|
# Annotations (includes QA results)
|
|
|
|
|
"annotations": workflow_run.annotations or {},
|
2026-06-16 15:19:49 +05:30
|
|
|
"extra": extra,
|
2025-12-22 14:08:30 +05:30
|
|
|
}
|
|
|
|
|
|
2026-01-23 18:53:59 +05:30
|
|
|
# Add public download URLs if token is available
|
|
|
|
|
if public_token:
|
|
|
|
|
base_url = (
|
|
|
|
|
f"{BACKEND_API_ENDPOINT}/api/v1/public/download/workflow/{public_token}"
|
|
|
|
|
)
|
|
|
|
|
context["recording_url"] = (
|
|
|
|
|
f"{base_url}/recording" if workflow_run.recording_url else None
|
|
|
|
|
)
|
|
|
|
|
context["transcript_url"] = (
|
|
|
|
|
f"{base_url}/transcript" if workflow_run.transcript_url else None
|
|
|
|
|
)
|
2026-06-16 15:19:49 +05:30
|
|
|
context["user_recording_url"] = (
|
|
|
|
|
f"{base_url}/user_recording" if user_recording_key else None
|
|
|
|
|
)
|
|
|
|
|
context["bot_recording_url"] = (
|
|
|
|
|
f"{base_url}/bot_recording" if bot_recording_key else None
|
|
|
|
|
)
|
2026-01-23 18:53:59 +05:30
|
|
|
else:
|
|
|
|
|
context["recording_url"] = workflow_run.recording_url
|
|
|
|
|
context["transcript_url"] = workflow_run.transcript_url
|
2026-06-16 15:19:49 +05:30
|
|
|
context["user_recording_url"] = user_recording_key
|
|
|
|
|
context["bot_recording_url"] = bot_recording_key
|
2026-01-23 18:53:59 +05:30
|
|
|
|
|
|
|
|
return context
|
|
|
|
|
|
2025-12-22 14:08:30 +05:30
|
|
|
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
def _build_webhook_payload(
|
|
|
|
|
webhook_data: WebhookNodeData, render_context: Dict[str, Any]
|
|
|
|
|
) -> Any:
|
|
|
|
|
"""Render the webhook payload once, so retries are deterministic.
|
|
|
|
|
|
|
|
|
|
Always surfaces the call disposition on the outgoing payload, even when the
|
|
|
|
|
template author didn't reference it. Fill only if absent so a template that
|
|
|
|
|
sets it explicitly keeps its own value.
|
|
|
|
|
"""
|
|
|
|
|
payload = render_template(webhook_data.payload_template or {}, render_context)
|
|
|
|
|
|
|
|
|
|
if isinstance(payload, dict):
|
|
|
|
|
gathered_context = render_context.get("gathered_context") or {}
|
|
|
|
|
payload.setdefault(
|
|
|
|
|
"call_disposition", gathered_context.get("call_disposition", "")
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
return payload
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Substrings that mark a header as likely carrying a secret. Matched against the
|
|
|
|
|
# normalized key so variants are caught too (e.g. ``X-Custom-Auth-Token``,
|
|
|
|
|
# ``My-Api-Key``), not just exact names. Their values are NOT persisted on the
|
|
|
|
|
# delivery row (which would store them in plaintext); secrets belong in the
|
|
|
|
|
# credential store, re-resolved at send time. Bare "key" is intentionally absent
|
|
|
|
|
# to avoid dropping benign headers like ``X-Idempotency-Key``.
|
|
|
|
|
_SECRET_HEADER_MARKERS = (
|
|
|
|
|
"authorization",
|
|
|
|
|
"auth",
|
|
|
|
|
"token",
|
|
|
|
|
"secret",
|
|
|
|
|
"password",
|
|
|
|
|
"passwd",
|
|
|
|
|
"cookie",
|
|
|
|
|
"credential",
|
|
|
|
|
"api-key",
|
|
|
|
|
"apikey",
|
|
|
|
|
"api_key",
|
|
|
|
|
"access-key",
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def _looks_like_secret_header(key: str) -> bool:
|
|
|
|
|
normalized = key.strip().lower()
|
|
|
|
|
return any(marker in normalized for marker in _SECRET_HEADER_MARKERS)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def _safe_custom_headers(
|
|
|
|
|
webhook_data: WebhookNodeData, webhook_name: str
|
|
|
|
|
) -> list[dict]:
|
|
|
|
|
"""Custom headers to persist, with secret-looking ones dropped.
|
|
|
|
|
|
|
|
|
|
Persisting arbitrary header values would store credentials (Authorization,
|
|
|
|
|
X-API-Key, ...) in plaintext on the delivery row. Drop those and tell the
|
|
|
|
|
operator to use a credential instead.
|
|
|
|
|
"""
|
|
|
|
|
safe = []
|
|
|
|
|
for h in webhook_data.custom_headers or []:
|
|
|
|
|
if not (h.key and h.value):
|
|
|
|
|
continue
|
|
|
|
|
if _looks_like_secret_header(h.key):
|
|
|
|
|
logger.warning(
|
|
|
|
|
f"Webhook '{webhook_name}' custom header '{h.key}' looks like a "
|
|
|
|
|
f"secret; it will not be stored or sent. Use a credential instead."
|
|
|
|
|
)
|
|
|
|
|
continue
|
|
|
|
|
safe.append({"key": h.key, "value": h.value})
|
|
|
|
|
return safe
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
async def _enqueue_webhook_delivery(
|
2026-04-21 07:56:16 +05:30
|
|
|
webhook_data: WebhookNodeData,
|
2025-12-22 14:08:30 +05:30
|
|
|
render_context: Dict[str, Any],
|
|
|
|
|
organization_id: int,
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
workflow_run_id: int,
|
|
|
|
|
webhook_node_id: str,
|
|
|
|
|
) -> None:
|
|
|
|
|
"""Persist a durable delivery record and enqueue its first send attempt.
|
2025-09-09 14:37:32 +05:30
|
|
|
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
The actual HTTP request is performed by the ``deliver_webhook`` task, which
|
|
|
|
|
retries transient failures with backoff and dead-letters exhausted/permanent
|
|
|
|
|
ones. This replaces the previous one-shot, best-effort inline POST that lost
|
|
|
|
|
the webhook entirely on a single network error.
|
2025-09-09 14:37:32 +05:30
|
|
|
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
Idempotent on ``(workflow_run_id, webhook_node_id)``: a retried run reuses the
|
|
|
|
|
existing delivery row and does not enqueue a second send.
|
2025-09-09 14:37:32 +05:30
|
|
|
"""
|
2026-04-21 07:56:16 +05:30
|
|
|
webhook_name = webhook_data.name
|
2025-12-22 14:08:30 +05:30
|
|
|
|
2026-04-21 07:56:16 +05:30
|
|
|
if not webhook_data.enabled:
|
2025-12-22 14:08:30 +05:30
|
|
|
logger.debug(f"Webhook '{webhook_name}' is disabled, skipping")
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
return
|
2025-12-22 14:08:30 +05:30
|
|
|
|
2026-04-21 07:56:16 +05:30
|
|
|
url = webhook_data.endpoint_url
|
2025-12-22 14:08:30 +05:30
|
|
|
if not url:
|
2026-01-12 16:05:57 +05:30
|
|
|
logger.warning(f"Webhook '{webhook_name}' has no endpoint URL")
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
return
|
2025-12-22 14:08:30 +05:30
|
|
|
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
payload = _build_webhook_payload(webhook_data, render_context)
|
2026-06-24 22:07:35 +05:30
|
|
|
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
# Persist non-secret request definition. The credential is stored by reference
|
|
|
|
|
# (uuid) and re-resolved at send time so secrets never land in this row.
|
|
|
|
|
custom_headers = _safe_custom_headers(webhook_data, webhook_name)
|
2026-04-21 07:56:16 +05:30
|
|
|
method = (webhook_data.http_method or "POST").upper()
|
2025-09-09 14:37:32 +05:30
|
|
|
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
delivery, created = await db_client.create_webhook_delivery(
|
|
|
|
|
workflow_run_id=workflow_run_id,
|
|
|
|
|
organization_id=organization_id,
|
|
|
|
|
endpoint_url=url,
|
|
|
|
|
payload=payload,
|
|
|
|
|
max_attempts=DEFAULT_WEBHOOK_DELIVERY_CONFIG["max_attempts"],
|
|
|
|
|
http_method=method,
|
|
|
|
|
webhook_name=webhook_name,
|
|
|
|
|
custom_headers=custom_headers or None,
|
|
|
|
|
credential_uuid=webhook_data.credential_uuid,
|
|
|
|
|
webhook_node_id=webhook_node_id,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
if not created:
|
|
|
|
|
logger.info(
|
|
|
|
|
f"Webhook '{webhook_name}' delivery already exists for run "
|
|
|
|
|
f"{workflow_run_id} node {webhook_node_id}; not re-enqueuing"
|
2025-09-09 14:37:32 +05:30
|
|
|
)
|
feat(webhooks): durable retrying delivery for final webhooks (#478)
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
2026-07-02 17:14:14 +01:00
|
|
|
return
|
|
|
|
|
|
|
|
|
|
# Lazy import avoids a circular import (arq imports this module at load time).
|
|
|
|
|
from api.tasks.arq import enqueue_job
|
|
|
|
|
|
|
|
|
|
await enqueue_job(
|
|
|
|
|
FunctionNames.DELIVER_WEBHOOK,
|
|
|
|
|
delivery.id,
|
|
|
|
|
_job_id=f"webhook-delivery-{delivery.id}-0",
|
|
|
|
|
)
|
|
|
|
|
logger.info(
|
|
|
|
|
f"Enqueued webhook '{webhook_name}' delivery {delivery.delivery_uuid} "
|
|
|
|
|
f"for run {workflow_run_id}"
|
|
|
|
|
)
|