mirror of
https://github.com/dograh-hq/dograh.git
synced 2026-07-04 10:52:17 +02:00
* feat(webhooks): durable retrying delivery for final webhooks
Final webhook nodes were fired inline with a single best-effort httpx POST
(run_integrations._execute_webhook_node). On a transient error the failure was
swallowed at three levels, so ARQ never retried and the final call report was
permanently lost -- leaving downstream receivers stuck (e.g. a CRM showing a
call as still "in conversation").
Replace the one-shot POST with a durable, idempotent delivery pipeline modelled
on the campaign retry pattern (persisted row + scheduled_for + bounded attempts):
- New webhook_deliveries table (WebhookDeliveryModel) is the source of truth.
Payload is rendered once and frozen so retries are deterministic; secrets are
not stored -- the credential is referenced by uuid and re-resolved at send time.
- run_integrations now persists a delivery row and enqueues deliver_webhook with
a deterministic ARQ job id instead of sending inline.
- deliver_webhook (new ARQ task) sends the request and:
* 2xx -> succeeded
* transient -> retry with capped exponential backoff (RequestError /
5xx / 408 / 425 / 429), up to max_attempts then dead_letter
* permanent 4xx -> dead_letter immediately (no pointless looping)
It is idempotent: a non-pending delivery is a no-op, so a duplicate enqueue or
sweeper re-injection can't double-send.
- sweep_webhook_deliveries cron (every 5 min) re-enqueues overdue pending
deliveries so nothing is lost to a worker restart / Redis flush.
- Stable X-Dograh-Delivery-Id / Workflow-Run-Id / Attempt headers let receivers
dedupe retried deliveries.
- enqueue_job now forwards ARQ job options (_job_id, _defer_by); failures log
repr(e) so empty-message errors like ConnectTimeout are diagnosable.
Config via DEFAULT_WEBHOOK_DELIVERY_CONFIG (env-overridable): max_attempts=5,
base_delay=30s, max_delay=600s, timeout=30s.
Tests cover payload rendering, persist+enqueue, success, transient retry,
retryable 5xx, permanent 4xx dead-letter, attempt exhaustion, and idempotency.
Migration verified to apply/rollback against Postgres; table/enum/indexes confirmed.
* fix(webhooks): atomic claim, safe success-recording, sweep paging, migration cleanup
Address review feedback on the webhook delivery pipeline:
- deliver_webhook now atomically claims a delivery (conditional UPDATE that
leases scheduled_for) before sending, so concurrent ARQ executions can't
double-send (the prior status=='pending' read was non-atomic).
- Recording success is moved out of the dead-letter try-block: if the receiver
accepted the payload (2xx) but the success DB-write fails, the row is left
pending for the sweeper to reconcile instead of being dead-lettered.
- The sweep keyset-paginates by id so a backlog over the page size is fully
drained, and logs the true re-enqueued total.
- Migration downgrade drops the enum via op.execute(DROP TYPE IF EXISTS ...)
instead of the deprecated op.get_bind().
* fix(webhooks): idempotent delivery creation and drop secret custom headers
Address the remaining review feedback:
- Add a (workflow_run_id, webhook_node_id) unique constraint and make
create_webhook_delivery a get-or-create returning (delivery, created). A
retried run_integrations now reuses the existing row instead of creating and
sending a duplicate final webhook; only a freshly-created row is enqueued.
- Stop persisting secret-looking custom headers (Authorization, X-API-Key,
Cookie, ...) in plaintext on the delivery row: they are dropped with a warning
pointing at the credential store (which is re-resolved securely at send time).
Non-secret custom headers are unaffected.
* fix(webhooks): harden idempotency key, secret-header match, sweep reclaim id
Address follow-up review feedback:
- webhook_node_id is now NOT NULL so a NULL can't slip past the
(workflow_run_id, webhook_node_id) unique constraint and create duplicates.
- Secret-header filtering matches normalized markers (auth/token/secret/cookie/
api-key/...) instead of an exact name list, catching variants like
X-Custom-Auth-Token while leaving benign headers (e.g. X-Idempotency-Key).
- The sweeper re-enqueues with a reclaim-specific job id (the lease timestamp)
so reconciling a delivered-but-unrecorded row isn't deduped against the
original attempt's already-completed ARQ job. The atomic claim still ensures
at most one send.
* fix(webhooks): scope delivery rows to workflow org
---------
Co-authored-by: Abhishek Kumar <abhishek@a6k.me>
|
||
|---|---|---|
| .. | ||
| 0a1b2c3d4e5f_add_mcp_in_toolcategory.py | ||
| 0c1bbc83fe9e_fix_unique_constraint_on_workflow_.py | ||
| 0c1223cc266f_make_json_not_nullable.py | ||
| 0fe708f2acb9_add_call_disposition_codes.py | ||
| 1a7d74d54e8f_add_transfer_call_tool_category.py | ||
| 1d441e79db94_add_user_model.py | ||
| 1da1d650c0e4_add_total_call_duration_to_usage.py | ||
| 2d6e2f41caa2_add_workflow_run.py | ||
| 2dfee251117b_fix_defaults.py | ||
| 2ed4baa89f15_add_annotations_in_workflow_run_model.py | ||
| 2f638891cbb6_add_workflow_run_text_sessions.py | ||
| 02ffd7f23d1d_add_index_in_workflow_run.py | ||
| 3a0384c5ab2e_add_user_configuration.py | ||
| 3a30110d7cd7_added_livekit_room.py | ||
| 3cd3155084a2_dedup_org_scoped_recordings.py | ||
| 4c1f1e3e8ef2_drop_looptalk_tables.py | ||
| 4d8e9b2a3c5f_drop_workflow_run_mode_enum.py | ||
| 6bd9f67ec994_add_folders_and_workflow_folder_id.py | ||
| 6d2f94baf4b7_add_ari_mode.py | ||
| 6fd8fac02883_add_user_email_and_password.py | ||
| 7e90cc8d025b_add_workflow_template.py | ||
| 7feef09d7cc6_add_price_per_second_usd.py | ||
| 08bb6e7f1397_added_campaign_table.py | ||
| 9be6240baa00_add_recording_and_transcript_urls_in_.py | ||
| 9ef49df72862_add_workflow_id_in_workflow_definition.py | ||
| 9f5f2d35f6fb_remove_workflow_definition_id_from_.py | ||
| 9f25ff8f3cbd_add_actions.py | ||
| 13ccd6e1f5ad_add_workflow_run_mode.py | ||
| 19d2a4b6c8ef_rename_integrations_organisation_id.py | ||
| 20c780c2a218_add_provider.py | ||
| 34c8537dfde5_add_processing_state_in_queued_runs.py | ||
| 36b5dbf670e4_add_external_credentials_model.py | ||
| 37d0a90fccba_add_api_keys_model.py | ||
| 45fa7fec2993_add_selected_organisation_id_field_in_.py | ||
| 49a8fe6841e6_add_state_field_to_workflow_runs.py | ||
| 58f17b468b3c_add_workflow_definition.py | ||
| 67a5cf3e09d0_unique_recording_id_per_org.py | ||
| 91cc6ba3e1c7_add_key_to_user_configurations.py | ||
| 93a1ddbb6ffd_add_workflow_model.py | ||
| 384be6596b36_make_email_case_insensitive.py | ||
| 477b47ce346b_add_organisation_id_in_workflow.py | ||
| 488eb58e4e6e_add_cloudonix_mode.py | ||
| 493ca2bb001f_add_end_call_tool_category.py | ||
| 594f16adf97c_add_timezone_info_on_tables.py | ||
| 693a865c011f_add_status_in_workflow.py | ||
| 982ec8e434be_add_storage_backend_in_workflow.py | ||
| 1225ac786848_add_organization_configurations.py | ||
| 2159d4ac431a_added_quota_tables.py | ||
| 3717ae6146e2_add_workflow_configurations.py | ||
| 4735a1f0cdb3_add_queued_runs_table.py | ||
| 6499c608d0f6_add_campaign_logs_column.py | ||
| 9641b4f306cd_add_is_superuser_to_user_model.py | ||
| 181475b2a1a1_add_public_access_token.py | ||
| 5253971e3f03_add_teplate_context_variables_in_.py | ||
| a1b2c3d4e5f6_make_tts_columns_nullable.py | ||
| a2b092ff7282_add_usage_info.py | ||
| a29b05f31ddf_add_last_validated_at.py | ||
| a57d25b75117_add_vonage_and_rename_config.py | ||
| a75ae71af479_add_log_in_workflow_run_model.py | ||
| a188ff90e76f_add_vobiz_mode_for_workflow.py | ||
| a399b39479fe_add_versioning_in_workflow_definitions.py | ||
| a2355fc6bdc1_add_multi_telephony_config_tables.py | ||
| ac6da37c5034_add_created_by_in_integration.py | ||
| b3a1c7e94f12_add_telnyx_mode.py | ||
| b7e3c9a1d2f4_add_webhook_deliveries.py | ||
| b79f19f68157_add_call_type_column_to_workflow_runs.py | ||
| bee2a9fcc6a6_fix_datetime_to_be_in_utc.py | ||
| c7c56dd36b21_add_agent_trigger.py | ||
| c71db647d354_add_calculator_in_toolcategory.py | ||
| c425d3445750_add_columns_in_usage_table.py | ||
| cdc80a4fd2dd_add_template_description.py | ||
| cdcf9f65913b_add_workflow_uuid.py | ||
| d1dac4c93e61_add_partial_index_on_call_id_for_.py | ||
| d11fbd083a55_add_connection_details.py | ||
| d0060de90c18_fix_migrations.py | ||
| d666f3244648_add_integrations.py | ||
| d688d0da1123_backfill_workflow_definition_versioning.py | ||
| dc33eef8dabe_add_document_tables.py | ||
| dcb0a27d98c6_remove_unique_tool_name_constraint.py | ||
| e0d1a9b9f6c4_add_looptalk_testing_tables_without_.py | ||
| e02f387b7538_add_embed_token_model.py | ||
| e54ddb048535_add_recording_table.py | ||
| e7254d2c6c18_add_retrieval_mode_in_document.py | ||
| ebc80cea7965_add_tools_model.py | ||
| ec010596a0b4_change_datatype_of_usage_to_float.py | ||
| efe356f488f9_add_extra_column_in_workflow_runs.py | ||
| f2e1d0c9b8a7_add_plivo_mode.py | ||
| f6f19156bcb7_add_organisation_table.py | ||
| f952c9c1105a_add_failed_state_to_queued_runs.py | ||
| fec0fb9a8db7_add_index.py | ||
| fefdd1835b7d_retry_outbound_calls_for_campaigns.py | ||