feat(automations): static_inputs on triggers + vertical-slice api/services

This commit is contained in:
CREDO23 2026-05-27 21:21:43 +02:00
parent 84d99f19a2
commit 27ab367a13
27 changed files with 915 additions and 356 deletions

View file

@ -34,24 +34,27 @@ system will survive feature growth:
---
## 2. The four-layer contract
## 2. The three-layer contract
The system is structured as four layers. Layers 1, 2, and 4 are defined by
SurfSense developers (at registration time). Layer 3 is what users write
(or the NL generator produces). The runtime reads all four to do its job.
The system is structured as three layers. Layers 1 and 3 are defined by
SurfSense developers (at registration time). Layer 2 is what users write
(or the NL generator produces). The runtime reads all three to do its job.
| Layer | What it is | Defined by |
| ----- | ---------- | ---------- |
| **1. Capability registry** | What this SurfSense instance can do | Developers, at startup |
| **2. Action contract** | Per-action input/output schema | Developers, at startup |
| **3. Automation definition** | One concrete saved automation | Users (or NL generator) |
| **4. Trigger contract** | Per-trigger config and payload schemas | Developers, at startup |
| **1. Action contract** | Per-action params and output schema | Developers, at startup |
| **2. Automation definition** | One concrete saved automation | Users (or NL generator) |
| **3. Trigger contract** | Per-trigger params and payload schemas | Developers, at startup |
Each layer constrains the one above. The runtime reads all four but doesn't
know what's in them ahead of time. That's how a new capability or trigger
Each layer constrains the next. The runtime reads all three but doesn't
know what's in them ahead of time. That's how a new action or trigger
type becomes available across the engine without code changes outside its
registration.
A unification layer below Layer 1 — one catalog of "things this SurfSense
instance can do," shared by automations, agents, and future surfaces — was
considered and deferred (§3). v1 actions are stand-alone.
### Schema language
Every shape in every layer is described in **JSON Schema (draft 2020-12).**
@ -66,167 +69,126 @@ extensions on top:
---
## 3. Capability registry (Layer 1)
## 3. Capability unification layer — deferred to post-v1
A `Capability` is one discrete thing the SurfSense backend exposes —
"post a Slack message," "query the Search Space," "generate a podcast." It
is the atomic unit of "things automations can do."
Earlier drafts introduced a `Capability` registry as Layer 1: one catalog
of "things this SurfSense instance can do," shared by the automation
engine (as actions), the agent (as tools), and any future HTTP surface.
The motivation is real — one source of truth beats N parallel registries —
but v1 has a single action (`agent_task`) and a single consumer (the
automation engine). The five-field shape sketched earlier (`id`,
`description`, `input_schema`, `output_schema`, `handler`) cannot safely
host any non-trivial capability: it carries no caller identity, no
search-space scoping, and no authorization gate on tool delegation.
Building the abstraction with one consumer would lock in a shape that
doesn't survive the second consumer.
```python
@dataclass
class Capability:
id: str # "slack.post_message"
description: str # for the NL generator + UI label
input_schema: dict # JSON Schema
output_schema: dict # JSON Schema
handler: AsyncHandler
```
The unification layer returns when the second consumer lands (Phase 2
tight actions or Phase 4 MCP), redesigned from the start with:
### v1-minimum: five fields, nothing else
- A `CallContext` carrying caller user id, search space id, and run id,
passed to every handler invocation.
- Explicit scope declarations per capability (e.g. `reads:documents`,
`writes:slack`, `destructive`) for the authorization layer to read.
- A per-user, per-search-space filter consulted at both definition save
time (validating `agent_task.tools`) and run time (scoping the agent's
tool list to what the automation creator can delegate).
The Capability is **deliberately five fields in v1**. Every additional field
that earlier drafts considered (`name`, `required_credentials`,
`side_effects`, `expected_duration_seconds`, `cost_estimate`) has been
removed until a concrete consumer feature demands it. Authoring stays cheap
and the registry stays trivial to introspect:
Until then:
- `name` → folded into `description`. The UI can render a short label from
the first line of `description` or fall back to `id`. No separate field
needed in v1.
- `required_credentials` → returns when external-credential capabilities
ship (Phase 2). v1 capabilities run server-side with app config; nothing
to declare.
- `side_effects` → returns when RBAC inside automations or
`READ_ONLY`-only agent tool gating arrives. v1 capabilities are
hand-picked and all trusted code.
- `expected_duration_seconds` → returns when multi-queue routing ships.
Single Celery queue in v1.
- `cost_estimate` → never returns as a declared field; cost is measured
per run from a ledger, aggregated per Capability, and surfaced as a
historical average. Pre-flight checks are deferred.
The runtime invariant: a Capability is **a typed, named, callable thing
the system can do.** Every consumer (executor, agent tool layer, future
HTTP API) sees the same five-field shape and uses it the same way.
### Where capabilities live (v1)
In v1, the capability registry is a single in-memory dict, populated at
process startup from native registrations in
`automations/registries/capabilities/`. Identical across all workers.
No database persistence, no closures rebuilt per worker.
### MCP integration — deferred to Phase 4
The earlier two-tier registry (native + MCP-derived), the
`mcp_connections` / `mcp_tools` tables, the harvester, and the lazy
per-worker closure cache are **deferred to Phase 4** along with the
rest of the integration-tooling surface. They are removed from v1
because:
- v1 has no external connector capabilities (no Slack, Notion, Drive,
etc.). The only capabilities that will ship are server-side helpers
(search-space query / fetch) plus the loose `agent_task` action.
- Without external connectors, the lifecycle mismatch that motivates
the two-tier design (connect Monday, run Friday, workers restarted
in between) doesn't arise. A startup-time dict is sufficient.
- Phase 4 reintroduces this design as-is — the registry interface in
v1 is the same callable surface a Phase-4 MCP harvester will register
into. The deferral is additive, not a different design.
See archived design at `docs/automation/archived/mcp-registry.md` once
v1 ships; for now the only consumer of the registry is the in-memory
native path.
- v1 actions are stand-alone units (Layer 1 below); the automation engine
reads its own action registry, nothing else.
- `agent_task.params.tools` is a forward-looking allowlist field with no
v1 semantics beyond "list of string identifiers." The handler's tool
resolution is opaque to the automation contract.
### Credentials — deferred to Phase 2
The earlier per-call credential resolution pattern (`ctx.resolve_mcp_client`,
`ctx.resolve_http_client`, `ctx.resolve_llm`) is **deferred to Phase 2**.
v1 capabilities run server-side using app-level configuration; none of
the seven v1 capabilities needs per-user or per-connection auth.
External-credential handlers (Slack, email, etc.) require per-user or
per-connection auth. v1 actions run server-side with app-level
configuration. When tight actions ship in Phase 2, the credential design
lands as part of the unification redesign: connection IDs in the
definition (never tokens); credentials loaded per-call by the handler
context (never pre-loaded into worker memory); credentials never enter
LLM context.
When Phase 2 ships external-credential capabilities (Slack, email, etc.),
the three guarantees the original design promised are reintroduced
unchanged:
### MCP — deferred to Phase 4
- Credentials never appear in the automation definition (connection IDs
only).
- Credentials never appear in the LLM's context (the host holds them
and uses them on the LLM's behalf when executing tool calls).
- Credentials are loaded per-call, not pre-loaded into worker memory.
The Phase-2 design returns as-is; only the v1 surface is simplified.
External tool servers feeding tools into a shared registry land with the
rest of the integration tooling in Phase 4, after the unification layer
is in place. The two-tier registry, `mcp_connections` and `mcp_tools`
tables, and the harvester arrive as a single coherent step then.
---
## 4. Action contract (Layer 2)
## 4. Action contract
An `Action` is what a user references in a plan step. Most actions are
thin wrappers around one capability (e.g., `slack_post` wraps
`slack.post_message`). Some compose: `agent_task` is one action whose
handler invokes the LangGraph runtime, which in turn can call many
capabilities.
An `Action` is what a user references in a plan step. Some actions are
deterministic single-purpose handlers (`slack_post`, `send_email`); one
action (`agent_task`) hosts an LLM and a tool allowlist for cases where
judgment is needed. The contract is the same in both cases — only the
handler differs.
```python
@dataclass
@dataclass(frozen=True, slots=True)
class ActionDefinition:
type: str # "agent_task", "slack_post"
name: str # for the UI
description: str # for the NL generator
config_schema: dict # JSON Schema for action.config
output_contract: dict | DynamicOutput # what it produces
uses_capabilities: list[str] # IDs from the registry
produces_artifacts: list[ArtifactSpec] # see §8
handler: AsyncHandler
type: str # "agent_task", "slack_post"
name: str # short UI label
description: str # for the NL generator and the UI
params_schema: dict # JSON Schema for step.params
handler: ActionHandler
```
This is the v1 shape: five fields, no handler context, no output
contract, no artifact declaration. The deferrals are intentional:
- **`output_contract`** — Phase 2. Deterministic handlers will return
a fixed shape; v1's only action (`agent_task`) takes an
`output_schema` inside `params` and validates against that instead.
- **`produces_artifacts`** — Phase 5. Artifact lifecycle (storage,
signed URLs, retention) is its own design step; v1 handlers
persist their own outputs.
- **Handler context** — paired with the unification redesign (§3).
v1 handlers receive `(args)` only; per-user / per-search-space
behavior is not yet a v1 concern.
### Tight vs loose actions
Two patterns coexist by design:
- **Tight actions** (`slack_post`, `linear_create_issue`, `send_email`):
config_schema is fully specified, output_contract is fixed, handler is a
thin wrapper. ~20 LOC each. Used when the user knows exactly what they
want done — no LLM tokens spent on trivial work.
- **Tight actions** (`slack_post`, `linear_create_issue`,
`send_email`) — deterministic single-purpose handlers. ~20 LOC
each. **Phase 2.**
- **Loose actions** (`agent_task`) — params_schema accepts a `prompt`,
a `tools` allowlist, and an optional `output_schema` declaring what
the agent must return; the handler validates the agent's output
against it. **v1.**
- **Loose actions** (`agent_task`): config_schema accepts a `prompt` and a
`tools` allowlist; output_contract is *dynamic* — the user declares the
output shape they want via `output_schema` in the step config; the
handler asks the LLM to return that shape and validates. Used when
judgment is needed.
The agent's tool list is **the same capabilities** that tight actions call
directly. One registry, two invocation modes. Adding a new MCP server gives
both modes access to its tools automatically.
The agent's `tools` allowlist resolves opaquely in v1; the redesigned
unification layer (§3) will give both invocation modes access to the
same vocabulary, with per-user authorization gating both.
### How names in the definition become function calls
The definition contains strings like `"action": "slack_post"`. The string is
just a name — it does not point to a function. At runtime, the executor
performs a **name-based lookup** against the action registry:
The definition contains strings like `"action": "agent_task"`. The
string is just a name — it does not point to a function. At runtime,
the executor performs a **name-based lookup** against the action
registry:
```python
# step.action is a string from the JSON definition, e.g. "slack_post"
action_def = _ACTION_REGISTRY[step.action] # dict lookup
handler = action_def.handler # Python callable
result = await handler(ctx, resolved_config) # invocation
action_def = action_registry.get(step.action) # dict lookup
handler = action_def.handler # Python callable
result = await handler(resolved_params) # invocation
```
The registry is a Python dict (or a thin wrapper around one) populated at
process startup. Each entry in `automations/actions/*.py` calls a
`register_action(...)` function at module import time, putting its
`ActionDefinition` (including the handler function reference) into the
registry.
The registry is a Python dict populated at process startup. Each entry
in `automations/registries/actions/*.py` calls `register_action(...)`
at module import time, putting its `ActionDefinition` (including the
handler function reference) into the registry.
The same pattern applies to capabilities. The definition references
capabilities by ID (`"slack.post_message"`); the capability registry maps
the ID to a `Capability` object holding the handler. Definitions never
reference Python code directly — they reference names that the registry
resolves to code.
This separation is what makes the contract portable. The definition is
pure data. The registry is the engine's runtime vocabulary. They meet at
name-based lookup; nothing else crosses the boundary.
The definition is pure data. The registry is the engine's runtime
vocabulary. They meet at name-based lookup; nothing else crosses the
boundary.
### The full expressive spectrum
@ -238,7 +200,7 @@ fully agentic. Six practical shapes worth recognizing:
| **1. Direct call** | `slack_post` with literal channel and template | No LLM. ~200ms. Fractions of a cent. |
| **2. Direct call with computed inputs** | `linear_create_issue` using `{{summary.title}}` from a prior step | No LLM for this step. Cheap. |
| **3. Single-domain agent task** | `agent_task` with `tools: ["slack.*"]` only | One LLM, bounded toolset. |
| **4. Multi-domain agent task, narrow** | `agent_task` with `tools: ["github.list_pull_requests", "linear.create_issue"]` | One LLM, named capabilities. |
| **4. Multi-domain agent task, narrow** | `agent_task` with `tools: ["github.list_pull_requests", "linear.create_issue"]` | One LLM, named tools. |
| **5. Multi-domain agent task, broad** | `agent_task` with `tools: ["slack.*", "github.*", "linear.*"]` | One LLM, large toolset, most agentic. |
| **6. Composed plan** | `agent_task` (narrow) for thinking → `slack_post` + `linear_create_issue` for acting | Best cost-to-power ratio. |
@ -258,7 +220,7 @@ user's.
---
## 5. Automation definition (Layer 3)
## 5. Automation definition
This is the JSON the user writes (or the NL generator produces). Stored in
`automations.definition` as JSONB.
@ -287,7 +249,7 @@ This is the JSON the user writes (or the NL generator produces). Stored in
"triggers": [
{
"type": "schedule",
"config": { "cron": "0 9 * * 1-5", "timezone": "Africa/Kigali" }
"params": { "cron": "0 9 * * 1-5", "timezone": "Africa/Kigali" }
}
],
@ -295,7 +257,7 @@ This is the JSON the user writes (or the NL generator produces). Stored in
{
"step_id": "research",
"action": "agent_task",
"config": {
"params": {
"prompt": "Find documents tagged {{inputs.tags}} indexed since {{inputs.since}}. Return JSON with bullets and source_doc_ids.",
"tools": ["search_space.query", "search_space.fetch_document"],
"model": "anthropic/claude-sonnet-4-7",
@ -313,7 +275,7 @@ This is the JSON the user writes (or the NL generator produces). Stored in
{
"step_id": "deliver",
"action": "slack_post",
"config": {
"params": {
"channel_id": "C0123",
"message_template": "*Competitor digest*\n\n{% for b in summary.bullets %}• {{b}}\n{% endfor %}"
}
@ -325,11 +287,10 @@ This is the JSON the user writes (or the NL generator produces). Stored in
"max_retries": 2,
"retry_backoff": "exponential",
"concurrency": "drop_if_running",
"budget_cap_usd": 1.50,
"on_failure": [ /* steps to run if main plan fails after retries */ ]
},
"metadata": { "tags": ["digest"], "created_from_nl": true }
"metadata": { "tags": ["digest"] }
}
```
@ -340,7 +301,7 @@ This is the JSON the user writes (or the NL generator produces). Stored in
"step_id": "...", // unique within plan
"action": "...", // references an ActionDefinition.type
"when": "{{ ... }}", // optional Jinja expr → bool; false = skip
"config": { ... }, // validated against action's config_schema
"params": { ... }, // validated against action's params_schema
"output_as": "...", // binds output to this name for later steps
"max_retries": 0, // optional, overrides automation default
"timeout_seconds": 1200 // optional, overrides automation default
@ -354,7 +315,7 @@ about it, or they compose automations through events (§7.5).
---
## 6. Trigger contract (Layer 4)
## 6. Trigger contract
Three trigger types. That's the entire taxonomy.
@ -363,23 +324,12 @@ Three trigger types. That's the entire taxonomy.
```python
TriggerDefinition(
type="schedule",
config_schema={
"type": "object",
"required": ["cron", "timezone"],
"properties": {
"cron": { "type": "string" },
"timezone": { "type": "string", "format": "iana-timezone" }
}
},
payload_schema={
"type": "object",
"properties": {
"fired_at": { "type": "string", "format": "date-time" },
"scheduled_for": { "type": "string", "format": "date-time" },
"last_fired_at": { "type": "string", "format": "date-time" }
}
}
params_model=ScheduleTriggerParams, # cron + timezone
)
# At fire time the schedule producer emits runtime inputs
# (fired_at, scheduled_for, last_fired_at) which are merged with the
# trigger row's static_inputs (static wins) and validated against
# automation.definition.inputs.schema_.
```
Implementation: extends `app/utils/periodic_scheduler.py`, which already
@ -395,7 +345,7 @@ want an event trigger instead.
```python
TriggerDefinition(
type="webhook",
config_schema={
params_schema={
"type": "object",
"properties": {
"input_mapping": {
@ -422,7 +372,7 @@ Dedups against runs in the last 24 hours.
```python
TriggerDefinition(
type="event",
config_schema={
params_schema={
"type": "object",
"required": ["event_type"],
"properties": {
@ -485,11 +435,13 @@ Common path (after a trigger has fired):
4. **Snapshot the resolved definition** into the run row (immutable history)
5. Enqueue executor task on the single `automations_default` Celery queue
The cost-estimate pre-check (originally step 3) is **deferred**.
v1 capabilities do not declare `cost_estimate`; pre-flight budgeting
returns when a historical-cost ledger exists. The mid-flight budget
cap (§7.2) still kills the run if accumulated cost crosses
`budget_cap_usd`.
The cost-estimate pre-check (originally step 3) is **deferred**. v1
actions do not declare cost estimates, the run row has no `cost_usd`
column, and no handler reports tokens used — so neither pre-flight
prediction nor mid-flight accumulation can be enforced. `Execution`
therefore does not expose `budget_cap_usd` in v1; it returns as a single
field addition the day the cost ledger ships (per-action cost reporting
+ `automation_runs.cost_usd` column + executor accumulation).
Queue routing by `expected_duration_seconds` is **deferred** until load
patterns justify a second queue. v1 uses a single queue.
@ -510,15 +462,15 @@ async def execute_run(run_id: int) -> None:
if step.when and not evaluate_predicate(step.when, context | step_outputs):
record_step_skipped(run, step); continue
resolved_config = render_config(step.config, context | step_outputs)
resolved_params = render_params(step.params, context | step_outputs)
action = action_registry.get(step.action)
validate(resolved_config, action.config_schema)
validate(resolved_params, action.params_schema)
try:
result = await with_retries(
action.handler,
ctx=build_action_context(run, action),
args=resolved_config,
args=resolved_params,
policy=step.retry_policy or run.execution.retry_policy,
)
validate(result, step.output_schema)
@ -541,14 +493,20 @@ validated dict come back; it doesn't know that step was "smart."
### 7.3 Action handlers
One handler per `ActionDefinition.type`. Receives `(ctx, args)`, returns
a dict matching `output_contract` (or matching the user-declared
`output_schema` for dynamic-output actions like `agent_task`).
One handler per `ActionDefinition.type`. Receives the validated `args`
dict and returns whatever the step's output validates against (a fixed
shape declared by tight actions, or a dynamic shape declared via
`output_schema` in the step params for `agent_task`).
Handlers handle their own credential resolution via `ctx.resolve_credentials`.
They do not know about retries, timeouts, or budget caps — those are the
Handlers do not know about retries or timeouts — those are the
executor's concern.
In v1, handlers take `(args)` only. The `CallContext` parameter sketched
in §7.2's pseudo-code (caller user id, search space id, run id,
credential resolver) arrives with the unification layer redesign (§3);
v1's single action (`agent_task`) reads what it needs from app-level
configuration.
### 7.4 Template engine
#### Why it exists
@ -747,7 +705,7 @@ Three fields, per-automation defaults with optional per-step overrides:
- `timeout_seconds`: integer
Retries on:
- Capability handler exceptions
- Action handler exceptions
- Output schema validation failures (for dynamic-output actions, the
validation error is fed back to the LLM in the retry)
@ -755,12 +713,21 @@ Not retries:
- `when:` evaluation failures (these are user errors, surface immediately)
- Input validation failures (caught at dispatch, never reach the executor)
### Budget enforcement
### Budget enforcement *(deferred — not in v1)*
`budget_cap_usd` is per-run. The dispatcher refuses to enqueue if estimated
cost exceeds it. The executor kills the run if accumulated cost crosses it
mid-flight (the LLM ops handler reports tokens consumed back to the
executor between calls).
Future shape: `budget_cap_usd` on `Execution`, dispatcher refuses to
enqueue if estimated cost exceeds it, executor kills the run if
accumulated cost crosses it mid-flight (the LLM ops handler reports
tokens consumed back to the executor between calls).
Prerequisites before this can land:
- Each action declares cost reporting (tokens × model price, API call
charges) — `ActionDefinition` has no such field today.
- `automation_runs.cost_usd` column + executor accumulates per step.
- A historical-cost ledger so pre-flight estimation can return useful
numbers (otherwise the dispatcher gate is guessing).
Until all three exist, v1 has no surface for budget enforcement.
### On-failure handlers
@ -787,14 +754,13 @@ nightly Celery Beat task deletes expired artifacts).
### Duration classes and queue routing — deferred
The original design routed runs to multiple Celery queues based on each
capability's declared `expected_duration_seconds`. v1 ships with **one
queue** (`automations_default`) and capabilities do not declare a
duration. Multi-queue routing returns when burst load on a single queue
actually justifies the operational complexity of independent worker
pools.
action's declared `expected_duration_seconds`. v1 ships with **one
queue** (`automations_default`) and actions do not declare a duration.
Multi-queue routing returns when burst load on a single queue actually
justifies the operational complexity of independent worker pools.
Adding the second queue is a config change plus reintroducing
`expected_duration_seconds` on the `Capability` dataclass — both
`expected_duration_seconds` on the `ActionDefinition` dataclass — both
mechanical, additive, and free of design rewrite.
---
@ -832,14 +798,16 @@ and an immutable run history.
### `automation_triggers`
| field | type | notes |
| --------------- | ----------------------------------------------------------------------------- | ------------------------------------------- |
| `id` | int PK | |
| `automation_id` | FK | |
| `type` | enum: `schedule`, `manual` (Phase 2/3 add `webhook`, `event`) | |
| `config` | jsonb | validated against trigger's `config_schema` |
| `enabled` | bool | |
| `last_fired_at` | timestamp | |
| field | type | notes |
| --------------- | ----------------------------------------------------------------------------- | ----------------------------------------------------------- |
| `id` | int PK | |
| `automation_id` | FK | |
| `type` | enum: `schedule`, `manual` (Phase 2/3 add `webhook`, `event`) | |
| `params` | jsonb | trigger-type config, validated against trigger's `params_schema` |
| `static_inputs` | jsonb | per-attachment domain values merged into every run (static wins on collision) |
| `enabled` | bool | |
| `last_fired_at` | timestamp | |
| `next_fire_at` | timestamp / null | precomputed next fire moment for schedule triggers |
`secret_hash` (for webhook bearer tokens) is **deferred to Phase 2** with
the webhook trigger.
@ -853,8 +821,7 @@ the webhook trigger.
| `trigger_id` | FK / null | null = manual via UI |
| `status` | enum | `pending`, `running`, `succeeded`, `failed`, `cancelled`, `timed_out` |
| `definition_snapshot` | jsonb | the definition as it was when this run fired |
| `trigger_payload` | jsonb | |
| `resolved_inputs` | jsonb | |
| `inputs` | jsonb | merged & validated inputs (trigger.static_inputs producer runtime data, static wins) |
| `step_results` | jsonb | array of per-step results with timing |
| `output` | jsonb / null | |
| `artifacts` | jsonb | references to created artifacts |
@ -863,7 +830,7 @@ the webhook trigger.
| `agent_session_id`| str / null | link to LangGraph trace if agent_task was used |
`cost_usd` (per-run accumulated cost) is **deferred** until at least one
v1 capability records token-level cost. When reintroduced it lands as a
action records token-level cost. When reintroduced it lands as a
column-only migration.
### Deferred tables
@ -897,8 +864,8 @@ not "trusted authors only."
User provides natural-language input. The Generator LLM is given:
- The full schema set (input schema for definition, registry of action
types with their config_schemas, registry of trigger types, available
capabilities for this SearchSpace, list of allowed Jinja filters)
types with their params_schemas, registry of trigger types, list of
allowed Jinja filters)
- A tool to list available connectors, channels, and other SearchSpace
resources, so it doesn't invent names that don't exist
- A few-shot set of examples
@ -918,13 +885,13 @@ Output: a structured proposal matching the automation definition schema.
Server-side, before the proposal reaches the user:
- Validate against JSON Schema (shape correctness)
- Verify every capability referenced exists in the registry (resource existence)
- Verify every action and trigger type referenced exists in the registry
- Verify every connector/channel/resource referenced exists in this SearchSpace
- Validate every template against the sandbox's allowlist (no underscore
attributes, no unregistered filter names, length under cap)
Failures here are deterministic errors, not warnings. A proposal that
references a non-existent capability or includes a template using
references a non-existent action or includes a template using
`{{x.__class__}}` is rejected before the user sees it; the Generator is
re-prompted with the validation error and asked to fix the proposal.
@ -947,7 +914,7 @@ produces two outputs for the user:
- Action sequences that touch external systems without obvious benefit
to the user
- Cost estimates that seem high relative to the goal
- References to capabilities the user hasn't used before
- References to actions the user hasn't used before
- Schedules tighter than 15 minutes (likely should be event triggers)
The Review LLM is a **UX layer** that makes review actually useful. It is
@ -1009,33 +976,18 @@ always.
surfsense_backend/app/
├── automations/ # NEW: the engine
│ ├── __init__.py
│ ├── models.py # SQLAlchemy models for 6 tables
│ ├── schemas.py # Pydantic schemas (definition envelope, etc.)
│ ├── persistence/ # SQLAlchemy models + enums for 3 tables
│ ├── schemas/ # Pydantic schemas (definition envelope, etc.)
│ ├── routes.py # FastAPI router (/api/v1/automations)
│ ├── service.py # CRUD + business logic
│ ├── dispatcher.py # trigger matching, cost check, run creation
│ ├── dispatcher.py # trigger matching, run creation
│ ├── executor.py # the Celery task that runs a plan
│ ├── templating.py # Jinja sandbox + filters
│ ├── events.py # publish/subscribe for domain_events
│ ├── filters.py # JSON filter grammar evaluator
│ ├── actions/
│ │ ├── registry.py
│ │ ├── agent_task.py
│ │ ├── transform_data.py
│ │ ├── slack_post.py
│ │ ├── send_email.py
│ │ ├── notification.py
│ │ └── (more in Phase 5: podcast_generation, report_generation, ...)
│ ├── triggers/
│ │ ├── registry.py
│ │ ├── schedule.py # Celery Beat hookup
│ │ ├── webhook.py # /fire endpoint
│ │ └── event.py # subscribes to domain_events
│ ├── capabilities/
│ │ ├── registry.py
│ │ ├── native.py # native capability registrations
│ │ ├── mcp_harvester.py # registers MCP tools as capabilities (Phase 4)
│ │ └── (LLM ops registered alongside)
│ ├── registries/ # action and trigger registries
│ │ ├── actions/ # ActionDefinition + handler registration
│ │ └── triggers/ # TriggerDefinition
│ └── nl/ # Phase 1 — primary user path
│ ├── generator.py # Generator LLM
│ ├── reviewer.py # Review LLM (summary + flagged items)
@ -1070,23 +1022,22 @@ automations in natural language.
**Step 1 (current scope, this batch of commits):**
- 3 tables (`automations`, `automation_triggers`, `automation_runs`) +
Alembic migration
- Empty Capability, Action, Trigger registries (concrete entries land in
later steps when the consuming feature lands)
- Empty action and trigger registries under
`app/automations/registries/` (concrete entries land in later steps)
- Pydantic schemas for the automation definition envelope, the two v1
trigger configs (`schedule`, `manual`), and the one v1 action config
(`agent_task`)
- Module structure under `app/automations/` (data/, schemas/,
trigger params shapes (`schedule`, `manual`), and the one v1 action
params shape (`agent_task`)
- Module structure under `app/automations/` (persistence/, schemas/,
registries/), fully isolated from the existing codebase
**Step 2:**
- Register the `agent_task` action and the `schedule` / `manual`
triggers in the registries
- Capability registry populated with native deliverable-producing
capabilities (chosen when this step starts)
- The `agent_task` action handler and the `schedule` / `manual` triggers
registered in `app/automations/registries/`. Tool resolution for
`agent_task.params.tools` is opaque to the contract — the handler
decides what string identifiers it accepts and how they resolve.
**Step 3:**
- Executor (single-queue Celery task) with retries, timeouts, budget
caps measured against `cost_usd` ledger on the run
- Executor (single-queue Celery task) with retries and timeouts
- Template engine (Jinja sandbox + the v1 filter allowlist + runtime
limits)
- Manual "Run now" endpoint
@ -1122,19 +1073,23 @@ somewhere humans see, complex pipelines have proper error handling.
**After Phase 3**: NL authoring is the polished primary surface; edit
flows are conversational rather than form-only.
### Phase 4 — Event triggers
### Phase 4 — Event triggers + integration tooling
- `domain_events` table and `events.py` module
- Indexing pipeline publishes `connector.*` events (smallest change — just
add publish calls to the existing flow)
- Automations publish `automation.run.*` events on completion
- `event` trigger with filter grammar
- MCP capability harvester (so MCP-backed events and tools both work)
- The unification layer redesign (see §3) — `CallContext`, scope
declarations, per-user authorization gating
- MCP integration on top of the unification layer (external tool servers
harvested into the shared catalog)
**After Phase 4**: "do X when Y happens" automations work, including
automation-chaining through events.
automation-chaining through events; external MCP tools and SurfSense
actions share one vocabulary.
### Phase 5 — Wrapping existing features and sharing
- Wrap existing SurfSense capabilities as actions: `podcast_generation`,
- Wrap existing SurfSense features as actions: `podcast_generation`,
`report_generation`, `indexing_sweep`
- Artifact lifecycle implementation
- `expected_duration_seconds` based queue routing (split `automations_long`
@ -1144,7 +1099,7 @@ automation-chaining through events.
shift documented in §7.4's pre-Phase-5 gate
- Cross-automation composition examples in the docs
**After Phase 5**: every existing SurfSense capability is automatable
**After Phase 5**: every existing SurfSense feature is automatable
without any per-feature code, and automations can be shared between
SearchSpaces and users.
@ -1156,13 +1111,12 @@ For reference — every decision made through the design process, in one
place.
### Foundations
1. ✅ JSON Schema 2020-12 is the single schema language for everything
1. ✅ JSON Schema (draft 2020-12) is the single schema language for everything
2. ✅ Definition is the program; infrastructure is the interpreter
3. ✅ List of steps (not single action) in the plan, with `output_as` chaining
4. ✅ One capability registry serving native + MCP + LLM operations through the same interface
5. ✅ Capability IDs do not leak handler kind (`slack.post_message`, not `mcp.slack.post_message`)
6. ✅ Name-based resolution: definitions reference actions and capabilities by string ID. The registry is the runtime's vocabulary; lookup is a dict access. No code references in definitions.
7. ✅ The expressive spectrum runs from pure direct calls to broad agent_task; the NL generator proposes the cheapest shape that meets intent (Shape 6 from §4 by default)
4. ⏸ Capability unification layer (one catalog shared by automations, agents, and future surfaces) — **deferred to post-v1** (see §3). v1 ships actions only.
5. ✅ Name-based resolution: definitions reference action and trigger types by string ID. The registry is the runtime's vocabulary; lookup is a dict access. No code references in definitions.
6. ✅ The expressive spectrum runs from pure direct calls to broad agent_task; the NL generator proposes the cheapest shape that meets intent (Shape 6 from §4 by default)
### Trigger taxonomy
8. ✅ Three trigger types: `schedule`, `webhook`, `event`
@ -1183,7 +1137,7 @@ place.
19. ✅ No DAGs, no parallelism, no loops — composition via agent_task or events
20. ✅ `on_failure` part of execution policy from v1
21. ✅ Step-level retry and timeout overrides
22. ✅ Budget cap enforced pre-enqueue and mid-flight
22. ⏸ Budget cap enforced pre-enqueue and mid-flight — **deferred** until the cost ledger ships (see §8 Budget enforcement)
### Components
23. ✅ Dispatcher / executor / handlers / registry — distinct, each replaceable
@ -1197,25 +1151,22 @@ place.
29. ✅ Automations publish run events for composability
30. ✅ Publish/subscribe behind interface — no direct table access elsewhere
### Capability storage
31. ✅ Native capabilities registered in-memory at startup from the codebase. Identical across all workers.
32. ⏸ MCP capability metadata persisted in `mcp_connections` and `mcp_tools` tables — **deferred to Phase 4**
33. ⏸ MCP handler closures built lazily per worker from database state — **deferred to Phase 4**
34. ⏸ MCP server tool list re-harvested on a schedule — **deferred to Phase 4**
35. ⏸ MCP tools harvested into the capability registry at connection time — **deferred to Phase 4**
36. ⏸ Side effects inferred from MCP hints + naming + admin overrides — **deferred to Phase 4**
37. ⏸ MCP tools callable directly (no agent required) when caller knows args — **deferred to Phase 4**
### Capability unification — all deferred to post-v1
31. ⏸ One shared catalog of "things this SurfSense instance can do" — **deferred**, see §3
32. ⏸ Handler `CallContext` (caller user id, search space id, run id) — **deferred** with unification
33. ⏸ Per-capability scope declarations driving authorization — **deferred** with unification
34. ⏸ MCP integration on top of the unification layer (`mcp_connections`, `mcp_tools`, harvester) — **deferred to Phase 4**
### Credentials — all deferred to Phase 2
38. ⏸ Credentials never appear in the automation definition — only connection IDs do — **Phase 2**
39. ⏸ Credentials never appear in the LLM's context — the host holds them — **Phase 2**
40. ⏸ Credentials resolved per-call by `ActionContext`, not pre-loaded into worker environment — **Phase 2**
41. ⏸ Tokens encrypted at rest; refresh handled automatically by `ActionContext.resolve_*_client` — **Phase 2**
35. ⏸ Credentials never appear in the automation definition — only connection IDs do — **Phase 2**
36. ⏸ Credentials never appear in the LLM's context — the host holds them — **Phase 2**
37. ⏸ Credentials resolved per-call by the handler context, not pre-loaded into worker environment — **Phase 2**
38. ⏸ Tokens encrypted at rest; refresh handled automatically by the handler context — **Phase 2**
### v1-minimum (new lock)
v1. ✅ `Capability` is exactly five fields: `id`, `description`, `input_schema`, `output_schema`, `handler`. Additional fields are added only when a concrete consumer feature requires them.
v2. ✅ Cost is **measured** from a per-run ledger, not declared. Pre-flight cost checks return when the ledger has enough history.
v3. ✅ Single `automations_default` Celery queue in v1. Multi-queue routing returns when load justifies it.
### v1-minimum
39. ✅ v1 ships actions only — no separate capability layer. `ActionDefinition` is five fields: `type`, `name`, `description`, `params_schema`, `handler`. Additional fields are added only when a concrete consumer feature requires them.
40. ✅ Cost is **measured** from a per-run ledger, not declared. Pre-flight cost checks return when the ledger has enough history.
41. ✅ Single `automations_default` Celery queue in v1. Multi-queue routing returns when load justifies it.
### NL authoring
42. ✅ LLM-authored templates is the primary path from day one — not a Phase 3 addition. Hand-authoring JSON is supported but secondary
@ -1227,7 +1178,7 @@ v3. ✅ Single `automations_default` Celery queue in v1. Multi-queue routing ret
48. ✅ NL drafts are transient storage, not a core table
### Data model
49. ✅ Six tables total — four for engine state, two for MCP persistence
49. ✅ v1 ships three tables (`automations`, `automation_triggers`, `automation_runs`). `domain_events` lands in Phase 3; `mcp_connections` and `mcp_tools` in Phase 4.
50. ✅ Run rows snapshot the definition (immutable history)
51. ✅ All entities scoped by `search_space_id` for RBAC
52. ✅ Editing an automation bumps `version`; existing runs unaffected
@ -1283,7 +1234,7 @@ Schemas spelled out concretely. Those follow mechanically from this plan.
agent (cron and skills subsystems); n8n documentation on node types and
workflow data model; the SurfSense repository and DeepWiki architecture
notes (FastAPI + Celery Beat + Electric SQL + LangGraph Deep Agents +
Search Space RBAC); Model Context Protocol specification for capability
harvesting; AWS EventBridge for filter grammar; workflow-pattern
Search Space RBAC); Model Context Protocol specification for external
tool harvesting; AWS EventBridge for filter grammar; workflow-pattern
literature (van der Aalst et al.) for the trigger / action / concurrency
vocabulary.*