mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-05-29 19:35:20 +02:00
§9 (Data model): drop from six tables to three. v1 ships automations, automation_triggers, automation_runs only. domain_events deferred to Phase 3 (event trigger); mcp_connections/mcp_tools deferred to Phase 4 (MCP integration). Remove the table definitions for the deferred ones and replace with a deferred-tables note pointing to the consuming phase. automation_triggers.type enum narrowed to schedule|manual for v1. Webhook and event types ship with their respective phases. secret_hash column deferred to Phase 2 alongside the webhook trigger. automation_runs.cost_usd column deferred until at least one v1 capability records token-level cost — additive when reintroduced. §14 (Phase 1) reorganized into four explicit steps matching the work we're about to do: scaffolding + schemas + empty registries (step 1), then registry population (step 2), then executor (step 3), then NL authoring + UI (step 4). The current commit batch lands step 1 only.
1289 lines
58 KiB
Markdown
1289 lines
58 KiB
Markdown
# SurfSense Automation Feature — Design Plan (v2)
|
||
|
||
A generic, extensible automation system for SurfSense that lets users (and
|
||
future SurfSense features) trigger agent work on a schedule, on an external
|
||
event, or on demand — with the ability to author automations either by hand
|
||
or from a natural-language description that yields an editable, structured
|
||
definition.
|
||
|
||
This document supersedes the v1 draft. It folds in the design audit pass and
|
||
the corrections from working through worked examples (notably: removing the
|
||
connector bias, clarifying the executor's role, integrating MCP cleanly, and
|
||
committing to JSON Schema as the single declarative language).
|
||
|
||
---
|
||
|
||
## 1. The load-bearing principle
|
||
|
||
> **The JSON definition is the program. Everything else is interpreter.**
|
||
|
||
Every decision in this document serves that principle. If we ever face a
|
||
design choice and one option lets some behavior leak out of the definition
|
||
into the engine, we pick the other option.
|
||
|
||
Three properties follow from this principle, and they're the reason the
|
||
system will survive feature growth:
|
||
|
||
- **Reproducibility** — same definition + same inputs → same observable
|
||
behavior, regardless of which version of the engine runs it.
|
||
- **Portability** — definitions can be exported, imported, version-
|
||
controlled, code-reviewed, and shared across SurfSense instances.
|
||
- **LLM tractability** — the NL authoring flow works because the LLM only
|
||
needs to produce a self-contained JSON document that validates against a
|
||
schema. It doesn't need to understand the engine.
|
||
|
||
---
|
||
|
||
## 2. The four-layer contract
|
||
|
||
The system is structured as four layers. Layers 1, 2, and 4 are defined by
|
||
SurfSense developers (at registration time). Layer 3 is what users write
|
||
(or the NL generator produces). The runtime reads all four to do its job.
|
||
|
||
| Layer | What it is | Defined by |
|
||
| ----- | ---------- | ---------- |
|
||
| **1. Capability registry** | What this SurfSense instance can do | Developers, at startup |
|
||
| **2. Action contract** | Per-action input/output schema | Developers, at startup |
|
||
| **3. Automation definition** | One concrete saved automation | Users (or NL generator) |
|
||
| **4. Trigger contract** | Per-trigger config and payload schemas | Developers, at startup |
|
||
|
||
Each layer constrains the one above. The runtime reads all four but doesn't
|
||
know what's in them ahead of time. That's how a new capability or trigger
|
||
type becomes available across the engine without code changes outside its
|
||
registration.
|
||
|
||
### Schema language
|
||
|
||
Every shape in every layer is described in **JSON Schema (draft 2020-12).**
|
||
No exceptions, no parallel languages, no inline shorthand. Two documented
|
||
extensions on top:
|
||
|
||
- `default: "$some_token"` — runtime-resolved defaults. The vocabulary is
|
||
fixed: `$last_fired_at`, `$creator`, `$space_default`. The engine resolves
|
||
these to values before validation.
|
||
- `x-surfsense-*` annotations — editor hints (widget type, autocomplete
|
||
source). The validator ignores them; the form editor reads them.
|
||
|
||
---
|
||
|
||
## 3. Capability registry (Layer 1)
|
||
|
||
A `Capability` is one discrete thing the SurfSense backend exposes —
|
||
"post a Slack message," "query the Search Space," "generate a podcast." It
|
||
is the atomic unit of "things automations can do."
|
||
|
||
```python
|
||
@dataclass
|
||
class Capability:
|
||
id: str # "slack.post_message"
|
||
description: str # for the NL generator + UI label
|
||
input_schema: dict # JSON Schema
|
||
output_schema: dict # JSON Schema
|
||
handler: AsyncHandler
|
||
```
|
||
|
||
### v1-minimum: five fields, nothing else
|
||
|
||
The Capability is **deliberately five fields in v1**. Every additional field
|
||
that earlier drafts considered (`name`, `required_credentials`,
|
||
`side_effects`, `expected_duration_seconds`, `cost_estimate`) has been
|
||
removed until a concrete consumer feature demands it. Authoring stays cheap
|
||
and the registry stays trivial to introspect:
|
||
|
||
- `name` → folded into `description`. The UI can render a short label from
|
||
the first line of `description` or fall back to `id`. No separate field
|
||
needed in v1.
|
||
- `required_credentials` → returns when external-credential capabilities
|
||
ship (Phase 2). v1 capabilities run server-side with app config; nothing
|
||
to declare.
|
||
- `side_effects` → returns when RBAC inside automations or
|
||
`READ_ONLY`-only agent tool gating arrives. v1 capabilities are
|
||
hand-picked and all trusted code.
|
||
- `expected_duration_seconds` → returns when multi-queue routing ships.
|
||
Single Celery queue in v1.
|
||
- `cost_estimate` → never returns as a declared field; cost is measured
|
||
per run from a ledger, aggregated per Capability, and surfaced as a
|
||
historical average. Pre-flight checks are deferred.
|
||
|
||
The runtime invariant: a Capability is **a typed, named, callable thing
|
||
the system can do.** Every consumer (executor, agent tool layer, future
|
||
HTTP API) sees the same five-field shape and uses it the same way.
|
||
|
||
### Where capabilities live (v1)
|
||
|
||
In v1, the capability registry is a single in-memory dict, populated at
|
||
process startup from native registrations in
|
||
`automations/registries/capabilities/`. Identical across all workers.
|
||
No database persistence, no closures rebuilt per worker.
|
||
|
||
### MCP integration — deferred to Phase 4
|
||
|
||
The earlier two-tier registry (native + MCP-derived), the
|
||
`mcp_connections` / `mcp_tools` tables, the harvester, and the lazy
|
||
per-worker closure cache are **deferred to Phase 4** along with the
|
||
rest of the integration-tooling surface. They are removed from v1
|
||
because:
|
||
|
||
- v1 has no external connector capabilities (no Slack, Notion, Drive,
|
||
etc.). The only capabilities that will ship are server-side helpers
|
||
(search-space query / fetch) plus the loose `agent_task` action.
|
||
- Without external connectors, the lifecycle mismatch that motivates
|
||
the two-tier design (connect Monday, run Friday, workers restarted
|
||
in between) doesn't arise. A startup-time dict is sufficient.
|
||
- Phase 4 reintroduces this design as-is — the registry interface in
|
||
v1 is the same callable surface a Phase-4 MCP harvester will register
|
||
into. The deferral is additive, not a different design.
|
||
|
||
See archived design at `docs/automation/archived/mcp-registry.md` once
|
||
v1 ships; for now the only consumer of the registry is the in-memory
|
||
native path.
|
||
|
||
### Credentials — deferred to Phase 2
|
||
|
||
The earlier per-call credential resolution pattern (`ctx.resolve_mcp_client`,
|
||
`ctx.resolve_http_client`, `ctx.resolve_llm`) is **deferred to Phase 2**.
|
||
v1 capabilities run server-side using app-level configuration; none of
|
||
the seven v1 capabilities needs per-user or per-connection auth.
|
||
|
||
When Phase 2 ships external-credential capabilities (Slack, email, etc.),
|
||
the three guarantees the original design promised are reintroduced
|
||
unchanged:
|
||
|
||
- Credentials never appear in the automation definition (connection IDs
|
||
only).
|
||
- Credentials never appear in the LLM's context (the host holds them
|
||
and uses them on the LLM's behalf when executing tool calls).
|
||
- Credentials are loaded per-call, not pre-loaded into worker memory.
|
||
|
||
The Phase-2 design returns as-is; only the v1 surface is simplified.
|
||
|
||
---
|
||
|
||
## 4. Action contract (Layer 2)
|
||
|
||
An `Action` is what a user references in a plan step. Most actions are
|
||
thin wrappers around one capability (e.g., `slack_post` wraps
|
||
`slack.post_message`). Some compose: `agent_task` is one action whose
|
||
handler invokes the LangGraph runtime, which in turn can call many
|
||
capabilities.
|
||
|
||
```python
|
||
@dataclass
|
||
class ActionDefinition:
|
||
type: str # "agent_task", "slack_post"
|
||
name: str # for the UI
|
||
description: str # for the NL generator
|
||
config_schema: dict # JSON Schema for action.config
|
||
output_contract: dict | DynamicOutput # what it produces
|
||
uses_capabilities: list[str] # IDs from the registry
|
||
produces_artifacts: list[ArtifactSpec] # see §8
|
||
handler: AsyncHandler
|
||
```
|
||
|
||
### Tight vs loose actions
|
||
|
||
Two patterns coexist by design:
|
||
|
||
- **Tight actions** (`slack_post`, `linear_create_issue`, `send_email`):
|
||
config_schema is fully specified, output_contract is fixed, handler is a
|
||
thin wrapper. ~20 LOC each. Used when the user knows exactly what they
|
||
want done — no LLM tokens spent on trivial work.
|
||
|
||
- **Loose actions** (`agent_task`): config_schema accepts a `prompt` and a
|
||
`tools` allowlist; output_contract is *dynamic* — the user declares the
|
||
output shape they want via `output_schema` in the step config; the
|
||
handler asks the LLM to return that shape and validates. Used when
|
||
judgment is needed.
|
||
|
||
The agent's tool list is **the same capabilities** that tight actions call
|
||
directly. One registry, two invocation modes. Adding a new MCP server gives
|
||
both modes access to its tools automatically.
|
||
|
||
### How names in the definition become function calls
|
||
|
||
The definition contains strings like `"action": "slack_post"`. The string is
|
||
just a name — it does not point to a function. At runtime, the executor
|
||
performs a **name-based lookup** against the action registry:
|
||
|
||
```python
|
||
# step.action is a string from the JSON definition, e.g. "slack_post"
|
||
action_def = _ACTION_REGISTRY[step.action] # dict lookup
|
||
handler = action_def.handler # Python callable
|
||
result = await handler(ctx, resolved_config) # invocation
|
||
```
|
||
|
||
The registry is a Python dict (or a thin wrapper around one) populated at
|
||
process startup. Each entry in `automations/actions/*.py` calls a
|
||
`register_action(...)` function at module import time, putting its
|
||
`ActionDefinition` (including the handler function reference) into the
|
||
registry.
|
||
|
||
The same pattern applies to capabilities. The definition references
|
||
capabilities by ID (`"slack.post_message"`); the capability registry maps
|
||
the ID to a `Capability` object holding the handler. Definitions never
|
||
reference Python code directly — they reference names that the registry
|
||
resolves to code.
|
||
|
||
This separation is what makes the contract portable. The definition is
|
||
pure data. The registry is the engine's runtime vocabulary. They meet at
|
||
name-based lookup; nothing else crosses the boundary.
|
||
|
||
### The full expressive spectrum
|
||
|
||
The contract supports a continuous spectrum from purely deterministic to
|
||
fully agentic. Six practical shapes worth recognizing:
|
||
|
||
| Shape | Example | Cost / latency profile |
|
||
| --- | --- | --- |
|
||
| **1. Direct call** | `slack_post` with literal channel and template | No LLM. ~200ms. Fractions of a cent. |
|
||
| **2. Direct call with computed inputs** | `linear_create_issue` using `{{summary.title}}` from a prior step | No LLM for this step. Cheap. |
|
||
| **3. Single-domain agent task** | `agent_task` with `tools: ["slack.*"]` only | One LLM, bounded toolset. |
|
||
| **4. Multi-domain agent task, narrow** | `agent_task` with `tools: ["github.list_pull_requests", "linear.create_issue"]` | One LLM, named capabilities. |
|
||
| **5. Multi-domain agent task, broad** | `agent_task` with `tools: ["slack.*", "github.*", "linear.*"]` | One LLM, large toolset, most agentic. |
|
||
| **6. Composed plan** | `agent_task` (narrow) for thinking → `slack_post` + `linear_create_issue` for acting | Best cost-to-power ratio. |
|
||
|
||
Shape 6 is the underrated one and the cost-and-speed answer. The agent
|
||
reasons once (Shape 3 or 4) and its structured output drives several
|
||
deterministic actions. This is roughly 5–10x cheaper and 3–4x faster than
|
||
forcing the agent to do everything (Shape 5) and produces the same outcome.
|
||
|
||
**The NL generator's job is to propose Shape 6-style plans by default.**
|
||
The Review LLM flags proposals that use `agent_task` for steps a
|
||
deterministic action could handle. This is the discipline that keeps
|
||
automations cheap at scale.
|
||
|
||
The user navigates the spectrum by intent (describing what they want), not
|
||
by mechanism — the shape selection is the engine's responsibility, not the
|
||
user's.
|
||
|
||
---
|
||
|
||
## 5. Automation definition (Layer 3)
|
||
|
||
This is the JSON the user writes (or the NL generator produces). Stored in
|
||
`automations.definition` as JSONB.
|
||
|
||
### Top-level shape
|
||
|
||
```jsonc
|
||
{
|
||
"schema_version": "1.0",
|
||
"name": "Daily competitor digest",
|
||
"goal": "Summarize new competitor content and post to Slack",
|
||
|
||
"inputs": {
|
||
"schema": {
|
||
"type": "object",
|
||
"required": ["since"],
|
||
"properties": {
|
||
"since": { "type": "string", "format": "date-time",
|
||
"default": "$last_fired_at" },
|
||
"tags": { "type": "array", "items": { "type": "string" },
|
||
"default": ["competitor"] }
|
||
}
|
||
}
|
||
},
|
||
|
||
"triggers": [
|
||
{
|
||
"type": "schedule",
|
||
"config": { "cron": "0 9 * * 1-5", "timezone": "Africa/Kigali" }
|
||
}
|
||
],
|
||
|
||
"plan": [
|
||
{
|
||
"step_id": "research",
|
||
"action": "agent_task",
|
||
"config": {
|
||
"prompt": "Find documents tagged {{inputs.tags}} indexed since {{inputs.since}}. Return JSON with bullets and source_doc_ids.",
|
||
"tools": ["search_space.query", "search_space.fetch_document"],
|
||
"model": "anthropic/claude-sonnet-4-7",
|
||
"output_schema": {
|
||
"type": "object",
|
||
"required": ["bullets", "source_doc_ids"],
|
||
"properties": {
|
||
"bullets": { "type": "array", "items": { "type": "string" } },
|
||
"source_doc_ids": { "type": "array", "items": { "type": "string" } }
|
||
}
|
||
}
|
||
},
|
||
"output_as": "summary"
|
||
},
|
||
{
|
||
"step_id": "deliver",
|
||
"action": "slack_post",
|
||
"config": {
|
||
"channel_id": "C0123",
|
||
"message_template": "*Competitor digest*\n\n{% for b in summary.bullets %}• {{b}}\n{% endfor %}"
|
||
}
|
||
}
|
||
],
|
||
|
||
"execution": {
|
||
"timeout_seconds": 600,
|
||
"max_retries": 2,
|
||
"retry_backoff": "exponential",
|
||
"concurrency": "drop_if_running",
|
||
"budget_cap_usd": 1.50,
|
||
"on_failure": [ /* steps to run if main plan fails after retries */ ]
|
||
},
|
||
|
||
"metadata": { "tags": ["digest"], "created_from_nl": true }
|
||
}
|
||
```
|
||
|
||
### Plan steps
|
||
|
||
```jsonc
|
||
{
|
||
"step_id": "...", // unique within plan
|
||
"action": "...", // references an ActionDefinition.type
|
||
"when": "{{ ... }}", // optional Jinja expr → bool; false = skip
|
||
"config": { ... }, // validated against action's config_schema
|
||
"output_as": "...", // binds output to this name for later steps
|
||
"max_retries": 0, // optional, overrides automation default
|
||
"timeout_seconds": 1200 // optional, overrides automation default
|
||
}
|
||
```
|
||
|
||
Steps run **sequentially**. No parallelism, no DAGs, no loops. If a user
|
||
needs branching, they use `when:` on multiple steps. If they need
|
||
parallelism or iteration, they use `agent_task` and let the agent reason
|
||
about it, or they compose automations through events (§7.5).
|
||
|
||
---
|
||
|
||
## 6. Trigger contract (Layer 4)
|
||
|
||
Three trigger types. That's the entire taxonomy.
|
||
|
||
### `schedule`
|
||
|
||
```python
|
||
TriggerDefinition(
|
||
type="schedule",
|
||
config_schema={
|
||
"type": "object",
|
||
"required": ["cron", "timezone"],
|
||
"properties": {
|
||
"cron": { "type": "string" },
|
||
"timezone": { "type": "string", "format": "iana-timezone" }
|
||
}
|
||
},
|
||
payload_schema={
|
||
"type": "object",
|
||
"properties": {
|
||
"fired_at": { "type": "string", "format": "date-time" },
|
||
"scheduled_for": { "type": "string", "format": "date-time" },
|
||
"last_fired_at": { "type": "string", "format": "date-time" }
|
||
}
|
||
}
|
||
)
|
||
```
|
||
|
||
Implementation: extends `app/utils/periodic_scheduler.py`, which already
|
||
reads connector sync schedules. Adds a second source — `automation_triggers
|
||
WHERE type='schedule'`. Same Celery Beat checker, two source tables.
|
||
|
||
Minimum interval: 1 minute (the existing checker's resolution). The form
|
||
editor warns when users set intervals under 15 minutes that they probably
|
||
want an event trigger instead.
|
||
|
||
### `webhook`
|
||
|
||
```python
|
||
TriggerDefinition(
|
||
type="webhook",
|
||
config_schema={
|
||
"type": "object",
|
||
"properties": {
|
||
"input_mapping": {
|
||
"type": "object",
|
||
"additionalProperties": { "type": "string" }
|
||
# values are JSONPath expressions
|
||
}
|
||
}
|
||
},
|
||
# payload is whatever the POST body is; user-defined shape via mapping
|
||
)
|
||
```
|
||
|
||
Endpoint: `POST /api/v1/automations/{id}/fire`. Bearer token shown once,
|
||
hashed at rest, rotatable, revocable. Returns `202 Accepted` with the
|
||
created run's URL. Caller polls for status; we do not push callbacks in
|
||
v1 (a `callback_webhook` action can be added later).
|
||
|
||
Idempotency: honors `Idempotency-Key` header or `idempotency_key` in body.
|
||
Dedups against runs in the last 24 hours.
|
||
|
||
### `event`
|
||
|
||
```python
|
||
TriggerDefinition(
|
||
type="event",
|
||
config_schema={
|
||
"type": "object",
|
||
"required": ["event_type"],
|
||
"properties": {
|
||
"event_type": { "type": "string" }, # e.g. "drive.file_added"
|
||
# or "surfsense.podcast.generated"
|
||
"filters": { "$ref": "#/definitions/filter_expression" }
|
||
}
|
||
}
|
||
# payload shape is documented per event_type in a separate registry
|
||
)
|
||
```
|
||
|
||
**Events absorb both connector events and internal SurfSense events.** A
|
||
file added to Drive and a podcast finishing in SurfSense are both events
|
||
in the same `domain_events` table, both subscribable by automations, both
|
||
matched by the same dispatcher code. The engine doesn't distinguish.
|
||
|
||
### Filter grammar
|
||
|
||
Filters are JSON-structured operators, not expressions. This is the one
|
||
place we deliberately don't use Jinja, because filters run on a hot path
|
||
(every event matched against every subscribing trigger) and structured
|
||
filters can be indexed and short-circuited.
|
||
|
||
Vocabulary:
|
||
- Equality: `equals`, `not_equals`
|
||
- String: `starts_with`, `ends_with`, `contains`, `regex`
|
||
- Numeric: `gt`, `gte`, `lt`, `lte`
|
||
- Set: `in`, `not_in`
|
||
- Existence: `exists`
|
||
- Composition: `$and`, `$or`, `$not`
|
||
|
||
Inspired by AWS EventBridge and MongoDB query syntax. The filter grammar
|
||
itself is published as a JSON Schema, so users get inline error messages.
|
||
|
||
---
|
||
|
||
## 7. Runtime components
|
||
|
||
Each component is distinct, replaceable, and has one job.
|
||
|
||
### 7.1 Dispatcher
|
||
|
||
What it does: matches firing triggers to automations, creates `AutomationRun`
|
||
rows, enqueues executor tasks.
|
||
|
||
For schedule triggers: Celery Beat polls the trigger table, computes due
|
||
ones, fires.
|
||
|
||
For webhook triggers: the FastAPI handler is the dispatcher entry point.
|
||
Validates token, runs input_mapping, creates run.
|
||
|
||
For event triggers: subscribes to the `domain_events` table. For each new
|
||
event, evaluates all matching triggers' filters, fires the matches.
|
||
|
||
Common path (after a trigger has fired):
|
||
1. Resolve `inputs` from trigger payload and defaults
|
||
2. Validate resolved inputs against the automation's input schema
|
||
3. **Idempotency check** — dedup against existing pending/running runs
|
||
4. **Snapshot the resolved definition** into the run row (immutable history)
|
||
5. Enqueue executor task on the single `automations_default` Celery queue
|
||
|
||
The cost-estimate pre-check (originally step 3) is **deferred**.
|
||
v1 capabilities do not declare `cost_estimate`; pre-flight budgeting
|
||
returns when a historical-cost ledger exists. The mid-flight budget
|
||
cap (§7.2) still kills the run if accumulated cost crosses
|
||
`budget_cap_usd`.
|
||
|
||
Queue routing by `expected_duration_seconds` is **deferred** until load
|
||
patterns justify a second queue. v1 uses a single queue.
|
||
|
||
### 7.2 Executor
|
||
|
||
What it is: **a Celery task wrapping a single function that walks a plan
|
||
step by step.** Not an agent, not a workflow engine, not a scheduler. A
|
||
loop with bookkeeping. Maybe 200 lines.
|
||
|
||
```python
|
||
async def execute_run(run_id: int) -> None:
|
||
run = load_run(run_id); run.status = "running"; save(run)
|
||
context = build_run_context(run)
|
||
step_outputs = {}
|
||
|
||
for step in run.plan:
|
||
if step.when and not evaluate_predicate(step.when, context | step_outputs):
|
||
record_step_skipped(run, step); continue
|
||
|
||
resolved_config = render_config(step.config, context | step_outputs)
|
||
action = action_registry.get(step.action)
|
||
validate(resolved_config, action.config_schema)
|
||
|
||
try:
|
||
result = await with_retries(
|
||
action.handler,
|
||
ctx=build_action_context(run, action),
|
||
args=resolved_config,
|
||
policy=step.retry_policy or run.execution.retry_policy,
|
||
)
|
||
validate(result, step.output_schema)
|
||
if step.output_as:
|
||
step_outputs[step.output_as] = result
|
||
record_step_succeeded(run, step, result)
|
||
except Exception as e:
|
||
record_step_failed(run, step, e)
|
||
await run_on_failure(run, e)
|
||
return
|
||
|
||
run.status = "succeeded"; save(run)
|
||
publish_event("automation.run.succeeded", run) # see §7.5
|
||
```
|
||
|
||
Intelligence lives **inside handlers**, not in the executor. The most
|
||
intelligent handler is `agent_task`, which spins up a LangGraph Deep Agent
|
||
for one step and returns when the agent finishes. The executor sees a
|
||
validated dict come back; it doesn't know that step was "smart."
|
||
|
||
### 7.3 Action handlers
|
||
|
||
One handler per `ActionDefinition.type`. Receives `(ctx, args)`, returns
|
||
a dict matching `output_contract` (or matching the user-declared
|
||
`output_schema` for dynamic-output actions like `agent_task`).
|
||
|
||
Handlers handle their own credential resolution via `ctx.resolve_credentials`.
|
||
They do not know about retries, timeouts, or budget caps — those are the
|
||
executor's concern.
|
||
|
||
### 7.4 Template engine
|
||
|
||
#### Why it exists
|
||
|
||
Most fields in an automation definition contain literal strings the user
|
||
authored once — but the actual rendered value has to change per run, because
|
||
it includes data from the trigger payload or from prior step outputs. The
|
||
template engine is what turns `"Daily digest for {{run.started_at}}"` into
|
||
`"Daily digest for 2026-05-26"` at run time.
|
||
|
||
Three fields use it:
|
||
- `*_template` strings in tight action configs (Slack messages, email bodies,
|
||
Linear titles, etc.)
|
||
- `prompt` in `agent_task` configs (so the agent sees resolved values, not
|
||
`{{...}}` placeholders)
|
||
- `when:` step predicates (which need to evaluate to a boolean)
|
||
|
||
#### Public interface
|
||
|
||
Single module, ~80 lines. Three public functions — everything else in the
|
||
engine routes through these:
|
||
|
||
```python
|
||
def render_template(template: str, context: dict) -> str: ...
|
||
def evaluate_predicate(expression: str, context: dict) -> bool: ...
|
||
def build_run_context(run, step_outputs) -> dict: ...
|
||
```
|
||
|
||
Backed by Jinja2's `SandboxedEnvironment`. The whole module is the seam: if
|
||
the template language is ever swapped, only this file changes.
|
||
|
||
#### Security architecture: allowlist by default
|
||
|
||
`SandboxedEnvironment` starts empty. A freshly-created instance gives a
|
||
template access to:
|
||
- Variables in the context dict we pass in (`run`, `inputs`, prior step
|
||
outputs)
|
||
- Public (non-underscore) attributes of those variables
|
||
- Jinja's built-in control flow (`{% if %}`, `{% for %}`, `{% set %}`)
|
||
|
||
Nothing else. No Python builtins, no modules, no I/O, no network, no
|
||
filesystem. Everything beyond the above must be **explicitly registered.**
|
||
This is the structurally important property: anything we didn't add is
|
||
inaccessible. The risk surface equals the size of what we registered.
|
||
|
||
The three sandbox rules that enforce this:
|
||
1. **Attribute access is filtered** — names starting with underscore are
|
||
rejected. This blocks the entire family of `{{x.__class__.__mro__...}}`
|
||
Python escape paths in one rule.
|
||
2. **Globals are allowlist-only** — `open`, `eval`, `exec`, `__import__`,
|
||
`getattr`, every module name, are all absent unless we register them.
|
||
We register zero globals.
|
||
3. **Unsafe callables are blocked** — `str.format` and `str.format_map`
|
||
specifically (due to CVE-2016-10745), plus anything marked
|
||
`unsafe_callable`.
|
||
|
||
#### What we register, exactly
|
||
|
||
- **Filters: a curated 15**, no more. `join`, `length`, `default`, `upper`,
|
||
`lower`, `truncate`, `tojson`, `date`, `replace`, `trim`, `slugify`,
|
||
`first`, `last`, `sort`, `reverse`. Each one is audited for what it does
|
||
with its input; none of them takes a callable, runs `eval`, or reaches
|
||
into Python objects beyond simple data transformation.
|
||
- **Globals: none.**
|
||
- **Tests: only the safe built-ins** (`defined`, `none`, `number`, `string`,
|
||
`mapping`, `sequence`, `boolean`).
|
||
|
||
Adding a new filter requires a deliberate code change and review: does this
|
||
filter do anything dangerous with its input? If yes, don't add it. The list
|
||
only grows by audited additions.
|
||
|
||
#### Runtime limits (defense in depth)
|
||
|
||
The sandbox handles the attack surface inside the template language. Three
|
||
additional limits handle resource exhaustion that the language permits but
|
||
the runtime shouldn't tolerate:
|
||
|
||
- **Template source length capped at 8 KB.** Checked before parsing.
|
||
- **Render time capped at 100 ms per render.** Implemented via a watchdog
|
||
thread; renders that exceed are killed and the step fails. Catches
|
||
`{% for i in range(10**9) %}` and nested loop bombs.
|
||
- **Output size capped at 1 MB.** A small template can produce a multi-GB
|
||
string via `{{ 'A' * 10**8 }}`-style multiplication; this catches it.
|
||
|
||
Plus `StrictUndefined`: any reference to a missing variable raises
|
||
immediately rather than silently rendering empty, so misconfigurations
|
||
fail fast.
|
||
|
||
#### Threat model and residual risk
|
||
|
||
The trust model from day one is:
|
||
|
||
- Templates are generated by an LLM from a user's natural-language input
|
||
(see §10), or written/edited by humans in the editable form
|
||
- A second LLM reviews the proposal and produces a plain-language summary
|
||
plus flagged anomalies for the user
|
||
- The user reviews and approves before the automation runs
|
||
- The Generator LLM's input is scoped (user prompt + schema + registry
|
||
only — no arbitrary document content), minimizing prompt-injection paths
|
||
|
||
The sandbox + runtime limits + curated filter list protect against the
|
||
malformed-template attack. Human review protects against the
|
||
semantically-malicious-but-syntactically-valid attack. These are
|
||
complementary layers, not redundant.
|
||
|
||
Known residual risks, each genuinely small:
|
||
|
||
- **Future Jinja CVEs.** Historical sandbox bypasses have existed and
|
||
been patched. This is a generic third-party-dependency risk, comparable
|
||
to bugs in any other library we rely on. Mitigation: subscribe to
|
||
security advisories, ship updates within a week of disclosure.
|
||
- **Side channels via prompts to LLMs.** A template that renders into a
|
||
prompt can attempt prompt injection of the agent at run time. This is
|
||
not a sandbox concern but a separate concern in `agent_task`'s design.
|
||
- **Operator deployments with long-lived secrets in worker env vars.**
|
||
Mitigation: credentials fetched per-handler-per-call via
|
||
`ActionContext.resolve_credentials`, never pre-loaded into worker
|
||
env vars accessible to templates.
|
||
|
||
The sandbox-with-allowlist architecture means **the attack surface
|
||
equals the set of things we registered.** With zero globals registered
|
||
and 15 audited filters, the surface is small, bounded, and reviewable.
|
||
This is the structural property that makes the architecture sound, and
|
||
it doesn't depend on hypothetical assumptions about who authors templates.
|
||
|
||
#### Pre-Phase-5 gate
|
||
|
||
One trust-model change is documented in the roadmap: **Phase 5 introduces
|
||
template sharing across SearchSpaces** (automation templates as
|
||
exportable, importable artifacts). At that point, the *approver* of a
|
||
template (the original author) is no longer the *runner* (the importer).
|
||
The "human reviews before save" mitigation breaks down because the
|
||
reviewer doesn't bear the risk.
|
||
|
||
Before Phase 5 ships, this needs an explicit re-approval flow: importing
|
||
a template triggers a fresh review pass by the importing user, with the
|
||
flagged-anomalies output prominently displayed, and the import cannot
|
||
complete without explicit per-template approval.
|
||
|
||
This is a UX/flow decision, not a template-language migration. Jinja
|
||
itself stays; what changes is the approval workflow at the import boundary.
|
||
|
||
#### The `run.*` namespace exposed in every template
|
||
|
||
```
|
||
run.id, run.started_at, run.automation_id, run.automation_name,
|
||
run.automation_version, run.trigger_type, run.trigger_id,
|
||
run.search_space_id, run.creator_id, run.attempt,
|
||
run.failed_step_id, run.error.* (only in on_failure context)
|
||
```
|
||
|
||
#### Default value rendering
|
||
|
||
Non-string template values render as JSON by default (via the `finalize`
|
||
hook): lists become `["a", "b"]`, dicts become `{"k": "v"}`, datetimes
|
||
become ISO 8601. The `| join`, `| length`, `| tojson` filters give explicit
|
||
control. Strings render as themselves with no quoting. `None` renders as
|
||
empty string in templates, as `null` in JSON contexts.
|
||
|
||
### 7.5 Event bus
|
||
|
||
`domain_events` table, polled by Celery Beat alongside the existing
|
||
scheduler. Both connector events and internal SurfSense events publish to
|
||
it. Both are consumed by the dispatcher's event-trigger subscriber.
|
||
|
||
**Automations themselves publish events.** Successful and failed runs emit
|
||
`automation.run.succeeded` / `automation.run.failed` events with the run
|
||
metadata. This makes automations composable through events — chain them by
|
||
subscribing one automation's event trigger to another's run event. No new
|
||
mechanism; the trigger filter and event publishing already exist.
|
||
|
||
Upgrade path documented: when throughput or latency demands it, replace
|
||
PostgreSQL polling with Redis Streams. The `events.publish()` and
|
||
`events.subscribe()` interfaces stay the same. Nothing else changes.
|
||
|
||
---
|
||
|
||
## 8. Cross-cutting concerns
|
||
|
||
### Concurrency policy
|
||
|
||
Per-automation `concurrency` field controls what happens when a new fire
|
||
occurs while a previous run is still running:
|
||
|
||
- `drop_if_running` — silently skip the new fire
|
||
- `queue` — execute serially, in arrival order
|
||
- `allow_parallel` — start a new run independently
|
||
|
||
The dispatcher enforces this before enqueueing.
|
||
|
||
### Retry policy
|
||
|
||
Three fields, per-automation defaults with optional per-step overrides:
|
||
- `max_retries`: integer, 0–10
|
||
- `retry_backoff`: `none` | `linear` | `exponential`
|
||
- `timeout_seconds`: integer
|
||
|
||
Retries on:
|
||
- Capability handler exceptions
|
||
- Output schema validation failures (for dynamic-output actions, the
|
||
validation error is fed back to the LLM in the retry)
|
||
|
||
Not retries:
|
||
- `when:` evaluation failures (these are user errors, surface immediately)
|
||
- Input validation failures (caught at dispatch, never reach the executor)
|
||
|
||
### Budget enforcement
|
||
|
||
`budget_cap_usd` is per-run. The dispatcher refuses to enqueue if estimated
|
||
cost exceeds it. The executor kills the run if accumulated cost crosses it
|
||
mid-flight (the LLM ops handler reports tokens consumed back to the
|
||
executor between calls).
|
||
|
||
### On-failure handlers
|
||
|
||
`execution.on_failure` is a list of steps that run after the main plan has
|
||
failed and all retries are exhausted. Same step shape as the main plan.
|
||
Cannot have their own `on_failure`. See `run.error.*` in the run context.
|
||
|
||
### Artifacts
|
||
|
||
Actions that produce artifacts declare `produces_artifacts: list[ArtifactSpec]`:
|
||
|
||
```python
|
||
@dataclass
|
||
class ArtifactSpec:
|
||
kind: str # "audio", "document", "image", "data"
|
||
retention: str # "transient" | "default" | "permanent"
|
||
visibility: str # "private" | "search_space" | "shared"
|
||
```
|
||
|
||
The engine handles storage (writes to SurfSense's existing object storage),
|
||
URL generation (signed, scoped to the run's permissions), and cleanup (a
|
||
nightly Celery Beat task deletes expired artifacts).
|
||
|
||
### Duration classes and queue routing — deferred
|
||
|
||
The original design routed runs to multiple Celery queues based on each
|
||
capability's declared `expected_duration_seconds`. v1 ships with **one
|
||
queue** (`automations_default`) and capabilities do not declare a
|
||
duration. Multi-queue routing returns when burst load on a single queue
|
||
actually justifies the operational complexity of independent worker
|
||
pools.
|
||
|
||
Adding the second queue is a config change plus reintroducing
|
||
`expected_duration_seconds` on the `Capability` dataclass — both
|
||
mechanical, additive, and free of design rewrite.
|
||
|
||
---
|
||
|
||
## 9. Data model
|
||
|
||
**v1 ships three tables:** `automations`, `automation_triggers`,
|
||
`automation_runs`. All scoped by `search_space_id` for RBAC.
|
||
|
||
The other three tables described in earlier drafts are deferred:
|
||
|
||
- `domain_events` → **deferred to Phase 3** (introduced with the event
|
||
trigger).
|
||
- `mcp_connections`, `mcp_tools` → **deferred to Phase 4** (MCP
|
||
integration).
|
||
|
||
The deferred tables ship as-is when their consuming feature lands;
|
||
nothing in the v1 schema needs to change to accommodate them. The three
|
||
v1 tables form the engine's persistent state — definitions, triggers,
|
||
and an immutable run history.
|
||
|
||
### `automations`
|
||
|
||
| field | type | notes |
|
||
| ----------------- | ----------------------------------- | -------------------------------------------------------------------------- |
|
||
| `id` | int PK | |
|
||
| `search_space_id` | FK → `search_spaces.id` | |
|
||
| `created_by` | FK → `users.id` | runs execute as this identity |
|
||
| `name` | str | |
|
||
| `description` | str | |
|
||
| `status` | enum | `active`, `paused`, `archived` |
|
||
| `definition` | jsonb | the editable structured spec |
|
||
| `version` | int | bumped on every edit |
|
||
| `created_at` / `updated_at` | timestamps | |
|
||
|
||
### `automation_triggers`
|
||
|
||
| field | type | notes |
|
||
| --------------- | ----------------------------------------------------------------------------- | ------------------------------------------- |
|
||
| `id` | int PK | |
|
||
| `automation_id` | FK | |
|
||
| `type` | enum: `schedule`, `manual` (Phase 2/3 add `webhook`, `event`) | |
|
||
| `config` | jsonb | validated against trigger's `config_schema` |
|
||
| `enabled` | bool | |
|
||
| `last_fired_at` | timestamp | |
|
||
|
||
`secret_hash` (for webhook bearer tokens) is **deferred to Phase 2** with
|
||
the webhook trigger.
|
||
|
||
### `automation_runs`
|
||
|
||
| field | type | notes |
|
||
| ----------------- | ---------------------------------------------------------------------------- | -------------------------------------------------- |
|
||
| `id` | int PK | |
|
||
| `automation_id` | FK | |
|
||
| `trigger_id` | FK / null | null = manual via UI |
|
||
| `status` | enum | `pending`, `running`, `succeeded`, `failed`, `cancelled`, `timed_out` |
|
||
| `definition_snapshot` | jsonb | the definition as it was when this run fired |
|
||
| `trigger_payload` | jsonb | |
|
||
| `resolved_inputs` | jsonb | |
|
||
| `step_results` | jsonb | array of per-step results with timing |
|
||
| `output` | jsonb / null | |
|
||
| `artifacts` | jsonb | references to created artifacts |
|
||
| `error` | jsonb / null | |
|
||
| `started_at` / `finished_at` | timestamps | |
|
||
| `agent_session_id`| str / null | link to LangGraph trace if agent_task was used |
|
||
|
||
`cost_usd` (per-run accumulated cost) is **deferred** until at least one
|
||
v1 capability records token-level cost. When reintroduced it lands as a
|
||
column-only migration.
|
||
|
||
### Deferred tables
|
||
|
||
- **`domain_events`** — the event bus backing event triggers. Ships in
|
||
Phase 3 with the event trigger. v1 only emits `automation.run.*`
|
||
events into application logs; the table is added when at least one
|
||
consumer needs to subscribe to them.
|
||
- **`mcp_connections`** / **`mcp_tools`** — see §3. Both ship in Phase 4
|
||
alongside the MCP harvester and the two-tier registry.
|
||
|
||
NL drafts are **not** a core table. They live in a generic short-TTL
|
||
store (Redis or a transient table) when the NL flow is built in
|
||
Phase 3.
|
||
|
||
---
|
||
|
||
## 10. NL authoring flow
|
||
|
||
**This is how the system is intended to be used from day one, not just a
|
||
Phase 3 addition.** The product surface is: user describes intent in natural
|
||
language, LLM produces a structured proposal, user reviews and edits in an
|
||
auto-generated form, then saves. Hand-authoring JSON directly is supported
|
||
but is not the primary path.
|
||
|
||
This shapes the trust model. Templates are LLM-generated from day one, not
|
||
hand-written by power users. The mitigation is human-in-the-loop review,
|
||
not "trusted authors only."
|
||
|
||
### Pass 1: Proposal generation
|
||
|
||
User provides natural-language input. The Generator LLM is given:
|
||
- The full schema set (input schema for definition, registry of action
|
||
types with their config_schemas, registry of trigger types, available
|
||
capabilities for this SearchSpace, list of allowed Jinja filters)
|
||
- A tool to list available connectors, channels, and other SearchSpace
|
||
resources, so it doesn't invent names that don't exist
|
||
- A few-shot set of examples
|
||
|
||
**Scoped input.** The Generator does *not* receive arbitrary SearchSpace
|
||
document content. Its context is the user's prompt plus the schema and
|
||
registry information. This minimizes the prompt-injection surface — there's
|
||
no document text in the context for an attacker to seed instructions into.
|
||
|
||
If a user wants document-aware generation later ("create an automation
|
||
that processes documents like this one"), that's a deliberate feature
|
||
extension with its own prompt-injection mitigations, not the default flow.
|
||
|
||
Output: a structured proposal matching the automation definition schema.
|
||
|
||
### Pass 2: Deterministic validation
|
||
|
||
Server-side, before the proposal reaches the user:
|
||
- Validate against JSON Schema (shape correctness)
|
||
- Verify every capability referenced exists in the registry (resource existence)
|
||
- Verify every connector/channel/resource referenced exists in this SearchSpace
|
||
- Validate every template against the sandbox's allowlist (no underscore
|
||
attributes, no unregistered filter names, length under cap)
|
||
|
||
Failures here are deterministic errors, not warnings. A proposal that
|
||
references a non-existent capability or includes a template using
|
||
`{{x.__class__}}` is rejected before the user sees it; the Generator is
|
||
re-prompted with the validation error and asked to fix the proposal.
|
||
|
||
### Pass 2.5: Review pass
|
||
|
||
A second LLM call — the **Review LLM** — examines the validated proposal and
|
||
produces two outputs for the user:
|
||
|
||
1. **A plain-language summary** of what the automation will do, in business
|
||
terms. "This automation will run every weekday at 9am. It reads documents
|
||
in this SearchSpace tagged 'competitor' that were indexed since the last
|
||
run, asks an agent to summarize them as 5 bullets, and posts the summary
|
||
to your #engineering-standup Slack channel. Estimated cost: $0.40 per
|
||
run."
|
||
|
||
2. **A "things worth checking" list** flagging anything unusual:
|
||
- Templates with unusual attribute paths or filter usage
|
||
- Prompts containing instructions that look more like commands than
|
||
descriptions ("ignore previous instructions" style)
|
||
- Action sequences that touch external systems without obvious benefit
|
||
to the user
|
||
- Cost estimates that seem high relative to the goal
|
||
- References to capabilities the user hasn't used before
|
||
- Schedules tighter than 15 minutes (likely should be event triggers)
|
||
|
||
The Review LLM is a **UX layer** that makes review actually useful. It is
|
||
**not a security boundary.** The deterministic controls (sandbox, runtime
|
||
limits, schema validator) are the security boundaries. The Review LLM
|
||
helps users catch their own intent mismatches and surfaces anomalies for
|
||
attention, but the sandbox would block dangerous templates even if the
|
||
Review LLM missed them.
|
||
|
||
This separation is important: two probabilistic controls compounding can
|
||
create a false sense of security. The Review LLM is explicitly framed in
|
||
the architecture as helper, not gatekeeper.
|
||
|
||
### Pass 3: Editable review
|
||
|
||
The user lands on a form pre-filled with the proposal. The page shows:
|
||
- The plain-language summary from the Review pass
|
||
- The flagged items, prominently displayed near the relevant fields
|
||
- The full editable form, auto-generated from the JSON Schemas
|
||
- Cost estimate and impact summary (which external systems get touched)
|
||
|
||
**Every field is editable.** Clarifications appear as required fields.
|
||
Templates are shown in code-styled fields with syntax highlighting and the
|
||
filter palette visible. The user can edit any field; saving re-runs Pass 2
|
||
(deterministic validation) before persisting.
|
||
|
||
Hitting **Save** promotes the proposal to an `automation` row.
|
||
|
||
### Editing existing automations
|
||
|
||
NL editing of an existing automation is a patch operation: the Generator
|
||
LLM receives the current definition plus the NL instruction and produces a
|
||
modified proposal. The same Pass 2 (validation) and Pass 2.5 (review) run
|
||
against the modified version, and the user reviews the diff before saving.
|
||
Existing run history is unaffected — only future runs use the new version.
|
||
|
||
### Why human-in-the-loop is non-negotiable
|
||
|
||
The Generator LLM, the Review LLM, and the sandbox are three layers of
|
||
defense against malformed or malicious proposals. The human approval step
|
||
is the fourth and most important layer. It exists because:
|
||
|
||
- LLMs can be prompt-injected; humans can spot text that asks them to
|
||
ignore instructions
|
||
- LLMs can produce confident-but-wrong proposals; humans can catch
|
||
semantic mismatches between intent and output
|
||
- The cost of a bad automation running unattended is high; the cost of a
|
||
user clicking "approve" after reading is low
|
||
|
||
The architecture must never offer "auto-approve" or "skip review" options
|
||
for LLM-generated proposals. Save requires human action on the proposal,
|
||
always.
|
||
|
||
---
|
||
|
||
## 11. Repository layout
|
||
|
||
```
|
||
surfsense_backend/app/
|
||
├── automations/ # NEW: the engine
|
||
│ ├── __init__.py
|
||
│ ├── models.py # SQLAlchemy models for 6 tables
|
||
│ ├── schemas.py # Pydantic schemas (definition envelope, etc.)
|
||
│ ├── routes.py # FastAPI router (/api/v1/automations)
|
||
│ ├── service.py # CRUD + business logic
|
||
│ ├── dispatcher.py # trigger matching, cost check, run creation
|
||
│ ├── executor.py # the Celery task that runs a plan
|
||
│ ├── templating.py # Jinja sandbox + filters
|
||
│ ├── events.py # publish/subscribe for domain_events
|
||
│ ├── filters.py # JSON filter grammar evaluator
|
||
│ ├── actions/
|
||
│ │ ├── registry.py
|
||
│ │ ├── agent_task.py
|
||
│ │ ├── transform_data.py
|
||
│ │ ├── slack_post.py
|
||
│ │ ├── send_email.py
|
||
│ │ ├── notification.py
|
||
│ │ └── (more in Phase 5: podcast_generation, report_generation, ...)
|
||
│ ├── triggers/
|
||
│ │ ├── registry.py
|
||
│ │ ├── schedule.py # Celery Beat hookup
|
||
│ │ ├── webhook.py # /fire endpoint
|
||
│ │ └── event.py # subscribes to domain_events
|
||
│ ├── capabilities/
|
||
│ │ ├── registry.py
|
||
│ │ ├── native.py # native capability registrations
|
||
│ │ ├── mcp_harvester.py # registers MCP tools as capabilities (Phase 4)
|
||
│ │ └── (LLM ops registered alongside)
|
||
│ └── nl/ # Phase 1 — primary user path
|
||
│ ├── generator.py # Generator LLM
|
||
│ ├── reviewer.py # Review LLM (summary + flagged items)
|
||
│ ├── validator.py # deterministic schema + resource checks
|
||
│ └── prompts.py # system prompts for both LLMs
|
||
│
|
||
├── utils/
|
||
│ └── periodic_scheduler.py # EXTENDED to scan automation_triggers
|
||
│
|
||
└── alembic/versions/
|
||
└── NN_add_automation_tables.py
|
||
|
||
surfsense_web/app/(routes)/
|
||
└── automations/ # NEW: UI
|
||
├── page.tsx # list
|
||
├── new/page.tsx # NL input + draft preview (Phase 1)
|
||
├── [id]/page.tsx # editor (auto-generated forms)
|
||
└── [id]/runs/page.tsx # run history, streamed via Electric SQL
|
||
```
|
||
|
||
---
|
||
|
||
## 12. Phased delivery
|
||
|
||
Each phase delivers something usable. Each de-risks the next. **NL authoring
|
||
is the primary user path from Phase 1** — what evolves across phases is
|
||
which actions and triggers are available, not whether users can describe
|
||
automations in natural language.
|
||
|
||
### Phase 1 — Engine MVP with NL authoring
|
||
|
||
**Step 1 (current scope, this batch of commits):**
|
||
- 3 tables (`automations`, `automation_triggers`, `automation_runs`) +
|
||
Alembic migration
|
||
- Empty Capability, Action, Trigger registries (concrete entries land in
|
||
later steps when the consuming feature lands)
|
||
- Pydantic schemas for the automation definition envelope, the two v1
|
||
trigger configs (`schedule`, `manual`), and the one v1 action config
|
||
(`agent_task`)
|
||
- Module structure under `app/automations/` (data/, schemas/,
|
||
registries/), fully isolated from the existing codebase
|
||
|
||
**Step 2:**
|
||
- Register the `agent_task` action and the `schedule` / `manual`
|
||
triggers in the registries
|
||
- Capability registry populated with native deliverable-producing
|
||
capabilities (chosen when this step starts)
|
||
|
||
**Step 3:**
|
||
- Executor (single-queue Celery task) with retries, timeouts, budget
|
||
caps measured against `cost_usd` ledger on the run
|
||
- Template engine (Jinja sandbox + the v1 filter allowlist + runtime
|
||
limits)
|
||
- Manual "Run now" endpoint
|
||
|
||
**Step 4:**
|
||
- NL authoring flow: Generator LLM, deterministic validator, Review LLM,
|
||
editable form
|
||
- Run history UI with Electric SQL streaming
|
||
|
||
**After Phase 1**: a user can describe an automation in natural language,
|
||
review the proposal (with summary + flagged anomalies), edit any field,
|
||
save, and watch it run on a schedule.
|
||
|
||
### Phase 2 — Webhooks and delivery
|
||
- `webhook` trigger with per-automation bearer tokens
|
||
- Tight actions: `slack_post`, `send_email`, `notification`
|
||
- `transform_data` action
|
||
- `on_failure` hooks
|
||
- Step-level retry/timeout overrides
|
||
- Concurrency policy enforcement
|
||
|
||
**After Phase 2**: external systems can drive automations, results go
|
||
somewhere humans see, complex pipelines have proper error handling.
|
||
|
||
### Phase 3 — NL authoring polish
|
||
- NL patch flow for editing existing automations (diff-based)
|
||
- Conversational refinement during proposal review ("change the schedule
|
||
to weekdays only," "add a Slack notification on failure")
|
||
- Improved Review LLM coverage (more anomaly patterns, cost-relative-to-
|
||
goal heuristics)
|
||
- Saved prompt templates and starter examples
|
||
|
||
**After Phase 3**: NL authoring is the polished primary surface; edit
|
||
flows are conversational rather than form-only.
|
||
|
||
### Phase 4 — Event triggers
|
||
- `domain_events` table and `events.py` module
|
||
- Indexing pipeline publishes `connector.*` events (smallest change — just
|
||
add publish calls to the existing flow)
|
||
- Automations publish `automation.run.*` events on completion
|
||
- `event` trigger with filter grammar
|
||
- MCP capability harvester (so MCP-backed events and tools both work)
|
||
|
||
**After Phase 4**: "do X when Y happens" automations work, including
|
||
automation-chaining through events.
|
||
|
||
### Phase 5 — Wrapping existing features and sharing
|
||
- Wrap existing SurfSense capabilities as actions: `podcast_generation`,
|
||
`report_generation`, `indexing_sweep`
|
||
- Artifact lifecycle implementation
|
||
- `expected_duration_seconds` based queue routing (split `automations_long`
|
||
from `automations_default`)
|
||
- **Automation templates** (shareable, exportable, importable) — with
|
||
the import re-approval flow that handles the approver-≠-runner trust
|
||
shift documented in §7.4's pre-Phase-5 gate
|
||
- Cross-automation composition examples in the docs
|
||
|
||
**After Phase 5**: every existing SurfSense capability is automatable
|
||
without any per-feature code, and automations can be shared between
|
||
SearchSpaces and users.
|
||
|
||
---
|
||
|
||
## 13. Decisions locked
|
||
|
||
For reference — every decision made through the design process, in one
|
||
place.
|
||
|
||
### Foundations
|
||
1. ✅ JSON Schema 2020-12 is the single schema language for everything
|
||
2. ✅ Definition is the program; infrastructure is the interpreter
|
||
3. ✅ List of steps (not single action) in the plan, with `output_as` chaining
|
||
4. ✅ One capability registry serving native + MCP + LLM operations through the same interface
|
||
5. ✅ Capability IDs do not leak handler kind (`slack.post_message`, not `mcp.slack.post_message`)
|
||
6. ✅ Name-based resolution: definitions reference actions and capabilities by string ID. The registry is the runtime's vocabulary; lookup is a dict access. No code references in definitions.
|
||
7. ✅ The expressive spectrum runs from pure direct calls to broad agent_task; the NL generator proposes the cheapest shape that meets intent (Shape 6 from §4 by default)
|
||
|
||
### Trigger taxonomy
|
||
8. ✅ Three trigger types: `schedule`, `webhook`, `event`
|
||
9. ✅ Events absorb both connector events and internal SurfSense events
|
||
10. ✅ Filter grammar is JSON-structured operators (not Jinja)
|
||
|
||
### Templating cluster
|
||
11. ✅ Jinja2 `SandboxedEnvironment` for templates and `when:` predicates — but with the explicit understanding that the sandbox is an allowlist-by-default architecture, not a denylist
|
||
12. ✅ Zero globals registered. Curated 15 filters only, each audited for safe behavior with hostile input. List grows only by reviewed addition
|
||
13. ✅ Four runtime mitigations: `StrictUndefined`, 8 KB template source cap, 100 ms render time cap (watchdog-enforced), 1 MB output size cap
|
||
14. ✅ Non-string template values render as JSON by default
|
||
15. ✅ Fixed `run.*` namespace, documented
|
||
16. ⏸ **Pre-Phase-5 gate**: template sharing across SearchSpaces breaks the approver-equals-runner trust model. Mitigation is a re-approval flow at the import boundary (UX-level), not a template-language migration. Jinja itself stays.
|
||
|
||
### Execution
|
||
17. ✅ Executor is a Celery task wrapping a sequential loop — not an agent
|
||
18. ✅ `when:` is optional per step; false = skipped (not failed)
|
||
19. ✅ No DAGs, no parallelism, no loops — composition via agent_task or events
|
||
20. ✅ `on_failure` part of execution policy from v1
|
||
21. ✅ Step-level retry and timeout overrides
|
||
22. ✅ Budget cap enforced pre-enqueue and mid-flight
|
||
|
||
### Components
|
||
23. ✅ Dispatcher / executor / handlers / registry — distinct, each replaceable
|
||
24. ⏸ Side effects are a set, including `USER_VISIBLE` — **deferred** until multi-user automation RBAC ships
|
||
25. ⏸ `expected_duration_seconds` integer drives queue routing — **deferred** until a second Celery queue is needed
|
||
26. ⏸ `produces_artifacts` is a list of `ArtifactSpec`, not a bool — **deferred** until artifacts beyond the deliverable handlers' own persistence are needed
|
||
27. ✅ Output schemas recommended on `agent_task`; editor warns when missing
|
||
|
||
### Event bus
|
||
28. ✅ `domain_events` table for v1, with upgrade path to Redis Streams
|
||
29. ✅ Automations publish run events for composability
|
||
30. ✅ Publish/subscribe behind interface — no direct table access elsewhere
|
||
|
||
### Capability storage
|
||
31. ✅ Native capabilities registered in-memory at startup from the codebase. Identical across all workers.
|
||
32. ⏸ MCP capability metadata persisted in `mcp_connections` and `mcp_tools` tables — **deferred to Phase 4**
|
||
33. ⏸ MCP handler closures built lazily per worker from database state — **deferred to Phase 4**
|
||
34. ⏸ MCP server tool list re-harvested on a schedule — **deferred to Phase 4**
|
||
35. ⏸ MCP tools harvested into the capability registry at connection time — **deferred to Phase 4**
|
||
36. ⏸ Side effects inferred from MCP hints + naming + admin overrides — **deferred to Phase 4**
|
||
37. ⏸ MCP tools callable directly (no agent required) when caller knows args — **deferred to Phase 4**
|
||
|
||
### Credentials — all deferred to Phase 2
|
||
38. ⏸ Credentials never appear in the automation definition — only connection IDs do — **Phase 2**
|
||
39. ⏸ Credentials never appear in the LLM's context — the host holds them — **Phase 2**
|
||
40. ⏸ Credentials resolved per-call by `ActionContext`, not pre-loaded into worker environment — **Phase 2**
|
||
41. ⏸ Tokens encrypted at rest; refresh handled automatically by `ActionContext.resolve_*_client` — **Phase 2**
|
||
|
||
### v1-minimum (new lock)
|
||
v1. ✅ `Capability` is exactly five fields: `id`, `description`, `input_schema`, `output_schema`, `handler`. Additional fields are added only when a concrete consumer feature requires them.
|
||
v2. ✅ Cost is **measured** from a per-run ledger, not declared. Pre-flight cost checks return when the ledger has enough history.
|
||
v3. ✅ Single `automations_default` Celery queue in v1. Multi-queue routing returns when load justifies it.
|
||
|
||
### NL authoring
|
||
42. ✅ LLM-authored templates is the primary path from day one — not a Phase 3 addition. Hand-authoring JSON is supported but secondary
|
||
43. ✅ Generator LLM produces JSON; deterministic schema + resource validation runs before user sees the proposal
|
||
44. ✅ Review LLM produces plain-language summary + flagged anomalies for the user — UX layer, not a security boundary
|
||
45. ✅ Generator LLM's input is scoped (user prompt + schema + registry only); arbitrary document content is not fed in
|
||
46. ✅ Human approval is required before save — no auto-approval option, ever
|
||
47. ✅ Every field editable in the proposal; unresolved questions surface as clarifications
|
||
48. ✅ NL drafts are transient storage, not a core table
|
||
|
||
### Data model
|
||
49. ✅ Six tables total — four for engine state, two for MCP persistence
|
||
50. ✅ Run rows snapshot the definition (immutable history)
|
||
51. ✅ All entities scoped by `search_space_id` for RBAC
|
||
52. ✅ Editing an automation bumps `version`; existing runs unaffected
|
||
|
||
---
|
||
|
||
## 14. Open questions deferred to implementation
|
||
|
||
None of these block design; they're decisions a developer will make in
|
||
context, with the principle from §1 as their guide.
|
||
|
||
- Exact retry backoff formulas (multipliers, jitter, ceilings)
|
||
- Webhook signature verification standards (HMAC scheme, header naming)
|
||
- Whether to support inline JSON Schema `$ref` to external schemas, or
|
||
inline everything
|
||
- Specific CDN/storage backend choices for artifacts (probably
|
||
whatever SurfSense already uses for podcasts)
|
||
- Rate limits per SearchSpace and per user
|
||
- Audit log retention policy
|
||
|
||
---
|
||
|
||
## 15. Why this is ready to build
|
||
|
||
This document satisfies five tests:
|
||
|
||
1. **The four worked examples** (digest, CI webhook, file-added-trigger,
|
||
weekly podcast) all express cleanly in the contract without special
|
||
cases. Each one was used to find gaps before the gaps reached code.
|
||
|
||
2. **The audit pass identified six refinements**, all incorporated. No
|
||
pending audit items.
|
||
|
||
3. **Every decision points back to the principle from §1.** When a future
|
||
feature request lands, "does it belong in the definition or in the
|
||
engine?" gives a clear answer.
|
||
|
||
4. **The build is staged** so Phase 1 ships in weeks, not months, and
|
||
each subsequent phase delivers user value while de-risking the next.
|
||
|
||
5. **Existing SurfSense infrastructure is reused**, not paralleled. Celery
|
||
Beat, PostgreSQL/JSONB, Electric SQL, SQLAlchemy/Alembic, the existing
|
||
`tools/registry.py` pattern, the existing Search Space scoping — all
|
||
continue to do what they already do. The automation engine is a new
|
||
directory, not a new system.
|
||
|
||
The next document a developer needs is the Pydantic models and JSON
|
||
Schemas spelled out concretely. Those follow mechanically from this plan.
|
||
|
||
---
|
||
|
||
*Sources consulted: Claude Code Routines documentation; NousResearch/hermes-
|
||
agent (cron and skills subsystems); n8n documentation on node types and
|
||
workflow data model; the SurfSense repository and DeepWiki architecture
|
||
notes (FastAPI + Celery Beat + Electric SQL + LangGraph Deep Agents +
|
||
Search Space RBAC); Model Context Protocol specification for capability
|
||
harvesting; AWS EventBridge for filter grammar; workflow-pattern
|
||
literature (van der Aalst et al.) for the trigger / action / concurrency
|
||
vocabulary.*
|