Top-level tests that span multiple submodules:
- test_stores.py (7): the trigger + action registry contracts — register
round-trip, unknown type → None (not raise), duplicate registration
rejected, defensive snapshot from all_*.
- test_definition_types.py (2): params_schema property on both
ActionDefinition and TriggerDefinition reflects the Pydantic model.
- test_persistence_enums.py (3): exact string values + member sets of
AutomationStatus / RunStatus / TriggerType — the postgres-mirrored
contract that breaks stored rows if drifted.
- test_import_registrations.py (2): the bundled agent_task action and
schedule trigger self-register on package import (canary for the
side-effect import chain).
conftest.py adds isolated_action_registry / isolated_trigger_registry
fixtures: snapshot + restore of the module-level _REGISTRY dicts so
tests that add their own definitions don't leak across the suite.
14 tests, pure unit.
auto_decide.build_auto_decisions (3): produces one decision per
action_request entry, defaults to one decision for legacy scalar
interrupts, and skips malformed interrupts silently so a misbehaving
tool can't take down the whole agent_task step.
finalize.extract_final_assistant_message (4): string-content AIMessage
returned verbatim, list-of-parts content concatenated (skipping
non-text parts like tool_use), walks back past trailing ToolMessages
to find the last AIMessage, and returns None when no extractable text
is present (so callers can branch on silence vs. empty).
7 tests, pure unit.
render.py (4): variable substitution, StrictUndefined raises on missing
keys, evaluate_predicate coerces to bool, render_value walks dicts/lists
and renders string leaves.
filters.py (4): slugify produces URL-safe output, date formats datetime
with strftime, date(None) → "" so templates can write
{{ inputs.last_fired_at | date }} on first run, date(str) passes through.
environment.py (4): the sandbox boundary — disallowed Jinja built-ins
(e.g. pprint) raise, and the finalize hook coerces non-string outputs
to predictable wire shapes (datetime → ISO, None → "", dict → JSON).
context.py (1): build_run_context exposes {run, inputs, steps} with the
exact shape every plan template body relies on.
13 tests total, all pure unit.
execute_step (6 tests): happy path, when=falsy → skipped, unknown action
→ ActionNotFound failure, retry budget exhaustion (attempts = 1 +
max_retries), retry recovery, and template-rendering of step params
against the run context.
with_retries (3 tests): first-try success returns attempts=1, recovery
returns the actual attempt that produced the result, and exhaustion
re-raises the last exception with the handler called 1 + max_retries
times.
All tests use backoff="none" to keep wall-clock time zero; timeout
testing is intentionally skipped (would need >= 1s per the int contract,
and exhaustion already locks that any Exception triggers retry).
Cover the input-validation contract dispatch_run relies on:
- no declared schema → inputs pass through unchanged (regression site
that previously stripped runtime keys like fired_at / last_fired_at
and broke Jinja templates).
- declared schema, valid inputs → passthrough validated.
- declared schema, invalid inputs → DispatchError (uniform exception
type, not raw jsonschema.ValidationError).
Plus the DispatchError exception identity (Exception subclass, message
preserved, isinstance-friendly for the dispatch layer's consumers).
4 tests, pure unit.
Cover the cron + IANA timezone + UTC normalization contract for the
schedule trigger: next-match strictly-after, DST offset shift across
spring-forward, malformed cron / unknown timezone rejection, and the
ScheduleTriggerParams Pydantic gate that surfaces InvalidCronError as
ValidationError at the API boundary.
8 tests, pure unit (no DB, no mocks).
Adds 34 tests under tests/unit/tasks/chat/streaming/ that cover the
new flows tree against the legacy stream_new_chat.py module to gate
the upcoming cutover. Coverage:
* Public entry points: stream_new_chat and stream_resume_chat are
async generator functions whose parameter signatures (name, kind,
annotation, default) match the legacy versions one-for-one. Uses a
normalized-annotation comparison so PEP-563 vs eager-annotation
representation differences are tolerated.
* Extracted helpers: image-capability gate, runtime-context builders
for new-chat and resume-chat, LLM-bundle dispatcher, premium-quota
needs check + reservation dataclass, rate-limit recovery truth
table, persistence-spawn registration/self-unregistration, await
helpers.
* SSE frame iterators: iter_initial_frames + iter_final_frames emit
the canonical sequence; iter_token_usage_frame skips on None.
* Initial thinking step: 4 parametrized branches (text, image-only,
empty, mentioned-docs), long-query truncation, many-docs collapse.
These tests are scaffolding for the cutover and will be removed once
the legacy module is deleted.
Resolves: surfsense_backend/app/agents/new_chat/middleware/memory_injection.py
- Took both imports: upstream moved MEMORY_HARD_LIMIT/SOFT_LIMIT to
app.services.memory; kept our perf-logger import for timing.
Pulls in upstream changes:
- Memory document feature (services/memory refactor, removal of
app.agents.new_chat.memory_extraction and background extraction in
stream_new_chat — agent now drives memory via update_memory tool).
- BACKEND_URL env refactor across web tool-ui/editor/chat/dashboard/lib.
- GitHub Actions backend test workflow + pre-commit biome bump.
- Token-display polish in MessageInfoDropdown; save_memory no-update
sentinel.
Verified: 1723 unit tests pass, ruff clean. No semantic regression in
stream_new_chat (their memory-extraction deletion and our preflight
removal touch different functions).
Skip the ~1-3s MCP initialize + list_tools handshake on every cache miss
by reading tool definitions from the connector row we already load. Lazy
populate on first miss, self-heal on corrupt cache, zero schema migration.
Splits the OpenAI-family gate into per-param predicates so AZURE and
AZURE_OPENAI configs now receive prompt_cache_key for backend routing
affinity (Microsoft auto-caches GPT-4o+ deployments at >=1024 tokens;
the key clusters same-prefix requests on the same GPU pool and raises
hit rate on turn 2+). prompt_cache_retention stays opted out for Azure
because litellm 1.83.14's Azure transformer would drop it silently;
revisit when Azure's supported params list is updated.
The preflight pattern probed the LLM with a 1-token ping before each
cold turn (when requested_llm_config_id==0, llm_config_id<0, and the
45s healthy TTL had expired) to detect 429s before fanning out into
planner/classifier/title-gen. To absorb its ~1-5s RTT cost we built the
agent speculatively in parallel; on 429 we discarded the build and
repinned.
Three problems with that design:
1. False security. Provider rate limits are token-bucket. A 1-token
ping consumes ~5 tokens; the real request consumes 10-50K. The
probe can return 200 while the real call still 429s.
2. Pure overhead in the common case. On warm-agent-cache turns the
probe dominates wall time: ~2.5s of TTFT pure tax for ~99% of users
who never see a 429.
3. The in-stream recovery loop (catch of _is_provider_rate_limited
gated by not _first_event_logged) already does the right thing
reactively: mark_runtime_cooldown -> resolve_or_get_pinned_llm_config_id
with exclude_config_ids={previous} -> rebuild agent -> retry the
stream. Preflight was never the only safety net; it was a redundant
probe in front of one.
Changes:
- Delete _preflight_llm, _settle_speculative_agent_build, and the
_PREFLIGHT_TIMEOUT_SEC / _PREFLIGHT_MAX_TOKENS constants.
- Drop the parallel agent_build_task / preflight_task plumbing in
both stream_new_chat and stream_resume_chat; build the agent inline
with await _build_main_agent_for_thread(...).
- Drop the unused is_recently_healthy / mark_healthy imports here
(still exported from auto_model_pin_service since OpenRouter
catalogue refresh and a few tests reference clear_healthy).
- Remove the obsolete preflight + settle-speculative tests from
test_stream_new_chat_contract.py.
Net: -447 LOC. ~2.5s removed from TTFT on every cold preflight-eligible
turn. 429 recovery path is unchanged - same repin/rebuild/retry, just
not paid in advance on the healthy path.
Updated the test for the indexing pipeline to verify that both the standard and hybrid chunkers are called via asyncio.to_thread, ensuring non-blocking behavior. This change reflects the routing of non-code documents through the hybrid chunker, maintaining the event loop contract.
Only MCP tools have a persistence target for 'approve_always' (the
connector's trusted-tools list); for native tools the decision lives
only in the in-memory runtime ruleset. Reflect that in the wire palette
so the FE can stay a pure renderer of allowed_decisions instead of
peeking at context.mcp_connector_id to decide whether to show the
'Always Allow' button.
The backend still accepts an 'approve_always' reply for any tool kind
(in-memory promotion is harmless), it just doesn't advertise it when
there's nowhere to persist.
Renames the SurfSense HITL extension decision-type from "always" to
"approve_always" so it sits in the same verb-first family as "approve",
"reject", and "edit". The Python constant is now SURFSENSE_DECISION_APPROVE_ALWAYS;
the wire value, the permission-domain decision_type, and the FE union members
all match (no wire/internal mismatch).
Both the multi_agent_chat permission middleware and the legacy new_chat one
accept the new wire value; the FE types.ts union is updated accordingly.
The "context.always" payload key is intentionally left untouched - it's the
patterns-to-promote field, semantically distinct from the decision type.
Until now an "Always Allow" reply only updated the in-memory runtime
ruleset, evaporating after the session ended. Persist it to the
existing connector.config['trusted_tools'] list so the next session's
fetch_user_allowlist_rulesets picks it up and the user is never asked
again for the same (connector, tool) pair.
- TrustedToolSaver + make_trusted_tool_saver(user_id) in
user_tool_allowlist: opens its own session via async_session_maker
per call, logs and swallows failures (in-memory promotion is the
canonical "always" path, durable persistence is opportunistic).
- PermissionMiddleware._process is now pure: returns
(state_update, list[_AlwaysPromotion]). aafter_model awaits the
saver for each promotion; after_model discards them. Promotions are
only emitted for tools whose metadata exposes mcp_connector_id, so
native tools and KB FS ops are correctly skipped.
- main_agent factory builds the saver once per turn and stashes it in
dependencies["trusted_tool_saver"]; pack_subagent and the KB
middleware stack forward it through build_permission_mw.
- Renamed pm._process(state, None) call sites in two existing tests to
pm.after_model(state, None) so they exercise the public hook
contract instead of the now-tuple-returning private method.
The FE permission card needs mcp_connector_id, mcp_server, and
tool_description in the interrupt context to render "Always Allow"
against the right connected account. Thread the tool through the
ask pipeline:
- pack_subagent → build_permission_mw(tools=...) → PermissionMiddleware
(tools_by_name) → request_permission_decision(tool=...) →
build_permission_ask_payload(tool=...) projects card fields out of
BaseTool.
- mcp_tool.py: stdio path now stashes mcp_connector_id in metadata for
parity with the HTTP path.