Extracts finalize_assistant_message: the post-stream server-side write
of the final assistant message (with content parts + token usage)
guarded by asyncio.shield + shielded_async_session so a client
disconnect cannot abort the persist.
Add-only; legacy stream_new_chat.py keeps its inline finalize block
until cutover.
Two cooperating modules that wrap stream_agent_events with in-stream
recovery from provider 429s:
* rate_limit_recovery: can_recover_provider_rate_limit truth-table
guard, reroute_to_next_auto_pin (selects the next eligible auto-pin
config and reloads the LLM bundle), log_rate_limit_recovered.
* stream_loop: run_stream_loop drives stream_agent_events in a
while-True loop, delegating recovery to a flow-supplied RecoverFn
callback so new_chat and resume_chat can share the same loop while
keeping their own nonlocal state.
Add-only; not yet wired into any orchestrator.
Extracts handle_terminal_exception: the shared except-branch behavior for
the chat orchestrators. Classifies the raised exception, logs the
structured chat_stream error event, and emits the terminal-error SSE
frame + done sentinel via the streaming service.
Add-only; nothing imports it yet.
Centralizes the premium-credits lifecycle for chat turns:
* needs_premium_quota: gate check (premium user + non-fallback config).
* PremiumReservation: dataclass capturing reservation state + token totals.
* reserve_premium / finalize_premium / release_premium: idempotent
reservation, commit, and rollback used by the orchestrators.
Add-only; legacy stream_new_chat.py keeps its inline quota handling
until cutover.
Six small, single-purpose modules shared by the upcoming new_chat and
resume_chat orchestrators:
* llm_bundle: dispatches negative config_id to the YAML loader and
non-negative config_id to the DB loader, returning (llm, AgentConfig).
* pre_stream_setup: builds the connector service, resolves the
Firecrawl API key, and returns the chat checkpointer.
* first_frames: iter_initial_frames + iter_final_frames emit the canonical
message-start / step-start / idle / finish / done SSE envelope.
* finalize_emit: iter_token_usage_frame emits the per-turn usage frame
from a TokenAccumulator summary.
* finally_cleanup: close_session_and_clear_ai_responding and run_gc_pass
centralize the finally-block bookkeeping.
* span: open_chat_request_span / set_agent_mode / close_chat_request_span /
record_outcome_attrs wrap the OpenTelemetry chat_request span.
Add-only; these are not yet wired into stream_new_chat.py.
Extracts the inner agent-streaming driver previously inlined as
_stream_agent_events in stream_new_chat.py.
stream_agent_events drives graph_stream.event_stream.stream_output and,
after the agent finishes, performs the post-stream safety-net work:
* commit any pending content the agent never explicitly finished
* evaluate file-operation contract outcomes and emit the appropriate
contract verdict for desktop_local_folder turns
This unit is what flows/shared/stream_loop.py wraps in the rate-limit
recovery while-loop. Add-only; no existing wiring uses it yet.
Extracts the agent-construction wrapper that the chat streamers call to
materialize the LangGraph agent for a given thread. Centralizes how we
pass the agent factory plus checkpointer, runtime context, and the
in-memory content builder.
Add-only; pre-existing inline equivalent in stream_new_chat.py stays
until cutover.
Extracts the desktop_local_folder file-operation contract helpers:
* contract_enforcement_active: gates the contract on filesystem mode.
* evaluate_file_contract_outcome: scores tool outputs as success/no-op.
* log_file_contract: structured logging of contract verdicts.
This is the unit responsible for catching agents that claim to have
written/edited a file without actually invoking the filesystem tool.
Add-only; stream_new_chat.py keeps its inline duplicates until cutover.
Extracts two pure context helpers used during input-state assembly:
* mentioned_docs.format_mentioned_surfsense_docs_as_context: renders the
user's @-mentioned SurfSense docs into the LLM context block.
* deepagents_todos.extract_todos_from_deepagents: pulls the in-progress
todo list from a deep-agents state snapshot for the title generator.
Add-only; existing call sites in stream_new_chat.py remain untouched
until cutover.
Foundation layer for the parallel refactor of stream_new_chat.py.
Extracts the StreamResult dataclass (tracks per-turn streaming state)
and a small set of shared utilities (resume_step_prefix, safe_float).
Add-only; no existing code imports from this package yet. Existing
stream_new_chat.py keeps its inline equivalents until cutover.
The citations fix (cacb27e0) added a "Chunk citations in your prose"
section to system_prompt_desktop.md telling the KB subagent to always
leave `evidence.chunk_ids` null and emit no `[citation:...]` markers in
desktop mode, but left the pre-existing line declaring that
`chunk_ids` apply to `<priority_documents>` hits. The two rules
contradicted each other; the model picked one per turn.
Strike the stale conditional clause and point at the dedicated section
as the single source of truth. Matches the parallel line in
system_prompt_cloud.md and the already-consistent
system_prompt_readonly_desktop.md.
Resolves: surfsense_backend/app/agents/new_chat/middleware/memory_injection.py
- Took both imports: upstream moved MEMORY_HARD_LIMIT/SOFT_LIMIT to
app.services.memory; kept our perf-logger import for timing.
Pulls in upstream changes:
- Memory document feature (services/memory refactor, removal of
app.agents.new_chat.memory_extraction and background extraction in
stream_new_chat — agent now drives memory via update_memory tool).
- BACKEND_URL env refactor across web tool-ui/editor/chat/dashboard/lib.
- GitHub Actions backend test workflow + pre-commit biome bump.
- Token-display polish in MessageInfoDropdown; save_memory no-update
sentinel.
Verified: 1723 unit tests pass, ruff clean. No semantic regression in
stream_new_chat (their memory-extraction deletion and our preflight
removal touch different functions).
Collapse the invalidate + warmup pair into a single
refresh_mcp_tools_cache_for_connector(connector_id, search_space_id)
helper and scope live discovery to the one connector that changed
instead of the whole search space.
- new mcp_tool.discover_single_mcp_connector: load one connector,
refresh OAuth if needed, force live MCP discovery so its cached_tools
row is rewritten; returned wrappers are discarded since the in-process
LRU is rebuilt lazily on the next user query
- mcp_tools_cache.refresh_mcp_tools_cache_for_connector: synchronously
evicts the per-space LRU (LRU keys cannot scope finer) and schedules
the per-connector prefetch via loop.create_task
- routes (OAuth callback, MCP POST, MCP PUT) collapse their two
back-to-back calls into a single refresh call; DELETE handlers keep
using bare invalidate_mcp_tools_cache (nothing to prefetch)
No new automated tests: the new functions are I/O glue (DB + network)
where mocked unit tests would test implementation rather than behavior.
The existing 9 unit tests for the cached_tools data shape are unchanged.
Skip the ~1-3s MCP initialize + list_tools handshake on every cache miss
by reading tool definitions from the connector row we already load. Lazy
populate on first miss, self-heal on corrupt cache, zero schema migration.
Splits the OpenAI-family gate into per-param predicates so AZURE and
AZURE_OPENAI configs now receive prompt_cache_key for backend routing
affinity (Microsoft auto-caches GPT-4o+ deployments at >=1024 tokens;
the key clusters same-prefix requests on the same GPU pool and raises
hit rate on turn 2+). prompt_cache_retention stays opted out for Azure
because litellm 1.83.14's Azure transformer would drop it silently;
revisit when Azure's supported params list is updated.
Adds an optional planner LLM role wired through KnowledgePriorityMiddleware
so KB query rewriting, date extraction, and recency classification run on a
cheap model (e.g. gpt-4o-mini, Haiku, Azure nano) instead of the user's
chat LLM. Operators opt in by setting is_planner: true on exactly one
global config; without it, behavior is unchanged.
The preflight pattern probed the LLM with a 1-token ping before each
cold turn (when requested_llm_config_id==0, llm_config_id<0, and the
45s healthy TTL had expired) to detect 429s before fanning out into
planner/classifier/title-gen. To absorb its ~1-5s RTT cost we built the
agent speculatively in parallel; on 429 we discarded the build and
repinned.
Three problems with that design:
1. False security. Provider rate limits are token-bucket. A 1-token
ping consumes ~5 tokens; the real request consumes 10-50K. The
probe can return 200 while the real call still 429s.
2. Pure overhead in the common case. On warm-agent-cache turns the
probe dominates wall time: ~2.5s of TTFT pure tax for ~99% of users
who never see a 429.
3. The in-stream recovery loop (catch of _is_provider_rate_limited
gated by not _first_event_logged) already does the right thing
reactively: mark_runtime_cooldown -> resolve_or_get_pinned_llm_config_id
with exclude_config_ids={previous} -> rebuild agent -> retry the
stream. Preflight was never the only safety net; it was a redundant
probe in front of one.
Changes:
- Delete _preflight_llm, _settle_speculative_agent_build, and the
_PREFLIGHT_TIMEOUT_SEC / _PREFLIGHT_MAX_TOKENS constants.
- Drop the parallel agent_build_task / preflight_task plumbing in
both stream_new_chat and stream_resume_chat; build the agent inline
with await _build_main_agent_for_thread(...).
- Drop the unused is_recently_healthy / mark_healthy imports here
(still exported from auto_model_pin_service since OpenRouter
catalogue refresh and a few tests reference clear_healthy).
- Remove the obsolete preflight + settle-speculative tests from
test_stream_new_chat_contract.py.
Net: -447 LOC. ~2.5s removed from TTFT on every cold preflight-eligible
turn. 429 recovery path is unchanged - same repin/rebuild/retry, just
not paid in advance on the healthy path.