Collapse the invalidate + warmup pair into a single
refresh_mcp_tools_cache_for_connector(connector_id, search_space_id)
helper and scope live discovery to the one connector that changed
instead of the whole search space.
- new mcp_tool.discover_single_mcp_connector: load one connector,
refresh OAuth if needed, force live MCP discovery so its cached_tools
row is rewritten; returned wrappers are discarded since the in-process
LRU is rebuilt lazily on the next user query
- mcp_tools_cache.refresh_mcp_tools_cache_for_connector: synchronously
evicts the per-space LRU (LRU keys cannot scope finer) and schedules
the per-connector prefetch via loop.create_task
- routes (OAuth callback, MCP POST, MCP PUT) collapse their two
back-to-back calls into a single refresh call; DELETE handlers keep
using bare invalidate_mcp_tools_cache (nothing to prefetch)
No new automated tests: the new functions are I/O glue (DB + network)
where mocked unit tests would test implementation rather than behavior.
The existing 9 unit tests for the cached_tools data shape are unchanged.
Skip the ~1-3s MCP initialize + list_tools handshake on every cache miss
by reading tool definitions from the connector row we already load. Lazy
populate on first miss, self-heal on corrupt cache, zero schema migration.
Splits the OpenAI-family gate into per-param predicates so AZURE and
AZURE_OPENAI configs now receive prompt_cache_key for backend routing
affinity (Microsoft auto-caches GPT-4o+ deployments at >=1024 tokens;
the key clusters same-prefix requests on the same GPU pool and raises
hit rate on turn 2+). prompt_cache_retention stays opted out for Azure
because litellm 1.83.14's Azure transformer would drop it silently;
revisit when Azure's supported params list is updated.
Adds an optional planner LLM role wired through KnowledgePriorityMiddleware
so KB query rewriting, date extraction, and recency classification run on a
cheap model (e.g. gpt-4o-mini, Haiku, Azure nano) instead of the user's
chat LLM. Operators opt in by setting is_planner: true on exactly one
global config; without it, behavior is unchanged.
The preflight pattern probed the LLM with a 1-token ping before each
cold turn (when requested_llm_config_id==0, llm_config_id<0, and the
45s healthy TTL had expired) to detect 429s before fanning out into
planner/classifier/title-gen. To absorb its ~1-5s RTT cost we built the
agent speculatively in parallel; on 429 we discarded the build and
repinned.
Three problems with that design:
1. False security. Provider rate limits are token-bucket. A 1-token
ping consumes ~5 tokens; the real request consumes 10-50K. The
probe can return 200 while the real call still 429s.
2. Pure overhead in the common case. On warm-agent-cache turns the
probe dominates wall time: ~2.5s of TTFT pure tax for ~99% of users
who never see a 429.
3. The in-stream recovery loop (catch of _is_provider_rate_limited
gated by not _first_event_logged) already does the right thing
reactively: mark_runtime_cooldown -> resolve_or_get_pinned_llm_config_id
with exclude_config_ids={previous} -> rebuild agent -> retry the
stream. Preflight was never the only safety net; it was a redundant
probe in front of one.
Changes:
- Delete _preflight_llm, _settle_speculative_agent_build, and the
_PREFLIGHT_TIMEOUT_SEC / _PREFLIGHT_MAX_TOKENS constants.
- Drop the parallel agent_build_task / preflight_task plumbing in
both stream_new_chat and stream_resume_chat; build the agent inline
with await _build_main_agent_for_thread(...).
- Drop the unused is_recently_healthy / mark_healthy imports here
(still exported from auto_model_pin_service since OpenRouter
catalogue refresh and a few tests reference clear_healthy).
- Remove the obsolete preflight + settle-speculative tests from
test_stream_new_chat_contract.py.
Net: -447 LOC. ~2.5s removed from TTFT on every cold preflight-eligible
turn. 429 recovery path is unchanged - same repin/rebuild/retry, just
not paid in advance on the healthy path.
Connector kb_sync_services (gmail, onedrive, google_calendar, jira),
streaming indexers (discord, luma, teams) and the file-processor save
path all called embed_text inside async coroutines, blocking the
background worker's event loop for the duration of the embed. Wrap each
call site in asyncio.to_thread so concurrent indexing tasks stop
serialising on the embed.
_restore_in_place_document and _reinsert_document_from_revision are
async helpers invoked by the synchronous-feeling POST /api/threads/.../revert
route; both ran embed_texts inline, blocking the event loop while the
HTTP client waited.
generate_document_summary and create_document_chunks are async helpers
called from the chat path and from many connector indexers. Both wrapped
embed_text/embed_texts directly inside the coroutine, blocking the event
loop for the full duration of the embedding call.
_create_document and _update_document run on the chat critical path
when the filesystem subagent writes via the user's chat turn. Both
called embed_texts synchronously inside an async coroutine, blocking
the event loop for the duration of the embed.
embed_texts holds a threading.Lock and runs a sync embedding call inside
search_knowledge_base, an async coroutine on the KB priority middleware
critical path. Blocking the event loop here stalls every other coroutine
on the worker (SSE keepalives, concurrent chat requests, background
tasks). Wrap in asyncio.to_thread so the embed runs on the default
executor pool while the loop keeps serving.
LiteLLM normalizes every provider's cache fields onto
usage.prompt_tokens_details (cached_tokens + cache_creation_tokens).
The earlier fallback to usage.cache_read_input_tokens /
usage.cache_creation_input_tokens was wrong: Anthropic-shaped fields
only live there via a trailing setattr loop, and the canonical field
name on the wrapper is cache_creation_tokens (not _input_tokens).
Updated the test for the indexing pipeline to verify that both the standard and hybrid chunkers are called via asyncio.to_thread, ensuring non-blocking behavior. This change reflects the routing of non-code documents through the hybrid chunker, maintaining the event loop contract.
ResumeDecision is the Pydantic gate at the /resume HTTP route. It was
the last spot still rejecting the new wire decision-type, so the FE's
'approve_always' dispatch was being 422'd before it could reach the
permission middleware that already speaks it.
Only MCP tools have a persistence target for 'approve_always' (the
connector's trusted-tools list); for native tools the decision lives
only in the in-memory runtime ruleset. Reflect that in the wire palette
so the FE can stay a pure renderer of allowed_decisions instead of
peeking at context.mcp_connector_id to decide whether to show the
'Always Allow' button.
The backend still accepts an 'approve_always' reply for any tool kind
(in-memory promotion is harmless), it just doesn't advertise it when
there's nowhere to persist.