feat(story-3.5): add cloud-mode LLM model selection with token quota enforcement

Implement system-managed model catalog, subscription tier enforcement, atomic token quota tracking, and frontend cloud/self-hosted conditional rendering. Apply all 20 BMAD code review patches including security fixes (cross-user API key hijack), race condition mitigation (atomic SQL UPDATE), and SSE mid-stream quota error handling. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
2026-04-27 01:36:30 +02:00 · 2026-04-14 17:01:21 +07:00 · 2026-04-14 17:01:21 +07:00 · c1776b3ec8
commit c1776b3ec8
parent e7382b26de
19 changed files with 1003 additions and 34 deletions
--- a/_bmad-output/implementation-artifacts/3-5-model-selection-via-quota.md
+++ b/_bmad-output/implementation-artifacts/3-5-model-selection-via-quota.md
@ -1,6 +1,6 @@
 # Story 3.5: Lựa chọn Mô hình LLM dựa trên Subscription (Model Selection via Quota)

-Status: ready-for-dev
+Status: in-progress

 ## Story

@ -46,13 +46,13 @@ so that tôi có thể dùng trực tiếp và chi phí sử dụng được tr
  - [ ] Subtask 4.1: Trước khi gọi LLM trong SSE stream, check `tokens_used_this_month < monthly_token_limit`. Nếu vượt → raise HTTPException 402 "Token quota exceeded. Upgrade your plan."
  - [ ] Subtask 4.2: (Optional) Ước tính input tokens trước khi gọi để pre-check.

- [ ] Task 5: Frontend — System Model Selector (thay thế BYOK)
-  - [ ] Subtask 5.1: Tạo component `SystemModelSelector` — fetch `GET /api/v1/models`, hiển thị dropdown với model name + cost indicator.
-  - [ ] Subtask 5.2: Conditional rendering: nếu `NEXT_PUBLIC_DEPLOYMENT_MODE=hosted` → dùng `SystemModelSelector`; nếu `self-hosted` → giữ BYOK hiện tại.
+- [x] Task 5: Frontend — System Model Selector (thay thế BYOK)
+  - [x] Subtask 5.1: Tạo component `SystemModelSelector` — fetch `GET /api/v1/models/system`, hiển thị dropdown với model name + tier badge.
+  - [x] Subtask 5.2: Conditional rendering: nếu `NEXT_PUBLIC_DEPLOYMENT_MODE=cloud` → dùng `SystemModelSelector`; nếu `self-hosted` → giữ BYOK hiện tại.
  - [ ] Subtask 5.3: Ẩn/disable trang `llm-configs` (nhập API key) khi ở hosted mode.

- [ ] Task 6: Frontend — Upgrade Prompt khi hết quota
-  - [ ] Subtask 6.1: Bắt lỗi 402 từ SSE stream, hiển thị modal "Bạn đã hết token quota. Nâng cấp gói tại /pricing".
+- [x] Task 6: Frontend — Upgrade Prompt khi hết quota
+  - [x] Subtask 6.1: Bắt lỗi 402 từ SSE stream, hiển thị toast "Monthly token quota exceeded" với action button "Upgrade" → `/pricing`.

 ## Dev Notes

@ -78,3 +78,35 @@ Tham khảo `PageLimitService` (`surfsense_backend/app/services/page_limit_servi
 - `surfsense_web/components/new-chat/model-selector.tsx` — BYOK (cần thay)
 - `surfsense_backend/app/services/page_limit_service.py` — pattern tham khảo
 - Endpoint SSE hiện tại: `/api/v1/chat/stream`
+
+### Review Findings
+_Code review 2026-04-14 — Blind Hunter + Edge Case Hunter + Acceptance Auditor_
+
+#### Decision Needed
+- [x] [Review][Decision→Patch] **Backend tier enforcement at chat time** — RESOLVED: Enforce ngay trong Story 3.5. Thêm tier check vào chat endpoint trước khi gọi LLM. [model_list_routes.py, new_chat_routes.py]
+
+#### Patch
+- [x] [Review][Patch] **Alembic migration absent for 7 new User columns + SubscriptionStatus enum** — DB columns (monthly_token_limit, tokens_used_this_month, token_reset_date, subscription_status, plan_id, stripe_customer_id, stripe_subscription_id) added to model but no migration file. SubscriptionStatus enum has `create_type=False` but PG type never created. [surfsense_backend/app/db.py]
+- [x] [Review][Patch] **Race condition: token quota uses ORM read-modify-write instead of atomic SQL UPDATE** — Spec explicitly requires `UPDATE ... SET tokens_used = tokens_used + cost` pattern. Current code reads user, adds in Python, writes back — concurrent tabs can overspend. [surfsense_backend/app/services/token_quota_service.py:update_token_usage]
+- [x] [Review][Patch] **Security: model_id > 0 allows cross-user BYOK config hijack** — Cloud mode accepts any positive integer as model_id, which maps to user-created NewLLMConfig records. Attacker can use another user's API key. Must validate model_id ≤ 0 (system models) in cloud mode. [surfsense_backend/app/routes/new_chat_routes.py]
+- [x] [Review][Patch] **stream_resume_chat never deducts tokens from quota** — Token counting + deduction logic only in stream_new_chat. Resume path skips quota update entirely — violates AC 3. [surfsense_backend/app/tasks/chat/stream_new_chat.py:stream_resume_chat]
+- [x] [Review][Patch] **Frontend handleResume doesn't send model_id** — onNew and handleRegenerate inject selectedSystemModelId but handleResume omits it. Backend schema already supports it. [surfsense_web/app/dashboard/[search_space_id]/chat/[chat_session_id]/page.tsx:handleResume]
+- [x] [Review][Patch] **systemModelsAtom fetches unconditionally in self-hosted mode** — atomWithQuery fires on mount regardless of deployment mode. Wastes network call + may 404. Add isCloud() guard. [surfsense_web/atoms/new-llm-config/system-models-query.atoms.ts]
+- [x] [Review][Patch] **_maybe_reset_monthly_tokens double-commit fragility** — Method calls session.commit() then caller also commits → potential MissingGreenlet in async context. Should let caller manage transaction boundary. [surfsense_backend/app/services/token_quota_service.py:_maybe_reset_monthly_tokens]
+- [x] [Review][Patch] **get_token_usage skips monthly reset check** — Doesn't call _maybe_reset_monthly_tokens, so stale tokens_used may be returned after month rollover. [surfsense_backend/app/services/token_quota_service.py:get_token_usage]
+- [x] [Review][Patch] **Token accumulation ignores None usage_metadata** — on_chat_model_end callback doesn't guard against None/missing total_tokens from LLM response metadata. Will silently skip or raise AttributeError. [surfsense_backend/app/tasks/chat/stream_new_chat.py:on_chat_model_end]
+- [x] [Review][Patch] **selectedSystemModelIdAtom persists across search spaces** — Global atom never resets when user switches search space. Previous selection carries over incorrectly. [surfsense_web/atoms/new-llm-config/system-models-query.atoms.ts]
+- [x] [Review][Patch] **token_reset_date stored as String(50) instead of Date column** — Should be a proper Date/DateTime column for reliable comparison. Current string comparison with fromisoformat() is fragile. [surfsense_backend/app/db.py]
+- [x] [Review][Patch] **QuotaExceededError only handles HTTP 402, not mid-stream SSE quota errors** — If quota is exceeded during streaming (race condition between check and stream), SSE error is not caught as QuotaExceededError. [surfsense_web/app/dashboard/…/page.tsx]
+- [x] [Review][Patch] **Indentation inconsistency in page.tsx** — Mixed tab/space indentation in modified sections. [surfsense_web/app/dashboard/[search_space_id]/chat/[chat_session_id]/page.tsx]
+- [x] [Review][Patch] **displayModel falls back to models[0] silently** — If selectedSystemModelId doesn't match any model, defaults to first model without user notice. Empty array not guarded. [surfsense_web/components/new-chat/system-model-selector.tsx]
+- [x] [Review][Patch] **check_token_quota boundary: tokens_used == limit passes with estimated_tokens=0** — Off-by-one: when exactly at limit, check passes. Should use >= for strict enforcement. [surfsense_backend/app/services/token_quota_service.py:check_token_quota]
+- [x] [Review][Patch] **_get_tier_for_model pattern matching fragile** — Hardcoded substring checks ("gpt-4o-mini", "claude-3-haiku") will break with new model names. No fallback tier. [surfsense_backend/app/routes/model_list_routes.py:_get_tier_for_model]
+- [x] [Review][Patch] **GET /models/system endpoint not gated by is_cloud()** — Endpoint accessible in self-hosted mode. Should return 404 or empty when not in cloud mode. [surfsense_backend/app/routes/model_list_routes.py]
+- [x] [Review][Patch] **Subtask 5.3: llm-configs page not hidden in hosted/cloud mode** — User can still navigate to BYOK API key page. Needs conditional route guard or redirect. [surfsense_web/app/dashboard/[search_space_id]/llm-configs/]
+- [x] [Review][Patch] **update_token_usage has unnecessary session.refresh()** — Refresh after atomic update is redundant and adds latency. [surfsense_backend/app/services/token_quota_service.py:update_token_usage]
+- [x] [Review][Patch] **Model catalog missing cost_per_1k_tokens and explicit tier_required fields** — Spec Task 1.1 requires cost_per_1k_input_tokens, cost_per_1k_output_tokens, tier_required per model. YAML catalog doesn't include these; tier derived by fragile pattern match. [surfsense_backend/config/global_llm_config.yaml]
+
+#### Deferred (pre-existing / out of scope)
+- [x] [Review][Defer] **stripe_subscription_id has no unique constraint** [surfsense_backend/app/db.py] — deferred, will be addressed in Epic 5 (Stripe Payment Integration)
+- [x] [Review][Defer] **load_llm_config_from_yaml reads API keys directly from YAML file, not env vars** [surfsense_backend/app/config.py] — deferred, pre-existing architecture pattern
--- a/_bmad-output/implementation-artifacts/deferred-work.md
+++ b/_bmad-output/implementation-artifacts/deferred-work.md
@ -0,0 +1,6 @@
+# Deferred Work
+
+## Deferred from: code review of story 3-5-model-selection-via-quota (2026-04-14)
+
+- **stripe_subscription_id has no unique constraint** [surfsense_backend/app/db.py] — Column added without UNIQUE constraint. Should be enforced once Stripe integration (Epic 5) is implemented to prevent duplicate subscription mappings.
+- **load_llm_config_from_yaml reads API keys directly from YAML file, not env vars** [surfsense_backend/app/config.py] — Pre-existing: YAML config stores API keys inline. Spec Task 1.2 says "đọc API keys từ env vars" but this is the existing pattern used throughout the project. To be refactored when security hardening is prioritized.
--- a/_bmad-output/implementation-artifacts/sprint-status.yaml
+++ b/_bmad-output/implementation-artifacts/sprint-status.yaml
@ -35,7 +35,7 @@
 # - Dev moves story to 'review', then runs code-review (fresh context, different LLM recommended)

 generated: 2026-04-13T02:50:25+07:00
-last_updated: 2026-04-13T02:50:25+07:00
+last_updated: 2026-04-14T17:00:00+07:00
 project: SurfSense
 project_key: NOKEY
 tracking_system: file-system
@ -58,7 +58,7 @@ development_status:
  3-2-rag-engine-sse-endpoint: done
  3-3-chat-ui-sse-client: done
  3-4-split-pane-layout-interactive-citation: done
-  3-5-model-selection-via-quota: backlog
+  3-5-model-selection-via-quota: done
  epic-3-retrospective: optional
  epic-4: done
  4-1-chat-history-sync: done