Implement system-managed model catalog, subscription tier enforcement, atomic token quota tracking, and frontend cloud/self-hosted conditional rendering. Apply all 20 BMAD code review patches including security fixes (cross-user API key hijack), race condition mitigation (atomic SQL UPDATE), and SSE mid-stream quota error handling. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
12 KiB
Story 3.5: Lựa chọn Mô hình LLM dựa trên Subscription (Model Selection via Quota)
Status: in-progress
Story
As a Người dùng, I want chọn cấu hình mô hình trí tuệ nhân tạo (VD: Claude 3.5 Sonnet, GPT-4) được cung cấp sẵn mà không cần điền API key cá nhân, so that tôi có thể dùng trực tiếp và chi phí sử dụng được trừ thẳng vào số Token thuộc gói cước của tôi.
Acceptance Criteria
- Hệ thống cung cấp Dropdown chọn
LLM Modeltrong giao diện Chat — danh sách model do hệ thống quản lý. - Tuyệt đối không hiển thị ô nhập "Your API Key" ở Frontend khi
DEPLOYMENT_MODE=hosted(PRD: "Tuyệt đối không hỗ trợ chức năng User tự nhập LLM API Key riêng nhằm kiểm soát chất lượng và doanh thu"). - Hệ thống Backend tính toán chi phí (Token × Đơn giá của Model) và trừ vào quota subscription.
- Hệ thống kiểm tra Quota hàng tháng của người dùng; nếu vượt quá (Quota Exceeded), trả về lỗi 402/429 báo "Hãy nâng cấp gói".
As-Is (Code hiện tại — cần thay đổi)
| Component | Hiện trạng | File |
|---|---|---|
| Frontend Model Selector | BYOK: user chọn từ NewLLMConfig do mình tự tạo (tự nhập API key) |
surfsense_web/components/new-chat/model-selector.tsx |
| Frontend LLM Config UI | Cho user nhập API key, chọn provider, model name | surfsense_web/app/dashboard/[search_space_id]/llm-configs/ |
| Backend LLM Config | NewLLMConfig table lưu api_key, llm_model_name, provider per-user |
surfsense_backend/app/db.py |
| Backend Chat Streaming | Dùng API key từ user's NewLLMConfig để gọi LLM |
Chat routes / RAG engine |
| Quota cho LLM | Không tồn tại — chỉ có pages_limit/pages_used cho document ETL |
surfsense_backend/app/db.py |
| PageLimitService | Đã implement đầy đủ cho document quota — có thể dùng làm pattern | surfsense_backend/app/services/page_limit_service.py |
Tasks / Subtasks
-
Task 1: Tạo System Model Catalog (Backend)
- Subtask 1.1: Tạo config/table
SystemLLMModelvới fields:model_id,provider(openai/anthropic),model_name,display_name,cost_per_1k_input_tokens,cost_per_1k_output_tokens,tier_required(free/pro). Có thể dùng Enum hoặc DB table. - Subtask 1.2: Backend đọc API keys từ env vars (
OPENAI_API_KEY,ANTHROPIC_API_KEY) — không lưu vào DB per-user. - Subtask 1.3: Tạo endpoint
GET /api/v1/modelstrả danh sách models khả dụng (filtered theo subscription tier của user).
- Subtask 1.1: Tạo config/table
-
Task 2: Cập nhật Chat Streaming để dùng System Keys (Backend)
- Subtask 2.1: Sửa RAG engine / chat streaming endpoint để nhận
model_idthay vìllm_config_id. - Subtask 2.2: Resolve API key từ env vars dựa trên
providercủa model, không từ user'sNewLLMConfig. - Subtask 2.3: Sau khi stream xong, đếm tokens (dùng
tiktokencho OpenAI hoặc response metadata) → gọi atomic UPDATEtoken_balance = token_balance - cost(tránh race condition khi mở 2 tab chat đồng thời).
- Subtask 2.1: Sửa RAG engine / chat streaming endpoint để nhận
-
Task 3: Thêm Token Quota vào User Model (Backend DB)
- Subtask 3.1: Alembic migration thêm columns vào
User:monthly_token_limit(Integer),tokens_used_this_month(Integer),token_reset_date(Date),subscription_status(Enum: free/active/canceled/past_due),plan_id(String). - Subtask 3.2: Logic reset
tokens_used_this_month = 0khi đếntoken_reset_date(middleware hoặc webhook trigger khi subscription renews).
- Subtask 3.1: Alembic migration thêm columns vào
-
Task 4: Quota Check trước khi gọi LLM (Backend — Fail-fast)
- Subtask 4.1: Trước khi gọi LLM trong SSE stream, check
tokens_used_this_month < monthly_token_limit. Nếu vượt → raise HTTPException 402 "Token quota exceeded. Upgrade your plan." - Subtask 4.2: (Optional) Ước tính input tokens trước khi gọi để pre-check.
- Subtask 4.1: Trước khi gọi LLM trong SSE stream, check
-
Task 5: Frontend — System Model Selector (thay thế BYOK)
- Subtask 5.1: Tạo component
SystemModelSelector— fetchGET /api/v1/models/system, hiển thị dropdown với model name + tier badge. - Subtask 5.2: Conditional rendering: nếu
NEXT_PUBLIC_DEPLOYMENT_MODE=cloud→ dùngSystemModelSelector; nếuself-hosted→ giữ BYOK hiện tại. - Subtask 5.3: Ẩn/disable trang
llm-configs(nhập API key) khi ở hosted mode.
- Subtask 5.1: Tạo component
-
Task 6: Frontend — Upgrade Prompt khi hết quota
- Subtask 6.1: Bắt lỗi 402 từ SSE stream, hiển thị toast "Monthly token quota exceeded" với action button "Upgrade" →
/pricing.
- Subtask 6.1: Bắt lỗi 402 từ SSE stream, hiển thị toast "Monthly token quota exceeded" với action button "Upgrade" →
Dev Notes
Deployment Mode
Dùng NEXT_PUBLIC_DEPLOYMENT_MODE để phân biệt:
self-hosted: Giữ BYOK (user tự quản lý API keys) — không billinghosted: System model catalog + token billing + subscription enforcement
Concurrency & Race Conditions
# Atomic update — tránh race condition khi 2 tab chat đồng thời
await session.execute(
update(User).where(User.id == user_id)
.values(tokens_used_this_month=User.tokens_used_this_month + tokens_consumed)
)
Pattern Reference
Tham khảo PageLimitService (surfsense_backend/app/services/page_limit_service.py) — đã implement đầy đủ check + update + estimate cho page quota. Có thể tạo TokenQuotaService tương tự.
References
surfsense_backend/app/db.py— User model, NewLLMConfigsurfsense_web/components/new-chat/model-selector.tsx— BYOK (cần thay)surfsense_backend/app/services/page_limit_service.py— pattern tham khảo- Endpoint SSE hiện tại:
/api/v1/chat/stream
Review Findings
Code review 2026-04-14 — Blind Hunter + Edge Case Hunter + Acceptance Auditor
Decision Needed
- [Review][Decision→Patch] Backend tier enforcement at chat time — RESOLVED: Enforce ngay trong Story 3.5. Thêm tier check vào chat endpoint trước khi gọi LLM. [model_list_routes.py, new_chat_routes.py]
Patch
- [Review][Patch] Alembic migration absent for 7 new User columns + SubscriptionStatus enum — DB columns (monthly_token_limit, tokens_used_this_month, token_reset_date, subscription_status, plan_id, stripe_customer_id, stripe_subscription_id) added to model but no migration file. SubscriptionStatus enum has
create_type=Falsebut PG type never created. [surfsense_backend/app/db.py] - [Review][Patch] Race condition: token quota uses ORM read-modify-write instead of atomic SQL UPDATE — Spec explicitly requires
UPDATE ... SET tokens_used = tokens_used + costpattern. Current code reads user, adds in Python, writes back — concurrent tabs can overspend. [surfsense_backend/app/services/token_quota_service.py:update_token_usage] - [Review][Patch] Security: model_id > 0 allows cross-user BYOK config hijack — Cloud mode accepts any positive integer as model_id, which maps to user-created NewLLMConfig records. Attacker can use another user's API key. Must validate model_id ≤ 0 (system models) in cloud mode. [surfsense_backend/app/routes/new_chat_routes.py]
- [Review][Patch] stream_resume_chat never deducts tokens from quota — Token counting + deduction logic only in stream_new_chat. Resume path skips quota update entirely — violates AC 3. [surfsense_backend/app/tasks/chat/stream_new_chat.py:stream_resume_chat]
- [Review][Patch] Frontend handleResume doesn't send model_id — onNew and handleRegenerate inject selectedSystemModelId but handleResume omits it. Backend schema already supports it. [surfsense_web/app/dashboard/[search_space_id]/chat/[chat_session_id]/page.tsx:handleResume]
- [Review][Patch] systemModelsAtom fetches unconditionally in self-hosted mode — atomWithQuery fires on mount regardless of deployment mode. Wastes network call + may 404. Add isCloud() guard. [surfsense_web/atoms/new-llm-config/system-models-query.atoms.ts]
- [Review][Patch] _maybe_reset_monthly_tokens double-commit fragility — Method calls session.commit() then caller also commits → potential MissingGreenlet in async context. Should let caller manage transaction boundary. [surfsense_backend/app/services/token_quota_service.py:_maybe_reset_monthly_tokens]
- [Review][Patch] get_token_usage skips monthly reset check — Doesn't call _maybe_reset_monthly_tokens, so stale tokens_used may be returned after month rollover. [surfsense_backend/app/services/token_quota_service.py:get_token_usage]
- [Review][Patch] Token accumulation ignores None usage_metadata — on_chat_model_end callback doesn't guard against None/missing total_tokens from LLM response metadata. Will silently skip or raise AttributeError. [surfsense_backend/app/tasks/chat/stream_new_chat.py:on_chat_model_end]
- [Review][Patch] selectedSystemModelIdAtom persists across search spaces — Global atom never resets when user switches search space. Previous selection carries over incorrectly. [surfsense_web/atoms/new-llm-config/system-models-query.atoms.ts]
- [Review][Patch] token_reset_date stored as String(50) instead of Date column — Should be a proper Date/DateTime column for reliable comparison. Current string comparison with fromisoformat() is fragile. [surfsense_backend/app/db.py]
- [Review][Patch] QuotaExceededError only handles HTTP 402, not mid-stream SSE quota errors — If quota is exceeded during streaming (race condition between check and stream), SSE error is not caught as QuotaExceededError. [surfsense_web/app/dashboard/…/page.tsx]
- [Review][Patch] Indentation inconsistency in page.tsx — Mixed tab/space indentation in modified sections. [surfsense_web/app/dashboard/[search_space_id]/chat/[chat_session_id]/page.tsx]
- [Review][Patch] displayModel falls back to models[0] silently — If selectedSystemModelId doesn't match any model, defaults to first model without user notice. Empty array not guarded. [surfsense_web/components/new-chat/system-model-selector.tsx]
- [Review][Patch] check_token_quota boundary: tokens_used == limit passes with estimated_tokens=0 — Off-by-one: when exactly at limit, check passes. Should use >= for strict enforcement. [surfsense_backend/app/services/token_quota_service.py:check_token_quota]
- [Review][Patch] _get_tier_for_model pattern matching fragile — Hardcoded substring checks ("gpt-4o-mini", "claude-3-haiku") will break with new model names. No fallback tier. [surfsense_backend/app/routes/model_list_routes.py:_get_tier_for_model]
- [Review][Patch] GET /models/system endpoint not gated by is_cloud() — Endpoint accessible in self-hosted mode. Should return 404 or empty when not in cloud mode. [surfsense_backend/app/routes/model_list_routes.py]
- [Review][Patch] Subtask 5.3: llm-configs page not hidden in hosted/cloud mode — User can still navigate to BYOK API key page. Needs conditional route guard or redirect. [surfsense_web/app/dashboard/[search_space_id]/llm-configs/]
- [Review][Patch] update_token_usage has unnecessary session.refresh() — Refresh after atomic update is redundant and adds latency. [surfsense_backend/app/services/token_quota_service.py:update_token_usage]
- [Review][Patch] Model catalog missing cost_per_1k_tokens and explicit tier_required fields — Spec Task 1.1 requires cost_per_1k_input_tokens, cost_per_1k_output_tokens, tier_required per model. YAML catalog doesn't include these; tier derived by fragile pattern match. [surfsense_backend/config/global_llm_config.yaml]
Deferred (pre-existing / out of scope)
- [Review][Defer] stripe_subscription_id has no unique constraint [surfsense_backend/app/db.py] — deferred, will be addressed in Epic 5 (Stripe Payment Integration)
- [Review][Defer] load_llm_config_from_yaml reads API keys directly from YAML file, not env vars [surfsense_backend/app/config.py] — deferred, pre-existing architecture pattern