SurfSense

mirror of https://github.com/MODSetter/SurfSense.git synced 2026-07-10 22:32:16 +02:00

Author	SHA1	Message	Date
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	9bcd50164d	feat(evals): publish multimodal_doc parser_compare benchmark + n=171 report Adds the full parser_compare experiment for the multimodal_doc suite: six arms compared on 30 PDFs / 171 questions from MMLongBench-Doc with anthropic/claude-sonnet-4.5 across the board. Source code: - core/parsers/{azure_di,llamacloud,pdf_pages}.py: direct parser SDK callers (Azure Document Intelligence prebuilt-read/layout, LlamaParse parse_page_with_llm/parse_page_with_agent) used by the LC arms, bypassing the SurfSense backend so each (basic/premium) extraction is a clean A/B independent of backend ETL routing. - suites/multimodal_doc/parser_compare/{ingest,runner,prompt}.py: six-arm benchmark (native_pdf, azure_basic_lc, azure_premium_lc, llamacloud_basic_lc, llamacloud_premium_lc, surfsense_agentic) with byte-identical prompts per question, deterministic grader, Wilson CIs, and the per-page preprocessing tariff cost overlay. Reproducibility: - pyproject.toml + uv.lock pin pypdf, azure-ai-documentintelligence, llama-cloud-services as new deps. - .env.example documents the AZURE_DI_* and LLAMA_CLOUD_API_KEY env vars now required for parser_compare. - 12 analysis scripts under scripts/: retry pass with exponential backoff, post-retry accuracy merge, McNemar / latency / per-PDF stats, context-overflow hypothesis test, etc. Each produces one number cited by the blog report. Citation surface: - reports/blog/multimodal_doc_parser_compare_n171_report.md: 1219-line technical writeup (16 sections) covering headline accuracy, per-format accuracy, McNemar pairwise significance, latency / token / per-PDF distributions, error analysis, retry experiment, post-retry final accuracy, cost amortization model with closed-form derivation, threats to validity, and reproducibility appendix. - data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/{raw, raw_retries,raw_post_retry}.jsonl + run_artifact.json + retry summary whitelisted via data/.gitignore as the verifiable numbers source. Gitignore: - ignore logs_*.txt + retry_run.log; structured artifacts cover the citation surface, debug logs are noise. - data/.gitignore default-ignores everything, whitelists the n=171 run artifacts only (parser manifest left ignored to avoid leaking local Windows usernames in absolute paths; manifest is fully regenerable via 'ingest multimodal_doc parser_compare'). - reports/.gitignore now whitelists hand-curated reports/blog/. Also retires the abandoned CRAG Task 3 implementation (download script, streaming Task 3 ingest, CragTask3Benchmark + tests) and trims the runner / ingest module APIs to match. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-14 19:54:41 -07:00
CREDO23	ef1152b80e	multi_agent_chat/permissions: layer user allow-list into subagent compile	2026-05-14 21:57:38 +02:00
CREDO23	e99c06c887	user_tool_allowlist: extract trust-tool storage into reusable service	2026-05-14 21:20:30 +02:00
CREDO23	31d6b43a42	multi_agent_chat/shared: drop bucket types and helpers	2026-05-14 20:10:25 +02:00
CREDO23	014801c764	multi_agent_chat/loader: MCP tools as flat list[BaseTool] per agent	2026-05-14 20:10:11 +02:00
CREDO23	5a00df8e48	multi_agent_chat/builtins: KB+deliverables+memory+research adopt RULESET + flat load_tools()	2026-05-14 20:09:55 +02:00
CREDO23	3bb90124d2	multi_agent_chat/connectors: every route declares its own RULESET + flat load_tools()	2026-05-14 20:09:49 +02:00
CREDO23	d45dfbfbd6	multi_agent_chat: pack_subagent owns per-subagent PermissionMiddleware via Ruleset	2026-05-14 20:09:29 +02:00
CREDO23	67142e68b1	multi_agent_chat: scope MCP allow/ask permissions per subagent + drop "policy" synonym	2026-05-14 18:09:14 +02:00
CREDO23	0723702320	multi_agent_chat: real-graph regressions for unified HITL paths + format pass	2026-05-14 17:41:24 +02:00
CREDO23	adb52fb575	multi_agent_chat: KB owns its ruleset, drop interrupt_on duplication	2026-05-14 17:41:07 +02:00
CREDO23	d68280113b	multi_agent_chat/connectors+builtins: adopt symmetric self_gated_tool_permission_row helper	2026-05-14 17:40:59 +02:00
CREDO23	a06aec2821	multi_agent_chat/subagents: HITL umbrella + ToolKind rename	2026-05-14 17:40:29 +02:00
CREDO23	8eaab12971	multi_agent_chat/permissions: restructure slice + simplify factory	2026-05-14 17:40:12 +02:00
CREDO23	a36b15b834	multi_agent_chat/middleware: tighten parallel-keying test with heterogeneous bundles and per-slice assertions	2026-05-14 10:11:51 +02:00
CREDO23	d69d2cc1fc	multi_agent_chat/middleware: tighten heterogeneous slice arithmetic to (2,3) bundles	2026-05-14 10:05:04 +02:00
CREDO23	668b89927b	multi_agent_chat/middleware: real-graph regression test for partial-pause parallel routing	2026-05-14 09:47:24 +02:00
CREDO23	8e10f38f32	multi_agent_chat/middleware: real-graph regression test for all-reject parallel routing	2026-05-14 09:36:03 +02:00
CREDO23	ca57b2106e	multi_agent_chat/middleware: real-graph regression test for heterogeneous parallel decisions	2026-05-14 09:26:08 +02:00
$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk	3737118050	chore: evals	2026-05-13 14:02:26 -07:00
CREDO23	f2495092da	chat/stream_resume: salt thinking-step prefix with turn_id to avoid duplicate React keys	2026-05-13 21:15:51 +02:00
CREDO23	1bb9f435e5	chat-messages: render and batch-submit multiple HITL approval cards	2026-05-13 21:00:01 +02:00
CREDO23	0fd87ccb7f	chat/stream_resume: key Command(resume=...) by Interrupt.id for parallel HITL	2026-05-13 20:59:57 +02:00
CREDO23	c06dd6e8ba	chat/stream_new_chat: emit one SSE frame per pending interrupt	2026-05-13 20:59:48 +02:00
CREDO23	583ac83735	multi_agent_chat/middleware: refresh module layout docs	2026-05-13 19:58:59 +02:00
CREDO23	22e9dd3cf3	multi_agent_chat/main_agent: routing prompt for parallel and serial specialist work	2026-05-13 19:58:34 +02:00
CREDO23	03cf1466d3	chat/stream_resume: route a flat decisions list per paused subagent	2026-05-13 19:58:13 +02:00
CREDO23	1001f56206	multi_agent_chat/middleware: parallel task tests and full bridge coverage	2026-05-13 19:57:57 +02:00
CREDO23	6fb011c95c	multi_agent_chat/middleware: real-graph regression tests for interrupt stamping	2026-05-13 19:57:09 +02:00
CREDO23	e27883e88c	multi_agent_chat/middleware: stamp tool_call_id on subagent interrupts at task chokepoint	2026-05-13 19:57:02 +02:00
CREDO23	fc2c5b6445	multi_agent_chat/middleware: per-call thread_id, tcid-keyed resume, decisions slicer	2026-05-13 19:56:51 +02:00
guangyang1206	b7b4443276	fix(web): invalidate all log cache keys on log mutations Fixes #1369 — log create/update/delete mutations did not invalidate the query keys that useLogs actually subscribes to, causing UI staleness. Replace narrow invalidations (list, summary) with prefix-level invalidation (["logs"]) to cover withQueryParams, list, summary and detail in one shot.	2026-05-13 20:59:08 +08:00
Anish Sarkar	883c72396c	chore: add minimumReleaseAge configuration to pnpm workspace for dependency management	2026-05-13 03:38:04 +05:30
Anish Sarkar	d9ec401835	chore: remove caret from @rocicorp/zero dependency version	2026-05-13 03:34:28 +05:30
CREDO23	246dae40a8	Merge upstream/dev into feature/multi-agent	2026-05-12 21:23:37 +02:00
Anish Sarkar	bd452b3df4	fix(tests): improve composio module hijack in integration tests	2026-05-13 00:44:20 +05:30
Anish Sarkar	9b926b3133	refactor: update test for index() to use chunk_text_hybrid	2026-05-13 00:22:43 +05:30
CREDO23	6b60d324a3	multi_agent_chat/main_agent: one specialist per task; advertise write_todos for multi-turn plans	2026-05-12 20:39:14 +02:00
Anish Sarkar	6eb900cb0f	chore: update packageManager version to pnpm@10.26.0 in both desktop and web projects	2026-05-12 23:59:58 +05:30
CREDO23	379cc992f4	multi_agent_chat/subagents: expose knowledge_base as ask_knowledge_base tool for siblings	2026-05-12 20:03:59 +02:00
Anish Sarkar	0884b63406	chore: ran linting	2026-05-12 23:25:33 +05:30
Anish Sarkar	32ff864fd3	refactor(assistant-ui): streamline docstrings and comments	2026-05-12 23:24:01 +05:30
Anish Sarkar	2437716752	refactor(assistant-ui): enhance mention chip handling and editor focus behavior	2026-05-12 23:18:45 +05:30
CREDO23	f2f62c1c05	multi_agent_chat/permissions: break circular import in interrupt subpackage	2026-05-12 18:20:07 +02:00
CREDO23	d843468256	multi_agent_chat/subagents: dict-keyed middleware_stack + always-on KB	2026-05-12 18:04:54 +02:00
Anish Sarkar	0c2beb7ce8	fix(thread): conditionally render screen capture button for desktop users	2026-05-12 21:26:33 +05:30
Anish Sarkar	8ea042e88c	refactor(chat): improve user query handling and mention chip functionality	2026-05-12 20:57:15 +05:30
CREDO23	eee861bb3d	multi_agent_chat/main_agent: rewrite system prompt to hierarchical prompts/ tree	2026-05-12 15:35:48 +02:00
Anish Sarkar	c43bfdb1d9	chore(migration): update migration files to enforce new publication mutation pattern	2026-05-12 17:31:49 +05:30
Anish Sarkar	d923d34e38	feat(migration): add migration 143 to force zero-cache resync after Zero upgrade	2026-05-12 16:43:50 +05:30

1 2 3 4 5 ...

5833 commits