Commit graph

5889 commits

Author SHA1 Message Date
Anish Sarkar
8001cae1b4 refactor: update MessageInfoDropdown to accept chatTurnId prop and enhance RevertTurnButton integration for improved functionality 2026-05-15 10:48:57 +05:30
DESKTOP-RTLN3BA\$punk
e8aad48ddf refactor(report): enhance citations and clarify implementation details
Updated the multimodal_doc_parser_compare_n171_report.md to include detailed code citations for preprocessing costs and retry logic. Improved clarity on the implementation of the retry mechanism and its impact on failure rates. Added a new section for a code citations index to ensure reproducibility of technical claims.

This enhances the report's transparency and allows readers to trace the source of each claim back to the codebase.
2026-05-14 20:07:14 -07:00
DESKTOP-RTLN3BA\$punk
9bcd50164d feat(evals): publish multimodal_doc parser_compare benchmark + n=171 report
Adds the full parser_compare experiment for the multimodal_doc suite:
six arms compared on 30 PDFs / 171 questions from MMLongBench-Doc with
anthropic/claude-sonnet-4.5 across the board.

Source code:
- core/parsers/{azure_di,llamacloud,pdf_pages}.py: direct parser SDK
  callers (Azure Document Intelligence prebuilt-read/layout, LlamaParse
  parse_page_with_llm/parse_page_with_agent) used by the LC arms,
  bypassing the SurfSense backend so each (basic/premium) extraction
  is a clean A/B independent of backend ETL routing.
- suites/multimodal_doc/parser_compare/{ingest,runner,prompt}.py:
  six-arm benchmark (native_pdf, azure_basic_lc, azure_premium_lc,
  llamacloud_basic_lc, llamacloud_premium_lc, surfsense_agentic) with
  byte-identical prompts per question, deterministic grader, Wilson
  CIs, and the per-page preprocessing tariff cost overlay.

Reproducibility:
- pyproject.toml + uv.lock pin pypdf, azure-ai-documentintelligence,
  llama-cloud-services as new deps.
- .env.example documents the AZURE_DI_* and LLAMA_CLOUD_API_KEY env
  vars now required for parser_compare.
- 12 analysis scripts under scripts/: retry pass with exponential
  backoff, post-retry accuracy merge, McNemar / latency / per-PDF
  stats, context-overflow hypothesis test, etc. Each produces one
  number cited by the blog report.

Citation surface:
- reports/blog/multimodal_doc_parser_compare_n171_report.md: 1219-line
  technical writeup (16 sections) covering headline accuracy, per-format
  accuracy, McNemar pairwise significance, latency / token / per-PDF
  distributions, error analysis, retry experiment, post-retry final
  accuracy, cost amortization model with closed-form derivation, threats
  to validity, and reproducibility appendix.
- data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/{raw,
  raw_retries,raw_post_retry}.jsonl + run_artifact.json + retry summary
  whitelisted via data/.gitignore as the verifiable numbers source.

Gitignore:
- ignore logs_*.txt + retry_run.log; structured artifacts cover the
  citation surface, debug logs are noise.
- data/.gitignore default-ignores everything, whitelists the n=171 run
  artifacts only (parser manifest left ignored to avoid leaking local
  Windows usernames in absolute paths; manifest is fully regenerable
  via 'ingest multimodal_doc parser_compare').
- reports/.gitignore now whitelists hand-curated reports/blog/.

Also retires the abandoned CRAG Task 3 implementation (download script,
streaming Task 3 ingest, CragTask3Benchmark + tests) and trims the
runner / ingest module APIs to match.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 19:54:41 -07:00
Anish Sarkar
5092bd3e8c refactor: integrate mobile preview functionality in citation components and enhance styling for improved usability 2026-05-15 04:13:58 +05:30
Anish Sarkar
4dd5871318 refactor: enhance citation components with mobile support and improved styling for better user experience 2026-05-15 03:56:01 +05:30
Anish Sarkar
01d7379914 refactor: add public URL handling for SurfSense documents across various components and schemas 2026-05-15 02:05:11 +05:30
Anish Sarkar
ea087d1d23 refactor: implement CitationHoverPopover component to enhance inline citation functionality and improve user interaction 2026-05-15 02:00:17 +05:30
CREDO23
ef1152b80e multi_agent_chat/permissions: layer user allow-list into subagent compile 2026-05-14 21:57:38 +02:00
CREDO23
e99c06c887 user_tool_allowlist: extract trust-tool storage into reusable service 2026-05-14 21:20:30 +02:00
Anish Sarkar
56239548c8 refactor: replace Activity icons with Workflow icons across various components for consistency 2026-05-15 00:47:21 +05:30
Anish Sarkar
3846056bc7 refactor: update icons and improve sidebar styling for enhanced user experience 2026-05-15 00:25:58 +05:30
CREDO23
31d6b43a42 multi_agent_chat/shared: drop bucket types and helpers 2026-05-14 20:10:25 +02:00
CREDO23
014801c764 multi_agent_chat/loader: MCP tools as flat list[BaseTool] per agent 2026-05-14 20:10:11 +02:00
CREDO23
5a00df8e48 multi_agent_chat/builtins: KB+deliverables+memory+research adopt RULESET + flat load_tools() 2026-05-14 20:09:55 +02:00
CREDO23
3bb90124d2 multi_agent_chat/connectors: every route declares its own RULESET + flat load_tools() 2026-05-14 20:09:49 +02:00
CREDO23
d45dfbfbd6 multi_agent_chat: pack_subagent owns per-subagent PermissionMiddleware via Ruleset 2026-05-14 20:09:29 +02:00
Anish Sarkar
c180417329 refactor: enhance loading and sidebar components with improved skeleton loading states and styling 2026-05-14 23:28:41 +05:30
Anish Sarkar
2bdd59611a refactor: update SidebarUserProfile and Composer components with improved styling and tooltip integration 2026-05-14 23:22:32 +05:30
Anish Sarkar
4083d33b5c refactor: enhance SidebarUserProfile component with download tracking and improved button styling 2026-05-14 22:53:41 +05:30
CREDO23
67142e68b1 multi_agent_chat: scope MCP allow/ask permissions per subagent + drop "policy" synonym 2026-05-14 18:09:14 +02:00
CREDO23
0723702320 multi_agent_chat: real-graph regressions for unified HITL paths + format pass 2026-05-14 17:41:24 +02:00
CREDO23
adb52fb575 multi_agent_chat: KB owns its ruleset, drop interrupt_on duplication 2026-05-14 17:41:07 +02:00
CREDO23
d68280113b multi_agent_chat/connectors+builtins: adopt symmetric self_gated_tool_permission_row helper 2026-05-14 17:40:59 +02:00
CREDO23
a06aec2821 multi_agent_chat/subagents: HITL umbrella + ToolKind rename 2026-05-14 17:40:29 +02:00
CREDO23
8eaab12971 multi_agent_chat/permissions: restructure slice + simplify factory 2026-05-14 17:40:12 +02:00
Anish Sarkar
c77babf39b refactor: replace button elements with Button component for improved consistency and styling across additional UI components 2026-05-14 15:02:46 +05:30
Anish Sarkar
13b2e874f6 refactor: replace button elements with Button component 2026-05-14 14:46:48 +05:30
Anish Sarkar
da55c75e5e refactor: remove ThreadList component and associated thread management logic 2026-05-14 14:42:41 +05:30
Anish Sarkar
ee72a49ab1 refactor: replace button elements with Button component for improved consistency and styling across additional UI components 2026-05-14 14:40:08 +05:30
Anish Sarkar
3d42712b3f refactor: replace button elements with Button component for improved consistency and styling across multiple UI components 2026-05-14 14:17:44 +05:30
Anish Sarkar
23e05acc7c refactor: remove CitationList component to streamline citation handling and improve code maintainability 2026-05-14 13:49:52 +05:30
Anish Sarkar
f98bde1e04 refactor: replace button elements with Button component for improved consistency and styling across UI components 2026-05-14 13:49:33 +05:30
CREDO23
a36b15b834 multi_agent_chat/middleware: tighten parallel-keying test with heterogeneous bundles and per-slice assertions 2026-05-14 10:11:51 +02:00
CREDO23
d69d2cc1fc multi_agent_chat/middleware: tighten heterogeneous slice arithmetic to (2,3) bundles 2026-05-14 10:05:04 +02:00
Anish Sarkar
198c38b335 refactor: replace button elements with Button component for consistent styling across various UI components 2026-05-14 13:30:20 +05:30
CREDO23
668b89927b multi_agent_chat/middleware: real-graph regression test for partial-pause parallel routing 2026-05-14 09:47:24 +02:00
CREDO23
8e10f38f32 multi_agent_chat/middleware: real-graph regression test for all-reject parallel routing 2026-05-14 09:36:03 +02:00
CREDO23
ca57b2106e multi_agent_chat/middleware: real-graph regression test for heterogeneous parallel decisions 2026-05-14 09:26:08 +02:00
Anish Sarkar
88f9c3353c refactor: update UI components to utilize 'popover-border' for consistent styling and enhance overall design coherence 2026-05-14 12:53:52 +05:30
DESKTOP-RTLN3BA\$punk
3737118050 chore: evals 2026-05-13 14:02:26 -07:00
Anish Sarkar
468f4ef121 refactor: standardize hover effects and focus styles across UI components 2026-05-14 02:10:33 +05:30
Anish Sarkar
cbfbf59c46 refactor: enhance UI components with improved hover effects and color consistency 2026-05-14 02:07:53 +05:30
CREDO23
f2495092da chat/stream_resume: salt thinking-step prefix with turn_id to avoid duplicate React keys 2026-05-13 21:15:51 +02:00
CREDO23
1bb9f435e5 chat-messages: render and batch-submit multiple HITL approval cards 2026-05-13 21:00:01 +02:00
CREDO23
0fd87ccb7f chat/stream_resume: key Command(resume=...) by Interrupt.id for parallel HITL 2026-05-13 20:59:57 +02:00
CREDO23
c06dd6e8ba chat/stream_new_chat: emit one SSE frame per pending interrupt 2026-05-13 20:59:48 +02:00
Anish Sarkar
bd5f1b3cdf feat: add CollapsedInboxIcon component to enhance sidebar functionality 2026-05-14 00:13:53 +05:30
Anish Sarkar
75b7a9cc6c refactor: update UI components to enhance hover effects and color consistency 2026-05-13 23:53:09 +05:30
CREDO23
583ac83735 multi_agent_chat/middleware: refresh module layout docs 2026-05-13 19:58:59 +02:00
CREDO23
22e9dd3cf3 multi_agent_chat/main_agent: routing prompt for parallel and serial specialist work 2026-05-13 19:58:34 +02:00