diff --git a/surfsense_evals/reports/blog/multimodal_doc_parser_compare_n171_report.md b/surfsense_evals/reports/blog/multimodal_doc_parser_compare_n171_report.md
deleted file mode 100644
index 7bd72ebb9..000000000
--- a/surfsense_evals/reports/blog/multimodal_doc_parser_compare_n171_report.md
+++ /dev/null
@@ -1,1336 +0,0 @@
-# Multimodal Document QA Benchmark: Native PDFs vs Parser-Stuffed Context vs SurfSense Agentic Retrieval
-
-**Date:** 2026-05-13  
-**Dataset:** MMLongBench-Doc / `multimodal_doc`  
-**Run:** `parser_compare`  
-**Model:** `anthropic/claude-sonnet-4.5` everywhere  
-**Sample:** 30 PDFs, 171 answerable questions  
-**Report artifact:** `reports/multimodal_doc/2026-05-14T02-30-16Z/summary.md`  
-**Raw artifact:** `data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw.jsonl`
-
----
-
-## 1. Executive Summary
-
-We ran a six-arm comparative study on 30 documents from MMLongBench-Doc to understand how different document-QA strategies perform on long, multimodal PDFs.
-
-The comparison was designed around a realistic product question:
-
-> If we use the same strong LLM, is it better to send the PDF directly, send a full parsed document into the prompt, or let SurfSense retrieve/context-manage chunks agentically?
-
-The arms were:
-
-1. **Native PDF attachment**: send the PDF file directly to Sonnet 4.5.
-2. **Azure Document Intelligence basic + long-context stuffing**.
-3. **Azure Document Intelligence premium + long-context stuffing**.
-4. **LlamaCloud basic + long-context stuffing**.
-5. **LlamaCloud premium + long-context stuffing**.
-6. **SurfSense agentic retrieval**: use SurfSense `/api/v1/new_chat`, with the PDF already ingested into SurfSense and retrieved dynamically during the answer process.
-
-Headline result:
-
-| Rank by accuracy | Arm | Accuracy | F1 | LLM $/Q | Preproc $/Q | **Total $/Q** | Median latency | Raw failures |
-|---:|---|---:|---:|---:|---:|---:|---:|---:|
-| 1 | LlamaCloud premium, full-context | **58.5%** | **61.1%** | $0.1208 | $0.0677 | $0.1885 | 6.8s | 4 |
-| 2 | Azure premium, full-context | 56.7% | 59.6% | $0.1373 | $0.0677 | $0.2051 | 6.9s | 3 |
-| 3 | Azure basic, full-context | 54.4% | 56.6% | $0.0994 | $0.0068 | $0.1062 | 7.1s | 1 |
-| 4 | SurfSense agentic retrieval | 53.2% | 54.3% | **$0.0150** | $0.0677 | **$0.0827** | 52.8s | **0** |
-| 5 | LlamaCloud basic, full-context | 50.3% | 53.2% | $0.0981 | $0.0068 | $0.1049 | 7.1s | 2 |
-| 6 | Native PDF attachment | 48.0% | 50.4% | $0.2552 | $0.0000 | $0.2552 | 29.5s | 27 |
-
-Cost ranking (cheapest first):
-
-| Rank by cost | Arm | Total $/Q | Accuracy |
-|---:|---|---:|---:|
-| 1 | **SurfSense agentic retrieval** | **$0.0827** | 53.2% |
-| 2 | LlamaCloud basic, full-context | $0.1049 | 50.3% |
-| 3 | Azure basic, full-context | $0.1062 | 54.4% |
-| 4 | LlamaCloud premium, full-context | $0.1885 | 58.5% |
-| 5 | Azure premium, full-context | $0.2051 | 56.7% |
-| 6 | Native PDF attachment | $0.2552 | 48.0% |
-
-The main lesson is not simply “parser X wins.” The more important finding is:
-
-> Full-context prompting gives slightly higher peak accuracy when the full processed document fits cleanly in the context window, but SurfSense is the cheapest *and* most robust option: it produced zero runtime failures across the 171-question run and the lowest end-to-end cost per question, while remaining within ~5 percentage points of the best full-context arm.
-
-A follow-up retry experiment (§9.4 + §9.5) tightens this further. We re-ran only the 37 failed `(arm, qid)` pairs with up to 5 attempts of exponential backoff, merged the survivors back into the run, and recomputed the headline numbers:
-
-- **All 10 long-context-arm failures recovered.** 100% recovery rate, mostly on attempt 1 — confirming these were transient transport-layer errors, not context-window overflows.
-- **Only 15 of 27 native_pdf failures recovered.** The remaining 12 are intrinsic: 6 questions on one PDF that exceeds the provider's 30 MB wire-size cap, and 5 questions on a 166-page PDF whose response stream the provider cannot reliably terminate. Native_pdf retains a **7% intrinsic failure rate that survives retries**.
-- **Final post-retry accuracy** (full table in §9.5): `llamacloud_premium_lc` 59.6% > `azure_premium_lc` 58.5% > `azure_basic_lc` 54.4% > `surfsense_agentic` 53.2% > `native_pdf` 52.0% > `llamacloud_basic_lc` 50.9%. The top three are unchanged; `native_pdf` moves up one spot to #5 (still last among the arms that complete cleanly); SurfSense holds its 53.2% at #4 and stays the cheapest arm.
-
----
-
-## 2. Why This Experiment Was Run
-
-Earlier small tests suggested that native PDF attachment could sometimes outperform OCR/RAG approaches. That result was not enough to settle the architectural question because it was small, did not isolate parsers, and did not test larger long-document behavior.
-
-This experiment was built to compare three classes of systems:
-
-### A. Non-agentic, no context management
-
-These arms pass the whole document representation to the LLM for every question.
-
-- **Native PDF** sends the original PDF directly to the model.
-- **Azure basic/premium** parses the PDF to markdown, then sends that entire markdown context.
-- **LlamaCloud basic/premium** does the same with LlamaCloud parser output.
-
-This is the “brute force” approach: give the model everything and ask it to answer.
-
-### B. Agentic, with context management
-
-SurfSense does not pass the full PDF into the prompt for every query. Instead, the document is ingested once, chunked/indexed, and then the agent retrieves/selects relevant context during the answer flow.
-
-This should normally:
-
-- reduce context overflow risk,
-- reduce per-question prompt size,
-- make the system usable on very long corpora,
-- but potentially lose accuracy when the needed evidence is hard to retrieve.
-
-The expected trade-off was:
-
-> SurfSense may score lower than ideal full-context methods, but should remain cheaper and more robust as documents get longer.
-
-That is mostly what the experiment showed.
-
----
-
-## 3. Dataset and Scope
-
-### Dataset
-
-The dataset was **MMLongBench-Doc**, a benchmark of long multimodal documents with question-answer pairs over PDFs.
-
-### Scope
-
-We selected the first 30 PDFs from the local MMLongBench-Doc document ordering and evaluated all answerable questions attached to those PDFs.
-
-- **PDFs:** 30
-- **Total questions in those PDFs:** 225
-- **Answerable questions used:** 171
-- **Unanswerable / `None` probes skipped:** 54
-
-Answer format distribution among the 171 answerable questions:
-
-| Answer format | Count |
-|---|---:|
-| `str` | 61 |
-| `int` | 57 |
-| `list` | 32 |
-| `float` | 21 |
-
-### Documents
-
-The 30 PDFs covered a wide spread:
-
-- short survey/poll PDFs,
-- arXiv-style research papers,
-- product/catalog PDFs,
-- prospectuses,
-- annual reports / financial filings,
-- very large image-rich PDFs.
-
-Important long or failure-prone PDFs:
-
-| PDF | Pages | Notes |
-|---|---:|---|
-| `2309.17421v2.pdf` | 166 | 43.6MB, image-heavy; one of the slowest SurfSense ingests |
-| `3M_2018_10K.pdf` | 160 | huge markdown extraction; LlamaCloud premium produced ~908k chars |
-| `2311.16502v3.pdf` | 117 | many transient request failures |
-| `2307.09288v2.pdf` | 77 | several transient request failures |
-| `2405.09818v1.pdf` | 27 | native PDF exceeded a hard provider message-size limit |
-
----
-
-## 4. Experimental Arms
-
-All answer-generation arms used:
-
-```text
-anthropic/claude-sonnet-4.5
-```
-
-### 4.1 `native_pdf`
-
-The PDF was attached directly to the OpenRouter chat-completions request using the native PDF file path. The model was asked to answer the question from the attached PDF.
-
-This arm has no preprocessing cost, but it pays the PDF/token cost repeatedly for every question.
-
-### 4.2 `azure_basic_lc`
-
-The PDF was parsed with Azure Document Intelligence in **basic** mode.
-
-Backend-equivalent mode:
-
-```text
-processing_mode=basic
-Azure model=prebuilt-read
-```
-
-The resulting markdown was passed fully into the LLM prompt for every question against that PDF.
-
-### 4.3 `azure_premium_lc`
-
-The PDF was parsed with Azure Document Intelligence in **premium** mode.
-
-Backend-equivalent mode:
-
-```text
-processing_mode=premium
-Azure model=prebuilt-layout
-```
-
-The resulting markdown was passed fully into the LLM prompt.
-
-### 4.4 `llamacloud_basic_lc`
-
-The PDF was parsed with LlamaCloud in basic mode.
-
-Backend-equivalent mode:
-
-```text
-processing_mode=basic
-LlamaCloud parse_mode=parse_page_with_llm
-```
-
-The extracted markdown was passed fully into the prompt.
-
-### 4.5 `llamacloud_premium_lc`
-
-The PDF was parsed with LlamaCloud in premium mode.
-
-Backend-equivalent mode:
-
-```text
-processing_mode=premium
-LlamaCloud parse_mode=parse_page_with_agent
-```
-
-The extracted markdown was passed fully into the prompt.
-
-### 4.6 `surfsense_agentic`
-
-SurfSense ingested the PDFs first, then the harness queried:
-
-```text
-POST /api/v1/new_chat
-```
-
-with the relevant document mentioned/scoped for that question.
-
-Unlike the full-context arms, SurfSense did not put the entire document into the prompt. The system relied on SurfSense’s existing agentic context-management and retrieval flow to pull relevant chunks.
-
----
-
-## 5. Ingestion and Run Notes
-
-### SurfSense ingestion
-
-The initial SurfSense ingest tried to upload the 30 PDFs with batch size 3. This timed out during the large `2309.17421v2.pdf` processing step:
-
-```text
-DocumentProcessingTimeout: Timed out after 1800s waiting for documents
-(still pending/processing: [7589])
-```
-
-The backend did not actually fail permanently. Celery continued processing the large PDF, and eventually completed it:
-
-```text
-Vision LLM described 414 image(s) in 2309.17421v2.pdf
-Document indexed successfully ... doc_id=7589 chunk_count=2093
-Task completed successfully for: 2309.17421v2.pdf
-```
-
-To recover cleanly, ingestion was resumed with:
-
-```text
---upload-batch-size 1
-```
-
-This gave each PDF its own 30-minute wait budget. After the resume:
-
-```text
-ready: 30
-```
-
-All 30 PDFs were available in SurfSense.
-
-### Parser extraction
-
-The direct parser-comparison ingest completed successfully:
-
-```text
-30 PDFs × 4 parser/mode combinations = 120 extractions
-0 extraction failures
-```
-
-The largest extracted markdowns came from `3M_2018_10K.pdf`:
-
-| Arm | Largest extraction | PDF |
-|---|---:|---|
-| Azure basic | 578,987 chars | `3M_2018_10K.pdf` |
-| Azure premium | 688,902 chars | `3M_2018_10K.pdf` |
-| LlamaCloud basic | 733,194 chars | `3M_2018_10K.pdf` |
-| LlamaCloud premium | 908,733 chars | `3M_2018_10K.pdf` |
-
-The LlamaCloud premium extraction for the 3M filing was estimated at roughly 227k tokens, which is above a typical 200k-token context window. That is an important warning sign for full-context architectures.
-
----
-
-## 6. Cost Model
-
-The experiment included:
-
-1. **LLM inference cost** for OpenRouter-powered arms.
-2. **Preprocessing cost** for parser-based arms.
-3. **SurfSense preprocessing cost** for the agentic arm.
-
-The preprocessing tariff used (source: [`runner.py:74-77`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/runner.py#L74-L77), with per-arm mapping at [`runner.py:89-101`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/runner.py#L89-L101) and the `$/Q` overlay at [`runner.py:725-747`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/runner.py#L725-L747)):
-
-| Mode | Cost |
-|---|---:|
-| Basic | $1 / 1000 pages |
-| Premium | $10 / 1000 pages |
-
-Across the 30 PDFs, the total page count was:
-
-```text
-1,158 pages
-```
-
-Therefore:
-
-| Tier | Preprocessing cost |
-|---|---:|
-| Basic | $1.158 |
-| Premium | $11.580 |
-
-SurfSense LLM cost was measured separately:
-
-The `/api/v1/new_chat` SSE stream does not surface per-call token usage to the evaluation harness, so the auto-generated report writes `LLM $/Q = $0.0000` for the SurfSense arm. The actual cost was reconstructed from the backend's `billable_call` ledger after the run:
-
-```text
-SurfSense LLM cost / question (measured): $0.015 (avg)
-SurfSense LLM cost (n=171 run total):     $2.57
-```
-
-That figure covers all internal LLM calls the agent issues per question (planner / reader / final answer). It is what the cost tables in this report use everywhere `surfsense_agentic` LLM cost is shown.
-
-The SurfSense preprocessing cost is included as `$11.58`, because the documents were ingested with premium processing (Azure Document Intelligence `prebuilt-layout`) plus vision LLM (`anthropic/claude-sonnet-4.5`) for image-content extraction.
-
----
-
-## 7. Main Results
-
-### 7.1 Raw accuracy and cost
-
-| Arm | Accuracy | Wilson 95% CI | F1 mean | Mean input tokens | Mean output tokens | LLM $/Q | Preprocess $/Q | Total $/Q | Latency p50 | Latency p95 |
-|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
-| `native_pdf` | 48.0% (82/171) | 40.6–55.4% | 50.4% | 65,773 | 209 | $0.2552 | $0.0000 | $0.2552 | 29.5s | 60.5s |
-| `azure_basic_lc` | 54.4% (93/171) | 46.9–61.7% | 56.6% | 31,883 | 250 | $0.0994 | $0.0068 | $0.1062 | 7.1s | 12.0s |
-| `azure_premium_lc` | 56.7% (97/171) | 49.2–63.9% | 59.6% | 39,787 | 223 | $0.1373 | $0.0677 | $0.2051 | 6.9s | 11.6s |
-| `llamacloud_basic_lc` | 50.3% (86/171) | 42.9–57.7% | 53.2% | 31,493 | 243 | $0.0981 | $0.0068 | $0.1049 | 7.1s | 11.9s |
-| `llamacloud_premium_lc` | **58.5%** (100/171) | 51.0–65.6% | **61.1%** | 39,131 | 228 | $0.1208 | $0.0677 | $0.1885 | 6.8s | 12.7s |
-| `surfsense_agentic` | 53.2% (91/171) | 45.7–60.5% | 54.3% | n/a* | n/a* | **$0.0150** | $0.0677 | **$0.0827** | 52.8s | 164.1s |
-
-*\*The SurfSense `/api/v1/new_chat` SSE stream does not currently surface prompt/completion token counts to the harness, so per-call token figures are recorded as `n/a`. The `LLM $/Q` value of `$0.0150` is the average measured from the backend's billable-call ledger across the 171 questions.*
-
-### 7.2 Accuracy by answer type
-
-| Arm | Float | Int | List | String |
-|---|---:|---:|---:|---:|
-| `native_pdf` | 62% (13/21) | 39% (22/57) | 31% (10/32) | 61% (37/61) |
-| `azure_basic_lc` | 52% (11/21) | 53% (30/57) | 44% (14/32) | 62% (38/61) |
-| `azure_premium_lc` | 62% (13/21) | **56%** (32/57) | 41% (13/32) | 64% (39/61) |
-| `llamacloud_basic_lc` | 62% (13/21) | 47% (27/57) | 38% (12/32) | 56% (34/61) |
-| `llamacloud_premium_lc` | **71%** (15/21) | 49% (28/57) | 47% (15/32) | **69%** (42/61) |
-| `surfsense_agentic` | 67% (14/21) | 44% (25/57) | **53%** (17/32) | 57% (35/61) |
-
-Notable pattern:
-
-- LlamaCloud premium was strongest on `float` and `string` answers.
-- Azure premium was strongest on `int` answers.
-- SurfSense was strongest on `list` answers.
-
-This is product-relevant: list answers usually require gathering multiple facts. SurfSense's agentic retrieval did better there than every full-context arm.
-
-### 7.3 Statistical significance: McNemar pairwise tests
-
-Accuracy differences at n = 171 are not automatically meaningful. We pair every two arms on the same set of 171 questions and run a two-sided **exact-binomial McNemar test** on the discordant pairs.
-
-For each ordered pair `(i, j)`, with the post-retry rows:
-
-- `b = #{q : i correct, j wrong}`
-- `c = #{q : i wrong,   j correct}`
-- under H0, `b ~ Binomial(b + c, 0.5)`,
-- two-sided p-value: `P(X ≤ min(b, c)) + P(X ≥ max(b, c))` computed exactly.
-
-(Implementation: [`compute_blog_extras.py:80-99`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/compute_blog_extras.py#L80-L99) for the exact-binomial p-value, [`compute_blog_extras.py:102-141`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/compute_blog_extras.py#L102-L141) for the pairwise table builder. Pure stdlib `math.comb`, no scipy.)
-
-**Pairwise McNemar table (post-retry, sorted by p-value):**
-
-| arm i | arm j | b (i only) | c (j only) | both ok | both wrong | p (2-sided) | sig |
-|---|---|---:|---:|---:|---:|---:|---|
-| `azure_premium_lc` | `llamacloud_basic_lc` | 20 | 7 | 80 | 64 | **0.0192** | * |
-| `llamacloud_basic_lc` | `llamacloud_premium_lc` | 12 | 27 | 75 | 57 | **0.0237** | * |
-| `llamacloud_premium_lc` | `native_pdf` | 23 | 10 | 79 | 59 | **0.0351** | * |
-| `azure_premium_lc` | `native_pdf` | 20 | 9 | 80 | 62 | 0.0614 | (·) |
-| `llamacloud_premium_lc` | `surfsense_agentic` | 24 | 13 | 78 | 56 | 0.0989 | (·) |
-| `azure_basic_lc` | `llamacloud_premium_lc` | 10 | 19 | 83 | 59 | 0.1360 | |
-| `azure_premium_lc` | `surfsense_agentic` | 21 | 12 | 79 | 59 | 0.1628 | |
-| `azure_basic_lc` | `azure_premium_lc` | 8 | 15 | 85 | 63 | 0.2100 | |
-| `azure_basic_lc` | `llamacloud_basic_lc` | 20 | 14 | 73 | 64 | 0.3915 | |
-| `azure_basic_lc` | `native_pdf` | 18 | 14 | 75 | 64 | 0.5966 | |
-| `llamacloud_basic_lc` | `surfsense_agentic` | 17 | 21 | 70 | 63 | 0.6271 | |
-| `azure_premium_lc` | `llamacloud_premium_lc` | 11 | 13 | 89 | 58 | 0.8388 | |
-| `azure_basic_lc` | `surfsense_agentic` | 20 | 18 | 73 | 60 | 0.8714 | |
-| `llamacloud_basic_lc` | `native_pdf` | 20 | 22 | 67 | 62 | 0.8776 | |
-| `native_pdf` | `surfsense_agentic` | 23 | 25 | 66 | 57 | 0.8854 | |
-
-`*`: p < 0.05. `(·)`: p < 0.10 (suggestive but not conclusive).
-
-What this table tells the reader at a glance:
-
-1. **Three pairs reach α = 0.05.** Both premium-LC arms beat `llamacloud_basic_lc`, and `llamacloud_premium_lc` beats `native_pdf`. Everything else is noise at this n.
-2. **Premium vs. basic *within Azure* is not significant** (p = 0.21). At n = 171 we cannot conclude `azure_premium_lc` (58.5%) is meaningfully better than `azure_basic_lc` (54.4%). This matters for cost-sensitive workloads — the 10× preprocessing tariff for Azure premium is buying a noisy gain.
-3. **`azure_basic_lc` vs `surfsense_agentic`: p = 0.87.** Effectively the same accuracy on this sample. The product story for SurfSense is therefore not "we're as accurate as the *best* arm" but "we're indistinguishable from a reasonable parser-stuffing arm at a fraction of the cost".
-4. **`llamacloud_basic_lc` vs `native_pdf`: p = 0.88.** Identical accuracy. The 4.0pp gap visible in the headline table is within sampling noise.
-5. **`llamacloud_premium_lc` vs `surfsense_agentic`: p = 0.099.** The flagship LC arm's 6.4pp accuracy advantage over SurfSense is *suggestive* but does not pass α = 0.05 — readers should not write headlines about a "definitive accuracy gap" between full-context premium and SurfSense. With more data this likely becomes significant; at n = 171 it does not.
-
-**Multiple-comparison note.** With 15 pairs and α = 0.05, you'd expect ~0.75 false positives by chance. Holm-correcting to family-wise α = 0.05 keeps only the most significant pair (`azure_premium_lc > llamacloud_basic_lc`, p = 0.019) at α/15 ≈ 0.0033, which it does not pass. So at strict family-wise control, *no* pair is significant; the three single-comparison-significant pairs above should be reported as "directional, single-comparison significant".
-
-### 7.4 Latency and request-size distributions
-
-**Latency per arm (seconds, post-retry):**
-
-| Arm | n | mean | std | p50 | p90 | p95 | p99 | max | CV |
-|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
-| `azure_premium_lc` | 171 | 7.4 | 2.7 | 7.0 | 10.6 | 11.6 | 13.5 | 24.6 | 0.37 |
-| `llamacloud_basic_lc` | 171 | 7.5 | 2.4 | 7.1 | 11.3 | 11.9 | 13.7 | 14.4 | 0.32 |
-| `azure_basic_lc` | 171 | 7.5 | 2.8 | 7.1 | 11.1 | 11.9 | 14.4 | 25.2 | 0.37 |
-| `llamacloud_premium_lc` | 171 | 7.6 | 3.1 | 6.9 | 11.4 | 12.7 | 15.5 | 29.4 | 0.41 |
-| `native_pdf` | 164 | 32.1 | 18.8 | 33.0 | 54.2 | 64.5 | 92.2 | 110.6 | 0.58 |
-| `surfsense_agentic` | 171 | 67.5 | 44.1 | 52.8 | 126.0 | 160.6 | 206.2 | 328.7 | 0.65 |
-
-(`native_pdf` n is 164 because 7 hard-failed rows have latency = 0; CV = std/mean is the dimensionless tail-fatness.)
-
-**Three operational observations:**
-
-1. **The four LC arms are essentially indistinguishable on latency** (p50 7 s, p95 12 s, CV ~0.35). The model dominates the budget; the parser doesn't.
-2. **Native_pdf is 4–5× slower at p50 and 5–8× slower at p95** because each call uploads the base64-inflated PDF and waits for the provider's PDF parser before generation starts.
-3. **SurfSense is 7–9× the LC arm latency at p50 and 13× at p95.** This is the agent-loop tax: SurfSense executes multiple internal LLM hops (retrieval planning, tool calls, final answer) per question. The CV of 0.65 means *some* questions take much longer — the p99 of 206 s is the practical "long-tail" budget you need to plan for if you build a SurfSense-style UI. For a synchronous chat experience this is acceptable; for a sub-second autocomplete it is not.
-
-**Input-token distribution (post-retry):**
-
-| Arm | mean | p50 | p95 | max |
-|---|---:|---:|---:|---:|
-| `azure_basic_lc` | 32,570 | 22,208 | 117,430 | 140,543 |
-| `llamacloud_basic_lc` | 32,098 | 21,622 | 103,914 | 163,246 |
-| `azure_premium_lc` | 41,366 | 26,472 | 133,647 | 207,958 |
-| `llamacloud_premium_lc` | 41,574 | 25,914 | 139,289 | 177,509 |
-| `native_pdf` | 84,657 | 59,883 | 259,136 | 390,267 |
-
-Two things worth flagging for the writer:
-
-- **Premium parsers extract ~30% more tokens than basic parsers.** That's the "tables and figures rendered as text" tax. It explains both the higher accuracy and the higher LLM input cost.
-- **Native_pdf reports 2× the input tokens of any LC arm.** The provider's PDF parser inserts page metadata, image-embedding tokens, and per-page positional context. The model is paying input-token cost for richer (but apparently less useful) information than what parsers produce. This corroborates the accuracy ranking: more raw bytes ≠ better answers.
-- **SurfSense doesn't appear** in this table because the SSE stream does not surface token counts. From the backend ledger, SurfSense's agent loop runs at ~5–15K input tokens per *internal hop*, with 2–4 hops per question — total per-question input is roughly an order of magnitude below the LC arms.
-
-### 7.5 Per-PDF accuracy heterogeneity
-
-Per-arm distribution of accuracy *across the 30 PDFs* (each PDF contributes mean correctness over its 4–8 questions):
-
-| Arm | n PDFs | mean | std | min | p25 | p50 | p75 | max | #PDFs at 0% | #PDFs at 100% |
-|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
-| `llamacloud_premium_lc` | 30 | 59.8% | 21.1% | 16.7% | 50.0% | 58.6% | 71.4% | 100.0% | 0 | **3** |
-| `azure_premium_lc` | 30 | 58.0% | 24.6% | 0.0% | 40.0% | 58.6% | 78.8% | 100.0% | 1 | 2 |
-| `azure_basic_lc` | 30 | 55.0% | 20.4% | 14.3% | 44.6% | 50.0% | 66.7% | 100.0% | 0 | 1 |
-| `surfsense_agentic` | 30 | 53.1% | 22.7% | 0.0% | 33.3% | 50.0% | 66.7% | 100.0% | 1 | 2 |
-| `native_pdf` | 30 | 51.1% | 24.8% | 0.0% | 35.0% | 50.0% | 70.2% | 85.7% | **3** | 0 |
-| `llamacloud_basic_lc` | 30 | 49.5% | 23.3% | 0.0% | 33.3% | 50.0% | 66.7% | 83.3% | 2 | 0 |
-
-Two product-relevant takeaways:
-
-1. **All arms have high per-PDF variance** (std 20–25 percentage points). PDF identity matters more than arm identity for any single document. A blog claim like "premium parsing improves accuracy" is true on average but does not transfer to a guarantee on any one PDF.
-2. **`llamacloud_premium_lc` is the only arm with zero PDFs at 0% accuracy** *and* the most PDFs at 100% (3). It's the most consistent arm. `native_pdf` is the only arm with zero perfect PDFs, and 3 PDFs at 0% — confirming its operational fragility doesn't only hit specific *questions*, it can wipe out entire documents.
-
----
-
-## 8. Raw vs Adjusted Accuracy
-
-The raw benchmark includes transient provider/network errors. For a blog post, it is useful to separate:
-
-- **raw reliability**: what actually happened in the run,
-- **intrinsic QA quality**: what the arm likely scores if transient network failures are retried.
-
-We classified transient failures as:
-
-- SSL bad-record-mac errors,
-- provider 502/503 errors,
-- empty response streams,
-- mid-stream JSON decode errors.
-
-We classified intrinsic failures as:
-
-- hard provider size limits,
-- context-window limits,
-- PDF/image decode failures.
-
-Adjusted accuracy removes transient failures from the denominator.
-
-| Arm | Raw accuracy | Transient failures | Intrinsic failures | Adjusted accuracy |
-|---|---:|---:|---:|---:|
-| `native_pdf` | 48.0% | 26 | 1 | 56.6% |
-| `azure_basic_lc` | 54.4% | 1 | 0 | 54.7% |
-| `azure_premium_lc` | 56.7% | 3 | 0 | 57.7% |
-| `llamacloud_basic_lc` | 50.3% | 2 | 0 | 50.9% |
-| `llamacloud_premium_lc` | 58.5% | 4 | 0 | 59.9% |
-| `surfsense_agentic` | 53.2% | **0** | **0** | 53.2% |
-
-Interpretation:
-
-- If we ignore transient failures, native PDF improves from 48.0% to 56.6%.
-- But this does not erase the operational problem: native PDF had many more runtime failures than every other arm.
-- SurfSense’s adjusted and raw accuracy are identical because it had zero failures.
-
----
-
-## 9. Error Analysis
-
-### 9.1 Failure count by arm
-
-| Arm | Questions | Failures | Failure rate |
-|---|---:|---:|---:|
-| `native_pdf` | 171 | 27 | **15.8%** |
-| `llamacloud_premium_lc` | 171 | 4 | 2.3% |
-| `azure_premium_lc` | 171 | 3 | 1.8% |
-| `llamacloud_basic_lc` | 171 | 2 | 1.2% |
-| `azure_basic_lc` | 171 | 1 | 0.6% |
-| `surfsense_agentic` | 171 | **0** | **0.0%** |
-
-### 9.2 Failure causes
-
-Most failures were not “the model answered incorrectly.” They were runtime/provider failures.
-
-#### Native PDF failures
-
-Native PDF had 27 failures:
-
-| Failure type | Count | Meaning |
-|---|---:|---|
-| SSL / transient request errors | 21 | Transport instability while sending large payloads |
-| Empty response | 5 | Stream ended without usable answer |
-| Provider 502 | 1 | OpenRouter / upstream gateway error |
-| Hard 30MB message-size limit | 1 | Intrinsic payload-size limit |
-
-There is overlap in how raw error strings were bucketed, but the operational takeaway is clear:
-
-> Native PDF attachment created the most fragile request shape. It repeatedly sent large binary/base64 payloads and was much more exposed to transport and provider-size failures.
-
-The clearest intrinsic hard failure occurred on:
-
-```text
-2405.09818v1.pdf::Q007
-```
-
-PDF details:
-
-```text
-PDF: 2405.09818v1.pdf
-Pages: 27
-PDF size: 24.1MB
-Estimated base64 wire size: ~32.0MB
-```
-
-Provider error:
-
-```text
-The message size (33657603 bytes) exceeds 30.000MB limit.
-```
-
-This is a strong example for the blog:
-
-> A PDF can look moderate by page count, but still exceed native attachment limits because file upload payloads inflate on the wire.
-
-#### Full-context parser arm failures
-
-The four parser-stuffing arms had only 10 combined failures across 684 calls:
-
-| Arm | Failures | Main cause |
-|---|---:|---|
-| Azure basic LC | 1 | SSL transient |
-| Azure premium LC | 3 | SSL transient |
-| LlamaCloud basic LC | 2 | SSL transient |
-| LlamaCloud premium LC | 4 | SSL transient |
-
-These failures were all classified as transient TLS/network errors:
-
-```text
-SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac
-```
-
-They likely would be mitigated by adding retries with exponential backoff in the evaluation harness.
-
-#### These are transport-layer failures, not context-window overflows
-
-A natural intuition is: *"the long-context arms must be hitting Sonnet 4.5's 200K context window, while SurfSense doesn't because it stores the data and retrieves chunks."* The data does not support that. We tested the hypothesis directly with `scripts/test_context_overflow_hypothesis.py` and found:
-
-**(1) Zero literal context-overflow errors in the LC arms.** No `context_length_exceeded`, no `prompt is too long`, no `maximum context length`. The only literal payload-limit error in the entire run was on `native_pdf` — a 30 MB *wire-size* limit, not a token-window limit:
-
-```text
-The message size (33657603 bytes) exceeds 30.000MB limit.
-```
-
-**(2) Failed requests were larger on average — but successful requests were larger still.** If failures were caused by hitting the model's context window, the largest *successful* payload per arm should sit near the window cap (~800K chars ≈ 200K tokens). It does not. In every LC arm, the largest payload that *succeeded* was meaningfully bigger than the largest payload that *failed*:
-
-| Arm | Max OK (chars / ~tokens) | Max FAIL (chars / ~tokens) |
-|---|---:|---:|
-| `azure_basic_lc` | 578,987 / ~145K | 412,474 / ~103K |
-| `azure_premium_lc` | 688,902 / ~172K | 439,469 / ~110K |
-| `llamacloud_basic_lc` | 733,194 / ~183K | 298,961 / ~75K |
-| `llamacloud_premium_lc` | **908,733 / ~227K** | 448,633 / ~112K |
-
-If the model were rejecting requests for being too long, max-OK could not exceed max-FAIL. So the model is not the bottleneck.
-
-**(3) The known overflow candidate succeeded.** `3M_2018_10K.pdf` parsed to 908K chars (~227K tokens) under `llamacloud_premium` — *over* Sonnet 4.5's 200K input window. Yet all four of its questions completed without a transport error (the model presumably truncated silently; one of the four was wrong, three correct). This is the opposite of what a true context-overflow theory would predict.
-
-**Conclusion.** The LC arms did not fail because the model rejected oversized prompts. They failed because the *eval harness* sent 100–500KB Markdown bodies repeatedly over public-internet TLS to OpenRouter, where SSL renegotiations, gateway timeouts, and brief upstream stalls become statistically inevitable. Every LC failure in this run is consistent with that — `SSLV3_ALERT_BAD_RECORD_MAC`, empty SSE streams, 502s. The intuition that "SurfSense survives because it bounds context" is correct, but for a different reason than expected: SurfSense survives because **it doesn't put 100–500KB on the wire in the first place**, not because the model would otherwise reject the prompt.
-
-#### SurfSense failures: zero — but that number deserves a footnote
-
-SurfSense reported `0 failures / 171 questions` to the eval harness. This is the most important operational result, but it is worth being precise about *why*, because the mechanism is partly architectural rather than purely "better RAG":
-
-1. **The harness call goes to `http://localhost:8000`, not over public internet.** All transport-class failures that hammered the LC arms (TLS renegotiation, intermediate proxy resets, OpenRouter gateway 502s) are simply not reachable over a loopback HTTP connection. SurfSense was not "asked to survive" the same network path the LC arms had to survive.
-2. **The backend retries internal LLM calls.** SurfSense's `/api/v1/new_chat` wraps every internal LLM hop in `RetryAfterMiddleware` (exponential backoff on 5xx, SSL errors, rate limits — see [`retry_after.py:113-179`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_backend/app/agents/new_chat/middleware/retry_after.py#L113-L179) for the backoff calculation and retry-decision logic). Failures the LC arms surfaced as fatal would have been silently retried inside SurfSense and never reached the harness.
-3. **SurfSense's outbound prompt is small.** The retrieval pipeline produces prompts in the 5–15K token range, not 100–500KB Markdown blobs, so even if SurfSense's calls *were* over public TLS, they would land in the size class where transient transport errors are far rarer.
-
-In other words, "0 failures" is the joint result of three things — agentic retrieval bounding the payload, a robust internal retry layer, and a localhost call shape — and not a claim that the underlying model never erred on SurfSense's behalf.
-
-What SurfSense *did* successfully handle, end-to-end:
-
-- all 30 PDFs,
-- the 166-page `2309.17421v2.pdf`,
-- the 160-page `3M_2018_10K.pdf` (the same document where one LC arm pushed 227K tokens at the model and still got mostly-correct answers),
-- image-heavy PDFs,
-- long financial/report-style PDFs,
-- all question formats,
-- without context overflow, request-size failures, or any error reaching the harness.
-
-### 9.3 PDFs with the most failures
-
-| PDF | Pages | Failures | Affected arms | Cause |
-|---|---:|---:|---|---|
-| `2311.16502v3.pdf` | 117 | 9 | Native, Azure premium, LlamaCloud basic/premium | SSL transient |
-| `2309.17421v2.pdf` | 166 | 8 | Native, Azure basic/premium | SSL, empty stream, 502 |
-| `2405.09818v1.pdf` | 27 | 6 | Native only | empty stream, SSL, 30MB size limit |
-| `2307.09288v2.pdf` | 77 | 5 | Native, LlamaCloud premium | SSL transient |
-| `05-03-18-political-release.pdf` | 17 | 2 | Native only | SSL transient |
-
-The failure distribution shows two different classes of problems:
-
-1. **Large/complex documents stress providers and transports.**
-2. **Native PDF attachment is especially sensitive to file size and binary payload limits.**
-
-### 9.4 Retry experiment: are these failures transient or intrinsic?
-
-To pressure-test the transport-layer hypothesis directly, we re-ran *only* the 37 failed `(arm, qid)` pairs through the same providers, with up to 5 attempts each, exponential backoff (base 1 s, max 30 s, jitter), and concurrency 2. The eval harness was not touched — same prompts, same cached PDFs, same cached parser markdown — only the request was retried. SurfSense was not retried (it had 0 failures and would otherwise have required spinning the backend back up). Failure detection (any row with `error` set OR empty `raw_text`) is at [`retry_failed_questions.py:99-111`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/retry_failed_questions.py#L99-L111); the per-row retry loop is at [`retry_failed_questions.py:260-304`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/retry_failed_questions.py#L260-L304).
-
-**Result (37 retries):**
-
-| Arm | Tried | Recovered | Still failed | Recovery rate |
-|---|---:|---:|---:|---:|
-| `azure_basic_lc` | 1 | 1 | 0 | **100.0%** |
-| `azure_premium_lc` | 3 | 3 | 0 | **100.0%** |
-| `llamacloud_basic_lc` | 2 | 2 | 0 | **100.0%** |
-| `llamacloud_premium_lc` | 4 | 4 | 0 | **100.0%** |
-| `native_pdf` | 27 | 15 | 12 | 55.6% |
-| **Total** | **37** | **25** | **12** | **67.6%** |
-
-Two findings, both consistent with §9.2's transport-layer story.
-
-**Finding 1 — every long-context failure was transient.** All 10 LC failures across both parsers and both quality tiers recovered. If these had been context-window overflow errors disguised as SSL alerts, retrying the *same* prompt would not fix them. It did. This is the strongest evidence that the original LC failures were transport-layer artifacts of pushing 100–500 KB Markdown bodies repeatedly over public-internet TLS, not anything wrong with the prompts themselves.
-
-**Finding 2 — half of native_pdf is intrinsic, not transient.** The 12 unrecovered native_pdf rows split cleanly into three buckets:
-
-| Bucket | Count | PDF | What's happening |
-|---|---:|---|---|
-| **30 MB hard wire-size limit** | 6 | `2405.09818v1.pdf` | Every retry returns the same `The message size (33657603 bytes) exceeds 30.000MB limit.` from Google. The base64-inflated payload is fundamentally above the provider's request-size cap. No amount of retrying helps. |
-| **Persistent empty SSE stream** | 5 | `2309.17421v2.pdf` (166 pages) | All 5 attempts return HTTP 200 but the response stream ends with no usable text. Probably the model is spending so long on the huge PDF that the upstream connection times out or is reset before any output token reaches the client. Effectively intrinsic at this provider/payload size. |
-| **502 on final attempt** | 1 | `2309.17421v2.pdf::Q003` | Earlier attempts got empty streams; final attempt got a 502. Borderline transient — could plausibly recover with more attempts — but at that point you're hammering the same fragile path. |
-
-The 15 native_pdf rows that *did* recover all succeeded on **attempt 1**, never needing a second retry. That is exactly the signature of independent transient transport hiccups: the original call was unlucky, the next one was fine.
-
-**What this changes about the headline result.** With a basic retry policy in front of the harness, the corrected failure picture would be:
-
-| Arm | Reported failures (no retries) | Intrinsic failures (with retries) | Intrinsic failure rate |
-|---|---:|---:|---:|
-| `native_pdf` | 27 | **12** | 7.0% |
-| `azure_basic_lc` | 1 | 0 | 0.0% |
-| `azure_premium_lc` | 3 | 0 | 0.0% |
-| `llamacloud_basic_lc` | 2 | 0 | 0.0% |
-| `llamacloud_premium_lc` | 4 | 0 | 0.0% |
-| `surfsense_agentic` | 0 | 0 | 0.0% |
-
-So the retries don't change the *winners* — the LC arms still have the highest accuracy and SurfSense is still the cheapest — but they sharpen the contrast on robustness:
-
-> Once you account for retries, the four long-context arms and SurfSense all run at zero intrinsic failures across 171 questions. Native PDF attachment, even with 5-attempt exponential backoff, still has a **7% intrinsic failure rate**, dominated by a single PDF that exceeds the provider's 30 MB wire-size cap and a 166-page PDF whose response stream the provider can't reliably terminate.
-
-The retry artifact is committed at `data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries.jsonl` (+ `raw_retries_summary.json`) for anyone who wants to inspect attempt-by-attempt latencies and error strings.
-
-### 9.5 Final accuracy after retries
-
-Merging the 25 retry-recovered rows back into `raw.jsonl` (script: `scripts/compute_post_retry_accuracy.py`, merged artifact: `raw_post_retry.jsonl`) gives the final corrected per-arm accuracy table. This is the headline that the blog *would have* reported if the harness had had retries from day one.
-
-**Final accuracy (171 questions, 30 PDFs, all `anthropic/claude-sonnet-4.5`):**
-
-| Rank | Arm | Accuracy | F1 | Failures | Fail rate |
-|---:|---|---:|---:|---:|---:|
-| 1 | `llamacloud_premium_lc` | **59.6%** | **62.3%** | 0 | 0.0% |
-| 2 | `azure_premium_lc` | 58.5% | 61.3% | 0 | 0.0% |
-| 3 | `azure_basic_lc` | 54.4% | 56.6% | 0 | 0.0% |
-| 4 | `surfsense_agentic` | 53.2% | 54.3% | 0 | 0.0% |
-| 5 | `native_pdf` | 52.0% | 54.8% | **12** | **7.0%** |
-| 6 | `llamacloud_basic_lc` | 50.9% | 53.8% | 0 | 0.0% |
-
-**Pre- vs. post-retry deltas:**
-
-| Arm | Δ accuracy | Δ failures | Notes |
-|---|---:|---:|---|
-| `native_pdf` | **+4.1 pp** | **-15** | Largest gain; 15 of 27 originally-empty answers became real answers, several of them correct. Still has the 12 unrecoverable hard-limit / persistent-empty-stream failures. |
-| `azure_premium_lc` | +1.8 pp | -3 | All 3 transient failures recovered; 2 of those answers were correct. |
-| `llamacloud_premium_lc` | +1.2 pp | -4 | All 4 transient failures recovered; 2 were correct. |
-| `llamacloud_basic_lc` | +0.6 pp | -2 | Both transient failures recovered; 1 was correct. |
-| `azure_basic_lc` | +0.0 pp | -1 | The single retry recovered, but the recovered answer was wrong — so failure rate dropped without an accuracy lift. |
-| `surfsense_agentic` | +0.0 pp | 0 | Nothing to retry; SurfSense already had zero failures. |
-
-**Ranking changes:**
-
-- The top three are unchanged (`llamacloud_premium_lc` > `azure_premium_lc` > `azure_basic_lc`).
-- `native_pdf` moves up one spot (#6 → #5) by overtaking `llamacloud_basic_lc` (52.0% vs 50.9%). It is still last among the arms that complete cleanly — and the only arm with a non-zero intrinsic failure rate.
-- `surfsense_agentic` stays at #4 with the same 53.2% accuracy. With the four LC arms now also at 0 failures, the operational-robustness story shifts: SurfSense is no longer uniquely zero-failure, but it remains the cheapest arm at $0.0827 / Q while `llamacloud_premium_lc` ($0.1885 / Q) is now zero-failure too. The SurfSense pitch becomes "same robustness as the best full-context arm, at less than half the cost, with bounded prompts that don't truncate on long documents".
-
-**Cost note.** The cost numbers in §1 / §7 still reflect the *original* run. Adding the retry survivors costs slightly more in LLM dollars (25 extra OpenRouter calls, mostly small LC payloads that succeeded on attempt 1; native_pdf retries are larger but didn't recover anyway after attempt 1). It does not change the per-arm cost ranking or the SurfSense win on cost.
-
----
-
-## 10. What the Results Mean
-
-### 10.1 Native PDF is not a safe default
-
-Native PDF attachment is attractive because it skips preprocessing. But in this benchmark it had:
-
-- lowest raw accuracy,
-- highest per-question cost,
-- high latency,
-- highest failure rate,
-- and — confirmed by the retry experiment in §9.4 — a **7% intrinsic failure rate that survives 5 attempts of exponential backoff**: 6 questions on a single PDF that exceeds the provider's 30 MB wire-size cap, plus 5 questions on a 166-page PDF whose response stream the provider cannot reliably terminate.
-
-It is simple, but operationally fragile. The "fragility" isn't only transient: a meaningful fraction of native_pdf failures are *unfixable* by retries.
-
-Native PDFs may still be good for:
-
-- quick one-off small PDFs,
-- demos,
-- short documents,
-- cases where no ingestion pipeline exists.
-
-But for production document QA, especially over large PDFs, native attachment is risky.
-
-### 10.2 Full-context parsed markdown performs best when it fits
-
-The best accuracy came from:
-
-```text
-llamacloud_premium_lc: 58.5%
-```
-
-This supports the intuition that:
-
-> If the full parsed document fits into the context window, a strong model can use it effectively.
-
-But this strategy has scaling limits:
-
-- the full document is resent for every question,
-- cost scales with document length × number of questions,
-- context overflow risk grows with long PDFs,
-- large extracted markdown can exceed the model window.
-
-The 3M 10-K example is important:
-
-```text
-LlamaCloud premium extraction: 908,733 chars
-Estimated tokens: ~227k
-```
-
-That is already above Sonnet 4.5's 200K-token input window. In this run the provider accepted the request without raising a context-overflow error (see §9.2), but that almost certainly means part of the document was silently dropped — three of the four 3M 10-K questions came back correct on `llamacloud_premium_lc`, one wrong, with no signal to the application that any truncation occurred. A larger corpus or longer filing makes full-context prompting unsafe in production: you do not get a hard error, you get an undetectable accuracy regression.
-
-### 10.3 Basic parsers are surprisingly competitive
-
-Azure basic scored:
-
-```text
-54.4% accuracy
-$0.1062 / question
-```
-
-That is only 4.1 points below the best arm, but at much lower preprocessing cost than premium methods.
-
-In this run:
-
-- Azure basic was cheaper than every premium parser arm.
-- Azure basic outperformed native PDF.
-- Azure basic was very close to SurfSense’s accuracy.
-
-For cost-sensitive workloads, basic parsing may be an excellent default.
-
-### 10.4 Premium parsing improves quality, but the gain is modest
-
-Premium parsing improved accuracy:
-
-| Parser | Basic | Premium | Gain |
-|---|---:|---:|---:|
-| Azure | 54.4% | 56.7% | +2.3pp |
-| LlamaCloud | 50.3% | 58.5% | +8.2pp |
-
-Premium is most justified when:
-
-- layout matters,
-- tables matter,
-- visual/page structure matters,
-- high accuracy is more important than preprocessing cost.
-
-But premium preprocessing is 10× the basic tariff, so the business decision depends on volume and accuracy requirements.
-
-### 10.5 SurfSense is the cheapest *and* most robust arm
-
-SurfSense scored:
-
-```text
-Accuracy:        53.2%   (within ~5pp of the best full-context arm)
-Failures:        0       (zero — the only arm with no runtime errors)
-LLM cost / Q:    $0.0150 (8× cheaper than native PDF, ~7× cheaper than premium LC)
-Total cost / Q:  $0.0827 (lowest of any arm, including basic LC)
-```
-
-It was not the top *accuracy* arm. But it won on every other axis that matters in production:
-
-- **Cost.** At $0.0827 / Q it was the cheapest of the six arms, end-to-end. Native PDF was 3.1× more expensive. Premium parser stuffing arms were 2.3–2.5× more expensive.
-- **Reliability.** Zero failures vs 1–4 transient failures for the parser arms, and 27 for native PDF.
-- **Scalability.** Bounded context per turn — it does not break when a single document exceeds the model context window.
-
-That is the strongest argument for SurfSense:
-
-> SurfSense does not try to win by stuffing the whole document into the prompt. It wins by making long-document QA operationally viable: bounded context, retrieval, no overflow, no large request payloads, and a consistently low marginal cost per question.
-
-This matters more as the corpus grows.
-
-In a real user workflow:
-
-- users do not ask 171 questions against only 30 PDFs,
-- they upload many PDFs,
-- documents can be hundreds of pages,
-- questions arrive over time,
-- the same corpus is reused.
-
-In that setting, paying ingestion once and retrieving context dynamically is strictly preferable to repeatedly stuffing full documents into every prompt: amortized preprocessing dominates total cost, and the per-question LLM bill stays small because the prompt is bounded by the retrieved context, not by the size of the underlying document.
-
-### 10.6 Cost amortization model (a math derivation the writer can quote)
-
-The headline `$/Q` numbers are the *break-even, per-question* cost on this specific run. To turn that into a production-grade claim we want a closed-form model the writer can extrapolate.
-
-**Setup.** A workload has:
-
-- `P` PDFs in the corpus,
-- average pages per PDF `k̄` (in this experiment, k̄ ≈ 39.6 — total `1188 / 30`),
-- `Q` total questions asked over the corpus across the corpus's lifetime (potentially many, since users keep coming back).
-
-Define each arm's per-arm constants:
-
-- `α_arm` = preprocessing tariff in $/page (`α = 0` for native_pdf, `0.001` for basic, `0.010` for premium),
-- `β_arm` = per-question LLM cost ($/Q at the arm's typical input/output token mix).
-
-Then the **total cost** for the workload is:
-
-```
-C_arm(P, k̄, Q) = α_arm · P · k̄  +  β_arm · Q
-                 └── one-time fixed cost ──┘   └─ scales with Q ─┘
-```
-
-and the **per-question amortized cost** is:
-
-```
-$/Q_arm(P, k̄, Q) = α_arm · P · k̄ / Q  +  β_arm
-                   = α_arm · k̄ / (Q/P)  +  β_arm
-```
-
-i.e. the preprocessing term shrinks as `Q/P` (questions per PDF) grows.
-
-**Plugging in our measured constants:**
-
-| Arm | α ($/page) | β ($/Q, measured) | Closed-form $/Q |
-|---|---:|---:|---|
-| `native_pdf` | 0.000 | 0.2552 | `$0.2552` (constant) |
-| `azure_basic_lc` | 0.001 | 0.0994 | `$0.0994 + 0.001 · 39.6 / (Q/P)` |
-| `azure_premium_lc` | 0.010 | 0.1373 | `$0.1373 + 0.010 · 39.6 / (Q/P)` |
-| `llamacloud_basic_lc` | 0.001 | 0.0981 | `$0.0981 + 0.001 · 39.6 / (Q/P)` |
-| `llamacloud_premium_lc` | 0.010 | 0.1208 | `$0.1208 + 0.010 · 39.6 / (Q/P)` |
-| `surfsense_agentic` | 0.010 | 0.0150 | `$0.0150 + 0.010 · 39.6 / (Q/P)` |
-
-This is the equation a technical reader can re-use directly with their own corpus.
-
-**Worked example: `llamacloud_premium_lc` vs `surfsense_agentic`.**
-
-The α terms are *identical* (both pay the premium tariff). So the cost gap is constant in `Q/P` and equals:
-
-```
-$/Q_LC_premium − $/Q_SurfSense = β_LC_premium − β_SurfSense
-                                = 0.1208 − 0.0150
-                                = $0.1058 per question
-```
-
-This is a structural advantage, not a regime-dependent one. **At every value of `Q/P`, SurfSense is ~$0.106/Q cheaper than the most accurate full-context arm.** Across `Q = 10,000` questions, that is **$1,058 saved** with no change in preprocessing spend.
-
-**Why is `β` so different?** Because LC arms send the *whole document* in every request:
-
-```
-β_LC ≈ p_in · (k̄ · t_per_page_LC) + p_out · t_out_LC
-β_SS ≈ p_in · t_in_SS_per_hop · n_hops_SS + p_out · t_out_SS
-```
-
-with Sonnet 4.5 priced at `p_in ≈ $3 / 1M` input tokens and `p_out ≈ $15 / 1M` output tokens. The ratio `β_LC / β_SS ≈ 8` falls out of the input-token ratio: LC arms send ~32–42 K tokens per call (§7.4), SurfSense's agent loop totals ~5–15 K tokens per question even after multi-hop.
-
-**Sensitivity intuition for the writer:**
-
-- If Sonnet 4.5 dropped its input price 10×, `β_LC` would drop ~10×, the cost gap would narrow toward zero, and the LC arms would become cost-competitive with SurfSense at the cost of preprocessing dollars. The agentic-retrieval cost story is *contingent on input-token pricing*; if LLM tokens become a free commodity, "stuff the whole document" becomes economically viable. We don't believe that's where input pricing is going on the 1–2 year horizon, but it is the right thing to caveat.
-- The `α` terms only matter when `Q/P` is small (one-off Q&A on a fresh corpus). For any reused corpus, the `β` term dominates and SurfSense's structural ~7× β advantage drives the total.
-
----
-
-## 11. Blog-Friendly Narrative
-
-A strong blog angle would be:
-
-> “We tested six ways to ask questions over long multimodal PDFs. Full-context parser output had the highest raw accuracy. Agentic retrieval was the cheapest *and* the most reliable — within five percentage points of the best, with zero failures and the lowest cost per question.”
-
-Suggested framing:
-
-1. Native PDF attachment seems attractive because it is simple.
-2. But long PDFs create huge request payloads, high cost, and provider instability.
-3. Parsed markdown improves model performance and reduces per-call cost.
-4. Premium parsers can improve quality, but at higher preprocessing cost.
-5. Full-context prompting is not scalable for truly long documents.
-6. SurfSense’s agentic retrieval gives up a few accuracy points but wins on cost (cheapest arm at $0.0827 / Q), robustness (zero runtime failures), and avoids context overflow on 100+ page PDFs.
-
-Suggested claim:
-
-> The question is not “Can a frontier model read a PDF?” It can. The real question is whether the approach survives long documents, repeated questions, provider limits, and production cost constraints.
-
-Suggested conclusion:
-
-> For small PDFs, native attachment can be fine. For long-document production QA, ingestion plus retrieval/context management is the more scalable architecture.
-
----
-
-## 12. Caveats and Improvements
-
-### 12.1 Add retries to the evaluation harness (validated)
-
-Many non-SurfSense failures were transient SSL / provider errors. The retry experiment in §9.4 confirmed this empirically: 5 attempts of exponential backoff recovers 100% of LC-arm failures and ~56% of native_pdf failures, with 25/37 originally-failed rows succeeding cleanly on the very first retry. The harness should bake this in around:
-
-- OpenRouter native PDF calls,
-- OpenRouter chat-completion calls for long-context arms.
-
-Empirically calibrated retry policy:
-
-- retry on SSL errors (e.g. `SSLV3_ALERT_BAD_RECORD_MAC`),
-- retry on 502/503/504,
-- retry on empty SSE stream,
-- exponential backoff (base 1 s, cap 30 s, jitter),
-- cap at **3 attempts** (most recoveries happen on attempt 1; the marginal recovery from attempts 4–5 in §9.4 is small and not worth the latency).
-
-Caveat: even with this policy, native_pdf retains a hard ~7% intrinsic failure rate at this dataset's PDF size distribution — retries cannot fix the 30 MB wire-size cap or the 166-page empty-stream case.
-
-### 12.2 Surface SurfSense token/cost telemetry on the SSE stream
-
-The cost numbers in this report for the SurfSense arm (`$0.015 / Q`, `$2.57` for the full 171-question run) were reconstructed from the backend's billable-call ledger after the run.
-
-The auto-generated `summary.md` still writes `LLM $/Q = $0.0000` for `surfsense_agentic`, because the `/api/v1/new_chat` SSE stream does not currently expose token usage or per-turn cost to the eval harness. That is the only reason the headline tables in earlier passes of this report had to flag the value as "untracked".
-
-For future reports the SSE stream should surface, per-turn:
-
-- prompt tokens,
-- completion tokens,
-- total tokens,
-- model,
-- cost per internal call,
-- total cost per user question.
-
-Once that is plumbed through, the harness can compute `surfsense_agentic` cost online instead of requiring a post-run reconciliation against the billable-call ledger.
-
-### 12.3 Test larger samples and stratified subsets
-
-This experiment used 30 PDFs and 171 answerable questions. A future blog could extend it with:
-
-- full MMLongBench-Doc,
-- stratified by page count,
-- stratified by document type,
-- separate chart for image-heavy vs text-heavy documents,
-- separate chart for short vs long PDFs.
-
-### 12.4 Compare retrieval-quality diagnostics
-
-SurfSense’s accuracy is partly retrieval-dependent. A deeper product analysis should inspect:
-
-- whether the relevant chunks were retrieved,
-- whether the answer failed despite retrieval,
-- how many tool calls were needed,
-- whether cited lines/pages aligned with gold evidence.
-
-This would explain *why* SurfSense missed certain questions.
-
----
-
-## 13. Recommended Product Interpretation
-
-For production:
-
-### Use native PDF only for:
-
-- small files,
-- low-volume one-off Q&A,
-- no-ingestion workflows,
-- quick previews.
-
-### Use full-context parsed markdown when:
-
-- the document fits comfortably in context,
-- latency matters,
-- you only ask a few questions per PDF,
-- highest possible single-question accuracy matters.
-
-### Use SurfSense agentic retrieval when:
-
-- documents are long,
-- the corpus grows over time,
-- users ask many questions,
-- cost per query matters,
-- context overflow must be avoided,
-- reliability matters more than a few points of peak accuracy.
-
-In this benchmark, SurfSense was not the highest raw-accuracy arm, but it was the only arm with zero failures.
-
-That reliability result is likely the strongest blog-worthy differentiator.
-
----
-
-## 14. Appendix: Commands Used
-
-High-level sequence:
-
-```bash
-python -m surfsense_evals setup \
-  --suite multimodal_doc \
-  --provider-model anthropic/claude-sonnet-4.5 \
-  --vision-llm anthropic/claude-sonnet-4.5 \
-  --scenario head-to-head
-```
-
-```bash
-python -m surfsense_evals ingest multimodal_doc mmlongbench \
-  --max-docs 30 \
-  --upload-batch-size 3 \
-  --use-vision-llm \
-  --processing-mode premium
-```
-
-After the large-PDF timeout:
-
-```bash
-python -m surfsense_evals ingest multimodal_doc mmlongbench \
-  --max-docs 30 \
-  --upload-batch-size 1 \
-  --use-vision-llm \
-  --processing-mode premium
-```
-
-Parser extraction:
-
-```bash
-python -m surfsense_evals ingest multimodal_doc parser_compare \
-  --max-docs 30 \
-  --pdf-concurrency 2
-```
-
-Benchmark run:
-
-```bash
-python -m surfsense_evals run multimodal_doc parser_compare \
-  --sample-per-doc 20 \
-  --concurrency 2 \
-  --max-output-tokens 512
-```
-
-Report generation:
-
-```bash
-python -m surfsense_evals report --suite multimodal_doc
-```
-
-Post-hoc retry experiment (§9.4 / §9.5):
-
-```bash
-# Re-run only the 37 failed (arm, qid) pairs with up to 5 attempts
-# of exponential backoff. SurfSense had 0 failures so backend/celery
-# are not required.
-python scripts/retry_failed_questions.py \
-  --run-id 2026-05-14T00-53-19Z \
-  --max-attempts 5 \
-  --base-delay 1.0 \
-  --max-delay 30.0 \
-  --concurrency 2
-```
-
-Merge retry survivors back into the run and recompute the headline:
-
-```bash
-python scripts/compute_post_retry_accuracy.py \
-  --run-id 2026-05-14T00-53-19Z
-```
-
-Compute the deeper blog stats (latency / token distributions, McNemar
-pairwise tests, per-PDF heterogeneity):
-
-```bash
-python scripts/compute_blog_extras.py \
-  --run-id 2026-05-14T00-53-19Z
-```
-
-### 14.1 Reproducibility notes
-
-- **LLM model:** `anthropic/claude-sonnet-4.5` for every arm, routed via OpenRouter (`https://openrouter.ai/api/v1/chat/completions`).
-- **PDF engine for `native_pdf`:** OpenRouter's `native` file-parser plugin (`engine: native`).
-- **Parser SDKs called directly from the eval harness:**
-  - `azure-ai-documentintelligence` (Azure DI, models `prebuilt-read` for basic and `prebuilt-layout` for premium).
-  - `llama-cloud-services` (LlamaParse, modes `parse_page_with_llm` for basic, `parse_page_with_agent` for premium).
-  - The harness writes the resulting Markdown to `data/multimodal_doc/parser_compare/extractions/` and records each extraction in `parser_compare_doc_map.jsonl`. This bypasses the SurfSense backend so each LC arm is a pure parser-stuffing comparison.
-- **SurfSense backend ETL:** With both `AZURE_DI_*` env vars present and `ETL_SERVICE=LLAMACLOUD`, the backend prefers Azure DI for PDFs (see `surfsense_backend/app/etl_pipeline/etl_pipeline_service.py`). The 30 PDFs were therefore ingested through Azure DI `prebuilt-layout` + Sonnet 4.5 vision-LLM image extraction. That is the basis for charging the `surfsense_agentic` arm the premium tariff.
-- **SurfSense `/api/v1/new_chat` flags:** `mentioned_document_ids` set to the per-question PDF's `document_id` (single-doc retrieval); `disabled_tools` left at default; `ephemeral_threads=true` to ensure no inter-question state leakage.
-- **Concurrency:** `concurrency=2` per arm during `parser_compare run` and during the retry pass. Higher concurrency on the LC arms reproducibly inflated SSL/transport failures.
-- **Grader:** deterministic, format-aware. The five branches:
-  - `Str`: lowercase, strip punctuation, collapse whitespace, exact match.
-  - `Int`: extract first integer with regex; require equality.
-  - `Float`: extract first decimal; correct if `|gold − pred| ≤ max(0.01, 0.02·|gold|)` (1% relative tolerance, 0.01 absolute floor).
-  - `List`: lowercase, split on `,` / `;`, set-equal compare; F1 = 2·|intersection| / (|pred| + |gold|).
-  - `None` ("Not answerable"): correct iff prediction contains "not answerable" / "cannot be determined" / equivalent.
-  - F1 for non-List formats = 1.0 if correct else 0.0; for List, token-level F1 over the parsed sets.
-  - Source: `surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py`.
-
-### 14.2 Statistical methodology
-
-- **Wilson 95% CIs** (§7.1) computed as `(p̂ + z²/2n ± z·√(p̂(1−p̂)/n + z²/(4n²))) / (1 + z²/n)` with `z = 1.96`.
-- **McNemar exact-binomial test** (§7.3): on paired arms `(i, j)`, with discordant counts `b = #{i correct, j wrong}` and `c = #{i wrong, j correct}`, `b ~ Bin(b+c, 0.5)` under H0; the two-sided p-value is computed exactly from `math.comb`. No continuity correction (n's are small enough that the exact form is cheap).
-- **Multiple comparisons:** 15 arm pairs. We report single-comparison-significant pairs (α = 0.05) and explicitly note which would survive Holm-Bonferroni at family-wise α = 0.05 (none, in this run).
-- **Per-PDF accuracy heterogeneity** (§7.5): each PDF contributes one mean over its 4–8 questions; we report mean / std / min / quartiles across the 30 per-PDF means (so each PDF is weighted equally regardless of how many questions it contributed).
-
-### 14.3 Threats to validity
-
-The claims in this report come with the following caveats. We list them so a reader can decide which generalize and which are specific to the run.
-
-1. **Single dataset.** All 171 questions come from MMLongBench-Doc. The dataset is academic-paper-heavy (arXiv preprints + a few financial 10-Ks and political reports). Findings on a corpus of, say, regulatory filings or scanned forms could differ — particularly for parser quality, where MMLongBench's clean academic PDFs are easier than the median real-world PDF.
-2. **Single LLM.** Every arm uses `anthropic/claude-sonnet-4.5`. Results would shift with a smaller or weaker model: less-capable models likely benefit more from premium parsing (because they cannot fix layout mistakes themselves) and benefit less from full-context stuffing (because they cannot use 200K-token contexts effectively).
-3. **Single retrieval policy.** `surfsense_agentic` was run with `mentioned_document_ids = [<pdf>]` — single-document retrieval, no cross-document mixing. SurfSense's accuracy on questions that span multiple documents (or that benefit from cross-corpus context) is not measured here.
-4. **n = 171.** The Wilson CIs span 7–8 percentage points per arm; only 3 of 15 arm pairs reach single-comparison significance (§7.3). The headline ranking is directionally robust but should not be treated as a precise ordering for arms that differ by < ~5pp.
-5. **Cost figures depend on the OpenRouter Sonnet 4.5 schedule.** Per-token prices change. The amortization model in §10.6 is the right thing for a reader to re-derive with their own pricing; the headline `$/Q` is run-specific.
-6. **`native_pdf` measured only the OpenRouter "native" file-parser plugin** (`engine: native`). Different engines (`mistral-ocr`, `cloudflare-ai`) might have different size limits, accuracy, and failure rates. The 30 MB intrinsic limit and the empty-stream behavior are specific to the Google upstream that OpenRouter routed Sonnet 4.5 through.
-7. **SurfSense LLM cost was reconstructed post-hoc.** The `/api/v1/new_chat` SSE stream does not currently surface per-turn tokens or cost (§12.2). The `$0.015/Q` figure is the average from the backend's `billable_call` ledger over the 171 turns, not a live measurement against each turn's response. We are confident in the *average*; we cannot give a per-question variance for SurfSense LLM cost from this run.
-8. **Grader is deterministic, not LLM-judged.** The MMLongBench-Doc paper itself uses a GPT-4 judge. We chose deterministic grading for reproducibility (two researchers running this harness will get the exact same number) and simpler downstream stats. An LLM-judge mode is implemented (`--judge gpt5`) but was not used here. If you switch to LLM judging, all arms shift up by roughly the same amount; the *ordering* should be stable but the absolute accuracy values are not directly comparable.
-9. **Retry experiment is not blind to its purpose.** The retry policy (5 attempts, exponential backoff, jitter, concurrency 2) was chosen *after* seeing the failure modes. We are not claiming this is the optimal policy across arms — only that with this policy, all LC failures recover and a clean residue of intrinsic native_pdf failures remains.
-10. **No statistical test was run for cost differences.** All cost numbers are point estimates from a single run; we do not report cost CIs because the variance comes from token-count variability per question and is well-modeled by the input-token distributions in §7.4 if a reader wants to construct a CI themselves.
-
-### 14.4 Code citations index
-
-Every technical claim in this report is reproducible from the code in this repository. The table below maps each claim to its exact source-of-truth file and line range, pinned to commit [`9bcd5016`](https://github.com/MODSetter/SurfSense/commit/9bcd5016) so the line numbers stay valid even if the files change later.
-
-#### Eval harness — arm definitions
-
-| Claim / construct | File@lines |
-|---|---|
-| `NativePdfArm` — attaches the PDF as an OpenRouter file part | [`core/arms/native_pdf.py:21-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/arms/native_pdf.py#L21) |
-| `BareLlmArm` — chat-completion with no retrieval (used for the four LC arms) | [`core/arms/bare_llm.py:22-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/arms/bare_llm.py#L22) |
-| `SurfSenseArm` — `/api/v1/new_chat` SSE consumer | [`core/arms/surfsense.py:30-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/arms/surfsense.py#L30) |
-| `OpenRouterChatProvider` — bare chat-completion HTTP client | [`core/providers/openrouter_chat.py:40-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/providers/openrouter_chat.py#L40) |
-| `OpenRouterPdfProvider` — file-parser-plugin chat-completion client | [`core/providers/openrouter_pdf.py:72-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/providers/openrouter_pdf.py#L72) |
-
-#### Eval harness — parser SDK callers (LC arms)
-
-| Claim / construct | File@lines |
-|---|---|
-| Azure DI mode→model map (`basic`→`prebuilt-read`, `premium`→`prebuilt-layout`) | [`core/parsers/azure_di.py:33-35`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/parsers/azure_di.py#L33-L35) |
-| LlamaCloud mode→mode map (`basic`→`parse_page_with_llm`, `premium`→`parse_page_with_agent`) | [`core/parsers/llamacloud.py:32-34`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/parsers/llamacloud.py#L32-L34) |
-| `pypdf`-based page count (used for the per-page tariff calculation) | [`core/parsers/pdf_pages.py`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/parsers/pdf_pages.py) |
-
-#### Eval harness — parser_compare benchmark
-
-| Claim / construct | File@lines |
-|---|---|
-| `ParserCompareBenchmark` (six-arm runner, prompt construction, raw.jsonl writer) | [`suites/multimodal_doc/parser_compare/runner.py:231-576`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/runner.py#L231-L576) |
-| Prompt: `build_native_pdf_prompt` (PDF attached separately) | [`parser_compare/prompt.py:69-76`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/prompt.py#L69-L76) |
-| Prompt: `build_long_context_prompt` (full Markdown stuffed inline) | [`parser_compare/prompt.py:92-113`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/prompt.py#L92-L113) |
-| Prompt: `build_surfsense_prompt` (chunks injected by the agent) | [`parser_compare/prompt.py:79-89`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/prompt.py#L79-L89) |
-| Pre-extraction manifest builder (cached parser outputs) | [`parser_compare/ingest.py`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/ingest.py) |
-
-#### Cost model
-
-| Claim / construct | File@lines |
-|---|---|
-| `PREPROCESS_USD_PER_PAGE` constant (`basic = 0.001`, `premium = 0.010`) | [`runner.py:74-77`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/runner.py#L74-L77) |
-| Per-arm tier mapping (`_LC_ARM_MODE`) | [`runner.py:89-94`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/runner.py#L89-L94) |
-| `SURFSENSE_INGEST_MODE = "premium"` (basis for charging SurfSense the premium tariff) | [`runner.py:96-101`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/runner.py#L96-L101) |
-| Cost overlay (`preprocess_cost_total`, `total_cost_per_q` computation) | [`runner.py:725-747`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare/runner.py#L725-L747) |
-
-#### Grader (deterministic, format-aware — §14.1)
-
-| Claim / construct | File@lines |
-|---|---|
-| `GradeResult` dataclass (`correct`, `f1`, `method`, normalised pred/gold) | [`mmlongbench/grader.py:40-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py#L40) |
-| `_grade_str` (lowercase + strip + exact match) | [`mmlongbench/grader.py:89-104`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py#L89-L104) |
-| `_grade_int` (regex extract first int, equality) | [`mmlongbench/grader.py:106-120`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py#L106-L120) |
-| `_grade_float` (1% relative tolerance, 0.01 absolute floor) | [`mmlongbench/grader.py:122-139`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py#L122-L139) |
-| `_grade_list` (set equality + token-level F1) | [`mmlongbench/grader.py:141-157`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py#L141-L157) |
-| `_grade_none` ("Not answerable" handling) | [`mmlongbench/grader.py:159-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py#L159) |
-| Public `grade()` dispatcher | [`mmlongbench/grader.py:224-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py#L224) |
-
-#### Statistical methodology (§14.2)
-
-| Claim / construct | File@lines |
-|---|---|
-| `wilson_ci()` — Wilson 95% CI for a single proportion | [`core/metrics/mc_accuracy.py:49-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/metrics/mc_accuracy.py#L49) |
-| `accuracy_with_wilson_ci()` — full per-arm accuracy + CI struct | [`core/metrics/mc_accuracy.py:73-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/src/surfsense_evals/core/metrics/mc_accuracy.py#L73) |
-| McNemar exact-binomial p-value (§7.3) | [`compute_blog_extras.py:80-99`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/compute_blog_extras.py#L80-L99) |
-| McNemar pairwise table builder | [`compute_blog_extras.py:102-141`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/compute_blog_extras.py#L102-L141) |
-| Latency distribution helpers (§7.4) | [`compute_blog_extras.py:186-213`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/compute_blog_extras.py#L186-L213) |
-| Token distribution helpers (§7.4) | [`compute_blog_extras.py:216-250`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/compute_blog_extras.py#L216-L250) |
-| Per-PDF accuracy heterogeneity (§7.5) | [`compute_blog_extras.py:149-183`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/compute_blog_extras.py#L149-L183) |
-
-#### Retry experiment (§9.4 / §9.5)
-
-| Claim / construct | File@lines |
-|---|---|
-| Failure-row detection (error set OR empty `raw_text`) | [`retry_failed_questions.py:99-111`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/retry_failed_questions.py#L99-L111) |
-| Per-row retry loop (5 attempts, exponential backoff w/ jitter) | [`retry_failed_questions.py:260-304`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/retry_failed_questions.py#L260-L304) |
-| Bounded-concurrency runner | [`retry_failed_questions.py:307-315`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/retry_failed_questions.py#L307-L315) |
-| Post-retry merge + recompute (§9.5 final accuracy table) | [`compute_post_retry_accuracy.py`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/compute_post_retry_accuracy.py) |
-| Context-overflow hypothesis test (§9.2) | [`test_context_overflow_hypothesis.py`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/scripts/test_context_overflow_hypothesis.py) |
-
-#### SurfSense backend (§9.2 — what "0 failures" actually measures)
-
-| Claim / construct | File@lines |
-|---|---|
-| `_exponential_delay()` — backoff with optional ±25% jitter | [`retry_after.py:113-128`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_backend/app/agents/new_chat/middleware/retry_after.py#L113-L128) |
-| `RetryAfterMiddleware` — wraps every internal LLM hop | [`retry_after.py:131-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_backend/app/agents/new_chat/middleware/retry_after.py#L131) |
-| `_should_retry()` — retryable-error classification | [`retry_after.py:171-…`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_backend/app/agents/new_chat/middleware/retry_after.py#L171) |
-| ETL routing — Azure DI preferred over LlamaCloud for compatible types | [`etl_pipeline_service.py:233-251`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_backend/app/etl_pipeline/etl_pipeline_service.py#L233-L251) |
-
-#### Run artifacts (the verifiable numbers source)
-
-These are the *outputs* the report cites — every accuracy / cost / latency number can be re-derived by running the analysis scripts on these JSONL files.
-
-| Artifact | Relative path | Contents |
-|---|---|---|
-| Raw run | [`raw.jsonl`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw.jsonl) | 1 026 rows = 6 arms × 171 questions; one row per `(arm, qid)` with the original ArmResult + grader verdict |
-| Retry log | [`raw_retries.jsonl`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries.jsonl) | 37 rows; per-row attempt timeline + final outcome |
-| Retry summary | [`raw_retries_summary.json`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries_summary.json) | per-arm tried / recovered / still-failed counts |
-| Post-retry merged | [`raw_post_retry.jsonl`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_post_retry.jsonl) | 1 026 rows; recovered retries replace originals; basis for §9.5 final accuracy + §7.3 McNemar |
-| Per-arm aggregates | [`run_artifact.json`](https://github.com/MODSetter/SurfSense/blob/9bcd5016/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/run_artifact.json) | the raw run's per-arm summary metrics + per-PDF correctness map |
-
-#### Reproducing every number in §1, §7, §8, §9
-
-```bash
-# 1) Sanity: load the artifacts that ship with the repo.
-ls surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/
-
-# 2) Recompute the post-retry headline accuracy (§1, §9.5).
-python surfsense_evals/scripts/compute_post_retry_accuracy.py \
-  --run-id 2026-05-14T00-53-19Z
-
-# 3) Recompute McNemar pairwise + latency / token / per-PDF distributions
-#    (§7.3, §7.4, §7.5).
-python surfsense_evals/scripts/compute_blog_extras.py \
-  --run-id 2026-05-14T00-53-19Z
-
-# 4) Re-run the context-overflow hypothesis test (§9.2).
-python surfsense_evals/scripts/test_context_overflow_hypothesis.py
-```
-
-To re-run the experiment end-to-end (slow: needs a backend + celery + ~3 hr ingest + ~2 hr LC arms), use the commands in §14.
-
----
-
-## 15. Appendix: File Locations
-
-Primary auto-generated report:
-
-```text
-reports/multimodal_doc/2026-05-14T02-30-16Z/summary.md
-```
-
-Raw run (all 1026 rows: 6 arms × 171 questions):
-
-```text
-data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw.jsonl
-```
-
-Run artifact (per-arm aggregates from the run):
-
-```text
-data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/run_artifact.json
-```
-
-Retry experiment (§9.4 / §9.5):
-
-```text
-data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries.jsonl
-data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries_summary.json
-```
-
-Post-retry merged artifact (used for the final accuracy + McNemar tables):
-
-```text
-data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_post_retry.jsonl
-```
-
-Parser manifest (PDF → extracted-markdown paths per LC arm):
-
-```text
-data/multimodal_doc/maps/parser_compare_doc_map.jsonl
-```
-
-Per-arm cached parser extractions (regenerated by the parser_compare
-ingest step; not tracked in git because absolute paths leak the local
-checkout):
-
-```text
-data/multimodal_doc/parser_compare/extractions/
-```
-
-Analysis scripts (all in `surfsense_evals/scripts/`):
-
-```text
-inspect_first30.py                 # corpus & question-count summary
-patch_manifest_for_parallel_ingest.py
-check_uploaded_status.py           # query SurfSense backend status
-analyze_failures.py                # cluster errors per arm + per PDF
-analyze_failure_timing.py          # per-arm failure-time clusters
-test_context_overflow_hypothesis.py
-compute_adjusted_accuracy.py       # transient-vs-intrinsic accuracy
-retry_failed_questions.py          # retry pass with exponential backoff
-compute_post_retry_accuracy.py     # merge retries + recompute headline
-compute_blog_extras.py             # latency/tokens/McNemar/per-PDF stats
-```
-
----
-
-## 16. One-Sentence Summary
-
-On 171 questions over 30 long multimodal PDFs, **full-context LlamaCloud-premium (59.6% post-retry) and Azure-premium (58.5%) won on accuracy**, but only **3 of 15 arm pairs are statistically distinguishable at α = 0.05** (McNemar, §7.3); meanwhile **SurfSense's agentic retrieval delivered 53.2% accuracy at $0.0827 / Q — the cheapest arm by ~$0.10 / Q vs every full-context arm — with zero runtime failures, while native PDF attachment retained an irrecoverable 7% intrinsic failure rate even after 5 attempts of exponential backoff (§9.4–§9.5)** — making the production trade-off "give up ~6pp of accuracy that may not even be statistically real, save ~57% on per-question cost, and inherit zero context-overflow / wire-size fragility on long documents".
diff --git a/surfsense_web/app/(home)/blog/blog-magazine.tsx b/surfsense_web/app/(home)/blog/blog-magazine.tsx
index 02e5045a9..f471d6919 100644
--- a/surfsense_web/app/(home)/blog/blog-magazine.tsx
+++ b/surfsense_web/app/(home)/blog/blog-magazine.tsx
@@ -35,9 +35,7 @@ function SearchIcon({ className }: { className?: string }) {
 }
 
 export function BlogWithSearchMagazine({ blogs }: { blogs: BlogEntry[] }) {
-	const featured = blogs[0];
-
-	if (!featured) {
+	if (blogs.length === 0) {
 		return (
 			<div className="relative overflow-hidden bg-neutral-50 px-4 md:px-8 dark:bg-neutral-950">
 				<Container className="relative pt-12 pb-24 md:pt-20">
@@ -47,6 +45,17 @@ export function BlogWithSearchMagazine({ blogs }: { blogs: BlogEntry[] }) {
 		);
 	}
 
+	// `blogs` arrives pre-sorted from the server: explicitly featured posts
+	// first (ordered by `featured_order` asc, then date desc), then the rest
+	// by date desc. If nothing is explicitly featured, fall back to treating
+	// the newest post as the cover so the layout never feels empty up top.
+	// `MagazineSearchGrid` re-filters using `heroSlugs` so the hero/featured
+	// posts never duplicate into the archive grid.
+	const explicitlyFeatured = blogs.filter((b) => b.featured);
+	const heroBlogs = explicitlyFeatured.length > 0 ? explicitlyFeatured : blogs.slice(0, 1);
+	const heroSlugs = new Set(heroBlogs.map((b) => b.slug));
+	const [coverStory, ...secondaryFeatured] = heroBlogs;
+
 	return (
 		<div className="relative overflow-hidden bg-neutral-50 px-4 pt-20 md:px-8 dark:bg-neutral-950">
 			<div className="pointer-events-none absolute inset-0 bg-[radial-gradient(ellipse_80%_50%_at_50%_-20%,rgba(120,119,198,0.15),transparent)] dark:bg-[radial-gradient(ellipse_80%_50%_at_50%_-20%,rgba(120,119,198,0.12),transparent)]" />
@@ -57,14 +66,38 @@ export function BlogWithSearchMagazine({ blogs }: { blogs: BlogEntry[] }) {
 					</h1>
 				</header>
 
-				<MagazineFeatured blog={featured} />
+				<MagazineFeatured blog={coverStory} />
 
-				<MagazineSearchGrid blogs={blogs} featuredSlug={featured.slug} />
+				{secondaryFeatured.length > 0 ? (
+					<MoreFeatured blogs={secondaryFeatured} />
+				) : null}
+
+				<MagazineSearchGrid blogs={blogs} excludedSlugs={heroSlugs} />
 			</Container>
 		</div>
 	);
 }
 
+function MoreFeatured({ blogs }: { blogs: BlogEntry[] }) {
+	return (
+		<section aria-labelledby="more-featured-heading" className="mb-14">
+			<h2
+				id="more-featured-heading"
+				className="mb-6 font-serif text-2xl font-medium text-neutral-900 dark:text-neutral-100"
+			>
+				More featured
+			</h2>
+			<ul className="grid gap-6 sm:grid-cols-2">
+				{blogs.map((blog) => (
+					<li key={blog.slug}>
+						<MagazineCard blog={blog} />
+					</li>
+				))}
+			</ul>
+		</section>
+	);
+}
+
 function MagazineFeatured({ blog }: { blog: BlogEntry }) {
 	return (
 		<Link
@@ -112,10 +145,11 @@ function MagazineFeatured({ blog }: { blog: BlogEntry }) {
 
 function MagazineSearchGrid({
 	blogs: allBlogs,
-	featuredSlug,
+	excludedSlugs,
 }: {
 	blogs: BlogEntry[];
-	featuredSlug: string;
+	/** Slugs already shown above the archive (cover story + "More featured"). */
+	excludedSlugs: Set<string>;
 }) {
 	const [search, setSearch] = useState("");
 
@@ -128,12 +162,15 @@ function MagazineSearchGrid({
 	);
 
 	const gridItems = useMemo(() => {
+		// When the reader is searching, surface every match (including
+		// featured posts they may be looking for); otherwise hide the posts
+		// that are already rendered as featured above the archive.
 		const results = search.trim() ? searcher.search(search) : allBlogs;
 		if (search.trim()) {
 			return results;
 		}
-		return results.filter((b) => b.slug !== featuredSlug);
-	}, [search, searcher, allBlogs, featuredSlug]);
+		return results.filter((b) => !excludedSlugs.has(b.slug));
+	}, [search, searcher, allBlogs, excludedSlugs]);
 
 	return (
 		<section aria-labelledby="archive-heading">
diff --git a/surfsense_web/app/(home)/blog/page.tsx b/surfsense_web/app/(home)/blog/page.tsx
index 2a29c2944..c5b12f936 100644
--- a/surfsense_web/app/(home)/blog/page.tsx
+++ b/surfsense_web/app/(home)/blog/page.tsx
@@ -25,6 +25,8 @@ export interface BlogEntry {
 	image: string;
 	author: string;
 	authorAvatar: string;
+	featured: boolean;
+	featuredOrder?: number;
 }
 
 export default async function BlogPage() {
@@ -38,6 +40,8 @@ export default async function BlogPage() {
 			image?: string;
 			author?: string;
 			authorAvatar?: string;
+			featured?: boolean;
+			featured_order?: number;
 		};
 	}>;
 
@@ -51,8 +55,20 @@ export default async function BlogPage() {
 			image: page.data.image ?? "/og-image.png",
 			author: page.data.author ?? "SurfSense Team",
 			authorAvatar: page.data.authorAvatar ?? "/logo.png",
+			featured: page.data.featured ?? false,
+			featuredOrder: page.data.featured_order,
 		}))
-		.sort((a, b) => new Date(b.date).getTime() - new Date(a.date).getTime());
+		.sort((a, b) => {
+			// Featured first; then by `featured_order` asc within featured;
+			// then by `date` desc as the universal tie-breaker.
+			if (a.featured !== b.featured) return a.featured ? -1 : 1;
+			if (a.featured && b.featured) {
+				const aOrder = a.featuredOrder ?? Number.POSITIVE_INFINITY;
+				const bOrder = b.featuredOrder ?? Number.POSITIVE_INFINITY;
+				if (aOrder !== bOrder) return aOrder - bOrder;
+			}
+			return new Date(b.date).getTime() - new Date(a.date).getTime();
+		});
 
 	return <BlogWithSearchMagazine blogs={blogs} />;
 }
diff --git a/surfsense_web/blog/content/agentic-rag-vs-long-context-llms-benchmark.mdx b/surfsense_web/blog/content/agentic-rag-vs-long-context-llms-benchmark.mdx
new file mode 100644
index 000000000..247536339
--- /dev/null
+++ b/surfsense_web/blog/content/agentic-rag-vs-long-context-llms-benchmark.mdx
@@ -0,0 +1,387 @@
+---
+title: "Agentic RAG vs Long-Context LLMs: A 171-Question Benchmark on 30 Long PDFs"
+description: "We benchmarked agentic RAG against long-context LLMs and native PDF attachment on 171 real questions across 30 long, multimodal PDFs, using Claude Sonnet 4.5 on every arm. Accuracy, cost per query, failure modes, and a vision-LLM-vs-OCR finding the internet still expects to go the other way."
+date: "2026-05-15"
+image: "/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-01-hero-image.png"
+author: "SurfSense Team"
+authorAvatar: "/logo.png"
+tags:
+  - "Agentic RAG"
+  - "Long-Context LLM"
+  - "RAG vs Agentic"
+  - "Vision LLM vs OCR"
+  - "Benchmark"
+  - "Claude Sonnet 4.5"
+  - "MMLongBench-Doc"
+featured: true
+featured_order: 1
+---
+
+> **TL;DR for skimmers**
+>
+> We ran six different ways of answering questions over 30 long, image-heavy PDFs (a total of 171 real questions) using the *same* large language model, Claude Sonnet 4.5, and measured accuracy, cost per question, and how often each approach broke. The result:
+>
+> - **Full-context "long-context" approaches won on raw accuracy** (LlamaCloud premium 59.6%, Azure premium 58.5%).
+> - **Agentic RAG was nearly as accurate (53.2%) at less than half the cost ($0.0827 per question vs $0.18–$0.26)** and zero failed queries out of 171.
+> - **Most accuracy gaps were not statistically significant.** 12 of 15 head-to-head comparisons could be coin-flips (McNemar test, α = 0.05).
+> - **Vision LLMs did not beat traditional OCR.** Letting Claude read the PDF directly with its built-in vision (the `native_pdf` arm) finished 5th of 6, behind every parser-based pipeline, with a stubborn 7% intrinsic failure rate that survived 5 retries with exponential backoff.
+>
+> Practical takeaway: if you are building a long-PDF Q&A product, **agentic RAG is the boring-but-correct default**. Reach for full-context only when the document fits, the budget allows, and the accuracy gain matters. Don't bet on vision LLMs replacing OCR pipelines yet.
+
+<img
+  src="/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-01-hero-image.png"
+  alt="Diagram comparing agentic RAG, long-context LLM, and native PDF pipelines for document question answering."
+  width={1920}
+  height={1080}
+  style={{ width: '100%', height: 'auto', borderRadius: '12px' }}
+/>
+
+## Why this matters: the agentic RAG vs long-context debate
+
+If you are shipping anything that lets a user ask questions about a PDF, whether that is a contract analyser, a research assistant, or an internal docs chatbot, you have hit one of the loudest arguments in AI engineering today.
+
+On one side: **long-context LLMs**. Modern models from Anthropic, OpenAI and Google now accept hundreds of thousands of tokens in a single prompt. Just stuff the whole document in and ask the question. Simple, fast to build.
+
+On the other side: **agentic RAG** (retrieval-augmented generation, where an agent dynamically pulls relevant chunks instead of dumping the whole document into the prompt). More complex, but classically considered cheaper and safer at scale.
+
+Layered on top of that is a quieter argument: **do you even need a document parser anymore?** Frontier models from Anthropic, OpenAI and Google now read PDFs natively using their vision stack. The story everyone wants to be true is that vision-capable LLMs make OCR pipelines obsolete. We tested that story too. Spoiler: not yet.
+
+The internet is full of opinions and very thin on data, especially for *long, multimodal* PDFs (the messy, image-heavy real-world kind). So we built a benchmark with the same model on every arm and measured what actually happens, on both questions at once.
+
+## What is agentic RAG?
+
+Quick definitions, in plain English, then we move to the data.
+
+**RAG (retrieval-augmented generation)** is the standard pattern for letting a language model answer questions about your private documents. You chunk the documents into pieces, store them in a vector database, and at query time you retrieve the chunks most likely to contain the answer and pass them to the model.
+
+**Agentic RAG** is RAG with an LLM agent in the driver's seat. Instead of one fixed retrieval step, the agent can:
+
+- ask itself sub-questions,
+- run multiple searches with different queries,
+- decide when it has enough evidence,
+- ignore irrelevant chunks,
+- and stop when the answer is complete.
+
+Think of vanilla RAG as handing a librarian one note that says *"find me the answer to X"*. Agentic RAG is handing the same librarian a research brief and a clipboard, and letting them walk back and forth between the shelves until the report writes itself.
+
+<img
+  src="/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-02-architecture-diagram.png"
+  alt="Agentic RAG architecture diagram showing an LLM agent iteratively retrieving document chunks before producing a final answer."
+  width={1920}
+  height={1080}
+  style={{ width: '100%', height: 'auto', borderRadius: '12px' }}
+/>
+
+For a 5-minute video walk-through, IBM Technology has the highest-ranked explainer on YouTube right now (325k views, watched by us, and accurate):
+
+<div style={{ position: 'relative', paddingBottom: '56.25%', height: 0, overflow: 'hidden', borderRadius: '12px', margin: '1.5rem 0' }}>
+  <iframe
+    src="https://www.youtube-nocookie.com/embed/0z9_MhcYvcY"
+    title="What is Agentic RAG? - IBM Technology"
+    allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
+    referrerPolicy="strict-origin-when-cross-origin"
+    allowFullScreen
+    style={{ position: 'absolute', top: 0, left: 0, width: '100%', height: '100%', border: 0 }}
+  />
+</div>
+
+## How we ran the benchmark
+
+To make the comparison fair, every arm answered the same questions and used the exact same large language model: **Claude Sonnet 4.5**, called through OpenRouter so the API path was identical.
+
+The dataset was [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/) ([paper](https://arxiv.org/abs/2407.01523), [GitHub](https://github.com/mayubo2333/MMLongBench-Doc), [Hugging Face](https://huggingface.co/datasets/yubo2333/MMLongBench-Doc)), an open multimodal-document benchmark of long PDFs with vetted question-answer pairs. The full corpus is 135 PDFs averaging 47.5 pages each, with 1,091 expert-annotated questions across 7 domains (33% cross-page, 22.5% deliberately unanswerable to detect hallucinations). We used the first 30 documents (a mix of research papers, financial filings, product catalogues, and image-heavy reports) and all 171 of their answerable questions.
+
+### Why multimodal documents?
+
+Real-world PDFs are messy. They contain charts, scanned tables, photos, multi-column layouts, and footnotes inside footnotes. A clean text-only benchmark wouldn't tell us anything useful about whether these approaches survive contact with the documents people actually upload to AI products. MMLongBench-Doc was built to include exactly that messiness, which is the territory where parser quality and retrieval strategy actually start to matter. We wanted the benchmark to look like the real inbox of an AI app, not a sanitised research toy.
+
+### Why only 30 documents?
+
+The full MMLongBench-Doc corpus has 135 PDFs. Processing the entire dataset across all six arms would have taken significantly longer to complete on my machine, so we capped the run at 30 to keep iteration time reasonable. We're upfront about what that costs us statistically in the significance section below: a bigger sample would have tightened every confidence interval. The findings here should be read as strong directional evidence, not a final verdict.
+
+### The six arms
+
+| Arm | What it does | Preprocessing | What goes in the prompt |
+|---|---|---|---|
+| `native_pdf` | Sends the raw PDF file directly to the model | None | The PDF itself, every question |
+| `azure_basic_lc` | Parses the PDF with Azure Document Intelligence (cheap mode) | $1 per 1,000 pages | The whole markdown, every question |
+| `azure_premium_lc` | Same as above, premium parser (preserves layout) | $10 per 1,000 pages | The whole markdown, every question |
+| `llamacloud_basic_lc` | Parses the PDF with LlamaParse (cheap mode) | $1 per 1,000 pages | The whole markdown, every question |
+| `llamacloud_premium_lc` | LlamaParse premium with layout/table preservation | $10 per 1,000 pages | The whole markdown, every question |
+| `surfsense_agentic` | Full agentic RAG pipeline | $10 per 1,000 pages (one-time ingest) | Only the chunks the agent decides to retrieve |
+
+Arms 2-5 are what we call **"long-context" or "full-context" stuffing**: parse the PDF once, paste the entire result into every prompt. Arm 6 is the agentic RAG approach. Arm 1, the `native_pdf` "just attach the PDF" pattern, is doing double duty here. It is also the **"vision LLM replaces OCR" hypothesis**: instead of any markdown parser, the model reads the PDF directly using its built-in vision capabilities. If vision-capable LLMs are good enough to retire OCR pipelines, this arm should be at the top of the table. (It isn't.)
+
+If you want to read the implementations, every arm lives in [`surfsense_evals/src/surfsense_evals/core/arms/`](https://github.com/MODSetter/SurfSense/tree/main/surfsense_evals/src/surfsense_evals/core/arms) — the [`bare_llm.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/src/surfsense_evals/core/arms/bare_llm.py) arm handles full-context stuffing, [`native_pdf.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/src/surfsense_evals/core/arms/native_pdf.py) handles vision-LLM PDF attachment, and [`surfsense.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/src/surfsense_evals/core/arms/surfsense.py) drives the agentic retrieval against the SurfSense `/api/v1/new_chat` endpoint. The full benchmark suite (prompts, ingest pipeline, runner) lives in [`suites/multimodal_doc/parser_compare/`](https://github.com/MODSetter/SurfSense/tree/main/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare).
+
+We graded answers with a [deterministic, format-aware grader](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py) (1% relative tolerance for floats, F1 over normalised tokens for lists). We logged input/output tokens, cost, latency, and any HTTP error per question.
+
+## Headline results: who wins on accuracy?
+
+After running all 171 questions through all 6 arms, then re-running the 37 failed queries with up to 5 attempts of exponential backoff, here is the scoreboard:
+
+<img
+  src="/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-03-accuracy-bar-chart-dark.png"
+  alt="Bar chart of post-retry accuracy on 171 long-PDF questions: LlamaCloud premium 59.6%, Azure premium 58.5%, Azure basic 54.4%, SurfSense agentic RAG 53.2%, Native PDF 52.0%, LlamaCloud basic 50.9%."
+  width={2200}
+  height={1240}
+  style={{ width: '100%', height: 'auto', borderRadius: '12px' }}
+/>
+
+The full table including F1, raw failures, and median latency:
+
+Bolded cell = winner of that column.
+
+| Rank | Arm | Accuracy | F1 | Median latency | Raw failures |
+|---:|---|---:|---:|---:|---:|
+| 1 | LlamaCloud premium, long-context | **59.6%** | **61.1%** | **6.8 s** | 4 |
+| 2 | Azure premium, long-context | 58.5% | 59.6% | 6.9 s | 3 |
+| 3 | Azure basic, long-context | 54.4% | 56.6% | 7.1 s | 1 |
+| 4 | SurfSense agentic RAG | 53.2% | 54.3% | 52.8 s | **0** |
+| 5 | Native PDF attachment | 52.0% | 50.4% | 29.5 s | 27 |
+| 6 | LlamaCloud basic, long-context | 50.9% | 53.2% | 7.1 s | 2 |
+
+A few things jump out:
+
+1. **The two long-context premium parsers win on raw accuracy**, but only by about 6 percentage points over agentic RAG.
+2. **Agentic RAG was the only arm with zero failures** out of 171 questions.
+3. **Native PDF attachment was the worst performer** despite being the most "AI-native" approach. More on why in the failure-mode section.
+4. **Latency on agentic RAG is high (52.8 s)** because the agent does several retrieval rounds. For batch jobs it's fine; for chat UX you'd stream partial results.
+
+Now the part most blog posts skip.
+
+## Cost per query: where agentic RAG wins big
+
+Accuracy is only half the story. Every approach also has a price tag: the LLM call plus the document preprocessing.
+
+<img
+  src="/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-04-cost-vs-accuracy-dark.png"
+  alt="Scatter plot of accuracy versus cost per query for six document-QA approaches; SurfSense agentic RAG sits at the cheapest end with competitive accuracy."
+  width={2200}
+  height={1280}
+  style={{ width: '100%', height: 'auto', borderRadius: '12px' }}
+/>
+
+Bolded cell = winner of that column.
+
+| Arm | Total $/Q | Accuracy |
+|---|---:|---:|
+| SurfSense agentic RAG | **$0.0827** | 53.2% |
+| LlamaCloud basic | $0.1049 | 50.9% |
+| Azure basic | $0.1062 | 54.4% |
+| LlamaCloud premium | $0.1885 | **59.6%** |
+| Azure premium | $0.2051 | 58.5% |
+| Native PDF | $0.2552 | 52.0% |
+
+The headline number: **agentic RAG was the cheapest arm at $0.0827 per question, about 60% cheaper than the most accurate full-context arm and 67% cheaper than native PDF attachment.** The technique wins on cost regardless of which agentic-RAG framework you use; we just happened to measure it with ours.
+
+Why is agentic RAG so much cheaper? Because every full-context arm pays the parser+LLM bill *for the entire document on every single question*. A 100-page PDF? You pay for 100 pages of input tokens 10 times if the user asks 10 questions. Agentic RAG pays the parser once at ingest time, then only sends the retrieved chunks (often 1–5% of the document) per question.
+
+There is a clean closed-form for this. If a document has *P* pages, a parser costs *c<sub>p</sub>* per page, the LLM costs *c<sub>L</sub>* per full-document call, and the user asks *Q* questions, then full-context cost-per-question is roughly:
+
+```
+Cost/Q ≈ (P × c_p) / Q + c_L
+```
+
+For agentic RAG it is:
+
+```
+Cost/Q ≈ (P × c_p) / Q + c_L × r
+```
+
+where `r` is the *retrieval ratio*, typically 0.02 to 0.10. So the more questions per document, the more agentic RAG dominates. For knowledge bases that get queried more than a couple of times, the gap widens by the day.
+
+## Failure modes: what 37 broken queries taught us
+
+We did not just count successes. We logged every error.
+
+Of 1,026 total `(arm, question)` cells, 37 returned no answer on the first pass. We then re-ran *only* those 37 with up to 5 attempts of exponential backoff (the [`retry_failed_questions.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/scripts/retry_failed_questions.py) script in the harness). The results separated **transient** (network/server) failures from **intrinsic** (the approach actually cannot do this) failures:
+
+Bolded cell = best result on that column (lowest failure rate, highest recovery rate).
+
+| Arm | First-pass failures | Recovered on retry | Intrinsic failures | Intrinsic failure rate |
+|---|---:|---:|---:|---:|
+| All 4 long-context arms (combined) | 10 | **10 (100%)** | **0** | **0%** |
+| Native PDF | 27 | 15 | 12 | 7.0% |
+| SurfSense agentic RAG | **0** | n/a | **0** | **0%** |
+
+Two findings worth highlighting:
+
+**1. Long-context "context overflow" was a myth.** We hypothesised that the long-context arms might be silently failing because the document didn't fit in the context window. We tested it: the failures clustered around HTTP/SSL errors (the request body was up to 30 MB, riding the public internet), not token limits. Once we retried, all 10 came back successfully. The Claude Sonnet 4.5 context window held up fine; the *transport layer* wobbled.
+
+**2. Native PDF has a stubborn 7% intrinsic failure rate.** Two specific PDFs broke it permanently:
+
+- a 27-page image-heavy PDF whose binary exceeded the provider's 30 MB request-body cap (6 questions broken);
+- a 166-page PDF whose response stream the provider could never reliably terminate (5 questions, repeated `empty stream` errors).
+
+Even with 5 attempts of exponential backoff, those 12 questions stayed broken. **For any production app that processes PDFs from arbitrary users, that is a 7% "this document cannot be answered today" rate**, which is unacceptable for most product flows.
+
+Agentic RAG sidesteps both problems because it never sends the raw PDF and never sends the entire document context in one giant request.
+
+### What this means for the vision-LLM-vs-OCR debate
+
+Bigger picture, the `native_pdf` numbers settle a question we wanted to answer: **on long, image-heavy PDFs, vision-capable LLMs reading the document directly did not outperform plain OCR plus markdown.** They came in 5th of 6 on accuracy (52.0% vs 50.9% to 59.6% for the OCR-based pipelines), were the most expensive arm at $0.2552 per question, and failed 7% of the time even after retries. Premium OCR with layout extraction held up better on the exact pages where you would expect vision to shine, the chart-and-table-heavy ones.
+
+The point is not that vision LLMs are bad. They are remarkable. The point is that the parser pipeline you already maintain is not yet obsolete, and the "skip the parser, attach the PDF" shortcut is not a free lunch.
+
+## Statistical significance: are these results actually different?
+
+This is the section most benchmarks omit, and it changes the conclusions.
+
+We ran McNemar's exact-binomial test on every pair of arms (15 pairs total) using [`compute_blog_extras.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/scripts/compute_blog_extras.py). McNemar is the right test for paired classifier comparisons: both arms answered the *same* questions, so we can ask: "of the questions where the two arms disagreed, did one really win more often than chance?"
+
+The result: **only 3 of 15 pairs are distinguishable at α = 0.05**.
+
+The three statistically-significant gaps are all between the *worst* arms (LlamaCloud basic, Native PDF) and the *best* arms (LlamaCloud premium, Azure premium). The most interesting comparison, **SurfSense agentic RAG vs the long-context premium arms**, does *not* clear the significance bar. The 6-point gap could plausibly be sample noise.
+
+In other words: on this dataset, the headline claim "long-context beats agentic RAG by 6 percentage points" is real on the scoreboard but **not statistically robust**. Run the same benchmark on a different sample of 30 PDFs and the order could shuffle. This is also the place where our 30-document scope bites us: a bigger run would have given more comparisons enough power to settle.
+
+## When to choose what: a decision framework
+
+Reading the data without an action plan is half the value. Here is how we would decide for a real product, using the same numbers.
+
+<img
+  src="/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-05-decision-tree.png"
+  alt="Decision tree showing when to choose agentic RAG, long-context full-context, or native PDF attachment for document question answering."
+  width={1920}
+  height={1080}
+  style={{ width: '100%', height: 'auto', borderRadius: '12px' }}
+/>
+
+### Use **agentic RAG** when
+
+- documents are long and mixed in size (some 5 pages, some 500),
+- the same documents will be queried more than once or twice,
+- cost per query matters (SaaS pricing, large user base),
+- you cannot afford intermittent failures on big PDFs,
+- you need to scale to corpora of thousands of documents.
+
+This is the default for most production AI products.
+
+### Use **long-context (full-context) LLMs** when
+
+- documents reliably fit in the context window after parsing (typically under ~150 pages of text),
+- the accuracy gain (6 percentage points in our benchmark, or zero, depending on which way the noise goes) actually justifies the 60–150% extra cost,
+- you have one or two questions per document, not dozens,
+- you can absorb occasional network failures on large request bodies.
+
+Premium parsing matters here. **Spending $10 per 1,000 pages on a layout-aware parser is worth it**: it gave us +4 to +9 accuracy points over basic parsers on the same questions.
+
+### Use **native PDF attachment** when
+
+- you are prototyping and want to ship in an afternoon,
+- documents are small and well-formatted,
+- you can tolerate a 7% failure rate (or you have validated the specific PDFs you care about don't trip the limits).
+
+Don't use it as the default for user-uploaded PDFs in production. The 30 MB request-body cap and unstable response streams will bite you, and exponential backoff will not save you.
+
+## Frequently asked questions
+
+<Accordion type="multiple" className="w-full not-prose">
+  <AccordionItem value="faq-1">
+    <AccordionTrigger>What is agentic RAG?</AccordionTrigger>
+    <AccordionContent className="flex flex-col gap-4 text-balance">
+      Agentic RAG is retrieval-augmented generation where an LLM agent (not a fixed pipeline) decides what to retrieve, when to stop, and how to combine evidence. Instead of one search and one answer, the agent can run multiple retrievals, refine its query, and iterate. It usually costs less than full-context prompting and handles arbitrarily large document collections.
+    </AccordionContent>
+  </AccordionItem>
+  <AccordionItem value="faq-2">
+    <AccordionTrigger>How is agentic RAG different from traditional RAG?</AccordionTrigger>
+    <AccordionContent className="flex flex-col gap-4 text-balance">
+      Traditional RAG runs a single, fixed retrieval step: take the user's question, find the top-k similar chunks, send them to the LLM. Agentic RAG lets the LLM plan, retrieve repeatedly, evaluate intermediate results, and decide when it has enough context. It is more flexible at the cost of more LLM calls, and it tends to outperform vanilla RAG on multi-hop or ambiguous queries.
+    </AccordionContent>
+  </AccordionItem>
+  <AccordionItem value="faq-3">
+    <AccordionTrigger>When should I use long-context LLMs instead of RAG?</AccordionTrigger>
+    <AccordionContent className="flex flex-col gap-4 text-balance">
+      When the document fits in the model's context window after parsing, you have a small number of queries per document, accuracy matters more than cost, and you can tolerate occasional transport-layer failures on multi-megabyte requests. In our benchmark, full-context premium parsers led on accuracy (about 58–60%) but cost 2–3× more per query than agentic RAG.
+    </AccordionContent>
+  </AccordionItem>
+  <AccordionItem value="faq-4">
+    <AccordionTrigger>What is a long context window?</AccordionTrigger>
+    <AccordionContent className="flex flex-col gap-4 text-balance">
+      A long context window is the maximum amount of text (measured in tokens) that an LLM can read in a single prompt. Modern frontier models support 200,000 tokens or more, which is roughly 150,000 words or 300+ printed pages. A long context window enables "just stuff the whole document in" approaches, but it does not eliminate the need for RAG when corpora exceed what one prompt can hold or when cost matters.
+    </AccordionContent>
+  </AccordionItem>
+  <AccordionItem value="faq-5">
+    <AccordionTrigger>How do you benchmark agentic RAG?</AccordionTrigger>
+    <AccordionContent className="flex flex-col gap-4 text-balance">
+      Run the same set of real-world questions through each approach using the *same* underlying LLM, log accuracy with a deterministic grader, log cost (LLM + preprocessing), log latency, and run pairwise McNemar tests for statistical significance. We used 171 questions across 30 long PDFs from MMLongBench-Doc.
+    </AccordionContent>
+  </AccordionItem>
+  <AccordionItem value="faq-6">
+    <AccordionTrigger>How much does agentic RAG cost per query?</AccordionTrigger>
+    <AccordionContent className="flex flex-col gap-4 text-balance">
+      In our benchmark, agentic RAG cost **$0.0827 per question** end-to-end (including a one-time premium parsing cost amortised across all questions for the document). The cheapest full-context arm cost $0.1049 (about 27% more); the most expensive cost $0.2552 (over 3× more). Cost per query for agentic RAG drops further as you ask more questions per document.
+    </AccordionContent>
+  </AccordionItem>
+  <AccordionItem value="faq-7">
+    <AccordionTrigger>Is RAG dead now that we have long context?</AccordionTrigger>
+    <AccordionContent className="flex flex-col gap-4 text-balance">
+      No, and this benchmark is part of the evidence. Long-context wins on raw accuracy by a small margin that is mostly within statistical noise, but RAG (especially agentic RAG) wins on cost per query, on robustness to large or malformed documents, and on horizontal scaling to large corpora. The right answer is "use the cheapest pattern that hits your accuracy target", which for most production apps is agentic RAG.
+    </AccordionContent>
+  </AccordionItem>
+  <AccordionItem value="faq-8">
+    <AccordionTrigger>Do vision LLMs outperform OCR for PDF question answering?</AccordionTrigger>
+    <AccordionContent className="flex flex-col gap-4 text-balance">
+      Not in our benchmark. The `native_pdf` arm, which lets Claude Sonnet 4.5 read each PDF directly using its native vision capabilities, finished 5th of 6 with 52.0% accuracy and a 7% intrinsic failure rate. Every OCR-based pipeline we tested (Azure Document Intelligence and LlamaParse, in both basic and premium tiers) either matched or beat it on accuracy at lower cost. Premium OCR with layout extraction held up especially well on chart-heavy and table-heavy pages, the exact territory where you would expect a vision model to dominate. Vision-capable LLMs may catch up as the models improve, but as of mid-2026, the safer default for long, multimodal PDFs is still parser plus markdown.
+    </AccordionContent>
+  </AccordionItem>
+</Accordion>
+
+## What this means for your AI app
+
+If you are choosing an architecture for a long-PDF Q&A product *today*:
+
+1. **Start with agentic RAG.** It is the cheapest, most robust default and gets you within statistical noise of full-context approaches.
+2. **Pay for premium parsing once.** Whether you choose RAG or full-context, layout-aware parsing buys you real accuracy points. The marginal cost is trivial against the LLM bill.
+3. **Avoid plain "attach the PDF" in production** unless you have validated every document path. The 7% intrinsic failure rate is real and not retry-able.
+4. **Don't trust accuracy gaps under 5–6 points** unless you have tested for significance. McNemar takes 30 seconds in Python and saves embarrassing benchmark posts.
+5. **Don't bet on vision LLMs replacing OCR yet.** On 30 long, image-heavy PDFs, the native PDF (vision LLM) path lost to every OCR-based pipeline on accuracy and was the most expensive arm at $0.2552 per question. The OCR pipeline you already maintain is not obsolete.
+
+## Reproduce this benchmark
+
+Everything that produced these numbers is open source. The eval harness is its own package inside the SurfSense monorepo:
+
+- [`surfsense_evals/`](https://github.com/MODSetter/SurfSense/tree/main/surfsense_evals) — the harness root (extensible base classes, providers, cost ledger).
+- [`suites/multimodal_doc/parser_compare/`](https://github.com/MODSetter/SurfSense/tree/main/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/parser_compare) — the benchmark used in this post (prompts, ingest, runner).
+- [`core/arms/bare_llm.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/src/surfsense_evals/core/arms/bare_llm.py), [`native_pdf.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/src/surfsense_evals/core/arms/native_pdf.py), [`surfsense.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/src/surfsense_evals/core/arms/surfsense.py) — the three arm implementations.
+- [`mmlongbench/grader.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/src/surfsense_evals/suites/multimodal_doc/mmlongbench/grader.py) — the deterministic format-aware grader.
+- [`scripts/retry_failed_questions.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/scripts/retry_failed_questions.py) — the failed-only retry pass with exponential backoff.
+- [`scripts/compute_blog_extras.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/scripts/compute_blog_extras.py) — the McNemar pairwise tests, latency/token percentiles, and per-PDF heterogeneity.
+- [`scripts/compute_post_retry_accuracy.py`](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/scripts/compute_post_retry_accuracy.py) — merges retry survivors back into the run and recomputes the headline numbers.
+
+A minimal end-to-end run looks like:
+
+```bash
+# 1. Clone, install (uv recommended)
+git clone https://github.com/MODSetter/SurfSense
+cd SurfSense/surfsense_evals
+uv sync --extra dev
+
+# 2. Configure provider keys (Azure DI, LlamaCloud, OpenRouter, SurfSense)
+cp .env.example .env
+$EDITOR .env
+
+# 3. Ingest the first 30 PDFs from MMLongBench-Doc into all parsers
+uv run python -m surfsense_evals.cli setup multimodal_doc \
+  --vision-llm anthropic/claude-sonnet-4.5
+uv run python -m surfsense_evals.cli ingest multimodal_doc \
+  --suite parser_compare --max-docs 30
+
+# 4. Run all six arms × all 171 questions
+uv run python -m surfsense_evals.cli run multimodal_doc \
+  --suite parser_compare --sample-per-doc 20 --concurrency 2
+
+# 5. Retry failures + compute final stats
+uv run python scripts/retry_failed_questions.py
+uv run python scripts/compute_post_retry_accuracy.py
+uv run python scripts/compute_blog_extras.py
+```
+
+The dataset itself is on [Hugging Face](https://huggingface.co/datasets/yubo2333/MMLongBench-Doc) and the [original GitHub repo](https://github.com/mayubo2333/MMLongBench-Doc) (NeurIPS 2024 D&B Spotlight, [paper](https://arxiv.org/abs/2407.01523)). Bring your own LLM provider; swap `anthropic/claude-sonnet-4.5` for `openai/gpt-4o`, `google/gemini-2.5-pro`, or any OpenRouter slug to repeat the experiment with a different model.
+
+If you find that the rankings shuffle on your own document set, we want to hear about it. Open an issue on [the SurfSense repo](https://github.com/MODSetter/SurfSense/issues) with the run artifacts and we will link your results from this post.
+
+The eval harness is open source and runs against any OpenRouter model, so you can re-run the same questions on `openai/gpt-4o`, `google/gemini-2.5-pro`, or whichever model you are evaluating for production. Wire your own RAG framework into the [`Arm` base class](https://github.com/MODSetter/SurfSense/blob/main/surfsense_evals/src/surfsense_evals/core/arms/base.py) (LangChain, LlamaIndex, Haystack, your own stack) and you can drop it into the same comparison without changing the rest of the pipeline.
+
+If you want a hosted way to try agentic RAG on your own PDFs without writing the harness yourself, [SurfSense](/free) is one option (it is the same agentic stack that powered the `surfsense_agentic` arm above).
diff --git a/surfsense_web/blog/content/no-login-ai-privacy-reality-check.mdx b/surfsense_web/blog/content/no-login-ai-privacy-reality-check.mdx
new file mode 100644
index 000000000..4eb87c9f3
--- /dev/null
+++ b/surfsense_web/blog/content/no-login-ai-privacy-reality-check.mdx
@@ -0,0 +1,289 @@
+---
+title: "How to Use Claude, ChatGPT, and Gemini Without Signing Up: A Plain-English 2026 Guide"
+description: "Where to use Claude, ChatGPT, Gemini, and other top AI models without making an account. Honest 2026 guide for casual users and developers, covering message caps, what each tool limits, and the privacy reality behind 'no login required'."
+date: "2026-05-15"
+image: "/images/blog/no-login-ai-privacy-reality-check/placeholder-01-no-login-vs-no-tracking-hero.png"
+author: "SurfSense Team"
+authorAvatar: "/logo.png"
+tags:
+  - "AI Without Login"
+  - "Free AI Chat"
+  - "ChatGPT Without Account"
+  - "Claude Without Login"
+  - "Gemini Without Login"
+  - "Claude Incognito"
+  - "Duck.ai"
+  - "Brave Leo"
+  - "Self-Hosted AI"
+featured: false
+---
+
+> **TL;DR for skimmers**
+>
+> You don't need an account to use the best AI models in 2026. Here's the brand-by-brand answer:
+>
+> - **Want ChatGPT?** Open `chatgpt.com` in a fresh tab. Guest mode works, no signup, but the message cap is tight.
+> - **Want Claude?** `claude.ai` itself still wants an account, but Anthropic [shipped incognito mode on April 9, 2026](https://support.claude.com/en/articles/12260368-using-incognito-chats). For zero-account Claude, use [Duck.ai](https://duck.ai) (Claude Haiku 4.5, anonymized) or [Brave Leo](https://brave.com/leo/) inside the Brave browser.
+> - **Want Gemini?** Google requires a Google account on `gemini.google.com`. The closest no-signup path is an open-source aggregator like our own [SurfSense /free](/free), which lists Gemini among its model options without a Google sign-in.
+> - **Want all of them in one place?** Our open-source [SurfSense /free](/free) lets you pick from ChatGPT, Claude, Gemini, DeepSeek, Mistral, Llama, and a rotating list of other models with no account, and you get 500,000 free tokens to spend across any of them. [Duck.ai](https://duck.ai) and [Brave Leo](https://brave.com/leo/) are strong privacy-first alternatives.
+> - **Care about privacy too?** Most "no login" pages still log your IP and prompt content. The exceptions are Brave Leo, Duck.ai, and self-hosted models. Skip to [the privacy honest-talk](#the-privacy-honest-talk-no-login-is-not-the-same-as-anonymous) if that's the part you came for.
+
+<img
+  src="/images/blog/no-login-ai-privacy-reality-check/placeholder-01-no-login-vs-no-tracking-hero.png"
+  alt="Side-by-side illustration of a browser with a 'No login required' badge and the same browser with hidden IP, fingerprint, and prompt-log data being inspected, showing that no login is not the same as no tracking."
+  width={1920}
+  height={1080}
+  style={{ width: '100%', height: 'auto', borderRadius: '12px' }}
+/>
+
+The promise of the search query "AI without login" is simple: someone wants to chat with a top model without making an account. Maybe they're trying it out for the first time. Maybe they don't want another email in their inbox. Maybe they care about privacy. Maybe they're recommending an AI to a class, a parent, or a coworker who isn't going to deal with a signup wall.
+
+Whatever the reason, the answer in 2026 is mostly: **yes, you can do this**, but the options are scattered across product pages, browser features, and a long tail of wrapper sites that look identical. This guide is the cleanest map we could draw, organised so a casual user gets value in the first three minutes and a developer or privacy-conscious reader can keep going deeper.
+
+We're going to cite primary sources for everything that touches privacy or product behavior, so you can verify (and so the article ages well as the products change). The two cluster posts that go deeper into [Claude specifically](/blog/use-claude-without-login-2026) and into [a tested comparison of 12 free AI chats](/blog/tested-no-login-ai-chats-2026) will sit alongside this one when they're ready.
+
+## Where to use each major AI model without an account
+
+Let's go brand by brand. For each one, what works, what the cap looks like, and where to go.
+
+### ChatGPT, with no OpenAI account
+
+The shortest answer: open [chatgpt.com](https://chatgpt.com) in a private/incognito tab. ChatGPT's guest mode lets you send a few prompts and get GPT-5-class output without making an account. There's no specific page to go to; the site detects you're not logged in and gives you a guest experience automatically.
+
+**What it gives you:** the same headline GPT-5 model that paying users start with, for short queries. The interface is the standard ChatGPT UI without the sidebar or chat history panel.
+
+**What it doesn't give you:** file uploads, the Code Interpreter, conversation history (refresh the page and your chat is gone), and no advanced features like custom GPTs or memory.
+
+**The catch:** the guest cap is around 10 messages per 5-hour rolling window on the headline GPT-5-class model ([per OpenAI's current behavior](https://www.smashingapps.com/how-to-use-chatgpt-without-creating-an-account/)). After that you're auto-downgraded to the lighter GPT-5 Mini variant with no hard limit, not blocked by a hard "please sign up" wall. So you can keep going indefinitely on the smaller model, you just can't keep using the headline one.
+
+If GPT-5 quality on a no-signup page matters more than going through OpenAI specifically, [Bing Copilot](https://copilot.microsoft.com) at `copilot.microsoft.com` runs on a GPT-5-class backend, also works without a Microsoft account, and tends to have a more generous cap because Microsoft monetises through Bing search instead of subscriptions.
+
+You can also use **[SurfSense /free](/free)** (full disclosure: this is our own open-source aggregator) which lists ChatGPT among its no-signup models. You get 500,000 free tokens to spend across any model on the page, which is meaningfully more than what guest mode gives you before the signup wall. The source code is on [GitHub](https://github.com/MODSetter/SurfSense) so the privacy and quota behavior is auditable, not just promised.
+
+### Claude, with as little friction as possible
+
+Anthropic does require an account on `claude.ai` itself, but the picture is much better than it was a year ago. There are now four legitimate paths to using Claude without going through the full signup-and-stay-logged-in experience.
+
+**Path 1: Anthropic's incognito mode (existing account required).** Launched April 9, 2026, available on every Claude plan from Free to Enterprise. Click the ghost icon in the upper-right when starting a new chat. The interface gets a black border and an "Incognito chat" label. The conversation is not saved to your chat history, not used by Claude's memory feature, and not used for training. Source: [Anthropic Help Center](https://support.claude.com/en/articles/12260368-using-incognito-chats). This is the right answer if you already have a Claude account and want a temporary, no-trace conversation.
+
+**Path 2: Duck.ai (no account at all).** [duck.ai](https://duck.ai) is DuckDuckGo's chat product. Pick "Claude 4.5 Haiku" from the model dropdown and start chatting. No signup, no email. DuckDuckGo proxies your request through their own servers, so Anthropic never sees your IP. We'll cover the full privacy mechanics [below](#the-privacy-honest-talk-no-login-is-not-the-same-as-anonymous). Per-session cap exists but there's no persistent quota.
+
+**Path 3: Brave Leo (no account, browser-side).** Install the [Brave browser](https://brave.com), open the sidebar, click the Leo icon, pick Claude Haiku from the model dropdown. No signup. Brave doesn't collect identifiers tied to you ([per their docs](https://brave.com/leo/)). The trade-off is that you have to use Brave as your browser, and you're limited to Haiku on the free tier (Sonnet and Opus require Brave Leo Premium at $14.99/month).
+
+**Path 4: Multi-model aggregator pages.** These wrap the Anthropic API and serve Claude responses without an account on Anthropic. The pick we'd recommend (with the obvious disclosure that we made it) is **[SurfSense /free](/free)**: it lists Claude alongside ChatGPT, Gemini, DeepSeek, Mistral, and Llama in one chat UI, the source code is open on [GitHub](https://github.com/MODSetter/SurfSense) so the privacy and quota behavior is verifiable, and the 500K free token quota is shared across any model you pick (so you can spend the budget on Claude if that's what you came for). The closed-source alternatives ([HIX.AI](https://hix.ai/claude), [EaseMate](https://www.easemate.ai/ai-chat/ask-claude), [Eye2.ai](https://eye2.ai), [NoteGPT](https://notegpt.io/ai-models/claude-sonnet-4-5)) work too, but quality and message limits vary widely; we [tested 12 of these](/blog/tested-no-login-ai-chats-2026) in a separate post.
+
+For the developer-specific paths (Claude Code with Bedrock or Vertex AI authentication, the Claude for Open Source program, the `/passes` Guest Pass system), see our [Claude-specific deep dive](/blog/use-claude-without-login-2026).
+
+### Gemini, the awkward one
+
+Google requires a Google account on `gemini.google.com` and there is no first-party guest mode. If you don't already use a Google account, this is the model with the most signup friction.
+
+The realistic options:
+
+- **Sign in with a throwaway Google account.** Imperfect but functional, and Google's free tier on Gemini is genuinely good (15 GB of Drive storage, deep web research, native voice).
+- **Use an aggregator that wraps Gemini.** **[SurfSense /free](/free)** (disclosure: ours, open source on [GitHub](https://github.com/MODSetter/SurfSense)) lists Gemini among its model options and forwards requests to Google's API behind the scenes, so the user-facing chat works with no Google sign-in and no Google identity tied to your prompts. Quality matches the underlying Gemini API tier we pay for. Other wrapper pages do the same thing but few publish their privacy or quota behavior; ours is auditable in source, and the 500K-token quota is shared across any model on the page (not just Gemini).
+- **Pick a different model.** If "I want long-context plus web search without signing in" is the actual need, Brave Leo with Claude Haiku, Bing Copilot, or Perplexity (more on that below) are no-signup substitutes for the most common Gemini use cases.
+
+### Multiple models in one tab
+
+If you want to compare answers from Claude, GPT, Gemini, and others without juggling four browser tabs and four signup walls, three products do this without an account:
+
+- **[SurfSense /free](/free)** (disclosure: ours, open source on [GitHub](https://github.com/MODSetter/SurfSense)) gives you a rotating list of models from OpenAI, Anthropic, Google, DeepSeek, Mistral, and Meta in one chat UI, no account, with **500,000 free tokens shared across any model** on the page (meaningfully more than the per-session caps you'll hit on Duck.ai or Brave Leo before they make you wait). The model lineup updates as new models ship, and the wrapper-layer code is on GitHub so the no-account session, the no-database-storage claim, and the quota behavior are all auditable. Trade-off: when you hit the 500K, you either sign up ($5 free credit, then $1 per $1 of model cost) or self-host.
+- **[Duck.ai](https://duck.ai)** has Claude 4.5 Haiku, Llama 4 Scout, Mistral Small 3 24B, GPT-4o mini, GPT-5 mini, and gpt-oss-120b on the free tier. The big win is genuine IP anonymisation; the trade-off is that everything is the cheaper tier of each model.
+- **[Brave Leo](https://brave.com/leo/)** has Claude Haiku, Llama 3.1 8B, Mixtral, and Qwen 3 14B in the Brave browser sidebar. Best privacy story of the three, but you have to use Brave as your browser.
+
+### Honourable mentions
+
+- **Perplexity** at [perplexity.ai](https://www.perplexity.ai) lets you do search-flavored AI lookups without an account. The interface is built around citations, which is great for fact-finding and not great for chat-style writing or code.
+- **DeepAI** at [deepai.org/chat](https://deepai.org/chat) and **HotBot AI** are older standalone chat products. They work, the model quality is below the frontier, and the privacy story is unremarkable.
+- **HIX.AI, EaseMate, NoteGPT, TalkAI, ChatBot Chat App** are wrapper sites that all do roughly the same thing: front a paid model API and serve it for free without an account. Use them if convenience is the only goal, with the caveat that none of them publicly commit to not retaining your prompts (more on that below).
+
+## What you give up by not making an account
+
+There's a real tradeoff between "skip the signup" and "have a good account experience". For most casual use it's worth it. For some workflows it isn't.
+
+Things you lose without an account:
+
+- **Saved chat history.** Most no-signup paths don't persist your conversation. Refresh the page and it's gone. (Brave Leo is an exception: it stores chat history locally on your device, so it survives between sessions but never leaves your machine. Duck.ai does the same with its "Recent Chats" feature.)
+- **File uploads, in most cases.** ChatGPT guest mode and Claude incognito do not let you attach a PDF or image. Duck.ai and Brave Leo are limited too. Aggregators vary.
+- **Tighter message limits.** ChatGPT guest mode caps fast. Claude on `claude.ai` lets account holders send 30-50 messages per 5-hour rolling window; guest paths are usually tighter.
+- **Cross-device continuity.** No signup means no syncing your conversations from laptop to phone.
+- **Power features.** No memory, no custom instructions, no Code Interpreter, no Anthropic Projects, no Gemini Workspace integrations.
+
+Things you gain:
+
+- **No marketing emails.** You're not on a list. You won't get the "you haven't tried our new feature!" emails or the retargeting ads that follow you around the web.
+- **No persistent identity.** The provider sees a session, not a user. Your prompts aren't tied to your purchase history, your YouTube viewing habits, or any other product the provider runs.
+- **No risk of accidental cross-account leakage.** A coworker who borrows your laptop sees a fresh chat, not your private history.
+
+For a quick prompt or a one-off question, the gains usually win. For sustained work, expect to either tolerate the limits or eventually sign up.
+
+## The privacy honest-talk: "no login" is not the same as "anonymous"
+
+Here's the part that surprises most readers, and the one most listicles dodge: skipping the signup does not make your AI usage anonymous. It removes one identifier (your account) but leaves several others in place. Whether that matters depends entirely on what you're pasting into the chat box.
+
+The honest categorisation of no-signup AI looks like this.
+
+<img
+  src="/images/blog/no-login-ai-privacy-reality-check/placeholder-02-three-tier-pyramid.png"
+  alt="A 3-tier pyramid showing: bottom tier 'genuinely anonymized' with Brave Leo, Duck.ai, and self-hosted Ollama; middle tier 'no account but provider logs' with chatgpt.com guest, Claude incognito, and Bing Copilot; top tier 'wrapper sites that also log you' with HIX, EaseMate, NoteGPT examples."
+  width={1920}
+  height={1080}
+  style={{ width: '100%', height: 'auto', borderRadius: '12px' }}
+/>
+
+**Anonymised tools** (the smallest group). The provider commits to not logging your IP and not retaining your prompts. There are essentially three options today:
+
+- **Brave Leo.** Per [Brave's product page](https://brave.com/leo/): "We do not collect identifiers that can be linked to you (such as IP Address), and no personal data is retained." Translation: the model provider that powers Leo (Anthropic, Mistral, Meta) sees the prompt content but not your IP.
+- **Duck.ai.** Per [DuckDuckGo's documentation](https://duckduckgo.com/duckduckgo-help-pages/duckai/): DuckDuckGo "anonymizes" your request by stripping your IP and replacing it with their own before forwarding to the underlying model. They don't store prompts, don't train on your data, and your "Recent Chats" are stored locally on your device.
+- **Self-hosted open-weights models.** Llama 3.3, Qwen 2.5, Mistral Large, DeepSeek R1 running locally on your laptop via [Ollama](https://ollama.com) or [LM Studio](https://lmstudio.ai). The model never leaves your machine. The only entity that sees your data is you. The "How to" is in [the section below](#for-developers-and-the-privacy-serious-self-host-in-5-minutes).
+
+**Account-free first-party tools.** No signup required, but the provider running the model still sees your IP, your prompt text, and your session metadata. The standard examples:
+
+- **`chatgpt.com` guest mode** is OpenAI logging your prompts directly. Their privacy policy applies. PCMag's [ChatGPT Tracks More Than You Think](https://www.pcmag.com/explainers/chatgpt-tracks-more-than-you-think-how-to-lock-down-your-privacy) is a good summary of what this looks like.
+- **Claude incognito mode** is Anthropic logging your prompts, with the specific feature carve-out that incognito chats aren't saved to your visible history, aren't used by Claude's memory, and aren't used for training. **The important caveat**: per [Anthropic's docs](https://support.claude.com/en/articles/12260368-using-incognito-chats), incognito chats "are retained for either 30 days (default), or longer in accordance with your organization's custom data retention setting (available for Enterprise plans)." Not in your history, not actually deleted either.
+- **Bing Copilot** is Microsoft logging your prompts. Standard Microsoft privacy policy applies.
+
+**Wrapper sites** that don't require an account but ALSO log you on top of the underlying provider's logging. Most of the "free Claude!" and "free GPT!" pages from the search results are in this group. They serve a real model, but they're a server in the middle that has its own logs, and most of them don't publicly commit to not retaining your prompts. Convenient. Not private.
+
+When you're evaluating any "free [Brand] without login" page, the question to ask is: *does their privacy page explicitly say they don't store prompts and don't pass identifying information to the model provider?* If the answer is just "no signup required!" with nothing about logging, you're in the wrapper category.
+
+**Open-source wrappers** are a half-step better than the closed ones, and worth calling out as a separate category. Our own [SurfSense /free](/free) is in this bucket: the source code that handles your prompt is on [GitHub](https://github.com/MODSetter/SurfSense), so the claims about anonymous sessions, no persistent identity, and no prompt retention are auditable rather than promised. That doesn't make it equivalent to Brave Leo or Duck.ai (the model provider behind /free still receives the prompt content), but it does mean you can verify the wrapper layer doesn't add its own logging on top. If you're going to use a wrapper anyway, prefer one whose code you can read.
+
+### What every public AI provider logs (with citations)
+
+This is the reference table the wrapper sites don't include. Sources are linked in each cell that needs one.
+
+<img
+  src="/images/blog/no-login-ai-privacy-reality-check/placeholder-03-what-providers-log-conceptual.png"
+  alt="A comparison illustration showing two flows: a 'standard no-login AI' lane where data flows past IP, prompt, and session logging stages into a tall stack of provider logs; and an 'anonymized AI' lane where a shield strips the IP before the data reaches a tiny single-sheet record."
+  width={1920}
+  height={1080}
+  style={{ width: '100%', height: 'auto', borderRadius: '12px' }}
+/>
+
+| Service | IP logged | Prompt content logged | Used for training | Retention |
+|---|---|---|---|---|
+| ChatGPT (guest, no login) | Yes | Yes | OpenAI may use guest prompts for service improvement | Per OpenAI privacy policy |
+| Claude (logged-in, normal) | Yes | Yes | No on Free/Pro/Max per [Anthropic privacy center](https://privacy.claude.com); stricter on Team/Enterprise | Per account settings |
+| Claude (incognito) | Yes | Yes (but not in your history) | No, per [Anthropic docs](https://support.claude.com/en/articles/12260368-using-incognito-chats) | 30 days default, longer for Enterprise |
+| Gemini (Google account) | Yes | Yes | Per Google account settings | Per Google account settings |
+| Bing Copilot (no account) | Yes | Yes | Per Microsoft privacy policy | Per Microsoft privacy policy |
+| Brave Leo | **No** ("no identifiers linkable to you" per [brave.com/leo](https://brave.com/leo/)) | Forwarded to model, not retained by Brave | No, per Brave docs | None per Brave docs |
+| Duck.ai | **No** (DuckDuckGo strips IP before forwarding) | Forwarded to model, not retained by DuckDuckGo | No, per [DuckDuckGo docs](https://duckduckgo.com/duckduckgo-help-pages/duckai/) | Local-only chat history |
+| Self-hosted Ollama | n/a (local only) | n/a (local only) | n/a | None unless you save it |
+| Wrapper sites (HIX, EaseMate, NoteGPT, etc.) | Yes (by wrapper) + downstream provider | Yes (by both wrapper and provider) | Depends on wrapper TOS | Depends on wrapper TOS |
+
+A few takeaways from the table that the existing search results never make explicit:
+
+- **Brave Leo and Duck.ai are the only mainstream products that publicly commit to not logging your IP.** They achieve this with a server-side proxy (DuckDuckGo) or a privacy-first browser pipeline (Brave). The model provider never sees your real IP; it sees the proxy's.
+- **Claude's incognito mode is honest about its limits.** It's still retained for 30 days. That's a thoughtful safety design, not an anonymity promise.
+- **Wrapper sites add a layer of logging on top of the model provider's logging.** That's strictly worse for privacy than going to the model provider's first-party guest mode, even if it removes the account requirement.
+
+## Pick the right tool for what you're actually doing
+
+The right tool depends on what you're pasting into the chat box. Three rough buckets cover almost every reader.
+
+<img
+  src="/images/blog/no-login-ai-privacy-reality-check/placeholder-04-decision-tree-by-use-case.png"
+  alt="A decision tree showing which no-login AI tool to pick based on what you are pasting: casual content can use any tool, sensitive code or customer data needs anonymized tools or self-hosting, and regulated data needs enterprise contracts or self-hosting only."
+  width={1920}
+  height={1080}
+  style={{ width: '100%', height: 'auto', borderRadius: '12px' }}
+/>
+
+### If you're a casual user with non-sensitive prompts
+
+Homework explanations, recipe ideas, brainstorming, drafts of emails, quick summaries of public articles, code from public GitHub repos. Your prompt isn't interesting to anyone with subpoena power. Privacy isn't really the constraint; convenience and model quality are.
+
+**What to use:** anything in this guide. Pick by which model you want.
+
+- For ChatGPT-like answers, `chatgpt.com` guest mode or Bing Copilot.
+- For Claude, [Duck.ai](https://duck.ai) (Haiku, anonymised) or an aggregator like [SurfSense /free](/free) which lists Claude among its model options.
+- For multi-model comparison in one tab, [SurfSense /free](/free) or [Duck.ai](https://duck.ai).
+
+### If you work with code, internal docs, or customer data
+
+Code that includes API keys, internal class names, or business logic; documents your employer hasn't published; conversations that involve customer details. The IP and the prompt content matter here.
+
+**What to use:** the anonymised category only. Brave Leo, Duck.ai, or a self-hosted model. If you must use a first-party guest mode, redact ruthlessly before pasting. Avoid wrapper sites for this kind of prompt.
+
+### If you're handling regulated or contractually sensitive data
+
+PHI under HIPAA, attorney-client privileged matter, financial data under SOX or GLBA, EU personal data under GDPR, anything covered by an NDA. The legal exposure is severe and the answer is not a free chat product.
+
+**What to use:** self-host an open-weights model on hardware you control, or use an enterprise contract with a BAA / DPA in place (Anthropic Enterprise, OpenAI Enterprise, Google Cloud Vertex AI). Public free chat is not an acceptable channel here, regardless of whether it asks for a login.
+
+## For developers and the privacy-serious: self-host in 5 minutes
+
+This section is for technical readers. If you're a casual user, you can skip it; the answers above are enough.
+
+Self-hosting an open-weights model is the only path where "private" means private in the strict sense. Your prompt content never leaves your machine. There is no provider, no logging, no retention, no training-on-your-data risk. And it's much easier than it used to be.
+
+1. Install [Ollama](https://ollama.com) (one-click installer for macOS, Windows, Linux).
+2. Open a terminal and run:
+   ```bash
+   ollama run llama3.3:8b
+   ```
+3. The first run downloads the model (about 5 GB). Subsequent runs start instantly.
+4. Type a prompt at the `>>>` prompt. You're chatting with a local model.
+
+Quality is genuinely competitive for most casual use. Llama 3.3 8B handles writing, summarisation, and general Q&A well. For better quality, swap to `qwen2.5:14b` or `mistral-small:24b` if you have 16+ GB of RAM. For coding-specific work, `deepseek-coder-v2` is the current open-weights leader.
+
+If you want a graphical interface instead of a terminal, install [LM Studio](https://lmstudio.ai). For a hosted-but-self-controlled experience, the open-source SurfSense stack on the [GitHub repo](https://github.com/MODSetter/SurfSense) gives you the same chat UI with the same model options, running on your own servers.
+
+For the deeper performance trade-offs between local and frontier models on real document Q&A, our [agentic RAG vs long-context LLMs benchmark](/blog/agentic-rag-vs-long-context-llms-benchmark) has the numbers.
+
+## FAQ
+
+### Can I use ChatGPT without an account?
+
+Yes. Open `chatgpt.com` in a private tab and you'll get guest mode automatically. You get around 10 messages on the headline GPT-5-class model per 5-hour rolling window, after which you're auto-switched to a lighter GPT-5 Mini variant with no hard limit (not blocked by a hard signup wall). No file uploads, no chat history, no Code Interpreter, but for short queries the model quality is the same as the paid first-tier experience.
+
+### Can I use Claude without an account?
+
+Not on `claude.ai` itself, which still requires signup. The closest no-account paths are [Duck.ai](https://duck.ai) (Claude Haiku 4.5, free, anonymised), [Brave Leo](https://brave.com/leo/) (Claude Haiku in the Brave browser sidebar), and aggregator pages like our open-source [SurfSense /free](/free), which lists Claude among the models you can pick with no Anthropic account and a 500K free token budget shared across the whole page. For more, see our [Claude-specific guide](/blog/use-claude-without-login-2026).
+
+### Can I use Gemini without a Google account?
+
+Not on Google's own product pages. Aggregator sites like our open-source [SurfSense /free](/free) include Gemini among the models you can pick and forward requests to the Gemini API behind the scenes, so the user-facing chat works without a Google sign-in. If you specifically want what Gemini is best at (long-context, web research, Workspace integration), there isn't a perfect Google-free substitute, though Brave Leo with Claude Haiku and Perplexity cover most use cases.
+
+### What is Claude incognito mode?
+
+A feature [Anthropic launched on April 9, 2026](https://support.claude.com/en/articles/12260368-using-incognito-chats), available on every Claude plan from Free to Enterprise. Click the ghost icon when starting a new chat. The conversation isn't saved to your history, isn't pulled into Claude's memory, and isn't used for training. It still requires an existing Claude account, and the conversation is retained for 30 days for safety. Useful if you have a Claude account and want a temporary one-off chat.
+
+### Is using AI without an account actually private?
+
+Not by itself. "No login" removes one identifier (your account), but the model provider still sees your IP and the content of your prompt. For actual anonymity, use Brave Leo, Duck.ai, or a self-hosted open-weights model. The privacy section above explains the categories in detail.
+
+### Does ChatGPT guest mode keep my data private?
+
+Less than logged-in mode, but not by much. OpenAI still logs your IP and prompt content. Logged-out users have fewer opt-out controls than logged-in free-tier users. If your prompt is something you'd be uncomfortable seeing on someone else's screen, treat ChatGPT guest mode as recorded.
+
+### What's the most private AI chatbot in 2026?
+
+A self-hosted open-weights model running locally via Ollama or LM Studio. Among hosted options, Brave Leo and Duck.ai are the two that publicly commit to not logging your IP and not retaining your prompts.
+
+### Are wrapper sites that say "free Claude" or "free GPT" safe?
+
+They're convenient, not private. Most "free [Brand] without login" pages are servers that wrap the underlying API and serve responses for free. They don't ask you to sign up, but they're a third party in the middle that has their own logs, on top of the model provider's logs. Use them for casual prompts you'd be fine with showing up in someone else's database.
+
+### What's the difference between Duck.ai and a regular ChatGPT wrapper?
+
+Duck.ai is the only mainstream chat product that publicly documents an end-to-end anonymisation model: DuckDuckGo proxies your request, strips your IP, doesn't retain prompts, and doesn't train on your data. Standard wrapper sites do none of these things. They're just "no signup form".
+
+### Is Brave Leo really free with no login?
+
+Yes. Per [Brave's documentation](https://brave.com/leo/), no account or signup is required for the free tier, and Brave doesn't collect identifiers tied to you. The free tier includes Claude Haiku, Llama 3.1 8B, Mixtral, and Qwen 3 14B. Premium ($14.99/month) adds Claude Sonnet 4 and DeepSeek R1.
+
+### How can a developer avoid the browser login flow entirely?
+
+For Anthropic specifically: configure Claude Code to authenticate via Amazon Bedrock, Google Vertex AI, or Microsoft Foundry per the [Claude Code authentication docs](https://code.claude.com/docs/en/authentication). No browser login required, IAM-only auth. For OpenAI and Google, the standard answer is API keys with cloud-provider IAM and IP allow-listing. For full local control, the [self-hosting section](#for-developers-and-the-privacy-serious-self-host-in-5-minutes) covers Ollama and LM Studio.
+
+## The bottom line
+
+The question "can I use a top AI model without an account" has a much better answer in 2026 than it did a year ago. Anthropic added incognito mode, Duck.ai added free Claude Haiku with real anonymisation, Brave Leo grew into a credible browser-side option, and the multi-model aggregators got cheaper to run.
+
+If you just want to chat: pick the brand you want, use the path from the relevant section above, and be done. If you care about privacy: stick to Brave Leo, Duck.ai, or a self-hosted model, and remember that "no signup" alone doesn't make a tool anonymous. If you're handling sensitive or regulated data: don't use a free chat product at all, use an enterprise contract or run the model yourself.
+
+And if you want a single no-account chat hub that lets you pick from ChatGPT, Claude, Gemini, DeepSeek, Mistral, Llama, and a rotating list of others under one URL with the wrapper-layer code open on [GitHub](https://github.com/MODSetter/SurfSense), that's exactly what we built **[SurfSense /free](/free)** for. The pitch: 500,000 free tokens shared across any model on the page, no account, anonymous sessions not stored in any database, and the model lineup updates whenever new models ship. It's not the right answer for every reader (if you need IP anonymisation specifically, Brave Leo or Duck.ai is still the better fit), but it is a genuine, honest pick, and we'd rather list it confidently than pretend we don't make it. Whichever you choose, the goal of this guide was to give you the honest map first.
diff --git a/surfsense_web/components/homepage/navbar.tsx b/surfsense_web/components/homepage/navbar.tsx
index 04b011c2d..d8e4828bd 100644
--- a/surfsense_web/components/homepage/navbar.tsx
+++ b/surfsense_web/components/homepage/navbar.tsx
@@ -37,7 +37,7 @@ export const Navbar = ({ scrolledBgClassName }: NavbarProps = {}) => {
 	const navItems = [
 		{ name: "Free\u00A0AI", link: "/free" },
 		{ name: "Pricing", link: "/pricing" },
-		// { name: "Blog", link: "/blog" },
+		{ name: "Blog", link: "/blog" },
 		{ name: "Changelog", link: "/changelog" },
 		{ name: "Docs", link: "/docs" },
 		{ name: "Contact\u00A0Us", link: "/contact" },
diff --git a/surfsense_web/pnpm-workspace.yaml b/surfsense_web/pnpm-workspace.yaml
index 5f1b93969..b33ba7657 100644
--- a/surfsense_web/pnpm-workspace.yaml
+++ b/surfsense_web/pnpm-workspace.yaml
@@ -1,3 +1,6 @@
+packages:
+  - "."
+
 allowBuilds:
   "@parcel/watcher": true
   "@rocicorp/zero-sqlite3": true
diff --git a/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-01-hero-image.png b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-01-hero-image.png
new file mode 100644
index 000000000..68ef7934e
Binary files /dev/null and b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-01-hero-image.png differ
diff --git a/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-02-architecture-diagram.png b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-02-architecture-diagram.png
new file mode 100644
index 000000000..13bf60204
Binary files /dev/null and b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-02-architecture-diagram.png differ
diff --git a/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-03-accuracy-bar-chart-dark.png b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-03-accuracy-bar-chart-dark.png
new file mode 100644
index 000000000..ebe443d6d
Binary files /dev/null and b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-03-accuracy-bar-chart-dark.png differ
diff --git a/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-03-accuracy-bar-chart-light.png b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-03-accuracy-bar-chart-light.png
new file mode 100644
index 000000000..187e514f4
Binary files /dev/null and b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-03-accuracy-bar-chart-light.png differ
diff --git a/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-04-cost-vs-accuracy-dark.png b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-04-cost-vs-accuracy-dark.png
new file mode 100644
index 000000000..79717ffcb
Binary files /dev/null and b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-04-cost-vs-accuracy-dark.png differ
diff --git a/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-04-cost-vs-accuracy-light.png b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-04-cost-vs-accuracy-light.png
new file mode 100644
index 000000000..0d6943a9c
Binary files /dev/null and b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-04-cost-vs-accuracy-light.png differ
diff --git a/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-05-decision-tree.png b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-05-decision-tree.png
new file mode 100644
index 000000000..3c083e6dd
Binary files /dev/null and b/surfsense_web/public/images/blog/agentic-rag-vs-long-context-llms-benchmark/placeholder-05-decision-tree.png differ
diff --git a/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-01-no-login-vs-no-tracking-hero.png b/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-01-no-login-vs-no-tracking-hero.png
new file mode 100644
index 000000000..b7e4c8880
Binary files /dev/null and b/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-01-no-login-vs-no-tracking-hero.png differ
diff --git a/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-02-three-tier-pyramid.png b/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-02-three-tier-pyramid.png
new file mode 100644
index 000000000..3bf0be0b8
Binary files /dev/null and b/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-02-three-tier-pyramid.png differ
diff --git a/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-03-what-providers-log-conceptual.png b/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-03-what-providers-log-conceptual.png
new file mode 100644
index 000000000..33c4e24e5
Binary files /dev/null and b/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-03-what-providers-log-conceptual.png differ
diff --git a/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-04-decision-tree-by-use-case.png b/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-04-decision-tree-by-use-case.png
new file mode 100644
index 000000000..ce1917068
Binary files /dev/null and b/surfsense_web/public/images/blog/no-login-ai-privacy-reality-check/placeholder-04-decision-tree-by-use-case.png differ
diff --git a/surfsense_web/source.config.ts b/surfsense_web/source.config.ts
index 28d9c2ea4..7ba4b89b5 100644
--- a/surfsense_web/source.config.ts
+++ b/surfsense_web/source.config.ts
@@ -25,6 +25,12 @@ export const blog = defineDocs({
 			author: z.string().default("SurfSense Team"),
 			authorAvatar: z.string().optional(),
 			tags: z.array(z.string()).optional(),
+			// Pin this post into the featured section above the archive grid.
+			// Multiple posts can be featured at once; ordering within the
+			// featured section follows `featured_order` ascending and falls
+			// back to `date` descending.
+			featured: z.boolean().optional().default(false),
+			featured_order: z.number().optional(),
 		}),
 	},
 });