feat(evals): publish multimodal_doc parser_compare benchmark + n=171 report

Adds the full parser_compare experiment for the multimodal_doc suite: six arms compared on 30 PDFs / 171 questions from MMLongBench-Doc with anthropic/claude-sonnet-4.5 across the board. Source code: - core/parsers/{azure_di,llamacloud,pdf_pages}.py: direct parser SDK callers (Azure Document Intelligence prebuilt-read/layout, LlamaParse parse_page_with_llm/parse_page_with_agent) used by the LC arms, bypassing the SurfSense backend so each (basic/premium) extraction is a clean A/B independent of backend ETL routing. - suites/multimodal_doc/parser_compare/{ingest,runner,prompt}.py: six-arm benchmark (native_pdf, azure_basic_lc, azure_premium_lc, llamacloud_basic_lc, llamacloud_premium_lc, surfsense_agentic) with byte-identical prompts per question, deterministic grader, Wilson CIs, and the per-page preprocessing tariff cost overlay. Reproducibility: - pyproject.toml + uv.lock pin pypdf, azure-ai-documentintelligence, llama-cloud-services as new deps. - .env.example documents the AZURE_DI_* and LLAMA_CLOUD_API_KEY env vars now required for parser_compare. - 12 analysis scripts under scripts/: retry pass with exponential backoff, post-retry accuracy merge, McNemar / latency / per-PDF stats, context-overflow hypothesis test, etc. Each produces one number cited by the blog report. Citation surface: - reports/blog/multimodal_doc_parser_compare_n171_report.md: 1219-line technical writeup (16 sections) covering headline accuracy, per-format accuracy, McNemar pairwise significance, latency / token / per-PDF distributions, error analysis, retry experiment, post-retry final accuracy, cost amortization model with closed-form derivation, threats to validity, and reproducibility appendix. - data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/{raw, raw_retries,raw_post_retry}.jsonl + run_artifact.json + retry summary whitelisted via data/.gitignore as the verifiable numbers source. Gitignore: - ignore logs_*.txt + retry_run.log; structured artifacts cover the citation surface, debug logs are noise. - data/.gitignore default-ignores everything, whitelists the n=171 run artifacts only (parser manifest left ignored to avoid leaking local Windows usernames in absolute paths; manifest is fully regenerable via 'ingest multimodal_doc parser_compare'). - reports/.gitignore now whitelists hand-curated reports/blog/. Also retires the abandoned CRAG Task 3 implementation (download script, streaming Task 3 ingest, CragTask3Benchmark + tests) and trims the runner / ingest module APIs to match. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-06 22:12:12 +02:00 · 2026-05-14 19:54:41 -07:00 · 2026-05-14 19:54:41 -07:00 · 9bcd50164d
commit 9bcd50164d
parent 3737118050
40 changed files with 9303 additions and 993 deletions
--- a/surfsense_evals/data/.gitignore
+++ b/surfsense_evals/data/.gitignore
@ -1,2 +1,22 @@
+# Default: don't track anything under data/ (large benchmarks, PDFs,
+# extracted markdown, ingestion caches, per-run artifacts can balloon).
 *
+
+# Always keep this gitignore file
 !.gitignore
+
+# Whitelist the artifacts the n=171 multimodal_doc / parser_compare blog
+# report (`reports/blog/multimodal_doc_parser_compare_n171_report.md`)
+# cites by path. These are the verifiable numbers source for the report.
+#
+# Path components have to be unblocked one level at a time because git
+# does not look into a directory whose parent is ignored.
+!multimodal_doc/
+!multimodal_doc/runs/
+!multimodal_doc/runs/2026-05-14T00-53-19Z/
+!multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/
+!multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw.jsonl
+!multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries.jsonl
+!multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries_summary.json
+!multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_post_retry.jsonl
+!multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/run_artifact.json
--- a/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw.jsonl
+++ b/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw.jsonl
--- a/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_post_retry.jsonl
+++ b/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_post_retry.jsonl
--- a/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries.jsonl
+++ b/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries.jsonl
@ -0,0 +1,37 @@
+{"qid": "05-03-18-political-release.pdf::Q000", "doc_id": "05-03-18-political-release.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "541", "pages": 17, "document_id": 5219, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:50:15Z", "latency_ms": 20985, "error": null, "raw_text_chars": 696}]}, "question_id": "05-03-18-political-release.pdf::Q000", "answer_letter": null, "raw_text": "I need to find information about how many adults rated Trump's government ethical standards as \"poor\" in the survey conducted April 25-May 1, 2018.\n\nLooking through the document, I find relevant information on page 4 (document page 6):\n\nThe document states: \"About four-in-ten Americans rate the ethical standards of Trump administration officials as excellent (9%) or good (30%), while 58% say they are not good (21%) or poor (36%).\"\n\nThe survey was conducted among 1,503 adults (as stated in the methodology section on page 14).\n\nTo find the number of adults who rated the ethical standards as \"poor\":\n36% of 1,503 = 0.36 \u00d7 1,503 = 541.08\n\nRounding to the nearest whole number: 541\n\nAnswer: 541", "citations": [], "input_tokens": 33894, "output_tokens": 191, "cost_micros": 104547, "latency_ms": 20968, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "05-03-18-political-release.pdf"}, "graded": {"correct": true, "f1": 1.0, "method": "int_eq", "normalised_pred": "541", "normalised_gold": "541"}}
+{"qid": "05-03-18-political-release.pdf::Q002", "doc_id": "05-03-18-political-release.pdf", "arm": "native_pdf", "answer_format": "list", "gold": "['Make good decisions about economic policy ', 'Make wise decisions about immigration policy ']", "pages": 17, "document_id": 5219, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:50:16Z", "latency_ms": 20453, "error": null, "raw_text_chars": 544}]}, "question_id": "05-03-18-political-release.pdf::Q002", "answer_letter": null, "raw_text": "Looking at page 5 of the document, which shows \"Public's confidence in Trump to handle several policy areas mixed,\" I can see the data for \"very confident\" and \"not at all confident\" percentages across different domains.\n\nFor \"Very confident\" (highest percentage):\n- Negotiate favorable trade agreements with other countries: 30%\n\nFor \"Not at all confident\" (highest percentage):\n- Make wise decisions about immigration policy: 55%\n\nAnswer: Make wise decisions about immigration policy, Negotiate favorable trade agreements with other countries", "citations": [], "input_tokens": 33942, "output_tokens": 117, "cost_micros": 103581, "latency_ms": 20438, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "05-03-18-political-release.pdf"}, "graded": {"correct": false, "f1": 0.5, "method": "list_set", "normalised_pred": "make wise decisions about immigration policy, negotiate favorable trade agreements with other countries", "normalised_gold": "make good decisions about economic policy, make wise decisions about immigration policy"}}
+{"qid": "0b85477387a9d0cc33fca0f4becaa0e5.pdf::Q005", "doc_id": "0b85477387a9d0cc33fca0f4becaa0e5.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "6", "pages": 16, "document_id": 5220, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:50:36Z", "latency_ms": 41750, "error": null, "raw_text_chars": 1479}]}, "question_id": "0b85477387a9d0cc33fca0f4becaa0e5.pdf::Q005", "answer_letter": null, "raw_text": "I need to count the pictures in the document that contain only one person.\n\nLet me go through the document page by page:\n\nPage 1:\n- Photo of Dr. Xiaoyan \"Cathy\" Zou with Rear Adm. (Ret.) Tim Ziemer - 2 people\n\nPage 2:\n- Photo of Capt. John W. Sanders - 1 person\n- Photo of Capt. Steven M. Wechsler - 1 person\n\nPage 3:\n- Photo of Afghan trainees in laboratory - multiple people\n\nPage 4:\n- Group photo of Pacific Partnership 2012 Internal Medicine Department staff - multiple people\n- Photo of Cmdr. Charmagne Beckett on flight deck - 1 person\n\nPage 5:\n- Photo of Petty Officer 1st Class Dennis Gonzales - 1 person\n\nPage 6:\n- Photo of Lance Cpl. Kip Boker and Cpl. Jacob Echeverri - 2 people\n\nPage 7:\n- Photo with Lt. Cmdr. Jennifer Curry, Capt. Buhari Oyofo, Dr. Walter T. Gwenigale, Lt. Joseph Diclaro, and Dr. Fatorma Bolay - 5 people\n- Photo with Lt. Cmdr. Jennifer Curry, Capt. Chris Martinez, Capt. Buhari Oyofo, Col. Vernon Graham, and Lt. Joseph Diclaro - 5 people\n\nPage 8:\n- Group photo of Kazakh scientists and NMRC staff - multiple people\n\nPage 9:\n- Photo of Lt. j.g. Michael Rucker treating a child - 2 people\n- Photo of U.S. Marines and Sailors in aircraft - multiple people\n\nPage 10:\n- Group photo of Joint Combat Casualty Research Team - multiple people\n\nPage 12:\n- Photo of NMRC 2012 Dining Out Event - multiple people\n- Photo with Rear Adm. Bruce Doll, Capt. John Sanders and others - multiple people\n\nPage 13:\n- Photo with Ellen Flannery, Emmanuel Adu-Gyamfi, Eil", "citations": [], "input_tokens": 40387, "output_tokens": 512, "cost_micros": 128841, "latency_ms": 41735, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "length", "pdf_filename": "0b85477387a9d0cc33fca0f4becaa0e5.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "int_eq", "normalised_pred": "None", "normalised_gold": "6"}}
+{"qid": "2005.12872v3.pdf::Q005", "doc_id": "2005.12872v3.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "36", "pages": 26, "document_id": 7577, "original_error": "JSONDecodeError: Expecting value: line 169 column 1 (char 924)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:50:36Z", "latency_ms": 36889, "error": null, "raw_text_chars": 1214}]}, "question_id": "2005.12872v3.pdf::Q005", "answer_letter": null, "raw_text": "I need to find information about DETR's default structure, specifically the number of multi-head self-attention layers.\n\nLet me search through the document for information about DETR's architecture.\n\nFrom page 2, I can see that DETR has a transformer encoder-decoder architecture.\n\nOn page 10 (Section 4.1), I find: \"To be comparable in the number of parameters we choose a model with 6 transformer and 6 decoder layers of width 256 with 8 attention heads.\"\n\nLooking at the detailed architecture description on page 22 (Section A.3) and Figure 10, I can see that:\n- The encoder has multiple layers, each containing multi-head self-attention\n- The decoder has multiple layers, each containing multi-head self-attention AND multi-head attention (cross-attention)\n\nFrom page 9-10, the baseline model uses \"6 encoder, 6 decoder layers\"\n\nEach encoder layer has one multi-head self-attention layer.\nEach decoder layer has one multi-head self-attention layer and one multi-head cross-attention layer.\n\nSo the total number of multi-head self-attention layers (not counting cross-attention) is:\n- 6 encoder layers \u00d7 1 self-attention per layer = 6\n- 6 decoder layers \u00d7 1 self-attention per layer = 6\n- Total = 12\n\nAnswer: 12", "citations": [], "input_tokens": 59866, "output_tokens": 318, "cost_micros": 184368, "latency_ms": 36875, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2005.12872v3.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "int_eq", "normalised_pred": "12", "normalised_gold": "36"}}
+{"qid": "2306.05425v1.pdf::Q002", "doc_id": "2306.05425v1.pdf", "arm": "llamacloud_basic_lc", "answer_format": "int", "gold": "6", "pages": 22, "document_id": 7587, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:51:13Z", "latency_ms": 11328, "error": null, "raw_text_chars": 1376}]}, "question_id": "2306.05425v1.pdf::Q002", "answer_letter": null, "raw_text": "I need to carefully examine the document for bar plots.\n\nLooking through the figures in the document:\n\n1. **Figure 1**: Shows example images with text descriptions - no bar plots\n2. **Figure 2**: Shows data format comparison with images and text - no bar plots\n3. **Figure 3**: Shows the Syphus overview diagram - no bar plots\n4. **Figure 4**: This appears to show statistics and analysis. Let me examine it carefully:\n   - Part (a): Shows verb-noun structure for instructions - this appears to be a bar chart\n   - Part (b): Shows verb-noun structure for responses - this appears to be a bar chart\n   - Part (c): Shows \"Diversity in our dataset\" with what appears to be distribution plots\n\nLooking at Figure 4 more carefully:\n- (a) contains what looks like a horizontal bar chart showing verbs and nouns from instructions\n- (b) contains what looks like a horizontal bar chart showing verbs and nouns from responses\n- (c) contains multiple distribution plots showing lengths and counts\n\n5. **Figure 5**: Shows response examples - no bar plots\n6. **Figure 6**: The document mentions Fig. 6(b) and Fig. 6(c) but these are not fully visible in the provided pages\n\nFrom what I can see clearly in Figure 4:\n- Subfigure (a): 1 bar plot (showing instruction verb-noun structure)\n- Subfigure (b): 1 bar plot (showing response verb-noun structure)\n\nTotal visible bar plots: 2\n\nAnswer: 2", "citations": [], "input_tokens": 20240, "output_tokens": 354, "cost_micros": 66030, "latency_ms": 11328, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": false, "f1": 0.0, "method": "int_eq", "normalised_pred": "2", "normalised_gold": "6"}}
+{"qid": "2307.09288v2.pdf::Q000", "doc_id": "2307.09288v2.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "5", "pages": 77, "document_id": 7588, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:51:18Z", "latency_ms": 52514, "error": null, "raw_text_chars": 860}]}, "question_id": "2307.09288v2.pdf::Q000", "answer_letter": null, "raw_text": "I'll examine pages 5-10 to count the tables.\n\nPage 5: No tables visible in the text content shown.\n\nPage 6: Table 1 is present - \"Llama 2 family of models\" showing training data, parameters, context length, GQA, tokens, and LR for Llama 1 and Llama 2 models.\n\nPage 7: \n- Table 2: \"CO2 emissions during pretraining\" showing time, power consumption, and carbon emitted for different Llama 2 models.\n- Table 3: \"Overall performance on grouped academic benchmarks\" comparing MPT, Falcon, Llama 1, and Llama 2 models.\n\nPage 8:\n- Table 4: \"Comparison to closed-source models on academic benchmarks\" showing results for GPT-3.5, GPT-4, PaLM, PaLM-2-L, and Llama 2.\n- Table 5: \"SFT annotation\" showing examples of helpfulness and safety annotations.\n\nPage 9: No complete tables visible in the shown content.\n\nPage 10: No tables visible in the shown content.\n\nAnswer: 5", "citations": [], "input_tokens": 199213, "output_tokens": 280, "cost_micros": 601839, "latency_ms": 52468, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2307.09288v2.pdf"}, "graded": {"correct": true, "f1": 1.0, "method": "int_eq", "normalised_pred": "5", "normalised_gold": "5"}}
+{"qid": "2307.09288v2.pdf::Q002", "doc_id": "2307.09288v2.pdf", "arm": "native_pdf", "answer_format": "list", "gold": "['Vicuna13b-v1.3', 'PaLM Bison', 'Vicuna13b-v1.1']", "pages": 77, "document_id": 7588, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:51:25Z", "latency_ms": 40452, "error": null, "raw_text_chars": 575}]}, "question_id": "2307.09288v2.pdf::Q002", "answer_letter": null, "raw_text": "Looking at Figure 3 on page 4, which shows \"Safety human evaluation results for Llama 2-Chat compared to other open-source and closed-source models,\" I need to identify the models with the lowest violation percentages (since lower is better for safety).\n\nFrom the bar chart showing \"Violation % (lower is safer)\":\n- Llama 2-70b-chat has the lowest violation percentage (approximately 0-5%)\n- Llama 2-13b-chat has the second lowest (approximately 0-5%)\n- Llama 2-34b-chat has the third lowest (approximately 5-10%)\n\nAnswer: Llama 2-70b-chat, Llama 2-13b-chat, Llama 2-34b-chat", "citations": [], "input_tokens": 199224, "output_tokens": 182, "cost_micros": 600402, "latency_ms": 40406, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2307.09288v2.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "list_set", "normalised_pred": "llama 2 13b chat, llama 2 34b chat, llama 2 70b chat", "normalised_gold": "palm bison, vicuna13b v1 1, vicuna13b v1 3"}}
+{"qid": "2307.09288v2.pdf::Q002", "doc_id": "2307.09288v2.pdf", "arm": "llamacloud_premium_lc", "answer_format": "list", "gold": "['Vicuna13b-v1.3', 'PaLM Bison', 'Vicuna13b-v1.1']", "pages": 77, "document_id": 7588, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:52:05Z", "latency_ms": 10343, "error": null, "raw_text_chars": 376}]}, "question_id": "2307.09288v2.pdf::Q002", "answer_letter": null, "raw_text": "Looking at Figure 3, which shows \"Safety human evaluation results for LLAMA 2-CHAT compared to other open-source and closed-source models,\" I can see the violation percentages (lower is safer):\n\nThe top-3 safest models (lowest violation %) are:\n1. Llama-2 7b-chat: 3%\n2. Llama-2 13b-chat: 3%\n3. Llama-2 70b-chat: 4%\n\nAnswer: Llama-2 7b-chat, Llama-2 13b-chat, Llama-2 70b-chat", "citations": [], "input_tokens": 105645, "output_tokens": 151, "cost_micros": 319200, "latency_ms": 10343, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": false, "f1": 0.0, "method": "list_set", "normalised_pred": "llama 2 13b chat, llama 2 70b chat, llama 2 7b chat", "normalised_gold": "palm bison, vicuna13b v1 1, vicuna13b v1 3"}}
+{"qid": "2307.09288v2.pdf::Q004", "doc_id": "2307.09288v2.pdf", "arm": "llamacloud_premium_lc", "answer_format": "str", "gold": "Llama 1 33B", "pages": 77, "document_id": 7588, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:52:10Z", "latency_ms": 29391, "error": null, "raw_text_chars": 525}]}, "question_id": "2307.09288v2.pdf::Q004", "answer_letter": null, "raw_text": "Looking at Table 23 in the document, which shows \"Comparison to open-source models on reading comprehension (SQUAD and QUAC)\", I need to find the QUAC (f1) 0-shot column.\n\nFrom the table, the QUAC (f1) 0-shot scores are:\n- MPT 7B: 38.0\n- MPT 30B: 40.4\n- Falcon 7B: 24.0\n- Falcon 40B: 41.2\n- LLAMA 1 7B: 38.9\n- LLAMA 1 13B: 39.9\n- LLAMA 1 33B: 44.1\n- LLAMA 1 65B: 41.0\n- LLAMA 2 7B: 39.4\n- LLAMA 2 13B: 42.7\n- LLAMA 2 34B: 42.9\n- LLAMA 2 70B: 42.4\n\nThe highest score is 44.1, which belongs to LLAMA 1 33B.\n\nAnswer: LLAMA 1 33B", "citations": [], "input_tokens": 105649, "output_tokens": 274, "cost_micros": 321057, "latency_ms": 29391, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": true, "f1": 1.0, "method": "str_norm", "normalised_pred": "llama 1 33b", "normalised_gold": "llama 1 33b"}}
+{"qid": "2307.09288v2.pdf::Q007", "doc_id": "2307.09288v2.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "15", "pages": 77, "document_id": 7588, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:52:15Z", "latency_ms": 43610, "error": null, "raw_text_chars": 650}]}, "question_id": "2307.09288v2.pdf::Q007", "answer_letter": null, "raw_text": "I need to find Figure 9 and count the emojis in the left and right subfigures.\n\nLooking through the document, I find Figure 9 on page 16 (document page 16). The figure shows \"Issues with multi-turn memory (left) can be improved with GAtt (right).\"\n\nLooking at the left subfigure:\n- It shows a conversation with text-only responses (no emojis visible)\n\nLooking at the right subfigure:\n- First response: 4 emojis (\ud83d\udc4b\ud83d\udc41\ufe0f\ud83d\udc44\ud83d\udc41\ufe0f)\n- Second response: 4 emojis (\u2708\ufe0f\u2708\ufe0f\ud83d\udeeb\ud83d\udcba)\n- Third response: 4 emojis (\ud83e\uddb4\ud83e\udd95\ud83e\uddb4\ud83d\udd25)\n- Fourth response: 2 emojis (\ud83c\uddee\ud83c\uddf9\ud83c\udf5d\ud83d\udc90)\n\nTotal in right subfigure: 4 + 4 + 4 + 2 = 14 emojis\n\nTotal in left subfigure: 0 emojis\n\nDifference: 14 - 0 = 14\n\nAnswer: 14", "citations": [], "input_tokens": 199223, "output_tokens": 267, "cost_micros": 601674, "latency_ms": 43578, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2307.09288v2.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "int_eq", "normalised_pred": "14", "normalised_gold": "15"}}
+{"qid": "2309.17421v2.pdf::Q000", "doc_id": "2309.17421v2.pdf", "arm": "native_pdf", "answer_format": "str", "gold": "Blue", "pages": 166, "document_id": 7589, "original_error": "HTTPStatusError: OpenRouter HTTP 502: {\"type\":\"https://developers.cloudflare.com/support/troubleshooting/http-status-codes/cloudflare-5xx-errors/error-502/\",\"title\":\"Error 502: Bad gateway\",\"status\":502,\"detail\":\"The origin web server returned an invalid or incomplete response to Cloudflare. This typically indicates the origin is overlo", "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:52:40Z", "latency_ms": 43593, "error": "HTTPStatusError: OpenRouter HTTP 502: {\"type\":\"https://developers.cloudflare.com/support/troubleshooting/http-status-codes/cloudflare-5xx-errors/error-502/\",\"title\":\"Error 502: Bad gateway\",\"status\":502,\"detail\":\"The origin web server returned an invalid or incomplete response to Cloudflare. This typically indicates the origin is overlo", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T01:53:24Z", "latency_ms": 44391, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T01:54:10Z", "latency_ms": 35797, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T01:54:48Z", "latency_ms": 42266, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T01:55:41Z", "latency_ms": 31983, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}]}, "question_id": "2309.17421v2.pdf::Q000", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 31843, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": null, "pdf_filename": "2309.17421v2.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "str_norm", "normalised_pred": "", "normalised_gold": "blue"}}
+{"qid": "2309.17421v2.pdf::Q001", "doc_id": "2309.17421v2.pdf", "arm": "native_pdf", "answer_format": "str", "gold": "YouTube Music", "pages": 166, "document_id": 7589, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:52:59Z", "latency_ms": 33250, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T01:53:33Z", "latency_ms": 40889, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T01:54:16Z", "latency_ms": 51515, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T01:55:10Z", "latency_ms": 54500, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T01:56:14Z", "latency_ms": 32641, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}]}, "question_id": "2309.17421v2.pdf::Q001", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 32500, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": null, "pdf_filename": "2309.17421v2.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "str_norm", "normalised_pred": "", "normalised_gold": "youtube music"}}
+{"qid": "2309.17421v2.pdf::Q002", "doc_id": "2309.17421v2.pdf", "arm": "native_pdf", "answer_format": "float", "gold": "$49.99", "pages": 166, "document_id": 7589, "original_error": null, "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:56:13Z", "latency_ms": 42110, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T01:56:56Z", "latency_ms": 52202, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T01:57:50Z", "latency_ms": 40469, "error": "HTTPStatusError: OpenRouter HTTP 502: {\"type\":\"https://developers.cloudflare.com/support/troubleshooting/http-status-codes/cloudflare-5xx-errors/error-502/\",\"title\":\"Error 502: Bad gateway\",\"status\":502,\"detail\":\"The origin web server returned an invalid or incomplete response to Cloudflare. This typically indicates the origin is overlo", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T01:58:34Z", "latency_ms": 49766, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T01:59:30Z", "latency_ms": 42110, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}]}, "question_id": "2309.17421v2.pdf::Q002", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 41968, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": null, "pdf_filename": "2309.17421v2.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "float_tol", "normalised_pred": "", "normalised_gold": "49.99"}}
+{"qid": "2309.17421v2.pdf::Q002", "doc_id": "2309.17421v2.pdf", "arm": "azure_premium_lc", "answer_format": "float", "gold": "$49.99", "pages": 166, "document_id": 7589, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:56:47Z", "latency_ms": 24578, "error": null, "raw_text_chars": 640}]}, "question_id": "2309.17421v2.pdf::Q002", "answer_letter": null, "raw_text": "I need to find Figure 111 in the document and identify the price of the keyboard shown in the screenshot.\n\nLooking through the document, I can find Figure 111 which is part of Section 9.8 on GUI Navigation, specifically about Online Shopping for an Ergonomic Keyboard.\n\nIn Figure 111, the screenshot shows a shopping cart with a keyboard item. The text indicates:\n\n\"Subtotal $4999\"\n\nAnd in the cart details, it shows:\n\"Kensington Pro Fit Ergonomic Wireless Keyboard - Black (K7540...\n$4999\"\n\nThe price shown is $49.99 (represented as $4999 in the format used in the screenshot, which appears to be missing the decimal point).\n\nAnswer: 49.99", "citations": [], "input_tokens": 133648, "output_tokens": 176, "cost_micros": 403584, "latency_ms": 24578, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": true, "f1": 1.0, "method": "float_tol", "normalised_pred": "49.99", "normalised_gold": "49.99"}}
+{"qid": "2309.17421v2.pdf::Q003", "doc_id": "2309.17421v2.pdf", "arm": "native_pdf", "answer_format": "float", "gold": "76%", "pages": 166, "document_id": 7589, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T01:57:11Z", "latency_ms": 69062, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T01:58:21Z", "latency_ms": 35639, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T01:58:59Z", "latency_ms": 51312, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T01:59:54Z", "latency_ms": 49218, "error": "HTTPStatusError: OpenRouter HTTP 502: {\"type\":\"https://developers.cloudflare.com/support/troubleshooting/http-status-codes/cloudflare-5xx-errors/error-502/\",\"title\":\"Error 502: Bad gateway\",\"status\":502,\"detail\":\"The origin web server returned an invalid or incomplete response to Cloudflare. This typically indicates the origin is overlo", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T02:00:48Z", "latency_ms": 56375, "error": "HTTPStatusError: OpenRouter HTTP 502: {\"type\":\"https://developers.cloudflare.com/support/troubleshooting/http-status-codes/cloudflare-5xx-errors/error-502/\",\"title\":\"Error 502: Bad gateway\",\"status\":502,\"detail\":\"The origin web server returned an invalid or incomplete response to Cloudflare. This typically indicates the origin is overlo", "raw_text_chars": 0}]}, "question_id": "2309.17421v2.pdf::Q003", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 0, "error": "HTTPStatusError: OpenRouter HTTP 502: {\"type\":\"https://developers.cloudflare.com/support/troubleshooting/http-status-codes/cloudflare-5xx-errors/error-502/\",\"title\":\"Error 502: Bad gateway\",\"status\":502,\"detail\":\"The origin web server returned an invalid or incomplete response to Cloudflare. This typically indicates the origin is overlo", "extra": {}, "graded": {"correct": false, "f1": 0.0, "method": "float_tol", "normalised_pred": "", "normalised_gold": "76.0"}}
+{"qid": "2309.17421v2.pdf::Q003", "doc_id": "2309.17421v2.pdf", "arm": "azure_basic_lc", "answer_format": "float", "gold": "76%", "pages": 166, "document_id": 7589, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:00:12Z", "latency_ms": 7967, "error": null, "raw_text_chars": 594}]}, "question_id": "2309.17421v2.pdf::Q003", "answer_letter": null, "raw_text": "I need to find Figure 107 in the document and identify the battery percentage shown in that screenshot.\n\nLooking through the document for Figure 107, I can see it's in Section 9.8 on GUI Navigation, specifically related to online shopping.\n\nIn the image caption list, Figure 107 is described as \"Section 9.8: online shopping.\"\n\nLooking at the actual Figure 107 in the document, I can see it's a smartphone screenshot showing an Amazon shopping interface. At the top of the screenshot, there is a time display \"5:32\" and next to it shows \"75\" which represents the battery percentage.\n\nAnswer: 75", "citations": [], "input_tokens": 117429, "output_tokens": 143, "cost_micros": 354432, "latency_ms": 7967, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": false, "f1": 0.0, "method": "float_tol", "normalised_pred": "75.0", "normalised_gold": "76.0"}}
+{"qid": "2309.17421v2.pdf::Q004", "doc_id": "2309.17421v2.pdf", "arm": "native_pdf", "answer_format": "str", "gold": "Ukraine", "pages": 166, "document_id": 7589, "original_error": null, "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:00:20Z", "latency_ms": 33344, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T02:00:54Z", "latency_ms": 43733, "error": "HTTPStatusError: OpenRouter HTTP 502: {\"type\":\"https://developers.cloudflare.com/support/troubleshooting/http-status-codes/cloudflare-5xx-errors/error-502/\",\"title\":\"Error 502: Bad gateway\",\"status\":502,\"detail\":\"The origin web server returned an invalid or incomplete response to Cloudflare. This typically indicates the origin is overlo", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T02:01:39Z", "latency_ms": 33390, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T02:02:16Z", "latency_ms": 33641, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T02:02:59Z", "latency_ms": 38000, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}]}, "question_id": "2309.17421v2.pdf::Q004", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 37859, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": null, "pdf_filename": "2309.17421v2.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "str_norm", "normalised_pred": "", "normalised_gold": "ukraine"}}
+{"qid": "2309.17421v2.pdf::Q005", "doc_id": "2309.17421v2.pdf", "arm": "native_pdf", "answer_format": "str", "gold": "JoooDeee", "pages": 166, "document_id": 7589, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:01:45Z", "latency_ms": 35218, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T02:02:21Z", "latency_ms": 44921, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T02:03:08Z", "latency_ms": 52842, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T02:04:05Z", "latency_ms": 27250, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T02:04:41Z", "latency_ms": 48031, "error": "EmptyResponse: stream ended with no text", "raw_text_chars": 0}]}, "question_id": "2309.17421v2.pdf::Q005", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 47889, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": null, "pdf_filename": "2309.17421v2.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "str_norm", "normalised_pred": "", "normalised_gold": "jooodeee"}}
+{"qid": "2310.05634v2.pdf::Q005", "doc_id": "2310.05634v2.pdf", "arm": "native_pdf", "answer_format": "str", "gold": "If it can be partially verified by the knowledge graph G", "pages": 24, "document_id": 7590, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:03:37Z", "latency_ms": 20843, "error": null, "raw_text_chars": 345}]}, "question_id": "2310.05634v2.pdf::Q005", "answer_letter": null, "raw_text": "Looking at the document, specifically on page 2, the task formulation section discusses the \"Setting of Conscious Incompetence.\"\n\nThe relevant text states: \"A sentence can map to both [NA] and a list of sub-graph knowledge if it can be partially verified by the knowledge graph G.\"\n\nAnswer: if it can be partially verified by the knowledge graph", "citations": [], "input_tokens": 60205, "output_tokens": 80, "cost_micros": 181815, "latency_ms": 20828, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2310.05634v2.pdf"}, "graded": {"correct": true, "f1": 0.9473684210526316, "method": "str_norm", "normalised_pred": "if it can be partially verified by knowledge graph", "normalised_gold": "if it can be partially verified by knowledge graph g"}}
+{"qid": "2310.07609v1.pdf::Q002", "doc_id": "2310.07609v1.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "5", "pages": 10, "document_id": 7591, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:03:57Z", "latency_ms": 14639, "error": null, "raw_text_chars": 1219}]}, "question_id": "2310.07609v1.pdf::Q002", "answer_letter": null, "raw_text": "Looking at the document, I need to identify which modules use Large Language Models (LLMs) as their backbone.\n\nFrom Section 3, the QACHECK system has 5 modules:\n1. **Claim Verifier** - Section 3.1 states it's \"based on InstructGPT\" (an LLM)\n2. **Question Generator** - Section 3.2 states \"Similar to the claim verifier, we also leverage InstructGPT\" (an LLM)\n3. **Question Answering Model** - Section 3.3 describes three implementations, with the default being \"GPT Reciter-Reader\" which uses InstructGPT (an LLM). However, this module also has non-LLM options (Retriever-Reader, FLAN-T5).\n4. **QA Validator** - Section 3.4 states \"We again implement the validator by prompting InstructGPT\" (an LLM)\n5. **Reasoner** - Section 3.5 provides \"two different implementations: 1) the end-to-end QA model based on FLAN-T5, and 2) the InstructGPT model\"\n\nCounting the modules that use LLM backbones:\n- Claim Verifier: Yes (InstructGPT)\n- Question Generator: Yes (InstructGPT)\n- Question Answering Model: Yes (uses GPT Reciter-Reader by default)\n- QA Validator: Yes (InstructGPT)\n- Reasoner: Yes (InstructGPT is one of the two implementations)\n\nAll 5 modules have LLM implementations, with some offering alternatives.\n\nAnswer: 5", "citations": [], "input_tokens": 27149, "output_tokens": 403, "cost_micros": 87492, "latency_ms": 14639, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2310.07609v1.pdf"}, "graded": {"correct": true, "f1": 1.0, "method": "int_eq", "normalised_pred": "5", "normalised_gold": "5"}}
+{"qid": "2311.16502v3.pdf::Q000", "doc_id": "2311.16502v3.pdf", "arm": "llamacloud_premium_lc", "answer_format": "int", "gold": "287", "pages": 117, "document_id": 7593, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:04:12Z", "latency_ms": 6562, "error": null, "raw_text_chars": 406}]}, "question_id": "2311.16502v3.pdf::Q000", "answer_letter": null, "raw_text": "Looking at the document, I need to find information about the sociology subject in MMMU.\n\nIn Figure 3, which shows the distribution of subjects across disciplines, I can see:\n\nUnder \"Humanities & Social Sci. (9%)\":\n- History (313, 2.71%)\n- Literature (147, 1.27%)\n- Psychology (340, 2.94%)\n- **Sociology (287, 2.48%)**\n\nThe number 287 represents the number of samples in the sociology subject.\n\nAnswer: 287", "citations": [], "input_tokens": 103232, "output_tokens": 123, "cost_micros": 311541, "latency_ms": 6562, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": true, "f1": 1.0, "method": "int_eq", "normalised_pred": "287", "normalised_gold": "287"}}
+{"qid": "2311.16502v3.pdf::Q002", "doc_id": "2311.16502v3.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "2", "pages": 117, "document_id": 7593, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:04:19Z", "latency_ms": 62062, "error": null, "raw_text_chars": 760}]}, "question_id": "2311.16502v3.pdf::Q002", "answer_letter": null, "raw_text": "I need to find Economics-related error cases that fall into the \"Perceptual Error\" category.\n\nLet me search through the document for Economics error cases:\n\nFrom the Case Study section (pages 20-21), I can see Table 10 which shows the index of case study figures by subjects with associated error categories.\n\nLooking at the Economics row in Table 10:\n- Economics has: Correct Case (22), Perceptual Error (23, 24)\n\nThis indicates that there are 2 Economics-related error cases in the Perceptual Error category (Figures 23 and 24).\n\nLet me verify by checking the actual figures:\n- Figure 23 (page 37): Economics: Macroeconomics - Error Category: Perceptual Error \u2713\n- Figure 24 (page 38): Economics: Macroeconomics - Error Category: Perceptual Error \u2713\n\nAnswer: 2", "citations": [], "input_tokens": 259127, "output_tokens": 212, "cost_micros": 1559532, "latency_ms": 62014, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2311.16502v3.pdf"}, "graded": {"correct": true, "f1": 1.0, "method": "int_eq", "normalised_pred": "2", "normalised_gold": "2"}}
+{"qid": "2311.16502v3.pdf::Q003", "doc_id": "2311.16502v3.pdf", "arm": "llamacloud_basic_lc", "answer_format": "list", "gold": "['Thermodynamics', 'Heat Transfer', 'Fluid Mechanics']", "pages": 117, "document_id": 7593, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:05:21Z", "latency_ms": 5093, "error": null, "raw_text_chars": 174}]}, "question_id": "2311.16502v3.pdf::Q003", "answer_letter": null, "raw_text": "Looking at Table 11 in the document which lists the subfields of each subject:\n\nFor Energy & Power, the subfields are:\n\nAnswer: Thermodynamics, Heat Transfer, Fluid Mechanics", "citations": [], "input_tokens": 83153, "output_tokens": 47, "cost_micros": 250164, "latency_ms": 5093, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": true, "f1": 1.0, "method": "list_set", "normalised_pred": "fluid mechanics, heat transfer, thermodynamics", "normalised_gold": "fluid mechanics, heat transfer, thermodynamics"}}
+{"qid": "2311.16502v3.pdf::Q007", "doc_id": "2311.16502v3.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "17", "pages": 117, "document_id": 7593, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:05:26Z", "latency_ms": 64485, "error": null, "raw_text_chars": 1513}]}, "question_id": "2311.16502v3.pdf::Q007", "answer_letter": null, "raw_text": "I need to find Figure 1 in the document and count the different icons shown as different image types.\n\nLooking at page 1, Figure 1 is titled \"Overview of the MMMU dataset\" and shows four main sections:\n\n1. **Comprehensive Disciplines** - Shows icons/images representing different disciplines (Engineering, Science, Art & Design, Business, Humanities & Social Sci., Medicine)\n\n2. **Heterogeneous Image Types** - Shows various image type icons including:\n   - Diagrams\n   - Tables\n   - Plots and Charts\n   - Photographs\n   - Chemical Structures\n   - Paintings\n   - Medical Images\n   - Sheet Music\n   - Geometric\n   - Pathology images\n   - Microscopic Images\n   - Comics\n\n3. **Interleaved Text and Images** - Shows a medical question example with 3 MRI images\n\n4. **Expert-level Skills Test** - Shows a diagram with three components: Perception, Knowledge, and Reasoning\n\nThe question asks specifically about \"different icon are shown as different image types\" in Figure 1. Looking at the \"Heterogeneous Image Types\" section, I can count the distinct image type icons shown.\n\nCounting the image types listed in the \"Heterogeneous Image Types\" section:\nDiagrams, Tables, Plots and Charts, Photographs, Chemical Structures, Paintings, Medical Images, Sheet Music, Geometric, Pathology images, Microscopic Images, Comics, and there's an ellipsis (...) suggesting more types.\n\nBased on what's explicitly visible and labeled in Figure 1, I can count approximately 12 distinct image type representations shown.\n\nAnswer: 12", "citations": [], "input_tokens": 259123, "output_tokens": 378, "cost_micros": 1563243, "latency_ms": 64406, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2311.16502v3.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "int_eq", "normalised_pred": "12", "normalised_gold": "17"}}
+{"qid": "2311.16502v3.pdf::Q009", "doc_id": "2311.16502v3.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "4", "pages": 117, "document_id": 7593, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:05:29Z", "latency_ms": 75843, "error": null, "raw_text_chars": 831}]}, "question_id": "2311.16502v3.pdf::Q009", "answer_letter": null, "raw_text": "I need to count the charts that appear in pages 5-10 of the document.\n\nLet me go through each page:\n\n**Page 5:**\n- Figure 4: A comparison chart showing benchmarks plotted on a 2D graph with \"Breadth (Knowledge)\" on x-axis and \"Depth (Reasoning)\" on y-axis\n- A table comparing different datasets\nTotal: 1 chart\n\n**Page 6:**\n- Table 2: Results table (not a chart)\nTotal: 0 charts\n\n**Page 7:**\n- Figure 5: Bar chart showing \"Performance of models on different types of images\"\n- Table 3: Results table (not a chart)\nTotal: 1 chart\n\n**Page 8:**\n- Figure 6: Pie chart showing \"Error distribution over 150 annotated GPT-4V errors\"\nTotal: 1 chart\n\n**Page 9:**\n- No charts, only text\nTotal: 0 charts\n\n**Page 10:**\n- No charts, only text (references section)\nTotal: 0 charts\n\nTotal charts in pages 5-10: 1 + 0 + 1 + 1 + 0 + 0 = 3\n\nAnswer: 3", "citations": [], "input_tokens": 259120, "output_tokens": 289, "cost_micros": 1561222, "latency_ms": 75796, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2311.16502v3.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "int_eq", "normalised_pred": "3", "normalised_gold": "4"}}
+{"qid": "2311.16502v3.pdf::Q013", "doc_id": "2311.16502v3.pdf", "arm": "native_pdf", "answer_format": "list", "gold": "['Perceptual Error', 'Lack of Knowledge', 'Reasoning Error']", "pages": 117, "document_id": 7593, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:06:30Z", "latency_ms": 77217, "error": null, "raw_text_chars": 404}]}, "question_id": "2311.16502v3.pdf::Q013", "answer_letter": null, "raw_text": "Looking at Figure 6 on page 8 of the document, which shows the \"Error distribution over 150 annotated GPT-4V errors\":\n\nThe error categories and their percentages are:\n- Perceptual Error: 35%\n- Lack of Knowledge: 29%\n- Reasoning Error: 26%\n- Textual Understanding: 4%\n- Reject to Answer: 3%\n- Annotation Error: 2%\n- Answer Extraction Error: 1%\n\nAnswer: Perceptual Error, Lack of Knowledge, Reasoning Error", "citations": [], "input_tokens": 259144, "output_tokens": 133, "cost_micros": 1557856, "latency_ms": 77171, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2311.16502v3.pdf"}, "graded": {"correct": true, "f1": 1.0, "method": "list_set", "normalised_pred": "lack of knowledge, perceptual error, reasoning error", "normalised_gold": "lack of knowledge, perceptual error, reasoning error"}}
+{"qid": "2311.16502v3.pdf::Q013", "doc_id": "2311.16502v3.pdf", "arm": "azure_premium_lc", "answer_format": "list", "gold": "['Perceptual Error', 'Lack of Knowledge', 'Reasoning Error']", "pages": 117, "document_id": 7593, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:06:45Z", "latency_ms": 7750, "error": null, "raw_text_chars": 387}]}, "question_id": "2311.16502v3.pdf::Q013", "answer_letter": null, "raw_text": "Looking at Figure 6 \"Error distribution over 150 annotated GPT-4V errors\":\n\nThe pie chart shows the following error categories and their percentages:\n- Perceptual Error: 35%\n- Lack of Knowledge: 29%\n- Reasoning Error: 26%\n- Textual Understanding: 4%\n- Reject to Answer: 3%\n- Annotation Error: 2%\n- Answer Extraction Error: 1%\n\nAnswer: Perceptual Error, Lack of Knowledge, Reasoning Error", "citations": [], "input_tokens": 109199, "output_tokens": 125, "cost_micros": 329472, "latency_ms": 7750, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": true, "f1": 1.0, "method": "list_set", "normalised_pred": "lack of knowledge, perceptual error, reasoning error", "normalised_gold": "lack of knowledge, perceptual error, reasoning error"}}
+{"qid": "2311.16502v3.pdf::Q014", "doc_id": "2311.16502v3.pdf", "arm": "native_pdf", "answer_format": "list", "gold": "['Reasoning Error', 'Perceptual Error', 'Lack of Knowledge']", "pages": 117, "document_id": 7593, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:06:53Z", "latency_ms": 59125, "error": null, "raw_text_chars": 830}]}, "question_id": "2311.16502v3.pdf::Q014", "answer_letter": null, "raw_text": "I'll search through pages 95-100 for error types mentioned in the case studies.\n\nFrom page 95 (Figure 80 - Architecture and Engineering: Surveying and Mapping):\n- This is a correct case, no error type mentioned\n\nFrom page 96 (Figure 82 - Architecture and Engineering: Civil Engineering):\n- Error Category: Reasoning Error\n\nFrom page 97 (Figure 83 - Computer Science: Data Structure and Algorithm):\n- This is a correct case, no error type mentioned\n\nFrom page 98 (Figure 84 - Computer Science: Compiler Principle):\n- Error Category: Perceptual Error, Lack of Knowledge\n\nFrom page 99 (Figure 85 - Computer Science: Operating System):\n- Error Category: Perceptual Error\n\nFrom page 100 (Figure 86 - Computer Science: Compiler Principle):\n- Error Category: Perceptual Error\n\nAnswer: Reasoning Error, Perceptual Error, Lack of Knowledge", "citations": [], "input_tokens": 259136, "output_tokens": 226, "cost_micros": 1559901, "latency_ms": 59077, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2311.16502v3.pdf"}, "graded": {"correct": true, "f1": 1.0, "method": "list_set", "normalised_pred": "lack of knowledge, perceptual error, reasoning error", "normalised_gold": "lack of knowledge, perceptual error, reasoning error"}}
+{"qid": "2311.16502v3.pdf::Q015", "doc_id": "2311.16502v3.pdf", "arm": "llamacloud_premium_lc", "answer_format": "int", "gold": "4", "pages": 117, "document_id": 7593, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:07:48Z", "latency_ms": 11110, "error": null, "raw_text_chars": 1296}]}, "question_id": "2311.16502v3.pdf::Q015", "answer_letter": null, "raw_text": "I need to count the successful cases (correct predictions by GPT-4V) in Pages 65-70.\n\nLet me go through each figure in this range:\n\n**Page 65 - Figure 51 (Clinical Medicine: Pathology)**\n- GPT-4V correctly identified (E) Familial polyposis coli\n- Ground Truth: (E) Familial polyposis coli\n- **SUCCESS**\n\n**Page 66 - Figure 52 (Clinical Medicine: Pathology)**\n- GPT-4V correctly identified (D) Ruptured berry aneurysm\n- Ground Truth: (D) Ruptured berry aneurysm\n- **SUCCESS**\n\n**Page 67 - Figure 53 (Clinical Medicine: Pathology)**\n- GPT-4V answered (A) Human papillomavirus infection\n- Ground Truth: (E) Chronic inflammation\n- Error Category: Perceptual Error\n- **FAILURE**\n\n**Page 68 - Figure 54 (Clinical Medicine: Pathology)**\n- GPT-4V answered (A) Human papillomavirus infection\n- Ground Truth: (E) Chronic inflammation\n- Error Category: Perceptual Error\n- **FAILURE**\n\n**Page 69 - Figure 55 (Clinical Medicine: Heart Disease)**\n- GPT-4V answered (B) Mitral regurgitation\n- Ground Truth: (C) Aortic value regurgitation\n- Error Category: Lack of Knowledge\n- **FAILURE**\n\n**Page 70 - Figure 56 (Diagnostics and Laboratory Medicine: Radiology)**\n- GPT-4V correctly identified (C) Dandy-Walker syndrome\n- Ground Truth: (C) Dandy-Walker syndrome\n- **SUCCESS**\n\nTotal successful cases: 3\n\nAnswer: 3", "citations": [], "input_tokens": 103232, "output_tokens": 424, "cost_micros": 316056, "latency_ms": 11110, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": false, "f1": 0.0, "method": "int_eq", "normalised_pred": "3", "normalised_gold": "4"}}
+{"qid": "2312.10997v5.pdf::Q000", "doc_id": "2312.10997v5.pdf", "arm": "native_pdf", "answer_format": "str", "gold": "Pre-training", "pages": 21, "document_id": 7596, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:07:52Z", "latency_ms": 23593, "error": null, "raw_text_chars": 1091}]}, "question_id": "2312.10997v5.pdf::Q000", "answer_letter": null, "raw_text": "I need to find the figure with a tree shape and identify which branch has the least leaves.\n\nLooking through the document, I can see Figure 1 on page 2, which shows a \"Technology tree of RAG research\" with a tree-like structure.\n\nExamining this tree diagram, I can see it has a trunk labeled \"Retrieval-Augmented Generation\" and branches extending upward. The main branches appear to be:\n\n1. A left branch (colored in teal/green) showing the evolution from pre-training through fine-tuning stages\n2. A middle branch (colored in orange) labeled \"Instructlens\" \n3. A right branch (colored in blue) showing the inference stage with multiple methods\n\nLooking at the number of \"leaves\" (individual methods/papers) on each branch:\n- The left branch (Pre-training/Fine-tuning) has several leaves\n- The middle branch (Instructlens) appears to have the fewest leaves - it's relatively short with minimal branching\n- The right branch (Inference) has many leaves with extensive branching\n\nThe branch labeled \"Instructlens\" in the middle appears to have the least number of leaves.\n\nAnswer: Instructlens", "citations": [], "input_tokens": 64633, "output_tokens": 269, "cost_micros": 197934, "latency_ms": 23577, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "engine": "native", "answer_letter_strategy": "none", "finish_reason": "stop", "pdf_filename": "2312.10997v5.pdf"}, "graded": {"correct": false, "f1": 0.0, "method": "str_norm", "normalised_pred": "instructlens", "normalised_gold": "pre training"}}
+{"qid": "2401.18059v1.pdf::Q002", "doc_id": "2401.18059v1.pdf", "arm": "azure_premium_lc", "answer_format": "str", "gold": "Collapsed Tree Algorithm", "pages": 23, "document_id": 7597, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 1, "recovered": true, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:07:59Z", "latency_ms": 7156, "error": null, "raw_text_chars": 752}]}, "question_id": "2401.18059v1.pdf::Q002", "answer_letter": null, "raw_text": "Looking at Appendix F \"PSEUDOCODE FOR RETRIEVAL METHODS\":\n\n**Algorithm 1 Tree Traversal Algorithm:**\n- Lines: function declaration, initialization of Scurrent, for layer loop, initialization of topk, for node loop, score calculation, append to top_k, end for, Slayer assignment, Scurrent update, end for, return statement, end function\n- Total: 13 lines\n\n**Algorithm 2 Collapsed Tree Algorithm:**\n- Lines: function declaration, flatten tree, initialization of top_nodes, for node loop, append with dot product, end for, sort top_nodes, result initialization, total_tokens initialization, for node loop, if condition, result append, end if, total_tokens update, end for, return statement, end function\n- Total: 16 lines\n\nAnswer: Collapsed Tree Algorithm", "citations": [], "input_tokens": 27211, "output_tokens": 190, "cost_micros": 84483, "latency_ms": 7156, "error": null, "extra": {"model": "anthropic/claude-sonnet-4.5", "finish_reason": "stop"}, "graded": {"correct": true, "f1": 1.0, "method": "str_norm", "normalised_pred": "collapsed tree algorithm", "normalised_gold": "collapsed tree algorithm"}}
+{"qid": "2405.09818v1.pdf::Q000", "doc_id": "2405.09818v1.pdf", "arm": "native_pdf", "answer_format": "list", "gold": "['Figure 5', 'Figure 6']", "pages": 27, "document_id": 7598, "original_error": null, "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:08:06Z", "latency_ms": 16641, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657610 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T02:08:23Z", "latency_ms": 11860, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657610 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T02:08:36Z", "latency_ms": 12422, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657610 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T02:08:54Z", "latency_ms": 19625, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657610 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T02:09:25Z", "latency_ms": 14078, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657610 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}]}, "question_id": "2405.09818v1.pdf::Q000", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 0, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657610 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "extra": {}, "graded": {"correct": false, "f1": 0.0, "method": "list_set", "normalised_pred": "", "normalised_gold": "figure 5, figure 6"}}
+{"qid": "2405.09818v1.pdf::Q001", "doc_id": "2405.09818v1.pdf", "arm": "native_pdf", "answer_format": "str", "gold": "text tokens", "pages": 27, "document_id": 7598, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:08:16Z", "latency_ms": 13922, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657593 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T02:08:30Z", "latency_ms": 25952, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657593 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T02:08:57Z", "latency_ms": 14718, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657593 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T02:09:14Z", "latency_ms": 33452, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657593 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T02:09:54Z", "latency_ms": 22093, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657593 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}]}, "question_id": "2405.09818v1.pdf::Q001", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 0, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657593 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "extra": {}, "graded": {"correct": false, "f1": 0.0, "method": "str_norm", "normalised_pred": "", "normalised_gold": "text tokens"}}
+{"qid": "2405.09818v1.pdf::Q003", "doc_id": "2405.09818v1.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "18", "pages": 27, "document_id": 7598, "original_error": "SSLError: [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2559)", "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:09:39Z", "latency_ms": 11906, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657608 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T02:09:52Z", "latency_ms": 15921, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657608 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T02:10:10Z", "latency_ms": 21921, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657608 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T02:10:34Z", "latency_ms": 11906, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657608 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T02:10:53Z", "latency_ms": 13686, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657608 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}]}, "question_id": "2405.09818v1.pdf::Q003", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 0, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657608 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "extra": {}, "graded": {"correct": false, "f1": 0.0, "method": "int_eq", "normalised_pred": "None", "normalised_gold": "18"}}
+{"qid": "2405.09818v1.pdf::Q004", "doc_id": "2405.09818v1.pdf", "arm": "native_pdf", "answer_format": "int", "gold": "1", "pages": 27, "document_id": 7598, "original_error": null, "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:10:16Z", "latency_ms": 14281, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657583 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T02:10:32Z", "latency_ms": 12125, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657583 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T02:10:45Z", "latency_ms": 11952, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657583 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T02:11:01Z", "latency_ms": 20641, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657583 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T02:11:30Z", "latency_ms": 13985, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657583 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}]}, "question_id": "2405.09818v1.pdf::Q004", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 0, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657583 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "extra": {}, "graded": {"correct": false, "f1": 0.0, "method": "int_eq", "normalised_pred": "None", "normalised_gold": "1"}}
+{"qid": "2405.09818v1.pdf::Q005", "doc_id": "2405.09818v1.pdf", "arm": "native_pdf", "answer_format": "str", "gold": "no", "pages": 27, "document_id": 7598, "original_error": null, "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:11:06Z", "latency_ms": 17157, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"{\\\"message\\\":\\\"Input is too long.\\\"}\",\"provider_name\":\"Amazon Bedrock\",\"is_byok\":false}},\"user_id\":\"user_3CNdnY1vL3Ln9TYRiGAii5kmBvu\"}", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T02:11:25Z", "latency_ms": 12125, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657607 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T02:11:40Z", "latency_ms": 13734, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657607 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T02:11:59Z", "latency_ms": 15828, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657607 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T02:12:25Z", "latency_ms": 19985, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657607 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}]}, "question_id": "2405.09818v1.pdf::Q005", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 0, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657607 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "extra": {}, "graded": {"correct": false, "f1": 0.0, "method": "str_norm", "normalised_pred": "", "normalised_gold": "no"}}
+{"qid": "2405.09818v1.pdf::Q007", "doc_id": "2405.09818v1.pdf", "arm": "native_pdf", "answer_format": "str", "gold": "150k", "pages": 27, "document_id": 7598, "original_error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657603 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "retry": {"max_attempts": 5, "n_attempts": 5, "recovered": false, "attempts": [{"attempt": 1, "started_iso": "2026-05-15T02:11:44Z", "latency_ms": 15875, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657603 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 2, "started_iso": "2026-05-15T02:12:00Z", "latency_ms": 16625, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657603 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 3, "started_iso": "2026-05-15T02:12:19Z", "latency_ms": 12937, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657603 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 4, "started_iso": "2026-05-15T02:12:35Z", "latency_ms": 14625, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657603 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}, {"attempt": 5, "started_iso": "2026-05-15T02:12:57Z", "latency_ms": 12156, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657603 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "raw_text_chars": 0}]}, "question_id": "2405.09818v1.pdf::Q007", "answer_letter": null, "raw_text": "", "citations": [], "input_tokens": 0, "output_tokens": 0, "cost_micros": 0, "latency_ms": 0, "error": "HTTPStatusError: OpenRouter HTTP 400: {\"error\":{\"message\":\"Provider returned error\",\"code\":400,\"metadata\":{\"raw\":\"[{\\n  \\\"error\\\": {\\n    \\\"code\\\": 400,\\n    \\\"message\\\": \\\"The message size (33657603 bytes) exceeds 30.000MB limit.\\\",\\n    \\\"status\\\": \\\"FAILED_PRECONDITION\\\"\\n  }\\n}\\n]\",\"provider_name\":\"Google\",\"is_byok\":false}},\"user_id", "extra": {}, "graded": {"correct": false, "f1": 0.0, "method": "str_norm", "normalised_pred": "", "normalised_gold": "150k"}}
--- a/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries_summary.json
+++ b/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/raw_retries_summary.json
@ -0,0 +1,100 @@
+{
+  "config": {
+    "base_delay": 1.0,
+    "concurrency": 2,
+    "llm_model": "anthropic/claude-sonnet-4.5",
+    "max_attempts": 5,
+    "max_delay": 30.0,
+    "max_output_tokens": 512,
+    "pdf_engine": "native"
+  },
+  "elapsed_s": 1373.6,
+  "n_failed_rows_input": 37,
+  "n_retried": 37,
+  "per_arm": {
+    "azure_basic_lc": {
+      "attempts_distribution": [
+        1
+      ],
+      "recovered": 1,
+      "recovery_rate": 1.0,
+      "still_failed": 0,
+      "tried": 1
+    },
+    "azure_premium_lc": {
+      "attempts_distribution": [
+        1,
+        1,
+        1
+      ],
+      "recovered": 3,
+      "recovery_rate": 1.0,
+      "still_failed": 0,
+      "tried": 3
+    },
+    "llamacloud_basic_lc": {
+      "attempts_distribution": [
+        1,
+        1
+      ],
+      "recovered": 2,
+      "recovery_rate": 1.0,
+      "still_failed": 0,
+      "tried": 2
+    },
+    "llamacloud_premium_lc": {
+      "attempts_distribution": [
+        1,
+        1,
+        1,
+        1
+      ],
+      "recovered": 4,
+      "recovery_rate": 1.0,
+      "still_failed": 0,
+      "tried": 4
+    },
+    "native_pdf": {
+      "attempts_distribution": [
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        1,
+        5,
+        5,
+        5,
+        5,
+        5,
+        5,
+        5,
+        5,
+        5,
+        5,
+        5,
+        5
+      ],
+      "recovered": 15,
+      "recovery_rate": 0.5555555555555556,
+      "still_failed": 12,
+      "tried": 27
+    }
+  },
+  "raw_retries_path": "data\\multimodal_doc\\runs\\2026-05-14T00-53-19Z\\parser_compare\\raw_retries.jsonl",
+  "run_id": "2026-05-14T00-53-19Z",
+  "totals": {
+    "recovered": 25,
+    "still_failed": 12,
+    "tried": 37
+  }
+}
--- a/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/run_artifact.json
+++ b/surfsense_evals/data/multimodal_doc/runs/2026-05-14T00-53-19Z/parser_compare/run_artifact.json