PageIndex/cookbook/agentic_retrieval.ipynb

900 lines
71 KiB
Text
Raw Permalink Normal View History

2025-11-17 21:43:25 +08:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "XTboY7brzyp2"
2025-11-18 00:51:43 +08:00
},
"source": [
"![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)"
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "markdown",
2025-11-18 00:51:43 +08:00
"metadata": {
"id": "EtjMbl9Pz3S-"
},
2025-11-17 21:43:25 +08:00
"source": [
"<p align=\"center\">Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</p>\n",
"\n",
"<p align=\"center\">\n",
" <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
" <a href=\"https://chat.pageindex.ai\">🖥️ Platform</a>&nbsp; • &nbsp;\n",
" <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
" <a href=\"https://github.com/VectifyAI/PageIndex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
" <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
" <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>&nbsp;\n",
"</p>\n",
"\n",
"<div align=\"center\">\n",
"\n",
"[![Star us on GitHub](https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐️%20Star%20Us)](https://github.com/VectifyAI/PageIndex) &nbsp;&nbsp; [![Follow us on X](https://img.shields.io/badge/Follow%20Us-000000?style=for-the-badge&logo=x&logoColor=white)](https://twitter.com/VectifyAI)\n",
"\n",
"</div>\n",
"\n",
"---\n"
2025-11-18 00:51:43 +08:00
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "markdown",
2025-11-18 00:51:43 +08:00
"metadata": {
"id": "bbC9uLWCz8zl"
},
2025-11-17 21:43:25 +08:00
"source": [
"# Agentic Retrieval with PageIndex Chat API\n",
"\n",
"Similarity-based RAG based on Vector-DB has shown big limitations in recent AI applications, reasoning-based or agentic retrieval has become important in current developments. However, unlike classic RAG pipeine with embedding input, top-K chunks returns, re-rank, what should a agentic-native retreival API looks like?\n",
"\n",
"For an agentic-native retrieval system, we need the ability to prompt for retrieval just as naturally as you interact with ChatGPT. Below, we provide an example of how the PageIndex Chat API enables this style of prompt-driven retrieval.\n",
"\n",
"\n",
"## PageIndex Chat API\n",
"[PageIndex Chat](https://chat.pageindex.ai/) is a AI assistant that allow you chat with multiple super-long documents without worrying about limited context or context rot problem. It is based on [PageIndex](https://pageindex.ai/blog/pageindex-intro), a vectorless reasoning-based RAG framework which gives more transparent and reliable results like a human expert.\n",
"<div align=\"center\">\n",
" <img src=\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\" width=\"70%\">\n",
"</div>\n",
"\n",
"You can now access PageIndex Chat with API or SDK.\n",
"\n",
"## 📝 Notebook Overview\n",
"\n",
"This notebook demonstrates a simple, minimal example of agentic retrieval with PageIndex. You will learn:\n",
"- [x] How to use PageIndex Chat API.\n",
"- [x] How to prompt the PageIndex Chat to make it a retrieval system"
2025-11-18 00:51:43 +08:00
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "markdown",
"metadata": {
"id": "77SQbPoe-LTN"
2025-11-18 00:51:43 +08:00
},
"source": [
"### Install PageIndex SDK"
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "code",
2025-11-18 00:51:43 +08:00
"execution_count": 36,
2025-11-17 21:43:25 +08:00
"metadata": {
"id": "6Eiv_cHf0OXz"
},
2025-11-18 00:51:43 +08:00
"outputs": [],
"source": [
"%pip install -q --upgrade pageindex"
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "markdown",
"metadata": {
"id": "UR9-qkdD-Om7"
2025-11-18 00:51:43 +08:00
},
"source": [
"### Setup PageIndex"
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "code",
2025-11-18 00:51:43 +08:00
"execution_count": 60,
"metadata": {
"id": "AFzsW4gq0fjh"
},
"outputs": [],
2025-11-17 21:43:25 +08:00
"source": [
"from pageindex import PageIndexClient\n",
"\n",
"# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n",
"PAGEINDEX_API_KEY = \"YOUR_PAGEINDEX_API_KEY\"\n",
"pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)"
2025-11-18 00:51:43 +08:00
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "markdown",
"metadata": {
"id": "uvzf9oWL-Ts9"
2025-11-18 00:51:43 +08:00
},
"source": [
"### Upload a document"
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "code",
2025-11-18 00:51:43 +08:00
"execution_count": 39,
2025-11-17 21:43:25 +08:00
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "qf7sNRoL0hGw",
"outputId": "529f53c1-c827-45a7-cf01-41f567d4feaa"
},
"outputs": [
{
"name": "stdout",
2025-11-18 00:51:43 +08:00
"output_type": "stream",
2025-11-17 21:43:25 +08:00
"text": [
"Downloaded https://arxiv.org/pdf/2507.13334.pdf\n",
"Document Submitted: pi-cmi34m6jy01sg0bqzofch62n8\n"
]
}
2025-11-18 00:51:43 +08:00
],
"source": [
"import os, requests\n",
"\n",
"pdf_url = \"https://arxiv.org/pdf/2507.13334.pdf\"\n",
"pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n",
"os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n",
"\n",
"response = requests.get(pdf_url)\n",
"with open(pdf_path, \"wb\") as f:\n",
" f.write(response.content)\n",
"print(f\"Downloaded {pdf_url}\")\n",
"\n",
"doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n",
"print('Document Submitted:', doc_id)"
2025-11-17 21:43:25 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "U4hpLB4T-fCt"
2025-11-18 00:51:43 +08:00
},
"source": [
"### Check the processing status"
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "code",
2025-11-18 00:51:43 +08:00
"execution_count": 61,
2025-11-17 21:43:25 +08:00
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "PB1S_CWd2n87",
"outputId": "472a64ab-747d-469c-9e46-3329456df212"
},
"outputs": [
{
"name": "stdout",
2025-11-18 00:51:43 +08:00
"output_type": "stream",
2025-11-17 21:43:25 +08:00
"text": [
"{'createdAt': '2025-11-16T08:36:41.177000',\n",
" 'description': 'This survey provides a comprehensive overview and taxonomy of '\n",
" 'Context Engineering for Large Language Models, covering '\n",
" 'foundational components, system implementations, evaluation '\n",
" 'methods, and future research directions.',\n",
" 'id': 'pi-cmi1gp1hg01t20do2l3bgzwz1',\n",
" 'name': '2507.13334_19.pdf',\n",
" 'pageNum': 166,\n",
" 'status': 'completed'}\n",
"\n",
" Document ready! (166 pages)\n"
]
}
2025-11-18 00:51:43 +08:00
],
"source": [
"from pprint import pprint\n",
"\n",
"doc_info = pi_client.get_document(doc_id)\n",
"pprint(doc_info)\n",
"\n",
"if doc_info['status'] == 'completed':\n",
" print(f\"\\n Document ready! ({doc_info['pageNum']} pages)\")\n",
"elif doc_info['status'] == 'processing':\n",
" print(\"\\n Document is still processing. Please wait and check again.\")"
2025-11-17 21:43:25 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "z1C9FOvO-p1m"
2025-11-18 00:51:43 +08:00
},
"source": [
"### Ask a question about this document"
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "code",
2025-11-18 00:51:43 +08:00
"execution_count": 55,
2025-11-17 21:43:25 +08:00
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "X3RbQvy_0nt7",
"outputId": "9bfb314b-24ad-4eb2-d26c-01be5728d3cc"
},
"outputs": [
{
"name": "stdout",
2025-11-18 00:51:43 +08:00
"output_type": "stream",
2025-11-17 21:43:25 +08:00
"text": [
"I'll help you find the evaluation methods used in this paper. Let me start by examining the document structure to locate the relevant sections.{\"doc_name\": \"2507.13334_19.pdf\"}Perfect! I can see there's a dedicated section on \"Evaluation\" (node_id: 0015) that covers pages 45-50. Let me extract the content from those pages to get detailed information about the evaluation methods.{\"doc_name\": \"2507.13334_19.pdf\", \"pages\": \"45-50\"}Based on the comprehensive evaluation section of the paper, here are the **evaluation methods** used:\n",
"\n",
"## Evaluation Framework Overview\n",
"\n",
"The paper presents a comprehensive evaluation framework organized into **Component-Level Assessment** and **System-Level Integration Assessment**.\n",
"\n",
"### 1. **Component-Level Assessment (Intrinsic Evaluation)**\n",
"\n",
"#### Prompt Engineering Evaluation:\n",
"- **Semantic similarity metrics**\n",
"- **Response quality assessment**\n",
"- **Robustness testing** across diverse input variations\n",
"- **Contextual calibration** assessment\n",
"\n",
"#### Long Context Processing Evaluation:\n",
"- **\"Needle in a haystack\"** evaluation paradigm - tests models' ability to retrieve specific information embedded within long contexts\n",
"- **Multi-document reasoning tasks** - assess synthesis capabilities\n",
"- **Position interpolation techniques** evaluation\n",
"- **Information retention, positional bias, and reasoning coherence** metrics\n",
"\n",
"#### Self-Contextualization Evaluation:\n",
"- **Meta-learning assessments**\n",
"- **Adaptation speed measurements**\n",
"- **Consistency analysis** across multiple iterations\n",
"- Self-refinement frameworks: **Self-Refine, Reflexion, N-CRITICS**\n",
"- Performance improvements measured (~20% improvement with GPT-4)\n",
"\n",
"#### Structured/Relational Data Integration:\n",
"- **Knowledge graph traversal accuracy**\n",
"- **Table comprehension assessment**\n",
"- **Database query generation evaluation**\n",
"\n",
"### 2. **System-Level Integration Assessment (Extrinsic Evaluation)**\n",
"\n",
"#### Retrieval-Augmented Generation (RAG):\n",
"- **Precision, recall, relevance metrics**\n",
"- **Factual accuracy assessment**\n",
"- **Task decomposition accuracy**\n",
"- **Multi-plan selection effectiveness**\n",
"- Memory-augmented planning evaluation\n",
"\n",
"#### Memory Systems Evaluation:\n",
"- **LongMemEval benchmark** (500 curated questions covering):\n",
" - Information extraction\n",
" - Temporal reasoning\n",
" - Multi-session reasoning\n",
" - Knowledge updates\n",
"- Dedicated benchmarks: **NarrativeQA, QMSum, QuALITY, MEMENTO**\n",
"- Accuracy degradation tracking (~30% degradation in extended interactions)\n",
"\n",
"#### Tool-Integrated Reasoning:\n",
"- **MCP-RADAR framework** for standardized evaluation\n",
"- **Berkeley Function Calling Leaderboard (BFCL)** - 2,000 test cases\n",
"- **T-Eval** - 553 tool-use cases\n",
"- **API-Bank** - 73 APIs, 314 dialogues\n",
"- **ToolHop** - 995 queries, 3,912 tools\n",
"- **StableToolBench** for API instability\n",
"- **WebArena** and **Mind2Web** for web agents\n",
"- **VideoWebArena** for multimodal agents\n",
"- Metrics: tool selection accuracy, parameter extraction precision, execution success rates, error recovery\n",
"\n",
"#### Multi-Agent Systems:\n",
"- **Communication effectiveness metrics**\n",
"- **Coordination efficiency assessment**\n",
"- **Protocol adherence evaluation**\n",
"- **Task decomposition accuracy**\n",
"- **Emergent collaborative behaviors** assessment\n",
"- Context handling and transaction support evaluation\n",
"\n",
"### 3. **Emerging Evaluation Paradigms**\n",
"\n",
"#### Self-Refinement Evaluation:\n",
"- Iterative improvement assessment across multiple cycles\n",
"- Multi-dimensional feedback mechanisms\n",
"- Ensemble-based evaluation approaches\n",
"\n",
"#### Multi-Aspect Feedback:\n",
"- Correctness, relevance, clarity, and robustness dimensions\n",
"- Self-rewarding mechanisms for autonomous evolution\n",
"\n",
"#### Criticism-Guided Evaluation:\n",
"- Specialized critic models providing detailed feedback\n",
"- Fine-grained assessment of reasoning quality, factual accuracy, logical consistency\n",
"\n",
"### 4. **Safety and Robustness Assessment**\n",
"\n",
"- **Adversarial attack resistance testing**\n",
"- **Distribution shift evaluation**\n",
"- **Input perturbation testing**\n",
"- **Alignment assessment** (adherence to intended behaviors)\n",
"- **Graceful degradation strategies**\n",
"- **Error recovery protocols**\n",
"- **Long-term behavior consistency** evaluation\n",
"\n",
"### Key Benchmarks Mentioned:\n",
"- GAIA (general assistant tasks - 92% human vs 15% GPT-4 accuracy)\n",
"- GTA benchmark (GPT-4 <50% task completion vs 92% human)\n",
"- WebArena Leaderboard (with success rates ranging from 23.5% to 61.7%)\n",
"\n",
"### Challenges Identified:\n",
"- Traditional metrics (BLEU, ROUGE, perplexity) inadequate for complex systems\n",
"- Need for \"living\" benchmarks that co-evolve with AI capabilities\n",
"- Longitudinal evaluation frameworks for tracking memory fidelity over time\n",
"- Compositional generalization assessment\n",
"- Evaluation of \"unknown unknowns\" in multi-agent systems\n",
"\n",
"The paper emphasizes a **paradigm shift from static benchmarks to dynamic, holistic assessments** that evaluate not just task success but reasoning quality, robustness, and long-term autonomy."
]
}
2025-11-18 00:51:43 +08:00
],
"source": [
"query = \"What are the evaluation methods used in this paper?\"\n",
"\n",
"for chunk in pi_client.chat_completions(\n",
" messages=[{\"role\": \"user\", \"content\": query}],\n",
" doc_id=doc_id,\n",
" stream=True\n",
"):\n",
" print(chunk, end='', flush=True)"
2025-11-17 21:43:25 +08:00
]
},
{
"cell_type": "markdown",
2025-11-18 00:51:43 +08:00
"metadata": {
"id": "RXrdnxv92LY1"
},
2025-11-17 21:43:25 +08:00
"source": [
"## Agentic Retrieval with PageIndex Chat API\n",
"\n",
"You can now easily prompt the PageIndex Chat API to be a retrieval assistant."
2025-11-18 00:51:43 +08:00
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "code",
2025-11-18 00:51:43 +08:00
"execution_count": 62,
2025-11-17 21:43:25 +08:00
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "puyOGkAi0wrH",
"outputId": "b08c682f-abfb-4381-8bb0-726d210120ad"
},
"outputs": [
{
"name": "stdout",
2025-11-18 00:51:43 +08:00
"output_type": "stream",
2025-11-17 21:43:25 +08:00
"text": [
2025-11-18 00:51:43 +08:00
"I'll help you find the evaluation methods discussed in this paper. Let me start by examining the document structure to locate relevant sections.{\"doc_name\": \"2507.13334_22.pdf\"}Now I can see the document structure. Section 6 is dedicated to \"Evaluation\" and spans pages 45-50. Let me retrieve the content from those pages to get the detailed evaluation methods.{\"doc_name\": \"2507.13334_22.pdf\", \"pages\": \"45-50\"}Perfect! I have retrieved the comprehensive evaluation section. Now let me compile the evaluation methods in the requested JSON format.\n",
"\n",
"```json\n",
2025-11-17 21:43:25 +08:00
"[\n",
" {\n",
" \"page\": 45,\n",
" \"content\": \"## 6. Evaluation\\n\\nThe evaluation of context-engineered systems presents unprecedented challenges that transcend traditional language model assessment paradigms. These systems exhibit complex, multi-component architectures with dynamic, context-dependent behaviors requiring comprehensive evaluation frameworks that assess component-level diagnostics, task-based performance, and overall system robustness [841, 1141].\\n\\nThe heterogeneous nature of context engineering components-spanning retrieval mechanisms, memory systems, reasoning chains, and multi-agent coordination-demands evaluation methodologies that can capture both individual component effectiveness and emergent system-level behaviors [314, 939].\\n\\n### 6.1. Evaluation Frameworks and Methodologies\\n\\nThis subsection presents comprehensive approaches for evaluating both individual components and integrated systems in context engineering.\\n\\n#### 6.1.1. Component-Level Assessment\\n\\nIntrinsic evaluation focuses on the performance of individual components in isolation, providing foundational insights into system capabilities and failure modes.\\n\\nFor prompt engineering components, evaluation encompasses prompt effectiveness measurement through semantic similarity metrics, response quality assessment, and robustness testing across diverse input variations. Current approaches reveal brittleness and robustness challenges in prompt design, necessitating more sophisticated evaluation frameworks that can assess contextual calibration and adaptive prompt optimization $[1141,669]$.\"\n",
" },\n",
" {\n",
" \"page\": 46,\n",
" \"content\": \"Long context processing evaluation requires specialized metrics addressing information retention, positional bias, and reasoning coherence across extended sequences. The \\\"needle in a haystack\\\" evaluation paradigm tests models' ability to retrieve specific information embedded within long contexts, while multi-document reasoning tasks assess synthesis capabilities across multiple information sources. Position interpolation techniques and ultra-long sequence processing methods face significant computational challenges that limit practical evaluation scenarios [737, 299].\\n\\nSelf-contextualization mechanisms undergo evaluation through meta-learning assessments, adaptation speed measurements, and consistency analysis across multiple iterations. Self-refinement frameworks including Self-Refine, Reflexion, and N-CRITICS demonstrate substantial performance improvements, with GPT-4 achieving approximately 20\\\\% improvement through iterative self-refinement processes [741, 964, 795]. Multi-dimensional feedback mechanisms and ensemble-based evaluation approaches provide comprehensive assessment of autonomous evolution capabilities [583, 710].\\n\\nStructured and relational data integration evaluation examines accuracy in knowledge graph traversal, table comprehension, and database query generation. However, current evaluation frameworks face significant limitations in assessing structural reasoning capabilities, with high-quality structured training data development presenting ongoing challenges. LSTM-based models demonstrate increased errors when sequential and structural information conflict, highlighting the need for more sophisticated benchmarks testing structural understanding $[769,674,167]$.\\n\\n#### 6.1.2. System-Level Integration Assessment\\n\\nExtrinsic evaluation measures end-to-end performance on downstream tasks, providing holistic assessments of system utility through comprehensive benchmarks spanning question answering, reasoning, and real-world applications.\\n\\nSystem-level evaluation must capture emergent behaviors arising from component interactions, including synergistic effects where combined components exceed individual performance and potential interference patterns where component integration degrades overall effectiveness [841, 1141].\\n\\nRetrieval-Augmented Generation evaluation encompasses both retrieval quality and generation effectiveness through comprehensive metrics addressing precision, recall, relevance, and factual accuracy. Agentic RAG systems introduce additional complexity requiring evaluation of task decomposition accuracy, multi-plan selection effectiveness, and memory-augmented planning capabilities. Self-reflection mechanisms demonstrate iterative improvement through feedback loops, with MemoryBank implementations incorporating Ebbinghaus Forgetting Curve principles for enhanced memory evaluation [444, 166, 1372, 1192, 41].\\n\\nMemory systems evaluation encounters substantial difficulties stemming from the absence of standardized assessment frameworks and the inherently stateless characteristics of contemporary LLMs. LongMemEval offers 500 carefully curated questions that evaluate fundamental capabilities encompassing information extraction, temporal reasoning, multi-session reasoning, and knowledge updates. Commercial AI assistants exhibit $30 \\\\%$ accuracy degradation throughout extended interactions, underscoring significant deficiencies in memory persistence and retrieval effectiveness [1340, 1180, 463, 847, 390]. Dedicated benchmarks such as NarrativeQA, QMSum, QuALITY, and MEMENTO tackle episodic memory evaluation challenges [556, 572].\\n\\nTool-integrated reasoning systems require comprehensive evaluation covering the entire interaction trajectory, including tool selection accuracy, parameter extraction precision, execution success rates, and error recovery capabilities. The MCP-RADAR framework provides standardized evaluation employing objective metrics for software engineering and mathematical reasoning domains. Real-world evaluati
" },\n",
" {\n",
" \"page\": 47,\n",
" \"content\": \"significant performance gaps, with GPT-4 completing less than 50\\\\% of tasks in the GTA benchmark, compared to human performance of $92 \\\\%$ [314, 1098, 126, 939]. Advanced benchmarks including BFCL (2,000 testing cases), T-Eval (553 tool-use cases), API-Bank (73 APIs, 314 dialogues), and ToolHop ( 995 queries, 3,912 tools) address multi-turn interactions and nested tool calling scenarios [263, 363, 377, 1264, 160, 835].\\n\\nMulti-agent systems evaluation captures communication effectiveness, coordination efficiency, and collective outcome quality through specialized metrics addressing protocol adherence, task decomposition accuracy, and emergent collaborative behaviors. Contemporary orchestration frameworks including LangGraph, AutoGen, and CAMEL demonstrate insufficient transaction support, with validation limitations emerging as systems rely exclusively on LLM self-validation capabilities without independent validation procedures. Context handling failures compound challenges as agents struggle with long-term context maintenance encompassing both episodic and semantic information [128, 394, 901].\\n\\n### 6.2. Benchmark Datasets and Evaluation Paradigms\\n\\nThis subsection reviews specialized benchmarks and evaluation paradigms designed for assessing context engineering system performance.\\n\\n#### 6.2.1. Foundational Component Benchmarks\\n\\nLong context processing evaluation employs specialized benchmark suites designed to test information retention, reasoning, and synthesis across extended sequences. Current benchmarks face significant computational complexity challenges, with $\\\\mathrm{O}\\\\left(\\\\mathrm{n}^{2}\\\\right)$ scaling limitations in attention mechanisms creating substantial memory constraints for ultra-long sequences. Position interpolation and extension techniques require sophisticated evaluation frameworks that can assess both computational efficiency and reasoning quality across varying sequence lengths [737, 299, 1236].\\n\\nAdvanced architectures including LongMamba and specialized position encoding methods demonstrate promising directions for long context processing, though evaluation reveals persistent challenges in maintaining coherence across extended sequences. The development of sliding attention mechanisms and memory-efficient implementations requires comprehensive benchmarks that can assess both computational tractability and task performance [1267, 351].\\n\\nStructured and relational data integration benchmarks encompass diverse knowledge representation formats and reasoning patterns. However, current evaluation frameworks face limitations in assessing structural reasoning capabilities, with the development of high-quality structured training data presenting ongoing challenges. Evaluation must address the fundamental tension between sequential and structural information processing, particularly in scenarios where these information types conflict [769, 674, 167].\\n\\n#### 6.2.2. System Implementation Benchmarks\\n\\nRetrieval-Augmented Generation evaluation leverages comprehensive benchmark suites addressing diverse retrieval and generation challenges. Modular RAG architectures demonstrate enhanced flexibility through specialized modules for retrieval, augmentation, and generation, enabling fine-grained evaluation of individual components and their interactions. Graph-enhanced RAG systems incorporating GraphRAG and LightRAG demonstrate improved performance in complex reasoning scenarios, though evaluation frameworks must address the additional complexity of graph traversal and multi-hop reasoning assessment [316, 973, 364].\\n\\nAgentic RAG systems introduce sophisticated planning and reflection mechanisms requiring evaluation\"\n",
" },\n",
" {\n",
" \"page\": 48,\n",
2025-11-18 00:51:43 +08:00
" \"content\": \"of task decomposition accuracy, multi-plan selection effectiveness, and iterative refinement capabilities. Real-time and streaming RAG applications present unique evaluation challenges in assessing both latency and accuracy under dynamic information conditions [444, 166, 1192].\\n\\nTool-integrated reasoning system evaluation employs comprehensive benchmarks spanning diverse tool usage scenarios and complexity levels. The Berkeley Function Calling Leaderboard (BFCL) provides 2,000 testing cases with step-by-step and end-to-end assessments measuring call accuracy, pass rates, and win rates across increasingly complex scenarios. T-Eval contributes 553 tool-use cases testing multi-turn interactions and nested tool calling capabilities [263, 1390, 835]. Advanced benchmarks including StableToolBench address API instability challenges, while NesTools evaluates nested tool scenarios and ToolHop assesses multi-hop tool usage across 995 queries and 3,912 tools [363, 377, 1264].\\n\\nWeb agent evaluation frameworks including WebArena and Mind2Web provide comprehensive assessment across thousands of tasks spanning 137 websites, revealing significant performance gaps in current LLM capabilities for complex web interactions. VideoWebArena extends evaluation to multimodal agents, while Deep Research Bench and DeepShop address specialized evaluation for research and shopping agents respectively $[1378,206,87,482]$.\\n\\nMulti-agent system evaluation employs specialized frameworks addressing coordination, communication, and collective intelligence. However, current frameworks face significant challenges in transactional integrity across complex workflows, with many systems lacking adequate compensation mechanisms for partial failures. Orchestration evaluation must address context management, coordination strategy effectiveness, and the ability to maintain system coherence under varying operational conditions [128, 901].\\n\\n| Release Date | Open Source | Method / Model | Success Rate (\\\\%) | Source |\\n| :-- | :--: | :-- | :--: | :-- |\\n| $2025-02$ | $\\\\times$ | IBM CUGA | 61.7 | $[753]$ |\\n| $2025-01$ | $\\\\times$ | OpenAI Operator | 58.1 | $[813]$ |\\n| $2024-08$ | $\\\\times$ | Jace.AI | 57.1 | $[476]$ |\\n| $2024-12$ | $\\\\times$ | ScribeAgent + GPT-4o | 53.0 | $[950]$ |\\n| $2025-01$ | $\\\\checkmark$ | AgentSymbiotic | 52.1 | $[1323]$ |\\n| $2025-01$ | $\\\\checkmark$ | Learn-by-Interact | 48.0 | $[998]$ |\\n| $2024-10$ | $\\\\checkmark$ | AgentOccam-Judge | 45.7 | $[1231]$ |\\n| $2024-08$ | $\\\\times$ | WebPilot | 37.2 | $[1331]$ |\\n| $2024-10$ | $\\\\checkmark$ | GUI-API Hybrid Agent | 35.8 | $[988]$ |\\n| $2024-09$ | $\\\\checkmark$ | Agent Workflow Memory | 35.5 | $[1144]$ |\\n| $2024-04$ | $\\\\checkmark$ | SteP | 33.5 | $[979]$ |\\n| $2025-06$ | $\\\\checkmark$ | TTI | 26.1 | $[951]$ |\\n| $2024-04$ | $\\\\checkmark$ | BrowserGym + GPT-4 | 23.5 | $[238]$ |\\n\\nTable 8: WebArena [1378] Leaderboard: Top performing models with their success rates and availability status.\\n\\n### 6.3. Evaluation Challenges and Emerging Paradigms\\n\\nThis subsection identifies current limitations in evaluation methodologies and explores emerging approaches for more effective assessment.\"\n",
2025-11-17 21:43:25 +08:00
" },\n",
" {\n",
" \"page\": 49,\n",
" \"content\": \"#### 6.3.1. Methodological Limitations and Biases\\n\\nTraditional evaluation metrics prove fundamentally inadequate for capturing the nuanced, dynamic behaviors exhibited by context-engineered systems. Static metrics like BLEU, ROUGE, and perplexity, originally designed for simpler text generation tasks, fail to assess complex reasoning chains, multi-step interactions, and emergent system behaviors. The inherent complexity and interdependencies of multi-component systems create attribution challenges where isolating failures and identifying root causes becomes computationally and methodologically intractable. Future metrics must evolve to capture not just task success, but the quality and robustness of the underlying reasoning process, especially in scenarios requiring compositional generalization and creative problem-solving [841, 1141].\\n\\nMemory system evaluation faces particular challenges due to the lack of standardized benchmarks and the stateless nature of current LLMs. Automated memory testing frameworks must address the isolation problem where different memory testing stages cannot be effectively separated, leading to unreliable assessment results. Commercial AI assistants demonstrate significant performance degradation during sustained interactions, with accuracy drops of up to $30 \\\\%$ highlighting critical gaps in current evaluation methodologies and pointing to the need for longitudinal evaluation frameworks that track memory fidelity over time $[1340,1180,463]$.\\n\\nTool-integrated reasoning system evaluation reveals substantial performance gaps between current systems and human-level capabilities. The GAIA benchmark demonstrates that while humans achieve $92 \\\\%$ accuracy on general assistant tasks, advanced models like GPT-4 achieve only $15 \\\\%$ accuracy, indicating fundamental limitations in current evaluation frameworks and system capabilities [778, 1098, 126]. Evaluation frameworks must address the complexity of multi-tool coordination, error recovery, and adaptive tool selection across diverse operational contexts [314, 939].\\n\\n#### 6.3.2. Emerging Evaluation Paradigms\\n\\nSelf-refinement evaluation paradigms leverage iterative improvement mechanisms to assess system capabilities across multiple refinement cycles. Frameworks including Self-Refine, Reflexion, and N-CRITICS demonstrate substantial performance improvements through multi-dimensional feedback and ensemblebased evaluation approaches. GPT-4 achieves approximately 20\\\\% improvement through self-refinement processes, highlighting the importance of evaluating systems across multiple iteration cycles rather than single-shot assessments. However, a key future challenge lies in evaluating the meta-learning capability itself—not just whether the system improves, but how efficiently and robustly it learns to refine its strategies over time $[741,964,795,583]$.\\n\\nMulti-aspect feedback evaluation incorporates diverse feedback dimensions including correctness, relevance, clarity, and robustness, providing comprehensive assessment of system outputs. Self-rewarding mechanisms enable autonomous evolution and meta-learning assessment, allowing systems to develop increasingly sophisticated evaluation criteria through iterative refinement [710].\\n\\nCriticism-guided evaluation employs specialized critic models to provide detailed feedback on system outputs, enabling fine-grained assessment of reasoning quality, factual accuracy, and logical consistency. These approaches address the limitations of traditional metrics by providing contextual, content-aware evaluation that can adapt to diverse task requirements and output formats [795, 583].\\n\\nOrchestration evaluation frameworks address the unique challenges of multi-agent coordination by incorporating transactional integrity assessment, context management evaluation, and coordination strategy effectiveness measurement. Advanced frameworks including SagaLLM provide transaction support and\"\n",
" },\n",
" {\n",
" \"page\": 50,\n",
" \"content\": \"independent validation procedures to address the limitations of systems that rely exclusively on LLM selfvalidation capabilities $[128,394]$.\\n\\n#### 6.3.3. Safety and Robustness Assessment\\n\\nSafety-oriented evaluation incorporates comprehensive robustness testing, adversarial attack resistance, and alignment assessment to ensure responsible development of context-engineered systems. Particular attention must be paid to the evaluation of agentic systems that can operate autonomously across extended periods, as these systems present unique safety challenges that traditional evaluation frameworks cannot adequately address $[973,364]$.\\n\\nRobustness evaluation must assess system performance under distribution shifts, input perturbations, and adversarial conditions through comprehensive stress testing protocols. Multi-agent systems face additional challenges in coordination failure scenarios, where partial system failures can cascade through the entire agent network. Evaluation frameworks must address graceful degradation strategies, error recovery protocols, and the ability to maintain system functionality under adverse conditions. Beyond predefined failure modes, future evaluation must grapple with assessing resilience to \\\"unknown unknowns\\\"-emergent and unpredictable failure cascades in highly complex, autonomous multi-agent systems [128, 394].\\n\\nAlignment evaluation measures system adherence to intended behaviors, value consistency, and beneficial outcome optimization through specialized assessment frameworks. Context engineering systems present unique alignment challenges due to their dynamic adaptation capabilities and complex interaction patterns across multiple components. Long-term evaluation must assess whether systems maintain beneficial behaviors as they adapt and evolve through extended operational periods [901].\\n\\nLooking ahead, the evaluation of context-engineered systems requires a paradigm shift from static benchmarks to dynamic, holistic assessments. Future frameworks must move beyond measuring task success to evaluating compositional generalization for novel problems and tracking long-term autonomy in interactive environments. The development of 'living' benchmarks that co-evolve with AI capabilities, alongside the integration of socio-technical and economic metrics, will be critical for ensuring these advanced systems are not only powerful but also reliable, efficient, and aligned with human values in real-world applications $[314,1378,1340]$.\\n\\nThe evaluation landscape for context-engineered systems continues evolving rapidly as new architectures, capabilities, and applications emerge. Future evaluation paradigms must address increasing system complexity while providing reliable, comprehensive, and actionable insights for system improvement and deployment decisions. The integration of multiple evaluation approaches-from component-level assessment to systemwide robustness testing-represents a critical research priority for ensuring the reliable deployment of context-engineered systems in real-world applications [841, 1141].\"\n",
" }\n",
"]\n",
"```"
]
}
2025-11-18 00:51:43 +08:00
],
"source": [
"retrieval_prompt = f\"\"\"\n",
"Your job is to retrieve the raw relevant content from the document based on the user's query.\n",
"\n",
"Query: {query}\n",
"\n",
"Return in JSON format:\n",
"```json\n",
"[\n",
" {{\n",
" \"page\": <number>,\n",
" \"content\": \"<raw text>\"\n",
" }},\n",
" ...\n",
"]\n",
"```\n",
"\"\"\"\n",
"\n",
"full_response = \"\"\n",
"\n",
"for chunk in pi_client.chat_completions(\n",
" messages=[{\"role\": \"user\", \"content\": retrieval_prompt}],\n",
" doc_id=doc_id,\n",
" stream=True\n",
"):\n",
" print(chunk, end='', flush=True)\n",
" full_response += chunk"
2025-11-17 21:43:25 +08:00
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "d-Y9towQ_CiF"
2025-11-18 00:51:43 +08:00
},
"source": [
"### Extract the JSON retreived results"
2025-11-18 00:51:43 +08:00
]
2025-11-17 21:43:25 +08:00
},
{
"cell_type": "code",
2025-11-18 00:51:43 +08:00
"execution_count": 59,
2025-11-17 21:43:25 +08:00
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "rwjC65oB05Tt",
"outputId": "64504ad5-1778-463f-989b-46e18aba2ea6"
},
"outputs": [
{
"name": "stdout",
2025-11-18 00:51:43 +08:00
"output_type": "stream",
2025-11-17 21:43:25 +08:00
"text": [
2025-11-18 00:51:43 +08:00
"Note: you may need to restart the kernel to use updated packages.\n",
2025-11-17 21:43:25 +08:00
"[{'content': '## 6. Evaluation\\n'\n",
" '\\n'\n",
" 'The evaluation of context-engineered systems presents '\n",
" 'unprecedented challenges that transcend traditional language '\n",
" 'model assessment paradigms. These systems exhibit complex, '\n",
" 'multi-component architectures with dynamic, context-dependent '\n",
" 'behaviors requiring comprehensive evaluation frameworks that '\n",
" 'assess component-level diagnostics, task-based performance, and '\n",
" 'overall system robustness [841, 1141].\\n'\n",
" '\\n'\n",
" 'The heterogeneous nature of context engineering '\n",
" 'components-spanning retrieval mechanisms, memory systems, '\n",
" 'reasoning chains, and multi-agent coordination-demands '\n",
" 'evaluation methodologies that can capture both individual '\n",
" 'component effectiveness and emergent system-level behaviors '\n",
" '[314, 939].\\n'\n",
" '\\n'\n",
" '### 6.1. Evaluation Frameworks and Methodologies\\n'\n",
" '\\n'\n",
" 'This subsection presents comprehensive approaches for evaluating '\n",
" 'both individual components and integrated systems in context '\n",
" 'engineering.\\n'\n",
" '\\n'\n",
" '#### 6.1.1. Component-Level Assessment\\n'\n",
" '\\n'\n",
" 'Intrinsic evaluation focuses on the performance of individual '\n",
" 'components in isolation, providing foundational insights into '\n",
" 'system capabilities and failure modes.\\n'\n",
" '\\n'\n",
" 'For prompt engineering components, evaluation encompasses prompt '\n",
" 'effectiveness measurement through semantic similarity metrics, '\n",
" 'response quality assessment, and robustness testing across '\n",
" 'diverse input variations. Current approaches reveal brittleness '\n",
" 'and robustness challenges in prompt design, necessitating more '\n",
" 'sophisticated evaluation frameworks that can assess contextual '\n",
" 'calibration and adaptive prompt optimization $[1141,669]$.',\n",
" 'page': 45},\n",
" {'content': 'Long context processing evaluation requires specialized metrics '\n",
" 'addressing information retention, positional bias, and reasoning '\n",
" 'coherence across extended sequences. The \"needle in a haystack\" '\n",
" \"evaluation paradigm tests models' ability to retrieve specific \"\n",
" 'information embedded within long contexts, while multi-document '\n",
" 'reasoning tasks assess synthesis capabilities across multiple '\n",
" 'information sources. Position interpolation techniques and '\n",
" 'ultra-long sequence processing methods face significant '\n",
" 'computational challenges that limit practical evaluation '\n",
" 'scenarios [737, 299].\\n'\n",
" '\\n'\n",
" 'Self-contextualization mechanisms undergo evaluation through '\n",
" 'meta-learning assessments, adaptation speed measurements, and '\n",
" 'consistency analysis across multiple iterations. Self-refinement '\n",
" 'frameworks including Self-Refine, Reflexion, and N-CRITICS '\n",
" 'demonstrate substantial performance improvements, with GPT-4 '\n",
" 'achieving approximately 20\\\\% improvement through iterative '\n",
" 'self-refinement processes [741, 964, 795]. Multi-dimensional '\n",
" 'feedback mechanisms and ensemble-based evaluation approaches '\n",
" 'provide comprehensive assessment of autonomous evolution '\n",
" 'capabilities [583, 710].\\n'\n",
" '\\n'\n",
" 'Structured and relational data integration evaluation examines '\n",
" 'accuracy in knowledge graph traversal, table comprehension, and '\n",
" 'database query generation. However, current evaluation '\n",
" 'frameworks face significant limitations in assessing structural '\n",
" 'reasoning capabilities, with high-quality structured training '\n",
" 'data development presenting ongoing challenges. LSTM-based '\n",
" 'models demonstrate increased errors when sequential and '\n",
" 'structural information conflict, highlighting the need for more '\n",
" 'sophisticated benchmarks testing structural understanding '\n",
" '$[769,674,167]$.\\n'\n",
" '\\n'\n",
" '#### 6.1.2. System-Level Integration Assessment\\n'\n",
" '\\n'\n",
" 'Extrinsic evaluation measures end-to-end performance on '\n",
" 'downstream tasks, providing holistic assessments of system '\n",
" 'utility through comprehensive benchmarks spanning question '\n",
" 'answering, reasoning, and real-world applications.\\n'\n",
" '\\n'\n",
" 'System-level evaluation must capture emergent behaviors arising '\n",
" 'from component interactions, including synergistic effects where '\n",
" 'combined components exceed individual performance and potential '\n",
" 'interference patterns where component integration degrades '\n",
" 'overall effectiveness [841, 1141].\\n'\n",
" '\\n'\n",
" 'Retrieval-Augmented Generation evaluation encompasses both '\n",
" 'retrieval quality and generation effectiveness through '\n",
" 'comprehensive metrics addressing precision, recall, relevance, '\n",
" 'and factual accuracy. Agentic RAG systems introduce additional '\n",
" 'complexity requiring evaluation of task decomposition accuracy, '\n",
" 'multi-plan selection effectiveness, and memory-augmented '\n",
" 'planning capabilities. Self-reflection mechanisms demonstrate '\n",
" 'iterative improvement through feedback loops, with MemoryBank '\n",
" 'implementations incorporating Ebbinghaus Forgetting Curve '\n",
" 'principles for enhanced memory evaluation [444, 166, 1372, 1192, '\n",
" '41].\\n'\n",
" '\\n'\n",
" 'Memory systems evaluation encounters substantial difficulties '\n",
" 'stemming from the absence of standardized assessment frameworks '\n",
" 'and the inherently stateless characteristics of contemporary '\n",
" 'LLMs. LongMemEval offers 500 carefully curated questions that '\n",
" 'evaluate fundamental capabilities encompassing information '\n",
" 'extraction, temporal reasoning, multi-session reasoning, and '\n",
" 'knowledge updates. Commercial AI assistants exhibit $30 \\\\%$ '\n",
" 'accuracy degradation throughout extended interactions, '\n",
" 'underscoring significant deficiencies in memory persistence and '\n",
" 'retrieval effectiveness [1340, 1180, 463, 847, 390]. Dedicated '\n",
" 'benchmarks such as NarrativeQA, QMSum, QuALITY, and MEMENTO '\n",
" 'tackle episodic memory evaluation challenges [556, 572].\\n'\n",
" '\\n'\n",
" 'Tool-integrated reasoning systems require comprehensive '\n",
" 'evaluation covering the entire interaction trajectory, including '\n",
" 'tool selection accuracy, parameter extraction precision, '\n",
" 'execution success rates, and error recovery capabilities. The '\n",
" 'MCP-RADAR framework provides standardized evaluation employing '\n",
" 'objective metrics for software engineering and mathematical '\n",
" 'reasoning domains. Real-world evaluation reveals',\n",
" 'page': 46},\n",
" {'content': 'significant performance gaps, with GPT-4 completing less than '\n",
" '50\\\\% of tasks in the GTA benchmark, compared to human '\n",
" 'performance of $92 \\\\%$ [314, 1098, 126, 939]. Advanced '\n",
" 'benchmarks including BFCL (2,000 testing cases), T-Eval (553 '\n",
" 'tool-use cases), API-Bank (73 APIs, 314 dialogues), and ToolHop '\n",
" '( 995 queries, 3,912 tools) address multi-turn interactions and '\n",
" 'nested tool calling scenarios [263, 363, 377, 1264, 160, 835].\\n'\n",
" '\\n'\n",
" 'Multi-agent systems evaluation captures communication '\n",
" 'effectiveness, coordination efficiency, and collective outcome '\n",
" 'quality through specialized metrics addressing protocol '\n",
" 'adherence, task decomposition accuracy, and emergent '\n",
" 'collaborative behaviors. Contemporary orchestration frameworks '\n",
" 'including LangGraph, AutoGen, and CAMEL demonstrate insufficient '\n",
" 'transaction support, with validation limitations emerging as '\n",
" 'systems rely exclusively on LLM self-validation capabilities '\n",
" 'without independent validation procedures. Context handling '\n",
" 'failures compound challenges as agents struggle with long-term '\n",
" 'context maintenance encompassing both episodic and semantic '\n",
" 'information [128, 394, 901].\\n'\n",
" '\\n'\n",
" '### 6.2. Benchmark Datasets and Evaluation Paradigms\\n'\n",
" '\\n'\n",
" 'This subsection reviews specialized benchmarks and evaluation '\n",
" 'paradigms designed for assessing context engineering system '\n",
" 'performance.\\n'\n",
" '\\n'\n",
" '#### 6.2.1. Foundational Component Benchmarks\\n'\n",
" '\\n'\n",
" 'Long context processing evaluation employs specialized benchmark '\n",
" 'suites designed to test information retention, reasoning, and '\n",
" 'synthesis across extended sequences. Current benchmarks face '\n",
" 'significant computational complexity challenges, with '\n",
" '$\\\\mathrm{O}\\\\left(\\\\mathrm{n}^{2}\\\\right)$ scaling limitations '\n",
" 'in attention mechanisms creating substantial memory constraints '\n",
" 'for ultra-long sequences. Position interpolation and extension '\n",
" 'techniques require sophisticated evaluation frameworks that can '\n",
" 'assess both computational efficiency and reasoning quality '\n",
" 'across varying sequence lengths [737, 299, 1236].\\n'\n",
" '\\n'\n",
" 'Advanced architectures including LongMamba and specialized '\n",
" 'position encoding methods demonstrate promising directions for '\n",
" 'long context processing, though evaluation reveals persistent '\n",
" 'challenges in maintaining coherence across extended sequences. '\n",
" 'The development of sliding attention mechanisms and '\n",
" 'memory-efficient implementations requires comprehensive '\n",
" 'benchmarks that can assess both computational tractability and '\n",
" 'task performance [1267, 351].\\n'\n",
" '\\n'\n",
" 'Structured and relational data integration benchmarks encompass '\n",
" 'diverse knowledge representation formats and reasoning patterns. '\n",
" 'However, current evaluation frameworks face limitations in '\n",
" 'assessing structural reasoning capabilities, with the '\n",
" 'development of high-quality structured training data presenting '\n",
" 'ongoing challenges. Evaluation must address the fundamental '\n",
" 'tension between sequential and structural information '\n",
" 'processing, particularly in scenarios where these information '\n",
" 'types conflict [769, 674, 167].\\n'\n",
" '\\n'\n",
" '#### 6.2.2. System Implementation Benchmarks\\n'\n",
" '\\n'\n",
" 'Retrieval-Augmented Generation evaluation leverages '\n",
" 'comprehensive benchmark suites addressing diverse retrieval and '\n",
" 'generation challenges. Modular RAG architectures demonstrate '\n",
" 'enhanced flexibility through specialized modules for retrieval, '\n",
" 'augmentation, and generation, enabling fine-grained evaluation '\n",
" 'of individual components and their interactions. Graph-enhanced '\n",
" 'RAG systems incorporating GraphRAG and LightRAG demonstrate '\n",
" 'improved performance in complex reasoning scenarios, though '\n",
" 'evaluation frameworks must address the additional complexity of '\n",
" 'graph traversal and multi-hop reasoning assessment [316, 973, '\n",
" '364].\\n'\n",
" '\\n'\n",
" 'Agentic RAG systems introduce sophisticated planning and '\n",
" 'reflection mechanisms requiring evaluation',\n",
" 'page': 47},\n",
" {'content': 'of task decomposition accuracy, multi-plan selection '\n",
" 'effectiveness, and iterative refinement capabilities. Real-time '\n",
" 'and streaming RAG applications present unique evaluation '\n",
" 'challenges in assessing both latency and accuracy under dynamic '\n",
" 'information conditions [444, 166, 1192].\\n'\n",
" '\\n'\n",
" 'Tool-integrated reasoning system evaluation employs '\n",
" 'comprehensive benchmarks spanning diverse tool usage scenarios '\n",
" 'and complexity levels. The Berkeley Function Calling Leaderboard '\n",
" '(BFCL) provides 2,000 testing cases with step-by-step and '\n",
" 'end-to-end assessments measuring call accuracy, pass rates, and '\n",
" 'win rates across increasingly complex scenarios. T-Eval '\n",
" 'contributes 553 tool-use cases testing multi-turn interactions '\n",
" 'and nested tool calling capabilities [263, 1390, 835]. Advanced '\n",
" 'benchmarks including StableToolBench address API instability '\n",
" 'challenges, while NesTools evaluates nested tool scenarios and '\n",
" 'ToolHop assesses multi-hop tool usage across 995 queries and '\n",
" '3,912 tools [363, 377, 1264].\\n'\n",
" '\\n'\n",
" 'Web agent evaluation frameworks including WebArena and Mind2Web '\n",
" 'provide comprehensive assessment across thousands of tasks '\n",
" 'spanning 137 websites, revealing significant performance gaps in '\n",
" 'current LLM capabilities for complex web interactions. '\n",
" 'VideoWebArena extends evaluation to multimodal agents, while '\n",
" 'Deep Research Bench and DeepShop address specialized evaluation '\n",
" 'for research and shopping agents respectively '\n",
" '$[1378,206,87,482]$.\\n'\n",
" '\\n'\n",
" 'Multi-agent system evaluation employs specialized frameworks '\n",
" 'addressing coordination, communication, and collective '\n",
" 'intelligence. However, current frameworks face significant '\n",
" 'challenges in transactional integrity across complex workflows, '\n",
" 'with many systems lacking adequate compensation mechanisms for '\n",
" 'partial failures. Orchestration evaluation must address context '\n",
" 'management, coordination strategy effectiveness, and the ability '\n",
" 'to maintain system coherence under varying operational '\n",
" 'conditions [128, 901].\\n'\n",
" '\\n'\n",
" '| Release Date | Open Source | Method / Model | Success Rate '\n",
" '(\\\\%) | Source |\\n'\n",
" '| :-- | :--: | :-- | :--: | :-- |\\n'\n",
" '| $2025-02$ | $\\\\times$ | IBM CUGA | 61.7 | $[753]$ |\\n'\n",
" '| $2025-01$ | $\\\\times$ | OpenAI Operator | 58.1 | $[813]$ |\\n'\n",
" '| $2024-08$ | $\\\\times$ | Jace.AI | 57.1 | $[476]$ |\\n'\n",
" '| $2024-12$ | $\\\\times$ | ScribeAgent + GPT-4o | 53.0 | $[950]$ '\n",
" '|\\n'\n",
" '| $2025-01$ | $\\\\checkmark$ | AgentSymbiotic | 52.1 | $[1323]$ '\n",
" '|\\n'\n",
" '| $2025-01$ | $\\\\checkmark$ | Learn-by-Interact | 48.0 | $[998]$ '\n",
" '|\\n'\n",
" '| $2024-10$ | $\\\\checkmark$ | AgentOccam-Judge | 45.7 | $[1231]$ '\n",
" '|\\n'\n",
" '| $2024-08$ | $\\\\times$ | WebPilot | 37.2 | $[1331]$ |\\n'\n",
" '| $2024-10$ | $\\\\checkmark$ | GUI-API Hybrid Agent | 35.8 | '\n",
" '$[988]$ |\\n'\n",
" '| $2024-09$ | $\\\\checkmark$ | Agent Workflow Memory | 35.5 | '\n",
" '$[1144]$ |\\n'\n",
" '| $2024-04$ | $\\\\checkmark$ | SteP | 33.5 | $[979]$ |\\n'\n",
" '| $2025-06$ | $\\\\checkmark$ | TTI | 26.1 | $[951]$ |\\n'\n",
" '| $2024-04$ | $\\\\checkmark$ | BrowserGym + GPT-4 | 23.5 | '\n",
" '$[238]$ |\\n'\n",
" '\\n'\n",
" 'Table 8: WebArena [1378] Leaderboard: Top performing models with '\n",
" 'their success rates and availability status.\\n'\n",
" '\\n'\n",
" '### 6.3. Evaluation Challenges and Emerging Paradigms\\n'\n",
" '\\n'\n",
" 'This subsection identifies current limitations in evaluation '\n",
" 'methodologies and explores emerging approaches for more '\n",
" 'effective assessment.',\n",
" 'page': 48},\n",
" {'content': '#### 6.3.1. Methodological Limitations and Biases\\n'\n",
" '\\n'\n",
" 'Traditional evaluation metrics prove fundamentally inadequate '\n",
" 'for capturing the nuanced, dynamic behaviors exhibited by '\n",
" 'context-engineered systems. Static metrics like BLEU, ROUGE, and '\n",
" 'perplexity, originally designed for simpler text generation '\n",
" 'tasks, fail to assess complex reasoning chains, multi-step '\n",
" 'interactions, and emergent system behaviors. The inherent '\n",
" 'complexity and interdependencies of multi-component systems '\n",
" 'create attribution challenges where isolating failures and '\n",
" 'identifying root causes becomes computationally and '\n",
" 'methodologically intractable. Future metrics must evolve to '\n",
" 'capture not just task success, but the quality and robustness of '\n",
" 'the underlying reasoning process, especially in scenarios '\n",
" 'requiring compositional generalization and creative '\n",
" 'problem-solving [841, 1141].\\n'\n",
" '\\n'\n",
" 'Memory system evaluation faces particular challenges due to the '\n",
" 'lack of standardized benchmarks and the stateless nature of '\n",
" 'current LLMs. Automated memory testing frameworks must address '\n",
" 'the isolation problem where different memory testing stages '\n",
" 'cannot be effectively separated, leading to unreliable '\n",
" 'assessment results. Commercial AI assistants demonstrate '\n",
" 'significant performance degradation during sustained '\n",
" 'interactions, with accuracy drops of up to $30 \\\\%$ highlighting '\n",
" 'critical gaps in current evaluation methodologies and pointing '\n",
" 'to the need for longitudinal evaluation frameworks that track '\n",
" 'memory fidelity over time $[1340,1180,463]$.\\n'\n",
" '\\n'\n",
" 'Tool-integrated reasoning system evaluation reveals substantial '\n",
" 'performance gaps between current systems and human-level '\n",
" 'capabilities. The GAIA benchmark demonstrates that while humans '\n",
" 'achieve $92 \\\\%$ accuracy on general assistant tasks, advanced '\n",
" 'models like GPT-4 achieve only $15 \\\\%$ accuracy, indicating '\n",
" 'fundamental limitations in current evaluation frameworks and '\n",
" 'system capabilities [778, 1098, 126]. Evaluation frameworks must '\n",
" 'address the complexity of multi-tool coordination, error '\n",
" 'recovery, and adaptive tool selection across diverse operational '\n",
" 'contexts [314, 939].\\n'\n",
" '\\n'\n",
" '#### 6.3.2. Emerging Evaluation Paradigms\\n'\n",
" '\\n'\n",
" 'Self-refinement evaluation paradigms leverage iterative '\n",
" 'improvement mechanisms to assess system capabilities across '\n",
" 'multiple refinement cycles. Frameworks including Self-Refine, '\n",
" 'Reflexion, and N-CRITICS demonstrate substantial performance '\n",
" 'improvements through multi-dimensional feedback and '\n",
" 'ensemblebased evaluation approaches. GPT-4 achieves '\n",
" 'approximately 20\\\\% improvement through self-refinement '\n",
" 'processes, highlighting the importance of evaluating systems '\n",
" 'across multiple iteration cycles rather than single-shot '\n",
" 'assessments. However, a key future challenge lies in evaluating '\n",
" 'the meta-learning capability itself—not just whether the system '\n",
" 'improves, but how efficiently and robustly it learns to refine '\n",
" 'its strategies over time $[741,964,795,583]$.\\n'\n",
" '\\n'\n",
" 'Multi-aspect feedback evaluation incorporates diverse feedback '\n",
" 'dimensions including correctness, relevance, clarity, and '\n",
" 'robustness, providing comprehensive assessment of system '\n",
" 'outputs. Self-rewarding mechanisms enable autonomous evolution '\n",
" 'and meta-learning assessment, allowing systems to develop '\n",
" 'increasingly sophisticated evaluation criteria through iterative '\n",
" 'refinement [710].\\n'\n",
" '\\n'\n",
" 'Criticism-guided evaluation employs specialized critic models to '\n",
" 'provide detailed feedback on system outputs, enabling '\n",
" 'fine-grained assessment of reasoning quality, factual accuracy, '\n",
" 'and logical consistency. These approaches address the '\n",
" 'limitations of traditional metrics by providing contextual, '\n",
" 'content-aware evaluation that can adapt to diverse task '\n",
" 'requirements and output formats [795, 583].\\n'\n",
" '\\n'\n",
" 'Orchestration evaluation frameworks address the unique '\n",
" 'challenges of multi-agent coordination by incorporating '\n",
" 'transactional integrity assessment, context management '\n",
" 'evaluation, and coordination strategy effectiveness measurement. '\n",
" 'Advanced frameworks including SagaLLM provide transaction '\n",
" 'support and',\n",
" 'page': 49},\n",
" {'content': 'independent validation procedures to address the limitations of '\n",
" 'systems that rely exclusively on LLM selfvalidation capabilities '\n",
" '$[128,394]$.\\n'\n",
" '\\n'\n",
" '#### 6.3.3. Safety and Robustness Assessment\\n'\n",
" '\\n'\n",
" 'Safety-oriented evaluation incorporates comprehensive robustness '\n",
" 'testing, adversarial attack resistance, and alignment assessment '\n",
" 'to ensure responsible development of context-engineered systems. '\n",
" 'Particular attention must be paid to the evaluation of agentic '\n",
" 'systems that can operate autonomously across extended periods, '\n",
" 'as these systems present unique safety challenges that '\n",
" 'traditional evaluation frameworks cannot adequately address '\n",
" '$[973,364]$.\\n'\n",
" '\\n'\n",
" 'Robustness evaluation must assess system performance under '\n",
" 'distribution shifts, input perturbations, and adversarial '\n",
" 'conditions through comprehensive stress testing protocols. '\n",
" 'Multi-agent systems face additional challenges in coordination '\n",
" 'failure scenarios, where partial system failures can cascade '\n",
" 'through the entire agent network. Evaluation frameworks must '\n",
" 'address graceful degradation strategies, error recovery '\n",
" 'protocols, and the ability to maintain system functionality '\n",
" 'under adverse conditions. Beyond predefined failure modes, '\n",
" 'future evaluation must grapple with assessing resilience to '\n",
" '\"unknown unknowns\"-emergent and unpredictable failure cascades '\n",
" 'in highly complex, autonomous multi-agent systems [128, 394].\\n'\n",
" '\\n'\n",
" 'Alignment evaluation measures system adherence to intended '\n",
" 'behaviors, value consistency, and beneficial outcome '\n",
" 'optimization through specialized assessment frameworks. Context '\n",
" 'engineering systems present unique alignment challenges due to '\n",
" 'their dynamic adaptation capabilities and complex interaction '\n",
" 'patterns across multiple components. Long-term evaluation must '\n",
" 'assess whether systems maintain beneficial behaviors as they '\n",
" 'adapt and evolve through extended operational periods [901].\\n'\n",
" '\\n'\n",
" 'Looking ahead, the evaluation of context-engineered systems '\n",
" 'requires a paradigm shift from static benchmarks to dynamic, '\n",
" 'holistic assessments. Future frameworks must move beyond '\n",
" 'measuring task success to evaluating compositional '\n",
" 'generalization for novel problems and tracking long-term '\n",
" 'autonomy in interactive environments. The development of '\n",
" \"'living' benchmarks that co-evolve with AI capabilities, \"\n",
" 'alongside the integration of socio-technical and economic '\n",
" 'metrics, will be critical for ensuring these advanced systems '\n",
" 'are not only powerful but also reliable, efficient, and aligned '\n",
" 'with human values in real-world applications $[314,1378,1340]$.\\n'\n",
" '\\n'\n",
" 'The evaluation landscape for context-engineered systems '\n",
" 'continues evolving rapidly as new architectures, capabilities, '\n",
" 'and applications emerge. Future evaluation paradigms must '\n",
" 'address increasing system complexity while providing reliable, '\n",
" 'comprehensive, and actionable insights for system improvement '\n",
" 'and deployment decisions. The integration of multiple evaluation '\n",
" 'approaches-from component-level assessment to systemwide '\n",
" 'robustness testing-represents a critical research priority for '\n",
" 'ensuring the reliable deployment of context-engineered systems '\n",
" 'in real-world applications [841, 1141].',\n",
" 'page': 50}]\n"
]
}
2025-11-18 00:51:43 +08:00
],
"source": [
"%pip install -q jsonextractor\n",
"\n",
"def extract_json(content):\n",
" from json_extractor import JsonExtractor\n",
" start_idx = content.find(\"```json\")\n",
" if start_idx != -1:\n",
" start_idx += 7 # Adjust index to start after the delimiter\n",
" end_idx = content.rfind(\"```\")\n",
" json_content = content[start_idx:end_idx].strip()\n",
" return JsonExtractor.extract_valid_json(json_content)\n",
"\n",
"from pprint import pprint\n",
"pprint(extract_json(full_response))"
2025-11-17 21:43:25 +08:00
]
}
2025-11-18 00:51:43 +08:00
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}