diff --git a/docs/TEST_SCENARIOS.md b/docs/TEST_SCENARIOS.md index c99ce4e..7e4ae4c 100644 --- a/docs/TEST_SCENARIOS.md +++ b/docs/TEST_SCENARIOS.md @@ -1,41 +1,152 @@ # Real-World Test Scenarios -This document provides concrete, real-world examples of testing AI agents with flakestorm across **all V2 pillars**: **mutation** (adversarial prompts), **environment chaos** (tool/LLM faults), **behavioral contracts** (invariants × chaos matrix), and **replay regression** (replay production incidents). Each scenario includes setup, config, and commands where applicable. - -**V2:** Use `version: "2.0"` in config to enable chaos, contracts, and replay. Flakestorm supports **24 mutation types** (prompt-level and system/network-level) and **max 50 mutations per run** in OSS. See [V2 Spec](V2_SPEC.md) and [V2 Audit](V2_AUDIT.md). +This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **24 mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md). --- ## Table of Contents -### V2 scenarios (all pillars) +### Scenarios with tool calling, chaos, contract, and replay -- [V2 Scenario: Environment Chaos](#v2-scenario-environment-chaos) — Tool/LLM fault injection -- [V2 Scenario: Behavioral Contract × Chaos Matrix](#v2-scenario-behavioral-contract--chaos-matrix) — Invariants under each chaos scenario -- [V2 Scenario: Replay Regression](#v2-scenario-replay-regression) — Replay production failures -- [Full V2 example (chaos + contract + replay)](../examples/v2_research_agent/README.md) — Working agent and config +1. [Research Agent with Search Tool](#scenario-1-research-agent-with-search-tool) — Search tool + LLM; chaos + contract +2. [Support Agent with KB Tool and Replay](#scenario-2-support-agent-with-kb-tool-and-replay) — KB tool; chaos + contract + replay +3. [Autonomous Planner with Multi-Tool Chain](#scenario-3-autonomous-planner-with-multi-tool-chain) — Multi-step agent (weather + calendar); chaos + contract +4. [Booking Agent with Calendar and Payment Tools](#scenario-4-booking-agent-with-calendar-and-payment-tools) — Two tools; chaos matrix + replay +5. [Data Pipeline Agent with Replay](#scenario-5-data-pipeline-agent-with-replay) — Pipeline tool; contract + replay regression +6. [Quick reference](#quick-reference-commands-and-config) -### Mutation-focused scenarios (agent + config examples) +### Additional scenarios (agent + config examples) -1. [Scenario 1: Customer Service Chatbot](#scenario-1-customer-service-chatbot) -2. [Scenario 2: Code Generation Agent](#scenario-2-code-generation-agent) -3. [Scenario 3: RAG-Based Q&A Agent](#scenario-3-rag-based-qa-agent) -4. [Scenario 4: Multi-Tool Agent (LangChain)](#scenario-4-multi-tool-agent-langchain) -5. [Scenario 5: Guardrailed Agent (Safety Testing)](#scenario-5-guardrailed-agent-safety-testing) -6. [Integration Guide](#integration-guide) +7. [Customer Service Chatbot](#scenario-6-customer-service-chatbot) +8. [Code Generation Agent](#scenario-7-code-generation-agent) +9. [RAG-Based Q&A Agent](#scenario-8-rag-based-qa-agent) +10. [Multi-Tool Agent (LangChain)](#scenario-9-multi-tool-agent-langchain) +11. [Guardrailed Agent (Safety Testing)](#scenario-10-guardrailed-agent-safety-testing) +12. [Integration Guide](#integration-guide) --- -## V2 Scenario: Environment Chaos +## Scenario 1: Research Agent with Search Tool -**Goal:** Test that your agent degrades gracefully when tools or the LLM fail (timeouts, errors, rate limits, malformed responses). +### The Agent -**Commands:** `flakestorm run --chaos` (mutations + chaos) or `flakestorm run --chaos --chaos-only` (golden prompts only, under chaos). Use `--chaos-profile api_outage` (or `degraded_llm`, `hostile_tools`, `high_latency`, `cascading_failure`) for built-in profiles. +A research assistant that **actually calls a search tool** over HTTP, then sends the query and search results to an LLM. We test it under environment chaos (tool/LLM faults) and a behavioral contract (must cite source, must complete). -**Config (excerpt):** +### Search Tool (Actual HTTP Service) + +The agent calls this service to fetch search results. For a single-endpoint HTTP agent, Flakestorm uses `tool: "*"` to fault the request to the agent, or use `match_url` when the agent makes outbound calls (see [Environment Chaos](ENVIRONMENT_CHAOS.md)). + +```python +# search_service.py — run on port 5001 +from fastapi import FastAPI + +app = FastAPI(title="Search Tool") + +@app.get("/search") +def search(q: str): + """Simulated search API. In production this might call a real search engine.""" + results = [ + {"title": "Wikipedia: " + q, "snippet": "According to Wikipedia, " + q + " is a topic."}, + {"title": "Source A", "snippet": "Per Source A, " + q + " has been documented."}, + ] + return {"query": q, "results": results} +``` + +### Agent Code (Actual Tool Calling) + +The agent receives the user query, **calls the search tool** via HTTP, then calls the LLM with the query and results. + +```python +# research_agent.py — run on port 8790 +import os +import httpx +from fastapi import FastAPI +from pydantic import BaseModel + +app = FastAPI(title="Research Agent with Search Tool") + +SEARCH_URL = os.environ.get("SEARCH_URL", "http://localhost:5001/search") +OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://localhost:11434/api/generate") +MODEL = os.environ.get("OLLAMA_MODEL", "gemma3:1b") + +class InvokeRequest(BaseModel): + input: str | None = None + prompt: str | None = None + +class InvokeResponse(BaseModel): + result: str + +def call_search(query: str) -> str: + """Actual tool call: HTTP GET to search service.""" + r = httpx.get(SEARCH_URL, params={"q": query}, timeout=10.0) + r.raise_for_status() + data = r.json() + snippets = [x.get("snippet", "") for x in data.get("results", [])[:3]] + return "\n".join(snippets) if snippets else "No results found." + +def call_llm(user_query: str, search_context: str) -> str: + """Call LLM with user query and tool output.""" + prompt = f"""You are a research assistant. Use the following search results to answer. Always cite the source. + +Search results: +{search_context} + +User question: {user_query} + +Answer (2-4 sentences, must cite source):""" + r = httpx.post( + OLLAMA_URL, + json={"model": MODEL, "prompt": prompt, "stream": False}, + timeout=60.0, + ) + r.raise_for_status() + return (r.json().get("response") or "").strip() + +@app.post("/reset") +def reset(): + return {"ok": True} + +@app.post("/invoke", response_model=InvokeResponse) +def invoke(req: InvokeRequest): + text = (req.input or req.prompt or "").strip() + if not text: + return InvokeResponse(result="Please ask a question.") + try: + search_context = call_search(text) # actual tool call + answer = call_llm(text, search_context) + return InvokeResponse(result=answer) + except Exception as e: + return InvokeResponse( + result="According to [system], the search or model failed. Please try again." + ) +``` + +### flakestorm Configuration ```yaml version: "2.0" +agent: + endpoint: "http://localhost:8790/invoke" + type: http + method: POST + request_template: '{"input": "{prompt}"}' + response_path: "result" + timeout: 15000 + reset_endpoint: "http://localhost:8790/reset" +model: + provider: ollama + name: gemma3:1b + base_url: "http://localhost:11434" +golden_prompts: + - "What is the capital of France?" + - "Summarize the benefits of renewable energy." +mutations: + count: 5 + types: [paraphrase, noise, prompt_injection] +invariants: + - type: latency + max_ms: 30000 + - type: output_not_empty chaos: tool_faults: - tool: "*" @@ -46,35 +157,18 @@ chaos: - mode: truncated_response max_tokens: 5 probability: 0.2 -``` - -**Docs:** [Environment Chaos](ENVIRONMENT_CHAOS.md), [V2 Audit §8.1](V2_AUDIT.md#1-prd-81--environment-chaos). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md). - ---- - -## V2 Scenario: Behavioral Contract × Chaos Matrix - -**Goal:** Verify that named invariants (with severity) hold under every chaos scenario; each (invariant × scenario) cell is an independent run. Optional `agent.reset_endpoint` or `agent.reset_function` for state isolation. - -**Commands:** `flakestorm contract run`, `flakestorm contract validate`, `flakestorm contract score`. - -**Config (excerpt):** - -```yaml -version: "2.0" -agent: - reset_endpoint: "http://localhost:8790/reset" contract: - name: "My Contract" + name: "Research Agent Contract" invariants: - - id: must-cite + - id: must-cite-source type: regex - pattern: "(?i)(source|according to)" + pattern: "(?i)(source|according to|per )" severity: critical - - id: max-latency - type: latency - max_ms: 60000 - severity: medium + when: always + - id: completes + type: completes + severity: high + when: always chaos_matrix: - name: "no-chaos" tool_faults: [] @@ -84,38 +178,616 @@ contract: - tool: "*" mode: error error_code: 503 +output: + format: html + path: "./reports" ``` -**Docs:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md), [V2 Spec](V2_SPEC.md) (contract matrix isolation, resilience score). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md). +### Running the Test + +```bash +# Terminal 1: Search tool +uvicorn search_service:app --host 0.0.0.0 --port 5001 +# Terminal 2: Agent (requires Ollama with gemma3:1b) +uvicorn research_agent:app --host 0.0.0.0 --port 8790 +# Terminal 3: Flakestorm +flakestorm run -c flakestorm.yaml +flakestorm run -c flakestorm.yaml --chaos +flakestorm contract run -c flakestorm.yaml +flakestorm ci -c flakestorm.yaml --min-score 0.5 +``` + +### What We're Testing + +| Pillar | What runs | What we verify | +|--------|-----------|----------------| +| **Mutation** | Adversarial prompts to agent (calls search + LLM) | Robustness to typos, paraphrases, injection. | +| **Chaos** | Tool 503 to agent, LLM truncated | Agent degrades gracefully (fallback, cites source when possible). | +| **Contract** | Contract x chaos matrix (no-chaos, api-outage) | Must cite source (critical), must complete (high); auto-FAIL if critical fails. | --- -## V2 Scenario: Replay Regression +## Scenario 2: Support Agent with KB Tool and Replay -**Goal:** Replay a saved session (e.g. production incident) with fixed inputs and tool responses, then verify the agent’s output against a contract. +### The Agent -**Commands:** `flakestorm replay run path/to/session.yaml -c flakestorm.yaml`, `flakestorm replay export --from-report report.json -o ./replays/`. Optional: `flakestorm replay run --from-langsmith RUN_ID --run` to import from LangSmith and run. +A customer support agent that **actually calls a knowledge-base (KB) tool** to fetch articles, then answers the user. We add a **replay session** from a production incident to verify the fix. -**Config (excerpt):** +### KB Tool (Actual HTTP Service) + +```python +# kb_service.py — run on port 5002 +from fastapi import FastAPI +from fastapi.responses import JSONResponse + +app = FastAPI(title="KB Tool") +ARTICLES = { + "reset-password": "To reset your password: go to Account > Security > Reset password. You will receive an email with a link.", + "cancel-subscription": "To cancel: Account > Billing > Cancel subscription. Refunds apply within 14 days.", +} + +@app.get("/kb/article") +def get_article(article_id: str): + """Actual tool: fetch KB article by ID.""" + if article_id not in ARTICLES: + return JSONResponse(status_code=404, content={"error": "Article not found"}) + return {"article_id": article_id, "content": ARTICLES[article_id]} +``` + +### Agent Code (Actual Tool Calling) + +The agent parses the user question, **calls the KB tool** to get the article, then formats a response. + +```python +# support_agent.py — run on port 8791 +import httpx +from fastapi import FastAPI +from pydantic import BaseModel + +app = FastAPI(title="Support Agent with KB Tool") +KB_URL = "http://localhost:5002/kb/article" + +class InvokeRequest(BaseModel): + input: str | None = None + prompt: str | None = None + +class InvokeResponse(BaseModel): + result: str + +def extract_article_id(query: str) -> str: + q = query.lower() + if "password" in q or "reset" in q: + return "reset-password" + if "cancel" in q or "subscription" in q: + return "cancel-subscription" + return "reset-password" + +def call_kb(article_id: str) -> str: + """Actual tool call: HTTP GET to KB service.""" + r = httpx.get(KB_URL, params={"article_id": article_id}, timeout=5.0) + if r.status_code != 200: + return f"[KB error: {r.status_code}]" + return r.json().get("content", "") + +@app.post("/reset") +def reset(): + return {"ok": True} + +@app.post("/invoke", response_model=InvokeResponse) +def invoke(req: InvokeRequest): + text = (req.input or req.prompt or "").strip() + if not text: + return InvokeResponse(result="Please describe your issue.") + try: + article_id = extract_article_id(text) + content = call_kb(article_id) # actual tool call + if not content or content.startswith("[KB error"): + return InvokeResponse(result="I could not find that article. Please contact support.") + return InvokeResponse(result=f"Here is what I found:\n\n{content}") + except Exception as e: + return InvokeResponse(result=f"Support system is temporarily unavailable. Please try again.") +``` + +### flakestorm Configuration ```yaml version: "2.0" +agent: + endpoint: "http://localhost:8791/invoke" + type: http + method: POST + request_template: '{"input": "{prompt}"}' + response_path: "result" + timeout: 10000 + reset_endpoint: "http://localhost:8791/reset" +golden_prompts: + - "How do I reset my password?" + - "I want to cancel my subscription." +invariants: + - type: output_not_empty + - type: latency + max_ms: 15000 +chaos: + tool_faults: + - tool: "*" + mode: error + error_code: 503 + probability: 0.25 +contract: + name: "Support Agent Contract" + invariants: + - id: not-empty + type: output_not_empty + severity: critical + when: always + - id: no-pii-leak + type: excludes_pii + severity: high + when: always + chaos_matrix: + - name: "no-chaos" + tool_faults: [] + llm_faults: [] + - name: "kb-down" + tool_faults: + - tool: "*" + mode: error + error_code: 503 replays: sessions: - - file: "replays/incident_001.yaml" - # Optional: sources for LangSmith import - # sources: ... + - file: "replays/support_incident_001.yaml" +scoring: + mutation: 0.20 + chaos: 0.35 + contract: 0.35 + replay: 0.10 +output: + format: html + path: "./reports" ``` -**Session file (e.g. `replays/incident_001.yaml`):** `id`, `input`, `tool_responses` (optional), `contract` (name or path). +### Replay Session (Production Incident) -**Docs:** [Replay Regression](REPLAY_REGRESSION.md), [V2 Audit §8.3](V2_AUDIT.md#3-prd-83--replay-based-regression). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md). +```yaml +# replays/support_incident_001.yaml +id: support-incident-001 +name: "Support agent failed when KB was down" +source: manual +input: "How do I reset my password?" +tool_responses: [] +contract: "Support Agent Contract" +``` + +### Running the Test + +```bash +# Terminal 1: KB service +uvicorn kb_service:app --host 0.0.0.0 --port 5002 +# Terminal 2: Support agent +uvicorn support_agent:app --host 0.0.0.0 --port 8791 +# Terminal 3: Flakestorm +flakestorm run -c flakestorm.yaml +flakestorm contract run -c flakestorm.yaml +flakestorm replay run replays/support_incident_001.yaml -c flakestorm.yaml +flakestorm ci -c flakestorm.yaml +``` + +### What We're Testing + +| Pillar | What runs | What we verify | +|--------|-----------|----------------| +| **Mutation** | Adversarial prompts to agent (calls KB tool) | Robustness to noisy/paraphrased support questions. | +| **Chaos** | Tool 503 to agent | Agent returns graceful message instead of crashing. | +| **Contract** | Invariants x chaos matrix | Output not empty (critical), no PII (high). | +| **Replay** | Replay support_incident_001.yaml | Same input passes contract (regression for production incident). | --- +## Scenario 3: Autonomous Planner with Multi-Tool Chain + +### The Agent + +An autonomous planner that chains multiple tool calls: it calls a weather tool, then a calendar tool, then formats a response. We test it under chaos (one tool fails) and a behavioral contract (response must complete and include a summary). + +### Tools (Weather + Calendar) + +```python +# tools_planner.py — run on port 5010 +from fastapi import FastAPI +from pydantic import BaseModel + +app = FastAPI(title="Planner Tools") + +@app.get("/weather") +def weather(city: str): + return {"city": city, "temp": 72, "condition": "Sunny"} + +@app.get("/calendar") +def calendar(date: str): + return {"date": date, "events": ["Meeting 10am", "Lunch 12pm"]} + +@app.post("/reset") +def reset(): + return {"ok": True} +``` + +### Agent Code (Multi-Step Tool Chain) + +```python +# planner_agent.py — port 8792 +import httpx +from fastapi import FastAPI +from pydantic import BaseModel + +app = FastAPI(title="Autonomous Planner Agent") +BASE = "http://localhost:5010" + +class InvokeRequest(BaseModel): + input: str | None = None + prompt: str | None = None + +class InvokeResponse(BaseModel): + result: str + +@app.post("/reset") +def reset(): + httpx.post(f"{BASE}/reset") + return {"ok": True} + +@app.post("/invoke", response_model=InvokeResponse) +def invoke(req: InvokeRequest): + text = (req.input or req.prompt or "").strip() + if not text: + return InvokeResponse(result="Please provide a request.") + try: + w = httpx.get(f"{BASE}/weather", params={"city": "Boston"}, timeout=5.0) + weather_data = w.json() if w.status_code == 200 else {} + c = httpx.get(f"{BASE}/calendar", params={"date": "today"}, timeout=5.0) + cal_data = c.json() if c.status_code == 200 else {} + summary = f"Weather: {weather_data.get('condition', 'N/A')}. Calendar: {len(cal_data.get('events', []))} events." + return InvokeResponse(result=f"Summary: {summary}") + except Exception as e: + return InvokeResponse(result=f"Summary: Planning unavailable ({type(e).__name__}).") +``` + +### flakestorm Configuration + +```yaml +version: "2.0" +agent: + endpoint: "http://localhost:8792/invoke" + type: http + method: POST + request_template: '{"input": "{prompt}"}' + response_path: "result" + timeout: 10000 + reset_endpoint: "http://localhost:8792/reset" +golden_prompts: + - "What is the weather and my schedule for today?" +invariants: + - type: output_not_empty + - type: latency + max_ms: 15000 +chaos: + tool_faults: + - tool: "*" + mode: error + error_code: 503 + probability: 0.3 +contract: + name: "Planner Contract" + invariants: + - id: completes + type: completes + severity: critical + when: always + chaos_matrix: + - name: "no-chaos" + tool_faults: [] + llm_faults: [] + - name: "tool-down" + tool_faults: + - tool: "*" + mode: error + error_code: 503 +output: + format: html + path: "./reports" +``` + +### Running the Test + +```bash +uvicorn tools_planner:app --host 0.0.0.0 --port 5010 +uvicorn planner_agent:app --host 0.0.0.0 --port 8792 +flakestorm run -c flakestorm.yaml +flakestorm run -c flakestorm.yaml --chaos +flakestorm contract run -c flakestorm.yaml +``` + +### What We're Testing + +| Pillar | What runs | What we verify | +|--------|-----------|----------------| +| **Chaos** | Tool 503 to agent | Agent returns summary or graceful fallback. | +| **Contract** | Invariants × chaos matrix (no-chaos, tool-down) | Must complete (critical). | + --- -## Scenario 1: Customer Service Chatbot +## Scenario 4: Booking Agent with Calendar and Payment Tools + +### The Agent + +A booking agent that calls a calendar API and a payment API to reserve a slot and confirm. We test under chaos (payment tool fails in one scenario) and replay a production incident. + +### Tools (Calendar + Payment) + +```python +# booking_tools.py — port 5011 +from fastapi import FastAPI +from pydantic import BaseModel + +app = FastAPI(title="Booking Tools") + +@app.post("/calendar/reserve") +def reserve_slot(slot: str): + return {"slot": slot, "confirmed": True, "id": "CAL-001"} + +@app.post("/payment/confirm") +def confirm_payment(amount: float, ref: str): + return {"ref": ref, "status": "paid", "amount": amount} +``` + +### Agent Code + +```python +# booking_agent.py — port 8793 +import httpx +from fastapi import FastAPI +from pydantic import BaseModel + +app = FastAPI(title="Booking Agent") +BASE = "http://localhost:5011" + +class InvokeRequest(BaseModel): + input: str | None = None + prompt: str | None = None + +class InvokeResponse(BaseModel): + result: str + +@app.post("/reset") +def reset(): + return {"ok": True} + +@app.post("/invoke", response_model=InvokeResponse) +def invoke(req: InvokeRequest): + text = (req.input or req.prompt or "").strip() + if not text: + return InvokeResponse(result="Please provide booking details.") + try: + r = httpx.post(f"{BASE}/calendar/reserve", json={"slot": "10:00"}, timeout=5.0) + cal = r.json() if r.status_code == 200 else {} + p = httpx.post(f"{BASE}/payment/confirm", json={"amount": 0, "ref": "BK-1"}, timeout=5.0) + pay = p.json() if p.status_code == 200 else {} + if cal.get("confirmed") and pay.get("status") == "paid": + return InvokeResponse(result=f"Booked. Ref: {pay.get('ref', 'N/A')}.") + return InvokeResponse(result="Booking could not be completed.") + except Exception as e: + return InvokeResponse(result=f"Booking unavailable ({type(e).__name__}).") +``` + +### flakestorm Configuration + +```yaml +version: "2.0" +agent: + endpoint: "http://localhost:8793/invoke" + type: http + method: POST + request_template: '{"input": "{prompt}"}' + response_path: "result" + timeout: 10000 + reset_endpoint: "http://localhost:8793/reset" +golden_prompts: + - "Book a slot at 10am and confirm payment." +invariants: + - type: output_not_empty +chaos: + tool_faults: + - tool: "*" + mode: error + error_code: 503 + probability: 0.25 +contract: + name: "Booking Contract" + invariants: + - id: not-empty + type: output_not_empty + severity: critical + when: always + chaos_matrix: + - name: "no-chaos" + tool_faults: [] + llm_faults: [] + - name: "payment-down" + tool_faults: + - tool: "*" + mode: error + error_code: 503 +replays: + sessions: + - file: "replays/booking_incident_001.yaml" +output: + format: html + path: "./reports" +``` + +### Replay Session + +```yaml +# replays/booking_incident_001.yaml +id: booking-incident-001 +name: "Booking failed when payment returned 503" +source: manual +input: "Book a slot at 10am and confirm payment." +contract: "Booking Contract" +``` + +### Running the Test + +```bash +uvicorn booking_tools:app --host 0.0.0.0 --port 5011 +uvicorn booking_agent:app --host 0.0.0.0 --port 8793 +flakestorm run -c flakestorm.yaml +flakestorm contract run -c flakestorm.yaml +flakestorm replay run replays/booking_incident_001.yaml -c flakestorm.yaml +flakestorm ci -c flakestorm.yaml +``` + +### What We're Testing + +| Pillar | What runs | What we verify | +|--------|-----------|----------------| +| **Chaos** | Tool 503 | Agent returns clear message when payment/calendar fails. | +| **Contract** | Invariants × chaos matrix | Output not empty (critical). | +| **Replay** | booking_incident_001.yaml | Same input passes contract. | + +--- + +## Scenario 5: Data Pipeline Agent with Replay + +### The Agent + +An agent that triggers a data pipeline via a tool and returns the run status. We verify behavior with a contract and replay a failed pipeline run. + +### Pipeline Tool + +```python +# pipeline_tool.py — port 5012 +from fastapi import FastAPI +from pydantic import BaseModel + +app = FastAPI(title="Pipeline Tool") + +@app.post("/pipeline/run") +def run_pipeline(job_id: str): + return {"job_id": job_id, "status": "success", "rows_processed": 1000} +``` + +### Agent Code + +```python +# pipeline_agent.py — port 8794 +import httpx +from fastapi import FastAPI +from pydantic import BaseModel + +app = FastAPI(title="Data Pipeline Agent") +BASE = "http://localhost:5012" + +class InvokeRequest(BaseModel): + input: str | None = None + prompt: str | None = None + +class InvokeResponse(BaseModel): + result: str + +@app.post("/reset") +def reset(): + return {"ok": True} + +@app.post("/invoke", response_model=InvokeResponse) +def invoke(req: InvokeRequest): + text = (req.input or req.prompt or "").strip() + if not text: + return InvokeResponse(result="Please specify a pipeline job.") + try: + r = httpx.post(f"{BASE}/pipeline/run", json={"job_id": "daily_etl"}, timeout=30.0) + data = r.json() if r.status_code == 200 else {} + status = data.get("status", "unknown") + return InvokeResponse(result=f"Pipeline run: {status}. Rows: {data.get('rows_processed', 0)}.") + except Exception as e: + return InvokeResponse(result=f"Pipeline run failed ({type(e).__name__}).") +``` + +### flakestorm Configuration + +```yaml +version: "2.0" +agent: + endpoint: "http://localhost:8794/invoke" + type: http + method: POST + request_template: '{"input": "{prompt}"}' + response_path: "result" + timeout: 35000 + reset_endpoint: "http://localhost:8794/reset" +golden_prompts: + - "Run the daily ETL pipeline." +invariants: + - type: output_not_empty + - type: latency + max_ms: 60000 +contract: + name: "Pipeline Contract" + invariants: + - id: not-empty + type: output_not_empty + severity: critical + when: always + chaos_matrix: + - name: "no-chaos" + tool_faults: [] + llm_faults: [] +replays: + sessions: + - file: "replays/pipeline_fail_001.yaml" +output: + format: html + path: "./reports" +``` + +### Replay Session + +```yaml +# replays/pipeline_fail_001.yaml +id: pipeline-fail-001 +name: "Pipeline agent returned empty on timeout" +source: manual +input: "Run the daily ETL pipeline." +contract: "Pipeline Contract" +``` + +### Running the Test + +```bash +uvicorn pipeline_tool:app --host 0.0.0.0 --port 5012 +uvicorn pipeline_agent:app --host 0.0.0.0 --port 8794 +flakestorm run -c flakestorm.yaml +flakestorm contract run -c flakestorm.yaml +flakestorm replay run replays/pipeline_fail_001.yaml -c flakestorm.yaml +``` + +### What We're Testing + +| Pillar | What runs | What we verify | +|--------|-----------|----------------| +| **Contract** | Invariants × chaos matrix | Output not empty (critical). | +| **Replay** | pipeline_fail_001.yaml | Regression: same input passes contract after fix. | + +--- + +## Quick reference: commands and config + +- **Environment chaos:** [Environment Chaos](ENVIRONMENT_CHAOS.md). Use `match_url` for per-URL fault injection when your agent makes outbound HTTP calls. +- **Behavioral contracts:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md). Reset: `agent.reset_endpoint` or `agent.reset_function`. +- **Replay regression:** [Replay Regression](REPLAY_REGRESSION.md). +- **Full example:** [Research Agent example](../examples/v2_research_agent/README.md). + +--- + +## Scenario 6: Customer Service Chatbot ### The Agent @@ -267,7 +939,7 @@ flakestorm run --output html --- -## Scenario 2: Code Generation Agent +## Scenario 7: Code Generation Agent ### The Agent @@ -373,7 +1045,7 @@ invariants: --- -## Scenario 3: RAG-Based Q&A Agent +## Scenario 8: RAG-Based Q&A Agent ### The Agent @@ -453,7 +1125,7 @@ invariants: --- -## Scenario 4: Multi-Tool Agent (LangChain) +## Scenario 9: Multi-Tool Agent (LangChain) ### The Agent @@ -557,7 +1229,7 @@ invariants: --- -## Scenario 5: Guardrailed Agent (Safety Testing) +## Scenario 10: Guardrailed Agent (Safety Testing) ### The Agent