mirror of
https://github.com/flakestorm/flakestorm.git
synced 2026-04-25 00:36:54 +02:00
Update TEST_SCENARIOS.md to include detailed descriptions of V2 scenarios, specifically focusing on the Research Agent with Search Tool and Support Agent with KB Tool. Enhanced documentation with actual tool calling examples, chaos and contract testing details, and configuration settings for improved clarity and usability.
This commit is contained in:
parent
11489255e3
commit
f1570628c3
1 changed files with 731 additions and 59 deletions
|
|
@ -1,41 +1,152 @@
|
|||
# Real-World Test Scenarios
|
||||
|
||||
This document provides concrete, real-world examples of testing AI agents with flakestorm across **all V2 pillars**: **mutation** (adversarial prompts), **environment chaos** (tool/LLM faults), **behavioral contracts** (invariants × chaos matrix), and **replay regression** (replay production incidents). Each scenario includes setup, config, and commands where applicable.
|
||||
|
||||
**V2:** Use `version: "2.0"` in config to enable chaos, contracts, and replay. Flakestorm supports **24 mutation types** (prompt-level and system/network-level) and **max 50 mutations per run** in OSS. See [V2 Spec](V2_SPEC.md) and [V2 Audit](V2_AUDIT.md).
|
||||
This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **24 mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md).
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
### V2 scenarios (all pillars)
|
||||
### Scenarios with tool calling, chaos, contract, and replay
|
||||
|
||||
- [V2 Scenario: Environment Chaos](#v2-scenario-environment-chaos) — Tool/LLM fault injection
|
||||
- [V2 Scenario: Behavioral Contract × Chaos Matrix](#v2-scenario-behavioral-contract--chaos-matrix) — Invariants under each chaos scenario
|
||||
- [V2 Scenario: Replay Regression](#v2-scenario-replay-regression) — Replay production failures
|
||||
- [Full V2 example (chaos + contract + replay)](../examples/v2_research_agent/README.md) — Working agent and config
|
||||
1. [Research Agent with Search Tool](#scenario-1-research-agent-with-search-tool) — Search tool + LLM; chaos + contract
|
||||
2. [Support Agent with KB Tool and Replay](#scenario-2-support-agent-with-kb-tool-and-replay) — KB tool; chaos + contract + replay
|
||||
3. [Autonomous Planner with Multi-Tool Chain](#scenario-3-autonomous-planner-with-multi-tool-chain) — Multi-step agent (weather + calendar); chaos + contract
|
||||
4. [Booking Agent with Calendar and Payment Tools](#scenario-4-booking-agent-with-calendar-and-payment-tools) — Two tools; chaos matrix + replay
|
||||
5. [Data Pipeline Agent with Replay](#scenario-5-data-pipeline-agent-with-replay) — Pipeline tool; contract + replay regression
|
||||
6. [Quick reference](#quick-reference-commands-and-config)
|
||||
|
||||
### Mutation-focused scenarios (agent + config examples)
|
||||
### Additional scenarios (agent + config examples)
|
||||
|
||||
1. [Scenario 1: Customer Service Chatbot](#scenario-1-customer-service-chatbot)
|
||||
2. [Scenario 2: Code Generation Agent](#scenario-2-code-generation-agent)
|
||||
3. [Scenario 3: RAG-Based Q&A Agent](#scenario-3-rag-based-qa-agent)
|
||||
4. [Scenario 4: Multi-Tool Agent (LangChain)](#scenario-4-multi-tool-agent-langchain)
|
||||
5. [Scenario 5: Guardrailed Agent (Safety Testing)](#scenario-5-guardrailed-agent-safety-testing)
|
||||
6. [Integration Guide](#integration-guide)
|
||||
7. [Customer Service Chatbot](#scenario-6-customer-service-chatbot)
|
||||
8. [Code Generation Agent](#scenario-7-code-generation-agent)
|
||||
9. [RAG-Based Q&A Agent](#scenario-8-rag-based-qa-agent)
|
||||
10. [Multi-Tool Agent (LangChain)](#scenario-9-multi-tool-agent-langchain)
|
||||
11. [Guardrailed Agent (Safety Testing)](#scenario-10-guardrailed-agent-safety-testing)
|
||||
12. [Integration Guide](#integration-guide)
|
||||
|
||||
---
|
||||
|
||||
## V2 Scenario: Environment Chaos
|
||||
## Scenario 1: Research Agent with Search Tool
|
||||
|
||||
**Goal:** Test that your agent degrades gracefully when tools or the LLM fail (timeouts, errors, rate limits, malformed responses).
|
||||
### The Agent
|
||||
|
||||
**Commands:** `flakestorm run --chaos` (mutations + chaos) or `flakestorm run --chaos --chaos-only` (golden prompts only, under chaos). Use `--chaos-profile api_outage` (or `degraded_llm`, `hostile_tools`, `high_latency`, `cascading_failure`) for built-in profiles.
|
||||
A research assistant that **actually calls a search tool** over HTTP, then sends the query and search results to an LLM. We test it under environment chaos (tool/LLM faults) and a behavioral contract (must cite source, must complete).
|
||||
|
||||
**Config (excerpt):**
|
||||
### Search Tool (Actual HTTP Service)
|
||||
|
||||
The agent calls this service to fetch search results. For a single-endpoint HTTP agent, Flakestorm uses `tool: "*"` to fault the request to the agent, or use `match_url` when the agent makes outbound calls (see [Environment Chaos](ENVIRONMENT_CHAOS.md)).
|
||||
|
||||
```python
|
||||
# search_service.py — run on port 5001
|
||||
from fastapi import FastAPI
|
||||
|
||||
app = FastAPI(title="Search Tool")
|
||||
|
||||
@app.get("/search")
|
||||
def search(q: str):
|
||||
"""Simulated search API. In production this might call a real search engine."""
|
||||
results = [
|
||||
{"title": "Wikipedia: " + q, "snippet": "According to Wikipedia, " + q + " is a topic."},
|
||||
{"title": "Source A", "snippet": "Per Source A, " + q + " has been documented."},
|
||||
]
|
||||
return {"query": q, "results": results}
|
||||
```
|
||||
|
||||
### Agent Code (Actual Tool Calling)
|
||||
|
||||
The agent receives the user query, **calls the search tool** via HTTP, then calls the LLM with the query and results.
|
||||
|
||||
```python
|
||||
# research_agent.py — run on port 8790
|
||||
import os
|
||||
import httpx
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI(title="Research Agent with Search Tool")
|
||||
|
||||
SEARCH_URL = os.environ.get("SEARCH_URL", "http://localhost:5001/search")
|
||||
OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://localhost:11434/api/generate")
|
||||
MODEL = os.environ.get("OLLAMA_MODEL", "gemma3:1b")
|
||||
|
||||
class InvokeRequest(BaseModel):
|
||||
input: str | None = None
|
||||
prompt: str | None = None
|
||||
|
||||
class InvokeResponse(BaseModel):
|
||||
result: str
|
||||
|
||||
def call_search(query: str) -> str:
|
||||
"""Actual tool call: HTTP GET to search service."""
|
||||
r = httpx.get(SEARCH_URL, params={"q": query}, timeout=10.0)
|
||||
r.raise_for_status()
|
||||
data = r.json()
|
||||
snippets = [x.get("snippet", "") for x in data.get("results", [])[:3]]
|
||||
return "\n".join(snippets) if snippets else "No results found."
|
||||
|
||||
def call_llm(user_query: str, search_context: str) -> str:
|
||||
"""Call LLM with user query and tool output."""
|
||||
prompt = f"""You are a research assistant. Use the following search results to answer. Always cite the source.
|
||||
|
||||
Search results:
|
||||
{search_context}
|
||||
|
||||
User question: {user_query}
|
||||
|
||||
Answer (2-4 sentences, must cite source):"""
|
||||
r = httpx.post(
|
||||
OLLAMA_URL,
|
||||
json={"model": MODEL, "prompt": prompt, "stream": False},
|
||||
timeout=60.0,
|
||||
)
|
||||
r.raise_for_status()
|
||||
return (r.json().get("response") or "").strip()
|
||||
|
||||
@app.post("/reset")
|
||||
def reset():
|
||||
return {"ok": True}
|
||||
|
||||
@app.post("/invoke", response_model=InvokeResponse)
|
||||
def invoke(req: InvokeRequest):
|
||||
text = (req.input or req.prompt or "").strip()
|
||||
if not text:
|
||||
return InvokeResponse(result="Please ask a question.")
|
||||
try:
|
||||
search_context = call_search(text) # actual tool call
|
||||
answer = call_llm(text, search_context)
|
||||
return InvokeResponse(result=answer)
|
||||
except Exception as e:
|
||||
return InvokeResponse(
|
||||
result="According to [system], the search or model failed. Please try again."
|
||||
)
|
||||
```
|
||||
|
||||
### flakestorm Configuration
|
||||
|
||||
```yaml
|
||||
version: "2.0"
|
||||
agent:
|
||||
endpoint: "http://localhost:8790/invoke"
|
||||
type: http
|
||||
method: POST
|
||||
request_template: '{"input": "{prompt}"}'
|
||||
response_path: "result"
|
||||
timeout: 15000
|
||||
reset_endpoint: "http://localhost:8790/reset"
|
||||
model:
|
||||
provider: ollama
|
||||
name: gemma3:1b
|
||||
base_url: "http://localhost:11434"
|
||||
golden_prompts:
|
||||
- "What is the capital of France?"
|
||||
- "Summarize the benefits of renewable energy."
|
||||
mutations:
|
||||
count: 5
|
||||
types: [paraphrase, noise, prompt_injection]
|
||||
invariants:
|
||||
- type: latency
|
||||
max_ms: 30000
|
||||
- type: output_not_empty
|
||||
chaos:
|
||||
tool_faults:
|
||||
- tool: "*"
|
||||
|
|
@ -46,35 +157,18 @@ chaos:
|
|||
- mode: truncated_response
|
||||
max_tokens: 5
|
||||
probability: 0.2
|
||||
```
|
||||
|
||||
**Docs:** [Environment Chaos](ENVIRONMENT_CHAOS.md), [V2 Audit §8.1](V2_AUDIT.md#1-prd-81--environment-chaos). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
|
||||
|
||||
---
|
||||
|
||||
## V2 Scenario: Behavioral Contract × Chaos Matrix
|
||||
|
||||
**Goal:** Verify that named invariants (with severity) hold under every chaos scenario; each (invariant × scenario) cell is an independent run. Optional `agent.reset_endpoint` or `agent.reset_function` for state isolation.
|
||||
|
||||
**Commands:** `flakestorm contract run`, `flakestorm contract validate`, `flakestorm contract score`.
|
||||
|
||||
**Config (excerpt):**
|
||||
|
||||
```yaml
|
||||
version: "2.0"
|
||||
agent:
|
||||
reset_endpoint: "http://localhost:8790/reset"
|
||||
contract:
|
||||
name: "My Contract"
|
||||
name: "Research Agent Contract"
|
||||
invariants:
|
||||
- id: must-cite
|
||||
- id: must-cite-source
|
||||
type: regex
|
||||
pattern: "(?i)(source|according to)"
|
||||
pattern: "(?i)(source|according to|per )"
|
||||
severity: critical
|
||||
- id: max-latency
|
||||
type: latency
|
||||
max_ms: 60000
|
||||
severity: medium
|
||||
when: always
|
||||
- id: completes
|
||||
type: completes
|
||||
severity: high
|
||||
when: always
|
||||
chaos_matrix:
|
||||
- name: "no-chaos"
|
||||
tool_faults: []
|
||||
|
|
@ -84,38 +178,616 @@ contract:
|
|||
- tool: "*"
|
||||
mode: error
|
||||
error_code: 503
|
||||
output:
|
||||
format: html
|
||||
path: "./reports"
|
||||
```
|
||||
|
||||
**Docs:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md), [V2 Spec](V2_SPEC.md) (contract matrix isolation, resilience score). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
|
||||
### Running the Test
|
||||
|
||||
```bash
|
||||
# Terminal 1: Search tool
|
||||
uvicorn search_service:app --host 0.0.0.0 --port 5001
|
||||
# Terminal 2: Agent (requires Ollama with gemma3:1b)
|
||||
uvicorn research_agent:app --host 0.0.0.0 --port 8790
|
||||
# Terminal 3: Flakestorm
|
||||
flakestorm run -c flakestorm.yaml
|
||||
flakestorm run -c flakestorm.yaml --chaos
|
||||
flakestorm contract run -c flakestorm.yaml
|
||||
flakestorm ci -c flakestorm.yaml --min-score 0.5
|
||||
```
|
||||
|
||||
### What We're Testing
|
||||
|
||||
| Pillar | What runs | What we verify |
|
||||
|--------|-----------|----------------|
|
||||
| **Mutation** | Adversarial prompts to agent (calls search + LLM) | Robustness to typos, paraphrases, injection. |
|
||||
| **Chaos** | Tool 503 to agent, LLM truncated | Agent degrades gracefully (fallback, cites source when possible). |
|
||||
| **Contract** | Contract x chaos matrix (no-chaos, api-outage) | Must cite source (critical), must complete (high); auto-FAIL if critical fails. |
|
||||
|
||||
---
|
||||
|
||||
## V2 Scenario: Replay Regression
|
||||
## Scenario 2: Support Agent with KB Tool and Replay
|
||||
|
||||
**Goal:** Replay a saved session (e.g. production incident) with fixed inputs and tool responses, then verify the agent’s output against a contract.
|
||||
### The Agent
|
||||
|
||||
**Commands:** `flakestorm replay run path/to/session.yaml -c flakestorm.yaml`, `flakestorm replay export --from-report report.json -o ./replays/`. Optional: `flakestorm replay run --from-langsmith RUN_ID --run` to import from LangSmith and run.
|
||||
A customer support agent that **actually calls a knowledge-base (KB) tool** to fetch articles, then answers the user. We add a **replay session** from a production incident to verify the fix.
|
||||
|
||||
**Config (excerpt):**
|
||||
### KB Tool (Actual HTTP Service)
|
||||
|
||||
```python
|
||||
# kb_service.py — run on port 5002
|
||||
from fastapi import FastAPI
|
||||
from fastapi.responses import JSONResponse
|
||||
|
||||
app = FastAPI(title="KB Tool")
|
||||
ARTICLES = {
|
||||
"reset-password": "To reset your password: go to Account > Security > Reset password. You will receive an email with a link.",
|
||||
"cancel-subscription": "To cancel: Account > Billing > Cancel subscription. Refunds apply within 14 days.",
|
||||
}
|
||||
|
||||
@app.get("/kb/article")
|
||||
def get_article(article_id: str):
|
||||
"""Actual tool: fetch KB article by ID."""
|
||||
if article_id not in ARTICLES:
|
||||
return JSONResponse(status_code=404, content={"error": "Article not found"})
|
||||
return {"article_id": article_id, "content": ARTICLES[article_id]}
|
||||
```
|
||||
|
||||
### Agent Code (Actual Tool Calling)
|
||||
|
||||
The agent parses the user question, **calls the KB tool** to get the article, then formats a response.
|
||||
|
||||
```python
|
||||
# support_agent.py — run on port 8791
|
||||
import httpx
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI(title="Support Agent with KB Tool")
|
||||
KB_URL = "http://localhost:5002/kb/article"
|
||||
|
||||
class InvokeRequest(BaseModel):
|
||||
input: str | None = None
|
||||
prompt: str | None = None
|
||||
|
||||
class InvokeResponse(BaseModel):
|
||||
result: str
|
||||
|
||||
def extract_article_id(query: str) -> str:
|
||||
q = query.lower()
|
||||
if "password" in q or "reset" in q:
|
||||
return "reset-password"
|
||||
if "cancel" in q or "subscription" in q:
|
||||
return "cancel-subscription"
|
||||
return "reset-password"
|
||||
|
||||
def call_kb(article_id: str) -> str:
|
||||
"""Actual tool call: HTTP GET to KB service."""
|
||||
r = httpx.get(KB_URL, params={"article_id": article_id}, timeout=5.0)
|
||||
if r.status_code != 200:
|
||||
return f"[KB error: {r.status_code}]"
|
||||
return r.json().get("content", "")
|
||||
|
||||
@app.post("/reset")
|
||||
def reset():
|
||||
return {"ok": True}
|
||||
|
||||
@app.post("/invoke", response_model=InvokeResponse)
|
||||
def invoke(req: InvokeRequest):
|
||||
text = (req.input or req.prompt or "").strip()
|
||||
if not text:
|
||||
return InvokeResponse(result="Please describe your issue.")
|
||||
try:
|
||||
article_id = extract_article_id(text)
|
||||
content = call_kb(article_id) # actual tool call
|
||||
if not content or content.startswith("[KB error"):
|
||||
return InvokeResponse(result="I could not find that article. Please contact support.")
|
||||
return InvokeResponse(result=f"Here is what I found:\n\n{content}")
|
||||
except Exception as e:
|
||||
return InvokeResponse(result=f"Support system is temporarily unavailable. Please try again.")
|
||||
```
|
||||
|
||||
### flakestorm Configuration
|
||||
|
||||
```yaml
|
||||
version: "2.0"
|
||||
agent:
|
||||
endpoint: "http://localhost:8791/invoke"
|
||||
type: http
|
||||
method: POST
|
||||
request_template: '{"input": "{prompt}"}'
|
||||
response_path: "result"
|
||||
timeout: 10000
|
||||
reset_endpoint: "http://localhost:8791/reset"
|
||||
golden_prompts:
|
||||
- "How do I reset my password?"
|
||||
- "I want to cancel my subscription."
|
||||
invariants:
|
||||
- type: output_not_empty
|
||||
- type: latency
|
||||
max_ms: 15000
|
||||
chaos:
|
||||
tool_faults:
|
||||
- tool: "*"
|
||||
mode: error
|
||||
error_code: 503
|
||||
probability: 0.25
|
||||
contract:
|
||||
name: "Support Agent Contract"
|
||||
invariants:
|
||||
- id: not-empty
|
||||
type: output_not_empty
|
||||
severity: critical
|
||||
when: always
|
||||
- id: no-pii-leak
|
||||
type: excludes_pii
|
||||
severity: high
|
||||
when: always
|
||||
chaos_matrix:
|
||||
- name: "no-chaos"
|
||||
tool_faults: []
|
||||
llm_faults: []
|
||||
- name: "kb-down"
|
||||
tool_faults:
|
||||
- tool: "*"
|
||||
mode: error
|
||||
error_code: 503
|
||||
replays:
|
||||
sessions:
|
||||
- file: "replays/incident_001.yaml"
|
||||
# Optional: sources for LangSmith import
|
||||
# sources: ...
|
||||
- file: "replays/support_incident_001.yaml"
|
||||
scoring:
|
||||
mutation: 0.20
|
||||
chaos: 0.35
|
||||
contract: 0.35
|
||||
replay: 0.10
|
||||
output:
|
||||
format: html
|
||||
path: "./reports"
|
||||
```
|
||||
|
||||
**Session file (e.g. `replays/incident_001.yaml`):** `id`, `input`, `tool_responses` (optional), `contract` (name or path).
|
||||
### Replay Session (Production Incident)
|
||||
|
||||
**Docs:** [Replay Regression](REPLAY_REGRESSION.md), [V2 Audit §8.3](V2_AUDIT.md#3-prd-83--replay-based-regression). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
|
||||
```yaml
|
||||
# replays/support_incident_001.yaml
|
||||
id: support-incident-001
|
||||
name: "Support agent failed when KB was down"
|
||||
source: manual
|
||||
input: "How do I reset my password?"
|
||||
tool_responses: []
|
||||
contract: "Support Agent Contract"
|
||||
```
|
||||
|
||||
### Running the Test
|
||||
|
||||
```bash
|
||||
# Terminal 1: KB service
|
||||
uvicorn kb_service:app --host 0.0.0.0 --port 5002
|
||||
# Terminal 2: Support agent
|
||||
uvicorn support_agent:app --host 0.0.0.0 --port 8791
|
||||
# Terminal 3: Flakestorm
|
||||
flakestorm run -c flakestorm.yaml
|
||||
flakestorm contract run -c flakestorm.yaml
|
||||
flakestorm replay run replays/support_incident_001.yaml -c flakestorm.yaml
|
||||
flakestorm ci -c flakestorm.yaml
|
||||
```
|
||||
|
||||
### What We're Testing
|
||||
|
||||
| Pillar | What runs | What we verify |
|
||||
|--------|-----------|----------------|
|
||||
| **Mutation** | Adversarial prompts to agent (calls KB tool) | Robustness to noisy/paraphrased support questions. |
|
||||
| **Chaos** | Tool 503 to agent | Agent returns graceful message instead of crashing. |
|
||||
| **Contract** | Invariants x chaos matrix | Output not empty (critical), no PII (high). |
|
||||
| **Replay** | Replay support_incident_001.yaml | Same input passes contract (regression for production incident). |
|
||||
|
||||
---
|
||||
|
||||
## Scenario 3: Autonomous Planner with Multi-Tool Chain
|
||||
|
||||
### The Agent
|
||||
|
||||
An autonomous planner that chains multiple tool calls: it calls a weather tool, then a calendar tool, then formats a response. We test it under chaos (one tool fails) and a behavioral contract (response must complete and include a summary).
|
||||
|
||||
### Tools (Weather + Calendar)
|
||||
|
||||
```python
|
||||
# tools_planner.py — run on port 5010
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI(title="Planner Tools")
|
||||
|
||||
@app.get("/weather")
|
||||
def weather(city: str):
|
||||
return {"city": city, "temp": 72, "condition": "Sunny"}
|
||||
|
||||
@app.get("/calendar")
|
||||
def calendar(date: str):
|
||||
return {"date": date, "events": ["Meeting 10am", "Lunch 12pm"]}
|
||||
|
||||
@app.post("/reset")
|
||||
def reset():
|
||||
return {"ok": True}
|
||||
```
|
||||
|
||||
### Agent Code (Multi-Step Tool Chain)
|
||||
|
||||
```python
|
||||
# planner_agent.py — port 8792
|
||||
import httpx
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI(title="Autonomous Planner Agent")
|
||||
BASE = "http://localhost:5010"
|
||||
|
||||
class InvokeRequest(BaseModel):
|
||||
input: str | None = None
|
||||
prompt: str | None = None
|
||||
|
||||
class InvokeResponse(BaseModel):
|
||||
result: str
|
||||
|
||||
@app.post("/reset")
|
||||
def reset():
|
||||
httpx.post(f"{BASE}/reset")
|
||||
return {"ok": True}
|
||||
|
||||
@app.post("/invoke", response_model=InvokeResponse)
|
||||
def invoke(req: InvokeRequest):
|
||||
text = (req.input or req.prompt or "").strip()
|
||||
if not text:
|
||||
return InvokeResponse(result="Please provide a request.")
|
||||
try:
|
||||
w = httpx.get(f"{BASE}/weather", params={"city": "Boston"}, timeout=5.0)
|
||||
weather_data = w.json() if w.status_code == 200 else {}
|
||||
c = httpx.get(f"{BASE}/calendar", params={"date": "today"}, timeout=5.0)
|
||||
cal_data = c.json() if c.status_code == 200 else {}
|
||||
summary = f"Weather: {weather_data.get('condition', 'N/A')}. Calendar: {len(cal_data.get('events', []))} events."
|
||||
return InvokeResponse(result=f"Summary: {summary}")
|
||||
except Exception as e:
|
||||
return InvokeResponse(result=f"Summary: Planning unavailable ({type(e).__name__}).")
|
||||
```
|
||||
|
||||
### flakestorm Configuration
|
||||
|
||||
```yaml
|
||||
version: "2.0"
|
||||
agent:
|
||||
endpoint: "http://localhost:8792/invoke"
|
||||
type: http
|
||||
method: POST
|
||||
request_template: '{"input": "{prompt}"}'
|
||||
response_path: "result"
|
||||
timeout: 10000
|
||||
reset_endpoint: "http://localhost:8792/reset"
|
||||
golden_prompts:
|
||||
- "What is the weather and my schedule for today?"
|
||||
invariants:
|
||||
- type: output_not_empty
|
||||
- type: latency
|
||||
max_ms: 15000
|
||||
chaos:
|
||||
tool_faults:
|
||||
- tool: "*"
|
||||
mode: error
|
||||
error_code: 503
|
||||
probability: 0.3
|
||||
contract:
|
||||
name: "Planner Contract"
|
||||
invariants:
|
||||
- id: completes
|
||||
type: completes
|
||||
severity: critical
|
||||
when: always
|
||||
chaos_matrix:
|
||||
- name: "no-chaos"
|
||||
tool_faults: []
|
||||
llm_faults: []
|
||||
- name: "tool-down"
|
||||
tool_faults:
|
||||
- tool: "*"
|
||||
mode: error
|
||||
error_code: 503
|
||||
output:
|
||||
format: html
|
||||
path: "./reports"
|
||||
```
|
||||
|
||||
### Running the Test
|
||||
|
||||
```bash
|
||||
uvicorn tools_planner:app --host 0.0.0.0 --port 5010
|
||||
uvicorn planner_agent:app --host 0.0.0.0 --port 8792
|
||||
flakestorm run -c flakestorm.yaml
|
||||
flakestorm run -c flakestorm.yaml --chaos
|
||||
flakestorm contract run -c flakestorm.yaml
|
||||
```
|
||||
|
||||
### What We're Testing
|
||||
|
||||
| Pillar | What runs | What we verify |
|
||||
|--------|-----------|----------------|
|
||||
| **Chaos** | Tool 503 to agent | Agent returns summary or graceful fallback. |
|
||||
| **Contract** | Invariants × chaos matrix (no-chaos, tool-down) | Must complete (critical). |
|
||||
|
||||
---
|
||||
|
||||
## Scenario 1: Customer Service Chatbot
|
||||
## Scenario 4: Booking Agent with Calendar and Payment Tools
|
||||
|
||||
### The Agent
|
||||
|
||||
A booking agent that calls a calendar API and a payment API to reserve a slot and confirm. We test under chaos (payment tool fails in one scenario) and replay a production incident.
|
||||
|
||||
### Tools (Calendar + Payment)
|
||||
|
||||
```python
|
||||
# booking_tools.py — port 5011
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI(title="Booking Tools")
|
||||
|
||||
@app.post("/calendar/reserve")
|
||||
def reserve_slot(slot: str):
|
||||
return {"slot": slot, "confirmed": True, "id": "CAL-001"}
|
||||
|
||||
@app.post("/payment/confirm")
|
||||
def confirm_payment(amount: float, ref: str):
|
||||
return {"ref": ref, "status": "paid", "amount": amount}
|
||||
```
|
||||
|
||||
### Agent Code
|
||||
|
||||
```python
|
||||
# booking_agent.py — port 8793
|
||||
import httpx
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI(title="Booking Agent")
|
||||
BASE = "http://localhost:5011"
|
||||
|
||||
class InvokeRequest(BaseModel):
|
||||
input: str | None = None
|
||||
prompt: str | None = None
|
||||
|
||||
class InvokeResponse(BaseModel):
|
||||
result: str
|
||||
|
||||
@app.post("/reset")
|
||||
def reset():
|
||||
return {"ok": True}
|
||||
|
||||
@app.post("/invoke", response_model=InvokeResponse)
|
||||
def invoke(req: InvokeRequest):
|
||||
text = (req.input or req.prompt or "").strip()
|
||||
if not text:
|
||||
return InvokeResponse(result="Please provide booking details.")
|
||||
try:
|
||||
r = httpx.post(f"{BASE}/calendar/reserve", json={"slot": "10:00"}, timeout=5.0)
|
||||
cal = r.json() if r.status_code == 200 else {}
|
||||
p = httpx.post(f"{BASE}/payment/confirm", json={"amount": 0, "ref": "BK-1"}, timeout=5.0)
|
||||
pay = p.json() if p.status_code == 200 else {}
|
||||
if cal.get("confirmed") and pay.get("status") == "paid":
|
||||
return InvokeResponse(result=f"Booked. Ref: {pay.get('ref', 'N/A')}.")
|
||||
return InvokeResponse(result="Booking could not be completed.")
|
||||
except Exception as e:
|
||||
return InvokeResponse(result=f"Booking unavailable ({type(e).__name__}).")
|
||||
```
|
||||
|
||||
### flakestorm Configuration
|
||||
|
||||
```yaml
|
||||
version: "2.0"
|
||||
agent:
|
||||
endpoint: "http://localhost:8793/invoke"
|
||||
type: http
|
||||
method: POST
|
||||
request_template: '{"input": "{prompt}"}'
|
||||
response_path: "result"
|
||||
timeout: 10000
|
||||
reset_endpoint: "http://localhost:8793/reset"
|
||||
golden_prompts:
|
||||
- "Book a slot at 10am and confirm payment."
|
||||
invariants:
|
||||
- type: output_not_empty
|
||||
chaos:
|
||||
tool_faults:
|
||||
- tool: "*"
|
||||
mode: error
|
||||
error_code: 503
|
||||
probability: 0.25
|
||||
contract:
|
||||
name: "Booking Contract"
|
||||
invariants:
|
||||
- id: not-empty
|
||||
type: output_not_empty
|
||||
severity: critical
|
||||
when: always
|
||||
chaos_matrix:
|
||||
- name: "no-chaos"
|
||||
tool_faults: []
|
||||
llm_faults: []
|
||||
- name: "payment-down"
|
||||
tool_faults:
|
||||
- tool: "*"
|
||||
mode: error
|
||||
error_code: 503
|
||||
replays:
|
||||
sessions:
|
||||
- file: "replays/booking_incident_001.yaml"
|
||||
output:
|
||||
format: html
|
||||
path: "./reports"
|
||||
```
|
||||
|
||||
### Replay Session
|
||||
|
||||
```yaml
|
||||
# replays/booking_incident_001.yaml
|
||||
id: booking-incident-001
|
||||
name: "Booking failed when payment returned 503"
|
||||
source: manual
|
||||
input: "Book a slot at 10am and confirm payment."
|
||||
contract: "Booking Contract"
|
||||
```
|
||||
|
||||
### Running the Test
|
||||
|
||||
```bash
|
||||
uvicorn booking_tools:app --host 0.0.0.0 --port 5011
|
||||
uvicorn booking_agent:app --host 0.0.0.0 --port 8793
|
||||
flakestorm run -c flakestorm.yaml
|
||||
flakestorm contract run -c flakestorm.yaml
|
||||
flakestorm replay run replays/booking_incident_001.yaml -c flakestorm.yaml
|
||||
flakestorm ci -c flakestorm.yaml
|
||||
```
|
||||
|
||||
### What We're Testing
|
||||
|
||||
| Pillar | What runs | What we verify |
|
||||
|--------|-----------|----------------|
|
||||
| **Chaos** | Tool 503 | Agent returns clear message when payment/calendar fails. |
|
||||
| **Contract** | Invariants × chaos matrix | Output not empty (critical). |
|
||||
| **Replay** | booking_incident_001.yaml | Same input passes contract. |
|
||||
|
||||
---
|
||||
|
||||
## Scenario 5: Data Pipeline Agent with Replay
|
||||
|
||||
### The Agent
|
||||
|
||||
An agent that triggers a data pipeline via a tool and returns the run status. We verify behavior with a contract and replay a failed pipeline run.
|
||||
|
||||
### Pipeline Tool
|
||||
|
||||
```python
|
||||
# pipeline_tool.py — port 5012
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI(title="Pipeline Tool")
|
||||
|
||||
@app.post("/pipeline/run")
|
||||
def run_pipeline(job_id: str):
|
||||
return {"job_id": job_id, "status": "success", "rows_processed": 1000}
|
||||
```
|
||||
|
||||
### Agent Code
|
||||
|
||||
```python
|
||||
# pipeline_agent.py — port 8794
|
||||
import httpx
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI(title="Data Pipeline Agent")
|
||||
BASE = "http://localhost:5012"
|
||||
|
||||
class InvokeRequest(BaseModel):
|
||||
input: str | None = None
|
||||
prompt: str | None = None
|
||||
|
||||
class InvokeResponse(BaseModel):
|
||||
result: str
|
||||
|
||||
@app.post("/reset")
|
||||
def reset():
|
||||
return {"ok": True}
|
||||
|
||||
@app.post("/invoke", response_model=InvokeResponse)
|
||||
def invoke(req: InvokeRequest):
|
||||
text = (req.input or req.prompt or "").strip()
|
||||
if not text:
|
||||
return InvokeResponse(result="Please specify a pipeline job.")
|
||||
try:
|
||||
r = httpx.post(f"{BASE}/pipeline/run", json={"job_id": "daily_etl"}, timeout=30.0)
|
||||
data = r.json() if r.status_code == 200 else {}
|
||||
status = data.get("status", "unknown")
|
||||
return InvokeResponse(result=f"Pipeline run: {status}. Rows: {data.get('rows_processed', 0)}.")
|
||||
except Exception as e:
|
||||
return InvokeResponse(result=f"Pipeline run failed ({type(e).__name__}).")
|
||||
```
|
||||
|
||||
### flakestorm Configuration
|
||||
|
||||
```yaml
|
||||
version: "2.0"
|
||||
agent:
|
||||
endpoint: "http://localhost:8794/invoke"
|
||||
type: http
|
||||
method: POST
|
||||
request_template: '{"input": "{prompt}"}'
|
||||
response_path: "result"
|
||||
timeout: 35000
|
||||
reset_endpoint: "http://localhost:8794/reset"
|
||||
golden_prompts:
|
||||
- "Run the daily ETL pipeline."
|
||||
invariants:
|
||||
- type: output_not_empty
|
||||
- type: latency
|
||||
max_ms: 60000
|
||||
contract:
|
||||
name: "Pipeline Contract"
|
||||
invariants:
|
||||
- id: not-empty
|
||||
type: output_not_empty
|
||||
severity: critical
|
||||
when: always
|
||||
chaos_matrix:
|
||||
- name: "no-chaos"
|
||||
tool_faults: []
|
||||
llm_faults: []
|
||||
replays:
|
||||
sessions:
|
||||
- file: "replays/pipeline_fail_001.yaml"
|
||||
output:
|
||||
format: html
|
||||
path: "./reports"
|
||||
```
|
||||
|
||||
### Replay Session
|
||||
|
||||
```yaml
|
||||
# replays/pipeline_fail_001.yaml
|
||||
id: pipeline-fail-001
|
||||
name: "Pipeline agent returned empty on timeout"
|
||||
source: manual
|
||||
input: "Run the daily ETL pipeline."
|
||||
contract: "Pipeline Contract"
|
||||
```
|
||||
|
||||
### Running the Test
|
||||
|
||||
```bash
|
||||
uvicorn pipeline_tool:app --host 0.0.0.0 --port 5012
|
||||
uvicorn pipeline_agent:app --host 0.0.0.0 --port 8794
|
||||
flakestorm run -c flakestorm.yaml
|
||||
flakestorm contract run -c flakestorm.yaml
|
||||
flakestorm replay run replays/pipeline_fail_001.yaml -c flakestorm.yaml
|
||||
```
|
||||
|
||||
### What We're Testing
|
||||
|
||||
| Pillar | What runs | What we verify |
|
||||
|--------|-----------|----------------|
|
||||
| **Contract** | Invariants × chaos matrix | Output not empty (critical). |
|
||||
| **Replay** | pipeline_fail_001.yaml | Regression: same input passes contract after fix. |
|
||||
|
||||
---
|
||||
|
||||
## Quick reference: commands and config
|
||||
|
||||
- **Environment chaos:** [Environment Chaos](ENVIRONMENT_CHAOS.md). Use `match_url` for per-URL fault injection when your agent makes outbound HTTP calls.
|
||||
- **Behavioral contracts:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md). Reset: `agent.reset_endpoint` or `agent.reset_function`.
|
||||
- **Replay regression:** [Replay Regression](REPLAY_REGRESSION.md).
|
||||
- **Full example:** [Research Agent example](../examples/v2_research_agent/README.md).
|
||||
|
||||
---
|
||||
|
||||
## Scenario 6: Customer Service Chatbot
|
||||
|
||||
### The Agent
|
||||
|
||||
|
|
@ -267,7 +939,7 @@ flakestorm run --output html
|
|||
|
||||
---
|
||||
|
||||
## Scenario 2: Code Generation Agent
|
||||
## Scenario 7: Code Generation Agent
|
||||
|
||||
### The Agent
|
||||
|
||||
|
|
@ -373,7 +1045,7 @@ invariants:
|
|||
|
||||
---
|
||||
|
||||
## Scenario 3: RAG-Based Q&A Agent
|
||||
## Scenario 8: RAG-Based Q&A Agent
|
||||
|
||||
### The Agent
|
||||
|
||||
|
|
@ -453,7 +1125,7 @@ invariants:
|
|||
|
||||
---
|
||||
|
||||
## Scenario 4: Multi-Tool Agent (LangChain)
|
||||
## Scenario 9: Multi-Tool Agent (LangChain)
|
||||
|
||||
### The Agent
|
||||
|
||||
|
|
@ -557,7 +1229,7 @@ invariants:
|
|||
|
||||
---
|
||||
|
||||
## Scenario 5: Guardrailed Agent (Safety Testing)
|
||||
## Scenario 10: Guardrailed Agent (Safety Testing)
|
||||
|
||||
### The Agent
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue