Update TEST_SCENARIOS.md to include detailed descriptions of V2 scenarios, specifically focusing on the Research Agent with Search Tool and Support Agent with KB Tool. Enhanced documentation with actual tool calling examples, chaos and contract testing details, and configuration settings for improved clarity and usability.

This commit is contained in:
Entropix 2026-03-09 13:41:41 +08:00
parent 11489255e3
commit f1570628c3

View file

@ -1,41 +1,152 @@
# Real-World Test Scenarios
This document provides concrete, real-world examples of testing AI agents with flakestorm across **all V2 pillars**: **mutation** (adversarial prompts), **environment chaos** (tool/LLM faults), **behavioral contracts** (invariants × chaos matrix), and **replay regression** (replay production incidents). Each scenario includes setup, config, and commands where applicable.
**V2:** Use `version: "2.0"` in config to enable chaos, contracts, and replay. Flakestorm supports **24 mutation types** (prompt-level and system/network-level) and **max 50 mutations per run** in OSS. See [V2 Spec](V2_SPEC.md) and [V2 Audit](V2_AUDIT.md).
This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **24 mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md).
---
## Table of Contents
### V2 scenarios (all pillars)
### Scenarios with tool calling, chaos, contract, and replay
- [V2 Scenario: Environment Chaos](#v2-scenario-environment-chaos) — Tool/LLM fault injection
- [V2 Scenario: Behavioral Contract × Chaos Matrix](#v2-scenario-behavioral-contract--chaos-matrix) — Invariants under each chaos scenario
- [V2 Scenario: Replay Regression](#v2-scenario-replay-regression) — Replay production failures
- [Full V2 example (chaos + contract + replay)](../examples/v2_research_agent/README.md) — Working agent and config
1. [Research Agent with Search Tool](#scenario-1-research-agent-with-search-tool) — Search tool + LLM; chaos + contract
2. [Support Agent with KB Tool and Replay](#scenario-2-support-agent-with-kb-tool-and-replay) — KB tool; chaos + contract + replay
3. [Autonomous Planner with Multi-Tool Chain](#scenario-3-autonomous-planner-with-multi-tool-chain) — Multi-step agent (weather + calendar); chaos + contract
4. [Booking Agent with Calendar and Payment Tools](#scenario-4-booking-agent-with-calendar-and-payment-tools) — Two tools; chaos matrix + replay
5. [Data Pipeline Agent with Replay](#scenario-5-data-pipeline-agent-with-replay) — Pipeline tool; contract + replay regression
6. [Quick reference](#quick-reference-commands-and-config)
### Mutation-focused scenarios (agent + config examples)
### Additional scenarios (agent + config examples)
1. [Scenario 1: Customer Service Chatbot](#scenario-1-customer-service-chatbot)
2. [Scenario 2: Code Generation Agent](#scenario-2-code-generation-agent)
3. [Scenario 3: RAG-Based Q&A Agent](#scenario-3-rag-based-qa-agent)
4. [Scenario 4: Multi-Tool Agent (LangChain)](#scenario-4-multi-tool-agent-langchain)
5. [Scenario 5: Guardrailed Agent (Safety Testing)](#scenario-5-guardrailed-agent-safety-testing)
6. [Integration Guide](#integration-guide)
7. [Customer Service Chatbot](#scenario-6-customer-service-chatbot)
8. [Code Generation Agent](#scenario-7-code-generation-agent)
9. [RAG-Based Q&A Agent](#scenario-8-rag-based-qa-agent)
10. [Multi-Tool Agent (LangChain)](#scenario-9-multi-tool-agent-langchain)
11. [Guardrailed Agent (Safety Testing)](#scenario-10-guardrailed-agent-safety-testing)
12. [Integration Guide](#integration-guide)
---
## V2 Scenario: Environment Chaos
## Scenario 1: Research Agent with Search Tool
**Goal:** Test that your agent degrades gracefully when tools or the LLM fail (timeouts, errors, rate limits, malformed responses).
### The Agent
**Commands:** `flakestorm run --chaos` (mutations + chaos) or `flakestorm run --chaos --chaos-only` (golden prompts only, under chaos). Use `--chaos-profile api_outage` (or `degraded_llm`, `hostile_tools`, `high_latency`, `cascading_failure`) for built-in profiles.
A research assistant that **actually calls a search tool** over HTTP, then sends the query and search results to an LLM. We test it under environment chaos (tool/LLM faults) and a behavioral contract (must cite source, must complete).
**Config (excerpt):**
### Search Tool (Actual HTTP Service)
The agent calls this service to fetch search results. For a single-endpoint HTTP agent, Flakestorm uses `tool: "*"` to fault the request to the agent, or use `match_url` when the agent makes outbound calls (see [Environment Chaos](ENVIRONMENT_CHAOS.md)).
```python
# search_service.py — run on port 5001
from fastapi import FastAPI
app = FastAPI(title="Search Tool")
@app.get("/search")
def search(q: str):
"""Simulated search API. In production this might call a real search engine."""
results = [
{"title": "Wikipedia: " + q, "snippet": "According to Wikipedia, " + q + " is a topic."},
{"title": "Source A", "snippet": "Per Source A, " + q + " has been documented."},
]
return {"query": q, "results": results}
```
### Agent Code (Actual Tool Calling)
The agent receives the user query, **calls the search tool** via HTTP, then calls the LLM with the query and results.
```python
# research_agent.py — run on port 8790
import os
import httpx
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Research Agent with Search Tool")
SEARCH_URL = os.environ.get("SEARCH_URL", "http://localhost:5001/search")
OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://localhost:11434/api/generate")
MODEL = os.environ.get("OLLAMA_MODEL", "gemma3:1b")
class InvokeRequest(BaseModel):
input: str | None = None
prompt: str | None = None
class InvokeResponse(BaseModel):
result: str
def call_search(query: str) -> str:
"""Actual tool call: HTTP GET to search service."""
r = httpx.get(SEARCH_URL, params={"q": query}, timeout=10.0)
r.raise_for_status()
data = r.json()
snippets = [x.get("snippet", "") for x in data.get("results", [])[:3]]
return "\n".join(snippets) if snippets else "No results found."
def call_llm(user_query: str, search_context: str) -> str:
"""Call LLM with user query and tool output."""
prompt = f"""You are a research assistant. Use the following search results to answer. Always cite the source.
Search results:
{search_context}
User question: {user_query}
Answer (2-4 sentences, must cite source):"""
r = httpx.post(
OLLAMA_URL,
json={"model": MODEL, "prompt": prompt, "stream": False},
timeout=60.0,
)
r.raise_for_status()
return (r.json().get("response") or "").strip()
@app.post("/reset")
def reset():
return {"ok": True}
@app.post("/invoke", response_model=InvokeResponse)
def invoke(req: InvokeRequest):
text = (req.input or req.prompt or "").strip()
if not text:
return InvokeResponse(result="Please ask a question.")
try:
search_context = call_search(text) # actual tool call
answer = call_llm(text, search_context)
return InvokeResponse(result=answer)
except Exception as e:
return InvokeResponse(
result="According to [system], the search or model failed. Please try again."
)
```
### flakestorm Configuration
```yaml
version: "2.0"
agent:
endpoint: "http://localhost:8790/invoke"
type: http
method: POST
request_template: '{"input": "{prompt}"}'
response_path: "result"
timeout: 15000
reset_endpoint: "http://localhost:8790/reset"
model:
provider: ollama
name: gemma3:1b
base_url: "http://localhost:11434"
golden_prompts:
- "What is the capital of France?"
- "Summarize the benefits of renewable energy."
mutations:
count: 5
types: [paraphrase, noise, prompt_injection]
invariants:
- type: latency
max_ms: 30000
- type: output_not_empty
chaos:
tool_faults:
- tool: "*"
@ -46,35 +157,18 @@ chaos:
- mode: truncated_response
max_tokens: 5
probability: 0.2
```
**Docs:** [Environment Chaos](ENVIRONMENT_CHAOS.md), [V2 Audit §8.1](V2_AUDIT.md#1-prd-81--environment-chaos). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
---
## V2 Scenario: Behavioral Contract × Chaos Matrix
**Goal:** Verify that named invariants (with severity) hold under every chaos scenario; each (invariant × scenario) cell is an independent run. Optional `agent.reset_endpoint` or `agent.reset_function` for state isolation.
**Commands:** `flakestorm contract run`, `flakestorm contract validate`, `flakestorm contract score`.
**Config (excerpt):**
```yaml
version: "2.0"
agent:
reset_endpoint: "http://localhost:8790/reset"
contract:
name: "My Contract"
name: "Research Agent Contract"
invariants:
- id: must-cite
- id: must-cite-source
type: regex
pattern: "(?i)(source|according to)"
pattern: "(?i)(source|according to|per )"
severity: critical
- id: max-latency
type: latency
max_ms: 60000
severity: medium
when: always
- id: completes
type: completes
severity: high
when: always
chaos_matrix:
- name: "no-chaos"
tool_faults: []
@ -84,38 +178,616 @@ contract:
- tool: "*"
mode: error
error_code: 503
output:
format: html
path: "./reports"
```
**Docs:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md), [V2 Spec](V2_SPEC.md) (contract matrix isolation, resilience score). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
### Running the Test
```bash
# Terminal 1: Search tool
uvicorn search_service:app --host 0.0.0.0 --port 5001
# Terminal 2: Agent (requires Ollama with gemma3:1b)
uvicorn research_agent:app --host 0.0.0.0 --port 8790
# Terminal 3: Flakestorm
flakestorm run -c flakestorm.yaml
flakestorm run -c flakestorm.yaml --chaos
flakestorm contract run -c flakestorm.yaml
flakestorm ci -c flakestorm.yaml --min-score 0.5
```
### What We're Testing
| Pillar | What runs | What we verify |
|--------|-----------|----------------|
| **Mutation** | Adversarial prompts to agent (calls search + LLM) | Robustness to typos, paraphrases, injection. |
| **Chaos** | Tool 503 to agent, LLM truncated | Agent degrades gracefully (fallback, cites source when possible). |
| **Contract** | Contract x chaos matrix (no-chaos, api-outage) | Must cite source (critical), must complete (high); auto-FAIL if critical fails. |
---
## V2 Scenario: Replay Regression
## Scenario 2: Support Agent with KB Tool and Replay
**Goal:** Replay a saved session (e.g. production incident) with fixed inputs and tool responses, then verify the agents output against a contract.
### The Agent
**Commands:** `flakestorm replay run path/to/session.yaml -c flakestorm.yaml`, `flakestorm replay export --from-report report.json -o ./replays/`. Optional: `flakestorm replay run --from-langsmith RUN_ID --run` to import from LangSmith and run.
A customer support agent that **actually calls a knowledge-base (KB) tool** to fetch articles, then answers the user. We add a **replay session** from a production incident to verify the fix.
**Config (excerpt):**
### KB Tool (Actual HTTP Service)
```python
# kb_service.py — run on port 5002
from fastapi import FastAPI
from fastapi.responses import JSONResponse
app = FastAPI(title="KB Tool")
ARTICLES = {
"reset-password": "To reset your password: go to Account > Security > Reset password. You will receive an email with a link.",
"cancel-subscription": "To cancel: Account > Billing > Cancel subscription. Refunds apply within 14 days.",
}
@app.get("/kb/article")
def get_article(article_id: str):
"""Actual tool: fetch KB article by ID."""
if article_id not in ARTICLES:
return JSONResponse(status_code=404, content={"error": "Article not found"})
return {"article_id": article_id, "content": ARTICLES[article_id]}
```
### Agent Code (Actual Tool Calling)
The agent parses the user question, **calls the KB tool** to get the article, then formats a response.
```python
# support_agent.py — run on port 8791
import httpx
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Support Agent with KB Tool")
KB_URL = "http://localhost:5002/kb/article"
class InvokeRequest(BaseModel):
input: str | None = None
prompt: str | None = None
class InvokeResponse(BaseModel):
result: str
def extract_article_id(query: str) -> str:
q = query.lower()
if "password" in q or "reset" in q:
return "reset-password"
if "cancel" in q or "subscription" in q:
return "cancel-subscription"
return "reset-password"
def call_kb(article_id: str) -> str:
"""Actual tool call: HTTP GET to KB service."""
r = httpx.get(KB_URL, params={"article_id": article_id}, timeout=5.0)
if r.status_code != 200:
return f"[KB error: {r.status_code}]"
return r.json().get("content", "")
@app.post("/reset")
def reset():
return {"ok": True}
@app.post("/invoke", response_model=InvokeResponse)
def invoke(req: InvokeRequest):
text = (req.input or req.prompt or "").strip()
if not text:
return InvokeResponse(result="Please describe your issue.")
try:
article_id = extract_article_id(text)
content = call_kb(article_id) # actual tool call
if not content or content.startswith("[KB error"):
return InvokeResponse(result="I could not find that article. Please contact support.")
return InvokeResponse(result=f"Here is what I found:\n\n{content}")
except Exception as e:
return InvokeResponse(result=f"Support system is temporarily unavailable. Please try again.")
```
### flakestorm Configuration
```yaml
version: "2.0"
agent:
endpoint: "http://localhost:8791/invoke"
type: http
method: POST
request_template: '{"input": "{prompt}"}'
response_path: "result"
timeout: 10000
reset_endpoint: "http://localhost:8791/reset"
golden_prompts:
- "How do I reset my password?"
- "I want to cancel my subscription."
invariants:
- type: output_not_empty
- type: latency
max_ms: 15000
chaos:
tool_faults:
- tool: "*"
mode: error
error_code: 503
probability: 0.25
contract:
name: "Support Agent Contract"
invariants:
- id: not-empty
type: output_not_empty
severity: critical
when: always
- id: no-pii-leak
type: excludes_pii
severity: high
when: always
chaos_matrix:
- name: "no-chaos"
tool_faults: []
llm_faults: []
- name: "kb-down"
tool_faults:
- tool: "*"
mode: error
error_code: 503
replays:
sessions:
- file: "replays/incident_001.yaml"
# Optional: sources for LangSmith import
# sources: ...
- file: "replays/support_incident_001.yaml"
scoring:
mutation: 0.20
chaos: 0.35
contract: 0.35
replay: 0.10
output:
format: html
path: "./reports"
```
**Session file (e.g. `replays/incident_001.yaml`):** `id`, `input`, `tool_responses` (optional), `contract` (name or path).
### Replay Session (Production Incident)
**Docs:** [Replay Regression](REPLAY_REGRESSION.md), [V2 Audit §8.3](V2_AUDIT.md#3-prd-83--replay-based-regression). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
```yaml
# replays/support_incident_001.yaml
id: support-incident-001
name: "Support agent failed when KB was down"
source: manual
input: "How do I reset my password?"
tool_responses: []
contract: "Support Agent Contract"
```
### Running the Test
```bash
# Terminal 1: KB service
uvicorn kb_service:app --host 0.0.0.0 --port 5002
# Terminal 2: Support agent
uvicorn support_agent:app --host 0.0.0.0 --port 8791
# Terminal 3: Flakestorm
flakestorm run -c flakestorm.yaml
flakestorm contract run -c flakestorm.yaml
flakestorm replay run replays/support_incident_001.yaml -c flakestorm.yaml
flakestorm ci -c flakestorm.yaml
```
### What We're Testing
| Pillar | What runs | What we verify |
|--------|-----------|----------------|
| **Mutation** | Adversarial prompts to agent (calls KB tool) | Robustness to noisy/paraphrased support questions. |
| **Chaos** | Tool 503 to agent | Agent returns graceful message instead of crashing. |
| **Contract** | Invariants x chaos matrix | Output not empty (critical), no PII (high). |
| **Replay** | Replay support_incident_001.yaml | Same input passes contract (regression for production incident). |
---
## Scenario 3: Autonomous Planner with Multi-Tool Chain
### The Agent
An autonomous planner that chains multiple tool calls: it calls a weather tool, then a calendar tool, then formats a response. We test it under chaos (one tool fails) and a behavioral contract (response must complete and include a summary).
### Tools (Weather + Calendar)
```python
# tools_planner.py — run on port 5010
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Planner Tools")
@app.get("/weather")
def weather(city: str):
return {"city": city, "temp": 72, "condition": "Sunny"}
@app.get("/calendar")
def calendar(date: str):
return {"date": date, "events": ["Meeting 10am", "Lunch 12pm"]}
@app.post("/reset")
def reset():
return {"ok": True}
```
### Agent Code (Multi-Step Tool Chain)
```python
# planner_agent.py — port 8792
import httpx
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Autonomous Planner Agent")
BASE = "http://localhost:5010"
class InvokeRequest(BaseModel):
input: str | None = None
prompt: str | None = None
class InvokeResponse(BaseModel):
result: str
@app.post("/reset")
def reset():
httpx.post(f"{BASE}/reset")
return {"ok": True}
@app.post("/invoke", response_model=InvokeResponse)
def invoke(req: InvokeRequest):
text = (req.input or req.prompt or "").strip()
if not text:
return InvokeResponse(result="Please provide a request.")
try:
w = httpx.get(f"{BASE}/weather", params={"city": "Boston"}, timeout=5.0)
weather_data = w.json() if w.status_code == 200 else {}
c = httpx.get(f"{BASE}/calendar", params={"date": "today"}, timeout=5.0)
cal_data = c.json() if c.status_code == 200 else {}
summary = f"Weather: {weather_data.get('condition', 'N/A')}. Calendar: {len(cal_data.get('events', []))} events."
return InvokeResponse(result=f"Summary: {summary}")
except Exception as e:
return InvokeResponse(result=f"Summary: Planning unavailable ({type(e).__name__}).")
```
### flakestorm Configuration
```yaml
version: "2.0"
agent:
endpoint: "http://localhost:8792/invoke"
type: http
method: POST
request_template: '{"input": "{prompt}"}'
response_path: "result"
timeout: 10000
reset_endpoint: "http://localhost:8792/reset"
golden_prompts:
- "What is the weather and my schedule for today?"
invariants:
- type: output_not_empty
- type: latency
max_ms: 15000
chaos:
tool_faults:
- tool: "*"
mode: error
error_code: 503
probability: 0.3
contract:
name: "Planner Contract"
invariants:
- id: completes
type: completes
severity: critical
when: always
chaos_matrix:
- name: "no-chaos"
tool_faults: []
llm_faults: []
- name: "tool-down"
tool_faults:
- tool: "*"
mode: error
error_code: 503
output:
format: html
path: "./reports"
```
### Running the Test
```bash
uvicorn tools_planner:app --host 0.0.0.0 --port 5010
uvicorn planner_agent:app --host 0.0.0.0 --port 8792
flakestorm run -c flakestorm.yaml
flakestorm run -c flakestorm.yaml --chaos
flakestorm contract run -c flakestorm.yaml
```
### What We're Testing
| Pillar | What runs | What we verify |
|--------|-----------|----------------|
| **Chaos** | Tool 503 to agent | Agent returns summary or graceful fallback. |
| **Contract** | Invariants × chaos matrix (no-chaos, tool-down) | Must complete (critical). |
---
## Scenario 1: Customer Service Chatbot
## Scenario 4: Booking Agent with Calendar and Payment Tools
### The Agent
A booking agent that calls a calendar API and a payment API to reserve a slot and confirm. We test under chaos (payment tool fails in one scenario) and replay a production incident.
### Tools (Calendar + Payment)
```python
# booking_tools.py — port 5011
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Booking Tools")
@app.post("/calendar/reserve")
def reserve_slot(slot: str):
return {"slot": slot, "confirmed": True, "id": "CAL-001"}
@app.post("/payment/confirm")
def confirm_payment(amount: float, ref: str):
return {"ref": ref, "status": "paid", "amount": amount}
```
### Agent Code
```python
# booking_agent.py — port 8793
import httpx
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Booking Agent")
BASE = "http://localhost:5011"
class InvokeRequest(BaseModel):
input: str | None = None
prompt: str | None = None
class InvokeResponse(BaseModel):
result: str
@app.post("/reset")
def reset():
return {"ok": True}
@app.post("/invoke", response_model=InvokeResponse)
def invoke(req: InvokeRequest):
text = (req.input or req.prompt or "").strip()
if not text:
return InvokeResponse(result="Please provide booking details.")
try:
r = httpx.post(f"{BASE}/calendar/reserve", json={"slot": "10:00"}, timeout=5.0)
cal = r.json() if r.status_code == 200 else {}
p = httpx.post(f"{BASE}/payment/confirm", json={"amount": 0, "ref": "BK-1"}, timeout=5.0)
pay = p.json() if p.status_code == 200 else {}
if cal.get("confirmed") and pay.get("status") == "paid":
return InvokeResponse(result=f"Booked. Ref: {pay.get('ref', 'N/A')}.")
return InvokeResponse(result="Booking could not be completed.")
except Exception as e:
return InvokeResponse(result=f"Booking unavailable ({type(e).__name__}).")
```
### flakestorm Configuration
```yaml
version: "2.0"
agent:
endpoint: "http://localhost:8793/invoke"
type: http
method: POST
request_template: '{"input": "{prompt}"}'
response_path: "result"
timeout: 10000
reset_endpoint: "http://localhost:8793/reset"
golden_prompts:
- "Book a slot at 10am and confirm payment."
invariants:
- type: output_not_empty
chaos:
tool_faults:
- tool: "*"
mode: error
error_code: 503
probability: 0.25
contract:
name: "Booking Contract"
invariants:
- id: not-empty
type: output_not_empty
severity: critical
when: always
chaos_matrix:
- name: "no-chaos"
tool_faults: []
llm_faults: []
- name: "payment-down"
tool_faults:
- tool: "*"
mode: error
error_code: 503
replays:
sessions:
- file: "replays/booking_incident_001.yaml"
output:
format: html
path: "./reports"
```
### Replay Session
```yaml
# replays/booking_incident_001.yaml
id: booking-incident-001
name: "Booking failed when payment returned 503"
source: manual
input: "Book a slot at 10am and confirm payment."
contract: "Booking Contract"
```
### Running the Test
```bash
uvicorn booking_tools:app --host 0.0.0.0 --port 5011
uvicorn booking_agent:app --host 0.0.0.0 --port 8793
flakestorm run -c flakestorm.yaml
flakestorm contract run -c flakestorm.yaml
flakestorm replay run replays/booking_incident_001.yaml -c flakestorm.yaml
flakestorm ci -c flakestorm.yaml
```
### What We're Testing
| Pillar | What runs | What we verify |
|--------|-----------|----------------|
| **Chaos** | Tool 503 | Agent returns clear message when payment/calendar fails. |
| **Contract** | Invariants × chaos matrix | Output not empty (critical). |
| **Replay** | booking_incident_001.yaml | Same input passes contract. |
---
## Scenario 5: Data Pipeline Agent with Replay
### The Agent
An agent that triggers a data pipeline via a tool and returns the run status. We verify behavior with a contract and replay a failed pipeline run.
### Pipeline Tool
```python
# pipeline_tool.py — port 5012
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Pipeline Tool")
@app.post("/pipeline/run")
def run_pipeline(job_id: str):
return {"job_id": job_id, "status": "success", "rows_processed": 1000}
```
### Agent Code
```python
# pipeline_agent.py — port 8794
import httpx
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Data Pipeline Agent")
BASE = "http://localhost:5012"
class InvokeRequest(BaseModel):
input: str | None = None
prompt: str | None = None
class InvokeResponse(BaseModel):
result: str
@app.post("/reset")
def reset():
return {"ok": True}
@app.post("/invoke", response_model=InvokeResponse)
def invoke(req: InvokeRequest):
text = (req.input or req.prompt or "").strip()
if not text:
return InvokeResponse(result="Please specify a pipeline job.")
try:
r = httpx.post(f"{BASE}/pipeline/run", json={"job_id": "daily_etl"}, timeout=30.0)
data = r.json() if r.status_code == 200 else {}
status = data.get("status", "unknown")
return InvokeResponse(result=f"Pipeline run: {status}. Rows: {data.get('rows_processed', 0)}.")
except Exception as e:
return InvokeResponse(result=f"Pipeline run failed ({type(e).__name__}).")
```
### flakestorm Configuration
```yaml
version: "2.0"
agent:
endpoint: "http://localhost:8794/invoke"
type: http
method: POST
request_template: '{"input": "{prompt}"}'
response_path: "result"
timeout: 35000
reset_endpoint: "http://localhost:8794/reset"
golden_prompts:
- "Run the daily ETL pipeline."
invariants:
- type: output_not_empty
- type: latency
max_ms: 60000
contract:
name: "Pipeline Contract"
invariants:
- id: not-empty
type: output_not_empty
severity: critical
when: always
chaos_matrix:
- name: "no-chaos"
tool_faults: []
llm_faults: []
replays:
sessions:
- file: "replays/pipeline_fail_001.yaml"
output:
format: html
path: "./reports"
```
### Replay Session
```yaml
# replays/pipeline_fail_001.yaml
id: pipeline-fail-001
name: "Pipeline agent returned empty on timeout"
source: manual
input: "Run the daily ETL pipeline."
contract: "Pipeline Contract"
```
### Running the Test
```bash
uvicorn pipeline_tool:app --host 0.0.0.0 --port 5012
uvicorn pipeline_agent:app --host 0.0.0.0 --port 8794
flakestorm run -c flakestorm.yaml
flakestorm contract run -c flakestorm.yaml
flakestorm replay run replays/pipeline_fail_001.yaml -c flakestorm.yaml
```
### What We're Testing
| Pillar | What runs | What we verify |
|--------|-----------|----------------|
| **Contract** | Invariants × chaos matrix | Output not empty (critical). |
| **Replay** | pipeline_fail_001.yaml | Regression: same input passes contract after fix. |
---
## Quick reference: commands and config
- **Environment chaos:** [Environment Chaos](ENVIRONMENT_CHAOS.md). Use `match_url` for per-URL fault injection when your agent makes outbound HTTP calls.
- **Behavioral contracts:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md). Reset: `agent.reset_endpoint` or `agent.reset_function`.
- **Replay regression:** [Replay Regression](REPLAY_REGRESSION.md).
- **Full example:** [Research Agent example](../examples/v2_research_agent/README.md).
---
## Scenario 6: Customer Service Chatbot
### The Agent
@ -267,7 +939,7 @@ flakestorm run --output html
---
## Scenario 2: Code Generation Agent
## Scenario 7: Code Generation Agent
### The Agent
@ -373,7 +1045,7 @@ invariants:
---
## Scenario 3: RAG-Based Q&A Agent
## Scenario 8: RAG-Based Q&A Agent
### The Agent
@ -453,7 +1125,7 @@ invariants:
---
## Scenario 4: Multi-Tool Agent (LangChain)
## Scenario 9: Multi-Tool Agent (LangChain)
### The Agent
@ -557,7 +1229,7 @@ invariants:
---
## Scenario 5: Guardrailed Agent (Safety Testing)
## Scenario 10: Guardrailed Agent (Safety Testing)
### The Agent