flakestorm/examples/v2_research_agent/README.md

76 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# V2 Research Assistant — Flakestorm v2 Example
A **working** HTTP agent and v2.0 config that demonstrates all three V2 pillars: **Environment Chaos**, **Behavioral Contracts**, and **Replay-Based Regression**.
## Prerequisites
- Python 3.10+
- **Ollama** running with a model (e.g. `ollama pull gemma3:1b` then `ollama run gemma3:1b`). The agent calls Ollama to generate real LLM responses; Flakestorm uses the same Ollama for mutation generation.
- Dependencies: `pip install -r requirements.txt` (fastapi, uvicorn, pydantic, httpx)
## 1. Start the agent
From the project root or this directory:
```bash
cd examples/v2_research_agent
uvicorn agent:app --host 0.0.0.0 --port 8790
```
Or: `python agent.py` (uses port 8790 by default).
Verify: `curl -X POST http://localhost:8790/invoke -H "Content-Type: application/json" -d "{\"input\": \"Hello\"}"`
## 2. Run Flakestorm v2 commands
From the **project root** (so `flakestorm` and config paths resolve):
```bash
# Mutation testing only (v1 style)
flakestorm run -c examples/v2_research_agent/flakestorm.yaml
# With chaos (tool/LLM faults)
flakestorm run -c examples/v2_research_agent/flakestorm.yaml --chaos
# Chaos only (no mutations, golden prompts under chaos)
flakestorm run -c examples/v2_research_agent/flakestorm.yaml --chaos-only
# Built-in chaos profile
flakestorm run -c examples/v2_research_agent/flakestorm.yaml --chaos-profile api_outage
# Behavioral contract × chaos matrix
flakestorm contract run -c examples/v2_research_agent/flakestorm.yaml
# Contract score only (CI gate)
flakestorm contract score -c examples/v2_research_agent/flakestorm.yaml
# Replay regression (one session)
flakestorm replay run examples/v2_research_agent/replays/incident_001.yaml -c examples/v2_research_agent/flakestorm.yaml
# Export failures from a report as replay files
flakestorm replay export --from-report reports/report.json -o examples/v2_research_agent/replays/
# Full CI run (mutation + contract + chaos + replay, overall weighted score)
flakestorm ci -c examples/v2_research_agent/flakestorm.yaml --min-score 0.5
```
## 3. What this example demonstrates
| Feature | Config / usage |
|--------|-----------------|
| **Chaos** | `chaos.tool_faults` (503 with probability), `chaos.llm_faults` (truncated); `--chaos`, `--chaos-profile` |
| **Contract** | `contract` with invariants (always-cite-source, completes, max-latency) and `chaos_matrix` (no-chaos, api-outage) |
| **Replay** | `replays.sessions` with `file: replays/incident_001.yaml`; contract resolved by name "Research Agent Contract" |
| **Scoring** | `scoring` weights (mutation 20%, chaos 35%, contract 35%, replay 10%); used in `flakestorm ci` |
| **Reset** | `agent.reset_endpoint: http://localhost:8790/reset` for contract matrix isolation |
## 4. Config layout (v2.0)
- `version: "2.0"`
- `agent` + `reset_endpoint`
- `chaos` (tool_faults, llm_faults)
- `contract` (invariants, chaos_matrix)
- `replays.sessions` (file reference)
- `scoring` (weights)
The agent calls **Ollama** (same model as in `flakestorm.yaml`: `gemma3:1b` by default). Set `OLLAMA_BASE_URL` or `OLLAMA_MODEL` if your Ollama runs elsewhere or uses a different model. The agent is stateless except for a call counter; `/reset` clears it so contract cells stay isolated.