flakestorm/examples/v2_research_agent/README.md

3.2 KiB
Raw Blame History

V2 Research Assistant — Flakestorm v2 Example

A working HTTP agent and v2.0 config that demonstrates all three V2 pillars: Environment Chaos, Behavioral Contracts, and Replay-Based Regression.

Prerequisites

  • Python 3.10+
  • Ollama running with a model (e.g. ollama pull gemma3:1b then ollama run gemma3:1b). The agent calls Ollama to generate real LLM responses; Flakestorm uses the same Ollama for mutation generation.
  • Dependencies: pip install -r requirements.txt (fastapi, uvicorn, pydantic, httpx)

1. Start the agent

From the project root or this directory:

cd examples/v2_research_agent
uvicorn agent:app --host 0.0.0.0 --port 8790

Or: python agent.py (uses port 8790 by default).

Verify: curl -X POST http://localhost:8790/invoke -H "Content-Type: application/json" -d "{\"input\": \"Hello\"}"

2. Run Flakestorm v2 commands

From the project root (so flakestorm and config paths resolve):

# Mutation testing only (v1 style)
flakestorm run -c examples/v2_research_agent/flakestorm.yaml

# With chaos (tool/LLM faults)
flakestorm run -c examples/v2_research_agent/flakestorm.yaml --chaos

# Chaos only (no mutations, golden prompts under chaos)
flakestorm run -c examples/v2_research_agent/flakestorm.yaml --chaos-only

# Built-in chaos profile
flakestorm run -c examples/v2_research_agent/flakestorm.yaml --chaos-profile api_outage

# Behavioral contract × chaos matrix
flakestorm contract run -c examples/v2_research_agent/flakestorm.yaml

# Contract score only (CI gate)
flakestorm contract score -c examples/v2_research_agent/flakestorm.yaml

# Replay regression (one session)
flakestorm replay run examples/v2_research_agent/replays/incident_001.yaml -c examples/v2_research_agent/flakestorm.yaml

# Export failures from a report as replay files
flakestorm replay export --from-report reports/report.json -o examples/v2_research_agent/replays/

# Full CI run (mutation + contract + chaos + replay, overall weighted score)
flakestorm ci -c examples/v2_research_agent/flakestorm.yaml --min-score 0.5

3. What this example demonstrates

Feature Config / usage
Chaos chaos.tool_faults (503 with probability), chaos.llm_faults (truncated); --chaos, --chaos-profile
Contract contract with invariants (always-cite-source, completes, max-latency) and chaos_matrix (no-chaos, api-outage)
Replay replays.sessions with file: replays/incident_001.yaml; contract resolved by name "Research Agent Contract"
Scoring scoring weights (mutation 20%, chaos 35%, contract 35%, replay 10%); used in flakestorm ci
Reset agent.reset_endpoint: http://localhost:8790/reset for contract matrix isolation

4. Config layout (v2.0)

  • version: "2.0"
  • agent + reset_endpoint
  • chaos (tool_faults, llm_faults)
  • contract (invariants, chaos_matrix)
  • replays.sessions (file reference)
  • scoring (weights)

The agent calls Ollama (same model as in flakestorm.yaml: gemma3:1b by default). Set OLLAMA_BASE_URL or OLLAMA_MODEL if your Ollama runs elsewhere or uses a different model. The agent is stateless except for a call counter; /reset clears it so contract cells stay isolated.