Update version to 2.0.0 and enhance chaos engineering features in Flakestorm. Added support for environment chaos, behavioral contracts, and replay regression. Expanded documentation and improved scoring mechanisms. Updated .gitignore to include new documentation files.

This commit is contained in:
Francisco M Humarang Jr. 2026-03-06 23:33:21 +08:00
parent 59cca61f3c
commit 9c3450a75d
63 changed files with 4147 additions and 134 deletions

View file

@ -0,0 +1,76 @@
# V2 Research Assistant — Flakestorm v2 Example
A **working** HTTP agent and v2.0 config that demonstrates all three V2 pillars: **Environment Chaos**, **Behavioral Contracts**, and **Replay-Based Regression**.
## Prerequisites
- Python 3.10+
- Ollama running (for mutation generation): `ollama run gemma3:1b` or any model
- Optional: `pip install fastapi uvicorn` (agent server)
## 1. Start the agent
From the project root or this directory:
```bash
cd examples/v2_research_agent
uvicorn agent:app --host 0.0.0.0 --port 8790
```
Or: `python agent.py` (uses port 8790 by default).
Verify: `curl -X POST http://localhost:8790/invoke -H "Content-Type: application/json" -d "{\"input\": \"Hello\"}"`
## 2. Run Flakestorm v2 commands
From the **project root** (so `flakestorm` and config paths resolve):
```bash
# Mutation testing only (v1 style)
flakestorm run -c examples/v2_research_agent/flakestorm.yaml
# With chaos (tool/LLM faults)
flakestorm run -c examples/v2_research_agent/flakestorm.yaml --chaos
# Chaos only (no mutations, golden prompts under chaos)
flakestorm run -c examples/v2_research_agent/flakestorm.yaml --chaos-only
# Built-in chaos profile
flakestorm run -c examples/v2_research_agent/flakestorm.yaml --chaos-profile api_outage
# Behavioral contract × chaos matrix
flakestorm contract run -c examples/v2_research_agent/flakestorm.yaml
# Contract score only (CI gate)
flakestorm contract score -c examples/v2_research_agent/flakestorm.yaml
# Replay regression (one session)
flakestorm replay run examples/v2_research_agent/replays/incident_001.yaml -c examples/v2_research_agent/flakestorm.yaml
# Export failures from a report as replay files
flakestorm replay export --from-report reports/report.json -o examples/v2_research_agent/replays/
# Full CI run (mutation + contract + chaos + replay, overall weighted score)
flakestorm ci -c examples/v2_research_agent/flakestorm.yaml --min-score 0.5
```
## 3. What this example demonstrates
| Feature | Config / usage |
|--------|-----------------|
| **Chaos** | `chaos.tool_faults` (503 with probability), `chaos.llm_faults` (truncated); `--chaos`, `--chaos-profile` |
| **Contract** | `contract` with invariants (always-cite-source, completes, max-latency) and `chaos_matrix` (no-chaos, api-outage) |
| **Replay** | `replays.sessions` with `file: replays/incident_001.yaml`; contract resolved by name "Research Agent Contract" |
| **Scoring** | `scoring` weights (mutation 20%, chaos 35%, contract 35%, replay 10%); used in `flakestorm ci` |
| **Reset** | `agent.reset_endpoint: http://localhost:8790/reset` for contract matrix isolation |
## 4. Config layout (v2.0)
- `version: "2.0"`
- `agent` + `reset_endpoint`
- `chaos` (tool_faults, llm_faults)
- `contract` (invariants, chaos_matrix)
- `replays.sessions` (file reference)
- `scoring` (weights)
The agent is stateless except for a call counter; `/reset` clears it so contract cells stay isolated.

View file

@ -0,0 +1,72 @@
"""
V2 Research Assistant Agent Working example for Flakestorm v2.
A minimal HTTP agent that simulates a research assistant: it responds to queries
and always cites a source (so behavioral contracts can be verified). Supports
/reset for contract matrix isolation. Used to demonstrate:
- flakestorm run (mutation testing)
- flakestorm run --chaos / --chaos-profile (environment chaos)
- flakestorm contract run (behavioral contract × chaos matrix)
- flakestorm replay run (replay regression)
- flakestorm ci (unified run with overall score)
"""
import os
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="V2 Research Assistant Agent")
# In-memory state (cleared by /reset for contract isolation)
_state = {"calls": 0}
class InvokeRequest(BaseModel):
"""Request body: prompt or input."""
input: str | None = None
prompt: str | None = None
query: str | None = None
class InvokeResponse(BaseModel):
"""Response with result and optional metadata."""
result: str
source: str = "demo_knowledge_base"
latency_ms: float | None = None
@app.post("/reset")
def reset():
"""Reset agent state. Called by Flakestorm before each contract matrix cell."""
_state["calls"] = 0
return {"ok": True}
@app.post("/invoke", response_model=InvokeResponse)
def invoke(req: InvokeRequest):
"""Handle a single user query. Always cites a source (contract invariant)."""
_state["calls"] += 1
text = req.input or req.prompt or req.query or ""
if not text.strip():
return InvokeResponse(
result="I didn't receive a question. Please ask something.",
source="none",
)
# Simulate a research response that cites a source (contract: always-cite-source)
response = (
f"According to [source: {_state['source']}], "
f"here is what I found for your query: \"{text[:100]}\". "
"Data may be incomplete when tools are degraded."
)
return InvokeResponse(result=response, source=_state["source"])
@app.get("/health")
def health():
return {"status": "ok"}
if __name__ == "__main__":
import uvicorn
port = int(os.environ.get("PORT", "8790"))
uvicorn.run(app, host="0.0.0.0", port=port)

View file

@ -0,0 +1,129 @@
# Flakestorm v2.0 — Research Assistant Example
# Demonstrates: mutation testing, chaos, behavioral contract, replay, ci
version: "2.0"
# -----------------------------------------------------------------------------
# Agent (HTTP). Start with: python agent.py (or uvicorn agent:app --port 8790)
# -----------------------------------------------------------------------------
agent:
endpoint: "http://localhost:8790/invoke"
type: "http"
method: "POST"
request_template: '{"input": "{prompt}"}'
response_path: "result"
timeout: 15000
reset_endpoint: "http://localhost:8790/reset"
# -----------------------------------------------------------------------------
# Model (for mutation generation only)
# -----------------------------------------------------------------------------
model:
provider: "ollama"
name: "gemma3:1b"
base_url: "http://localhost:11434"
# -----------------------------------------------------------------------------
# Mutations
# -----------------------------------------------------------------------------
mutations:
count: 5
types:
- paraphrase
- noise
- tone_shift
- prompt_injection
# -----------------------------------------------------------------------------
# Golden prompts
# -----------------------------------------------------------------------------
golden_prompts:
- "What is the capital of France?"
- "Summarize the benefits of renewable energy."
# -----------------------------------------------------------------------------
# Invariants (run invariants)
# -----------------------------------------------------------------------------
invariants:
- type: latency
max_ms: 30000
- type: contains
value: "source"
- type: output_not_empty
# -----------------------------------------------------------------------------
# V2: Environment Chaos (tool/LLM faults)
# For HTTP agent, tool_faults with tool "*" apply to the single request to endpoint.
# -----------------------------------------------------------------------------
chaos:
tool_faults:
- tool: "*"
mode: error
error_code: 503
message: "Service Unavailable"
probability: 0.3
llm_faults:
- mode: truncated_response
max_tokens: 5
probability: 0.2
# -----------------------------------------------------------------------------
# V2: Behavioral Contract + Chaos Matrix
# -----------------------------------------------------------------------------
contract:
name: "Research Agent Contract"
description: "Must cite source and complete under chaos"
invariants:
- id: always-cite-source
type: regex
pattern: "(?i)(source|according to)"
severity: critical
when: always
description: "Must cite a source"
- id: completes
type: completes
severity: high
when: always
description: "Must return a response"
- id: max-latency
type: latency
max_ms: 60000
severity: medium
when: always
chaos_matrix:
- name: "no-chaos"
tool_faults: []
llm_faults: []
- name: "api-outage"
tool_faults:
- tool: "*"
mode: error
error_code: 503
message: "Service Unavailable"
# -----------------------------------------------------------------------------
# V2: Replay regression (sessions can reference file or be inline)
# -----------------------------------------------------------------------------
replays:
sessions:
- file: "replays/incident_001.yaml"
# -----------------------------------------------------------------------------
# V2: Scoring weights (overall = mutation*0.2 + chaos*0.35 + contract*0.35 + replay*0.1)
# -----------------------------------------------------------------------------
scoring:
mutation: 0.20
chaos: 0.35
contract: 0.35
replay: 0.10
# -----------------------------------------------------------------------------
# Output
# -----------------------------------------------------------------------------
output:
format: "html"
path: "./reports"
advanced:
concurrency: 5
retries: 2

View file

@ -0,0 +1,9 @@
# Replay session: production incident to regress
# Run with: flakestorm replay run replays/incident_001.yaml -c flakestorm.yaml
id: incident-001
name: "Research agent incident - missing source"
source: manual
input: "What is the capital of France?"
tool_responses: []
expected_failure: "Agent returned response without citing source"
contract: "Research Agent Contract"

View file

@ -0,0 +1,4 @@
# V2 Research Agent — run the example HTTP agent
fastapi>=0.100.0
uvicorn>=0.22.0
pydantic>=2.0