diff --git a/README.md b/README.md
index 31bc9b7..f8e299d 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@
-
+
@@ -134,10 +134,12 @@ Flakestorm supports several modes; you can use one or combine them:
- **Chaos only** — Golden prompts → agent with fault-injected tools/LLM → invariants. *Does the agent handle bad environments?*
- **Contract** — Golden prompts → agent under each chaos scenario → verify named invariants across a matrix. *Does the agent obey its rules under every failure mode?*
- **Replay** — Recorded production input + recorded tool responses → agent → contract. *Did we fix this incident?*
-- **Mutation (optional)** — Golden prompts → adversarial mutations (22+ types, max 50/run) → agent (optionally under chaos) → invariants. *Does the agent handle bad inputs (and optionally bad environments)?*
+- **Mutation (optional)** — Golden prompts → adversarial mutations (24 types, max 50/run) → agent (optionally under chaos) → invariants. *Does the agent handle bad inputs (and optionally bad environments)?*
You define **golden prompts**, **invariants** (or a full **contract** with severity and chaos matrix), and optionally **chaos** (tool/LLM faults) and **replay** sessions. Flakestorm runs the chosen mode(s), checks responses against your rules, and produces a **robustness score** (mutation or chaos-only runs) or **resilience score** (contract run), plus HTML report. Use `flakestorm run`, `flakestorm contract run`, `flakestorm replay run`, or `flakestorm ci` for the combined overall score (OSS: run from CLI or your own scripts; **native CI/CD integrations** — scheduled runs, pipeline plugins — are **Cloud only**).
+For the full **V1 vs V2 flow** (mutation-only vs four pillars, contract matrix isolation, resilience score formula), see the [Usage Guide](docs/USAGE_GUIDE.md#how-it-works).
+
> **Note**: Mutation generation uses a local LLM (Ollama) or cloud APIs (OpenAI, Claude, Gemini). API keys via environment variables only. See [LLM Providers](docs/LLM_PROVIDERS.md).
## Features
@@ -150,7 +152,7 @@ You define **golden prompts**, **invariants** (or a full **contract** with sever
### Supporting capabilities
-- **Adversarial mutations** — 22+ mutation types (prompt-level and system/network-level); max 50 mutations per run in OSS. [→ Test Scenarios](docs/TEST_SCENARIOS.md)
+- **Adversarial mutations** — 24 mutation types (prompt-level and system/network-level); max 50 mutations per run in OSS. [→ Test Scenarios](docs/TEST_SCENARIOS.md)
- **Invariants & assertions** — Deterministic checks, semantic similarity, safety (PII, refusal); configurable per contract.
- **Robustness score** — For mutation runs: a single weighted score (0–1) of how well the agent handled adversarial prompts. Reported in HTML/JSON and CLI (`results.statistics.robustness_score`).
- **Unified resilience score** — For full CI: weighted combination of **mutation robustness**, chaos resilience, contract compliance, and replay regression; weights (mutation, chaos, contract, replay) configurable in YAML and must sum to 1.0.
@@ -163,7 +165,7 @@ You define **golden prompts**, **invariants** (or a full **contract** with sever
## Open Source vs Cloud
**Open Source (Always Free):**
-- Core chaos engine with all 22+ mutation types (max 50 per run; no artificial feature gating)
+- Core chaos engine with all 24 mutation types (max 50 per run; no artificial feature gating)
- Local execution for fast experimentation
- Run from CLI or your own scripts (no native CI/CD; that’s Cloud only)
- Full transparency and extensibility
@@ -276,4 +278,3 @@ Apache 2.0 - See [LICENSE](LICENSE) for details.
❤️ Sponsor Flakestorm on GitHub
-
\ No newline at end of file
diff --git a/docs/USAGE_GUIDE.md b/docs/USAGE_GUIDE.md
index 9dcba8a..6a73911 100644
--- a/docs/USAGE_GUIDE.md
+++ b/docs/USAGE_GUIDE.md
@@ -25,7 +25,10 @@ This comprehensive guide walks you through using flakestorm to test your AI agen
### What is flakestorm?
-flakestorm is an **adversarial testing framework** and **chaos engineering platform** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs. With **V2** (`version: "2.0"` in config) you get environment chaos (tool/LLM faults, context attacks), behavioral contracts (invariants × chaos matrix), and replay regression; **22+ mutation types** and **max 50 mutations per run** in OSS. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md) and [V2 Spec](V2_SPEC.md).
+flakestorm is an **adversarial testing framework** and **chaos engineering platform** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.
+
+- **V1** (`version: "1.0"` or omitted): Mutation-only mode — golden prompts → mutation engine → agent → invariants → **robustness score**. Ideal for quick adversarial input testing.
+- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **24 mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [V2 Audit](V2_AUDIT.md).
### Why Use flakestorm?
@@ -39,47 +42,44 @@ flakestorm is an **adversarial testing framework** and **chaos engineering platf
### How It Works
+Flakestorm supports **V1 (mutation-only)** and **V2 (full chaos platform)**. The flow depends on your config version and which commands you run.
+
+#### V1 / Mutation-only flow
+
+With a V1 config (or V2 config without `--chaos`), you get the classic adversarial flow:
+
```
┌─────────────────────────────────────────────────────────────────┐
-│ flakestorm FLOW │
+│ flakestorm V1 — MUTATION-ONLY FLOW │
├─────────────────────────────────────────────────────────────────┤
-│ │
-│ 1. GOLDEN PROMPTS 2. MUTATION ENGINE │
-│ ┌─────────────────┐ ┌─────────────────┐ │
-│ │ "Book a flight │ ───► │ Local LLM │ │
-│ │ from NYC to LA"│ │ (Qwen/Ollama) │ │
-│ └─────────────────┘ └────────┬────────┘ │
-│ │ │
-│ ▼ │
-│ ┌─────────────────┐ │
-│ │ Mutated Prompts │ │
-│ │ • Typos │ │
-│ │ • Paraphrases │ │
-│ │ • Injections │ │
-│ └────────┬────────┘ │
-│ │ │
-│ 3. YOUR AGENT ▼ │
-│ ┌─────────────────┐ ┌─────────────────┐ │
-│ │ AI Agent │ ◄─── │ Test Runner │ │
-│ │ (HTTP/Python) │ │ (Async) │ │
-│ └────────┬────────┘ └─────────────────┘ │
-│ │ │
-│ ▼ │
-│ 4. VERIFICATION 5. REPORTING │
-│ ┌─────────────────┐ ┌─────────────────┐ │
-│ │ Invariant │ ───► │ HTML/JSON/CLI │ │
-│ │ Assertions │ │ Reports │ │
-│ └─────────────────┘ └─────────────────┘ │
-│ │ │
-│ ▼ │
-│ ┌─────────────────┐ │
-│ │ Robustness │ │
-│ │ Score: 0.85 │ │
-│ └─────────────────┘ │
-│ │
+│ 1. GOLDEN PROMPTS → 2. MUTATION ENGINE (Local LLM) │
+│ "Book a flight" → Mutated prompts (typos, paraphrases, │
+│ injections, encoding, etc. — 24 types)│
+│ ↓ │
+│ 3. YOUR AGENT ← Test Runner sends each mutated prompt │
+│ (HTTP/Python) ↓ │
+│ 4. INVARIANT ASSERTIONS → 5. REPORTING │
+│ (contains, latency, similarity, safety) → Robustness Score │
└─────────────────────────────────────────────────────────────────┘
```
+**Commands:** `flakestorm run` (no `--chaos`) → **Robustness score** (0–1).
+
+#### V2 flow — Four pillars
+
+With **`version: "2.0"`** in your config, Flakestorm adds environment chaos, behavioral contracts, and replay regression. See [V2 Spec](V2_SPEC.md) and [V2 Audit](V2_AUDIT.md).
+
+| Pillar | What runs | Score / output |
+|--------|-----------|----------------|
+| **Mutation run** | Golden prompts → 24 mutation types → agent → invariants | **Robustness score** (0–1). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. |
+| **Environment chaos** | Fault injection into tools and LLM (timeouts, errors, rate limits, malformed responses, context attacks) | **Chaos resilience** (0–1). Use `flakestorm run --chaos` (with mutations) or `flakestorm run --chaos --chaos-only` (no mutations). |
+| **Behavioral contracts** | Contracts (invariants × severity) × chaos matrix scenarios; each cell is an independent run (optional reset per cell). | **Resilience score** (0–100%). Use `flakestorm contract run`. Per-contract formula: weighted by severity (critical×3, high×2, medium×1); **auto-FAIL** if any critical fails. |
+| **Replay regression** | Replay saved sessions (e.g. production incidents) and verify against a contract. | Per-session pass/fail; **replay regression** score when run via CI. Use `flakestorm replay run [path]`. |
+
+**Unified CI:** `flakestorm ci` runs mutation run, contract run (if configured), chaos-only run (if chaos configured), and all replay sessions; then computes an **overall resilience score** from `scoring.weights` (default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10). Weights must sum to 1.0.
+
+**Contract matrix isolation (V2):** Each (invariant × scenario) cell is independent. Configure `agent.reset_endpoint` (HTTP) or `agent.reset_function` (Python) to clear agent state between cells; if not set and the agent is stateful, Flakestorm warns. See [V2 Spec — Contract matrix isolation](V2_SPEC.md#contract-matrix-isolation).
+
---
## Installation
@@ -819,7 +819,7 @@ golden_prompts:
### Mutation Types
-flakestorm generates adversarial variations of your golden prompts across 22+ mutation types organized into categories:
+flakestorm generates adversarial variations of your golden prompts across 24 mutation types organized into categories:
#### Prompt-Level Attacks
@@ -925,6 +925,21 @@ Score = (Weighted Passed Tests) / (Total Weighted Tests)
- **0.7-0.8**: Fair - Needs work
- **<0.7**: Poor - Significant reliability issues
+#### V2 Resilience Score (contract + overall)
+
+When using **V2** (`version: "2.0"`) with behavioral contracts and/or `flakestorm ci`, two additional scores apply. See [V2 Spec](V2_SPEC.md#resilience-score-formula).
+
+**Per-contract score** (for `flakestorm contract run`):
+
+```
+score = (Σ(passed_critical×3) + Σ(passed_high×2) + Σ(passed_medium×1))
+ / (Σ(total_critical×3) + Σ(total_high×2) + Σ(total_medium×1)) × 100
+```
+
+- **Automatic FAIL:** If any **critical** severity invariant fails in any scenario, the overall result is FAIL regardless of the numeric score.
+
+**Overall score** (for `flakestorm ci`): Configurable via **`scoring.weights`**. Weights must **sum to 1.0**. Default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10. The CI run combines mutation robustness, chaos resilience, contract compliance, and replay regression into one weighted overall resilience score.
+
---
## Understanding Mutation Types
@@ -1106,7 +1121,7 @@ flakestorm provides 22+ mutation types organized into **Prompt-Level Attacks** a
### Choosing Mutation Types
**Comprehensive Testing (Recommended):**
-Use all 22+ types for complete coverage:
+Use all 24 types for complete coverage:
```yaml
types:
# Original 8 types
@@ -1206,7 +1221,7 @@ The 22+ mutation types work together to provide comprehensive robustness testing
- **Infrastructure**: HTTP Header Injection, Payload Size Attack, Content-Type Confusion, Query Parameter Poisoning, Request Method Attack, Protocol-Level Attack, Resource Exhaustion, Concurrent Request Pattern, Timeout Manipulation
- **Temporal/Context**: Temporal Attack, Multi-Turn Attack
-For comprehensive testing, use all 22+ types. For focused testing:
+For comprehensive testing, use all 24 types. For focused testing:
- **Security-focused**: Emphasize Prompt Injection, Advanced Jailbreak, Protocol-Level Attack, HTTP Header Injection
- **UX-focused**: Emphasize Noise, Tone Shift, Context Manipulation, Language Mixing
- **Infrastructure-focused**: Emphasize all system/network-level types