Update documentation to reflect enhancements in Flakestorm V2, including detailed descriptions of new features such as resilience scores, chaos engineering capabilities, behavioral contracts, and replay regression. Clarified API key management via environment variables, updated CLI commands, and improved test scenarios. Adjusted mutation types count to 22+ and ensured all V2 gaps are closed as per the latest specifications.

2026-04-25 00:36:54 +02:00 · 2026-03-09 19:52:39 +08:00 · 2026-03-09 19:52:39 +08:00 · 4a13425f8a
commit 4a13425f8a
parent f1570628c3
7 changed files with 142 additions and 39 deletions
--- a/docs/API_SPECIFICATION.md
+++ b/docs/API_SPECIFICATION.md
@ -376,6 +376,11 @@ results.get_by_prompt("...")  # Filter by prompt

 # Serialization
 results.to_dict()  # Full JSON-serializable dict
+
+# V2: Resilience and contract/replay (when config has contract/replays)
+results.resilience_scores   # dict: mutation_robustness, chaos_resilience, contract_compliance, replay_regression
+results.contract_compliance # ContractRunResult | None (when contract run was executed)
+# Replay results are reported via flakestorm replay run --output; see Reports below.
 ```

 #### MutationResult
@ -443,6 +448,8 @@ reporter.print_failures(limit=10)
 reporter.print_full_report()
 ```

+**V2 reports:** Contract runs (`flakestorm contract run --output report.html`) and replay runs (`flakestorm replay run --output report.html`) produce HTML reports that include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants, fix tool behavior). See [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md) and [Replay Regression](REPLAY_REGRESSION.md).
+
 ---

 ## CLI Commands
@ -459,11 +466,14 @@ flakestorm init --force            # Overwrite existing

 ### `flakestorm run`

-Run reliability tests.
+Run reliability tests (mutation run; optionally with chaos).

 ```bash
-flakestorm run                              # Default config
+flakestorm run                              # Default config (mutation only)
 flakestorm run --config custom.yaml         # Custom config
+flakestorm run --chaos                       # Apply chaos (tool/LLM faults, context_attacks) during mutation run
+flakestorm run --chaos-only                  # Chaos-only run (no mutations); requires chaos config
+flakestorm run --chaos-profile api_outage    # Use a built-in chaos profile
 flakestorm run --output json                 # JSON output
 flakestorm run --output terminal             # Terminal only
 flakestorm run --min-score 0.9 --ci          # CI mode
@ -503,6 +513,38 @@ else
 fi
 ```

+### V2: `flakestorm contract run` / `validate` / `score`
+
+Run behavioral contract tests (invariants × chaos matrix).
+
+```bash
+flakestorm contract run                      # Run contract matrix; progress and score in terminal
+flakestorm contract run --output report.html  # Save HTML report with suggested actions for failed cells
+flakestorm contract validate                 # Validate contract config only
+flakestorm contract score                    # Output contract resilience score only
+```
+
+### V2: `flakestorm replay run` / `export`
+
+Replay regression: run saved sessions and verify against a contract.
+
+```bash
+flakestorm replay run                        # Replay sessions from config (file or inline)
+flakestorm replay run path/to/session.yaml   # Replay a single session file
+flakestorm replay run path/to/replays/       # Replay all sessions in directory
+flakestorm replay run --output report.html   # Save HTML report with suggested actions for failed sessions
+flakestorm replay export --from-report FILE  # Export from an existing report
+```
+
+### V2: `flakestorm ci`
+
+Run full CI pipeline: mutation run, contract run (if configured), chaos-only (if chaos configured), replay (if configured); then compute overall weighted score from `scoring.weights`.
+
+```bash
+flakestorm ci
+flakestorm ci --config custom.yaml
+```
+
 ---

 ## Environment Variables
@ -512,6 +554,8 @@ fi
 | `OLLAMA_HOST` | Override Ollama server URL |
 | Custom headers | Expanded in config via `${VAR}` syntax |

+**V2 — API keys (env-only):** Model API keys must not be literal in config. Use environment variables and reference them in config (e.g. `api_key: "${OPENAI_API_KEY}"`). Supported: `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc. See [LLM Providers](LLM_PROVIDERS.md).
+
 ---

 ## Exit Codes
--- a/docs/CONFIGURATION_GUIDE.md
+++ b/docs/CONFIGURATION_GUIDE.md
@ -53,7 +53,7 @@ With `version: "2.0"` you can add the three **chaos engineering pillars** and a

 **Context attacks** (chaos on tool/context or input before invoke, not the user prompt) are configured under `chaos.context_attacks`. You can use a **list** of attack configs or a **dict** (addendum format, e.g. `memory_poisoning: { payload: "...", strategy: "append" }`). Each scenario in `contract.chaos_matrix` can also define its own `context_attacks`. See [Context Attacks](CONTEXT_ATTACKS.md).

-All v1.0 options remain valid; v2.0 blocks are optional and additive.
+All v1.0 options remain valid; v2.0 blocks are optional and additive. Implementation status: all V2 gaps are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)). Mutations: **22+ types**, **max 50 per run** in OSS.

 ---

--- a/docs/CONNECTION_GUIDE.md
+++ b/docs/CONNECTION_GUIDE.md
@ -44,6 +44,8 @@ This guide explains how to connect FlakeStorm to your agent, covering different

 **Note:** Native CI/CD integrations (scheduled runs, pipeline plugins) are **Cloud only**. OSS users run `flakestorm ci` from their own scripts or job runners.

+**V2 — API keys:** When using cloud LLM providers (OpenAI, Anthropic, Google) for mutation generation or agent backends, API keys must be set via **environment variables only** (e.g. `OPENAI_API_KEY`). Reference them in config as `api_key: "${OPENAI_API_KEY}"`. Do not put literal keys in config files. See [LLM Providers](LLM_PROVIDERS.md).
+
 ---

 ## Internal Code Options
--- a/docs/DEVELOPER_FAQ.md
+++ b/docs/DEVELOPER_FAQ.md
@ -7,14 +7,15 @@ This document answers common questions developers might have about the flakestor
 ## Table of Contents

 1. [Architecture Questions](#architecture-questions)
-2. [Configuration System](#configuration-system)
-3. [Mutation Engine](#mutation-engine)
-4. [Assertion System](#assertion-system)
-5. [Performance & Rust](#performance--rust)
-6. [Agent Adapters](#agent-adapters)
-7. [Testing & Quality](#testing--quality)
-8. [Extending flakestorm](#extending-flakestorm)
-9. [Common Issues](#common-issues)
+2. [V2 and Documentation](#v2-and-documentation)
+3. [Configuration System](#configuration-system)
+4. [Mutation Engine](#mutation-engine)
+5. [Assertion System](#assertion-system)
+6. [Performance & Rust](#performance--rust)
+7. [Agent Adapters](#agent-adapters)
+8. [Testing & Quality](#testing--quality)
+9. [Extending flakestorm](#extending-flakestorm)
+10. [Common Issues](#common-issues)

 ---

@ -77,6 +78,39 @@ This separation allows:

 ---

+## V2 and Documentation
+
+### Q: What is V2 and where is it documented?
+
+**A:** **V2** (`version: "2.0"` in config) adds three chaos-engineering pillars and a unified score. All gaps from the V2 PRD are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)). Authoritative references:
+
+| Topic | Document |
+|-------|----------|
+| Spec clarifications (reset, behavior_unchanged, probes, scoring) | [V2_SPEC](V2_SPEC.md) |
+| Environment chaos (tool/LLM faults, profiles, response_drift) | [ENVIRONMENT_CHAOS](ENVIRONMENT_CHAOS.md) |
+| Behavioral contracts (chaos_matrix, invariants, reset_endpoint/reset_function) | [BEHAVIORAL_CONTRACTS](BEHAVIORAL_CONTRACTS.md) |
+| Replay regression (sessions, LangSmith, contract resolution) | [REPLAY_REGRESSION](REPLAY_REGRESSION.md) |
+| Context attacks (memory_poisoning, system_prompt_leak_probe, list/dict config) | [CONTEXT_ATTACKS](CONTEXT_ATTACKS.md) |
+| LLM providers and API keys (env-only) | [LLM_PROVIDERS](LLM_PROVIDERS.md) |
+
+### Q: How do chaos, contract, and replay fit into the codebase?
+
+**A:** In V2:
+
+- **Chaos:** `chaos/` (interceptor, tool_proxy, llm_proxy, faults, profiles). The runner wraps the agent with `ChaosInterceptor` when `--chaos` or `--chaos-only` is used. Tool faults apply by `match_url` or `tool: "*"`; LLM faults (truncated, empty, garbage, rate_limit, response_drift) are applied in the interceptor.
+- **Contract:** `contracts/` (engine, matrix). When config has `contract` + `chaos_matrix`, `FlakeStormRunner.run()` runs the contract engine: resets between cells (if `reset_endpoint`/`reset_function`), runs invariants (including `behavior_unchanged` and probes for system_prompt_leak), and attaches `contract_compliance` to results. Scoring uses severity weights; any critical failure → FAIL.
+- **Replay:** `replay/` (loader, runner). Sessions loaded from file or inline (or LangSmith); contract resolved by name or path. `flakestorm replay run [path]` replays and verifies against the contract; reports include suggested actions for failed sessions.
+
+### Q: Why must API keys be environment variables only in V2?
+
+**A:** Security: literal API keys in config files get committed to version control. V2 validates at load time and fails with a clear message if a literal key is detected. Use `api_key: "${OPENAI_API_KEY}"` (and set the variable in the environment or CI secrets).
+
+### Q: What does `flakestorm ci` run?
+
+**A:** It runs, in order: (1) mutation run (with chaos if configured), (2) contract run if `contract` + `chaos_matrix` are configured, (3) chaos-only run if chaos is configured, (4) replay run if `replays` is configured. Then it computes an **overall weighted score** from `scoring.weights` (mutation, chaos, contract, replay); weights must sum to 1.0. Default weights: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10.
+
+---
+
 ## Configuration System

 ### Q: Why Pydantic instead of dataclasses or attrs?
--- a/docs/TESTING_GUIDE.md
+++ b/docs/TESTING_GUIDE.md
@ -8,12 +8,13 @@ This guide explains how to run, write, and expand tests for flakestorm. It cover

 1. [Running Tests](#running-tests)
 2. [Test Structure](#test-structure)
-3. [Writing Tests: Agent Adapters](#writing-tests-agent-adapters)
-4. [Writing Tests: Orchestrator](#writing-tests-orchestrator)
-5. [Writing Tests: Report Generation](#writing-tests-report-generation)
-6. [Integration Tests](#integration-tests)
-7. [CLI Tests](#cli-tests)
-8. [Test Fixtures](#test-fixtures)
+3. [V2 Integration Tests](#v2-integration-tests)
+4. [Writing Tests: Agent Adapters](#writing-tests-agent-adapters)
+5. [Writing Tests: Orchestrator](#writing-tests-orchestrator)
+6. [Writing Tests: Report Generation](#writing-tests-report-generation)
+7. [Integration Tests](#integration-tests)
+8. [CLI Tests](#cli-tests)
+9. [Test Fixtures](#test-fixtures)

 ---

@ -67,6 +68,9 @@ pytest tests/test_performance.py

 # Integration tests (requires Ollama)
 pytest tests/test_integration.py
+
+# V2 integration tests (chaos, contract, replay)
+pytest tests/test_chaos_integration.py tests/test_contract_integration.py tests/test_replay_integration.py
 ```

 ---
@ -81,15 +85,32 @@ tests/
 ├── test_mutations.py              # Mutation engine tests
 ├── test_assertions.py             # Assertion checkers tests
 ├── test_performance.py            # Rust/Python bridge tests
-├── test_adapters.py      # Agent adapter tests (TO CREATE)
-├── test_orchestrator.py  # Orchestrator tests (TO CREATE)
-├── test_reports.py       # Report generation tests (TO CREATE)
-├── test_cli.py           # CLI command tests (TO CREATE)
-└── test_integration.py   # Full integration tests (TO CREATE)
+├── test_adapters.py               # Agent adapter tests
+├── test_orchestrator.py           # Orchestrator tests
+├── test_reports.py                # Report generation tests
+├── test_cli.py                    # CLI command tests
+├── test_integration.py            # Full integration tests
+├── test_chaos_integration.py      # V2: chaos (tool/LLM faults, interceptor)
+├── test_contract_integration.py   # V2: contract (N×M matrix, score, critical fail)
+└── test_replay_integration.py     # V2: replay (session → replay → pass/fail)
 ```

 ---

+## V2 Integration Tests
+
+V2 adds three integration test modules; all gaps are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)).
+
+| Module | What it tests |
+|--------|----------------|
+| `test_chaos_integration.py` | Chaos interceptor, tool faults (match_url/tool *), LLM faults (truncated, empty, garbage, rate_limit, response_drift). |
+| `test_contract_integration.py` | Contract engine: invariants × chaos matrix, reset between cells, resilience score (severity-weighted), critical failure → FAIL. |
+| `test_replay_integration.py` | Replay loader (file/format), ReplayRunner verification against contract, contract resolution by name/path. |
+
+For CI pipelines that use V2, run the full suite including these; `flakestorm ci` runs mutation, contract (if configured), chaos-only (if configured), and replay (if configured), then computes the overall weighted score from `scoring.weights`.
+
+---
+
 ## Writing Tests: Agent Adapters

 ### Location: `tests/test_adapters.py`
--- a/docs/TEST_SCENARIOS.md
+++ b/docs/TEST_SCENARIOS.md
@ -1,6 +1,6 @@
 # Real-World Test Scenarios

-This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **24 mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md).
+This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **22+ mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md).

 ---

--- a/docs/USAGE_GUIDE.md
+++ b/docs/USAGE_GUIDE.md
@ -28,7 +28,7 @@ This comprehensive guide walks you through using flakestorm to test your AI agen
 flakestorm is an **adversarial testing framework** and **chaos engineering platform** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.

 - **V1** (`version: "1.0"` or omitted): Mutation-only mode — golden prompts → mutation engine → agent → invariants → **robustness score**. Ideal for quick adversarial input testing.
- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **24 mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [V2 Audit](V2_AUDIT.md).
+- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **22+ mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [GAP_VERIFICATION](GAP_VERIFICATION.md).

 ### Why Use flakestorm?

@ -54,7 +54,7 @@ With a V1 config (or V2 config without `--chaos`), you get the classic adversari
 ├─────────────────────────────────────────────────────────────────┤
 │  1. GOLDEN PROMPTS  →  2. MUTATION ENGINE (Local LLM)            │
 │     "Book a flight"       → Mutated prompts (typos, paraphrases,  │
-│                            injections, encoding, etc. — 24 types)│
+│                            injections, encoding, etc. — 22+ types)│
 │                                        ↓                         │
 │  3. YOUR AGENT  ←  Test Runner sends each mutated prompt         │
 │     (HTTP/Python)       ↓                                         │
@ -71,13 +71,15 @@ With **`version: "2.0"`** in your config, Flakestorm adds environment chaos, beh

 | Pillar | What runs | Score / output |
 |--------|-----------|----------------|
-| **Mutation run** | Golden prompts → 24 mutation types → agent → invariants | **Robustness score** (0–1). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. |
+| **Mutation run** | Golden prompts → 22+ mutation types → agent → invariants | **Robustness score** (0–1). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. |
 | **Environment chaos** | Fault injection into tools and LLM (timeouts, errors, rate limits, malformed responses, context attacks) | **Chaos resilience** (0–1). Use `flakestorm run --chaos` (with mutations) or `flakestorm run --chaos --chaos-only` (no mutations). |
 | **Behavioral contracts** | Contracts (invariants × severity) × chaos matrix scenarios; each cell is an independent run (optional reset per cell). | **Resilience score** (0–100%). Use `flakestorm contract run`. Per-contract formula: weighted by severity (critical×3, high×2, medium×1); **auto-FAIL** if any critical fails. |
 | **Replay regression** | Replay saved sessions (e.g. production incidents) and verify against a contract. | Per-session pass/fail; **replay regression** score when run via CI. Use `flakestorm replay run [path]`. |

 **Unified CI:** `flakestorm ci` runs mutation run, contract run (if configured), chaos-only run (if chaos configured), and all replay sessions; then computes an **overall resilience score** from `scoring.weights` (default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10). Weights must sum to 1.0.

+**Reports:** Use `flakestorm contract run --output report.html` and `flakestorm replay run --output report.html` to save HTML reports; both include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants). Replay accepts a single session file or a directory: `flakestorm replay run path/to/session.yaml` or `flakestorm replay run path/to/replays/`.
+
 **Contract matrix isolation (V2):** Each (invariant × scenario) cell is independent. Configure `agent.reset_endpoint` (HTTP) or `agent.reset_function` (Python) to clear agent state between cells; if not set and the agent is stateful, Flakestorm warns. See [V2 Spec — Contract matrix isolation](V2_SPEC.md#contract-matrix-isolation).

 ---
@ -819,7 +821,7 @@ golden_prompts:

 ### Mutation Types

-flakestorm generates adversarial variations of your golden prompts across 24 mutation types organized into categories:
+flakestorm generates adversarial variations of your golden prompts across 22+ mutation types organized into categories:

 #### Prompt-Level Attacks

@ -1121,7 +1123,7 @@ flakestorm provides 22+ mutation types organized into **Prompt-Level Attacks** a
 ### Choosing Mutation Types

 **Comprehensive Testing (Recommended):**
-Use all 24 types for complete coverage:
+Use all 22+ types for complete coverage:
 ```yaml
 types:
  # Original 8 types
@ -1221,7 +1223,7 @@ The 22+ mutation types work together to provide comprehensive robustness testing
 - **Infrastructure**: HTTP Header Injection, Payload Size Attack, Content-Type Confusion, Query Parameter Poisoning, Request Method Attack, Protocol-Level Attack, Resource Exhaustion, Concurrent Request Pattern, Timeout Manipulation
 - **Temporal/Context**: Temporal Attack, Multi-Turn Attack

-For comprehensive testing, use all 24 types. For focused testing:
+For comprehensive testing, use all 22+ types. For focused testing:
 - **Security-focused**: Emphasize Prompt Injection, Advanced Jailbreak, Protocol-Level Attack, HTTP Header Injection
 - **UX-focused**: Emphasize Noise, Tone Shift, Context Manipulation, Language Mixing
 - **Infrastructure-focused**: Emphasize all system/network-level types