diff --git a/docs/API_SPECIFICATION.md b/docs/API_SPECIFICATION.md
index 43c6379..83f0350 100644
--- a/docs/API_SPECIFICATION.md
+++ b/docs/API_SPECIFICATION.md
@@ -376,6 +376,11 @@ results.get_by_prompt("...")  # Filter by prompt
 
 # Serialization
 results.to_dict()  # Full JSON-serializable dict
+
+# V2: Resilience and contract/replay (when config has contract/replays)
+results.resilience_scores   # dict: mutation_robustness, chaos_resilience, contract_compliance, replay_regression
+results.contract_compliance # ContractRunResult | None (when contract run was executed)
+# Replay results are reported via flakestorm replay run --output; see Reports below.
 ```
 
 #### MutationResult
@@ -443,6 +448,8 @@ reporter.print_failures(limit=10)
 reporter.print_full_report()
 ```
 
+**V2 reports:** Contract runs (`flakestorm contract run --output report.html`) and replay runs (`flakestorm replay run --output report.html`) produce HTML reports that include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants, fix tool behavior). See [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md) and [Replay Regression](REPLAY_REGRESSION.md).
+
 ---
 
 ## CLI Commands
@@ -459,16 +466,19 @@ flakestorm init --force            # Overwrite existing
 
 ### `flakestorm run`
 
-Run reliability tests.
+Run reliability tests (mutation run; optionally with chaos).
 
 ```bash
-flakestorm run                              # Default config
+flakestorm run                              # Default config (mutation only)
 flakestorm run --config custom.yaml         # Custom config
-flakestorm run --output json                # JSON output
-flakestorm run --output terminal            # Terminal only
-flakestorm run --min-score 0.9 --ci         # CI mode
-flakestorm run --verify-only                # Just verify setup
-flakestorm run --quiet                      # Minimal output
+flakestorm run --chaos                       # Apply chaos (tool/LLM faults, context_attacks) during mutation run
+flakestorm run --chaos-only                  # Chaos-only run (no mutations); requires chaos config
+flakestorm run --chaos-profile api_outage    # Use a built-in chaos profile
+flakestorm run --output json                 # JSON output
+flakestorm run --output terminal             # Terminal only
+flakestorm run --min-score 0.9 --ci          # CI mode
+flakestorm run --verify-only                 # Just verify setup
+flakestorm run --quiet                       # Minimal output
 ```
 
 ### `flakestorm verify`
@@ -503,6 +513,38 @@ else
 fi
 ```
 
+### V2: `flakestorm contract run` / `validate` / `score`
+
+Run behavioral contract tests (invariants × chaos matrix).
+
+```bash
+flakestorm contract run                      # Run contract matrix; progress and score in terminal
+flakestorm contract run --output report.html  # Save HTML report with suggested actions for failed cells
+flakestorm contract validate                 # Validate contract config only
+flakestorm contract score                    # Output contract resilience score only
+```
+
+### V2: `flakestorm replay run` / `export`
+
+Replay regression: run saved sessions and verify against a contract.
+
+```bash
+flakestorm replay run                        # Replay sessions from config (file or inline)
+flakestorm replay run path/to/session.yaml   # Replay a single session file
+flakestorm replay run path/to/replays/       # Replay all sessions in directory
+flakestorm replay run --output report.html   # Save HTML report with suggested actions for failed sessions
+flakestorm replay export --from-report FILE  # Export from an existing report
+```
+
+### V2: `flakestorm ci`
+
+Run full CI pipeline: mutation run, contract run (if configured), chaos-only (if chaos configured), replay (if configured); then compute overall weighted score from `scoring.weights`.
+
+```bash
+flakestorm ci
+flakestorm ci --config custom.yaml
+```
+
 ---
 
 ## Environment Variables
@@ -512,6 +554,8 @@ fi
 | `OLLAMA_HOST` | Override Ollama server URL |
 | Custom headers | Expanded in config via `${VAR}` syntax |
 
+**V2 — API keys (env-only):** Model API keys must not be literal in config. Use environment variables and reference them in config (e.g. `api_key: "${OPENAI_API_KEY}"`). Supported: `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc. See [LLM Providers](LLM_PROVIDERS.md).
+
 ---
 
 ## Exit Codes
diff --git a/docs/CONFIGURATION_GUIDE.md b/docs/CONFIGURATION_GUIDE.md
index 6256ece..209015b 100644
--- a/docs/CONFIGURATION_GUIDE.md
+++ b/docs/CONFIGURATION_GUIDE.md
@@ -53,7 +53,7 @@ With `version: "2.0"` you can add the three **chaos engineering pillars** and a
 
 **Context attacks** (chaos on tool/context or input before invoke, not the user prompt) are configured under `chaos.context_attacks`. You can use a **list** of attack configs or a **dict** (addendum format, e.g. `memory_poisoning: { payload: "...", strategy: "append" }`). Each scenario in `contract.chaos_matrix` can also define its own `context_attacks`. See [Context Attacks](CONTEXT_ATTACKS.md).
 
-All v1.0 options remain valid; v2.0 blocks are optional and additive.
+All v1.0 options remain valid; v2.0 blocks are optional and additive. Implementation status: all V2 gaps are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)). Mutations: **22+ types**, **max 50 per run** in OSS.
 
 ---
 
diff --git a/docs/CONNECTION_GUIDE.md b/docs/CONNECTION_GUIDE.md
index 4ec1096..4138378 100644
--- a/docs/CONNECTION_GUIDE.md
+++ b/docs/CONNECTION_GUIDE.md
@@ -44,6 +44,8 @@ This guide explains how to connect FlakeStorm to your agent, covering different
 
 **Note:** Native CI/CD integrations (scheduled runs, pipeline plugins) are **Cloud only**. OSS users run `flakestorm ci` from their own scripts or job runners.
 
+**V2 — API keys:** When using cloud LLM providers (OpenAI, Anthropic, Google) for mutation generation or agent backends, API keys must be set via **environment variables only** (e.g. `OPENAI_API_KEY`). Reference them in config as `api_key: "${OPENAI_API_KEY}"`. Do not put literal keys in config files. See [LLM Providers](LLM_PROVIDERS.md).
+
 ---
 
 ## Internal Code Options
diff --git a/docs/DEVELOPER_FAQ.md b/docs/DEVELOPER_FAQ.md
index 147d37a..89ec0f1 100644
--- a/docs/DEVELOPER_FAQ.md
+++ b/docs/DEVELOPER_FAQ.md
@@ -7,14 +7,15 @@ This document answers common questions developers might have about the flakestor
 ## Table of Contents
 
 1. [Architecture Questions](#architecture-questions)
-2. [Configuration System](#configuration-system)
-3. [Mutation Engine](#mutation-engine)
-4. [Assertion System](#assertion-system)
-5. [Performance & Rust](#performance--rust)
-6. [Agent Adapters](#agent-adapters)
-7. [Testing & Quality](#testing--quality)
-8. [Extending flakestorm](#extending-flakestorm)
-9. [Common Issues](#common-issues)
+2. [V2 and Documentation](#v2-and-documentation)
+3. [Configuration System](#configuration-system)
+4. [Mutation Engine](#mutation-engine)
+5. [Assertion System](#assertion-system)
+6. [Performance & Rust](#performance--rust)
+7. [Agent Adapters](#agent-adapters)
+8. [Testing & Quality](#testing--quality)
+9. [Extending flakestorm](#extending-flakestorm)
+10. [Common Issues](#common-issues)
 
 ---
 
@@ -77,6 +78,39 @@ This separation allows:
 
 ---
 
+## V2 and Documentation
+
+### Q: What is V2 and where is it documented?
+
+**A:** **V2** (`version: "2.0"` in config) adds three chaos-engineering pillars and a unified score. All gaps from the V2 PRD are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)). Authoritative references:
+
+| Topic | Document |
+|-------|----------|
+| Spec clarifications (reset, behavior_unchanged, probes, scoring) | [V2_SPEC](V2_SPEC.md) |
+| Environment chaos (tool/LLM faults, profiles, response_drift) | [ENVIRONMENT_CHAOS](ENVIRONMENT_CHAOS.md) |
+| Behavioral contracts (chaos_matrix, invariants, reset_endpoint/reset_function) | [BEHAVIORAL_CONTRACTS](BEHAVIORAL_CONTRACTS.md) |
+| Replay regression (sessions, LangSmith, contract resolution) | [REPLAY_REGRESSION](REPLAY_REGRESSION.md) |
+| Context attacks (memory_poisoning, system_prompt_leak_probe, list/dict config) | [CONTEXT_ATTACKS](CONTEXT_ATTACKS.md) |
+| LLM providers and API keys (env-only) | [LLM_PROVIDERS](LLM_PROVIDERS.md) |
+
+### Q: How do chaos, contract, and replay fit into the codebase?
+
+**A:** In V2:
+
+- **Chaos:** `chaos/` (interceptor, tool_proxy, llm_proxy, faults, profiles). The runner wraps the agent with `ChaosInterceptor` when `--chaos` or `--chaos-only` is used. Tool faults apply by `match_url` or `tool: "*"`; LLM faults (truncated, empty, garbage, rate_limit, response_drift) are applied in the interceptor.
+- **Contract:** `contracts/` (engine, matrix). When config has `contract` + `chaos_matrix`, `FlakeStormRunner.run()` runs the contract engine: resets between cells (if `reset_endpoint`/`reset_function`), runs invariants (including `behavior_unchanged` and probes for system_prompt_leak), and attaches `contract_compliance` to results. Scoring uses severity weights; any critical failure → FAIL.
+- **Replay:** `replay/` (loader, runner). Sessions loaded from file or inline (or LangSmith); contract resolved by name or path. `flakestorm replay run [path]` replays and verifies against the contract; reports include suggested actions for failed sessions.
+
+### Q: Why must API keys be environment variables only in V2?
+
+**A:** Security: literal API keys in config files get committed to version control. V2 validates at load time and fails with a clear message if a literal key is detected. Use `api_key: "${OPENAI_API_KEY}"` (and set the variable in the environment or CI secrets).
+
+### Q: What does `flakestorm ci` run?
+
+**A:** It runs, in order: (1) mutation run (with chaos if configured), (2) contract run if `contract` + `chaos_matrix` are configured, (3) chaos-only run if chaos is configured, (4) replay run if `replays` is configured. Then it computes an **overall weighted score** from `scoring.weights` (mutation, chaos, contract, replay); weights must sum to 1.0. Default weights: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10.
+
+---
+
 ## Configuration System
 
 ### Q: Why Pydantic instead of dataclasses or attrs?
diff --git a/docs/TESTING_GUIDE.md b/docs/TESTING_GUIDE.md
index d413d3f..f301d64 100644
--- a/docs/TESTING_GUIDE.md
+++ b/docs/TESTING_GUIDE.md
@@ -8,12 +8,13 @@ This guide explains how to run, write, and expand tests for flakestorm. It cover
 
 1. [Running Tests](#running-tests)
 2. [Test Structure](#test-structure)
-3. [Writing Tests: Agent Adapters](#writing-tests-agent-adapters)
-4. [Writing Tests: Orchestrator](#writing-tests-orchestrator)
-5. [Writing Tests: Report Generation](#writing-tests-report-generation)
-6. [Integration Tests](#integration-tests)
-7. [CLI Tests](#cli-tests)
-8. [Test Fixtures](#test-fixtures)
+3. [V2 Integration Tests](#v2-integration-tests)
+4. [Writing Tests: Agent Adapters](#writing-tests-agent-adapters)
+5. [Writing Tests: Orchestrator](#writing-tests-orchestrator)
+6. [Writing Tests: Report Generation](#writing-tests-report-generation)
+7. [Integration Tests](#integration-tests)
+8. [CLI Tests](#cli-tests)
+9. [Test Fixtures](#test-fixtures)
 
 ---
 
@@ -67,6 +68,9 @@ pytest tests/test_performance.py
 
 # Integration tests (requires Ollama)
 pytest tests/test_integration.py
+
+# V2 integration tests (chaos, contract, replay)
+pytest tests/test_chaos_integration.py tests/test_contract_integration.py tests/test_replay_integration.py
 ```
 
 ---
@@ -76,20 +80,37 @@ pytest tests/test_integration.py
 ```
 tests/
 ├── __init__.py
-├── conftest.py           # Shared fixtures
-├── test_config.py        # Configuration loading tests
-├── test_mutations.py     # Mutation engine tests
-├── test_assertions.py    # Assertion checkers tests
-├── test_performance.py   # Rust/Python bridge tests
-├── test_adapters.py      # Agent adapter tests (TO CREATE)
-├── test_orchestrator.py  # Orchestrator tests (TO CREATE)
-├── test_reports.py       # Report generation tests (TO CREATE)
-├── test_cli.py           # CLI command tests (TO CREATE)
-└── test_integration.py   # Full integration tests (TO CREATE)
+├── conftest.py                    # Shared fixtures
+├── test_config.py                 # Configuration loading tests
+├── test_mutations.py              # Mutation engine tests
+├── test_assertions.py             # Assertion checkers tests
+├── test_performance.py            # Rust/Python bridge tests
+├── test_adapters.py               # Agent adapter tests
+├── test_orchestrator.py           # Orchestrator tests
+├── test_reports.py                # Report generation tests
+├── test_cli.py                    # CLI command tests
+├── test_integration.py            # Full integration tests
+├── test_chaos_integration.py      # V2: chaos (tool/LLM faults, interceptor)
+├── test_contract_integration.py   # V2: contract (N×M matrix, score, critical fail)
+└── test_replay_integration.py     # V2: replay (session → replay → pass/fail)
 ```
 
 ---
 
+## V2 Integration Tests
+
+V2 adds three integration test modules; all gaps are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)).
+
+| Module | What it tests |
+|--------|----------------|
+| `test_chaos_integration.py` | Chaos interceptor, tool faults (match_url/tool *), LLM faults (truncated, empty, garbage, rate_limit, response_drift). |
+| `test_contract_integration.py` | Contract engine: invariants × chaos matrix, reset between cells, resilience score (severity-weighted), critical failure → FAIL. |
+| `test_replay_integration.py` | Replay loader (file/format), ReplayRunner verification against contract, contract resolution by name/path. |
+
+For CI pipelines that use V2, run the full suite including these; `flakestorm ci` runs mutation, contract (if configured), chaos-only (if configured), and replay (if configured), then computes the overall weighted score from `scoring.weights`.
+
+---
+
 ## Writing Tests: Agent Adapters
 
 ### Location: `tests/test_adapters.py`
diff --git a/docs/TEST_SCENARIOS.md b/docs/TEST_SCENARIOS.md
index 7e4ae4c..05ef783 100644
--- a/docs/TEST_SCENARIOS.md
+++ b/docs/TEST_SCENARIOS.md
@@ -1,6 +1,6 @@
 # Real-World Test Scenarios
 
-This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **24 mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md).
+This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **22+ mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md).
 
 ---
 
diff --git a/docs/USAGE_GUIDE.md b/docs/USAGE_GUIDE.md
index 6a73911..19207e4 100644
--- a/docs/USAGE_GUIDE.md
+++ b/docs/USAGE_GUIDE.md
@@ -28,7 +28,7 @@ This comprehensive guide walks you through using flakestorm to test your AI agen
 flakestorm is an **adversarial testing framework** and **chaos engineering platform** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.
 
 - **V1** (`version: "1.0"` or omitted): Mutation-only mode — golden prompts → mutation engine → agent → invariants → **robustness score**. Ideal for quick adversarial input testing.
-- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **24 mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [V2 Audit](V2_AUDIT.md).
+- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **22+ mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [GAP_VERIFICATION](GAP_VERIFICATION.md).
 
 ### Why Use flakestorm?
 
@@ -54,7 +54,7 @@ With a V1 config (or V2 config without `--chaos`), you get the classic adversari
 ├─────────────────────────────────────────────────────────────────┤
 │  1. GOLDEN PROMPTS  →  2. MUTATION ENGINE (Local LLM)            │
 │     "Book a flight"       → Mutated prompts (typos, paraphrases,  │
-│                            injections, encoding, etc. — 24 types)│
+│                            injections, encoding, etc. — 22+ types)│
 │                                        ↓                         │
 │  3. YOUR AGENT  ←  Test Runner sends each mutated prompt         │
 │     (HTTP/Python)       ↓                                         │
@@ -71,13 +71,15 @@ With **`version: "2.0"`** in your config, Flakestorm adds environment chaos, beh
 
 | Pillar | What runs | Score / output |
 |--------|-----------|----------------|
-| **Mutation run** | Golden prompts → 24 mutation types → agent → invariants | **Robustness score** (0–1). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. |
+| **Mutation run** | Golden prompts → 22+ mutation types → agent → invariants | **Robustness score** (0–1). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. |
 | **Environment chaos** | Fault injection into tools and LLM (timeouts, errors, rate limits, malformed responses, context attacks) | **Chaos resilience** (0–1). Use `flakestorm run --chaos` (with mutations) or `flakestorm run --chaos --chaos-only` (no mutations). |
 | **Behavioral contracts** | Contracts (invariants × severity) × chaos matrix scenarios; each cell is an independent run (optional reset per cell). | **Resilience score** (0–100%). Use `flakestorm contract run`. Per-contract formula: weighted by severity (critical×3, high×2, medium×1); **auto-FAIL** if any critical fails. |
 | **Replay regression** | Replay saved sessions (e.g. production incidents) and verify against a contract. | Per-session pass/fail; **replay regression** score when run via CI. Use `flakestorm replay run [path]`. |
 
 **Unified CI:** `flakestorm ci` runs mutation run, contract run (if configured), chaos-only run (if chaos configured), and all replay sessions; then computes an **overall resilience score** from `scoring.weights` (default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10). Weights must sum to 1.0.
 
+**Reports:** Use `flakestorm contract run --output report.html` and `flakestorm replay run --output report.html` to save HTML reports; both include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants). Replay accepts a single session file or a directory: `flakestorm replay run path/to/session.yaml` or `flakestorm replay run path/to/replays/`.
+
 **Contract matrix isolation (V2):** Each (invariant × scenario) cell is independent. Configure `agent.reset_endpoint` (HTTP) or `agent.reset_function` (Python) to clear agent state between cells; if not set and the agent is stateful, Flakestorm warns. See [V2 Spec — Contract matrix isolation](V2_SPEC.md#contract-matrix-isolation).
 
 ---
@@ -819,7 +821,7 @@ golden_prompts:
 
 ### Mutation Types
 
-flakestorm generates adversarial variations of your golden prompts across 24 mutation types organized into categories:
+flakestorm generates adversarial variations of your golden prompts across 22+ mutation types organized into categories:
 
 #### Prompt-Level Attacks
 
@@ -1121,7 +1123,7 @@ flakestorm provides 22+ mutation types organized into **Prompt-Level Attacks** a
 ### Choosing Mutation Types
 
 **Comprehensive Testing (Recommended):**
-Use all 24 types for complete coverage:
+Use all 22+ types for complete coverage:
 ```yaml
 types:
   # Original 8 types
@@ -1221,7 +1223,7 @@ The 22+ mutation types work together to provide comprehensive robustness testing
 - **Infrastructure**: HTTP Header Injection, Payload Size Attack, Content-Type Confusion, Query Parameter Poisoning, Request Method Attack, Protocol-Level Attack, Resource Exhaustion, Concurrent Request Pattern, Timeout Manipulation
 - **Temporal/Context**: Temporal Attack, Multi-Turn Attack
 
-For comprehensive testing, use all 24 types. For focused testing:
+For comprehensive testing, use all 22+ types. For focused testing:
 - **Security-focused**: Emphasize Prompt Injection, Advanced Jailbreak, Protocol-Level Attack, HTTP Header Injection
 - **UX-focused**: Emphasize Noise, Tone Shift, Context Manipulation, Language Mixing
 - **Infrastructure-focused**: Emphasize all system/network-level types