diff --git a/.gitignore b/.gitignore
index 4f74d6d..426c8cb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -127,4 +127,3 @@ docs/*
 !docs/CONTRIBUTING.md
 !docs/API_SPECIFICATION.md
 !docs/TESTING_GUIDE.md
-!docs/IMPLEMENTATION_CHECKLIST.md
diff --git a/README.md b/README.md
index f8e299d..b2a6462 100644
--- a/README.md
+++ b/README.md
@@ -152,7 +152,7 @@ For the full **V1 vs V2 flow** (mutation-only vs four pillars, contract matrix i
 
 ### Supporting capabilities
 
-- **Adversarial mutations** — 24 mutation types (prompt-level and system/network-level); max 50 mutations per run in OSS. [→ Test Scenarios](docs/TEST_SCENARIOS.md)
+- **Adversarial mutations** — 24 mutation types (prompt-level and system/network-level); max 50 mutations per run in OSS. [→ Test Scenarios](docs/TEST_SCENARIOS.md) for mutation, chaos, contract, and replay examples.
 - **Invariants & assertions** — Deterministic checks, semantic similarity, safety (PII, refusal); configurable per contract.
 - **Robustness score** — For mutation runs: a single weighted score (0–1) of how well the agent handled adversarial prompts. Reported in HTML/JSON and CLI (`results.statistics.robustness_score`).
 - **Unified resilience score** — For full CI: weighted combination of **mutation robustness**, chaos resilience, contract compliance, and replay regression; weights (mutation, chaos, contract, replay) configurable in YAML and must sum to 1.0.
@@ -229,7 +229,7 @@ See [Roadmap](ROADMAP.md) for the full plan. Highlights:
 - [📖 Usage Guide](docs/USAGE_GUIDE.md) - Complete end-to-end guide (includes local setup)
 - [⚙️ Configuration Guide](docs/CONFIGURATION_GUIDE.md) - All configuration options
 - [🔌 Connection Guide](docs/CONNECTION_GUIDE.md) - How to connect FlakeStorm to your agent
-- [🧪 Test Scenarios](docs/TEST_SCENARIOS.md) - Real-world examples with code
+- [🧪 Test Scenarios](docs/TEST_SCENARIOS.md) - Real-world examples for mutation, chaos, contract, and replay (V2)
 - [📂 Example: chaos, contracts & replay](examples/v2_research_agent/README.md) - Working agent and config you can run
 - [🔗 Integrations Guide](docs/INTEGRATIONS_GUIDE.md) - HuggingFace models & semantic similarity
 - [🤖 LLM Providers](docs/LLM_PROVIDERS.md) - OpenAI, Claude, Gemini (env-only API keys)
@@ -254,7 +254,6 @@ See [Roadmap](ROADMAP.md) for the full plan. Highlights:
 ### Reference
 - [📋 API Specification](docs/API_SPECIFICATION.md) - API reference
 - [🧪 Testing Guide](docs/TESTING_GUIDE.md) - How to run and write tests
-- [✅ Implementation Checklist](docs/IMPLEMENTATION_CHECKLIST.md) - Development progress
 
 ## Cloud Version (Early Access)
 
diff --git a/docs/IMPLEMENTATION_CHECKLIST.md b/docs/IMPLEMENTATION_CHECKLIST.md
deleted file mode 100644
index 1f6e148..0000000
--- a/docs/IMPLEMENTATION_CHECKLIST.md
+++ /dev/null
@@ -1,220 +0,0 @@
-# flakestorm Implementation Checklist
-
-This document tracks the implementation progress of flakestorm - The Agent Reliability Engine.
-
-## CLI Version (Open Source - Apache 2.0)
-
-### Phase 1: Foundation (Week 1-2)
-
-#### Project Scaffolding
-- [x] Initialize Python project with pyproject.toml
-- [x] Set up Rust workspace with Cargo.toml
-- [x] Create Apache 2.0 LICENSE file
-- [x] Write comprehensive README.md
-- [x] Create flakestorm.yaml.example template
-- [x] Set up project structure (src/flakestorm/*)
-- [x] Configure pre-commit hooks (black, ruff, mypy)
-
-#### Configuration System
-- [x] Define Pydantic models for configuration
-- [x] Implement YAML loading/validation
-- [x] Support environment variable expansion
-- [x] Create configuration factory functions
-- [x] Add configuration validation tests
-
-#### Agent Protocol/Adapter
-- [x] Define AgentProtocol interface
-- [x] Implement HTTPAgentAdapter
-- [x] Implement PythonAgentAdapter
-- [x] Implement LangChainAgentAdapter
-- [x] Create adapter factory function
-- [x] Add retry logic for HTTP adapter
-
----
-
-### Phase 2: Mutation Engine (Week 2-3)
-
-#### Ollama Integration
-- [x] Create MutationEngine class
-- [x] Implement Ollama client wrapper
-- [x] Add connection verification
-- [x] Support async mutation generation
-- [x] Implement batch generation
-
-#### Mutation Types & Templates
-- [x] Define MutationType enum
-- [x] Create Mutation dataclass
-- [x] Write templates for PARAPHRASE
-- [x] Write templates for NOISE
-- [x] Write templates for TONE_SHIFT
-- [x] Write templates for PROMPT_INJECTION
-- [x] Add mutation validation logic
-- [x] Support custom templates
-
-#### Rust Performance Bindings
-- [x] Set up PyO3 bindings
-- [x] Implement robustness score calculation
-- [x] Implement weighted score calculation
-- [x] Implement Levenshtein distance
-- [x] Implement parallel processing utilities
-- [x] Build and test Rust module
-- [x] Integrate with Python package
-
----
-
-### Phase 3: Runner & Assertions (Week 3-4)
-
-#### Async Runner
-- [x] Create FlakeStormRunner class
-- [x] Implement orchestrator logic
-- [x] Add concurrency control with semaphores
-- [x] Implement progress tracking
-- [x] Add setup verification
-
-#### Invariant System
-- [x] Create InvariantVerifier class
-- [x] Implement ContainsChecker
-- [x] Implement LatencyChecker
-- [x] Implement ValidJsonChecker
-- [x] Implement RegexChecker
-- [x] Implement SimilarityChecker
-- [x] Implement ExcludesPIIChecker
-- [x] Implement RefusalChecker
-- [x] Add checker registry
-
----
-
-### Phase 4: CLI & Reporting (Week 4-5)
-
-#### CLI Commands
-- [x] Set up Typer application
-- [x] Implement `flakestorm init` command
-- [x] Implement `flakestorm run` command
-- [x] Implement `flakestorm verify` command
-- [x] Implement `flakestorm report` command
-- [x] Implement `flakestorm score` command
-- [x] Add CI mode (--ci --min-score)
-- [x] Add rich progress bars
-
-#### Report Generation
-- [x] Create report data models
-- [x] Implement HTMLReportGenerator
-- [x] Create interactive HTML template
-- [x] Implement JSONReportGenerator
-- [x] Implement TerminalReporter
-- [x] Add score visualization
-- [x] Add mutation matrix view
-
----
-
-### Phase 5: V2 Features (Week 5-7)
-
-#### Environment Chaos & Context Attacks
-- [x] ChaosConfig (tool_faults, llm_faults, context_attacks as list or dict)
-- [x] ChaosInterceptor: memory_poisoning applied to input before invoke; LLM faults (timeout before call, others after)
-- [x] context_attacks: indirect_injection, memory_poisoning (strategy prepend/append/replace), normalize_context_attacks
-- [x] Per-scenario context_attacks in contract.chaos_matrix
-
-#### Behavioral Contracts
-- [x] ContractEngine: (invariant × scenario) cells with optional reset (reset_endpoint / reset_function)
-- [x] system_prompt_leak_probe via contract invariant `probes`; behavior_unchanged with baseline auto/manual
-- [x] Stateful detection and warning when no reset configured
-
-#### Replay Regression
-- [x] ReplaySessionConfig with `file` (load from file) or inline id/input; validation require id+input when no file
-- [x] ReplayConfig.sources (LangSmith project or run_id) with auto_import
-
-#### Scoring & Config
-- [x] ScoringConfig (mutation, chaos, contract, replay) weights must sum to 1.0
-- [x] AgentConfig.reset_endpoint, reset_function; ModelConfig api_key env-only
-- [x] Mutation count max 50 (OSS); 22+ mutation types
-
-#### HuggingFace Integration
-- [x] Create HuggingFaceModelProvider
-- [x] Support GGUF model downloading
-- [x] Add recommended models list
-- [x] Integrate with Ollama model importing
-
-#### Vector Similarity
-- [x] Create LocalEmbedder class
-- [x] Integrate sentence-transformers
-- [x] Implement similarity calculation
-- [x] Add lazy model loading
-
----
-
-### Testing & Quality
-
-#### Unit Tests
-- [x] Test configuration loading
-- [x] Test mutation types
-- [x] Test assertion checkers
-- [ ] Test agent adapters
-- [ ] Test orchestrator
-- [ ] Test report generation
-
-#### Integration Tests
-- [ ] Test full run with mock agent
-- [ ] Test CLI commands
-- [ ] Test report generation
-
-#### Documentation
-- [x] Write README.md
-- [x] Create IMPLEMENTATION_CHECKLIST.md
-- [x] Create ARCHITECTURE_SUMMARY.md
-- [x] Create API_SPECIFICATION.md
-- [x] Create CONTRIBUTING.md
-- [x] Create CONFIGURATION_GUIDE.md
-
----
-
-### Phase 6: Essential Mutations (Week 7-8)
-
-#### Core Mutation Types
-- [x] Add ENCODING_ATTACKS mutation type
-- [x] Add CONTEXT_MANIPULATION mutation type
-- [x] Add LENGTH_EXTREMES mutation type
-- [x] Update MutationType enum with all 8 types
-- [x] Create templates for new mutation types
-- [x] Update mutation validation for edge cases
-
-#### Configuration Updates
-- [x] Update MutationConfig defaults
-- [x] Update example configuration files
-- [x] Update orchestrator comments
-
-#### Documentation Updates
-- [x] Update README.md with comprehensive mutation types table
-- [x] Add Mutation Strategy section to README
-- [x] Update API_SPECIFICATION.md with all 8 types
-- [x] Update MODULES.md with detailed mutation documentation
-- [x] Add Mutation Types Guide to CONFIGURATION_GUIDE.md
-- [x] Add Understanding Mutation Types to USAGE_GUIDE.md
-- [x] Add Mutation Type Deep Dive to TEST_SCENARIOS.md
-
----
-
-## Progress Summary
-
-| Phase | Status | Completion |
-|-------|--------|------------|
-| CLI Phase 1: Foundation | ✅ Complete | 100% |
-| CLI Phase 2: Mutation Engine | ✅ Complete | 100% |
-| CLI Phase 3: Runner & Assertions | ✅ Complete | 100% |
-| CLI Phase 4: CLI & Reporting | ✅ Complete | 100% |
-| CLI Phase 5: V2 Features | ✅ Complete | 90% |
-| CLI Phase 6: Essential Mutations | ✅ Complete | 100% |
-| Documentation | ✅ Complete | 100% |
-
----
-
-## Next Steps
-
-### Immediate (Current Sprint)
-1. **Rust Build**: Compile and integrate Rust performance module
-2. **Integration Tests**: Add full integration test suite
-3. **PyPI Release**: Prepare and publish to PyPI
-4. **Community Launch**: Publish to Hacker News and Reddit
-
-### Future Roadmap
-See [ROADMAP.md](ROADMAP.md) for comprehensive roadmap of advanced chaos engineering and adversarial testing features. These are open for community contribution - see [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved.
diff --git a/docs/TEST_SCENARIOS.md b/docs/TEST_SCENARIOS.md
index e3acf8a..c99ce4e 100644
--- a/docs/TEST_SCENARIOS.md
+++ b/docs/TEST_SCENARIOS.md
@@ -1,13 +1,22 @@
 # Real-World Test Scenarios
 
-This document provides concrete, real-world examples of testing AI agents with flakestorm. Each scenario includes the complete setup, expected inputs/outputs, and integration code.
+This document provides concrete, real-world examples of testing AI agents with flakestorm across **all V2 pillars**: **mutation** (adversarial prompts), **environment chaos** (tool/LLM faults), **behavioral contracts** (invariants × chaos matrix), and **replay regression** (replay production incidents). Each scenario includes setup, config, and commands where applicable.
 
-**V2:** Flakestorm supports **22+ mutation types** (prompt-level and system/network-level) with a **max of 50 mutations per run** in OSS. Use `version: "2.0"` in config for chaos, behavioral contracts, and replay regression. See [Configuration Guide](CONFIGURATION_GUIDE.md) and [V2 Spec](V2_SPEC.md).
+**V2:** Use `version: "2.0"` in config to enable chaos, contracts, and replay. Flakestorm supports **24 mutation types** (prompt-level and system/network-level) and **max 50 mutations per run** in OSS. See [V2 Spec](V2_SPEC.md) and [V2 Audit](V2_AUDIT.md).
 
 ---
 
 ## Table of Contents
 
+### V2 scenarios (all pillars)
+
+- [V2 Scenario: Environment Chaos](#v2-scenario-environment-chaos) — Tool/LLM fault injection
+- [V2 Scenario: Behavioral Contract × Chaos Matrix](#v2-scenario-behavioral-contract--chaos-matrix) — Invariants under each chaos scenario
+- [V2 Scenario: Replay Regression](#v2-scenario-replay-regression) — Replay production failures
+- [Full V2 example (chaos + contract + replay)](../examples/v2_research_agent/README.md) — Working agent and config
+
+### Mutation-focused scenarios (agent + config examples)
+
 1. [Scenario 1: Customer Service Chatbot](#scenario-1-customer-service-chatbot)
 2. [Scenario 2: Code Generation Agent](#scenario-2-code-generation-agent)
 3. [Scenario 3: RAG-Based Q&A Agent](#scenario-3-rag-based-qa-agent)
@@ -17,6 +26,95 @@ This document provides concrete, real-world examples of testing AI agents with f
 
 ---
 
+## V2 Scenario: Environment Chaos
+
+**Goal:** Test that your agent degrades gracefully when tools or the LLM fail (timeouts, errors, rate limits, malformed responses).
+
+**Commands:** `flakestorm run --chaos` (mutations + chaos) or `flakestorm run --chaos --chaos-only` (golden prompts only, under chaos). Use `--chaos-profile api_outage` (or `degraded_llm`, `hostile_tools`, `high_latency`, `cascading_failure`) for built-in profiles.
+
+**Config (excerpt):**
+
+```yaml
+version: "2.0"
+chaos:
+  tool_faults:
+    - tool: "*"
+      mode: error
+      error_code: 503
+      probability: 0.3
+  llm_faults:
+    - mode: truncated_response
+      max_tokens: 5
+      probability: 0.2
+```
+
+**Docs:** [Environment Chaos](ENVIRONMENT_CHAOS.md), [V2 Audit §8.1](V2_AUDIT.md#1-prd-81--environment-chaos). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
+
+---
+
+## V2 Scenario: Behavioral Contract × Chaos Matrix
+
+**Goal:** Verify that named invariants (with severity) hold under every chaos scenario; each (invariant × scenario) cell is an independent run. Optional `agent.reset_endpoint` or `agent.reset_function` for state isolation.
+
+**Commands:** `flakestorm contract run`, `flakestorm contract validate`, `flakestorm contract score`.
+
+**Config (excerpt):**
+
+```yaml
+version: "2.0"
+agent:
+  reset_endpoint: "http://localhost:8790/reset"
+contract:
+  name: "My Contract"
+  invariants:
+    - id: must-cite
+      type: regex
+      pattern: "(?i)(source|according to)"
+      severity: critical
+    - id: max-latency
+      type: latency
+      max_ms: 60000
+      severity: medium
+  chaos_matrix:
+    - name: "no-chaos"
+      tool_faults: []
+      llm_faults: []
+    - name: "api-outage"
+      tool_faults:
+        - tool: "*"
+          mode: error
+          error_code: 503
+```
+
+**Docs:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md), [V2 Spec](V2_SPEC.md) (contract matrix isolation, resilience score). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
+
+---
+
+## V2 Scenario: Replay Regression
+
+**Goal:** Replay a saved session (e.g. production incident) with fixed inputs and tool responses, then verify the agent’s output against a contract.
+
+**Commands:** `flakestorm replay run path/to/session.yaml -c flakestorm.yaml`, `flakestorm replay export --from-report report.json -o ./replays/`. Optional: `flakestorm replay run --from-langsmith RUN_ID --run` to import from LangSmith and run.
+
+**Config (excerpt):**
+
+```yaml
+version: "2.0"
+replays:
+  sessions:
+    - file: "replays/incident_001.yaml"
+  # Optional: sources for LangSmith import
+  # sources: ...
+```
+
+**Session file (e.g. `replays/incident_001.yaml`):** `id`, `input`, `tool_responses` (optional), `contract` (name or path).
+
+**Docs:** [Replay Regression](REPLAY_REGRESSION.md), [V2 Audit §8.3](V2_AUDIT.md#3-prd-83--replay-based-regression). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
+
+---
+
+---
+
 ## Scenario 1: Customer Service Chatbot
 
 ### The Agent