Remove the implementation checklist document and update README and TEST_SCENARIOS to reflect the latest V2 features, including detailed descriptions of environment chaos, behavioral contracts, and replay regression scenarios. Adjusted links and clarified configuration options for better usability.

2026-04-25 00:36:54 +02:00 · 2026-03-09 13:01:08 +08:00 · 2026-03-09 13:01:08 +08:00 · 11489255e3
commit 11489255e3
parent 4b0ab63f97
4 changed files with 102 additions and 226 deletions
--- a/docs/IMPLEMENTATION_CHECKLIST.md
+++ b/docs/IMPLEMENTATION_CHECKLIST.md
@ -1,220 +0,0 @@
-# flakestorm Implementation Checklist
-
-This document tracks the implementation progress of flakestorm - The Agent Reliability Engine.
-
-## CLI Version (Open Source - Apache 2.0)
-
-### Phase 1: Foundation (Week 1-2)
-
-#### Project Scaffolding
- [x] Initialize Python project with pyproject.toml
- [x] Set up Rust workspace with Cargo.toml
- [x] Create Apache 2.0 LICENSE file
- [x] Write comprehensive README.md
- [x] Create flakestorm.yaml.example template
- [x] Set up project structure (src/flakestorm/*)
- [x] Configure pre-commit hooks (black, ruff, mypy)
-
-#### Configuration System
- [x] Define Pydantic models for configuration
- [x] Implement YAML loading/validation
- [x] Support environment variable expansion
- [x] Create configuration factory functions
- [x] Add configuration validation tests
-
-#### Agent Protocol/Adapter
- [x] Define AgentProtocol interface
- [x] Implement HTTPAgentAdapter
- [x] Implement PythonAgentAdapter
- [x] Implement LangChainAgentAdapter
- [x] Create adapter factory function
- [x] Add retry logic for HTTP adapter
-
---
-
-### Phase 2: Mutation Engine (Week 2-3)
-
-#### Ollama Integration
- [x] Create MutationEngine class
- [x] Implement Ollama client wrapper
- [x] Add connection verification
- [x] Support async mutation generation
- [x] Implement batch generation
-
-#### Mutation Types & Templates
- [x] Define MutationType enum
- [x] Create Mutation dataclass
- [x] Write templates for PARAPHRASE
- [x] Write templates for NOISE
- [x] Write templates for TONE_SHIFT
- [x] Write templates for PROMPT_INJECTION
- [x] Add mutation validation logic
- [x] Support custom templates
-
-#### Rust Performance Bindings
- [x] Set up PyO3 bindings
- [x] Implement robustness score calculation
- [x] Implement weighted score calculation
- [x] Implement Levenshtein distance
- [x] Implement parallel processing utilities
- [x] Build and test Rust module
- [x] Integrate with Python package
-
---
-
-### Phase 3: Runner & Assertions (Week 3-4)
-
-#### Async Runner
- [x] Create FlakeStormRunner class
- [x] Implement orchestrator logic
- [x] Add concurrency control with semaphores
- [x] Implement progress tracking
- [x] Add setup verification
-
-#### Invariant System
- [x] Create InvariantVerifier class
- [x] Implement ContainsChecker
- [x] Implement LatencyChecker
- [x] Implement ValidJsonChecker
- [x] Implement RegexChecker
- [x] Implement SimilarityChecker
- [x] Implement ExcludesPIIChecker
- [x] Implement RefusalChecker
- [x] Add checker registry
-
---
-
-### Phase 4: CLI & Reporting (Week 4-5)
-
-#### CLI Commands
- [x] Set up Typer application
- [x] Implement `flakestorm init` command
- [x] Implement `flakestorm run` command
- [x] Implement `flakestorm verify` command
- [x] Implement `flakestorm report` command
- [x] Implement `flakestorm score` command
- [x] Add CI mode (--ci --min-score)
- [x] Add rich progress bars
-
-#### Report Generation
- [x] Create report data models
- [x] Implement HTMLReportGenerator
- [x] Create interactive HTML template
- [x] Implement JSONReportGenerator
- [x] Implement TerminalReporter
- [x] Add score visualization
- [x] Add mutation matrix view
-
---
-
-### Phase 5: V2 Features (Week 5-7)
-
-#### Environment Chaos & Context Attacks
- [x] ChaosConfig (tool_faults, llm_faults, context_attacks as list or dict)
- [x] ChaosInterceptor: memory_poisoning applied to input before invoke; LLM faults (timeout before call, others after)
- [x] context_attacks: indirect_injection, memory_poisoning (strategy prepend/append/replace), normalize_context_attacks
- [x] Per-scenario context_attacks in contract.chaos_matrix
-
-#### Behavioral Contracts
- [x] ContractEngine: (invariant × scenario) cells with optional reset (reset_endpoint / reset_function)
- [x] system_prompt_leak_probe via contract invariant `probes`; behavior_unchanged with baseline auto/manual
- [x] Stateful detection and warning when no reset configured
-
-#### Replay Regression
- [x] ReplaySessionConfig with `file` (load from file) or inline id/input; validation require id+input when no file
- [x] ReplayConfig.sources (LangSmith project or run_id) with auto_import
-
-#### Scoring & Config
- [x] ScoringConfig (mutation, chaos, contract, replay) weights must sum to 1.0
- [x] AgentConfig.reset_endpoint, reset_function; ModelConfig api_key env-only
- [x] Mutation count max 50 (OSS); 22+ mutation types
-
-#### HuggingFace Integration
- [x] Create HuggingFaceModelProvider
- [x] Support GGUF model downloading
- [x] Add recommended models list
- [x] Integrate with Ollama model importing
-
-#### Vector Similarity
- [x] Create LocalEmbedder class
- [x] Integrate sentence-transformers
- [x] Implement similarity calculation
- [x] Add lazy model loading
-
---
-
-### Testing & Quality
-
-#### Unit Tests
- [x] Test configuration loading
- [x] Test mutation types
- [x] Test assertion checkers
- [ ] Test agent adapters
- [ ] Test orchestrator
- [ ] Test report generation
-
-#### Integration Tests
- [ ] Test full run with mock agent
- [ ] Test CLI commands
- [ ] Test report generation
-
-#### Documentation
- [x] Write README.md
- [x] Create IMPLEMENTATION_CHECKLIST.md
- [x] Create ARCHITECTURE_SUMMARY.md
- [x] Create API_SPECIFICATION.md
- [x] Create CONTRIBUTING.md
- [x] Create CONFIGURATION_GUIDE.md
-
---
-
-### Phase 6: Essential Mutations (Week 7-8)
-
-#### Core Mutation Types
- [x] Add ENCODING_ATTACKS mutation type
- [x] Add CONTEXT_MANIPULATION mutation type
- [x] Add LENGTH_EXTREMES mutation type
- [x] Update MutationType enum with all 8 types
- [x] Create templates for new mutation types
- [x] Update mutation validation for edge cases
-
-#### Configuration Updates
- [x] Update MutationConfig defaults
- [x] Update example configuration files
- [x] Update orchestrator comments
-
-#### Documentation Updates
- [x] Update README.md with comprehensive mutation types table
- [x] Add Mutation Strategy section to README
- [x] Update API_SPECIFICATION.md with all 8 types
- [x] Update MODULES.md with detailed mutation documentation
- [x] Add Mutation Types Guide to CONFIGURATION_GUIDE.md
- [x] Add Understanding Mutation Types to USAGE_GUIDE.md
- [x] Add Mutation Type Deep Dive to TEST_SCENARIOS.md
-
---
-
-## Progress Summary
-
-| Phase | Status | Completion |
-|-------|--------|------------|
-| CLI Phase 1: Foundation | ✅ Complete | 100% |
-| CLI Phase 2: Mutation Engine | ✅ Complete | 100% |
-| CLI Phase 3: Runner & Assertions | ✅ Complete | 100% |
-| CLI Phase 4: CLI & Reporting | ✅ Complete | 100% |
-| CLI Phase 5: V2 Features | ✅ Complete | 90% |
-| CLI Phase 6: Essential Mutations | ✅ Complete | 100% |
-| Documentation | ✅ Complete | 100% |
-
---
-
-## Next Steps
-
-### Immediate (Current Sprint)
-1. **Rust Build**: Compile and integrate Rust performance module
-2. **Integration Tests**: Add full integration test suite
-3. **PyPI Release**: Prepare and publish to PyPI
-4. **Community Launch**: Publish to Hacker News and Reddit
-
-### Future Roadmap
-See [ROADMAP.md](ROADMAP.md) for comprehensive roadmap of advanced chaos engineering and adversarial testing features. These are open for community contribution - see [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved.
--- a/docs/TEST_SCENARIOS.md
+++ b/docs/TEST_SCENARIOS.md
@ -1,13 +1,22 @@
 # Real-World Test Scenarios

-This document provides concrete, real-world examples of testing AI agents with flakestorm. Each scenario includes the complete setup, expected inputs/outputs, and integration code.
+This document provides concrete, real-world examples of testing AI agents with flakestorm across **all V2 pillars**: **mutation** (adversarial prompts), **environment chaos** (tool/LLM faults), **behavioral contracts** (invariants × chaos matrix), and **replay regression** (replay production incidents). Each scenario includes setup, config, and commands where applicable.

-**V2:** Flakestorm supports **22+ mutation types** (prompt-level and system/network-level) with a **max of 50 mutations per run** in OSS. Use `version: "2.0"` in config for chaos, behavioral contracts, and replay regression. See [Configuration Guide](CONFIGURATION_GUIDE.md) and [V2 Spec](V2_SPEC.md).
+**V2:** Use `version: "2.0"` in config to enable chaos, contracts, and replay. Flakestorm supports **24 mutation types** (prompt-level and system/network-level) and **max 50 mutations per run** in OSS. See [V2 Spec](V2_SPEC.md) and [V2 Audit](V2_AUDIT.md).

 ---

 ## Table of Contents

+### V2 scenarios (all pillars)
+
+- [V2 Scenario: Environment Chaos](#v2-scenario-environment-chaos) — Tool/LLM fault injection
+- [V2 Scenario: Behavioral Contract × Chaos Matrix](#v2-scenario-behavioral-contract--chaos-matrix) — Invariants under each chaos scenario
+- [V2 Scenario: Replay Regression](#v2-scenario-replay-regression) — Replay production failures
+- [Full V2 example (chaos + contract + replay)](../examples/v2_research_agent/README.md) — Working agent and config
+
+### Mutation-focused scenarios (agent + config examples)
+
 1. [Scenario 1: Customer Service Chatbot](#scenario-1-customer-service-chatbot)
 2. [Scenario 2: Code Generation Agent](#scenario-2-code-generation-agent)
 3. [Scenario 3: RAG-Based Q&A Agent](#scenario-3-rag-based-qa-agent)
@ -17,6 +26,95 @@ This document provides concrete, real-world examples of testing AI agents with f

 ---

+## V2 Scenario: Environment Chaos
+
+**Goal:** Test that your agent degrades gracefully when tools or the LLM fail (timeouts, errors, rate limits, malformed responses).
+
+**Commands:** `flakestorm run --chaos` (mutations + chaos) or `flakestorm run --chaos --chaos-only` (golden prompts only, under chaos). Use `--chaos-profile api_outage` (or `degraded_llm`, `hostile_tools`, `high_latency`, `cascading_failure`) for built-in profiles.
+
+**Config (excerpt):**
+
+```yaml
+version: "2.0"
+chaos:
+  tool_faults:
+    - tool: "*"
+      mode: error
+      error_code: 503
+      probability: 0.3
+  llm_faults:
+    - mode: truncated_response
+      max_tokens: 5
+      probability: 0.2
+```
+
+**Docs:** [Environment Chaos](ENVIRONMENT_CHAOS.md), [V2 Audit §8.1](V2_AUDIT.md#1-prd-81--environment-chaos). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
+
+---
+
+## V2 Scenario: Behavioral Contract × Chaos Matrix
+
+**Goal:** Verify that named invariants (with severity) hold under every chaos scenario; each (invariant × scenario) cell is an independent run. Optional `agent.reset_endpoint` or `agent.reset_function` for state isolation.
+
+**Commands:** `flakestorm contract run`, `flakestorm contract validate`, `flakestorm contract score`.
+
+**Config (excerpt):**
+
+```yaml
+version: "2.0"
+agent:
+  reset_endpoint: "http://localhost:8790/reset"
+contract:
+  name: "My Contract"
+  invariants:
+    - id: must-cite
+      type: regex
+      pattern: "(?i)(source|according to)"
+      severity: critical
+    - id: max-latency
+      type: latency
+      max_ms: 60000
+      severity: medium
+  chaos_matrix:
+    - name: "no-chaos"
+      tool_faults: []
+      llm_faults: []
+    - name: "api-outage"
+      tool_faults:
+        - tool: "*"
+          mode: error
+          error_code: 503
+```
+
+**Docs:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md), [V2 Spec](V2_SPEC.md) (contract matrix isolation, resilience score). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
+
+---
+
+## V2 Scenario: Replay Regression
+
+**Goal:** Replay a saved session (e.g. production incident) with fixed inputs and tool responses, then verify the agent’s output against a contract.
+
+**Commands:** `flakestorm replay run path/to/session.yaml -c flakestorm.yaml`, `flakestorm replay export --from-report report.json -o ./replays/`. Optional: `flakestorm replay run --from-langsmith RUN_ID --run` to import from LangSmith and run.
+
+**Config (excerpt):**
+
+```yaml
+version: "2.0"
+replays:
+  sessions:
+    - file: "replays/incident_001.yaml"
+  # Optional: sources for LangSmith import
+  # sources: ...
+```
+
+**Session file (e.g. `replays/incident_001.yaml`):** `id`, `input`, `tool_responses` (optional), `contract` (name or path).
+
+**Docs:** [Replay Regression](REPLAY_REGRESSION.md), [V2 Audit §8.3](V2_AUDIT.md#3-prd-83--replay-based-regression). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
+
+---
+
+---
+
 ## Scenario 1: Customer Service Chatbot

 ### The Agent