Remove the implementation checklist document and update README and TEST_SCENARIOS to reflect the latest V2 features, including detailed descriptions of environment chaos, behavioral contracts, and replay regression scenarios. Adjusted links and clarified configuration options for better usability.

This commit is contained in:
Entropix 2026-03-09 13:01:08 +08:00
parent 4b0ab63f97
commit 11489255e3
4 changed files with 102 additions and 226 deletions

View file

@ -1,220 +0,0 @@
# flakestorm Implementation Checklist
This document tracks the implementation progress of flakestorm - The Agent Reliability Engine.
## CLI Version (Open Source - Apache 2.0)
### Phase 1: Foundation (Week 1-2)
#### Project Scaffolding
- [x] Initialize Python project with pyproject.toml
- [x] Set up Rust workspace with Cargo.toml
- [x] Create Apache 2.0 LICENSE file
- [x] Write comprehensive README.md
- [x] Create flakestorm.yaml.example template
- [x] Set up project structure (src/flakestorm/*)
- [x] Configure pre-commit hooks (black, ruff, mypy)
#### Configuration System
- [x] Define Pydantic models for configuration
- [x] Implement YAML loading/validation
- [x] Support environment variable expansion
- [x] Create configuration factory functions
- [x] Add configuration validation tests
#### Agent Protocol/Adapter
- [x] Define AgentProtocol interface
- [x] Implement HTTPAgentAdapter
- [x] Implement PythonAgentAdapter
- [x] Implement LangChainAgentAdapter
- [x] Create adapter factory function
- [x] Add retry logic for HTTP adapter
---
### Phase 2: Mutation Engine (Week 2-3)
#### Ollama Integration
- [x] Create MutationEngine class
- [x] Implement Ollama client wrapper
- [x] Add connection verification
- [x] Support async mutation generation
- [x] Implement batch generation
#### Mutation Types & Templates
- [x] Define MutationType enum
- [x] Create Mutation dataclass
- [x] Write templates for PARAPHRASE
- [x] Write templates for NOISE
- [x] Write templates for TONE_SHIFT
- [x] Write templates for PROMPT_INJECTION
- [x] Add mutation validation logic
- [x] Support custom templates
#### Rust Performance Bindings
- [x] Set up PyO3 bindings
- [x] Implement robustness score calculation
- [x] Implement weighted score calculation
- [x] Implement Levenshtein distance
- [x] Implement parallel processing utilities
- [x] Build and test Rust module
- [x] Integrate with Python package
---
### Phase 3: Runner & Assertions (Week 3-4)
#### Async Runner
- [x] Create FlakeStormRunner class
- [x] Implement orchestrator logic
- [x] Add concurrency control with semaphores
- [x] Implement progress tracking
- [x] Add setup verification
#### Invariant System
- [x] Create InvariantVerifier class
- [x] Implement ContainsChecker
- [x] Implement LatencyChecker
- [x] Implement ValidJsonChecker
- [x] Implement RegexChecker
- [x] Implement SimilarityChecker
- [x] Implement ExcludesPIIChecker
- [x] Implement RefusalChecker
- [x] Add checker registry
---
### Phase 4: CLI & Reporting (Week 4-5)
#### CLI Commands
- [x] Set up Typer application
- [x] Implement `flakestorm init` command
- [x] Implement `flakestorm run` command
- [x] Implement `flakestorm verify` command
- [x] Implement `flakestorm report` command
- [x] Implement `flakestorm score` command
- [x] Add CI mode (--ci --min-score)
- [x] Add rich progress bars
#### Report Generation
- [x] Create report data models
- [x] Implement HTMLReportGenerator
- [x] Create interactive HTML template
- [x] Implement JSONReportGenerator
- [x] Implement TerminalReporter
- [x] Add score visualization
- [x] Add mutation matrix view
---
### Phase 5: V2 Features (Week 5-7)
#### Environment Chaos & Context Attacks
- [x] ChaosConfig (tool_faults, llm_faults, context_attacks as list or dict)
- [x] ChaosInterceptor: memory_poisoning applied to input before invoke; LLM faults (timeout before call, others after)
- [x] context_attacks: indirect_injection, memory_poisoning (strategy prepend/append/replace), normalize_context_attacks
- [x] Per-scenario context_attacks in contract.chaos_matrix
#### Behavioral Contracts
- [x] ContractEngine: (invariant × scenario) cells with optional reset (reset_endpoint / reset_function)
- [x] system_prompt_leak_probe via contract invariant `probes`; behavior_unchanged with baseline auto/manual
- [x] Stateful detection and warning when no reset configured
#### Replay Regression
- [x] ReplaySessionConfig with `file` (load from file) or inline id/input; validation require id+input when no file
- [x] ReplayConfig.sources (LangSmith project or run_id) with auto_import
#### Scoring & Config
- [x] ScoringConfig (mutation, chaos, contract, replay) weights must sum to 1.0
- [x] AgentConfig.reset_endpoint, reset_function; ModelConfig api_key env-only
- [x] Mutation count max 50 (OSS); 22+ mutation types
#### HuggingFace Integration
- [x] Create HuggingFaceModelProvider
- [x] Support GGUF model downloading
- [x] Add recommended models list
- [x] Integrate with Ollama model importing
#### Vector Similarity
- [x] Create LocalEmbedder class
- [x] Integrate sentence-transformers
- [x] Implement similarity calculation
- [x] Add lazy model loading
---
### Testing & Quality
#### Unit Tests
- [x] Test configuration loading
- [x] Test mutation types
- [x] Test assertion checkers
- [ ] Test agent adapters
- [ ] Test orchestrator
- [ ] Test report generation
#### Integration Tests
- [ ] Test full run with mock agent
- [ ] Test CLI commands
- [ ] Test report generation
#### Documentation
- [x] Write README.md
- [x] Create IMPLEMENTATION_CHECKLIST.md
- [x] Create ARCHITECTURE_SUMMARY.md
- [x] Create API_SPECIFICATION.md
- [x] Create CONTRIBUTING.md
- [x] Create CONFIGURATION_GUIDE.md
---
### Phase 6: Essential Mutations (Week 7-8)
#### Core Mutation Types
- [x] Add ENCODING_ATTACKS mutation type
- [x] Add CONTEXT_MANIPULATION mutation type
- [x] Add LENGTH_EXTREMES mutation type
- [x] Update MutationType enum with all 8 types
- [x] Create templates for new mutation types
- [x] Update mutation validation for edge cases
#### Configuration Updates
- [x] Update MutationConfig defaults
- [x] Update example configuration files
- [x] Update orchestrator comments
#### Documentation Updates
- [x] Update README.md with comprehensive mutation types table
- [x] Add Mutation Strategy section to README
- [x] Update API_SPECIFICATION.md with all 8 types
- [x] Update MODULES.md with detailed mutation documentation
- [x] Add Mutation Types Guide to CONFIGURATION_GUIDE.md
- [x] Add Understanding Mutation Types to USAGE_GUIDE.md
- [x] Add Mutation Type Deep Dive to TEST_SCENARIOS.md
---
## Progress Summary
| Phase | Status | Completion |
|-------|--------|------------|
| CLI Phase 1: Foundation | ✅ Complete | 100% |
| CLI Phase 2: Mutation Engine | ✅ Complete | 100% |
| CLI Phase 3: Runner & Assertions | ✅ Complete | 100% |
| CLI Phase 4: CLI & Reporting | ✅ Complete | 100% |
| CLI Phase 5: V2 Features | ✅ Complete | 90% |
| CLI Phase 6: Essential Mutations | ✅ Complete | 100% |
| Documentation | ✅ Complete | 100% |
---
## Next Steps
### Immediate (Current Sprint)
1. **Rust Build**: Compile and integrate Rust performance module
2. **Integration Tests**: Add full integration test suite
3. **PyPI Release**: Prepare and publish to PyPI
4. **Community Launch**: Publish to Hacker News and Reddit
### Future Roadmap
See [ROADMAP.md](ROADMAP.md) for comprehensive roadmap of advanced chaos engineering and adversarial testing features. These are open for community contribution - see [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved.

View file

@ -1,13 +1,22 @@
# Real-World Test Scenarios
This document provides concrete, real-world examples of testing AI agents with flakestorm. Each scenario includes the complete setup, expected inputs/outputs, and integration code.
This document provides concrete, real-world examples of testing AI agents with flakestorm across **all V2 pillars**: **mutation** (adversarial prompts), **environment chaos** (tool/LLM faults), **behavioral contracts** (invariants × chaos matrix), and **replay regression** (replay production incidents). Each scenario includes setup, config, and commands where applicable.
**V2:** Flakestorm supports **22+ mutation types** (prompt-level and system/network-level) with a **max of 50 mutations per run** in OSS. Use `version: "2.0"` in config for chaos, behavioral contracts, and replay regression. See [Configuration Guide](CONFIGURATION_GUIDE.md) and [V2 Spec](V2_SPEC.md).
**V2:** Use `version: "2.0"` in config to enable chaos, contracts, and replay. Flakestorm supports **24 mutation types** (prompt-level and system/network-level) and **max 50 mutations per run** in OSS. See [V2 Spec](V2_SPEC.md) and [V2 Audit](V2_AUDIT.md).
---
## Table of Contents
### V2 scenarios (all pillars)
- [V2 Scenario: Environment Chaos](#v2-scenario-environment-chaos) — Tool/LLM fault injection
- [V2 Scenario: Behavioral Contract × Chaos Matrix](#v2-scenario-behavioral-contract--chaos-matrix) — Invariants under each chaos scenario
- [V2 Scenario: Replay Regression](#v2-scenario-replay-regression) — Replay production failures
- [Full V2 example (chaos + contract + replay)](../examples/v2_research_agent/README.md) — Working agent and config
### Mutation-focused scenarios (agent + config examples)
1. [Scenario 1: Customer Service Chatbot](#scenario-1-customer-service-chatbot)
2. [Scenario 2: Code Generation Agent](#scenario-2-code-generation-agent)
3. [Scenario 3: RAG-Based Q&A Agent](#scenario-3-rag-based-qa-agent)
@ -17,6 +26,95 @@ This document provides concrete, real-world examples of testing AI agents with f
---
## V2 Scenario: Environment Chaos
**Goal:** Test that your agent degrades gracefully when tools or the LLM fail (timeouts, errors, rate limits, malformed responses).
**Commands:** `flakestorm run --chaos` (mutations + chaos) or `flakestorm run --chaos --chaos-only` (golden prompts only, under chaos). Use `--chaos-profile api_outage` (or `degraded_llm`, `hostile_tools`, `high_latency`, `cascading_failure`) for built-in profiles.
**Config (excerpt):**
```yaml
version: "2.0"
chaos:
tool_faults:
- tool: "*"
mode: error
error_code: 503
probability: 0.3
llm_faults:
- mode: truncated_response
max_tokens: 5
probability: 0.2
```
**Docs:** [Environment Chaos](ENVIRONMENT_CHAOS.md), [V2 Audit §8.1](V2_AUDIT.md#1-prd-81--environment-chaos). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
---
## V2 Scenario: Behavioral Contract × Chaos Matrix
**Goal:** Verify that named invariants (with severity) hold under every chaos scenario; each (invariant × scenario) cell is an independent run. Optional `agent.reset_endpoint` or `agent.reset_function` for state isolation.
**Commands:** `flakestorm contract run`, `flakestorm contract validate`, `flakestorm contract score`.
**Config (excerpt):**
```yaml
version: "2.0"
agent:
reset_endpoint: "http://localhost:8790/reset"
contract:
name: "My Contract"
invariants:
- id: must-cite
type: regex
pattern: "(?i)(source|according to)"
severity: critical
- id: max-latency
type: latency
max_ms: 60000
severity: medium
chaos_matrix:
- name: "no-chaos"
tool_faults: []
llm_faults: []
- name: "api-outage"
tool_faults:
- tool: "*"
mode: error
error_code: 503
```
**Docs:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md), [V2 Spec](V2_SPEC.md) (contract matrix isolation, resilience score). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
---
## V2 Scenario: Replay Regression
**Goal:** Replay a saved session (e.g. production incident) with fixed inputs and tool responses, then verify the agents output against a contract.
**Commands:** `flakestorm replay run path/to/session.yaml -c flakestorm.yaml`, `flakestorm replay export --from-report report.json -o ./replays/`. Optional: `flakestorm replay run --from-langsmith RUN_ID --run` to import from LangSmith and run.
**Config (excerpt):**
```yaml
version: "2.0"
replays:
sessions:
- file: "replays/incident_001.yaml"
# Optional: sources for LangSmith import
# sources: ...
```
**Session file (e.g. `replays/incident_001.yaml`):** `id`, `input`, `tool_responses` (optional), `contract` (name or path).
**Docs:** [Replay Regression](REPLAY_REGRESSION.md), [V2 Audit §8.3](V2_AUDIT.md#3-prd-83--replay-based-regression). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md).
---
---
## Scenario 1: Customer Service Chatbot
### The Agent