diff --git a/.gitignore b/.gitignore index 4f74d6d..426c8cb 100644 --- a/.gitignore +++ b/.gitignore @@ -127,4 +127,3 @@ docs/* !docs/CONTRIBUTING.md !docs/API_SPECIFICATION.md !docs/TESTING_GUIDE.md -!docs/IMPLEMENTATION_CHECKLIST.md diff --git a/README.md b/README.md index f8e299d..b2a6462 100644 --- a/README.md +++ b/README.md @@ -152,7 +152,7 @@ For the full **V1 vs V2 flow** (mutation-only vs four pillars, contract matrix i ### Supporting capabilities -- **Adversarial mutations** — 24 mutation types (prompt-level and system/network-level); max 50 mutations per run in OSS. [→ Test Scenarios](docs/TEST_SCENARIOS.md) +- **Adversarial mutations** — 24 mutation types (prompt-level and system/network-level); max 50 mutations per run in OSS. [→ Test Scenarios](docs/TEST_SCENARIOS.md) for mutation, chaos, contract, and replay examples. - **Invariants & assertions** — Deterministic checks, semantic similarity, safety (PII, refusal); configurable per contract. - **Robustness score** — For mutation runs: a single weighted score (0–1) of how well the agent handled adversarial prompts. Reported in HTML/JSON and CLI (`results.statistics.robustness_score`). - **Unified resilience score** — For full CI: weighted combination of **mutation robustness**, chaos resilience, contract compliance, and replay regression; weights (mutation, chaos, contract, replay) configurable in YAML and must sum to 1.0. @@ -229,7 +229,7 @@ See [Roadmap](ROADMAP.md) for the full plan. Highlights: - [📖 Usage Guide](docs/USAGE_GUIDE.md) - Complete end-to-end guide (includes local setup) - [⚙️ Configuration Guide](docs/CONFIGURATION_GUIDE.md) - All configuration options - [🔌 Connection Guide](docs/CONNECTION_GUIDE.md) - How to connect FlakeStorm to your agent -- [🧪 Test Scenarios](docs/TEST_SCENARIOS.md) - Real-world examples with code +- [🧪 Test Scenarios](docs/TEST_SCENARIOS.md) - Real-world examples for mutation, chaos, contract, and replay (V2) - [📂 Example: chaos, contracts & replay](examples/v2_research_agent/README.md) - Working agent and config you can run - [🔗 Integrations Guide](docs/INTEGRATIONS_GUIDE.md) - HuggingFace models & semantic similarity - [🤖 LLM Providers](docs/LLM_PROVIDERS.md) - OpenAI, Claude, Gemini (env-only API keys) @@ -254,7 +254,6 @@ See [Roadmap](ROADMAP.md) for the full plan. Highlights: ### Reference - [📋 API Specification](docs/API_SPECIFICATION.md) - API reference - [🧪 Testing Guide](docs/TESTING_GUIDE.md) - How to run and write tests -- [✅ Implementation Checklist](docs/IMPLEMENTATION_CHECKLIST.md) - Development progress ## Cloud Version (Early Access) diff --git a/docs/IMPLEMENTATION_CHECKLIST.md b/docs/IMPLEMENTATION_CHECKLIST.md deleted file mode 100644 index 1f6e148..0000000 --- a/docs/IMPLEMENTATION_CHECKLIST.md +++ /dev/null @@ -1,220 +0,0 @@ -# flakestorm Implementation Checklist - -This document tracks the implementation progress of flakestorm - The Agent Reliability Engine. - -## CLI Version (Open Source - Apache 2.0) - -### Phase 1: Foundation (Week 1-2) - -#### Project Scaffolding -- [x] Initialize Python project with pyproject.toml -- [x] Set up Rust workspace with Cargo.toml -- [x] Create Apache 2.0 LICENSE file -- [x] Write comprehensive README.md -- [x] Create flakestorm.yaml.example template -- [x] Set up project structure (src/flakestorm/*) -- [x] Configure pre-commit hooks (black, ruff, mypy) - -#### Configuration System -- [x] Define Pydantic models for configuration -- [x] Implement YAML loading/validation -- [x] Support environment variable expansion -- [x] Create configuration factory functions -- [x] Add configuration validation tests - -#### Agent Protocol/Adapter -- [x] Define AgentProtocol interface -- [x] Implement HTTPAgentAdapter -- [x] Implement PythonAgentAdapter -- [x] Implement LangChainAgentAdapter -- [x] Create adapter factory function -- [x] Add retry logic for HTTP adapter - ---- - -### Phase 2: Mutation Engine (Week 2-3) - -#### Ollama Integration -- [x] Create MutationEngine class -- [x] Implement Ollama client wrapper -- [x] Add connection verification -- [x] Support async mutation generation -- [x] Implement batch generation - -#### Mutation Types & Templates -- [x] Define MutationType enum -- [x] Create Mutation dataclass -- [x] Write templates for PARAPHRASE -- [x] Write templates for NOISE -- [x] Write templates for TONE_SHIFT -- [x] Write templates for PROMPT_INJECTION -- [x] Add mutation validation logic -- [x] Support custom templates - -#### Rust Performance Bindings -- [x] Set up PyO3 bindings -- [x] Implement robustness score calculation -- [x] Implement weighted score calculation -- [x] Implement Levenshtein distance -- [x] Implement parallel processing utilities -- [x] Build and test Rust module -- [x] Integrate with Python package - ---- - -### Phase 3: Runner & Assertions (Week 3-4) - -#### Async Runner -- [x] Create FlakeStormRunner class -- [x] Implement orchestrator logic -- [x] Add concurrency control with semaphores -- [x] Implement progress tracking -- [x] Add setup verification - -#### Invariant System -- [x] Create InvariantVerifier class -- [x] Implement ContainsChecker -- [x] Implement LatencyChecker -- [x] Implement ValidJsonChecker -- [x] Implement RegexChecker -- [x] Implement SimilarityChecker -- [x] Implement ExcludesPIIChecker -- [x] Implement RefusalChecker -- [x] Add checker registry - ---- - -### Phase 4: CLI & Reporting (Week 4-5) - -#### CLI Commands -- [x] Set up Typer application -- [x] Implement `flakestorm init` command -- [x] Implement `flakestorm run` command -- [x] Implement `flakestorm verify` command -- [x] Implement `flakestorm report` command -- [x] Implement `flakestorm score` command -- [x] Add CI mode (--ci --min-score) -- [x] Add rich progress bars - -#### Report Generation -- [x] Create report data models -- [x] Implement HTMLReportGenerator -- [x] Create interactive HTML template -- [x] Implement JSONReportGenerator -- [x] Implement TerminalReporter -- [x] Add score visualization -- [x] Add mutation matrix view - ---- - -### Phase 5: V2 Features (Week 5-7) - -#### Environment Chaos & Context Attacks -- [x] ChaosConfig (tool_faults, llm_faults, context_attacks as list or dict) -- [x] ChaosInterceptor: memory_poisoning applied to input before invoke; LLM faults (timeout before call, others after) -- [x] context_attacks: indirect_injection, memory_poisoning (strategy prepend/append/replace), normalize_context_attacks -- [x] Per-scenario context_attacks in contract.chaos_matrix - -#### Behavioral Contracts -- [x] ContractEngine: (invariant × scenario) cells with optional reset (reset_endpoint / reset_function) -- [x] system_prompt_leak_probe via contract invariant `probes`; behavior_unchanged with baseline auto/manual -- [x] Stateful detection and warning when no reset configured - -#### Replay Regression -- [x] ReplaySessionConfig with `file` (load from file) or inline id/input; validation require id+input when no file -- [x] ReplayConfig.sources (LangSmith project or run_id) with auto_import - -#### Scoring & Config -- [x] ScoringConfig (mutation, chaos, contract, replay) weights must sum to 1.0 -- [x] AgentConfig.reset_endpoint, reset_function; ModelConfig api_key env-only -- [x] Mutation count max 50 (OSS); 22+ mutation types - -#### HuggingFace Integration -- [x] Create HuggingFaceModelProvider -- [x] Support GGUF model downloading -- [x] Add recommended models list -- [x] Integrate with Ollama model importing - -#### Vector Similarity -- [x] Create LocalEmbedder class -- [x] Integrate sentence-transformers -- [x] Implement similarity calculation -- [x] Add lazy model loading - ---- - -### Testing & Quality - -#### Unit Tests -- [x] Test configuration loading -- [x] Test mutation types -- [x] Test assertion checkers -- [ ] Test agent adapters -- [ ] Test orchestrator -- [ ] Test report generation - -#### Integration Tests -- [ ] Test full run with mock agent -- [ ] Test CLI commands -- [ ] Test report generation - -#### Documentation -- [x] Write README.md -- [x] Create IMPLEMENTATION_CHECKLIST.md -- [x] Create ARCHITECTURE_SUMMARY.md -- [x] Create API_SPECIFICATION.md -- [x] Create CONTRIBUTING.md -- [x] Create CONFIGURATION_GUIDE.md - ---- - -### Phase 6: Essential Mutations (Week 7-8) - -#### Core Mutation Types -- [x] Add ENCODING_ATTACKS mutation type -- [x] Add CONTEXT_MANIPULATION mutation type -- [x] Add LENGTH_EXTREMES mutation type -- [x] Update MutationType enum with all 8 types -- [x] Create templates for new mutation types -- [x] Update mutation validation for edge cases - -#### Configuration Updates -- [x] Update MutationConfig defaults -- [x] Update example configuration files -- [x] Update orchestrator comments - -#### Documentation Updates -- [x] Update README.md with comprehensive mutation types table -- [x] Add Mutation Strategy section to README -- [x] Update API_SPECIFICATION.md with all 8 types -- [x] Update MODULES.md with detailed mutation documentation -- [x] Add Mutation Types Guide to CONFIGURATION_GUIDE.md -- [x] Add Understanding Mutation Types to USAGE_GUIDE.md -- [x] Add Mutation Type Deep Dive to TEST_SCENARIOS.md - ---- - -## Progress Summary - -| Phase | Status | Completion | -|-------|--------|------------| -| CLI Phase 1: Foundation | ✅ Complete | 100% | -| CLI Phase 2: Mutation Engine | ✅ Complete | 100% | -| CLI Phase 3: Runner & Assertions | ✅ Complete | 100% | -| CLI Phase 4: CLI & Reporting | ✅ Complete | 100% | -| CLI Phase 5: V2 Features | ✅ Complete | 90% | -| CLI Phase 6: Essential Mutations | ✅ Complete | 100% | -| Documentation | ✅ Complete | 100% | - ---- - -## Next Steps - -### Immediate (Current Sprint) -1. **Rust Build**: Compile and integrate Rust performance module -2. **Integration Tests**: Add full integration test suite -3. **PyPI Release**: Prepare and publish to PyPI -4. **Community Launch**: Publish to Hacker News and Reddit - -### Future Roadmap -See [ROADMAP.md](ROADMAP.md) for comprehensive roadmap of advanced chaos engineering and adversarial testing features. These are open for community contribution - see [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved. diff --git a/docs/TEST_SCENARIOS.md b/docs/TEST_SCENARIOS.md index e3acf8a..c99ce4e 100644 --- a/docs/TEST_SCENARIOS.md +++ b/docs/TEST_SCENARIOS.md @@ -1,13 +1,22 @@ # Real-World Test Scenarios -This document provides concrete, real-world examples of testing AI agents with flakestorm. Each scenario includes the complete setup, expected inputs/outputs, and integration code. +This document provides concrete, real-world examples of testing AI agents with flakestorm across **all V2 pillars**: **mutation** (adversarial prompts), **environment chaos** (tool/LLM faults), **behavioral contracts** (invariants × chaos matrix), and **replay regression** (replay production incidents). Each scenario includes setup, config, and commands where applicable. -**V2:** Flakestorm supports **22+ mutation types** (prompt-level and system/network-level) with a **max of 50 mutations per run** in OSS. Use `version: "2.0"` in config for chaos, behavioral contracts, and replay regression. See [Configuration Guide](CONFIGURATION_GUIDE.md) and [V2 Spec](V2_SPEC.md). +**V2:** Use `version: "2.0"` in config to enable chaos, contracts, and replay. Flakestorm supports **24 mutation types** (prompt-level and system/network-level) and **max 50 mutations per run** in OSS. See [V2 Spec](V2_SPEC.md) and [V2 Audit](V2_AUDIT.md). --- ## Table of Contents +### V2 scenarios (all pillars) + +- [V2 Scenario: Environment Chaos](#v2-scenario-environment-chaos) — Tool/LLM fault injection +- [V2 Scenario: Behavioral Contract × Chaos Matrix](#v2-scenario-behavioral-contract--chaos-matrix) — Invariants under each chaos scenario +- [V2 Scenario: Replay Regression](#v2-scenario-replay-regression) — Replay production failures +- [Full V2 example (chaos + contract + replay)](../examples/v2_research_agent/README.md) — Working agent and config + +### Mutation-focused scenarios (agent + config examples) + 1. [Scenario 1: Customer Service Chatbot](#scenario-1-customer-service-chatbot) 2. [Scenario 2: Code Generation Agent](#scenario-2-code-generation-agent) 3. [Scenario 3: RAG-Based Q&A Agent](#scenario-3-rag-based-qa-agent) @@ -17,6 +26,95 @@ This document provides concrete, real-world examples of testing AI agents with f --- +## V2 Scenario: Environment Chaos + +**Goal:** Test that your agent degrades gracefully when tools or the LLM fail (timeouts, errors, rate limits, malformed responses). + +**Commands:** `flakestorm run --chaos` (mutations + chaos) or `flakestorm run --chaos --chaos-only` (golden prompts only, under chaos). Use `--chaos-profile api_outage` (or `degraded_llm`, `hostile_tools`, `high_latency`, `cascading_failure`) for built-in profiles. + +**Config (excerpt):** + +```yaml +version: "2.0" +chaos: + tool_faults: + - tool: "*" + mode: error + error_code: 503 + probability: 0.3 + llm_faults: + - mode: truncated_response + max_tokens: 5 + probability: 0.2 +``` + +**Docs:** [Environment Chaos](ENVIRONMENT_CHAOS.md), [V2 Audit §8.1](V2_AUDIT.md#1-prd-81--environment-chaos). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md). + +--- + +## V2 Scenario: Behavioral Contract × Chaos Matrix + +**Goal:** Verify that named invariants (with severity) hold under every chaos scenario; each (invariant × scenario) cell is an independent run. Optional `agent.reset_endpoint` or `agent.reset_function` for state isolation. + +**Commands:** `flakestorm contract run`, `flakestorm contract validate`, `flakestorm contract score`. + +**Config (excerpt):** + +```yaml +version: "2.0" +agent: + reset_endpoint: "http://localhost:8790/reset" +contract: + name: "My Contract" + invariants: + - id: must-cite + type: regex + pattern: "(?i)(source|according to)" + severity: critical + - id: max-latency + type: latency + max_ms: 60000 + severity: medium + chaos_matrix: + - name: "no-chaos" + tool_faults: [] + llm_faults: [] + - name: "api-outage" + tool_faults: + - tool: "*" + mode: error + error_code: 503 +``` + +**Docs:** [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md), [V2 Spec](V2_SPEC.md) (contract matrix isolation, resilience score). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md). + +--- + +## V2 Scenario: Replay Regression + +**Goal:** Replay a saved session (e.g. production incident) with fixed inputs and tool responses, then verify the agent’s output against a contract. + +**Commands:** `flakestorm replay run path/to/session.yaml -c flakestorm.yaml`, `flakestorm replay export --from-report report.json -o ./replays/`. Optional: `flakestorm replay run --from-langsmith RUN_ID --run` to import from LangSmith and run. + +**Config (excerpt):** + +```yaml +version: "2.0" +replays: + sessions: + - file: "replays/incident_001.yaml" + # Optional: sources for LangSmith import + # sources: ... +``` + +**Session file (e.g. `replays/incident_001.yaml`):** `id`, `input`, `tool_responses` (optional), `contract` (name or path). + +**Docs:** [Replay Regression](REPLAY_REGRESSION.md), [V2 Audit §8.3](V2_AUDIT.md#3-prd-83--replay-based-regression). **Working example:** [v2_research_agent](../examples/v2_research_agent/README.md). + +--- + +--- + ## Scenario 1: Customer Service Chatbot ### The Agent