diff --git a/docs/API_SPECIFICATION.md b/docs/API_SPECIFICATION.md index 43c6379..83f0350 100644 --- a/docs/API_SPECIFICATION.md +++ b/docs/API_SPECIFICATION.md @@ -376,6 +376,11 @@ results.get_by_prompt("...") # Filter by prompt # Serialization results.to_dict() # Full JSON-serializable dict + +# V2: Resilience and contract/replay (when config has contract/replays) +results.resilience_scores # dict: mutation_robustness, chaos_resilience, contract_compliance, replay_regression +results.contract_compliance # ContractRunResult | None (when contract run was executed) +# Replay results are reported via flakestorm replay run --output; see Reports below. ``` #### MutationResult @@ -443,6 +448,8 @@ reporter.print_failures(limit=10) reporter.print_full_report() ``` +**V2 reports:** Contract runs (`flakestorm contract run --output report.html`) and replay runs (`flakestorm replay run --output report.html`) produce HTML reports that include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants, fix tool behavior). See [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md) and [Replay Regression](REPLAY_REGRESSION.md). + --- ## CLI Commands @@ -459,16 +466,19 @@ flakestorm init --force # Overwrite existing ### `flakestorm run` -Run reliability tests. +Run reliability tests (mutation run; optionally with chaos). ```bash -flakestorm run # Default config +flakestorm run # Default config (mutation only) flakestorm run --config custom.yaml # Custom config -flakestorm run --output json # JSON output -flakestorm run --output terminal # Terminal only -flakestorm run --min-score 0.9 --ci # CI mode -flakestorm run --verify-only # Just verify setup -flakestorm run --quiet # Minimal output +flakestorm run --chaos # Apply chaos (tool/LLM faults, context_attacks) during mutation run +flakestorm run --chaos-only # Chaos-only run (no mutations); requires chaos config +flakestorm run --chaos-profile api_outage # Use a built-in chaos profile +flakestorm run --output json # JSON output +flakestorm run --output terminal # Terminal only +flakestorm run --min-score 0.9 --ci # CI mode +flakestorm run --verify-only # Just verify setup +flakestorm run --quiet # Minimal output ``` ### `flakestorm verify` @@ -503,6 +513,38 @@ else fi ``` +### V2: `flakestorm contract run` / `validate` / `score` + +Run behavioral contract tests (invariants × chaos matrix). + +```bash +flakestorm contract run # Run contract matrix; progress and score in terminal +flakestorm contract run --output report.html # Save HTML report with suggested actions for failed cells +flakestorm contract validate # Validate contract config only +flakestorm contract score # Output contract resilience score only +``` + +### V2: `flakestorm replay run` / `export` + +Replay regression: run saved sessions and verify against a contract. + +```bash +flakestorm replay run # Replay sessions from config (file or inline) +flakestorm replay run path/to/session.yaml # Replay a single session file +flakestorm replay run path/to/replays/ # Replay all sessions in directory +flakestorm replay run --output report.html # Save HTML report with suggested actions for failed sessions +flakestorm replay export --from-report FILE # Export from an existing report +``` + +### V2: `flakestorm ci` + +Run full CI pipeline: mutation run, contract run (if configured), chaos-only (if chaos configured), replay (if configured); then compute overall weighted score from `scoring.weights`. + +```bash +flakestorm ci +flakestorm ci --config custom.yaml +``` + --- ## Environment Variables @@ -512,6 +554,8 @@ fi | `OLLAMA_HOST` | Override Ollama server URL | | Custom headers | Expanded in config via `${VAR}` syntax | +**V2 — API keys (env-only):** Model API keys must not be literal in config. Use environment variables and reference them in config (e.g. `api_key: "${OPENAI_API_KEY}"`). Supported: `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc. See [LLM Providers](LLM_PROVIDERS.md). + --- ## Exit Codes diff --git a/docs/CONFIGURATION_GUIDE.md b/docs/CONFIGURATION_GUIDE.md index 6256ece..209015b 100644 --- a/docs/CONFIGURATION_GUIDE.md +++ b/docs/CONFIGURATION_GUIDE.md @@ -53,7 +53,7 @@ With `version: "2.0"` you can add the three **chaos engineering pillars** and a **Context attacks** (chaos on tool/context or input before invoke, not the user prompt) are configured under `chaos.context_attacks`. You can use a **list** of attack configs or a **dict** (addendum format, e.g. `memory_poisoning: { payload: "...", strategy: "append" }`). Each scenario in `contract.chaos_matrix` can also define its own `context_attacks`. See [Context Attacks](CONTEXT_ATTACKS.md). -All v1.0 options remain valid; v2.0 blocks are optional and additive. +All v1.0 options remain valid; v2.0 blocks are optional and additive. Implementation status: all V2 gaps are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)). Mutations: **22+ types**, **max 50 per run** in OSS. --- diff --git a/docs/CONNECTION_GUIDE.md b/docs/CONNECTION_GUIDE.md index 4ec1096..4138378 100644 --- a/docs/CONNECTION_GUIDE.md +++ b/docs/CONNECTION_GUIDE.md @@ -44,6 +44,8 @@ This guide explains how to connect FlakeStorm to your agent, covering different **Note:** Native CI/CD integrations (scheduled runs, pipeline plugins) are **Cloud only**. OSS users run `flakestorm ci` from their own scripts or job runners. +**V2 — API keys:** When using cloud LLM providers (OpenAI, Anthropic, Google) for mutation generation or agent backends, API keys must be set via **environment variables only** (e.g. `OPENAI_API_KEY`). Reference them in config as `api_key: "${OPENAI_API_KEY}"`. Do not put literal keys in config files. See [LLM Providers](LLM_PROVIDERS.md). + --- ## Internal Code Options diff --git a/docs/DEVELOPER_FAQ.md b/docs/DEVELOPER_FAQ.md index 147d37a..89ec0f1 100644 --- a/docs/DEVELOPER_FAQ.md +++ b/docs/DEVELOPER_FAQ.md @@ -7,14 +7,15 @@ This document answers common questions developers might have about the flakestor ## Table of Contents 1. [Architecture Questions](#architecture-questions) -2. [Configuration System](#configuration-system) -3. [Mutation Engine](#mutation-engine) -4. [Assertion System](#assertion-system) -5. [Performance & Rust](#performance--rust) -6. [Agent Adapters](#agent-adapters) -7. [Testing & Quality](#testing--quality) -8. [Extending flakestorm](#extending-flakestorm) -9. [Common Issues](#common-issues) +2. [V2 and Documentation](#v2-and-documentation) +3. [Configuration System](#configuration-system) +4. [Mutation Engine](#mutation-engine) +5. [Assertion System](#assertion-system) +6. [Performance & Rust](#performance--rust) +7. [Agent Adapters](#agent-adapters) +8. [Testing & Quality](#testing--quality) +9. [Extending flakestorm](#extending-flakestorm) +10. [Common Issues](#common-issues) --- @@ -77,6 +78,39 @@ This separation allows: --- +## V2 and Documentation + +### Q: What is V2 and where is it documented? + +**A:** **V2** (`version: "2.0"` in config) adds three chaos-engineering pillars and a unified score. All gaps from the V2 PRD are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)). Authoritative references: + +| Topic | Document | +|-------|----------| +| Spec clarifications (reset, behavior_unchanged, probes, scoring) | [V2_SPEC](V2_SPEC.md) | +| Environment chaos (tool/LLM faults, profiles, response_drift) | [ENVIRONMENT_CHAOS](ENVIRONMENT_CHAOS.md) | +| Behavioral contracts (chaos_matrix, invariants, reset_endpoint/reset_function) | [BEHAVIORAL_CONTRACTS](BEHAVIORAL_CONTRACTS.md) | +| Replay regression (sessions, LangSmith, contract resolution) | [REPLAY_REGRESSION](REPLAY_REGRESSION.md) | +| Context attacks (memory_poisoning, system_prompt_leak_probe, list/dict config) | [CONTEXT_ATTACKS](CONTEXT_ATTACKS.md) | +| LLM providers and API keys (env-only) | [LLM_PROVIDERS](LLM_PROVIDERS.md) | + +### Q: How do chaos, contract, and replay fit into the codebase? + +**A:** In V2: + +- **Chaos:** `chaos/` (interceptor, tool_proxy, llm_proxy, faults, profiles). The runner wraps the agent with `ChaosInterceptor` when `--chaos` or `--chaos-only` is used. Tool faults apply by `match_url` or `tool: "*"`; LLM faults (truncated, empty, garbage, rate_limit, response_drift) are applied in the interceptor. +- **Contract:** `contracts/` (engine, matrix). When config has `contract` + `chaos_matrix`, `FlakeStormRunner.run()` runs the contract engine: resets between cells (if `reset_endpoint`/`reset_function`), runs invariants (including `behavior_unchanged` and probes for system_prompt_leak), and attaches `contract_compliance` to results. Scoring uses severity weights; any critical failure → FAIL. +- **Replay:** `replay/` (loader, runner). Sessions loaded from file or inline (or LangSmith); contract resolved by name or path. `flakestorm replay run [path]` replays and verifies against the contract; reports include suggested actions for failed sessions. + +### Q: Why must API keys be environment variables only in V2? + +**A:** Security: literal API keys in config files get committed to version control. V2 validates at load time and fails with a clear message if a literal key is detected. Use `api_key: "${OPENAI_API_KEY}"` (and set the variable in the environment or CI secrets). + +### Q: What does `flakestorm ci` run? + +**A:** It runs, in order: (1) mutation run (with chaos if configured), (2) contract run if `contract` + `chaos_matrix` are configured, (3) chaos-only run if chaos is configured, (4) replay run if `replays` is configured. Then it computes an **overall weighted score** from `scoring.weights` (mutation, chaos, contract, replay); weights must sum to 1.0. Default weights: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10. + +--- + ## Configuration System ### Q: Why Pydantic instead of dataclasses or attrs? diff --git a/docs/TESTING_GUIDE.md b/docs/TESTING_GUIDE.md index d413d3f..f301d64 100644 --- a/docs/TESTING_GUIDE.md +++ b/docs/TESTING_GUIDE.md @@ -8,12 +8,13 @@ This guide explains how to run, write, and expand tests for flakestorm. It cover 1. [Running Tests](#running-tests) 2. [Test Structure](#test-structure) -3. [Writing Tests: Agent Adapters](#writing-tests-agent-adapters) -4. [Writing Tests: Orchestrator](#writing-tests-orchestrator) -5. [Writing Tests: Report Generation](#writing-tests-report-generation) -6. [Integration Tests](#integration-tests) -7. [CLI Tests](#cli-tests) -8. [Test Fixtures](#test-fixtures) +3. [V2 Integration Tests](#v2-integration-tests) +4. [Writing Tests: Agent Adapters](#writing-tests-agent-adapters) +5. [Writing Tests: Orchestrator](#writing-tests-orchestrator) +6. [Writing Tests: Report Generation](#writing-tests-report-generation) +7. [Integration Tests](#integration-tests) +8. [CLI Tests](#cli-tests) +9. [Test Fixtures](#test-fixtures) --- @@ -67,6 +68,9 @@ pytest tests/test_performance.py # Integration tests (requires Ollama) pytest tests/test_integration.py + +# V2 integration tests (chaos, contract, replay) +pytest tests/test_chaos_integration.py tests/test_contract_integration.py tests/test_replay_integration.py ``` --- @@ -76,20 +80,37 @@ pytest tests/test_integration.py ``` tests/ ├── __init__.py -├── conftest.py # Shared fixtures -├── test_config.py # Configuration loading tests -├── test_mutations.py # Mutation engine tests -├── test_assertions.py # Assertion checkers tests -├── test_performance.py # Rust/Python bridge tests -├── test_adapters.py # Agent adapter tests (TO CREATE) -├── test_orchestrator.py # Orchestrator tests (TO CREATE) -├── test_reports.py # Report generation tests (TO CREATE) -├── test_cli.py # CLI command tests (TO CREATE) -└── test_integration.py # Full integration tests (TO CREATE) +├── conftest.py # Shared fixtures +├── test_config.py # Configuration loading tests +├── test_mutations.py # Mutation engine tests +├── test_assertions.py # Assertion checkers tests +├── test_performance.py # Rust/Python bridge tests +├── test_adapters.py # Agent adapter tests +├── test_orchestrator.py # Orchestrator tests +├── test_reports.py # Report generation tests +├── test_cli.py # CLI command tests +├── test_integration.py # Full integration tests +├── test_chaos_integration.py # V2: chaos (tool/LLM faults, interceptor) +├── test_contract_integration.py # V2: contract (N×M matrix, score, critical fail) +└── test_replay_integration.py # V2: replay (session → replay → pass/fail) ``` --- +## V2 Integration Tests + +V2 adds three integration test modules; all gaps are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)). + +| Module | What it tests | +|--------|----------------| +| `test_chaos_integration.py` | Chaos interceptor, tool faults (match_url/tool *), LLM faults (truncated, empty, garbage, rate_limit, response_drift). | +| `test_contract_integration.py` | Contract engine: invariants × chaos matrix, reset between cells, resilience score (severity-weighted), critical failure → FAIL. | +| `test_replay_integration.py` | Replay loader (file/format), ReplayRunner verification against contract, contract resolution by name/path. | + +For CI pipelines that use V2, run the full suite including these; `flakestorm ci` runs mutation, contract (if configured), chaos-only (if configured), and replay (if configured), then computes the overall weighted score from `scoring.weights`. + +--- + ## Writing Tests: Agent Adapters ### Location: `tests/test_adapters.py` diff --git a/docs/TEST_SCENARIOS.md b/docs/TEST_SCENARIOS.md index 7e4ae4c..05ef783 100644 --- a/docs/TEST_SCENARIOS.md +++ b/docs/TEST_SCENARIOS.md @@ -1,6 +1,6 @@ # Real-World Test Scenarios -This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **24 mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md). +This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **22+ mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md). --- diff --git a/docs/USAGE_GUIDE.md b/docs/USAGE_GUIDE.md index 6a73911..19207e4 100644 --- a/docs/USAGE_GUIDE.md +++ b/docs/USAGE_GUIDE.md @@ -28,7 +28,7 @@ This comprehensive guide walks you through using flakestorm to test your AI agen flakestorm is an **adversarial testing framework** and **chaos engineering platform** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs. - **V1** (`version: "1.0"` or omitted): Mutation-only mode — golden prompts → mutation engine → agent → invariants → **robustness score**. Ideal for quick adversarial input testing. -- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **24 mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [V2 Audit](V2_AUDIT.md). +- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **22+ mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [GAP_VERIFICATION](GAP_VERIFICATION.md). ### Why Use flakestorm? @@ -54,7 +54,7 @@ With a V1 config (or V2 config without `--chaos`), you get the classic adversari ├─────────────────────────────────────────────────────────────────┤ │ 1. GOLDEN PROMPTS → 2. MUTATION ENGINE (Local LLM) │ │ "Book a flight" → Mutated prompts (typos, paraphrases, │ -│ injections, encoding, etc. — 24 types)│ +│ injections, encoding, etc. — 22+ types)│ │ ↓ │ │ 3. YOUR AGENT ← Test Runner sends each mutated prompt │ │ (HTTP/Python) ↓ │ @@ -71,13 +71,15 @@ With **`version: "2.0"`** in your config, Flakestorm adds environment chaos, beh | Pillar | What runs | Score / output | |--------|-----------|----------------| -| **Mutation run** | Golden prompts → 24 mutation types → agent → invariants | **Robustness score** (0–1). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. | +| **Mutation run** | Golden prompts → 22+ mutation types → agent → invariants | **Robustness score** (0–1). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. | | **Environment chaos** | Fault injection into tools and LLM (timeouts, errors, rate limits, malformed responses, context attacks) | **Chaos resilience** (0–1). Use `flakestorm run --chaos` (with mutations) or `flakestorm run --chaos --chaos-only` (no mutations). | | **Behavioral contracts** | Contracts (invariants × severity) × chaos matrix scenarios; each cell is an independent run (optional reset per cell). | **Resilience score** (0–100%). Use `flakestorm contract run`. Per-contract formula: weighted by severity (critical×3, high×2, medium×1); **auto-FAIL** if any critical fails. | | **Replay regression** | Replay saved sessions (e.g. production incidents) and verify against a contract. | Per-session pass/fail; **replay regression** score when run via CI. Use `flakestorm replay run [path]`. | **Unified CI:** `flakestorm ci` runs mutation run, contract run (if configured), chaos-only run (if chaos configured), and all replay sessions; then computes an **overall resilience score** from `scoring.weights` (default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10). Weights must sum to 1.0. +**Reports:** Use `flakestorm contract run --output report.html` and `flakestorm replay run --output report.html` to save HTML reports; both include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants). Replay accepts a single session file or a directory: `flakestorm replay run path/to/session.yaml` or `flakestorm replay run path/to/replays/`. + **Contract matrix isolation (V2):** Each (invariant × scenario) cell is independent. Configure `agent.reset_endpoint` (HTTP) or `agent.reset_function` (Python) to clear agent state between cells; if not set and the agent is stateful, Flakestorm warns. See [V2 Spec — Contract matrix isolation](V2_SPEC.md#contract-matrix-isolation). --- @@ -819,7 +821,7 @@ golden_prompts: ### Mutation Types -flakestorm generates adversarial variations of your golden prompts across 24 mutation types organized into categories: +flakestorm generates adversarial variations of your golden prompts across 22+ mutation types organized into categories: #### Prompt-Level Attacks @@ -1121,7 +1123,7 @@ flakestorm provides 22+ mutation types organized into **Prompt-Level Attacks** a ### Choosing Mutation Types **Comprehensive Testing (Recommended):** -Use all 24 types for complete coverage: +Use all 22+ types for complete coverage: ```yaml types: # Original 8 types @@ -1221,7 +1223,7 @@ The 22+ mutation types work together to provide comprehensive robustness testing - **Infrastructure**: HTTP Header Injection, Payload Size Attack, Content-Type Confusion, Query Parameter Poisoning, Request Method Attack, Protocol-Level Attack, Resource Exhaustion, Concurrent Request Pattern, Timeout Manipulation - **Temporal/Context**: Temporal Attack, Multi-Turn Attack -For comprehensive testing, use all 24 types. For focused testing: +For comprehensive testing, use all 22+ types. For focused testing: - **Security-focused**: Emphasize Prompt Injection, Advanced Jailbreak, Protocol-Level Attack, HTTP Header Injection - **UX-focused**: Emphasize Noise, Tone Shift, Context Manipulation, Language Mixing - **Infrastructure-focused**: Emphasize all system/network-level types