Update documentation to reflect enhancements in Flakestorm V2, including detailed descriptions of new features such as resilience scores, chaos engineering capabilities, behavioral contracts, and replay regression. Clarified API key management via environment variables, updated CLI commands, and improved test scenarios. Adjusted mutation types count to 22+ and ensured all V2 gaps are closed as per the latest specifications.

This commit is contained in:
Francisco M Humarang Jr. 2026-03-09 19:52:39 +08:00
parent f1570628c3
commit 4a13425f8a
7 changed files with 142 additions and 39 deletions

View file

@ -376,6 +376,11 @@ results.get_by_prompt("...") # Filter by prompt
# Serialization
results.to_dict() # Full JSON-serializable dict
# V2: Resilience and contract/replay (when config has contract/replays)
results.resilience_scores # dict: mutation_robustness, chaos_resilience, contract_compliance, replay_regression
results.contract_compliance # ContractRunResult | None (when contract run was executed)
# Replay results are reported via flakestorm replay run --output; see Reports below.
```
#### MutationResult
@ -443,6 +448,8 @@ reporter.print_failures(limit=10)
reporter.print_full_report()
```
**V2 reports:** Contract runs (`flakestorm contract run --output report.html`) and replay runs (`flakestorm replay run --output report.html`) produce HTML reports that include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants, fix tool behavior). See [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md) and [Replay Regression](REPLAY_REGRESSION.md).
---
## CLI Commands
@ -459,11 +466,14 @@ flakestorm init --force # Overwrite existing
### `flakestorm run`
Run reliability tests.
Run reliability tests (mutation run; optionally with chaos).
```bash
flakestorm run # Default config
flakestorm run # Default config (mutation only)
flakestorm run --config custom.yaml # Custom config
flakestorm run --chaos # Apply chaos (tool/LLM faults, context_attacks) during mutation run
flakestorm run --chaos-only # Chaos-only run (no mutations); requires chaos config
flakestorm run --chaos-profile api_outage # Use a built-in chaos profile
flakestorm run --output json # JSON output
flakestorm run --output terminal # Terminal only
flakestorm run --min-score 0.9 --ci # CI mode
@ -503,6 +513,38 @@ else
fi
```
### V2: `flakestorm contract run` / `validate` / `score`
Run behavioral contract tests (invariants × chaos matrix).
```bash
flakestorm contract run # Run contract matrix; progress and score in terminal
flakestorm contract run --output report.html # Save HTML report with suggested actions for failed cells
flakestorm contract validate # Validate contract config only
flakestorm contract score # Output contract resilience score only
```
### V2: `flakestorm replay run` / `export`
Replay regression: run saved sessions and verify against a contract.
```bash
flakestorm replay run # Replay sessions from config (file or inline)
flakestorm replay run path/to/session.yaml # Replay a single session file
flakestorm replay run path/to/replays/ # Replay all sessions in directory
flakestorm replay run --output report.html # Save HTML report with suggested actions for failed sessions
flakestorm replay export --from-report FILE # Export from an existing report
```
### V2: `flakestorm ci`
Run full CI pipeline: mutation run, contract run (if configured), chaos-only (if chaos configured), replay (if configured); then compute overall weighted score from `scoring.weights`.
```bash
flakestorm ci
flakestorm ci --config custom.yaml
```
---
## Environment Variables
@ -512,6 +554,8 @@ fi
| `OLLAMA_HOST` | Override Ollama server URL |
| Custom headers | Expanded in config via `${VAR}` syntax |
**V2 — API keys (env-only):** Model API keys must not be literal in config. Use environment variables and reference them in config (e.g. `api_key: "${OPENAI_API_KEY}"`). Supported: `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc. See [LLM Providers](LLM_PROVIDERS.md).
---
## Exit Codes

View file

@ -53,7 +53,7 @@ With `version: "2.0"` you can add the three **chaos engineering pillars** and a
**Context attacks** (chaos on tool/context or input before invoke, not the user prompt) are configured under `chaos.context_attacks`. You can use a **list** of attack configs or a **dict** (addendum format, e.g. `memory_poisoning: { payload: "...", strategy: "append" }`). Each scenario in `contract.chaos_matrix` can also define its own `context_attacks`. See [Context Attacks](CONTEXT_ATTACKS.md).
All v1.0 options remain valid; v2.0 blocks are optional and additive.
All v1.0 options remain valid; v2.0 blocks are optional and additive. Implementation status: all V2 gaps are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)). Mutations: **22+ types**, **max 50 per run** in OSS.
---

View file

@ -44,6 +44,8 @@ This guide explains how to connect FlakeStorm to your agent, covering different
**Note:** Native CI/CD integrations (scheduled runs, pipeline plugins) are **Cloud only**. OSS users run `flakestorm ci` from their own scripts or job runners.
**V2 — API keys:** When using cloud LLM providers (OpenAI, Anthropic, Google) for mutation generation or agent backends, API keys must be set via **environment variables only** (e.g. `OPENAI_API_KEY`). Reference them in config as `api_key: "${OPENAI_API_KEY}"`. Do not put literal keys in config files. See [LLM Providers](LLM_PROVIDERS.md).
---
## Internal Code Options

View file

@ -7,14 +7,15 @@ This document answers common questions developers might have about the flakestor
## Table of Contents
1. [Architecture Questions](#architecture-questions)
2. [Configuration System](#configuration-system)
3. [Mutation Engine](#mutation-engine)
4. [Assertion System](#assertion-system)
5. [Performance & Rust](#performance--rust)
6. [Agent Adapters](#agent-adapters)
7. [Testing & Quality](#testing--quality)
8. [Extending flakestorm](#extending-flakestorm)
9. [Common Issues](#common-issues)
2. [V2 and Documentation](#v2-and-documentation)
3. [Configuration System](#configuration-system)
4. [Mutation Engine](#mutation-engine)
5. [Assertion System](#assertion-system)
6. [Performance & Rust](#performance--rust)
7. [Agent Adapters](#agent-adapters)
8. [Testing & Quality](#testing--quality)
9. [Extending flakestorm](#extending-flakestorm)
10. [Common Issues](#common-issues)
---
@ -77,6 +78,39 @@ This separation allows:
---
## V2 and Documentation
### Q: What is V2 and where is it documented?
**A:** **V2** (`version: "2.0"` in config) adds three chaos-engineering pillars and a unified score. All gaps from the V2 PRD are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)). Authoritative references:
| Topic | Document |
|-------|----------|
| Spec clarifications (reset, behavior_unchanged, probes, scoring) | [V2_SPEC](V2_SPEC.md) |
| Environment chaos (tool/LLM faults, profiles, response_drift) | [ENVIRONMENT_CHAOS](ENVIRONMENT_CHAOS.md) |
| Behavioral contracts (chaos_matrix, invariants, reset_endpoint/reset_function) | [BEHAVIORAL_CONTRACTS](BEHAVIORAL_CONTRACTS.md) |
| Replay regression (sessions, LangSmith, contract resolution) | [REPLAY_REGRESSION](REPLAY_REGRESSION.md) |
| Context attacks (memory_poisoning, system_prompt_leak_probe, list/dict config) | [CONTEXT_ATTACKS](CONTEXT_ATTACKS.md) |
| LLM providers and API keys (env-only) | [LLM_PROVIDERS](LLM_PROVIDERS.md) |
### Q: How do chaos, contract, and replay fit into the codebase?
**A:** In V2:
- **Chaos:** `chaos/` (interceptor, tool_proxy, llm_proxy, faults, profiles). The runner wraps the agent with `ChaosInterceptor` when `--chaos` or `--chaos-only` is used. Tool faults apply by `match_url` or `tool: "*"`; LLM faults (truncated, empty, garbage, rate_limit, response_drift) are applied in the interceptor.
- **Contract:** `contracts/` (engine, matrix). When config has `contract` + `chaos_matrix`, `FlakeStormRunner.run()` runs the contract engine: resets between cells (if `reset_endpoint`/`reset_function`), runs invariants (including `behavior_unchanged` and probes for system_prompt_leak), and attaches `contract_compliance` to results. Scoring uses severity weights; any critical failure → FAIL.
- **Replay:** `replay/` (loader, runner). Sessions loaded from file or inline (or LangSmith); contract resolved by name or path. `flakestorm replay run [path]` replays and verifies against the contract; reports include suggested actions for failed sessions.
### Q: Why must API keys be environment variables only in V2?
**A:** Security: literal API keys in config files get committed to version control. V2 validates at load time and fails with a clear message if a literal key is detected. Use `api_key: "${OPENAI_API_KEY}"` (and set the variable in the environment or CI secrets).
### Q: What does `flakestorm ci` run?
**A:** It runs, in order: (1) mutation run (with chaos if configured), (2) contract run if `contract` + `chaos_matrix` are configured, (3) chaos-only run if chaos is configured, (4) replay run if `replays` is configured. Then it computes an **overall weighted score** from `scoring.weights` (mutation, chaos, contract, replay); weights must sum to 1.0. Default weights: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10.
---
## Configuration System
### Q: Why Pydantic instead of dataclasses or attrs?

View file

@ -8,12 +8,13 @@ This guide explains how to run, write, and expand tests for flakestorm. It cover
1. [Running Tests](#running-tests)
2. [Test Structure](#test-structure)
3. [Writing Tests: Agent Adapters](#writing-tests-agent-adapters)
4. [Writing Tests: Orchestrator](#writing-tests-orchestrator)
5. [Writing Tests: Report Generation](#writing-tests-report-generation)
6. [Integration Tests](#integration-tests)
7. [CLI Tests](#cli-tests)
8. [Test Fixtures](#test-fixtures)
3. [V2 Integration Tests](#v2-integration-tests)
4. [Writing Tests: Agent Adapters](#writing-tests-agent-adapters)
5. [Writing Tests: Orchestrator](#writing-tests-orchestrator)
6. [Writing Tests: Report Generation](#writing-tests-report-generation)
7. [Integration Tests](#integration-tests)
8. [CLI Tests](#cli-tests)
9. [Test Fixtures](#test-fixtures)
---
@ -67,6 +68,9 @@ pytest tests/test_performance.py
# Integration tests (requires Ollama)
pytest tests/test_integration.py
# V2 integration tests (chaos, contract, replay)
pytest tests/test_chaos_integration.py tests/test_contract_integration.py tests/test_replay_integration.py
```
---
@ -81,15 +85,32 @@ tests/
├── test_mutations.py # Mutation engine tests
├── test_assertions.py # Assertion checkers tests
├── test_performance.py # Rust/Python bridge tests
├── test_adapters.py # Agent adapter tests (TO CREATE)
├── test_orchestrator.py # Orchestrator tests (TO CREATE)
├── test_reports.py # Report generation tests (TO CREATE)
├── test_cli.py # CLI command tests (TO CREATE)
└── test_integration.py # Full integration tests (TO CREATE)
├── test_adapters.py # Agent adapter tests
├── test_orchestrator.py # Orchestrator tests
├── test_reports.py # Report generation tests
├── test_cli.py # CLI command tests
├── test_integration.py # Full integration tests
├── test_chaos_integration.py # V2: chaos (tool/LLM faults, interceptor)
├── test_contract_integration.py # V2: contract (N×M matrix, score, critical fail)
└── test_replay_integration.py # V2: replay (session → replay → pass/fail)
```
---
## V2 Integration Tests
V2 adds three integration test modules; all gaps are closed (see [GAP_VERIFICATION](GAP_VERIFICATION.md)).
| Module | What it tests |
|--------|----------------|
| `test_chaos_integration.py` | Chaos interceptor, tool faults (match_url/tool *), LLM faults (truncated, empty, garbage, rate_limit, response_drift). |
| `test_contract_integration.py` | Contract engine: invariants × chaos matrix, reset between cells, resilience score (severity-weighted), critical failure → FAIL. |
| `test_replay_integration.py` | Replay loader (file/format), ReplayRunner verification against contract, contract resolution by name/path. |
For CI pipelines that use V2, run the full suite including these; `flakestorm ci` runs mutation, contract (if configured), chaos-only (if configured), and replay (if configured), then computes the overall weighted score from `scoring.weights`.
---
## Writing Tests: Agent Adapters
### Location: `tests/test_adapters.py`

View file

@ -1,6 +1,6 @@
# Real-World Test Scenarios
This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **24 mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md).
This document provides concrete, real-world examples of testing AI agents with flakestorm: environment chaos (tool/LLM faults), behavioral contracts (invariants × chaos matrix), replay regression, and adversarial mutations. Each scenario includes setup, config, and commands where applicable. Flakestorm supports **22+ mutation types** and **max 50 mutations per run** in OSS. See [Configuration Guide](CONFIGURATION_GUIDE.md), [Spec](V2_SPEC.md), and [Audit](V2_AUDIT.md).
---

View file

@ -28,7 +28,7 @@ This comprehensive guide walks you through using flakestorm to test your AI agen
flakestorm is an **adversarial testing framework** and **chaos engineering platform** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.
- **V1** (`version: "1.0"` or omitted): Mutation-only mode — golden prompts → mutation engine → agent → invariants → **robustness score**. Ideal for quick adversarial input testing.
- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **24 mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [V2 Audit](V2_AUDIT.md).
- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **22+ mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [GAP_VERIFICATION](GAP_VERIFICATION.md).
### Why Use flakestorm?
@ -54,7 +54,7 @@ With a V1 config (or V2 config without `--chaos`), you get the classic adversari
├─────────────────────────────────────────────────────────────────┤
│ 1. GOLDEN PROMPTS → 2. MUTATION ENGINE (Local LLM) │
│ "Book a flight" → Mutated prompts (typos, paraphrases, │
│ injections, encoding, etc. — 24 types)│
│ injections, encoding, etc. — 22+ types)│
│ ↓ │
│ 3. YOUR AGENT ← Test Runner sends each mutated prompt │
│ (HTTP/Python) ↓ │
@ -71,13 +71,15 @@ With **`version: "2.0"`** in your config, Flakestorm adds environment chaos, beh
| Pillar | What runs | Score / output |
|--------|-----------|----------------|
| **Mutation run** | Golden prompts → 24 mutation types → agent → invariants | **Robustness score** (01). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. |
| **Mutation run** | Golden prompts → 22+ mutation types → agent → invariants | **Robustness score** (01). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. |
| **Environment chaos** | Fault injection into tools and LLM (timeouts, errors, rate limits, malformed responses, context attacks) | **Chaos resilience** (01). Use `flakestorm run --chaos` (with mutations) or `flakestorm run --chaos --chaos-only` (no mutations). |
| **Behavioral contracts** | Contracts (invariants × severity) × chaos matrix scenarios; each cell is an independent run (optional reset per cell). | **Resilience score** (0100%). Use `flakestorm contract run`. Per-contract formula: weighted by severity (critical×3, high×2, medium×1); **auto-FAIL** if any critical fails. |
| **Replay regression** | Replay saved sessions (e.g. production incidents) and verify against a contract. | Per-session pass/fail; **replay regression** score when run via CI. Use `flakestorm replay run [path]`. |
**Unified CI:** `flakestorm ci` runs mutation run, contract run (if configured), chaos-only run (if chaos configured), and all replay sessions; then computes an **overall resilience score** from `scoring.weights` (default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10). Weights must sum to 1.0.
**Reports:** Use `flakestorm contract run --output report.html` and `flakestorm replay run --output report.html` to save HTML reports; both include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants). Replay accepts a single session file or a directory: `flakestorm replay run path/to/session.yaml` or `flakestorm replay run path/to/replays/`.
**Contract matrix isolation (V2):** Each (invariant × scenario) cell is independent. Configure `agent.reset_endpoint` (HTTP) or `agent.reset_function` (Python) to clear agent state between cells; if not set and the agent is stateful, Flakestorm warns. See [V2 Spec — Contract matrix isolation](V2_SPEC.md#contract-matrix-isolation).
---
@ -819,7 +821,7 @@ golden_prompts:
### Mutation Types
flakestorm generates adversarial variations of your golden prompts across 24 mutation types organized into categories:
flakestorm generates adversarial variations of your golden prompts across 22+ mutation types organized into categories:
#### Prompt-Level Attacks
@ -1121,7 +1123,7 @@ flakestorm provides 22+ mutation types organized into **Prompt-Level Attacks** a
### Choosing Mutation Types
**Comprehensive Testing (Recommended):**
Use all 24 types for complete coverage:
Use all 22+ types for complete coverage:
```yaml
types:
# Original 8 types
@ -1221,7 +1223,7 @@ The 22+ mutation types work together to provide comprehensive robustness testing
- **Infrastructure**: HTTP Header Injection, Payload Size Attack, Content-Type Confusion, Query Parameter Poisoning, Request Method Attack, Protocol-Level Attack, Resource Exhaustion, Concurrent Request Pattern, Timeout Manipulation
- **Temporal/Context**: Temporal Attack, Multi-Turn Attack
For comprehensive testing, use all 24 types. For focused testing:
For comprehensive testing, use all 22+ types. For focused testing:
- **Security-focused**: Emphasize Prompt Injection, Advanced Jailbreak, Protocol-Level Attack, HTTP Header Injection
- **UX-focused**: Emphasize Noise, Tone Shift, Context Manipulation, Language Mixing
- **Infrastructure-focused**: Emphasize all system/network-level types