Update README and usage guide to reflect changes in mutation types and clarify V1/V2 flow. Increased mutation types from 22 to 24 and added details on the new V2 features, including environment chaos and behavioral contracts. Enhanced documentation for clarity on scoring mechanisms and command usage.

This commit is contained in:
Entropix 2026-03-09 12:45:42 +08:00
parent 4c1b43c5d5
commit 4b0ab63f97
2 changed files with 60 additions and 44 deletions

View file

@ -134,10 +134,12 @@ Flakestorm supports several modes; you can use one or combine them:
- **Chaos only** — Golden prompts → agent with fault-injected tools/LLM → invariants. *Does the agent handle bad environments?* - **Chaos only** — Golden prompts → agent with fault-injected tools/LLM → invariants. *Does the agent handle bad environments?*
- **Contract** — Golden prompts → agent under each chaos scenario → verify named invariants across a matrix. *Does the agent obey its rules under every failure mode?* - **Contract** — Golden prompts → agent under each chaos scenario → verify named invariants across a matrix. *Does the agent obey its rules under every failure mode?*
- **Replay** — Recorded production input + recorded tool responses → agent → contract. *Did we fix this incident?* - **Replay** — Recorded production input + recorded tool responses → agent → contract. *Did we fix this incident?*
- **Mutation (optional)** — Golden prompts → adversarial mutations (22+ types, max 50/run) → agent (optionally under chaos) → invariants. *Does the agent handle bad inputs (and optionally bad environments)?* - **Mutation (optional)** — Golden prompts → adversarial mutations (24 types, max 50/run) → agent (optionally under chaos) → invariants. *Does the agent handle bad inputs (and optionally bad environments)?*
You define **golden prompts**, **invariants** (or a full **contract** with severity and chaos matrix), and optionally **chaos** (tool/LLM faults) and **replay** sessions. Flakestorm runs the chosen mode(s), checks responses against your rules, and produces a **robustness score** (mutation or chaos-only runs) or **resilience score** (contract run), plus HTML report. Use `flakestorm run`, `flakestorm contract run`, `flakestorm replay run`, or `flakestorm ci` for the combined overall score (OSS: run from CLI or your own scripts; **native CI/CD integrations** — scheduled runs, pipeline plugins — are **Cloud only**). You define **golden prompts**, **invariants** (or a full **contract** with severity and chaos matrix), and optionally **chaos** (tool/LLM faults) and **replay** sessions. Flakestorm runs the chosen mode(s), checks responses against your rules, and produces a **robustness score** (mutation or chaos-only runs) or **resilience score** (contract run), plus HTML report. Use `flakestorm run`, `flakestorm contract run`, `flakestorm replay run`, or `flakestorm ci` for the combined overall score (OSS: run from CLI or your own scripts; **native CI/CD integrations** — scheduled runs, pipeline plugins — are **Cloud only**).
For the full **V1 vs V2 flow** (mutation-only vs four pillars, contract matrix isolation, resilience score formula), see the [Usage Guide](docs/USAGE_GUIDE.md#how-it-works).
> **Note**: Mutation generation uses a local LLM (Ollama) or cloud APIs (OpenAI, Claude, Gemini). API keys via environment variables only. See [LLM Providers](docs/LLM_PROVIDERS.md). > **Note**: Mutation generation uses a local LLM (Ollama) or cloud APIs (OpenAI, Claude, Gemini). API keys via environment variables only. See [LLM Providers](docs/LLM_PROVIDERS.md).
## Features ## Features
@ -150,7 +152,7 @@ You define **golden prompts**, **invariants** (or a full **contract** with sever
### Supporting capabilities ### Supporting capabilities
- **Adversarial mutations** — 22+ mutation types (prompt-level and system/network-level); max 50 mutations per run in OSS. [→ Test Scenarios](docs/TEST_SCENARIOS.md) - **Adversarial mutations** — 24 mutation types (prompt-level and system/network-level); max 50 mutations per run in OSS. [→ Test Scenarios](docs/TEST_SCENARIOS.md)
- **Invariants & assertions** — Deterministic checks, semantic similarity, safety (PII, refusal); configurable per contract. - **Invariants & assertions** — Deterministic checks, semantic similarity, safety (PII, refusal); configurable per contract.
- **Robustness score** — For mutation runs: a single weighted score (01) of how well the agent handled adversarial prompts. Reported in HTML/JSON and CLI (`results.statistics.robustness_score`). - **Robustness score** — For mutation runs: a single weighted score (01) of how well the agent handled adversarial prompts. Reported in HTML/JSON and CLI (`results.statistics.robustness_score`).
- **Unified resilience score** — For full CI: weighted combination of **mutation robustness**, chaos resilience, contract compliance, and replay regression; weights (mutation, chaos, contract, replay) configurable in YAML and must sum to 1.0. - **Unified resilience score** — For full CI: weighted combination of **mutation robustness**, chaos resilience, contract compliance, and replay regression; weights (mutation, chaos, contract, replay) configurable in YAML and must sum to 1.0.
@ -163,7 +165,7 @@ You define **golden prompts**, **invariants** (or a full **contract** with sever
## Open Source vs Cloud ## Open Source vs Cloud
**Open Source (Always Free):** **Open Source (Always Free):**
- Core chaos engine with all 22+ mutation types (max 50 per run; no artificial feature gating) - Core chaos engine with all 24 mutation types (max 50 per run; no artificial feature gating)
- Local execution for fast experimentation - Local execution for fast experimentation
- Run from CLI or your own scripts (no native CI/CD; thats Cloud only) - Run from CLI or your own scripts (no native CI/CD; thats Cloud only)
- Full transparency and extensibility - Full transparency and extensibility
@ -276,4 +278,3 @@ Apache 2.0 - See [LICENSE](LICENSE) for details.
<p align="center"> <p align="center">
❤️ <a href="https://github.com/sponsors/flakestorm">Sponsor Flakestorm on GitHub</a> ❤️ <a href="https://github.com/sponsors/flakestorm">Sponsor Flakestorm on GitHub</a>
</p> </p>

View file

@ -25,7 +25,10 @@ This comprehensive guide walks you through using flakestorm to test your AI agen
### What is flakestorm? ### What is flakestorm?
flakestorm is an **adversarial testing framework** and **chaos engineering platform** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs. With **V2** (`version: "2.0"` in config) you get environment chaos (tool/LLM faults, context attacks), behavioral contracts (invariants × chaos matrix), and replay regression; **22+ mutation types** and **max 50 mutations per run** in OSS. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md) and [V2 Spec](V2_SPEC.md). flakestorm is an **adversarial testing framework** and **chaos engineering platform** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.
- **V1** (`version: "1.0"` or omitted): Mutation-only mode — golden prompts → mutation engine → agent → invariants → **robustness score**. Ideal for quick adversarial input testing.
- **V2** (`version: "2.0"` in config): Full chaos platform — **Environment Chaos** (tool/LLM faults, context attacks), **Behavioral Contracts** (invariants × chaos matrix with per-cell isolation), and **Replay Regression** (replay production incidents). You get **24 mutation types** and **max 50 mutations per run** in OSS; plus `flakestorm run --chaos`, `flakestorm contract run`, `flakestorm replay run`, and `flakestorm ci` for a unified **resilience score**. API keys for cloud LLM providers must be set via environment variables only (e.g. `api_key: "${OPENAI_API_KEY}"`). See [Configuration Guide](CONFIGURATION_GUIDE.md), [V2 Spec](V2_SPEC.md), and [V2 Audit](V2_AUDIT.md).
### Why Use flakestorm? ### Why Use flakestorm?
@ -39,47 +42,44 @@ flakestorm is an **adversarial testing framework** and **chaos engineering platf
### How It Works ### How It Works
Flakestorm supports **V1 (mutation-only)** and **V2 (full chaos platform)**. The flow depends on your config version and which commands you run.
#### V1 / Mutation-only flow
With a V1 config (or V2 config without `--chaos`), you get the classic adversarial flow:
``` ```
┌─────────────────────────────────────────────────────────────────┐ ┌─────────────────────────────────────────────────────────────────┐
│ flakestorm FLOW │ flakestorm V1 — MUTATION-ONLY FLOW
├─────────────────────────────────────────────────────────────────┤ ├─────────────────────────────────────────────────────────────────┤
│ │ │ 1. GOLDEN PROMPTS → 2. MUTATION ENGINE (Local LLM) │
│ 1. GOLDEN PROMPTS 2. MUTATION ENGINE │ │ "Book a flight" → Mutated prompts (typos, paraphrases, │
│ ┌─────────────────┐ ┌─────────────────┐ │ │ injections, encoding, etc. — 24 types)│
│ │ "Book a flight │ ───► │ Local LLM │ │ │ ↓ │
│ │ from NYC to LA"│ │ (Qwen/Ollama) │ │ │ 3. YOUR AGENT ← Test Runner sends each mutated prompt │
│ └─────────────────┘ └────────┬────────┘ │ │ (HTTP/Python) ↓ │
│ │ │ │ 4. INVARIANT ASSERTIONS → 5. REPORTING │
│ ▼ │ │ (contains, latency, similarity, safety) → Robustness Score │
│ ┌─────────────────┐ │
│ │ Mutated Prompts │ │
│ │ • Typos │ │
│ │ • Paraphrases │ │
│ │ • Injections │ │
│ └────────┬────────┘ │
│ │ │
│ 3. YOUR AGENT ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ AI Agent │ ◄─── │ Test Runner │ │
│ │ (HTTP/Python) │ │ (Async) │ │
│ └────────┬────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ 4. VERIFICATION 5. REPORTING │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Invariant │ ───► │ HTML/JSON/CLI │ │
│ │ Assertions │ │ Reports │ │
│ └─────────────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Robustness │ │
│ │ Score: 0.85 │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘ └─────────────────────────────────────────────────────────────────┘
``` ```
**Commands:** `flakestorm run` (no `--chaos`) → **Robustness score** (01).
#### V2 flow — Four pillars
With **`version: "2.0"`** in your config, Flakestorm adds environment chaos, behavioral contracts, and replay regression. See [V2 Spec](V2_SPEC.md) and [V2 Audit](V2_AUDIT.md).
| Pillar | What runs | Score / output |
|--------|-----------|----------------|
| **Mutation run** | Golden prompts → 24 mutation types → agent → invariants | **Robustness score** (01). Use `flakestorm run` or `flakestorm run --chaos` to include chaos. |
| **Environment chaos** | Fault injection into tools and LLM (timeouts, errors, rate limits, malformed responses, context attacks) | **Chaos resilience** (01). Use `flakestorm run --chaos` (with mutations) or `flakestorm run --chaos --chaos-only` (no mutations). |
| **Behavioral contracts** | Contracts (invariants × severity) × chaos matrix scenarios; each cell is an independent run (optional reset per cell). | **Resilience score** (0100%). Use `flakestorm contract run`. Per-contract formula: weighted by severity (critical×3, high×2, medium×1); **auto-FAIL** if any critical fails. |
| **Replay regression** | Replay saved sessions (e.g. production incidents) and verify against a contract. | Per-session pass/fail; **replay regression** score when run via CI. Use `flakestorm replay run [path]`. |
**Unified CI:** `flakestorm ci` runs mutation run, contract run (if configured), chaos-only run (if chaos configured), and all replay sessions; then computes an **overall resilience score** from `scoring.weights` (default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10). Weights must sum to 1.0.
**Contract matrix isolation (V2):** Each (invariant × scenario) cell is independent. Configure `agent.reset_endpoint` (HTTP) or `agent.reset_function` (Python) to clear agent state between cells; if not set and the agent is stateful, Flakestorm warns. See [V2 Spec — Contract matrix isolation](V2_SPEC.md#contract-matrix-isolation).
--- ---
## Installation ## Installation
@ -819,7 +819,7 @@ golden_prompts:
### Mutation Types ### Mutation Types
flakestorm generates adversarial variations of your golden prompts across 22+ mutation types organized into categories: flakestorm generates adversarial variations of your golden prompts across 24 mutation types organized into categories:
#### Prompt-Level Attacks #### Prompt-Level Attacks
@ -925,6 +925,21 @@ Score = (Weighted Passed Tests) / (Total Weighted Tests)
- **0.7-0.8**: Fair - Needs work - **0.7-0.8**: Fair - Needs work
- **<0.7**: Poor - Significant reliability issues - **<0.7**: Poor - Significant reliability issues
#### V2 Resilience Score (contract + overall)
When using **V2** (`version: "2.0"`) with behavioral contracts and/or `flakestorm ci`, two additional scores apply. See [V2 Spec](V2_SPEC.md#resilience-score-formula).
**Per-contract score** (for `flakestorm contract run`):
```
score = (Σ(passed_critical×3) + Σ(passed_high×2) + Σ(passed_medium×1))
/ (Σ(total_critical×3) + Σ(total_high×2) + Σ(total_medium×1)) × 100
```
- **Automatic FAIL:** If any **critical** severity invariant fails in any scenario, the overall result is FAIL regardless of the numeric score.
**Overall score** (for `flakestorm ci`): Configurable via **`scoring.weights`**. Weights must **sum to 1.0**. Default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10. The CI run combines mutation robustness, chaos resilience, contract compliance, and replay regression into one weighted overall resilience score.
--- ---
## Understanding Mutation Types ## Understanding Mutation Types
@ -1106,7 +1121,7 @@ flakestorm provides 22+ mutation types organized into **Prompt-Level Attacks** a
### Choosing Mutation Types ### Choosing Mutation Types
**Comprehensive Testing (Recommended):** **Comprehensive Testing (Recommended):**
Use all 22+ types for complete coverage: Use all 24 types for complete coverage:
```yaml ```yaml
types: types:
# Original 8 types # Original 8 types
@ -1206,7 +1221,7 @@ The 22+ mutation types work together to provide comprehensive robustness testing
- **Infrastructure**: HTTP Header Injection, Payload Size Attack, Content-Type Confusion, Query Parameter Poisoning, Request Method Attack, Protocol-Level Attack, Resource Exhaustion, Concurrent Request Pattern, Timeout Manipulation - **Infrastructure**: HTTP Header Injection, Payload Size Attack, Content-Type Confusion, Query Parameter Poisoning, Request Method Attack, Protocol-Level Attack, Resource Exhaustion, Concurrent Request Pattern, Timeout Manipulation
- **Temporal/Context**: Temporal Attack, Multi-Turn Attack - **Temporal/Context**: Temporal Attack, Multi-Turn Attack
For comprehensive testing, use all 22+ types. For focused testing: For comprehensive testing, use all 24 types. For focused testing:
- **Security-focused**: Emphasize Prompt Injection, Advanced Jailbreak, Protocol-Level Attack, HTTP Header Injection - **Security-focused**: Emphasize Prompt Injection, Advanced Jailbreak, Protocol-Level Attack, HTTP Header Injection
- **UX-focused**: Emphasize Noise, Tone Shift, Context Manipulation, Language Mixing - **UX-focused**: Emphasize Noise, Tone Shift, Context Manipulation, Language Mixing
- **Infrastructure-focused**: Emphasize all system/network-level types - **Infrastructure-focused**: Emphasize all system/network-level types