Enhance documentation for Flakestorm V2 features, including detailed updates on behavioral contracts, context attacks, and scoring mechanisms. Added new configuration options for state isolation in agents, clarified context attack types, and improved the contract report generation with suggested actions for failures. Updated various guides to reflect the latest changes in chaos engineering capabilities and replay regression functionalities.

This commit is contained in:
Francisco M Humarang Jr. 2026-03-08 20:29:48 +08:00
parent 902c5d8ac4
commit 4c1b43c5d5
17 changed files with 518 additions and 91 deletions

View file

@ -13,10 +13,15 @@ If neither is provided, Flakestorm **fails with a clear error** (does not silent
Each (invariant × scenario) cell is an **independent invocation**. Agent state must not leak between cells.
- **Reset is optional:** configure `agent.reset_endpoint` (HTTP) or `agent.reset_function` (Python) to clear state before each cell.
- If no reset is configured and the agent **appears stateful** (response variance across identical inputs), Flakestorm **warns** (does not fail):
- **Reset is optional:** configure `agent.reset_endpoint` (HTTP) or `agent.reset_function` (Python module path, e.g. `myagent:reset_state`) to clear state before each cell.
- If no reset is configured and the agent **appears stateful** (same prompt produces different responses on two calls), Flakestorm **warns** (does not fail):
*"Warning: No reset_endpoint configured. Contract matrix cells may share state. Results may be contaminated. Add reset_endpoint to your config for accurate isolation."*
## Contract invariants: system_prompt_leak_probe and behavior_unchanged
- **system_prompt_leak_probe:** Use a contract invariant with **`probes`** (list of probe prompts). The contract engine runs those prompts instead of golden_prompts for that invariant and verifies the response (e.g. with `excludes_pattern`) so the agent does not leak the system prompt.
- **behavior_unchanged:** Use invariant type `behavior_unchanged`. Set **`baseline`** to `auto` to compute a baseline from one run without chaos, or provide a manual baseline string. Response is compared with **`similarity_threshold`** (default 0.75).
## Resilience score formula
**Per-contract score:**
@ -28,4 +33,4 @@ score = (Σ(passed_critical×3) + Σ(passed_high×2) + Σ(passed_medium×1))
**Automatic FAIL:** If any **critical** severity invariant fails in any scenario, the overall result is FAIL regardless of the numeric score.
**Overall score (mutation + chaos + contract + replay):** Configurable via `scoring.weights` (default: mutation 20%, chaos 35%, contract 35%, replay 10%).
**Overall score (mutation + chaos + contract + replay):** Configurable via **`scoring.weights`**. Weights must **sum to 1.0** (validation enforced). Default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10.