Update documentation and configuration for Flakestorm V2, enhancing clarity on CI processes, report generation, and reproducibility features. Added details on the new --output option for saving reports, clarified the use of --min-score, and improved descriptions of the seed configuration for deterministic runs. Updated README and usage guides to reflect these changes and ensure comprehensive understanding of the CI pipeline and report outputs.

This commit is contained in:
Francisco M Humarang Jr. 2026-03-12 20:05:51 +08:00
parent 4a13425f8a
commit f4d45d4053
14 changed files with 356 additions and 49 deletions

View file

@ -538,13 +538,24 @@ flakestorm replay export --from-report FILE # Export from an existing report
### V2: `flakestorm ci`
Run full CI pipeline: mutation run, contract run (if configured), chaos-only (if chaos configured), replay (if configured); then compute overall weighted score from `scoring.weights`.
Run full CI pipeline: mutation run, contract run (if configured), chaos-only (if chaos configured), replay (if configured); then compute overall weighted score from `scoring.weights`. Writes a **CI summary report** (e.g. `flakestorm-ci-report.html`) with per-phase scores and **"View detailed report"** links to phase-specific reports (mutation, contract, chaos, replay). Contract phase PASS/FAIL in the summary matches the contract detailed report (FAIL if any critical invariant fails).
```bash
flakestorm ci
flakestorm ci --config custom.yaml
flakestorm ci --min-score 0.5 # Fail if overall score below 0.5
flakestorm ci --output ./reports # Save summary + detailed reports to directory
flakestorm ci --output report.html # Save summary report to file
flakestorm ci --quiet # Minimal output, no progress bars
```
| Option | Description |
|--------|-------------|
| `--config`, `-c` | Config file path (default: `flakestorm.yaml`) |
| `--min-score` | Minimum overall (weighted) score to pass (default: 0.0) |
| `--output`, `-o` | Path to save reports: directory (creates `flakestorm-ci-report.html` + phase reports) or HTML file path |
| `--quiet`, `-q` | Minimal output, no progress bars |
---
## Environment Variables

View file

@ -960,7 +960,7 @@ advanced:
|--------|------|---------|-------------|
| `concurrency` | integer | `10` | Max concurrent agent requests (1-100) |
| `retries` | integer | `2` | Retry failed requests (0-5) |
| `seed` | integer | null | Random seed for reproducibility |
| `seed` | integer | null | **Reproducible runs:** when set, Python's random is seeded (chaos behavior fixed) and the mutation-generation LLM uses temperature=0 so the same config yields the same results run-to-run. Omit for exploratory, varying runs. |
---

View file

@ -107,7 +107,7 @@ This separation allows:
### Q: What does `flakestorm ci` run?
**A:** It runs, in order: (1) mutation run (with chaos if configured), (2) contract run if `contract` + `chaos_matrix` are configured, (3) chaos-only run if chaos is configured, (4) replay run if `replays` is configured. Then it computes an **overall weighted score** from `scoring.weights` (mutation, chaos, contract, replay); weights must sum to 1.0. Default weights: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10.
**A:** It runs, in order: (1) mutation run (with chaos if configured), (2) contract run if `contract` + `chaos_matrix` are configured, (3) chaos-only run if chaos is configured, (4) replay run if `replays` is configured. Then it computes an **overall weighted score** from `scoring.weights` (mutation, chaos, contract, replay); weights must sum to 1.0. Default weights: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10. It also writes a **CI summary report** (e.g. `flakestorm-ci-report.html`) with per-phase scores and links to **detailed reports** (mutation, contract, chaos, replay). Contract phase PASS/FAIL in the summary matches the contract detailed report (FAIL if any critical invariant fails). Use `--output` to control where reports are saved and `--min-score` for the overall pass threshold.
---

View file

@ -76,7 +76,7 @@ With **`version: "2.0"`** in your config, Flakestorm adds environment chaos, beh
| **Behavioral contracts** | Contracts (invariants × severity) × chaos matrix scenarios; each cell is an independent run (optional reset per cell). | **Resilience score** (0100%). Use `flakestorm contract run`. Per-contract formula: weighted by severity (critical×3, high×2, medium×1); **auto-FAIL** if any critical fails. |
| **Replay regression** | Replay saved sessions (e.g. production incidents) and verify against a contract. | Per-session pass/fail; **replay regression** score when run via CI. Use `flakestorm replay run [path]`. |
**Unified CI:** `flakestorm ci` runs mutation run, contract run (if configured), chaos-only run (if chaos configured), and all replay sessions; then computes an **overall resilience score** from `scoring.weights` (default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10). Weights must sum to 1.0.
**Unified CI:** `flakestorm ci` runs mutation run, contract run (if configured), chaos-only run (if chaos configured), and all replay sessions; then computes an **overall resilience score** from `scoring.weights` (default: mutation 0.20, chaos 0.35, contract 0.35, replay 0.10). Weights must sum to 1.0. It writes a **CI summary report** (e.g. `flakestorm-ci-report.html`) with per-phase scores and links to **detailed reports** (mutation, contract, chaos, replay). Contract PASS/FAIL in the summary matches the contract detailed report (FAIL if any critical invariant fails). Use `--output DIR` or `--output report.html` and `--min-score N`.
**Reports:** Use `flakestorm contract run --output report.html` and `flakestorm replay run --output report.html` to save HTML reports; both include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants). Replay accepts a single session file or a directory: `flakestorm replay run path/to/session.yaml` or `flakestorm replay run path/to/replays/`.
@ -1858,6 +1858,22 @@ advanced:
retries: 3 # Retry failed requests 3 times
```
### Reproducible Runs
By default, mutation generation (LLM) and chaos (e.g. fault triggers, payload choice) can vary between runs, so scores may differ. For **deterministic, reproducible runs** (e.g. CI or regression checks), set a **random seed** in config:
```yaml
advanced:
seed: 42 # Same config → same mutations and chaos → same scores
```
When `advanced.seed` is set:
- **Python random** is seeded at run start, so chaos behavior (which faults trigger, which payloads) is fixed.
- The **mutation-generation LLM** uses temperature=0, so the same golden prompts produce the same mutations each run.
Use a fixed seed when you need comparable run-to-run results; omit it for exploratory testing where variation is acceptable.
### Golden Prompt Guide
A comprehensive guide to creating effective golden prompts for your agent.