Enhance documentation and replay functionality in Flakestorm. Updated README to clarify V2 Spec and added references to LangSmith sources in configuration guide. Improved replay regression capabilities by allowing imports from LangSmith projects and runs, with filtering options. Added new classes for LangSmith project and run sources in the configuration. Updated replay loader to support project imports and refined session resolution logic.

2026-06-29 20:19:37 +02:00 · 2026-03-07 02:04:55 +08:00 · 2026-03-07 02:04:55 +08:00 · 1bbe3a1f7b
commit 1bbe3a1f7b
parent 58f49b08ba
10 changed files with 419 additions and 61 deletions
--- a/docs/BEHAVIORAL_CONTRACTS.md
+++ b/docs/BEHAVIORAL_CONTRACTS.md
@ -82,6 +82,8 @@ Each entry is a **scenario**: a name plus optional `tool_faults`, `llm_faults`,
 - **Weights:** critical = 3, high = 2, medium = 1, low = 1.
 - **Automatic FAIL:** If any invariant with severity `critical` fails in any scenario, the contract is considered failed regardless of the numeric score.

+See [V2 Spec](V2_SPEC.md) for the exact formula and matrix isolation (reset) behavior.
+
 ---

 ## Commands
--- a/docs/CONFIGURATION_GUIDE.md
+++ b/docs/CONFIGURATION_GUIDE.md
@ -45,9 +45,10 @@ With `version: "2.0"` you can add the three **chaos engineering pillars** and a

 | Block | Purpose | Documentation |
 |-------|---------|---------------|
-| `chaos` | **Environment chaos** — Inject faults into tools, LLMs, and context (timeouts, errors, rate limits, context attacks). | [Environment Chaos](ENVIRONMENT_CHAOS.md) |
+| `chaos` | **Environment chaos** — Inject faults into tools, LLMs, and context (timeouts, errors, rate limits, context attacks, **response_drift**). | [Environment Chaos](ENVIRONMENT_CHAOS.md) |
 | `contract` + `chaos_matrix` | **Behavioral contracts** — Named invariants verified across a matrix of chaos scenarios; produces a resilience score. | [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md) |
 | `replays.sessions` | **Replay regression** — Import production failure sessions and replay them as deterministic tests. | [Replay Regression](REPLAY_REGRESSION.md) |
+| `replays.sources` | **LangSmith sources** — Import from a LangSmith project or by run ID; `auto_import` re-fetches on each run/ci. | [Replay Regression](REPLAY_REGRESSION.md) |
 | `scoring` | **Unified score** — Weights for mutation_robustness, chaos_resilience, contract_compliance, replay_regression (used by `flakestorm ci`). | See [README](../README.md) “Scores at a glance” |

 **Context attacks** (chaos on tool/context, not the user prompt) are configured under `chaos.context_attacks`. See [Context Attacks](CONTEXT_ATTACKS.md).
--- a/docs/ENVIRONMENT_CHAOS.md
+++ b/docs/ENVIRONMENT_CHAOS.md
@ -110,4 +110,10 @@ chaos:
 - `high_latency` — Delayed responses.
 - `indirect_injection` — Context attack profile (inject into tool/context).

-Profile YAMLs live in `src/flakestorm/chaos/profiles/`. Use with `--chaos-profile NAME`.
+Profile YAMLs live in `src/flakestorm/chaos/profiles/`. Use with `--chaos-profile NAME`. The **`model_version_drift`** profile exercises the LLM fault type **`response_drift`**.
+
+---
+
+## See also
+
+- [Context Attacks](CONTEXT_ATTACKS.md) — Indirect injection, memory poisoning.
--- a/docs/REPLAY_REGRESSION.md
+++ b/docs/REPLAY_REGRESSION.md
@ -63,7 +63,7 @@ Flakestorm resolves name first, then path; if not found, replay may fail or fall

 ## Configuration in flakestorm.yaml

-You can define replay sessions inline or by file:
+You can define replay sessions inline, by file, or via **LangSmith sources**:

 ```yaml
 version: "2.0"
@ -76,9 +76,20 @@ replays:
      input: "What is the capital of France?"
      contract: "Research Agent Contract"
      tool_responses: []
+  # LangSmith sources (import by project or run ID; auto_import re-fetches on each run/ci)
+  sources:
+    - type: langsmith
+      project: "my-production-agent"
+      filter:
+        status: error           # error | warning | all
+        date_range: last_7_days
+        min_latency_ms: 5000
+      auto_import: true
+    - type: langsmith_run
+      run_id: "abc123def456"
 ```

-When you use `file:`, the session’s `id`, `input`, and `contract` come from the loaded file. When you use inline `id` and `input`, you must provide them.
+When you use `file:`, the session’s `id`, `input`, and `contract` come from the loaded file. When you use inline `id` and `input`, you must provide them. **`replays.sources`** sessions are merged when running `flakestorm ci` or when `auto_import` is true (project sources).

 ---

@ -89,9 +100,10 @@ When you use `file:`, the session’s `id`, `input`, and `contract` come from th
 | `flakestorm replay run path/to/replay.yaml -c flakestorm.yaml` | Run a single replay file. `-c` supplies agent and contract config. |
 | `flakestorm replay run path/to/dir -c flakestorm.yaml` | Run all replay files in the directory. |
 | `flakestorm replay export --from-report REPORT.json --output ./replays` | Export failed mutations from a Flakestorm report as replay YAML files. |
-| `flakestorm replay import --from-langsmith RUN_ID` | Import a session from LangSmith (requires `flakestorm[langsmith]`). |
-| `flakestorm replay import --from-langsmith RUN_ID --run` | Import and run the replay. |
-| `flakestorm ci -c flakestorm.yaml` | Runs mutation, contract, chaos-only, **and all sessions in `replays.sessions`**; reports **replay_regression** (passed/total) and **overall** weighted score. |
+| `flakestorm replay run --from-langsmith RUN_ID -c flakestorm.yaml` | Import a single session from LangSmith by run ID (requires `flakestorm[langsmith]`). |
+| `flakestorm replay run --from-langsmith RUN_ID --run -o replay.yaml` | Import, optionally write to file, and run the replay. |
+| `flakestorm replay run --from-langsmith-project PROJECT --filter-status error -o ./replays/` | Import all runs from a LangSmith project; write one YAML per run. Add `--run` to run after import. |
+| `flakestorm ci -c flakestorm.yaml` | Runs mutation, contract, chaos-only, **and all replay sessions** (including `replays.sources` with `auto_import`); reports **replay_regression** and **overall** weighted score. |

 ---

@ -99,7 +111,8 @@ When you use `file:`, the session’s `id`, `input`, and `contract` come from th

 - **Manual** — Write YAML/JSON replay files from incident reports.
 - **Flakestorm export** — `flakestorm replay export --from-report REPORT.json` turns failed runs into replay files.
- **LangSmith** — `flakestorm replay import --from-langsmith RUN_ID` (requires `pip install flakestorm[langsmith]`).
+- **LangSmith (single run)** — `flakestorm replay run --from-langsmith RUN_ID` (requires `pip install flakestorm[langsmith]`).
+- **LangSmith (project)** — `flakestorm replay run --from-langsmith-project PROJECT --filter-status error -o ./replays/` imports failed runs from a project; or use `replays.sources` in config with `auto_import: true` so CI re-fetches from the project each run.

 ---

--- a/docs/V2_AUDIT.md
+++ b/docs/V2_AUDIT.md
@ -68,16 +68,62 @@ Verification of the codebase against the PRD and addendum: behavior, config sche

 ---

-## 6. Addendum — Context Attacks, Model Drift, LangSmith, Spec
+## 6. Addendum (flakestorm-v2-addendum.md) — Full Checklist

-| Item | Status |
-|------|--------|
-| Context attacks module (indirect_injection, etc.) | ✅ `chaos/context_attacks.py`; profile `indirect_injection.yaml` |
-| response_drift in llm_proxy | ✅ `chaos/llm_proxy.py` (json_field_rename, verbosity_shift, format_change, refusal_rephrase, tone_shift) |
-| LangSmith load + schema check | ✅ `replay/loader.py`: `load_langsmith_run`, `_validate_langsmith_run_schema` |
-| Python tool fault: fail loudly when no tools | ✅ `create_instrumented_adapter` raises if type=python and tool_faults |
-| Contract matrix isolation (reset) | ✅ Optional reset; warning if stateful and no reset |
-| Resilience score formula (addendum §6.3) | ✅ In `contracts/matrix.py` and `docs/V2_SPEC.md` |
+### Addition 1 — Context Attacks Module
+
+| Requirement | Status | Notes |
+|-------------|--------|------|
+| `chaos/context_attacks.py` | ✅ | `ContextAttackEngine`, `maybe_inject_indirect()` |
+| indirect_injection (inject payloads into tool response) | ✅ | Wired via engine; profile `indirect_injection.yaml` |
+| memory_poisoning, system_prompt_leak_probe | ⚠️ | Docstring/config types exist; memory_poisoning inject step and leak probe as contract assertion are not fully wired in execution flow |
+| Contract invariants: excludes_pattern, behavior_unchanged | ✅ | `assertions/verifier.py`; use for system_prompt_not_leaked, injection_not_executed |
+| Config: `chaos.context_attacks` list with type (e.g. indirect_injection) | ✅ | `ContextAttackConfig` in `core/config.py` |
+
+### Addition 2 — Model Version Drift (response_drift)
+
+| Requirement | Status | Notes |
+|-------------|--------|------|
+| `response_drift` in llm_faults | ✅ | `chaos/llm_proxy.py`: `apply_llm_response_drift`, drift_type, severity, direction, factor |
+| drift_type: json_field_rename, verbosity_shift, format_change, refusal_rephrase, tone_shift | ✅ | Implemented in llm_proxy |
+| Profile `model_version_drift.yaml` | ✅ | `chaos/profiles/model_version_drift.yaml` |
+
+### Addition 3 — Multi-Agent Failure Propagation
+
+| Requirement | Status | Notes |
+|-------------|--------|------|
+| v3 roadmap placeholder, no v2 implementation | ✅ | Documented in ROADMAP.md as V3; no code required |
+
+### Addition 4 — Resilience Certificate Export
+
+| Requirement | Status | Notes |
+|-------------|--------|------|
+| `flakestorm certificate` CLI command | ❌ | Not implemented |
+| `reports/certificate.py` (PDF/HTML certificate) | ❌ | Not implemented |
+| Config `certificate.tester_name`, pass_threshold, output_format | ❌ | Not implemented |
+
+### Addition 5 — LangSmith Replay Import
+
+| Requirement | Status | Notes |
+|-------------|--------|------|
+| Import single run by ID: `flakestorm replay --from-langsmith RUN_ID` | ✅ | `replay/loader.py`: `load_langsmith_run(run_id)`; CLI option |
+| Import and run: `--from-langsmith RUN_ID --run` | ✅ | `_replay_async` supports run_after_import |
+| Schema validation (fail clearly if LangSmith API changed) | ✅ | `_validate_langsmith_run_schema` |
+| Map run inputs/outputs/child_runs to ReplaySessionConfig | ✅ | `_langsmith_run_to_session` |
+| `--from-langsmith-project PROJECT` + `--filter-status` + `--output` | ✅ | `replay run --from-langsmith-project X --filter-status error -o ./replays/`; writes YAML per run |
+| `replays.sources` (type: langsmith | langsmith_run, project, filter, auto_import) | ✅ | `LangSmithProjectSourceConfig`, `LangSmithRunSourceConfig`, `ReplayConfig.sources`; CI uses `resolve_sessions_from_config(..., include_sources=True)` |
+
+### Addition 6 — Implicit Spec Clarifications
+
+| Requirement | Status | Notes |
+|-------------|--------|------|
+| 6.1 Python callables: fail loudly if tool_faults but no tools/ToolRegistry | ✅ | `create_instrumented_adapter` raises with clear message for type=python |
+| 6.2 Contract matrix: reset between cells (reset_endpoint / reset_function) | ✅ | `ContractEngine._reset_agent()`; config fields on AgentConfig |
+| 6.3 Resilience score formula in spec (weighted, auto-FAIL on critical) | ✅ | `contracts/matrix.py` docstring and implementation; `docs/V2_SPEC.md` |
+
+---
+
+**Summary:** Addendum Additions 1, 2, 3, 5, 6 are implemented (with minor gaps on full memory_poisoning/leak_probe wiring). **Addition 4 (Resilience Certificate)** is not implemented.

 ---