mirror of
https://github.com/flakestorm/flakestorm.git
synced 2026-04-25 00:36:54 +02:00
162 lines
8.6 KiB
Markdown
162 lines
8.6 KiB
Markdown
# V2 Implementation Audit
|
||
|
||
**Date:** March 2026
|
||
**Reference:** [Flakestorm v2.md](Flakestorm%20v2.md), [flakestorm-v2-addendum.md](flakestorm-v2-addendum.md)
|
||
|
||
## Scope
|
||
|
||
Verification of the codebase against the PRD and addendum: behavior, config schema, CLI, and examples.
|
||
|
||
---
|
||
|
||
## 1. PRD §8.1 — Environment Chaos
|
||
|
||
| Requirement | Status | Implementation |
|
||
|-------------|--------|----------------|
|
||
| Tool faults: timeout, error, malformed, slow, malicious_response | ✅ | `chaos/faults.py`, `chaos/http_transport.py` (by match_url or tool `*`) |
|
||
| LLM faults: timeout, truncated_response, rate_limit, empty, garbage | ✅ | `chaos/llm_proxy.py`, `chaos/interceptor.py` |
|
||
| probability, after_calls, tool `*` | ✅ | `chaos/faults.should_trigger`, transport and interceptor |
|
||
| Built-in profiles: api_outage, degraded_llm, hostile_tools, high_latency, cascading_failure | ✅ | `chaos/profiles/*.yaml` |
|
||
| InstrumentedAgentAdapter / httpx transport | ✅ | `ChaosInterceptor`, `ChaosHttpTransport`, `HTTPAgentAdapter(transport=...)` |
|
||
|
||
---
|
||
|
||
## 2. PRD §8.2 — Behavioral Contracts
|
||
|
||
| Requirement | Status | Implementation |
|
||
|-------------|--------|----------------|
|
||
| Contract with id, severity, when, negate | ✅ | `ContractInvariantConfig`, `contracts/engine.py` |
|
||
| Chaos matrix (scenarios) | ✅ | `contract.chaos_matrix`, scenario → ChaosConfig per run |
|
||
| Resilience matrix N×M, weighted score | ✅ | `contracts/matrix.py` (critical×3, high×2, medium×1), FAIL if any critical |
|
||
| Invariant types: contains_any, output_not_empty, completes, excludes_pattern, behavior_unchanged | ✅ | Assertions + verifier; contract engine runs verifier with contract invariants |
|
||
| reset_endpoint / reset_function | ✅ | `AgentConfig`, `ContractEngine._reset_agent()` before each cell |
|
||
| Stateful warning when no reset | ✅ | `ContractEngine._detect_stateful_and_warn()`, `STATEFUL_WARNING` |
|
||
|
||
---
|
||
|
||
## 3. PRD §8.3 — Replay-Based Regression
|
||
|
||
| Requirement | Status | Implementation |
|
||
|-------------|--------|----------------|
|
||
| Replay session: input, tool_responses, contract | ✅ | `ReplaySessionConfig`, `replay/loader.py`, `replay/runner.py` |
|
||
| Contract by name or path | ✅ | `resolve_contract()` in loader |
|
||
| Verify against contract | ✅ | `ReplayRunner.run()` uses `InvariantVerifier` with resolved contract |
|
||
| Export from report | ✅ | `flakestorm replay export --from-report FILE` |
|
||
| Replays in config: sessions with file or inline | ✅ | `replays.sessions`; session can have `file` only (load from file) or full inline |
|
||
|
||
---
|
||
|
||
## 4. PRD §9 — Combined Modes & Resilience Score
|
||
|
||
| Requirement | Status | Implementation |
|
||
|-------------|--------|----------------|
|
||
| Mutation only, chaos only, mutation+chaos, contract, replay | ✅ | `run` (with --chaos, --chaos-only), `contract run`, `replay run` |
|
||
| Unified resilience score (mutation_robustness, chaos_resilience, contract_compliance, replay_regression, overall) | ✅ | `reports/models.TestResults.resilience_scores`; `flakestorm ci` computes overall from `scoring.weights` |
|
||
|
||
---
|
||
|
||
## 5. PRD §10 — CLI
|
||
|
||
| Command | Status |
|
||
|---------|--------|
|
||
| flakestorm run --chaos, --chaos-profile, --chaos-only | ✅ |
|
||
| flakestorm chaos | ✅ |
|
||
| flakestorm contract run / validate / score | ✅ |
|
||
| flakestorm replay run [PATH] | ✅ (replay run, replay export) |
|
||
| flakestorm replay export --from-report FILE | ✅ |
|
||
| flakestorm ci | ✅ (mutation + contract + chaos + replay + overall score) |
|
||
|
||
---
|
||
|
||
## 6. Addendum (flakestorm-v2-addendum.md) — Full Checklist
|
||
|
||
### Addition 1 — Context Attacks Module
|
||
|
||
| Requirement | Status | Notes |
|
||
|-------------|--------|------|
|
||
| `chaos/context_attacks.py` | ✅ | `ContextAttackEngine`, `maybe_inject_indirect()` |
|
||
| indirect_injection (inject payloads into tool response) | ✅ | Wired via engine; profile `indirect_injection.yaml` |
|
||
| memory_poisoning, system_prompt_leak_probe | ⚠️ | Docstring/config types exist; memory_poisoning inject step and leak probe as contract assertion are not fully wired in execution flow |
|
||
| Contract invariants: excludes_pattern, behavior_unchanged | ✅ | `assertions/verifier.py`; use for system_prompt_not_leaked, injection_not_executed |
|
||
| Config: `chaos.context_attacks` list with type (e.g. indirect_injection) | ✅ | `ContextAttackConfig` in `core/config.py` |
|
||
|
||
### Addition 2 — Model Version Drift (response_drift)
|
||
|
||
| Requirement | Status | Notes |
|
||
|-------------|--------|------|
|
||
| `response_drift` in llm_faults | ✅ | `chaos/llm_proxy.py`: `apply_llm_response_drift`, drift_type, severity, direction, factor |
|
||
| drift_type: json_field_rename, verbosity_shift, format_change, refusal_rephrase, tone_shift | ✅ | Implemented in llm_proxy |
|
||
| Profile `model_version_drift.yaml` | ✅ | `chaos/profiles/model_version_drift.yaml` |
|
||
|
||
### Addition 3 — Multi-Agent Failure Propagation
|
||
|
||
| Requirement | Status | Notes |
|
||
|-------------|--------|------|
|
||
| v3 roadmap placeholder, no v2 implementation | ✅ | Documented in ROADMAP.md as V3; no code required |
|
||
|
||
### Addition 4 — Resilience Certificate Export
|
||
|
||
| Requirement | Status | Notes |
|
||
|-------------|--------|------|
|
||
| `flakestorm certificate` CLI command | ❌ | Not implemented |
|
||
| `reports/certificate.py` (PDF/HTML certificate) | ❌ | Not implemented |
|
||
| Config `certificate.tester_name`, pass_threshold, output_format | ❌ | Not implemented |
|
||
|
||
### Addition 5 — LangSmith Replay Import
|
||
|
||
| Requirement | Status | Notes |
|
||
|-------------|--------|------|
|
||
| Import single run by ID: `flakestorm replay --from-langsmith RUN_ID` | ✅ | `replay/loader.py`: `load_langsmith_run(run_id)`; CLI option |
|
||
| Import and run: `--from-langsmith RUN_ID --run` | ✅ | `_replay_async` supports run_after_import |
|
||
| Schema validation (fail clearly if LangSmith API changed) | ✅ | `_validate_langsmith_run_schema` |
|
||
| Map run inputs/outputs/child_runs to ReplaySessionConfig | ✅ | `_langsmith_run_to_session` |
|
||
| `--from-langsmith-project PROJECT` + `--filter-status` + `--output` | ✅ | `replay run --from-langsmith-project X --filter-status error -o ./replays/`; writes YAML per run |
|
||
| `replays.sources` (type: langsmith | langsmith_run, project, filter, auto_import) | ✅ | `LangSmithProjectSourceConfig`, `LangSmithRunSourceConfig`, `ReplayConfig.sources`; CI uses `resolve_sessions_from_config(..., include_sources=True)` |
|
||
|
||
### Addition 6 — Implicit Spec Clarifications
|
||
|
||
| Requirement | Status | Notes |
|
||
|-------------|--------|------|
|
||
| 6.1 Python callables: fail loudly if tool_faults but no tools/ToolRegistry | ✅ | `create_instrumented_adapter` raises with clear message for type=python |
|
||
| 6.2 Contract matrix: reset between cells (reset_endpoint / reset_function) | ✅ | `ContractEngine._reset_agent()`; config fields on AgentConfig |
|
||
| 6.3 Resilience score formula in spec (weighted, auto-FAIL on critical) | ✅ | `contracts/matrix.py` docstring and implementation; `docs/V2_SPEC.md` |
|
||
|
||
---
|
||
|
||
**Summary:** Addendum Additions 1, 2, 3, 5, 6 are implemented (with minor gaps on full memory_poisoning/leak_probe wiring). **Addition 4 (Resilience Certificate)** is not implemented.
|
||
|
||
---
|
||
|
||
## 7. Config Schema (v2.0)
|
||
|
||
- `version: "2.0"` supported; v1.0 backward compatible.
|
||
- `chaos`, `contract`, `chaos_matrix`, `replays`, `scoring` present and used.
|
||
- Replay session can be `file: "path"` only; full session loaded from file. Validation updated so `id`/`input`/`contract` optional when `file` is set.
|
||
|
||
---
|
||
|
||
## 8. Changes Made During This Audit
|
||
|
||
1. **Replay session file-only** — `ReplaySessionConfig` allows session with only `file`; `id`/`input`/`contract` optional when `file` is set (defaults/loaded from file).
|
||
2. **CI replay path** — Replay session file path resolved relative to config file directory: `config_path.parent / s.file`.
|
||
3. **V2 example** — Added `examples/v2_research_agent/`: working HTTP agent (FastAPI), v2 flakestorm.yaml (chaos, contract, replays, scoring), replay file, README, requirements.txt.
|
||
|
||
---
|
||
|
||
## 9. Example: V2 Research Agent
|
||
|
||
- **Agent:** `examples/v2_research_agent/agent.py` — FastAPI app with `/invoke` and `/reset`.
|
||
- **Config:** `examples/v2_research_agent/flakestorm.yaml` — version 2.0, chaos, contract, chaos_matrix, replays.sessions with file, scoring.
|
||
- **Replay:** `examples/v2_research_agent/replays/incident_001.yaml`.
|
||
- **Usage:** See `examples/v2_research_agent/README.md` (start agent, then run `flakestorm run`, `flakestorm contract run`, `flakestorm replay run`, `flakestorm ci`).
|
||
|
||
---
|
||
|
||
## 10. Test Status
|
||
|
||
- **181 tests passing** (including chaos, contract, replay integration tests).
|
||
- V2 example config loads successfully (`load_config("examples/v2_research_agent/flakestorm.yaml")`).
|
||
|
||
---
|
||
|
||
*Audit complete. Implementation aligns with PRD and addendum; optional config and path resolution improved; V2 example added.*
|