flakestorm/docs/V2_AUDIT.md

162 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# V2 Implementation Audit
**Date:** March 2026
**Reference:** [Flakestorm v2.md](Flakestorm%20v2.md), [flakestorm-v2-addendum.md](flakestorm-v2-addendum.md)
## Scope
Verification of the codebase against the PRD and addendum: behavior, config schema, CLI, and examples.
---
## 1. PRD §8.1 — Environment Chaos
| Requirement | Status | Implementation |
|-------------|--------|----------------|
| Tool faults: timeout, error, malformed, slow, malicious_response | ✅ | `chaos/faults.py`, `chaos/http_transport.py` (by match_url or tool `*`) |
| LLM faults: timeout, truncated_response, rate_limit, empty, garbage | ✅ | `chaos/llm_proxy.py`, `chaos/interceptor.py` |
| probability, after_calls, tool `*` | ✅ | `chaos/faults.should_trigger`, transport and interceptor |
| Built-in profiles: api_outage, degraded_llm, hostile_tools, high_latency, cascading_failure | ✅ | `chaos/profiles/*.yaml` |
| InstrumentedAgentAdapter / httpx transport | ✅ | `ChaosInterceptor`, `ChaosHttpTransport`, `HTTPAgentAdapter(transport=...)` |
---
## 2. PRD §8.2 — Behavioral Contracts
| Requirement | Status | Implementation |
|-------------|--------|----------------|
| Contract with id, severity, when, negate | ✅ | `ContractInvariantConfig`, `contracts/engine.py` |
| Chaos matrix (scenarios) | ✅ | `contract.chaos_matrix`, scenario → ChaosConfig per run |
| Resilience matrix N×M, weighted score | ✅ | `contracts/matrix.py` (critical×3, high×2, medium×1), FAIL if any critical |
| Invariant types: contains_any, output_not_empty, completes, excludes_pattern, behavior_unchanged | ✅ | Assertions + verifier; contract engine runs verifier with contract invariants |
| reset_endpoint / reset_function | ✅ | `AgentConfig`, `ContractEngine._reset_agent()` before each cell |
| Stateful warning when no reset | ✅ | `ContractEngine._detect_stateful_and_warn()`, `STATEFUL_WARNING` |
---
## 3. PRD §8.3 — Replay-Based Regression
| Requirement | Status | Implementation |
|-------------|--------|----------------|
| Replay session: input, tool_responses, contract | ✅ | `ReplaySessionConfig`, `replay/loader.py`, `replay/runner.py` |
| Contract by name or path | ✅ | `resolve_contract()` in loader |
| Verify against contract | ✅ | `ReplayRunner.run()` uses `InvariantVerifier` with resolved contract |
| Export from report | ✅ | `flakestorm replay export --from-report FILE` |
| Replays in config: sessions with file or inline | ✅ | `replays.sessions`; session can have `file` only (load from file) or full inline |
---
## 4. PRD §9 — Combined Modes & Resilience Score
| Requirement | Status | Implementation |
|-------------|--------|----------------|
| Mutation only, chaos only, mutation+chaos, contract, replay | ✅ | `run` (with --chaos, --chaos-only), `contract run`, `replay run` |
| Unified resilience score (mutation_robustness, chaos_resilience, contract_compliance, replay_regression, overall) | ✅ | `reports/models.TestResults.resilience_scores`; `flakestorm ci` computes overall from `scoring.weights` |
---
## 5. PRD §10 — CLI
| Command | Status |
|---------|--------|
| flakestorm run --chaos, --chaos-profile, --chaos-only | ✅ |
| flakestorm chaos | ✅ |
| flakestorm contract run / validate / score | ✅ |
| flakestorm replay run [PATH] | ✅ (replay run, replay export) |
| flakestorm replay export --from-report FILE | ✅ |
| flakestorm ci | ✅ (mutation + contract + chaos + replay + overall score) |
---
## 6. Addendum (flakestorm-v2-addendum.md) — Full Checklist
### Addition 1 — Context Attacks Module
| Requirement | Status | Notes |
|-------------|--------|------|
| `chaos/context_attacks.py` | ✅ | `ContextAttackEngine`, `maybe_inject_indirect()` |
| indirect_injection (inject payloads into tool response) | ✅ | Wired via engine; profile `indirect_injection.yaml` |
| memory_poisoning, system_prompt_leak_probe | ⚠️ | Docstring/config types exist; memory_poisoning inject step and leak probe as contract assertion are not fully wired in execution flow |
| Contract invariants: excludes_pattern, behavior_unchanged | ✅ | `assertions/verifier.py`; use for system_prompt_not_leaked, injection_not_executed |
| Config: `chaos.context_attacks` list with type (e.g. indirect_injection) | ✅ | `ContextAttackConfig` in `core/config.py` |
### Addition 2 — Model Version Drift (response_drift)
| Requirement | Status | Notes |
|-------------|--------|------|
| `response_drift` in llm_faults | ✅ | `chaos/llm_proxy.py`: `apply_llm_response_drift`, drift_type, severity, direction, factor |
| drift_type: json_field_rename, verbosity_shift, format_change, refusal_rephrase, tone_shift | ✅ | Implemented in llm_proxy |
| Profile `model_version_drift.yaml` | ✅ | `chaos/profiles/model_version_drift.yaml` |
### Addition 3 — Multi-Agent Failure Propagation
| Requirement | Status | Notes |
|-------------|--------|------|
| v3 roadmap placeholder, no v2 implementation | ✅ | Documented in ROADMAP.md as V3; no code required |
### Addition 4 — Resilience Certificate Export
| Requirement | Status | Notes |
|-------------|--------|------|
| `flakestorm certificate` CLI command | ❌ | Not implemented |
| `reports/certificate.py` (PDF/HTML certificate) | ❌ | Not implemented |
| Config `certificate.tester_name`, pass_threshold, output_format | ❌ | Not implemented |
### Addition 5 — LangSmith Replay Import
| Requirement | Status | Notes |
|-------------|--------|------|
| Import single run by ID: `flakestorm replay --from-langsmith RUN_ID` | ✅ | `replay/loader.py`: `load_langsmith_run(run_id)`; CLI option |
| Import and run: `--from-langsmith RUN_ID --run` | ✅ | `_replay_async` supports run_after_import |
| Schema validation (fail clearly if LangSmith API changed) | ✅ | `_validate_langsmith_run_schema` |
| Map run inputs/outputs/child_runs to ReplaySessionConfig | ✅ | `_langsmith_run_to_session` |
| `--from-langsmith-project PROJECT` + `--filter-status` + `--output` | ✅ | `replay run --from-langsmith-project X --filter-status error -o ./replays/`; writes YAML per run |
| `replays.sources` (type: langsmith | langsmith_run, project, filter, auto_import) | ✅ | `LangSmithProjectSourceConfig`, `LangSmithRunSourceConfig`, `ReplayConfig.sources`; CI uses `resolve_sessions_from_config(..., include_sources=True)` |
### Addition 6 — Implicit Spec Clarifications
| Requirement | Status | Notes |
|-------------|--------|------|
| 6.1 Python callables: fail loudly if tool_faults but no tools/ToolRegistry | ✅ | `create_instrumented_adapter` raises with clear message for type=python |
| 6.2 Contract matrix: reset between cells (reset_endpoint / reset_function) | ✅ | `ContractEngine._reset_agent()`; config fields on AgentConfig |
| 6.3 Resilience score formula in spec (weighted, auto-FAIL on critical) | ✅ | `contracts/matrix.py` docstring and implementation; `docs/V2_SPEC.md` |
---
**Summary:** Addendum Additions 1, 2, 3, 5, 6 are implemented (with minor gaps on full memory_poisoning/leak_probe wiring). **Addition 4 (Resilience Certificate)** is not implemented.
---
## 7. Config Schema (v2.0)
- `version: "2.0"` supported; v1.0 backward compatible.
- `chaos`, `contract`, `chaos_matrix`, `replays`, `scoring` present and used.
- Replay session can be `file: "path"` only; full session loaded from file. Validation updated so `id`/`input`/`contract` optional when `file` is set.
---
## 8. Changes Made During This Audit
1. **Replay session file-only**`ReplaySessionConfig` allows session with only `file`; `id`/`input`/`contract` optional when `file` is set (defaults/loaded from file).
2. **CI replay path** — Replay session file path resolved relative to config file directory: `config_path.parent / s.file`.
3. **V2 example** — Added `examples/v2_research_agent/`: working HTTP agent (FastAPI), v2 flakestorm.yaml (chaos, contract, replays, scoring), replay file, README, requirements.txt.
---
## 9. Example: V2 Research Agent
- **Agent:** `examples/v2_research_agent/agent.py` — FastAPI app with `/invoke` and `/reset`.
- **Config:** `examples/v2_research_agent/flakestorm.yaml` — version 2.0, chaos, contract, chaos_matrix, replays.sessions with file, scoring.
- **Replay:** `examples/v2_research_agent/replays/incident_001.yaml`.
- **Usage:** See `examples/v2_research_agent/README.md` (start agent, then run `flakestorm run`, `flakestorm contract run`, `flakestorm replay run`, `flakestorm ci`).
---
## 10. Test Status
- **181 tests passing** (including chaos, contract, replay integration tests).
- V2 example config loads successfully (`load_config("examples/v2_research_agent/flakestorm.yaml")`).
---
*Audit complete. Implementation aligns with PRD and addendum; optional config and path resolution improved; V2 example added.*