flakestorm/docs/V2_AUDIT.md

8.6 KiB
Raw Permalink Blame History

V2 Implementation Audit

Date: March 2026
Reference: Flakestorm v2.md, flakestorm-v2-addendum.md

Scope

Verification of the codebase against the PRD and addendum: behavior, config schema, CLI, and examples.


1. PRD §8.1 — Environment Chaos

Requirement Status Implementation
Tool faults: timeout, error, malformed, slow, malicious_response chaos/faults.py, chaos/http_transport.py (by match_url or tool *)
LLM faults: timeout, truncated_response, rate_limit, empty, garbage chaos/llm_proxy.py, chaos/interceptor.py
probability, after_calls, tool * chaos/faults.should_trigger, transport and interceptor
Built-in profiles: api_outage, degraded_llm, hostile_tools, high_latency, cascading_failure chaos/profiles/*.yaml
InstrumentedAgentAdapter / httpx transport ChaosInterceptor, ChaosHttpTransport, HTTPAgentAdapter(transport=...)

2. PRD §8.2 — Behavioral Contracts

Requirement Status Implementation
Contract with id, severity, when, negate ContractInvariantConfig, contracts/engine.py
Chaos matrix (scenarios) contract.chaos_matrix, scenario → ChaosConfig per run
Resilience matrix N×M, weighted score contracts/matrix.py (critical×3, high×2, medium×1), FAIL if any critical
Invariant types: contains_any, output_not_empty, completes, excludes_pattern, behavior_unchanged Assertions + verifier; contract engine runs verifier with contract invariants
reset_endpoint / reset_function AgentConfig, ContractEngine._reset_agent() before each cell
Stateful warning when no reset ContractEngine._detect_stateful_and_warn(), STATEFUL_WARNING

3. PRD §8.3 — Replay-Based Regression

Requirement Status Implementation
Replay session: input, tool_responses, contract ReplaySessionConfig, replay/loader.py, replay/runner.py
Contract by name or path resolve_contract() in loader
Verify against contract ReplayRunner.run() uses InvariantVerifier with resolved contract
Export from report flakestorm replay export --from-report FILE
Replays in config: sessions with file or inline replays.sessions; session can have file only (load from file) or full inline

4. PRD §9 — Combined Modes & Resilience Score

Requirement Status Implementation
Mutation only, chaos only, mutation+chaos, contract, replay run (with --chaos, --chaos-only), contract run, replay run
Unified resilience score (mutation_robustness, chaos_resilience, contract_compliance, replay_regression, overall) reports/models.TestResults.resilience_scores; flakestorm ci computes overall from scoring.weights

5. PRD §10 — CLI

Command Status
flakestorm run --chaos, --chaos-profile, --chaos-only
flakestorm chaos
flakestorm contract run / validate / score
flakestorm replay run [PATH] (replay run, replay export)
flakestorm replay export --from-report FILE
flakestorm ci (mutation + contract + chaos + replay + overall score)

6. Addendum (flakestorm-v2-addendum.md) — Full Checklist

Addition 1 — Context Attacks Module

Requirement Status Notes
chaos/context_attacks.py ContextAttackEngine, maybe_inject_indirect()
indirect_injection (inject payloads into tool response) Wired via engine; profile indirect_injection.yaml
memory_poisoning, system_prompt_leak_probe ⚠️ Docstring/config types exist; memory_poisoning inject step and leak probe as contract assertion are not fully wired in execution flow
Contract invariants: excludes_pattern, behavior_unchanged assertions/verifier.py; use for system_prompt_not_leaked, injection_not_executed
Config: chaos.context_attacks list with type (e.g. indirect_injection) ContextAttackConfig in core/config.py

Addition 2 — Model Version Drift (response_drift)

Requirement Status Notes
response_drift in llm_faults chaos/llm_proxy.py: apply_llm_response_drift, drift_type, severity, direction, factor
drift_type: json_field_rename, verbosity_shift, format_change, refusal_rephrase, tone_shift Implemented in llm_proxy
Profile model_version_drift.yaml chaos/profiles/model_version_drift.yaml

Addition 3 — Multi-Agent Failure Propagation

Requirement Status Notes
v3 roadmap placeholder, no v2 implementation Documented in ROADMAP.md as V3; no code required

Addition 4 — Resilience Certificate Export

Requirement Status Notes
flakestorm certificate CLI command Not implemented
reports/certificate.py (PDF/HTML certificate) Not implemented
Config certificate.tester_name, pass_threshold, output_format Not implemented

Addition 5 — LangSmith Replay Import

Requirement Status Notes
Import single run by ID: flakestorm replay --from-langsmith RUN_ID replay/loader.py: load_langsmith_run(run_id); CLI option
Import and run: --from-langsmith RUN_ID --run _replay_async supports run_after_import
Schema validation (fail clearly if LangSmith API changed) _validate_langsmith_run_schema
Map run inputs/outputs/child_runs to ReplaySessionConfig _langsmith_run_to_session
--from-langsmith-project PROJECT + --filter-status + --output replay run --from-langsmith-project X --filter-status error -o ./replays/; writes YAML per run
replays.sources (type: langsmith langsmith_run, project, filter, auto_import)

Addition 6 — Implicit Spec Clarifications

Requirement Status Notes
6.1 Python callables: fail loudly if tool_faults but no tools/ToolRegistry create_instrumented_adapter raises with clear message for type=python
6.2 Contract matrix: reset between cells (reset_endpoint / reset_function) ContractEngine._reset_agent(); config fields on AgentConfig
6.3 Resilience score formula in spec (weighted, auto-FAIL on critical) contracts/matrix.py docstring and implementation; docs/V2_SPEC.md

Summary: Addendum Additions 1, 2, 3, 5, 6 are implemented (with minor gaps on full memory_poisoning/leak_probe wiring). Addition 4 (Resilience Certificate) is not implemented.


7. Config Schema (v2.0)

  • version: "2.0" supported; v1.0 backward compatible.
  • chaos, contract, chaos_matrix, replays, scoring present and used.
  • Replay session can be file: "path" only; full session loaded from file. Validation updated so id/input/contract optional when file is set.

8. Changes Made During This Audit

  1. Replay session file-onlyReplaySessionConfig allows session with only file; id/input/contract optional when file is set (defaults/loaded from file).
  2. CI replay path — Replay session file path resolved relative to config file directory: config_path.parent / s.file.
  3. V2 example — Added examples/v2_research_agent/: working HTTP agent (FastAPI), v2 flakestorm.yaml (chaos, contract, replays, scoring), replay file, README, requirements.txt.

9. Example: V2 Research Agent

  • Agent: examples/v2_research_agent/agent.py — FastAPI app with /invoke and /reset.
  • Config: examples/v2_research_agent/flakestorm.yaml — version 2.0, chaos, contract, chaos_matrix, replays.sessions with file, scoring.
  • Replay: examples/v2_research_agent/replays/incident_001.yaml.
  • Usage: See examples/v2_research_agent/README.md (start agent, then run flakestorm run, flakestorm contract run, flakestorm replay run, flakestorm ci).

10. Test Status

  • 181 tests passing (including chaos, contract, replay integration tests).
  • V2 example config loads successfully (load_config("examples/v2_research_agent/flakestorm.yaml")).

Audit complete. Implementation aligns with PRD and addendum; optional config and path resolution improved; V2 example added.