flakestorm/docs/REPLAY_REGRESSION.md

5.9 KiB
Raw Permalink Blame History

Replay-Based Regression (Pillar 3)

What it is: You import real production failure sessions (exact user input, tool responses, and failure description) and replay them as deterministic tests. Flakestorm sends the same input to the agent, injects the same tool responses via the chaos layer, and verifies the response against a contract. If the agent now passes, youve confirmed the fix.

Why it matters: The best test cases come from production. Replay closes the loop: incident → capture → fix → replay → pass.

Question answered: Did we fix this incident?


When to use it

  • You had a production incident (e.g. agent fabricated data when a tool returned 504).
  • You fixed the agent and want to prove the same scenario passes.
  • You run replays via flakestorm replay run for one-off checks, or flakestorm ci to include replay_regression in the overall score.

Replay file format

A replay session is a YAML (or JSON) file with the following shape. You can reference it from flakestorm.yaml with file: "replays/incident_001.yaml" or run it directly with flakestorm replay run path/to/file.yaml.

id: "incident-2026-02-15"
name: "Prod incident: fabricated revenue figure"
source: manual
input: "What was ACME Corp's Q3 revenue?"
tool_responses:
  - tool: market_data_api
    response: null
    status: 504
    latency_ms: 30000
  - tool: web_search
    response: "Connection reset by peer"
    status: 0
expected_failure: "Agent fabricated revenue instead of saying data unavailable"
contract: "Finance Agent Contract"

Fields

Field Required Description
file No Path to replay file; when set, session is loaded from file and id/input/contract may be omitted.
id Yes (if not using file) Unique replay id.
input Yes (if not using file) Exact user input from the incident.
contract Yes (if not using file) Contract name (from main config) or path to a contract YAML file. Used to verify the agents response.
tool_responses No List of recorded tool responses to inject during replay. Each has tool, optional response, status, latency_ms.
name No Human-readable name.
source No e.g. manual, langsmith.
expected_failure No Short description of what went wrong (for documentation).
context No Optional conversation/system context.

Validation: A replay session must have either file or both id and input (inline session).


Contract reference

  • By name: contract: "Finance Agent Contract" — the contract must be defined in the same flakestorm.yaml (under contract:).
  • By path: contract: "./contracts/safety.yaml" — path relative to the config file directory.

Flakestorm resolves name first, then path; if not found, replay may fail or fall back depending on setup.


Configuration in flakestorm.yaml

You can define replay sessions inline, by file, or via LangSmith sources:

version: "2.0"
# ... agent, contract, etc. ...

replays:
  sessions:
    - file: "replays/incident_001.yaml"
    - id: "inline-001"
      input: "What is the capital of France?"
      contract: "Research Agent Contract"
      tool_responses: []
  # LangSmith sources (import by project or run ID; auto_import re-fetches on each run/ci)
  sources:
    - type: langsmith
      project: "my-production-agent"
      filter:
        status: error           # error | warning | all
        date_range: last_7_days
        min_latency_ms: 5000
      auto_import: true
    - type: langsmith_run
      run_id: "abc123def456"

When you use file:, the sessions id, input, and contract come from the loaded file. When you use inline id and input, you must provide them. replays.sources sessions are merged when running flakestorm ci or when auto_import is true (project sources).


Commands

Command What it does
flakestorm replay run path/to/replay.yaml -c flakestorm.yaml Run a single replay file. -c supplies agent and contract config.
flakestorm replay run path/to/dir -c flakestorm.yaml Run all replay files in the directory.
flakestorm replay export --from-report REPORT.json --output ./replays Export failed mutations from a Flakestorm report as replay YAML files.
flakestorm replay run --from-langsmith RUN_ID -c flakestorm.yaml Import a single session from LangSmith by run ID (requires flakestorm[langsmith]).
flakestorm replay run --from-langsmith RUN_ID --run -o replay.yaml Import, optionally write to file, and run the replay.
flakestorm replay run --from-langsmith-project PROJECT --filter-status error -o ./replays/ Import all runs from a LangSmith project; write one YAML per run. Add --run to run after import.
flakestorm ci -c flakestorm.yaml Runs mutation, contract, chaos-only, and all replay sessions (including replays.sources with auto_import); reports replay_regression and overall weighted score.

Import sources

  • Manual — Write YAML/JSON replay files from incident reports.
  • Flakestorm exportflakestorm replay export --from-report REPORT.json turns failed runs into replay files.
  • LangSmith (single run)flakestorm replay run --from-langsmith RUN_ID (requires pip install flakestorm[langsmith]).
  • LangSmith (project)flakestorm replay run --from-langsmith-project PROJECT --filter-status error -o ./replays/ imports failed runs from a project; or use replays.sources in config with auto_import: true so CI re-fetches from the project each run.

See also

  • Behavioral Contracts — How contracts and invariants are defined (replay verifies against a contract).
  • Environment Chaos — Replay uses the same chaos/interceptor layer to inject recorded tool responses.