flakestorm/docs/BEHAVIORAL_CONTRACTS.md

5.5 KiB
Raw Blame History

Behavioral Contracts (Pillar 2)

What it is: A contract is a named set of invariants (rules the agent must always follow). Flakestorm runs your agent under each scenario in a chaos matrix and checks every invariant in every scenario. The result is a resilience score (0100%) and a pass/fail matrix.

Why it matters: You need to know that the agent still obeys its rules when tools fail, the LLM is degraded, or context is poisoned — not just on the happy path.

Question answered: Does the agent obey its rules when the world breaks?


When to use it

  • You have hard rules: “always cite a source”, “never return PII”, “never fabricate numbers when tools fail”.
  • You want a single resilience score for CI that reflects behavior across multiple failure modes.
  • You run flakestorm contract run for contract-only checks, or flakestorm ci to include contract in the overall score.

Configuration

In flakestorm.yaml with version: "2.0" add contract and chaos_matrix:

contract:
  name: "Finance Agent Contract"
  description: "Invariants that must hold under all failure conditions"
  invariants:
    - id: always-cite-source
      type: regex
      pattern: "(?i)(source|according to|reference)"
      severity: critical
      when: always
      description: "Must always cite a data source"
    - id: never-fabricate-when-tools-fail
      type: regex
      pattern: '\\$[\\d,]+\\.\\d{2}'
      negate: true
      severity: critical
      when: tool_faults_active
      description: "Must not return dollar figures when tools are failing"
    - id: max-latency
      type: latency
      max_ms: 60000
      severity: medium
      when: always
  chaos_matrix:
    - name: "no-chaos"
      tool_faults: []
      llm_faults: []
    - name: "search-tool-down"
      tool_faults:
        - tool: market_data_api
          mode: error
          error_code: 503
    - name: "llm-degraded"
      llm_faults:
        - mode: truncated_response
          max_tokens: 20

Invariant fields

Field Required Description
id Yes Unique identifier for this invariant.
type Yes Same as run invariants: contains, regex, latency, valid_json, similarity, excludes_pii, refusal_check, completes, output_not_empty, contains_any, excludes_pattern, behavior_unchanged, etc.
severity No critical | high | medium | low (default medium). Weights the resilience score; any critical failure = automatic fail.
when No always | tool_faults_active | llm_faults_active | any_chaos_active | no_chaos. When this invariant is evaluated.
negate No If true, the check passes when the pattern does not match (e.g. “must NOT contain dollar figures”).
description No Human-readable description.
probes No For system_prompt_leak_probe: list of probe prompts to run instead of golden_prompts; use with excludes_pattern to ensure no leak.
baseline No For behavior_unchanged: auto or manual baseline string.
similarity_threshold No For behavior_unchanged/similarity; default 0.75.
Plus type-specific pattern, patterns, value, values, max_ms, threshold, etc., same as Configuration Guide.

Chaos matrix

Each entry is a scenario: a name plus optional tool_faults, llm_faults, and context_attacks. The contract engine runs golden prompts (or probes for that invariant when set) under each scenario and verifies every invariant. Result: invariants × scenarios cells; resilience score is severity-weighted pass rate, and any critical failure fails the contract.


Resilience score

  • Formula: (Σ passed × severity_weight) / (Σ total × severity_weight) × 100.
  • Weights: critical = 3, high = 2, medium = 1, low = 1.
  • Automatic FAIL: If any invariant with severity critical fails in any scenario, the contract is considered failed regardless of the numeric score.

See V2 Spec for the exact formula and matrix isolation (reset) behavior.


Commands

Command What it does
flakestorm contract run Run the contract across the chaos matrix; print resilience score and pass/fail.
flakestorm contract validate Validate contract YAML without executing.
flakestorm contract score Output only the resilience score (e.g. for CI: flakestorm contract score -c flakestorm.yaml).
flakestorm ci Runs contract (if configured) and includes contract_compliance in the overall weighted score.

Stateful agents

If your agent keeps state between calls, each (invariant × scenario) cell should start from a clean state. Set agent.reset_endpoint (HTTP POST URL, e.g. http://localhost:8000/reset) or agent.reset_function (Python module path, e.g. myagent:reset_state) so Flakestorm can reset between cells. If the agent appears stateful (same prompt produces different responses on two calls) and no reset is configured, Flakestorm logs: "Warning: No reset_endpoint configured. Contract matrix cells may share state. Results may be contaminated. Add reset_endpoint to your config for accurate isolation." It does not fail the run.


See also