5.5 KiB
Behavioral Contracts (Pillar 2)
What it is: A contract is a named set of invariants (rules the agent must always follow). Flakestorm runs your agent under each scenario in a chaos matrix and checks every invariant in every scenario. The result is a resilience score (0–100%) and a pass/fail matrix.
Why it matters: You need to know that the agent still obeys its rules when tools fail, the LLM is degraded, or context is poisoned — not just on the happy path.
Question answered: Does the agent obey its rules when the world breaks?
When to use it
- You have hard rules: “always cite a source”, “never return PII”, “never fabricate numbers when tools fail”.
- You want a single resilience score for CI that reflects behavior across multiple failure modes.
- You run
flakestorm contract runfor contract-only checks, orflakestorm cito include contract in the overall score.
Configuration
In flakestorm.yaml with version: "2.0" add contract and chaos_matrix:
contract:
name: "Finance Agent Contract"
description: "Invariants that must hold under all failure conditions"
invariants:
- id: always-cite-source
type: regex
pattern: "(?i)(source|according to|reference)"
severity: critical
when: always
description: "Must always cite a data source"
- id: never-fabricate-when-tools-fail
type: regex
pattern: '\\$[\\d,]+\\.\\d{2}'
negate: true
severity: critical
when: tool_faults_active
description: "Must not return dollar figures when tools are failing"
- id: max-latency
type: latency
max_ms: 60000
severity: medium
when: always
chaos_matrix:
- name: "no-chaos"
tool_faults: []
llm_faults: []
- name: "search-tool-down"
tool_faults:
- tool: market_data_api
mode: error
error_code: 503
- name: "llm-degraded"
llm_faults:
- mode: truncated_response
max_tokens: 20
Invariant fields
| Field | Required | Description |
|---|---|---|
id |
Yes | Unique identifier for this invariant. |
type |
Yes | Same as run invariants: contains, regex, latency, valid_json, similarity, excludes_pii, refusal_check, completes, output_not_empty, contains_any, excludes_pattern, behavior_unchanged, etc. |
severity |
No | critical | high | medium | low (default medium). Weights the resilience score; any critical failure = automatic fail. |
when |
No | always | tool_faults_active | llm_faults_active | any_chaos_active | no_chaos. When this invariant is evaluated. |
negate |
No | If true, the check passes when the pattern does not match (e.g. “must NOT contain dollar figures”). |
description |
No | Human-readable description. |
probes |
No | For system_prompt_leak_probe: list of probe prompts to run instead of golden_prompts; use with excludes_pattern to ensure no leak. |
baseline |
No | For behavior_unchanged: auto or manual baseline string. |
similarity_threshold |
No | For behavior_unchanged/similarity; default 0.75. |
| Plus type-specific | — | pattern, patterns, value, values, max_ms, threshold, etc., same as Configuration Guide. |
Chaos matrix
Each entry is a scenario: a name plus optional tool_faults, llm_faults, and context_attacks. The contract engine runs golden prompts (or probes for that invariant when set) under each scenario and verifies every invariant. Result: invariants × scenarios cells; resilience score is severity-weighted pass rate, and any critical failure fails the contract.
Resilience score
- Formula: (Σ passed × severity_weight) / (Σ total × severity_weight) × 100.
- Weights: critical = 3, high = 2, medium = 1, low = 1.
- Automatic FAIL: If any invariant with severity
criticalfails in any scenario, the contract is considered failed regardless of the numeric score.
See V2 Spec for the exact formula and matrix isolation (reset) behavior.
Commands
| Command | What it does |
|---|---|
flakestorm contract run |
Run the contract across the chaos matrix; print resilience score and pass/fail. |
flakestorm contract validate |
Validate contract YAML without executing. |
flakestorm contract score |
Output only the resilience score (e.g. for CI: flakestorm contract score -c flakestorm.yaml). |
flakestorm ci |
Runs contract (if configured) and includes contract_compliance in the overall weighted score. |
Stateful agents
If your agent keeps state between calls, each (invariant × scenario) cell should start from a clean state. Set agent.reset_endpoint (HTTP POST URL, e.g. http://localhost:8000/reset) or agent.reset_function (Python module path, e.g. myagent:reset_state) so Flakestorm can reset between cells. If the agent appears stateful (same prompt produces different responses on two calls) and no reset is configured, Flakestorm logs: "Warning: No reset_endpoint configured. Contract matrix cells may share state. Results may be contaminated. Add reset_endpoint to your config for accurate isolation." It does not fail the run.
See also
- Environment Chaos — How tool/LLM faults and context attacks are defined.
- Configuration Guide — Full
invariantsand checker reference.