mirror of
https://github.com/flakestorm/flakestorm.git
synced 2026-04-25 00:36:54 +02:00
567 lines
17 KiB
Markdown
567 lines
17 KiB
Markdown
# flakestorm API Specification
|
||
|
||
## Python SDK
|
||
|
||
### Quick Start
|
||
|
||
```python
|
||
import asyncio
|
||
from flakestorm import FlakeStormRunner, load_config
|
||
|
||
async def main():
|
||
config = load_config("flakestorm.yaml")
|
||
runner = FlakeStormRunner(config)
|
||
results = await runner.run()
|
||
print(f"Robustness Score: {results.statistics.robustness_score:.1%}")
|
||
|
||
asyncio.run(main())
|
||
```
|
||
|
||
---
|
||
|
||
## Core Classes
|
||
|
||
### FlakeStormConfig
|
||
|
||
Configuration container for all flakestorm settings.
|
||
|
||
```python
|
||
from flakestorm import FlakeStormConfig, load_config
|
||
|
||
# Load from file
|
||
config = load_config("flakestorm.yaml")
|
||
|
||
# Access properties
|
||
config.agent.endpoint # str
|
||
config.model.name # str
|
||
config.golden_prompts # list[str]
|
||
config.invariants # list[InvariantConfig]
|
||
|
||
# Serialize
|
||
yaml_str = config.to_yaml()
|
||
|
||
# Parse from string
|
||
config = FlakeStormConfig.from_yaml(yaml_content)
|
||
```
|
||
|
||
#### Properties
|
||
|
||
| Property | Type | Description |
|
||
|----------|------|-------------|
|
||
| `version` | `str` | Config version (`1.0` \| `2.0`) |
|
||
| `agent` | `AgentConfig` | Agent connection settings (includes V2 `reset_endpoint`, `reset_function`) |
|
||
| `model` | `ModelConfig` | LLM settings (V2: `api_key` env-only) |
|
||
| `mutations` | `MutationConfig` | Mutation generation (max 50/run OSS, 22+ types) |
|
||
| `golden_prompts` | `list[str]` | Test prompts |
|
||
| `invariants` | `list[InvariantConfig]` | Assertion rules |
|
||
| `output` | `OutputConfig` | Report settings |
|
||
| `advanced` | `AdvancedConfig` | Advanced options |
|
||
| **V2** `chaos` | `ChaosConfig \| None` | Tool/LLM faults and context_attacks (list or dict) |
|
||
| **V2** `contract` | `ContractConfig \| None` | Behavioral contract and chaos_matrix (scenarios may include context_attacks) |
|
||
| **V2** `chaos_matrix` | `list[ChaosScenarioConfig] \| None` | Top-level chaos scenarios when not using contract.chaos_matrix |
|
||
| **V2** `replays` | `ReplayConfig \| None` | Replay sessions (file or inline) and LangSmith sources |
|
||
| **V2** `scoring` | `ScoringConfig \| None` | Weights for mutation, chaos, contract, replay (must sum to 1.0) |
|
||
|
||
---
|
||
|
||
### FlakeStormRunner
|
||
|
||
Main test runner class.
|
||
|
||
```python
|
||
from flakestorm import FlakeStormRunner
|
||
|
||
runner = FlakeStormRunner(
|
||
config="flakestorm.yaml", # or FlakeStormConfig object
|
||
agent=None, # optional: pre-configured adapter
|
||
console=None, # optional: Rich console
|
||
show_progress=True, # show progress bars
|
||
)
|
||
|
||
# Run tests
|
||
results = await runner.run()
|
||
|
||
# Verify setup only
|
||
is_valid = await runner.verify_setup()
|
||
|
||
# Get config summary
|
||
summary = runner.get_config_summary()
|
||
```
|
||
|
||
#### Methods
|
||
|
||
| Method | Returns | Description |
|
||
|--------|---------|-------------|
|
||
| `run()` | `TestResults` | Execute full test suite |
|
||
| `verify_setup()` | `bool` | Check configuration validity |
|
||
| `get_config_summary()` | `str` | Human-readable config summary |
|
||
|
||
---
|
||
|
||
### Agent Adapters
|
||
|
||
#### AgentProtocol
|
||
|
||
Interface for custom agent implementations.
|
||
|
||
```python
|
||
from typing import Protocol
|
||
|
||
class AgentProtocol(Protocol):
|
||
async def invoke(self, input: str) -> str:
|
||
"""Execute agent and return response."""
|
||
...
|
||
```
|
||
|
||
#### HTTPAgentAdapter
|
||
|
||
Adapter for HTTP-based agents.
|
||
|
||
```python
|
||
from flakestorm import HTTPAgentAdapter
|
||
|
||
adapter = HTTPAgentAdapter(
|
||
endpoint="http://localhost:8000/invoke",
|
||
timeout=30000, # ms
|
||
headers={"Authorization": "Bearer token"},
|
||
retries=2,
|
||
)
|
||
|
||
response = await adapter.invoke("Hello")
|
||
# Returns AgentResponse with .output, .latency_ms, .error
|
||
```
|
||
|
||
#### PythonAgentAdapter
|
||
|
||
Adapter for Python callable agents.
|
||
|
||
```python
|
||
from flakestorm import PythonAgentAdapter
|
||
|
||
async def my_agent(input: str) -> str:
|
||
return f"Response to: {input}"
|
||
|
||
adapter = PythonAgentAdapter(my_agent)
|
||
response = await adapter.invoke("Test")
|
||
```
|
||
|
||
#### create_agent_adapter
|
||
|
||
Factory function for creating adapters from config.
|
||
|
||
```python
|
||
from flakestorm import create_agent_adapter
|
||
|
||
adapter = create_agent_adapter(config.agent)
|
||
```
|
||
|
||
---
|
||
|
||
### Mutation Engine
|
||
|
||
#### MutationType
|
||
|
||
```python
|
||
from flakestorm import MutationType
|
||
|
||
# Original 8 types
|
||
MutationType.PARAPHRASE # Semantic rewrites
|
||
MutationType.NOISE # Typos and errors
|
||
MutationType.TONE_SHIFT # Aggressive tone
|
||
MutationType.PROMPT_INJECTION # Basic adversarial attacks
|
||
MutationType.ENCODING_ATTACKS # Encoded inputs (Base64, Unicode, URL)
|
||
MutationType.CONTEXT_MANIPULATION # Context manipulation
|
||
MutationType.LENGTH_EXTREMES # Edge cases (empty/long inputs)
|
||
MutationType.CUSTOM # User-defined templates
|
||
|
||
# Advanced prompt-level attacks (7 new types)
|
||
MutationType.MULTI_TURN_ATTACK # Context persistence and conversation state
|
||
MutationType.ADVANCED_JAILBREAK # Advanced prompt injection (DAN, role-playing)
|
||
MutationType.SEMANTIC_SIMILARITY_ATTACK # Adversarial examples
|
||
MutationType.FORMAT_POISONING # Structured data injection (JSON, XML)
|
||
MutationType.LANGUAGE_MIXING # Multilingual, code-switching, emoji
|
||
MutationType.TOKEN_MANIPULATION # Tokenizer edge cases, special tokens
|
||
MutationType.TEMPORAL_ATTACK # Time-sensitive context, impossible dates
|
||
|
||
# System/Network-level attacks (8+ new types)
|
||
MutationType.HTTP_HEADER_INJECTION # HTTP header manipulation
|
||
MutationType.PAYLOAD_SIZE_ATTACK # Extremely large payloads, DoS
|
||
MutationType.CONTENT_TYPE_CONFUSION # MIME type manipulation
|
||
MutationType.QUERY_PARAMETER_POISONING # Malicious query parameters
|
||
MutationType.REQUEST_METHOD_ATTACK # HTTP method confusion
|
||
MutationType.PROTOCOL_LEVEL_ATTACK # Protocol-level exploits
|
||
MutationType.RESOURCE_EXHAUSTION # CPU/memory exhaustion, DoS
|
||
MutationType.CONCURRENT_REQUEST_PATTERN # Race conditions, concurrent state
|
||
MutationType.TIMEOUT_MANIPULATION # Timeout handling, slow requests
|
||
|
||
# Properties
|
||
MutationType.PARAPHRASE.display_name # "Paraphrase"
|
||
MutationType.PARAPHRASE.default_weight # 1.0
|
||
MutationType.PARAPHRASE.description # "Rewrite using..."
|
||
```
|
||
|
||
**Mutation Types Overview (22+ types):**
|
||
|
||
#### Prompt-Level Attacks
|
||
|
||
| Type | Description | Default Weight | When to Use |
|
||
|------|-------------|----------------|-------------|
|
||
| `PARAPHRASE` | Semantically equivalent rewrites | 1.0 | Test semantic understanding |
|
||
| `NOISE` | Typos and spelling errors | 0.8 | Test input robustness |
|
||
| `TONE_SHIFT` | Aggressive/impatient phrasing | 0.9 | Test emotional resilience |
|
||
| `PROMPT_INJECTION` | Basic adversarial attack attempts | 1.5 | Test security |
|
||
| `ENCODING_ATTACKS` | Base64, Unicode, URL encoding | 1.3 | Test parser robustness and security |
|
||
| `CONTEXT_MANIPULATION` | Adding/removing/reordering context | 1.1 | Test context extraction |
|
||
| `LENGTH_EXTREMES` | Empty, minimal, or very long inputs | 1.2 | Test boundary conditions |
|
||
| `MULTI_TURN_ATTACK` | Context persistence and conversation state | 1.4 | Test conversational agents |
|
||
| `ADVANCED_JAILBREAK` | Advanced prompt injection (DAN, role-playing) | 2.0 | Test advanced security |
|
||
| `SEMANTIC_SIMILARITY_ATTACK` | Adversarial examples - similar but different | 1.3 | Test robustness |
|
||
| `FORMAT_POISONING` | Structured data injection (JSON, XML, markdown) | 1.6 | Test structured data parsing |
|
||
| `LANGUAGE_MIXING` | Multilingual, code-switching, emoji | 1.2 | Test internationalization |
|
||
| `TOKEN_MANIPULATION` | Tokenizer edge cases, special tokens | 1.5 | Test LLM tokenization |
|
||
| `TEMPORAL_ATTACK` | Time-sensitive context, impossible dates | 1.1 | Test temporal reasoning |
|
||
| `CUSTOM` | User-defined mutation templates | 1.0 | Test domain-specific scenarios |
|
||
|
||
#### System/Network-Level Attacks
|
||
|
||
| Type | Description | Default Weight | When to Use |
|
||
|------|-------------|----------------|-------------|
|
||
| `HTTP_HEADER_INJECTION` | HTTP header manipulation attacks | 1.7 | Test HTTP API security |
|
||
| `PAYLOAD_SIZE_ATTACK` | Extremely large payloads, DoS | 1.4 | Test resource limits |
|
||
| `CONTENT_TYPE_CONFUSION` | MIME type manipulation | 1.5 | Test HTTP parsers |
|
||
| `QUERY_PARAMETER_POISONING` | Malicious query parameters | 1.6 | Test GET-based APIs |
|
||
| `REQUEST_METHOD_ATTACK` | HTTP method confusion | 1.3 | Test REST APIs |
|
||
| `PROTOCOL_LEVEL_ATTACK` | Protocol-level exploits (request smuggling) | 1.8 | Test protocol handling |
|
||
| `RESOURCE_EXHAUSTION` | CPU/memory exhaustion, DoS | 1.5 | Test production resilience |
|
||
| `CONCURRENT_REQUEST_PATTERN` | Race conditions, concurrent state | 1.4 | Test high-traffic agents |
|
||
| `TIMEOUT_MANIPULATION` | Timeout handling, slow requests | 1.3 | Test timeout resilience |
|
||
|
||
**Mutation Strategy:**
|
||
|
||
Choose mutation types based on your testing goals:
|
||
- **Comprehensive**: Use all 22+ types for complete coverage
|
||
- **Security-focused**: Emphasize `PROMPT_INJECTION`, `ADVANCED_JAILBREAK`, `PROTOCOL_LEVEL_ATTACK`, `HTTP_HEADER_INJECTION`
|
||
- **UX-focused**: Emphasize `NOISE`, `TONE_SHIFT`, `CONTEXT_MANIPULATION`, `LANGUAGE_MIXING`
|
||
- **Infrastructure-focused**: Emphasize all system/network-level types
|
||
- **Edge case testing**: Emphasize `LENGTH_EXTREMES`, `ENCODING_ATTACKS`, `TOKEN_MANIPULATION`, `RESOURCE_EXHAUSTION`
|
||
|
||
#### Mutation
|
||
|
||
```python
|
||
from flakestorm import Mutation, MutationType
|
||
|
||
mutation = Mutation(
|
||
original="Book a flight",
|
||
mutated="I need to fly",
|
||
type=MutationType.PARAPHRASE,
|
||
weight=1.0,
|
||
)
|
||
|
||
# Properties
|
||
mutation.id # Unique hash
|
||
mutation.is_valid() # Validity check
|
||
mutation.to_dict() # Serialize
|
||
mutation.character_diff # Character count difference
|
||
```
|
||
|
||
#### MutationEngine
|
||
|
||
```python
|
||
from flakestorm import MutationEngine
|
||
|
||
engine = MutationEngine(config.model)
|
||
|
||
# Verify Ollama connection
|
||
is_connected = await engine.verify_connection()
|
||
|
||
# Generate mutations
|
||
mutations = await engine.generate_mutations(
|
||
seed_prompt="Book a flight",
|
||
types=[MutationType.PARAPHRASE, MutationType.NOISE],
|
||
count=10,
|
||
)
|
||
|
||
# Batch generation
|
||
results = await engine.generate_batch(
|
||
prompts=["Prompt 1", "Prompt 2"],
|
||
types=[MutationType.PARAPHRASE],
|
||
count_per_prompt=5,
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
### Invariant Verification
|
||
|
||
#### InvariantVerifier
|
||
|
||
```python
|
||
from flakestorm import InvariantVerifier
|
||
|
||
verifier = InvariantVerifier(config.invariants)
|
||
|
||
# Verify a response
|
||
result = verifier.verify(
|
||
response="Agent output text",
|
||
latency_ms=150.0,
|
||
)
|
||
|
||
# Result properties
|
||
result.all_passed # bool
|
||
result.passed_count # int
|
||
result.failed_count # int
|
||
result.checks # list[CheckResult]
|
||
result.get_failed_checks()
|
||
result.get_passed_checks()
|
||
```
|
||
|
||
#### Built-in Checkers
|
||
|
||
```python
|
||
from flakestorm.assertions import (
|
||
ContainsChecker,
|
||
LatencyChecker,
|
||
ValidJsonChecker,
|
||
RegexChecker,
|
||
SimilarityChecker,
|
||
ExcludesPIIChecker,
|
||
RefusalChecker,
|
||
)
|
||
```
|
||
|
||
#### Custom Checker
|
||
|
||
```python
|
||
from flakestorm.assertions.deterministic import BaseChecker, CheckResult
|
||
|
||
class MyChecker(BaseChecker):
|
||
def check(self, response: str, latency_ms: float) -> CheckResult:
|
||
passed = "expected" in response
|
||
return CheckResult(
|
||
type=self.type,
|
||
passed=passed,
|
||
details="Custom check result",
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
### Test Results
|
||
|
||
#### TestResults
|
||
|
||
```python
|
||
results = await runner.run()
|
||
|
||
# Statistics
|
||
results.statistics.robustness_score # 0.0-1.0
|
||
results.statistics.total_mutations # int
|
||
results.statistics.passed_mutations # int
|
||
results.statistics.failed_mutations # int
|
||
results.statistics.avg_latency_ms # float
|
||
results.statistics.p95_latency_ms # float
|
||
results.statistics.by_type # list[TypeStatistics]
|
||
|
||
# Timing
|
||
results.started_at # datetime
|
||
results.completed_at # datetime
|
||
results.duration # seconds
|
||
|
||
# Mutations
|
||
results.mutations # list[MutationResult]
|
||
results.passed_mutations # list[MutationResult]
|
||
results.failed_mutations # list[MutationResult]
|
||
results.get_by_type("noise") # Filter by type
|
||
results.get_by_prompt("...") # Filter by prompt
|
||
|
||
# Serialization
|
||
results.to_dict() # Full JSON-serializable dict
|
||
|
||
# V2: Resilience and contract/replay (when config has contract/replays)
|
||
results.resilience_scores # dict: mutation_robustness, chaos_resilience, contract_compliance, replay_regression
|
||
results.contract_compliance # ContractRunResult | None (when contract run was executed)
|
||
# Replay results are reported via flakestorm replay run --output; see Reports below.
|
||
```
|
||
|
||
#### MutationResult
|
||
|
||
```python
|
||
for result in results.mutations:
|
||
result.original_prompt # str
|
||
result.mutation # Mutation object
|
||
result.response # str
|
||
result.latency_ms # float
|
||
result.passed # bool
|
||
result.checks # list[CheckResult]
|
||
result.error # str | None
|
||
result.failed_checks # list[CheckResult]
|
||
```
|
||
|
||
---
|
||
|
||
### Report Generation
|
||
|
||
#### HTMLReportGenerator
|
||
|
||
```python
|
||
from flakestorm.reports import HTMLReportGenerator
|
||
|
||
generator = HTMLReportGenerator(results)
|
||
|
||
# Generate HTML string
|
||
html = generator.generate()
|
||
|
||
# Save to file
|
||
path = generator.save() # Auto-generated path
|
||
path = generator.save("custom/path/report.html")
|
||
```
|
||
|
||
#### JSONReportGenerator
|
||
|
||
```python
|
||
from flakestorm.reports import JSONReportGenerator
|
||
|
||
generator = JSONReportGenerator(results)
|
||
|
||
# Full report
|
||
json_str = generator.generate(pretty=True)
|
||
|
||
# Summary only (for CI)
|
||
summary = generator.generate_summary()
|
||
|
||
# Save
|
||
path = generator.save()
|
||
path = generator.save(summary_only=True)
|
||
```
|
||
|
||
#### TerminalReporter
|
||
|
||
```python
|
||
from flakestorm.reports import TerminalReporter
|
||
from rich.console import Console
|
||
|
||
reporter = TerminalReporter(results, Console())
|
||
|
||
reporter.print_summary()
|
||
reporter.print_type_breakdown()
|
||
reporter.print_failures(limit=10)
|
||
reporter.print_full_report()
|
||
```
|
||
|
||
**V2 reports:** Contract runs (`flakestorm contract run --output report.html`) and replay runs (`flakestorm replay run --output report.html`) produce HTML reports that include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants, fix tool behavior). See [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md) and [Replay Regression](REPLAY_REGRESSION.md).
|
||
|
||
---
|
||
|
||
## CLI Commands
|
||
|
||
### `flakestorm init [PATH]`
|
||
|
||
Initialize a new configuration file.
|
||
|
||
```bash
|
||
flakestorm init # Creates flakestorm.yaml
|
||
flakestorm init config/test.yaml # Custom path
|
||
flakestorm init --force # Overwrite existing
|
||
```
|
||
|
||
### `flakestorm run`
|
||
|
||
Run reliability tests (mutation run; optionally with chaos).
|
||
|
||
```bash
|
||
flakestorm run # Default config (mutation only)
|
||
flakestorm run --config custom.yaml # Custom config
|
||
flakestorm run --chaos # Apply chaos (tool/LLM faults, context_attacks) during mutation run
|
||
flakestorm run --chaos-only # Chaos-only run (no mutations); requires chaos config
|
||
flakestorm run --chaos-profile api_outage # Use a built-in chaos profile
|
||
flakestorm run --output json # JSON output
|
||
flakestorm run --output terminal # Terminal only
|
||
flakestorm run --min-score 0.9 --ci # CI mode
|
||
flakestorm run --verify-only # Just verify setup
|
||
flakestorm run --quiet # Minimal output
|
||
```
|
||
|
||
### `flakestorm verify`
|
||
|
||
Verify configuration and connections.
|
||
|
||
```bash
|
||
flakestorm verify
|
||
flakestorm verify --config custom.yaml
|
||
```
|
||
|
||
### `flakestorm report PATH`
|
||
|
||
View or convert existing reports.
|
||
|
||
```bash
|
||
flakestorm report results.json # View in terminal
|
||
flakestorm report results.json --output html # Convert to HTML
|
||
```
|
||
|
||
### `flakestorm score`
|
||
|
||
Output only the robustness score (for CI scripts).
|
||
|
||
```bash
|
||
SCORE=$(flakestorm score)
|
||
if (( $(echo "$SCORE >= 0.9" | bc -l) )); then
|
||
echo "Passed"
|
||
else
|
||
echo "Failed"
|
||
exit 1
|
||
fi
|
||
```
|
||
|
||
### V2: `flakestorm contract run` / `validate` / `score`
|
||
|
||
Run behavioral contract tests (invariants × chaos matrix).
|
||
|
||
```bash
|
||
flakestorm contract run # Run contract matrix; progress and score in terminal
|
||
flakestorm contract run --output report.html # Save HTML report with suggested actions for failed cells
|
||
flakestorm contract validate # Validate contract config only
|
||
flakestorm contract score # Output contract resilience score only
|
||
```
|
||
|
||
### V2: `flakestorm replay run` / `export`
|
||
|
||
Replay regression: run saved sessions and verify against a contract.
|
||
|
||
```bash
|
||
flakestorm replay run # Replay sessions from config (file or inline)
|
||
flakestorm replay run path/to/session.yaml # Replay a single session file
|
||
flakestorm replay run path/to/replays/ # Replay all sessions in directory
|
||
flakestorm replay run --output report.html # Save HTML report with suggested actions for failed sessions
|
||
flakestorm replay export --from-report FILE # Export from an existing report
|
||
```
|
||
|
||
### V2: `flakestorm ci`
|
||
|
||
Run full CI pipeline: mutation run, contract run (if configured), chaos-only (if chaos configured), replay (if configured); then compute overall weighted score from `scoring.weights`.
|
||
|
||
```bash
|
||
flakestorm ci
|
||
flakestorm ci --config custom.yaml
|
||
```
|
||
|
||
---
|
||
|
||
## Environment Variables
|
||
|
||
| Variable | Description |
|
||
|----------|-------------|
|
||
| `OLLAMA_HOST` | Override Ollama server URL |
|
||
| Custom headers | Expanded in config via `${VAR}` syntax |
|
||
|
||
**V2 — API keys (env-only):** Model API keys must not be literal in config. Use environment variables and reference them in config (e.g. `api_key: "${OPENAI_API_KEY}"`). Supported: `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc. See [LLM Providers](LLM_PROVIDERS.md).
|
||
|
||
---
|
||
|
||
## Exit Codes
|
||
|
||
| Code | Meaning |
|
||
|------|---------|
|
||
| 0 | Success |
|
||
| 1 | Error (config, connection, etc.) |
|
||
| 1 | CI mode: Score below threshold |
|