flakestorm/docs/API_SPECIFICATION.md

567 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# flakestorm API Specification
## Python SDK
### Quick Start
```python
import asyncio
from flakestorm import FlakeStormRunner, load_config
async def main():
config = load_config("flakestorm.yaml")
runner = FlakeStormRunner(config)
results = await runner.run()
print(f"Robustness Score: {results.statistics.robustness_score:.1%}")
asyncio.run(main())
```
---
## Core Classes
### FlakeStormConfig
Configuration container for all flakestorm settings.
```python
from flakestorm import FlakeStormConfig, load_config
# Load from file
config = load_config("flakestorm.yaml")
# Access properties
config.agent.endpoint # str
config.model.name # str
config.golden_prompts # list[str]
config.invariants # list[InvariantConfig]
# Serialize
yaml_str = config.to_yaml()
# Parse from string
config = FlakeStormConfig.from_yaml(yaml_content)
```
#### Properties
| Property | Type | Description |
|----------|------|-------------|
| `version` | `str` | Config version (`1.0` \| `2.0`) |
| `agent` | `AgentConfig` | Agent connection settings (includes V2 `reset_endpoint`, `reset_function`) |
| `model` | `ModelConfig` | LLM settings (V2: `api_key` env-only) |
| `mutations` | `MutationConfig` | Mutation generation (max 50/run OSS, 22+ types) |
| `golden_prompts` | `list[str]` | Test prompts |
| `invariants` | `list[InvariantConfig]` | Assertion rules |
| `output` | `OutputConfig` | Report settings |
| `advanced` | `AdvancedConfig` | Advanced options |
| **V2** `chaos` | `ChaosConfig \| None` | Tool/LLM faults and context_attacks (list or dict) |
| **V2** `contract` | `ContractConfig \| None` | Behavioral contract and chaos_matrix (scenarios may include context_attacks) |
| **V2** `chaos_matrix` | `list[ChaosScenarioConfig] \| None` | Top-level chaos scenarios when not using contract.chaos_matrix |
| **V2** `replays` | `ReplayConfig \| None` | Replay sessions (file or inline) and LangSmith sources |
| **V2** `scoring` | `ScoringConfig \| None` | Weights for mutation, chaos, contract, replay (must sum to 1.0) |
---
### FlakeStormRunner
Main test runner class.
```python
from flakestorm import FlakeStormRunner
runner = FlakeStormRunner(
config="flakestorm.yaml", # or FlakeStormConfig object
agent=None, # optional: pre-configured adapter
console=None, # optional: Rich console
show_progress=True, # show progress bars
)
# Run tests
results = await runner.run()
# Verify setup only
is_valid = await runner.verify_setup()
# Get config summary
summary = runner.get_config_summary()
```
#### Methods
| Method | Returns | Description |
|--------|---------|-------------|
| `run()` | `TestResults` | Execute full test suite |
| `verify_setup()` | `bool` | Check configuration validity |
| `get_config_summary()` | `str` | Human-readable config summary |
---
### Agent Adapters
#### AgentProtocol
Interface for custom agent implementations.
```python
from typing import Protocol
class AgentProtocol(Protocol):
async def invoke(self, input: str) -> str:
"""Execute agent and return response."""
...
```
#### HTTPAgentAdapter
Adapter for HTTP-based agents.
```python
from flakestorm import HTTPAgentAdapter
adapter = HTTPAgentAdapter(
endpoint="http://localhost:8000/invoke",
timeout=30000, # ms
headers={"Authorization": "Bearer token"},
retries=2,
)
response = await adapter.invoke("Hello")
# Returns AgentResponse with .output, .latency_ms, .error
```
#### PythonAgentAdapter
Adapter for Python callable agents.
```python
from flakestorm import PythonAgentAdapter
async def my_agent(input: str) -> str:
return f"Response to: {input}"
adapter = PythonAgentAdapter(my_agent)
response = await adapter.invoke("Test")
```
#### create_agent_adapter
Factory function for creating adapters from config.
```python
from flakestorm import create_agent_adapter
adapter = create_agent_adapter(config.agent)
```
---
### Mutation Engine
#### MutationType
```python
from flakestorm import MutationType
# Original 8 types
MutationType.PARAPHRASE # Semantic rewrites
MutationType.NOISE # Typos and errors
MutationType.TONE_SHIFT # Aggressive tone
MutationType.PROMPT_INJECTION # Basic adversarial attacks
MutationType.ENCODING_ATTACKS # Encoded inputs (Base64, Unicode, URL)
MutationType.CONTEXT_MANIPULATION # Context manipulation
MutationType.LENGTH_EXTREMES # Edge cases (empty/long inputs)
MutationType.CUSTOM # User-defined templates
# Advanced prompt-level attacks (7 new types)
MutationType.MULTI_TURN_ATTACK # Context persistence and conversation state
MutationType.ADVANCED_JAILBREAK # Advanced prompt injection (DAN, role-playing)
MutationType.SEMANTIC_SIMILARITY_ATTACK # Adversarial examples
MutationType.FORMAT_POISONING # Structured data injection (JSON, XML)
MutationType.LANGUAGE_MIXING # Multilingual, code-switching, emoji
MutationType.TOKEN_MANIPULATION # Tokenizer edge cases, special tokens
MutationType.TEMPORAL_ATTACK # Time-sensitive context, impossible dates
# System/Network-level attacks (8+ new types)
MutationType.HTTP_HEADER_INJECTION # HTTP header manipulation
MutationType.PAYLOAD_SIZE_ATTACK # Extremely large payloads, DoS
MutationType.CONTENT_TYPE_CONFUSION # MIME type manipulation
MutationType.QUERY_PARAMETER_POISONING # Malicious query parameters
MutationType.REQUEST_METHOD_ATTACK # HTTP method confusion
MutationType.PROTOCOL_LEVEL_ATTACK # Protocol-level exploits
MutationType.RESOURCE_EXHAUSTION # CPU/memory exhaustion, DoS
MutationType.CONCURRENT_REQUEST_PATTERN # Race conditions, concurrent state
MutationType.TIMEOUT_MANIPULATION # Timeout handling, slow requests
# Properties
MutationType.PARAPHRASE.display_name # "Paraphrase"
MutationType.PARAPHRASE.default_weight # 1.0
MutationType.PARAPHRASE.description # "Rewrite using..."
```
**Mutation Types Overview (22+ types):**
#### Prompt-Level Attacks
| Type | Description | Default Weight | When to Use |
|------|-------------|----------------|-------------|
| `PARAPHRASE` | Semantically equivalent rewrites | 1.0 | Test semantic understanding |
| `NOISE` | Typos and spelling errors | 0.8 | Test input robustness |
| `TONE_SHIFT` | Aggressive/impatient phrasing | 0.9 | Test emotional resilience |
| `PROMPT_INJECTION` | Basic adversarial attack attempts | 1.5 | Test security |
| `ENCODING_ATTACKS` | Base64, Unicode, URL encoding | 1.3 | Test parser robustness and security |
| `CONTEXT_MANIPULATION` | Adding/removing/reordering context | 1.1 | Test context extraction |
| `LENGTH_EXTREMES` | Empty, minimal, or very long inputs | 1.2 | Test boundary conditions |
| `MULTI_TURN_ATTACK` | Context persistence and conversation state | 1.4 | Test conversational agents |
| `ADVANCED_JAILBREAK` | Advanced prompt injection (DAN, role-playing) | 2.0 | Test advanced security |
| `SEMANTIC_SIMILARITY_ATTACK` | Adversarial examples - similar but different | 1.3 | Test robustness |
| `FORMAT_POISONING` | Structured data injection (JSON, XML, markdown) | 1.6 | Test structured data parsing |
| `LANGUAGE_MIXING` | Multilingual, code-switching, emoji | 1.2 | Test internationalization |
| `TOKEN_MANIPULATION` | Tokenizer edge cases, special tokens | 1.5 | Test LLM tokenization |
| `TEMPORAL_ATTACK` | Time-sensitive context, impossible dates | 1.1 | Test temporal reasoning |
| `CUSTOM` | User-defined mutation templates | 1.0 | Test domain-specific scenarios |
#### System/Network-Level Attacks
| Type | Description | Default Weight | When to Use |
|------|-------------|----------------|-------------|
| `HTTP_HEADER_INJECTION` | HTTP header manipulation attacks | 1.7 | Test HTTP API security |
| `PAYLOAD_SIZE_ATTACK` | Extremely large payloads, DoS | 1.4 | Test resource limits |
| `CONTENT_TYPE_CONFUSION` | MIME type manipulation | 1.5 | Test HTTP parsers |
| `QUERY_PARAMETER_POISONING` | Malicious query parameters | 1.6 | Test GET-based APIs |
| `REQUEST_METHOD_ATTACK` | HTTP method confusion | 1.3 | Test REST APIs |
| `PROTOCOL_LEVEL_ATTACK` | Protocol-level exploits (request smuggling) | 1.8 | Test protocol handling |
| `RESOURCE_EXHAUSTION` | CPU/memory exhaustion, DoS | 1.5 | Test production resilience |
| `CONCURRENT_REQUEST_PATTERN` | Race conditions, concurrent state | 1.4 | Test high-traffic agents |
| `TIMEOUT_MANIPULATION` | Timeout handling, slow requests | 1.3 | Test timeout resilience |
**Mutation Strategy:**
Choose mutation types based on your testing goals:
- **Comprehensive**: Use all 22+ types for complete coverage
- **Security-focused**: Emphasize `PROMPT_INJECTION`, `ADVANCED_JAILBREAK`, `PROTOCOL_LEVEL_ATTACK`, `HTTP_HEADER_INJECTION`
- **UX-focused**: Emphasize `NOISE`, `TONE_SHIFT`, `CONTEXT_MANIPULATION`, `LANGUAGE_MIXING`
- **Infrastructure-focused**: Emphasize all system/network-level types
- **Edge case testing**: Emphasize `LENGTH_EXTREMES`, `ENCODING_ATTACKS`, `TOKEN_MANIPULATION`, `RESOURCE_EXHAUSTION`
#### Mutation
```python
from flakestorm import Mutation, MutationType
mutation = Mutation(
original="Book a flight",
mutated="I need to fly",
type=MutationType.PARAPHRASE,
weight=1.0,
)
# Properties
mutation.id # Unique hash
mutation.is_valid() # Validity check
mutation.to_dict() # Serialize
mutation.character_diff # Character count difference
```
#### MutationEngine
```python
from flakestorm import MutationEngine
engine = MutationEngine(config.model)
# Verify Ollama connection
is_connected = await engine.verify_connection()
# Generate mutations
mutations = await engine.generate_mutations(
seed_prompt="Book a flight",
types=[MutationType.PARAPHRASE, MutationType.NOISE],
count=10,
)
# Batch generation
results = await engine.generate_batch(
prompts=["Prompt 1", "Prompt 2"],
types=[MutationType.PARAPHRASE],
count_per_prompt=5,
)
```
---
### Invariant Verification
#### InvariantVerifier
```python
from flakestorm import InvariantVerifier
verifier = InvariantVerifier(config.invariants)
# Verify a response
result = verifier.verify(
response="Agent output text",
latency_ms=150.0,
)
# Result properties
result.all_passed # bool
result.passed_count # int
result.failed_count # int
result.checks # list[CheckResult]
result.get_failed_checks()
result.get_passed_checks()
```
#### Built-in Checkers
```python
from flakestorm.assertions import (
ContainsChecker,
LatencyChecker,
ValidJsonChecker,
RegexChecker,
SimilarityChecker,
ExcludesPIIChecker,
RefusalChecker,
)
```
#### Custom Checker
```python
from flakestorm.assertions.deterministic import BaseChecker, CheckResult
class MyChecker(BaseChecker):
def check(self, response: str, latency_ms: float) -> CheckResult:
passed = "expected" in response
return CheckResult(
type=self.type,
passed=passed,
details="Custom check result",
)
```
---
### Test Results
#### TestResults
```python
results = await runner.run()
# Statistics
results.statistics.robustness_score # 0.0-1.0
results.statistics.total_mutations # int
results.statistics.passed_mutations # int
results.statistics.failed_mutations # int
results.statistics.avg_latency_ms # float
results.statistics.p95_latency_ms # float
results.statistics.by_type # list[TypeStatistics]
# Timing
results.started_at # datetime
results.completed_at # datetime
results.duration # seconds
# Mutations
results.mutations # list[MutationResult]
results.passed_mutations # list[MutationResult]
results.failed_mutations # list[MutationResult]
results.get_by_type("noise") # Filter by type
results.get_by_prompt("...") # Filter by prompt
# Serialization
results.to_dict() # Full JSON-serializable dict
# V2: Resilience and contract/replay (when config has contract/replays)
results.resilience_scores # dict: mutation_robustness, chaos_resilience, contract_compliance, replay_regression
results.contract_compliance # ContractRunResult | None (when contract run was executed)
# Replay results are reported via flakestorm replay run --output; see Reports below.
```
#### MutationResult
```python
for result in results.mutations:
result.original_prompt # str
result.mutation # Mutation object
result.response # str
result.latency_ms # float
result.passed # bool
result.checks # list[CheckResult]
result.error # str | None
result.failed_checks # list[CheckResult]
```
---
### Report Generation
#### HTMLReportGenerator
```python
from flakestorm.reports import HTMLReportGenerator
generator = HTMLReportGenerator(results)
# Generate HTML string
html = generator.generate()
# Save to file
path = generator.save() # Auto-generated path
path = generator.save("custom/path/report.html")
```
#### JSONReportGenerator
```python
from flakestorm.reports import JSONReportGenerator
generator = JSONReportGenerator(results)
# Full report
json_str = generator.generate(pretty=True)
# Summary only (for CI)
summary = generator.generate_summary()
# Save
path = generator.save()
path = generator.save(summary_only=True)
```
#### TerminalReporter
```python
from flakestorm.reports import TerminalReporter
from rich.console import Console
reporter = TerminalReporter(results, Console())
reporter.print_summary()
reporter.print_type_breakdown()
reporter.print_failures(limit=10)
reporter.print_full_report()
```
**V2 reports:** Contract runs (`flakestorm contract run --output report.html`) and replay runs (`flakestorm replay run --output report.html`) produce HTML reports that include **suggested actions** for failed cells or sessions (e.g. add reset_endpoint, tighten invariants, fix tool behavior). See [Behavioral Contracts](BEHAVIORAL_CONTRACTS.md) and [Replay Regression](REPLAY_REGRESSION.md).
---
## CLI Commands
### `flakestorm init [PATH]`
Initialize a new configuration file.
```bash
flakestorm init # Creates flakestorm.yaml
flakestorm init config/test.yaml # Custom path
flakestorm init --force # Overwrite existing
```
### `flakestorm run`
Run reliability tests (mutation run; optionally with chaos).
```bash
flakestorm run # Default config (mutation only)
flakestorm run --config custom.yaml # Custom config
flakestorm run --chaos # Apply chaos (tool/LLM faults, context_attacks) during mutation run
flakestorm run --chaos-only # Chaos-only run (no mutations); requires chaos config
flakestorm run --chaos-profile api_outage # Use a built-in chaos profile
flakestorm run --output json # JSON output
flakestorm run --output terminal # Terminal only
flakestorm run --min-score 0.9 --ci # CI mode
flakestorm run --verify-only # Just verify setup
flakestorm run --quiet # Minimal output
```
### `flakestorm verify`
Verify configuration and connections.
```bash
flakestorm verify
flakestorm verify --config custom.yaml
```
### `flakestorm report PATH`
View or convert existing reports.
```bash
flakestorm report results.json # View in terminal
flakestorm report results.json --output html # Convert to HTML
```
### `flakestorm score`
Output only the robustness score (for CI scripts).
```bash
SCORE=$(flakestorm score)
if (( $(echo "$SCORE >= 0.9" | bc -l) )); then
echo "Passed"
else
echo "Failed"
exit 1
fi
```
### V2: `flakestorm contract run` / `validate` / `score`
Run behavioral contract tests (invariants × chaos matrix).
```bash
flakestorm contract run # Run contract matrix; progress and score in terminal
flakestorm contract run --output report.html # Save HTML report with suggested actions for failed cells
flakestorm contract validate # Validate contract config only
flakestorm contract score # Output contract resilience score only
```
### V2: `flakestorm replay run` / `export`
Replay regression: run saved sessions and verify against a contract.
```bash
flakestorm replay run # Replay sessions from config (file or inline)
flakestorm replay run path/to/session.yaml # Replay a single session file
flakestorm replay run path/to/replays/ # Replay all sessions in directory
flakestorm replay run --output report.html # Save HTML report with suggested actions for failed sessions
flakestorm replay export --from-report FILE # Export from an existing report
```
### V2: `flakestorm ci`
Run full CI pipeline: mutation run, contract run (if configured), chaos-only (if chaos configured), replay (if configured); then compute overall weighted score from `scoring.weights`.
```bash
flakestorm ci
flakestorm ci --config custom.yaml
```
---
## Environment Variables
| Variable | Description |
|----------|-------------|
| `OLLAMA_HOST` | Override Ollama server URL |
| Custom headers | Expanded in config via `${VAR}` syntax |
**V2 — API keys (env-only):** Model API keys must not be literal in config. Use environment variables and reference them in config (e.g. `api_key: "${OPENAI_API_KEY}"`). Supported: `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc. See [LLM Providers](LLM_PROVIDERS.md).
---
## Exit Codes
| Code | Meaning |
|------|---------|
| 0 | Success |
| 1 | Error (config, connection, etc.) |
| 1 | CI mode: Score below threshold |