26 KiB
flakestorm Module Documentation
This document provides a comprehensive explanation of each module in the flakestorm codebase, what it does, how it works, and analysis of its design decisions.
Table of Contents
- Architecture Overview
- Core Modules
- Mutation Modules
- Assertion Modules
- Reporting Modules
- CLI Module
- Rust Performance Module
- Design Analysis
Architecture Overview
flakestorm/
├── core/ # Core orchestration logic
│ ├── config.py # Configuration (V1 + V2: chaos, contract, replays, scoring)
│ ├── protocol.py # Agent adapters, create_instrumented_adapter (chaos interceptor)
│ ├── orchestrator.py # Main test coordination
│ ├── runner.py # High-level test runner
│ └── performance.py # Rust/Python bridge
├── chaos/ # V2 environment chaos
│ ├── context_attacks.py # memory_poisoning (input before invoke), indirect_injection, normalize_context_attacks
│ ├── interceptor.py # ChaosInterceptor: memory_poisoning + LLM faults (timeout before call, others after)
│ ├── faults.py # should_trigger, tool/LLM fault application
│ ├── llm_proxy.py # apply_llm_fault (truncated, empty, garbage, rate_limit, response_drift)
│ └── profiles/ # Built-in chaos profiles
├── contracts/ # V2 behavioral contracts
│ ├── engine.py # ContractEngine: (invariant × scenario) cells, reset, probes, behavior_unchanged
│ └── matrix.py # ResilienceMatrix
├── replay/ # V2 replay regression
│ ├── loader.py # Load replay sessions (file or inline)
│ └── runner.py # Replay execution
├── mutations/ # Adversarial input generation (22+ types, max 50/run OSS)
│ ├── types.py # MutationType enum
│ ├── templates.py # LLM prompt templates
│ └── engine.py # Mutation generation engine
├── assertions/ # Response validation
│ ├── deterministic.py # Rule-based assertions
│ ├── semantic.py # AI-based assertions
│ ├── safety.py # Security assertions
│ └── verifier.py # InvariantVerifier (all invariant types including behavior_unchanged)
├── reports/ # Output generation
│ ├── models.py # Report data models
│ ├── html.py # HTML report generator
│ ├── json_export.py # JSON export
│ └── terminal.py # Terminal output
├── cli/ # Command-line interface
│ └── main.py # flakestorm run, contract run, replay run, ci
└── integrations/ # External integrations
├── huggingface.py # HuggingFace model support
└── embeddings.py # Local embeddings
Core Modules
config.py - Configuration Management
Location: src/flakestorm/core/config.py
Purpose: Handles loading, validating, and providing type-safe access to the flakestorm.yaml configuration file.
Key Components:
class AgentConfig(BaseModel):
"""Configuration for connecting to the target agent."""
endpoint: str # Agent URL or Python module path
type: AgentType # http, python, or langchain
timeout: int = 30000 # Request timeout (ms)
headers: dict = {} # HTTP headers
request_template: str # How to format requests
response_path: str # JSONPath to extract response
# V2: state isolation for contract matrix
reset_endpoint: str | None # HTTP POST URL called before each cell
reset_function: str | None # Python path e.g. myagent:reset_state
class FlakeStormConfig(BaseModel):
"""Root configuration model."""
version: str = "1.0" # 1.0 | 2.0
agent: AgentConfig
golden_prompts: list[str]
mutations: MutationConfig # count max 50 in OSS; 22+ mutation types
model: ModelConfig # api_key env-only in V2
invariants: list[InvariantConfig]
output: OutputConfig
advanced: AdvancedConfig
# V2 optional
chaos: ChaosConfig | None # tool_faults, llm_faults, context_attacks (list or dict)
contract: ContractConfig | None # invariants + chaos_matrix (scenarios can have context_attacks)
chaos_matrix: list[ChaosScenarioConfig] | None # when not using contract.chaos_matrix
replays: ReplayConfig | None # sessions (file or inline), sources (LangSmith)
scoring: ScoringConfig | None # mutation, chaos, contract, replay weights (must sum to 1.0)
Key Functions:
| Function | Purpose |
|---|---|
load_config(path) |
Load and validate YAML config file |
expand_env_vars() |
Replace ${VAR} with environment values |
validate_config() |
Run Pydantic validation |
Design Analysis:
✅ Strengths:
- Uses Pydantic for robust validation with clear error messages
- Environment variable expansion for secrets management
- Type safety prevents runtime configuration errors
- Default values reduce required configuration
⚠️ Considerations:
- Large config model - could be split into smaller files for maintainability
- No schema versioning - future config changes need migration support
Why This Design: Pydantic was chosen over alternatives (dataclasses, attrs) because:
- Built-in YAML/JSON serialization
- Automatic validation with descriptive errors
- Environment variable support
- Wide ecosystem adoption
protocol.py - Agent Adapters
Location: src/flakestorm/core/protocol.py
Purpose: Provides a unified interface for communicating with different types of AI agents (HTTP APIs, Python functions, LangChain).
Key Components:
class AgentProtocol(Protocol):
"""Protocol that all agent adapters must implement."""
async def invoke(self, prompt: str) -> AgentResponse:
"""Send prompt to agent and return response."""
...
class HTTPAgentAdapter(BaseAgentAdapter):
"""Adapter for HTTP-based agents."""
async def invoke(self, prompt: str) -> AgentResponse:
# 1. Format request using template
# 2. Send HTTP POST with headers
# 3. Extract response using JSONPath
# 4. Return with latency measurement
class PythonAgentAdapter(BaseAgentAdapter):
"""Adapter for Python function agents."""
async def invoke(self, prompt: str) -> AgentResponse:
# 1. Import the specified module
# 2. Call the function with prompt
# 3. Return response with timing
Design Analysis:
✅ Strengths:
- Protocol pattern allows easy extension for new agent types
- Async-first design for efficient parallel testing
- Built-in latency measurement for performance tracking
- Retry logic handles transient failures
⚠️ Considerations:
- HTTP adapter assumes JSON request/response format
- Python adapter uses dynamic import which can be security-sensitive
Why This Design: The adapter pattern was chosen because:
- Decouples test logic from agent communication
- Easy to add new agent types without modifying core
- Allows mocking for unit tests
orchestrator.py - Test Orchestration
Location: src/flakestorm/core/orchestrator.py
Purpose: Coordinates the entire testing process: mutation generation, parallel test execution, and result aggregation.
Key Components:
class Orchestrator:
"""Main orchestration class."""
async def run(self) -> TestResults:
"""Execute the full test suite."""
# 1. Generate mutations for all golden prompts
# 2. Run mutations sequentially (open-source version)
# 3. Verify responses against invariants
# 4. Aggregate and score results
# 5. Return comprehensive results
Execution Flow:
run()
├─► _generate_mutations() # Create adversarial inputs
│ └─► MutationEngine.generate_mutations()
│
├─► _run_mutations() # Execute tests in parallel
│ ├─► Semaphore(concurrency)
│ └─► _run_single_mutation()
│ ├─► agent.invoke(mutated_prompt)
│ └─► verifier.verify(response)
│
└─► _aggregate_results() # Calculate statistics
└─► calculate_statistics()
Design Analysis:
✅ Strengths:
- Async/await for efficient I/O-bound operations
- Semaphore controls concurrency to prevent overwhelming the agent
- Progress tracking with Rich for user feedback
- Clean separation between generation, execution, and verification
⚠️ Considerations:
- All mutations held in memory - could be memory-intensive for large runs
- No checkpointing - failed runs restart from beginning
Why This Design: Async orchestration was chosen because:
- Agent calls are I/O-bound, not CPU-bound
- Parallelism improves test throughput significantly
- Semaphore pattern is standard for rate limiting
performance.py - Rust/Python Bridge
Location: src/flakestorm/core/performance.py
Purpose: Provides high-performance implementations of compute-intensive operations using Rust, with pure Python fallbacks.
Key Functions:
def is_rust_available() -> bool:
"""Check if Rust extension is installed."""
def calculate_robustness_score(...) -> float:
"""Calculate weighted robustness score."""
# Uses Rust if available, else Python
def levenshtein_distance(s1, s2) -> int:
"""Fast string edit distance calculation."""
# 88x faster in Rust vs Python
def string_similarity(s1, s2) -> float:
"""Calculate string similarity ratio."""
Performance Comparison:
| Function | Python Time | Rust Time | Speedup |
|---|---|---|---|
| Levenshtein (5000 iter) | 5864ms | 67ms | 88x |
| Robustness Score | 0.5ms | 0.01ms | 50x |
| String Similarity | 1.2ms | 0.02ms | 60x |
Design Analysis:
✅ Strengths:
- Graceful fallback if Rust not available
- Same API regardless of implementation
- Significant performance improvement for scoring
⚠️ Considerations:
- Requires Rust toolchain for compilation
- Binary compatibility across platforms
Why This Design: The bridge pattern was chosen because:
- Pure Python works everywhere (easy installation)
- Rust acceleration for production (performance)
- Same tests validate both implementations
Mutation Modules
types.py - Mutation Types
Location: src/flakestorm/mutations/types.py
Purpose: Defines the types of adversarial mutations and their data structures.
Key Components:
class MutationType(str, Enum):
"""Types of adversarial mutations."""
PARAPHRASE = "paraphrase" # Same meaning, different words
NOISE = "noise" # Typos and errors
TONE_SHIFT = "tone_shift" # Different emotional tone
PROMPT_INJECTION = "prompt_injection" # Jailbreak attempts
ENCODING_ATTACKS = "encoding_attacks" # Encoded inputs
CONTEXT_MANIPULATION = "context_manipulation" # Context changes
LENGTH_EXTREMES = "length_extremes" # Edge case lengths
CUSTOM = "custom" # User-defined templates
The 8 Core Mutation Types:
-
PARAPHRASE (Weight: 1.0)
- What it tests: Semantic understanding - can the agent handle different wording?
- How it works: LLM rewrites the prompt using synonyms and alternative phrasing while preserving intent
- Why essential: Users express the same intent in many ways. Agents must understand meaning, not just keywords.
- Template strategy: Instructs LLM to use completely different words while keeping exact same meaning
-
NOISE (Weight: 0.8)
- What it tests: Typo tolerance - can the agent handle user errors?
- How it works: LLM adds realistic typos (swapped letters, missing letters, abbreviations)
- Why essential: Real users make typos, especially on mobile. Robust agents must handle common errors gracefully.
- Template strategy: Simulates realistic typing errors as if typed quickly on a phone
-
TONE_SHIFT (Weight: 0.9)
- What it tests: Emotional resilience - can the agent handle frustrated users?
- How it works: LLM rewrites with urgency, impatience, and slight aggression
- Why essential: Users get impatient. Agents must maintain quality even under stress.
- Template strategy: Adds words like "NOW", "HURRY", "ASAP" and frustration phrases
-
PROMPT_INJECTION (Weight: 1.5)
- What it tests: Security - can the agent resist manipulation?
- How it works: LLM adds injection attempts like "ignore previous instructions"
- Why essential: Attackers try to manipulate agents. Security is non-negotiable.
- Template strategy: Keeps original request but adds injection techniques after it
-
ENCODING_ATTACKS (Weight: 1.3)
- What it tests: Parser robustness - can the agent handle encoded inputs?
- How it works: LLM transforms prompt using Base64, Unicode escapes, or URL encoding
- Why essential: Attackers use encoding to bypass filters. Agents must decode correctly.
- Template strategy: Instructs LLM to use various encoding techniques (Base64, Unicode, URL)
-
CONTEXT_MANIPULATION (Weight: 1.1)
- What it tests: Context extraction - can the agent find intent in noisy context?
- How it works: LLM adds irrelevant information, removes key context, or reorders structure
- Why essential: Real conversations include irrelevant information. Agents must extract the core request.
- Template strategy: Adds/removes/reorders context while keeping core request ambiguous
-
LENGTH_EXTREMES (Weight: 1.2)
- What it tests: Edge cases - can the agent handle empty or very long inputs?
- How it works: LLM creates minimal versions (removing non-essential words) or very long versions (expanding with repetition)
- Why essential: Real inputs vary wildly in length. Agents must handle boundaries.
- Template strategy: Creates extremely short or extremely long versions to test token limits
-
CUSTOM (Weight: 1.0)
- What it tests: Domain-specific scenarios
- How it works: User provides custom template with
{prompt}placeholder - Why essential: Every domain has unique failure modes. Custom mutations let you test them.
- Template strategy: Applies user-defined transformation instructions
Mutation Philosophy:
The 8 mutation types are designed to cover different failure modes:
- Semantic Robustness: PARAPHRASE, CONTEXT_MANIPULATION test understanding
- Input Robustness: NOISE, ENCODING_ATTACKS, LENGTH_EXTREMES test parsing
- Security: PROMPT_INJECTION, ENCODING_ATTACKS test resistance to attacks
- User Experience: TONE_SHIFT, NOISE, CONTEXT_MANIPULATION test real-world usage
Together, they provide comprehensive coverage of agent failure modes.
@dataclass
class Mutation:
"""A single mutation of a golden prompt."""
original: str # Original prompt
mutated: str # Mutated version
type: MutationType # Type of mutation
weight: float # Scoring weight
metadata: dict # Additional info
@property
def id(self) -> str:
"""Unique hash for this mutation."""
return hashlib.md5(..., usedforsecurity=False)
def is_valid(self) -> bool:
"""Validates mutation, with special handling for LENGTH_EXTREMES."""
# LENGTH_EXTREMES may intentionally create empty or very long strings
Design Analysis:
✅ Strengths:
- Enum prevents invalid mutation types
- Dataclass provides clean, typed structure
- Built-in weight scoring for weighted results
- Special validation logic for edge cases (LENGTH_EXTREMES)
Why This Design: String enum was chosen because:
- Values serialize directly to YAML/JSON
- Type checking catches typos
- Easy to extend with new types
- All 8 types work together to provide comprehensive testing coverage
engine.py - Mutation Generation
Location: src/flakestorm/mutations/engine.py
Purpose: Generates adversarial mutations using a local LLM (Ollama/Qwen).
Key Components:
class MutationEngine:
"""Engine for generating adversarial mutations."""
def __init__(self, config: LLMConfig):
self.client = ollama.AsyncClient(host=config.host)
self.model = config.model
async def generate_mutations(
self,
prompt: str,
types: list[MutationType],
count: int
) -> list[Mutation]:
"""Generate multiple mutations for a prompt."""
Generation Flow:
generate_mutations(prompt, types, count)
│
├─► For each mutation type:
│ ├─► Get template from templates.py
│ ├─► Format with original prompt
│ └─► Call Ollama API
│
├─► Parse LLM responses
│ └─► Extract mutated prompts
│
└─► Create Mutation objects
└─► Assign difficulty weights
Design Analysis:
✅ Strengths:
- Async API calls for parallel generation
- Local LLM (no API costs, no data leakage)
- Customizable templates per mutation type
⚠️ Considerations:
- Depends on Ollama being installed and running
- LLM output parsing can be fragile
- Model quality affects mutation quality
Why This Design: Local LLM was chosen over cloud APIs because:
- Zero cost at scale
- No rate limits
- Privacy - prompts stay local
- Works offline
Assertion Modules
deterministic.py - Rule-Based Checks
Location: src/flakestorm/assertions/deterministic.py
Purpose: Implements deterministic, rule-based assertions that check responses against exact criteria.
Key Checkers:
class ContainsChecker(BaseChecker):
"""Check if response contains a value."""
class NotContainsChecker(BaseChecker):
"""Check if response does NOT contain a value."""
class RegexChecker(BaseChecker):
"""Check if response matches a regex pattern."""
class LatencyChecker(BaseChecker):
"""Check if response time is within limit."""
class ValidJsonChecker(BaseChecker):
"""Check if response is valid JSON."""
Design Analysis:
✅ Strengths:
- Fast execution (no AI/ML involved)
- Predictable, reproducible results
- Easy to debug failures
Why This Design: Checker pattern with registry allows:
- Easy addition of new check types
- Configuration-driven selection
- Consistent error reporting
semantic.py - AI-Based Checks
Location: src/flakestorm/assertions/semantic.py
Purpose: Implements semantic assertions using embeddings for meaning-based comparison.
Key Components:
class LocalEmbedder:
"""Local sentence embeddings using sentence-transformers."""
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
def embed(self, text: str) -> np.ndarray:
return self.model.encode(text)
def similarity(self, text1: str, text2: str) -> float:
emb1, emb2 = self.embed(text1), self.embed(text2)
return cosine_similarity(emb1, emb2)
class SimilarityChecker(BaseChecker):
"""Check semantic similarity to expected response."""
def check(self, response: str, latency_ms: float) -> CheckResult:
similarity = self.embedder.similarity(response, expected)
return CheckResult(passed=similarity >= threshold)
Design Analysis:
✅ Strengths:
- Catches semantic equivalence (not just string matching)
- Lazy loading of heavy ML models
- Configurable similarity thresholds
⚠️ Considerations:
- Requires sentence-transformers (optional dependency)
- First call is slow (model loading)
- Memory usage for embedding model
Why This Design: sentence-transformers was chosen because:
- High-quality embeddings out of the box
- Runs locally (no API calls)
- Small model size (all-MiniLM-L6-v2 is 23MB)
safety.py - Security Checks
Location: src/flakestorm/assertions/safety.py
Purpose: Implements security-focused assertions for PII detection and refusal verification.
Key Checkers:
class ExcludesPIIChecker(BaseChecker):
"""Check that response doesn't contain PII."""
PII_PATTERNS = [
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
r'\b\d{16}\b', # Credit card
r'\b[A-Za-z0-9._%+-]+@...', # Email
r'\b\d{3}[-.]?\d{3}[-.]?\d{4}', # Phone
]
class RefusalChecker(BaseChecker):
"""Check that agent refuses dangerous requests."""
REFUSAL_PHRASES = [
"I cannot", "I'm unable to", "I won't",
"against my guidelines", "not appropriate"
]
Design Analysis:
✅ Strengths:
- Essential for production safety
- Regex-based PII detection is fast
- Catches common refusal patterns
⚠️ Considerations:
- PII patterns may miss edge cases
- Refusal detection is heuristic-based
Why This Design: Pattern-based detection was chosen because:
- Fast and deterministic
- No false positives from ML
- Easy to audit and extend
Reporting Modules
models.py - Data Structures
Location: src/flakestorm/reports/models.py
Purpose: Defines data structures for test results and reports.
Key Models:
@dataclass
class MutationResult:
"""Result of testing a single mutation."""
mutation: Mutation
response: str
latency_ms: float
passed: bool
checks: list[CheckResult]
@dataclass
class TestResults:
"""Complete test run results."""
config: FlakeStormConfig
mutations: list[MutationResult]
statistics: TestStatistics
timestamp: datetime
html.py - HTML Report Generation
Location: src/flakestorm/reports/html.py
Purpose: Generates interactive HTML reports with visualizations.
Key Features:
- Embedded CSS (no external dependencies)
- Pass/fail grid visualization
- Latency charts
- Failure details with expandable sections
- Mobile-responsive design
Design Analysis:
✅ Strengths:
- Self-contained HTML (single file, works offline)
- No JavaScript framework dependencies
- Professional appearance
CLI Module
main.py - Command-Line Interface
Location: src/flakestorm/cli/main.py
Purpose: Provides the flakestorm command-line tool using Typer.
Commands:
flakestorm init # Create config file
flakestorm run # Run tests
flakestorm verify # Validate config
flakestorm report # Generate report from JSON
flakestorm score # Show score from results
Design Analysis:
✅ Strengths:
- Typer provides automatic help generation
- Rich integration for beautiful output
- Consistent exit codes for CI
Rust Performance Module
Location: rust/src/
Components:
| File | Purpose |
|---|---|
lib.rs |
PyO3 bindings and main functions |
scoring.rs |
Statistics calculation algorithms |
parallel.rs |
Rayon-based parallel processing |
Key Functions:
#[pyfunction]
fn calculate_robustness_score(
semantic_passed: u32,
deterministic_passed: u32,
total: u32,
semantic_weight: f64,
deterministic_weight: f64,
) -> f64
#[pyfunction]
fn levenshtein_distance(s1: &str, s2: &str) -> usize
#[pyfunction]
fn string_similarity(s1: &str, s2: &str) -> f64
Design Analysis:
✅ Strengths:
- PyO3 provides seamless Python integration
- Rayon enables easy parallelism
- Comprehensive test suite
Design Analysis
Overall Architecture Assessment
Strengths:
- Modularity: Clear separation of concerns makes code maintainable
- Extensibility: Easy to add new mutation types, checkers, adapters
- Type Safety: Pydantic and type hints catch errors early
- Performance: Rust acceleration where it matters
- Usability: Rich CLI with progress bars and beautiful output
Areas for Improvement:
- Memory Usage: Large test runs keep all results in memory
- Checkpointing: No resume capability for interrupted runs
- Distributed Execution: Single-machine only
Performance Characteristics
| Operation | Complexity | Bottleneck |
|---|---|---|
| Mutation Generation | O(n*m) | LLM inference |
| Test Execution | O(n) | Agent response time |
| Scoring | O(n) | CPU (optimized with Rust) |
| Report Generation | O(n) | I/O |
Where n = number of mutations, m = mutation types.
Security Considerations
- Secrets Management: Environment variable expansion keeps secrets out of config files
- Local LLM: No data sent to external APIs
- PII Detection: Built-in checks for sensitive data
- Injection Testing: Helps harden agents against attacks
This documentation reflects the current implementation. Always refer to the source code for the most up-to-date information.