mirror of https://github.com/flakestorm/flakestorm.git synced 2026-04-25 00:36:54 +02:00

Francisco M Humarang Jr. 4c1b43c5d5 Enhance documentation for Flakestorm V2 features, including detailed updates on behavioral contracts, context attacks, and scoring mechanisms. Added new configuration options for state isolation in agents, clarified context attack types, and improved the contract report generation with suggested actions for failures. Updated various guides to reflect the latest changes in chaos engineering capabilities and replay regression functionalities.

2026-03-08 20:29:48 +08:00

26 KiB

Raw Blame History

flakestorm Module Documentation

This document provides a comprehensive explanation of each module in the flakestorm codebase, what it does, how it works, and analysis of its design decisions.

Architecture Overview
Core Modules
Mutation Modules
Assertion Modules
Reporting Modules
CLI Module
- main.py
Rust Performance Module
Design Analysis

Architecture Overview

flakestorm/
├── core/                    # Core orchestration logic
│   ├── config.py           # Configuration (V1 + V2: chaos, contract, replays, scoring)
│   ├── protocol.py         # Agent adapters, create_instrumented_adapter (chaos interceptor)
│   ├── orchestrator.py     # Main test coordination
│   ├── runner.py           # High-level test runner
│   └── performance.py      # Rust/Python bridge
├── chaos/                   # V2 environment chaos
│   ├── context_attacks.py  # memory_poisoning (input before invoke), indirect_injection, normalize_context_attacks
│   ├── interceptor.py      # ChaosInterceptor: memory_poisoning + LLM faults (timeout before call, others after)
│   ├── faults.py           # should_trigger, tool/LLM fault application
│   ├── llm_proxy.py        # apply_llm_fault (truncated, empty, garbage, rate_limit, response_drift)
│   └── profiles/           # Built-in chaos profiles
├── contracts/               # V2 behavioral contracts
│   ├── engine.py           # ContractEngine: (invariant × scenario) cells, reset, probes, behavior_unchanged
│   └── matrix.py           # ResilienceMatrix
├── replay/                  # V2 replay regression
│   ├── loader.py           # Load replay sessions (file or inline)
│   └── runner.py           # Replay execution
├── mutations/               # Adversarial input generation (22+ types, max 50/run OSS)
│   ├── types.py            # MutationType enum
│   ├── templates.py        # LLM prompt templates
│   └── engine.py           # Mutation generation engine
├── assertions/              # Response validation
│   ├── deterministic.py    # Rule-based assertions
│   ├── semantic.py         # AI-based assertions
│   ├── safety.py           # Security assertions
│   └── verifier.py         # InvariantVerifier (all invariant types including behavior_unchanged)
├── reports/                 # Output generation
│   ├── models.py           # Report data models
│   ├── html.py             # HTML report generator
│   ├── json_export.py      # JSON export
│   └── terminal.py         # Terminal output
├── cli/                     # Command-line interface
│   └── main.py             # flakestorm run, contract run, replay run, ci
└── integrations/            # External integrations
    ├── huggingface.py      # HuggingFace model support
    └── embeddings.py       # Local embeddings

Core Modules

config.py - Configuration Management

Location: src/flakestorm/core/config.py

Purpose: Handles loading, validating, and providing type-safe access to the flakestorm.yaml configuration file.

Key Components:

class AgentConfig(BaseModel):
    """Configuration for connecting to the target agent."""
    endpoint: str          # Agent URL or Python module path
    type: AgentType        # http, python, or langchain
    timeout: int = 30000   # Request timeout (ms)
    headers: dict = {}     # HTTP headers
    request_template: str  # How to format requests
    response_path: str     # JSONPath to extract response
    # V2: state isolation for contract matrix
    reset_endpoint: str | None   # HTTP POST URL called before each cell
    reset_function: str | None  # Python path e.g. myagent:reset_state

class FlakeStormConfig(BaseModel):
    """Root configuration model."""
    version: str = "1.0"   # 1.0 | 2.0
    agent: AgentConfig
    golden_prompts: list[str]
    mutations: MutationConfig   # count max 50 in OSS; 22+ mutation types
    model: ModelConfig         # api_key env-only in V2
    invariants: list[InvariantConfig]
    output: OutputConfig
    advanced: AdvancedConfig
    # V2 optional
    chaos: ChaosConfig | None       # tool_faults, llm_faults, context_attacks (list or dict)
    contract: ContractConfig | None  # invariants + chaos_matrix (scenarios can have context_attacks)
    chaos_matrix: list[ChaosScenarioConfig] | None  # when not using contract.chaos_matrix
    replays: ReplayConfig | None     # sessions (file or inline), sources (LangSmith)
    scoring: ScoringConfig | None    # mutation, chaos, contract, replay weights (must sum to 1.0)

Key Functions:

Function	Purpose
`load_config(path)`	Load and validate YAML config file
`expand_env_vars()`	Replace `${VAR}` with environment values
`validate_config()`	Run Pydantic validation

Design Analysis:

✅ Strengths:

Uses Pydantic for robust validation with clear error messages
Environment variable expansion for secrets management
Type safety prevents runtime configuration errors
Default values reduce required configuration

⚠️ Considerations:

Large config model - could be split into smaller files for maintainability
No schema versioning - future config changes need migration support

Why This Design: Pydantic was chosen over alternatives (dataclasses, attrs) because:

Built-in YAML/JSON serialization
Automatic validation with descriptive errors
Environment variable support
Wide ecosystem adoption

protocol.py - Agent Adapters

Location: src/flakestorm/core/protocol.py

Purpose: Provides a unified interface for communicating with different types of AI agents (HTTP APIs, Python functions, LangChain).

Key Components:

class AgentProtocol(Protocol):
    """Protocol that all agent adapters must implement."""

    async def invoke(self, prompt: str) -> AgentResponse:
        """Send prompt to agent and return response."""
        ...

class HTTPAgentAdapter(BaseAgentAdapter):
    """Adapter for HTTP-based agents."""

    async def invoke(self, prompt: str) -> AgentResponse:
        # 1. Format request using template
        # 2. Send HTTP POST with headers
        # 3. Extract response using JSONPath
        # 4. Return with latency measurement

class PythonAgentAdapter(BaseAgentAdapter):
    """Adapter for Python function agents."""

    async def invoke(self, prompt: str) -> AgentResponse:
        # 1. Import the specified module
        # 2. Call the function with prompt
        # 3. Return response with timing

Design Analysis:

✅ Strengths:

Protocol pattern allows easy extension for new agent types
Async-first design for efficient parallel testing
Built-in latency measurement for performance tracking
Retry logic handles transient failures

⚠️ Considerations:

HTTP adapter assumes JSON request/response format
Python adapter uses dynamic import which can be security-sensitive

Why This Design: The adapter pattern was chosen because:

Decouples test logic from agent communication
Easy to add new agent types without modifying core
Allows mocking for unit tests

orchestrator.py - Test Orchestration

Location: src/flakestorm/core/orchestrator.py

Purpose: Coordinates the entire testing process: mutation generation, parallel test execution, and result aggregation.

Key Components:

class Orchestrator:
    """Main orchestration class."""

    async def run(self) -> TestResults:
        """Execute the full test suite."""
        # 1. Generate mutations for all golden prompts
        # 2. Run mutations sequentially (open-source version)
        # 3. Verify responses against invariants
        # 4. Aggregate and score results
        # 5. Return comprehensive results

Execution Flow:

run()
  ├─► _generate_mutations()     # Create adversarial inputs
  │     └─► MutationEngine.generate_mutations()
  │
  ├─► _run_mutations()          # Execute tests in parallel
  │     ├─► Semaphore(concurrency)
  │     └─► _run_single_mutation()
  │           ├─► agent.invoke(mutated_prompt)
  │           └─► verifier.verify(response)
  │
  └─► _aggregate_results()      # Calculate statistics
        └─► calculate_statistics()

Design Analysis:

✅ Strengths:

Async/await for efficient I/O-bound operations
Semaphore controls concurrency to prevent overwhelming the agent
Progress tracking with Rich for user feedback
Clean separation between generation, execution, and verification

⚠️ Considerations:

All mutations held in memory - could be memory-intensive for large runs
No checkpointing - failed runs restart from beginning

Why This Design: Async orchestration was chosen because:

Agent calls are I/O-bound, not CPU-bound
Parallelism improves test throughput significantly
Semaphore pattern is standard for rate limiting

performance.py - Rust/Python Bridge

Location: src/flakestorm/core/performance.py

Purpose: Provides high-performance implementations of compute-intensive operations using Rust, with pure Python fallbacks.

Key Functions:

def is_rust_available() -> bool:
    """Check if Rust extension is installed."""

def calculate_robustness_score(...) -> float:
    """Calculate weighted robustness score."""
    # Uses Rust if available, else Python

def levenshtein_distance(s1, s2) -> int:
    """Fast string edit distance calculation."""
    # 88x faster in Rust vs Python

def string_similarity(s1, s2) -> float:
    """Calculate string similarity ratio."""

Performance Comparison:

Function	Python Time	Rust Time	Speedup
Levenshtein (5000 iter)	5864ms	67ms	88x
Robustness Score	0.5ms	0.01ms	50x
String Similarity	1.2ms	0.02ms	60x

Design Analysis:

✅ Strengths:

Graceful fallback if Rust not available
Same API regardless of implementation
Significant performance improvement for scoring

⚠️ Considerations:

Requires Rust toolchain for compilation
Binary compatibility across platforms

Why This Design: The bridge pattern was chosen because:

Pure Python works everywhere (easy installation)
Rust acceleration for production (performance)
Same tests validate both implementations

Mutation Modules

types.py - Mutation Types

Location: src/flakestorm/mutations/types.py

Purpose: Defines the types of adversarial mutations and their data structures.

Key Components:

class MutationType(str, Enum):
    """Types of adversarial mutations."""
    PARAPHRASE = "paraphrase"              # Same meaning, different words
    NOISE = "noise"                        # Typos and errors
    TONE_SHIFT = "tone_shift"              # Different emotional tone
    PROMPT_INJECTION = "prompt_injection"  # Jailbreak attempts
    ENCODING_ATTACKS = "encoding_attacks"  # Encoded inputs
    CONTEXT_MANIPULATION = "context_manipulation"  # Context changes
    LENGTH_EXTREMES = "length_extremes"    # Edge case lengths
    CUSTOM = "custom"                      # User-defined templates

The 8 Core Mutation Types:

PARAPHRASE (Weight: 1.0)
- What it tests: Semantic understanding - can the agent handle different wording?
- How it works: LLM rewrites the prompt using synonyms and alternative phrasing while preserving intent
- Why essential: Users express the same intent in many ways. Agents must understand meaning, not just keywords.
- Template strategy: Instructs LLM to use completely different words while keeping exact same meaning
NOISE (Weight: 0.8)
- What it tests: Typo tolerance - can the agent handle user errors?
- How it works: LLM adds realistic typos (swapped letters, missing letters, abbreviations)
- Why essential: Real users make typos, especially on mobile. Robust agents must handle common errors gracefully.
- Template strategy: Simulates realistic typing errors as if typed quickly on a phone
TONE_SHIFT (Weight: 0.9)
- What it tests: Emotional resilience - can the agent handle frustrated users?
- How it works: LLM rewrites with urgency, impatience, and slight aggression
- Why essential: Users get impatient. Agents must maintain quality even under stress.
- Template strategy: Adds words like "NOW", "HURRY", "ASAP" and frustration phrases
PROMPT_INJECTION (Weight: 1.5)
- What it tests: Security - can the agent resist manipulation?
- How it works: LLM adds injection attempts like "ignore previous instructions"
- Why essential: Attackers try to manipulate agents. Security is non-negotiable.
- Template strategy: Keeps original request but adds injection techniques after it
ENCODING_ATTACKS (Weight: 1.3)
- What it tests: Parser robustness - can the agent handle encoded inputs?
- How it works: LLM transforms prompt using Base64, Unicode escapes, or URL encoding
- Why essential: Attackers use encoding to bypass filters. Agents must decode correctly.
- Template strategy: Instructs LLM to use various encoding techniques (Base64, Unicode, URL)
CONTEXT_MANIPULATION (Weight: 1.1)
- What it tests: Context extraction - can the agent find intent in noisy context?
- How it works: LLM adds irrelevant information, removes key context, or reorders structure
- Why essential: Real conversations include irrelevant information. Agents must extract the core request.
- Template strategy: Adds/removes/reorders context while keeping core request ambiguous
LENGTH_EXTREMES (Weight: 1.2)
- What it tests: Edge cases - can the agent handle empty or very long inputs?
- How it works: LLM creates minimal versions (removing non-essential words) or very long versions (expanding with repetition)
- Why essential: Real inputs vary wildly in length. Agents must handle boundaries.
- Template strategy: Creates extremely short or extremely long versions to test token limits
CUSTOM (Weight: 1.0)
- What it tests: Domain-specific scenarios
- How it works: User provides custom template with {prompt} placeholder
- Why essential: Every domain has unique failure modes. Custom mutations let you test them.
- Template strategy: Applies user-defined transformation instructions

Mutation Philosophy:

The 8 mutation types are designed to cover different failure modes:

Semantic Robustness: PARAPHRASE, CONTEXT_MANIPULATION test understanding
Input Robustness: NOISE, ENCODING_ATTACKS, LENGTH_EXTREMES test parsing
Security: PROMPT_INJECTION, ENCODING_ATTACKS test resistance to attacks
User Experience: TONE_SHIFT, NOISE, CONTEXT_MANIPULATION test real-world usage

Together, they provide comprehensive coverage of agent failure modes.

@dataclass
class Mutation:
    """A single mutation of a golden prompt."""
    original: str           # Original prompt
    mutated: str           # Mutated version
    type: MutationType     # Type of mutation
    weight: float          # Scoring weight
    metadata: dict         # Additional info

    @property
    def id(self) -> str:
        """Unique hash for this mutation."""
        return hashlib.md5(..., usedforsecurity=False)
    
    def is_valid(self) -> bool:
        """Validates mutation, with special handling for LENGTH_EXTREMES."""
        # LENGTH_EXTREMES may intentionally create empty or very long strings

Design Analysis:

✅ Strengths:

Enum prevents invalid mutation types
Dataclass provides clean, typed structure
Built-in weight scoring for weighted results
Special validation logic for edge cases (LENGTH_EXTREMES)

Why This Design: String enum was chosen because:

Values serialize directly to YAML/JSON
Type checking catches typos
Easy to extend with new types
All 8 types work together to provide comprehensive testing coverage

engine.py - Mutation Generation

Location: src/flakestorm/mutations/engine.py

Purpose: Generates adversarial mutations using a local LLM (Ollama/Qwen).

Key Components:

class MutationEngine:
    """Engine for generating adversarial mutations."""

    def __init__(self, config: LLMConfig):
        self.client = ollama.AsyncClient(host=config.host)
        self.model = config.model

    async def generate_mutations(
        self,
        prompt: str,
        types: list[MutationType],
        count: int
    ) -> list[Mutation]:
        """Generate multiple mutations for a prompt."""

Generation Flow:

generate_mutations(prompt, types, count)
  │
  ├─► For each mutation type:
  │     ├─► Get template from templates.py
  │     ├─► Format with original prompt
  │     └─► Call Ollama API
  │
  ├─► Parse LLM responses
  │     └─► Extract mutated prompts
  │
  └─► Create Mutation objects
        └─► Assign difficulty weights

Design Analysis:

✅ Strengths:

Async API calls for parallel generation
Local LLM (no API costs, no data leakage)
Customizable templates per mutation type

⚠️ Considerations:

Depends on Ollama being installed and running
LLM output parsing can be fragile
Model quality affects mutation quality

Why This Design: Local LLM was chosen over cloud APIs because:

Zero cost at scale
No rate limits
Privacy - prompts stay local
Works offline

Assertion Modules

deterministic.py - Rule-Based Checks

Location: src/flakestorm/assertions/deterministic.py

Purpose: Implements deterministic, rule-based assertions that check responses against exact criteria.

Key Checkers:

class ContainsChecker(BaseChecker):
    """Check if response contains a value."""

class NotContainsChecker(BaseChecker):
    """Check if response does NOT contain a value."""

class RegexChecker(BaseChecker):
    """Check if response matches a regex pattern."""

class LatencyChecker(BaseChecker):
    """Check if response time is within limit."""

class ValidJsonChecker(BaseChecker):
    """Check if response is valid JSON."""

Design Analysis:

✅ Strengths:

Fast execution (no AI/ML involved)
Predictable, reproducible results
Easy to debug failures

Why This Design: Checker pattern with registry allows:

Easy addition of new check types
Configuration-driven selection
Consistent error reporting

semantic.py - AI-Based Checks

Location: src/flakestorm/assertions/semantic.py

Purpose: Implements semantic assertions using embeddings for meaning-based comparison.

Key Components:

class LocalEmbedder:
    """Local sentence embeddings using sentence-transformers."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def embed(self, text: str) -> np.ndarray:
        return self.model.encode(text)

    def similarity(self, text1: str, text2: str) -> float:
        emb1, emb2 = self.embed(text1), self.embed(text2)
        return cosine_similarity(emb1, emb2)

class SimilarityChecker(BaseChecker):
    """Check semantic similarity to expected response."""

    def check(self, response: str, latency_ms: float) -> CheckResult:
        similarity = self.embedder.similarity(response, expected)
        return CheckResult(passed=similarity >= threshold)

Design Analysis:

✅ Strengths:

Catches semantic equivalence (not just string matching)
Lazy loading of heavy ML models
Configurable similarity thresholds

⚠️ Considerations:

Requires sentence-transformers (optional dependency)
First call is slow (model loading)
Memory usage for embedding model

Why This Design: sentence-transformers was chosen because:

High-quality embeddings out of the box
Runs locally (no API calls)
Small model size (all-MiniLM-L6-v2 is 23MB)

safety.py - Security Checks

Location: src/flakestorm/assertions/safety.py

Purpose: Implements security-focused assertions for PII detection and refusal verification.

Key Checkers:

class ExcludesPIIChecker(BaseChecker):
    """Check that response doesn't contain PII."""

    PII_PATTERNS = [
        r'\b\d{3}-\d{2}-\d{4}\b',      # SSN
        r'\b\d{16}\b',                   # Credit card
        r'\b[A-Za-z0-9._%+-]+@...',     # Email
        r'\b\d{3}[-.]?\d{3}[-.]?\d{4}', # Phone
    ]

class RefusalChecker(BaseChecker):
    """Check that agent refuses dangerous requests."""

    REFUSAL_PHRASES = [
        "I cannot", "I'm unable to", "I won't",
        "against my guidelines", "not appropriate"
    ]

Design Analysis:

✅ Strengths:

Essential for production safety
Regex-based PII detection is fast
Catches common refusal patterns

⚠️ Considerations:

PII patterns may miss edge cases
Refusal detection is heuristic-based

Why This Design: Pattern-based detection was chosen because:

Fast and deterministic
No false positives from ML
Easy to audit and extend

Reporting Modules

models.py - Data Structures

Location: src/flakestorm/reports/models.py

Purpose: Defines data structures for test results and reports.

Key Models:

@dataclass
class MutationResult:
    """Result of testing a single mutation."""
    mutation: Mutation
    response: str
    latency_ms: float
    passed: bool
    checks: list[CheckResult]

@dataclass
class TestResults:
    """Complete test run results."""
    config: FlakeStormConfig
    mutations: list[MutationResult]
    statistics: TestStatistics
    timestamp: datetime

html.py - HTML Report Generation

Location: src/flakestorm/reports/html.py

Purpose: Generates interactive HTML reports with visualizations.

Key Features:

Embedded CSS (no external dependencies)
Pass/fail grid visualization
Latency charts
Failure details with expandable sections
Mobile-responsive design

Design Analysis:

✅ Strengths:

Self-contained HTML (single file, works offline)
No JavaScript framework dependencies
Professional appearance

CLI Module

main.py - Command-Line Interface

Location: src/flakestorm/cli/main.py

Purpose: Provides the flakestorm command-line tool using Typer.

Commands:

flakestorm init      # Create config file
flakestorm run       # Run tests
flakestorm verify    # Validate config
flakestorm report    # Generate report from JSON
flakestorm score     # Show score from results

Design Analysis:

✅ Strengths:

Typer provides automatic help generation
Rich integration for beautiful output
Consistent exit codes for CI

Rust Performance Module

Location: rust/src/

Components:

File	Purpose
`lib.rs`	PyO3 bindings and main functions
`scoring.rs`	Statistics calculation algorithms
`parallel.rs`	Rayon-based parallel processing

Key Functions:

#[pyfunction]
fn calculate_robustness_score(
    semantic_passed: u32,
    deterministic_passed: u32,
    total: u32,
    semantic_weight: f64,
    deterministic_weight: f64,
) -> f64

#[pyfunction]
fn levenshtein_distance(s1: &str, s2: &str) -> usize

#[pyfunction]
fn string_similarity(s1: &str, s2: &str) -> f64

Design Analysis:

✅ Strengths:

PyO3 provides seamless Python integration
Rayon enables easy parallelism
Comprehensive test suite

Design Analysis

Overall Architecture Assessment

Strengths:

Modularity: Clear separation of concerns makes code maintainable
Extensibility: Easy to add new mutation types, checkers, adapters
Type Safety: Pydantic and type hints catch errors early
Performance: Rust acceleration where it matters
Usability: Rich CLI with progress bars and beautiful output

Areas for Improvement:

Memory Usage: Large test runs keep all results in memory
Checkpointing: No resume capability for interrupted runs
Distributed Execution: Single-machine only

Performance Characteristics

Operation	Complexity	Bottleneck
Mutation Generation	O(n*m)	LLM inference
Test Execution	O(n)	Agent response time
Scoring	O(n)	CPU (optimized with Rust)
Report Generation	O(n)	I/O

Where n = number of mutations, m = mutation types.

Security Considerations

Secrets Management: Environment variable expansion keeps secrets out of config files
Local LLM: No data sent to external APIs
PII Detection: Built-in checks for sensitive data
Injection Testing: Helps harden agents against attacks

This documentation reflects the current implementation. Always refer to the source code for the most up-to-date information.

26 KiB Raw Blame History Unescape Escape

flakestorm Module Documentation

Table of Contents

Architecture Overview

Core Modules

config.py - Configuration Management

protocol.py - Agent Adapters

orchestrator.py - Test Orchestration

performance.py - Rust/Python Bridge

Mutation Modules

types.py - Mutation Types

engine.py - Mutation Generation

Assertion Modules

deterministic.py - Rule-Based Checks

semantic.py - AI-Based Checks

safety.py - Security Checks

Reporting Modules

models.py - Data Structures

html.py - HTML Report Generation

CLI Module

main.py - Command-Line Interface

Rust Performance Module

Design Analysis

Overall Architecture Assessment

Performance Characteristics

Security Considerations

26 KiB

Raw Blame History