flakestorm/docs/MODULES.md

# flakestorm Module Documentation

This document provides a comprehensive explanation of each module in the flakestorm codebase, what it does, how it works, and analysis of its design decisions.

---

## Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [Core Modules](#core-modules)
   - [config.py](#configpy---configuration-management)
   - [protocol.py](#protocolpy---agent-adapters)
   - [orchestrator.py](#orchestratorpy---test-orchestration)
   - [runner.py](#runnerpy---test-execution)
   - [performance.py](#performancepy---rustpython-bridge)
3. [Mutation Modules](#mutation-modules)
   - [types.py](#typespm---mutation-types)
   - [templates.py](#templatespy---prompt-templates)
   - [engine.py](#enginepy---mutation-generation)
4. [Assertion Modules](#assertion-modules)
   - [deterministic.py](#deterministicpy---rule-based-checks)
   - [semantic.py](#semanticpy---ai-based-checks)
   - [safety.py](#safetypy---security-checks)
   - [verifier.py](#verifierpy---assertion-orchestration)
5. [Reporting Modules](#reporting-modules)
   - [models.py](#modelspy---data-structures)
   - [html.py](#htmlpy---html-report-generation)
   - [terminal.py](#terminalpy---cli-output)
6. [CLI Module](#cli-module)
   - [main.py](#mainpy---command-line-interface)
7. [Rust Performance Module](#rust-performance-module)
8. [Design Analysis](#design-analysis)

---

## Architecture Overview

```
flakestorm/
├── core/                    # Core orchestration logic
│   ├── config.py           # Configuration (V1 + V2: chaos, contract, replays, scoring)
│   ├── protocol.py         # Agent adapters, create_instrumented_adapter (chaos interceptor)
│   ├── orchestrator.py     # Main test coordination
│   ├── runner.py           # High-level test runner
│   └── performance.py      # Rust/Python bridge
├── chaos/                   # V2 environment chaos
│   ├── context_attacks.py  # memory_poisoning (input before invoke), indirect_injection, normalize_context_attacks
│   ├── interceptor.py      # ChaosInterceptor: memory_poisoning + LLM faults (timeout before call, others after)
│   ├── faults.py           # should_trigger, tool/LLM fault application
│   ├── llm_proxy.py        # apply_llm_fault (truncated, empty, garbage, rate_limit, response_drift)
│   └── profiles/           # Built-in chaos profiles
├── contracts/               # V2 behavioral contracts
│   ├── engine.py           # ContractEngine: (invariant × scenario) cells, reset, probes, behavior_unchanged
│   └── matrix.py           # ResilienceMatrix
├── replay/                  # V2 replay regression
│   ├── loader.py           # Load replay sessions (file or inline)
│   └── runner.py           # Replay execution
├── mutations/               # Adversarial input generation (22+ types, max 50/run OSS)
│   ├── types.py            # MutationType enum
│   ├── templates.py        # LLM prompt templates
│   └── engine.py           # Mutation generation engine
├── assertions/              # Response validation
│   ├── deterministic.py    # Rule-based assertions
│   ├── semantic.py         # AI-based assertions
│   ├── safety.py           # Security assertions
│   └── verifier.py         # InvariantVerifier (all invariant types including behavior_unchanged)
├── reports/                 # Output generation
│   ├── models.py           # Report data models
│   ├── html.py             # HTML report generator
│   ├── json_export.py      # JSON export
│   └── terminal.py         # Terminal output
├── cli/                     # Command-line interface
│   └── main.py             # flakestorm run, contract run, replay run, ci
└── integrations/            # External integrations
    ├── huggingface.py      # HuggingFace model support
    └── embeddings.py       # Local embeddings
```

---

## Core Modules

### config.py - Configuration Management

**Location:** `src/flakestorm/core/config.py`

**Purpose:** Handles loading, validating, and providing type-safe access to the `flakestorm.yaml` configuration file.

**Key Components:**

```python
class AgentConfig(BaseModel):
    """Configuration for connecting to the target agent."""
    endpoint: str          # Agent URL or Python module path
    type: AgentType        # http, python, or langchain
    timeout: int = 30000   # Request timeout (ms)
    headers: dict = {}     # HTTP headers
    request_template: str  # How to format requests
    response_path: str     # JSONPath to extract response
    # V2: state isolation for contract matrix
    reset_endpoint: str | None   # HTTP POST URL called before each cell
    reset_function: str | None  # Python path e.g. myagent:reset_state
```

```python
class FlakeStormConfig(BaseModel):
    """Root configuration model."""
    version: str = "1.0"   # 1.0 | 2.0
    agent: AgentConfig
    golden_prompts: list[str]
    mutations: MutationConfig   # count max 50 in OSS; 22+ mutation types
    model: ModelConfig         # api_key env-only in V2
    invariants: list[InvariantConfig]
    output: OutputConfig
    advanced: AdvancedConfig
    # V2 optional
    chaos: ChaosConfig | None       # tool_faults, llm_faults, context_attacks (list or dict)
    contract: ContractConfig | None  # invariants + chaos_matrix (scenarios can have context_attacks)
    chaos_matrix: list[ChaosScenarioConfig] | None  # when not using contract.chaos_matrix
    replays: ReplayConfig | None     # sessions (file or inline), sources (LangSmith)
    scoring: ScoringConfig | None    # mutation, chaos, contract, replay weights (must sum to 1.0)
```

**Key Functions:**

| Function | Purpose |
|----------|---------|
| `load_config(path)` | Load and validate YAML config file |
| `expand_env_vars()` | Replace `${VAR}` with environment values |
| `validate_config()` | Run Pydantic validation |

**Design Analysis:**

✅ **Strengths:**
- Uses Pydantic for robust validation with clear error messages
- Environment variable expansion for secrets management
- Type safety prevents runtime configuration errors
- Default values reduce required configuration

⚠️ **Considerations:**
- Large config model - could be split into smaller files for maintainability
- No schema versioning - future config changes need migration support

**Why This Design:**
Pydantic was chosen over alternatives (dataclasses, attrs) because:
1. Built-in YAML/JSON serialization
2. Automatic validation with descriptive errors
3. Environment variable support
4. Wide ecosystem adoption

---

### protocol.py - Agent Adapters

**Location:** `src/flakestorm/core/protocol.py`

**Purpose:** Provides a unified interface for communicating with different types of AI agents (HTTP APIs, Python functions, LangChain).

**Key Components:**

```python
class AgentProtocol(Protocol):
    """Protocol that all agent adapters must implement."""

    async def invoke(self, prompt: str) -> AgentResponse:
        """Send prompt to agent and return response."""
        ...
```

```python
class HTTPAgentAdapter(BaseAgentAdapter):
    """Adapter for HTTP-based agents."""

    async def invoke(self, prompt: str) -> AgentResponse:
        # 1. Format request using template
        # 2. Send HTTP POST with headers
        # 3. Extract response using JSONPath
        # 4. Return with latency measurement
```

```python
class PythonAgentAdapter(BaseAgentAdapter):
    """Adapter for Python function agents."""

    async def invoke(self, prompt: str) -> AgentResponse:
        # 1. Import the specified module
        # 2. Call the function with prompt
        # 3. Return response with timing
```

**Design Analysis:**

✅ **Strengths:**
- Protocol pattern allows easy extension for new agent types
- Async-first design for efficient parallel testing
- Built-in latency measurement for performance tracking
- Retry logic handles transient failures

⚠️ **Considerations:**
- HTTP adapter assumes JSON request/response format
- Python adapter uses dynamic import which can be security-sensitive

**Why This Design:**
The adapter pattern was chosen because:
1. Decouples test logic from agent communication
2. Easy to add new agent types without modifying core
3. Allows mocking for unit tests

---

### orchestrator.py - Test Orchestration

**Location:** `src/flakestorm/core/orchestrator.py`

**Purpose:** Coordinates the entire testing process: mutation generation, parallel test execution, and result aggregation.

**Key Components:**

```python
class Orchestrator:
    """Main orchestration class."""

    async def run(self) -> TestResults:
        """Execute the full test suite."""
        # 1. Generate mutations for all golden prompts
        # 2. Run mutations sequentially (open-source version)
        # 3. Verify responses against invariants
        # 4. Aggregate and score results
        # 5. Return comprehensive results
```

**Execution Flow:**

```
run()
  ├─► _generate_mutations()     # Create adversarial inputs
  │     └─► MutationEngine.generate_mutations()
  │
  ├─► _run_mutations()          # Execute tests in parallel
  │     ├─► Semaphore(concurrency)
  │     └─► _run_single_mutation()
  │           ├─► agent.invoke(mutated_prompt)
  │           └─► verifier.verify(response)
  │
  └─► _aggregate_results()      # Calculate statistics
        └─► calculate_statistics()
```

**Design Analysis:**

✅ **Strengths:**
- Async/await for efficient I/O-bound operations
- Semaphore controls concurrency to prevent overwhelming the agent
- Progress tracking with Rich for user feedback
- Clean separation between generation, execution, and verification

⚠️ **Considerations:**
- All mutations held in memory - could be memory-intensive for large runs
- No checkpointing - failed runs restart from beginning

**Why This Design:**
Async orchestration was chosen because:
1. Agent calls are I/O-bound, not CPU-bound
2. Parallelism improves test throughput significantly
3. Semaphore pattern is standard for rate limiting

---

### performance.py - Rust/Python Bridge

**Location:** `src/flakestorm/core/performance.py`

**Purpose:** Provides high-performance implementations of compute-intensive operations using Rust, with pure Python fallbacks.

**Key Functions:**

```python
def is_rust_available() -> bool:
    """Check if Rust extension is installed."""

def calculate_robustness_score(...) -> float:
    """Calculate weighted robustness score."""
    # Uses Rust if available, else Python

def levenshtein_distance(s1, s2) -> int:
    """Fast string edit distance calculation."""
    # 88x faster in Rust vs Python

def string_similarity(s1, s2) -> float:
    """Calculate string similarity ratio."""
```

**Performance Comparison:**

| Function | Python Time | Rust Time | Speedup |
|----------|------------|-----------|---------|
| Levenshtein (5000 iter) | 5864ms | 67ms | **88x** |
| Robustness Score | 0.5ms | 0.01ms | **50x** |
| String Similarity | 1.2ms | 0.02ms | **60x** |

**Design Analysis:**

✅ **Strengths:**
- Graceful fallback if Rust not available
- Same API regardless of implementation
- Significant performance improvement for scoring

⚠️ **Considerations:**
- Requires Rust toolchain for compilation
- Binary compatibility across platforms

**Why This Design:**
The bridge pattern was chosen because:
1. Pure Python works everywhere (easy installation)
2. Rust acceleration for production (performance)
3. Same tests validate both implementations

---

## Mutation Modules

### types.py - Mutation Types

**Location:** `src/flakestorm/mutations/types.py`

**Purpose:** Defines the types of adversarial mutations and their data structures.

**Key Components:**

```python
class MutationType(str, Enum):
    """Types of adversarial mutations."""
    PARAPHRASE = "paraphrase"              # Same meaning, different words
    NOISE = "noise"                        # Typos and errors
    TONE_SHIFT = "tone_shift"              # Different emotional tone
    PROMPT_INJECTION = "prompt_injection"  # Jailbreak attempts
    ENCODING_ATTACKS = "encoding_attacks"  # Encoded inputs
    CONTEXT_MANIPULATION = "context_manipulation"  # Context changes
    LENGTH_EXTREMES = "length_extremes"    # Edge case lengths
    CUSTOM = "custom"                      # User-defined templates
```

**The 8 Core Mutation Types:**

1. **PARAPHRASE** (Weight: 1.0)
   - **What it tests**: Semantic understanding - can the agent handle different wording?
   - **How it works**: LLM rewrites the prompt using synonyms and alternative phrasing while preserving intent
   - **Why essential**: Users express the same intent in many ways. Agents must understand meaning, not just keywords.
   - **Template strategy**: Instructs LLM to use completely different words while keeping exact same meaning

2. **NOISE** (Weight: 0.8)
   - **What it tests**: Typo tolerance - can the agent handle user errors?
   - **How it works**: LLM adds realistic typos (swapped letters, missing letters, abbreviations)
   - **Why essential**: Real users make typos, especially on mobile. Robust agents must handle common errors gracefully.
   - **Template strategy**: Simulates realistic typing errors as if typed quickly on a phone

3. **TONE_SHIFT** (Weight: 0.9)
   - **What it tests**: Emotional resilience - can the agent handle frustrated users?
   - **How it works**: LLM rewrites with urgency, impatience, and slight aggression
   - **Why essential**: Users get impatient. Agents must maintain quality even under stress.
   - **Template strategy**: Adds words like "NOW", "HURRY", "ASAP" and frustration phrases

4. **PROMPT_INJECTION** (Weight: 1.5)
   - **What it tests**: Security - can the agent resist manipulation?
   - **How it works**: LLM adds injection attempts like "ignore previous instructions"
   - **Why essential**: Attackers try to manipulate agents. Security is non-negotiable.
   - **Template strategy**: Keeps original request but adds injection techniques after it

5. **ENCODING_ATTACKS** (Weight: 1.3)
   - **What it tests**: Parser robustness - can the agent handle encoded inputs?
   - **How it works**: LLM transforms prompt using Base64, Unicode escapes, or URL encoding
   - **Why essential**: Attackers use encoding to bypass filters. Agents must decode correctly.
   - **Template strategy**: Instructs LLM to use various encoding techniques (Base64, Unicode, URL)

6. **CONTEXT_MANIPULATION** (Weight: 1.1)
   - **What it tests**: Context extraction - can the agent find intent in noisy context?
   - **How it works**: LLM adds irrelevant information, removes key context, or reorders structure
   - **Why essential**: Real conversations include irrelevant information. Agents must extract the core request.
   - **Template strategy**: Adds/removes/reorders context while keeping core request ambiguous

7. **LENGTH_EXTREMES** (Weight: 1.2)
   - **What it tests**: Edge cases - can the agent handle empty or very long inputs?
   - **How it works**: LLM creates minimal versions (removing non-essential words) or very long versions (expanding with repetition)
   - **Why essential**: Real inputs vary wildly in length. Agents must handle boundaries.
   - **Template strategy**: Creates extremely short or extremely long versions to test token limits

8. **CUSTOM** (Weight: 1.0)
   - **What it tests**: Domain-specific scenarios
   - **How it works**: User provides custom template with `{prompt}` placeholder
   - **Why essential**: Every domain has unique failure modes. Custom mutations let you test them.
   - **Template strategy**: Applies user-defined transformation instructions

**Mutation Philosophy:**

The 8 mutation types are designed to cover different failure modes:
- **Semantic Robustness**: PARAPHRASE, CONTEXT_MANIPULATION test understanding
- **Input Robustness**: NOISE, ENCODING_ATTACKS, LENGTH_EXTREMES test parsing
- **Security**: PROMPT_INJECTION, ENCODING_ATTACKS test resistance to attacks
- **User Experience**: TONE_SHIFT, NOISE, CONTEXT_MANIPULATION test real-world usage

Together, they provide comprehensive coverage of agent failure modes.

```python
@dataclass
class Mutation:
    """A single mutation of a golden prompt."""
    original: str           # Original prompt
    mutated: str           # Mutated version
    type: MutationType     # Type of mutation
    weight: float          # Scoring weight
    metadata: dict         # Additional info

    @property
    def id(self) -> str:
        """Unique hash for this mutation."""
        return hashlib.md5(..., usedforsecurity=False)

    def is_valid(self) -> bool:
        """Validates mutation, with special handling for LENGTH_EXTREMES."""
        # LENGTH_EXTREMES may intentionally create empty or very long strings
```

**Design Analysis:**

✅ **Strengths:**
- Enum prevents invalid mutation types
- Dataclass provides clean, typed structure
- Built-in weight scoring for weighted results
- Special validation logic for edge cases (LENGTH_EXTREMES)

**Why This Design:**
String enum was chosen because:
1. Values serialize directly to YAML/JSON
2. Type checking catches typos
3. Easy to extend with new types
4. All 8 types work together to provide comprehensive testing coverage

---

### engine.py - Mutation Generation

**Location:** `src/flakestorm/mutations/engine.py`

**Purpose:** Generates adversarial mutations using a local LLM (Ollama/Qwen).

**Key Components:**

```python
class MutationEngine:
    """Engine for generating adversarial mutations."""

    def __init__(self, config: LLMConfig):
        self.client = ollama.AsyncClient(host=config.host)
        self.model = config.model

    async def generate_mutations(
        self,
        prompt: str,
        types: list[MutationType],
        count: int
    ) -> list[Mutation]:
        """Generate multiple mutations for a prompt."""
```

**Generation Flow:**

```
generate_mutations(prompt, types, count)
  │
  ├─► For each mutation type:
  │     ├─► Get template from templates.py
  │     ├─► Format with original prompt
  │     └─► Call Ollama API
  │
  ├─► Parse LLM responses
  │     └─► Extract mutated prompts
  │
  └─► Create Mutation objects
        └─► Assign difficulty weights
```

**Design Analysis:**

✅ **Strengths:**
- Async API calls for parallel generation
- Local LLM (no API costs, no data leakage)
- Customizable templates per mutation type

⚠️ **Considerations:**
- Depends on Ollama being installed and running
- LLM output parsing can be fragile
- Model quality affects mutation quality

**Why This Design:**
Local LLM was chosen over cloud APIs because:
1. Zero cost at scale
2. No rate limits
3. Privacy - prompts stay local
4. Works offline

---

## Assertion Modules

### deterministic.py - Rule-Based Checks

**Location:** `src/flakestorm/assertions/deterministic.py`

**Purpose:** Implements deterministic, rule-based assertions that check responses against exact criteria.

**Key Checkers:**

```python
class ContainsChecker(BaseChecker):
    """Check if response contains a value."""

class NotContainsChecker(BaseChecker):
    """Check if response does NOT contain a value."""

class RegexChecker(BaseChecker):
    """Check if response matches a regex pattern."""

class LatencyChecker(BaseChecker):
    """Check if response time is within limit."""

class ValidJsonChecker(BaseChecker):
    """Check if response is valid JSON."""
```

**Design Analysis:**

✅ **Strengths:**
- Fast execution (no AI/ML involved)
- Predictable, reproducible results
- Easy to debug failures

**Why This Design:**
Checker pattern with registry allows:
1. Easy addition of new check types
2. Configuration-driven selection
3. Consistent error reporting

---

### semantic.py - AI-Based Checks

**Location:** `src/flakestorm/assertions/semantic.py`

**Purpose:** Implements semantic assertions using embeddings for meaning-based comparison.

**Key Components:**

```python
class LocalEmbedder:
    """Local sentence embeddings using sentence-transformers."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def embed(self, text: str) -> np.ndarray:
        return self.model.encode(text)

    def similarity(self, text1: str, text2: str) -> float:
        emb1, emb2 = self.embed(text1), self.embed(text2)
        return cosine_similarity(emb1, emb2)
```

```python
class SimilarityChecker(BaseChecker):
    """Check semantic similarity to expected response."""

    def check(self, response: str, latency_ms: float) -> CheckResult:
        similarity = self.embedder.similarity(response, expected)
        return CheckResult(passed=similarity >= threshold)
```

**Design Analysis:**

✅ **Strengths:**
- Catches semantic equivalence (not just string matching)
- Lazy loading of heavy ML models
- Configurable similarity thresholds

⚠️ **Considerations:**
- Requires sentence-transformers (optional dependency)
- First call is slow (model loading)
- Memory usage for embedding model

**Why This Design:**
sentence-transformers was chosen because:
1. High-quality embeddings out of the box
2. Runs locally (no API calls)
3. Small model size (all-MiniLM-L6-v2 is 23MB)

---

### safety.py - Security Checks

**Location:** `src/flakestorm/assertions/safety.py`

**Purpose:** Implements security-focused assertions for PII detection and refusal verification.

**Key Checkers:**

```python
class ExcludesPIIChecker(BaseChecker):
    """Check that response doesn't contain PII."""

    PII_PATTERNS = [
        r'\b\d{3}-\d{2}-\d{4}\b',      # SSN
        r'\b\d{16}\b',                   # Credit card
        r'\b[A-Za-z0-9._%+-]+@...',     # Email
        r'\b\d{3}[-.]?\d{3}[-.]?\d{4}', # Phone
    ]
```

```python
class RefusalChecker(BaseChecker):
    """Check that agent refuses dangerous requests."""

    REFUSAL_PHRASES = [
        "I cannot", "I'm unable to", "I won't",
        "against my guidelines", "not appropriate"
    ]
```

**Design Analysis:**

✅ **Strengths:**
- Essential for production safety
- Regex-based PII detection is fast
- Catches common refusal patterns

⚠️ **Considerations:**
- PII patterns may miss edge cases
- Refusal detection is heuristic-based

**Why This Design:**
Pattern-based detection was chosen because:
1. Fast and deterministic
2. No false positives from ML
3. Easy to audit and extend

---

## Reporting Modules

### models.py - Data Structures

**Location:** `src/flakestorm/reports/models.py`

**Purpose:** Defines data structures for test results and reports.

**Key Models:**

```python
@dataclass
class MutationResult:
    """Result of testing a single mutation."""
    mutation: Mutation
    response: str
    latency_ms: float
    passed: bool
    checks: list[CheckResult]

@dataclass
class TestResults:
    """Complete test run results."""
    config: FlakeStormConfig
    mutations: list[MutationResult]
    statistics: TestStatistics
    timestamp: datetime
```

---

### html.py - HTML Report Generation

**Location:** `src/flakestorm/reports/html.py`

**Purpose:** Generates interactive HTML reports with visualizations.

**Key Features:**
- Embedded CSS (no external dependencies)
- Pass/fail grid visualization
- Latency charts
- Failure details with expandable sections
- Mobile-responsive design

**Design Analysis:**

✅ **Strengths:**
- Self-contained HTML (single file, works offline)
- No JavaScript framework dependencies
- Professional appearance

---

## CLI Module

### main.py - Command-Line Interface

**Location:** `src/flakestorm/cli/main.py`

**Purpose:** Provides the `flakestorm` command-line tool using Typer.

**Commands:**

```bash
flakestorm init      # Create config file
flakestorm run       # Run tests
flakestorm verify    # Validate config
flakestorm report    # Generate report from JSON
flakestorm score     # Show score from results
```

**Design Analysis:**

✅ **Strengths:**
- Typer provides automatic help generation
- Rich integration for beautiful output
- Consistent exit codes for CI

---

## Rust Performance Module

**Location:** `rust/src/`

**Components:**

| File | Purpose |
|------|---------|
| `lib.rs` | PyO3 bindings and main functions |
| `scoring.rs` | Statistics calculation algorithms |
| `parallel.rs` | Rayon-based parallel processing |

**Key Functions:**

```rust
#[pyfunction]
fn calculate_robustness_score(
    semantic_passed: u32,
    deterministic_passed: u32,
    total: u32,
    semantic_weight: f64,
    deterministic_weight: f64,
) -> f64

#[pyfunction]
fn levenshtein_distance(s1: &str, s2: &str) -> usize

#[pyfunction]
fn string_similarity(s1: &str, s2: &str) -> f64
```

**Design Analysis:**

✅ **Strengths:**
- PyO3 provides seamless Python integration
- Rayon enables easy parallelism
- Comprehensive test suite

---

## Design Analysis

### Overall Architecture Assessment

**Strengths:**
1. **Modularity**: Clear separation of concerns makes code maintainable
2. **Extensibility**: Easy to add new mutation types, checkers, adapters
3. **Type Safety**: Pydantic and type hints catch errors early
4. **Performance**: Rust acceleration where it matters
5. **Usability**: Rich CLI with progress bars and beautiful output

**Areas for Improvement:**
1. **Memory Usage**: Large test runs keep all results in memory
2. **Checkpointing**: No resume capability for interrupted runs
3. **Distributed Execution**: Single-machine only

### Performance Characteristics

| Operation | Complexity | Bottleneck |
|-----------|------------|------------|
| Mutation Generation | O(n*m) | LLM inference |
| Test Execution | O(n) | Agent response time |
| Scoring | O(n) | CPU (optimized with Rust) |
| Report Generation | O(n) | I/O |

Where n = number of mutations, m = mutation types.

### Security Considerations

1. **Secrets Management**: Environment variable expansion keeps secrets out of config files
2. **Local LLM**: No data sent to external APIs
3. **PII Detection**: Built-in checks for sensitive data
4. **Injection Testing**: Helps harden agents against attacks

---

*This documentation reflects the current implementation. Always refer to the source code for the most up-to-date information.*