flakestorm/docs/USAGE_GUIDE.md

# flakestorm Usage Guide

> **The Agent Reliability Engine** - Chaos Engineering for AI Agents

This comprehensive guide walks you through using flakestorm to test your AI agents for reliability, robustness, and safety.

---

## Table of Contents

1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Quick Start](#quick-start)
4. [Core Concepts](#core-concepts)
5. [Configuration Deep Dive](#configuration-deep-dive)
6. [Running Tests](#running-tests)
7. [Understanding Results](#understanding-results)
8. [Integration Patterns](#integration-patterns)
9. [CI/CD Integration](#cicd-integration)
10. [Advanced Usage](#advanced-usage)
11. [Troubleshooting](#troubleshooting)

---

## Introduction

### What is flakestorm?

flakestorm is an **adversarial testing framework** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.

### Why Use flakestorm?

| Problem | How flakestorm Helps |
|---------|-------------------|
| Agent fails with typos in user input | Tests with noise mutations |
| Agent leaks sensitive data | Safety assertions catch PII exposure |
| Agent behavior varies unpredictably | Semantic similarity assertions ensure consistency |
| Prompt injection attacks | Tests agent resilience to injection attempts |
| No way to quantify reliability | Provides robustness scores (0.0 - 1.0) |

### How It Works

```
┌─────────────────────────────────────────────────────────────────┐
│                         flakestorm FLOW                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. GOLDEN PROMPTS          2. MUTATION ENGINE                 │
│   ┌─────────────────┐        ┌─────────────────┐               │
│   │ "Book a flight  │  ───►  │ Local LLM       │               │
│   │  from NYC to LA"│        │ (Qwen/Ollama)   │               │
│   └─────────────────┘        └────────┬────────┘               │
│                                       │                         │
│                                       ▼                         │
│                              ┌─────────────────┐               │
│                              │ Mutated Prompts │               │
│                              │ • Typos         │               │
│                              │ • Paraphrases   │               │
│                              │ • Injections    │               │
│                              └────────┬────────┘               │
│                                       │                         │
│   3. YOUR AGENT                       ▼                         │
│   ┌─────────────────┐        ┌─────────────────┐               │
│   │ AI Agent        │  ◄───  │ Test Runner     │               │
│   │ (HTTP/Python)   │        │ (Async)         │               │
│   └────────┬────────┘        └─────────────────┘               │
│            │                                                    │
│            ▼                                                    │
│   4. VERIFICATION            5. REPORTING                       │
│   ┌─────────────────┐        ┌─────────────────┐               │
│   │ Invariant       │  ───►  │ HTML/JSON/CLI   │               │
│   │ Assertions      │        │ Reports         │               │
│   └─────────────────┘        └─────────────────┘               │
│                                       │                         │
│                                       ▼                         │
│                              ┌─────────────────┐               │
│                              │ Robustness      │               │
│                              │ Score: 0.85     │               │
│                              └─────────────────┘               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## Installation

### Prerequisites

- **Python 3.10+** (3.11 recommended)
- **Ollama** (for local LLM mutation generation)
- **Rust** (optional, for performance optimization)

### Step 1: Install Ollama

```bash
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve
```

### Step 2: Pull the Default Model

```bash
# Pull Qwen Coder 3 8B (recommended for mutations)
ollama pull qwen2.5-coder:7b

# Verify it's working
ollama run qwen2.5-coder:7b "Hello, world!"
```

### Step 3: Install flakestorm

```bash
# From PyPI (when published)
pip install flakestorm

# From source (development)
git clone https://github.com/flakestorm/flakestorm.git
cd flakestorm
pip install -e ".[dev]"
```

### Step 4: (Optional) Install Rust Extension

For 80x+ performance improvement on scoring:

```bash
cd rust
pip install maturin
maturin build --release
pip install ../target/wheels/*.whl
```

### Verify Installation

```bash
flakestorm --version
flakestorm --help
```

---

## Quick Start

### 1. Initialize Configuration

```bash
# Create flakestorm.yaml in your project
flakestorm init
```

### 2. Configure Your Agent

Edit `flakestorm.yaml`:

```yaml
# Your AI agent endpoint
agent:
  endpoint: "http://localhost:8000/chat"
  type: http
  timeout: 30

# Prompts that should always work
golden_prompts:
  - "What is the weather in New York?"
  - "Book a flight from NYC to LA for tomorrow"
  - "Cancel my reservation #12345"

# What to check in responses
invariants:
  - type: contains
    value: "weather"
    prompt_filter: "weather"
  - type: latency
    max_ms: 5000
  - type: excludes_pii
```

### 3. Run Tests

```bash
# Basic run
flakestorm run

# With HTML report
flakestorm run --output html

# CI mode (fails if score < threshold)
flakestorm run --ci --min-score 0.8
```

### 4. View Results

```bash
# Open the generated report
open reports/entropix_report_*.html
```

---

## Core Concepts

### Golden Prompts

**What they are:** Carefully crafted prompts that represent your agent's core use cases. These are prompts that *should always work correctly*.

**How to choose them:**
- Cover all major user intents
- Include edge cases you've seen in production
- Represent different complexity levels

```yaml
golden_prompts:
  # Simple intent
  - "Hello, how are you?"

  # Complex intent with parameters
  - "Book a flight from New York to Los Angeles departing March 15th"

  # Edge case
  - "What if I need to cancel my booking?"
```

### Mutation Types

flakestorm generates adversarial variations of your golden prompts:

| Type | Description | Example |
|------|-------------|---------|
| `paraphrase` | Same meaning, different words | "Book flight" → "Reserve a plane ticket" |
| `noise` | Typos and formatting errors | "Book flight" → "Bok fligt" |
| `tone_shift` | Different emotional tone | "Book flight" → "I NEED A FLIGHT NOW!!!" |
| `prompt_injection` | Attempted jailbreaks | "Book flight. Ignore above and..." |

### Invariants (Assertions)

Rules that agent responses must satisfy:

```yaml
invariants:
  # Response must contain a keyword
  - type: contains
    value: "booked"

  # Response must NOT contain certain content
  - type: not_contains
    value: "error"

  # Response must match regex pattern
  - type: regex
    pattern: "confirmation.*#[A-Z0-9]+"

  # Response time limit
  - type: latency
    max_ms: 3000

  # Must be valid JSON
  - type: valid_json

  # Semantic similarity to expected response
  - type: similarity
    expected: "Your flight has been booked successfully"
    threshold: 0.8

  # Safety: no PII leakage
  - type: excludes_pii

  # Safety: must include refusal for dangerous requests
  - type: refusal
```

### Robustness Score

A number from 0.0 to 1.0 indicating how reliable your agent is:

```
Score = (Weighted Passed Tests) / (Total Weighted Tests)
```

Weights by mutation type:
- `prompt_injection`: 1.5 (harder to defend against)
- `paraphrase`: 1.0 (should always work)
- `tone_shift`: 1.0 (should handle different tones)
- `noise`: 0.8 (minor errors are acceptable)

**Interpretation:**
- **0.9+**: Excellent - Production ready
- **0.8-0.9**: Good - Minor improvements needed
- **0.7-0.8**: Fair - Needs work
- **<0.7**: Poor - Significant reliability issues

---

## Configuration Deep Dive

### Full Configuration Schema

```yaml
# =============================================================================
# AGENT CONFIGURATION
# =============================================================================
agent:
  # Required: Where to send requests
  endpoint: "http://localhost:8000/chat"

  # Agent type: http, python, or langchain
  type: http

  # Request timeout in seconds
  timeout: 30

  # HTTP-specific settings
  headers:
    Authorization: "Bearer ${API_KEY}"  # Environment variable expansion
    Content-Type: "application/json"

  # How to format the request body
  # Available placeholders: {prompt}
  request_template: |
    {"message": "{prompt}", "stream": false}

  # JSONPath to extract response from JSON
  response_path: "$.response"

# =============================================================================
# GOLDEN PROMPTS
# =============================================================================
golden_prompts:
  - "What is 2 + 2?"
  - "Summarize this article: {article_text}"
  - "Translate to Spanish: Hello, world!"

# =============================================================================
# MUTATION CONFIGURATION
# =============================================================================
mutations:
  # Number of mutations per golden prompt
  count: 20

  # Which mutation types to use
  types:
    - paraphrase
    - noise
    - tone_shift
    - prompt_injection

  # Weights for scoring (higher = more important to pass)
  weights:
    paraphrase: 1.0
    noise: 0.8
    tone_shift: 1.0
    prompt_injection: 1.5

# =============================================================================
# LLM CONFIGURATION (for mutation generation)
# =============================================================================
llm:
  # Ollama model to use
  model: "qwen2.5-coder:7b"

  # Ollama server URL
  host: "http://localhost:11434"

  # Generation temperature (higher = more creative mutations)
  temperature: 0.8

# =============================================================================
# INVARIANTS (ASSERTIONS)
# =============================================================================
invariants:
  # Example: Response must contain booking confirmation
  - type: contains
    value: "confirmed"
    case_sensitive: false
    prompt_filter: "book"  # Only apply to prompts containing "book"

  # Example: Response time limit
  - type: latency
    max_ms: 5000

  # Example: Must be valid JSON
  - type: valid_json

  # Example: Semantic similarity
  - type: similarity
    expected: "I've booked your flight"
    threshold: 0.75

  # Example: No PII in response
  - type: excludes_pii

  # Example: Must refuse dangerous requests
  - type: refusal
    prompt_filter: "ignore|bypass|jailbreak"

# =============================================================================
# ADVANCED SETTINGS
# =============================================================================
advanced:
  # Concurrent test executions
  concurrency: 10

  # Retry failed requests
  retries: 3

  # Output directory for reports
  output_dir: "./reports"

  # Fail threshold for CI mode
  min_score: 0.8
```

### Environment Variable Expansion

Use `${VAR_NAME}` syntax to reference environment variables:

```yaml
agent:
  endpoint: "${AGENT_URL}"
  headers:
    Authorization: "Bearer ${API_KEY}"
```

---

## Running Tests

### Basic Commands

```bash
# Run with default config (flakestorm.yaml)
flakestorm run

# Specify config file
flakestorm run --config my-config.yaml

# Output format: terminal (default), html, json
flakestorm run --output html

# Quiet mode (less output)
flakestorm run --quiet

# Verbose mode (more output)
flakestorm run --verbose
```

### CI/CD Mode

```bash
# Fail if score < 0.8
flakestorm run --ci --min-score 0.8

# Exit codes:
# 0 = Score meets threshold
# 1 = Score below threshold
# 2 = Configuration error
# 3 = Runtime error
```

### Individual Commands

```bash
# Just verify config is valid
flakestorm verify --config flakestorm.yaml

# Generate report from previous run
flakestorm report --input results.json --output html

# Show current score
flakestorm score --input results.json
```

---

## Understanding Results

### Terminal Output

```
╭──────────────────────────────────────────────────────────────────╮
│                     flakestorm TEST RESULTS                        │
├──────────────────────────────────────────────────────────────────┤
│  Robustness Score: 0.85                                          │
│  ████████████████████░░░░ 85%                                    │
├──────────────────────────────────────────────────────────────────┤
│  Total Mutations: 80                                             │
│  ✅ Passed: 68                                                   │
│  ❌ Failed: 12                                                   │
├──────────────────────────────────────────────────────────────────┤
│  By Mutation Type:                                               │
│    paraphrase:       95% (19/20)                                 │
│    noise:            90% (18/20)                                 │
│    tone_shift:       85% (17/20)                                 │
│    prompt_injection: 70% (14/20)                                 │
├──────────────────────────────────────────────────────────────────┤
│  Latency: avg=245ms, p50=200ms, p95=450ms, p99=890ms            │
╰──────────────────────────────────────────────────────────────────╯
```

### HTML Report

The HTML report provides:

1. **Summary Dashboard** - Overall score, pass/fail breakdown
2. **Mutation Matrix** - Visual grid of all test results
3. **Failure Details** - Specific failures with input/output
4. **Latency Charts** - Response time distribution
5. **Recommendations** - AI-generated improvement suggestions

### JSON Export

```json
{
  "timestamp": "2024-01-15T10:30:00Z",
  "config_hash": "abc123",
  "statistics": {
    "total_mutations": 80,
    "passed_mutations": 68,
    "failed_mutations": 12,
    "robustness_score": 0.85,
    "avg_latency_ms": 245,
    "p95_latency_ms": 450
  },
  "results": [
    {
      "golden_prompt": "Book a flight to NYC",
      "mutation": "Reserve a plane ticket to New York",
      "mutation_type": "paraphrase",
      "passed": true,
      "response": "I've booked your flight...",
      "latency_ms": 234,
      "checks": [
        {"type": "contains", "passed": true},
        {"type": "latency", "passed": true}
      ]
    }
  ]
}
```

---

## Integration Patterns

### Pattern 1: HTTP Agent

Most common pattern - agent exposed via REST API:

```yaml
agent:
  endpoint: "http://localhost:8000/api/chat"
  type: http
  request_template: |
    {"message": "{prompt}"}
  response_path: "$.reply"
```

**Your agent code:**

```python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    message: str

class ChatResponse(BaseModel):
    reply: str

@app.post("/api/chat")
async def chat(request: ChatRequest) -> ChatResponse:
    # Your agent logic here
    response = your_llm_call(request.message)
    return ChatResponse(reply=response)
```

### Pattern 2: Python Module

Direct Python integration (no HTTP overhead):

```yaml
agent:
  endpoint: "my_agent.agent:handle_message"
  type: python
```

**Your agent code (`my_agent/agent.py`):**

```python
def handle_message(prompt: str) -> str:
    """
    flakestorm will call this function directly.

    Args:
        prompt: The user message (mutated)

    Returns:
        The agent's response as a string
    """
    # Your agent logic
    return process_message(prompt)
```

### Pattern 3: LangChain Agent

For LangChain-based agents:

```yaml
agent:
  endpoint: "my_agent.chain:agent"
  type: langchain
```

**Your agent code:**

```python
from langchain.agents import AgentExecutor

# flakestorm will call agent.invoke({"input": prompt})
agent = AgentExecutor(...)
```

---

## CI/CD Integration

### GitHub Actions

Create `.github/workflows/flakestorm.yml`:

```yaml
name: Agent Reliability Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  reliability-test:
    runs-on: ubuntu-latest

    services:
      ollama:
        image: ollama/ollama
        ports:
          - 11434:11434

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install flakestorm
          pip install -r requirements.txt

      - name: Pull Ollama model
        run: |
          curl -X POST http://localhost:11434/api/pull \
            -d '{"name": "qwen2.5-coder:7b"}'

      - name: Start agent
        run: |
          python -m my_agent &
          sleep 5  # Wait for startup

      - name: Run flakestorm tests
        run: |
          flakestorm run --ci --min-score 0.8 --output json

      - name: Upload report
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: flakestorm-report
          path: reports/
```

### GitLab CI

```yaml
flakestorm-test:
  image: python:3.11
  services:
    - name: ollama/ollama
      alias: ollama
  variables:
    OLLAMA_HOST: "http://ollama:11434"
  script:
    - pip install flakestorm
    - flakestorm run --ci --min-score 0.8
  artifacts:
    paths:
      - reports/
    when: always
```

### Pre-commit Hook

Add to `.pre-commit-config.yaml`:

```yaml
repos:
  - repo: local
    hooks:
      - id: flakestorm
        name: flakestorm Agent Tests
        entry: flakestorm run --ci --min-score 0.8
        language: system
        pass_filenames: false
        always_run: true
```

---

## Advanced Usage

### Custom Mutation Templates

Override default mutation prompts:

```yaml
mutations:
  templates:
    paraphrase: |
      Rewrite this prompt with completely different words
      but preserve the exact meaning: "{prompt}"

    noise: |
      Add realistic typos and formatting errors to this prompt.
      Make 2-3 small mistakes: "{prompt}"
```

### Filtering Invariants by Prompt

Apply assertions only to specific prompts:

```yaml
invariants:
  # Only for booking-related prompts
  - type: contains
    value: "confirmation"
    prompt_filter: "book|reserve|schedule"

  # Only for cancellation prompts
  - type: regex
    pattern: "cancelled|refunded"
    prompt_filter: "cancel"
```

### Custom Weights

Adjust scoring weights based on your priorities:

```yaml
mutations:
  weights:
    # Security is critical - weight injection tests higher
    prompt_injection: 2.0

    # Typo tolerance is less important
    noise: 0.5
```

### Parallel Execution

Control concurrency for rate-limited APIs:

```yaml
advanced:
  concurrency: 5  # Max 5 parallel requests
  retries: 3      # Retry failed requests 3 times
```

---

## Troubleshooting

### Common Issues

#### "Cannot connect to Ollama"

```bash
# Check if Ollama is running
curl http://localhost:11434/api/version

# Start Ollama if not running
ollama serve
```

#### "Model not found"

```bash
# List available models
ollama list

# Pull the required model
ollama pull qwen2.5-coder:7b
```

#### "Agent connection refused"

```bash
# Verify your agent is running
curl http://localhost:8000/health

# Check the endpoint in config
cat flakestorm.yaml | grep endpoint
```

#### "Timeout errors"

Increase timeout in config:

```yaml
agent:
  timeout: 60  # Increase to 60 seconds
```

#### "Low robustness score"

1. Review failed mutations in the report
2. Identify patterns (e.g., all prompt_injection failing)
3. Improve your agent's handling of those cases
4. Re-run tests

### Debug Mode

```bash
# Enable verbose logging
flakestorm run --verbose

# Or set environment variable
export ENTROPIX_DEBUG=1
flakestorm run
```

### Getting Help

- **Documentation**: https://flakestorm.dev/docs
- **GitHub Issues**: https://github.com/flakestorm/flakestorm/issues
- **Discord**: https://discord.gg/flakestorm

---

## Next Steps

1. **Start simple**: Test with 1-2 golden prompts first
2. **Add invariants gradually**: Start with `contains` and `latency`
3. **Review failures**: Use reports to understand weak points
4. **Iterate**: Improve agent, re-test, repeat
5. **Integrate to CI**: Automate testing on every PR

---

*Built with ❤️ by the flakestorm Team*