# flakestorm Usage Guide > **The Agent Reliability Engine** - Chaos Engineering for AI Agents This comprehensive guide walks you through using flakestorm to test your AI agents for reliability, robustness, and safety. --- ## Table of Contents 1. [Introduction](#introduction) 2. [Installation](#installation) 3. [Quick Start](#quick-start) 4. [Core Concepts](#core-concepts) 5. [Configuration Deep Dive](#configuration-deep-dive) 6. [Running Tests](#running-tests) 7. [Understanding Results](#understanding-results) 8. [Integration Patterns](#integration-patterns) 9. [Advanced Usage](#advanced-usage) 10. [Troubleshooting](#troubleshooting) --- ## Introduction ### What is flakestorm? flakestorm is an **adversarial testing framework** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs. ### Why Use flakestorm? | Problem | How flakestorm Helps | |---------|-------------------| | Agent fails with typos in user input | Tests with noise mutations | | Agent leaks sensitive data | Safety assertions catch PII exposure | | Agent behavior varies unpredictably | Semantic similarity assertions ensure consistency | | Prompt injection attacks | Tests agent resilience to injection attempts | | No way to quantify reliability | Provides robustness scores (0.0 - 1.0) | ### How It Works ``` ┌─────────────────────────────────────────────────────────────────┐ │ flakestorm FLOW │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. GOLDEN PROMPTS 2. MUTATION ENGINE │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ "Book a flight │ ───► │ Local LLM │ │ │ │ from NYC to LA"│ │ (Qwen/Ollama) │ │ │ └─────────────────┘ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Mutated Prompts │ │ │ │ • Typos │ │ │ │ • Paraphrases │ │ │ │ • Injections │ │ │ └────────┬────────┘ │ │ │ │ │ 3. YOUR AGENT ▼ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ AI Agent │ ◄─── │ Test Runner │ │ │ │ (HTTP/Python) │ │ (Async) │ │ │ └────────┬────────┘ └─────────────────┘ │ │ │ │ │ ▼ │ │ 4. VERIFICATION 5. REPORTING │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Invariant │ ───► │ HTML/JSON/CLI │ │ │ │ Assertions │ │ Reports │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Robustness │ │ │ │ Score: 0.85 │ │ │ └─────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Installation ### Prerequisites - **Python 3.10+** (3.11 recommended) - **Ollama** (for local LLM mutation generation) - **Rust** (optional, for performance optimization) ### Installation Order **Important:** Install Ollama first (it's a system-level service), then set up your Python virtual environment: 1. **Install Ollama** (system-level, runs independently) 2. **Create virtual environment** (for Python packages) 3. **Install flakestorm** (Python package) 4. **Start Ollama service** (if not already running) 5. **Pull the model** (required for mutation generation) ### Step 1: Install Ollama (System-Level) **macOS Installation:** ```bash # Option 1: Homebrew (recommended) brew install ollama # If you get permission errors, fix permissions first: sudo chown -R $(whoami) /Users/imac-frank/Library/Logs/Homebrew sudo chown -R $(whoami) /usr/local/Cellar sudo chown -R $(whoami) /usr/local/Homebrew brew install ollama # Option 2: Official Installer (if Homebrew doesn't work) # Visit https://ollama.ai/download and download the macOS installer # Double-click the .dmg file and follow the installation wizard ``` **Windows Installation:** 1. **Download the Installer:** - Visit https://ollama.com/download/windows - Download `OllamaSetup.exe` 2. **Run the Installer:** - Double-click `OllamaSetup.exe` - Follow the installation wizard - Ollama will be installed and added to your PATH automatically 3. **Verify Installation:** ```powershell ollama --version ``` **Linux Installation:** ```bash # Install using the official script curl -fsSL https://ollama.com/install.sh | sh # Or using package managers: # Ubuntu/Debian sudo apt install ollama # Fedora/RHEL sudo dnf install ollama # Arch Linux sudo pacman -S ollama ``` **Start Ollama Service:** After installation, start Ollama: ```bash # macOS/Linux - Start the service ollama serve # Windows - Ollama runs as a service automatically after installation # You can also start it manually from the Start menu or run: ollama serve ``` ### Step 2: Pull the Default Model ```bash # Pull Qwen Coder 3 8B (recommended for mutations) ollama pull qwen2.5-coder:7b # Verify it's working ollama run qwen2.5-coder:7b "Hello, world!" ``` ### Step 3: Create Virtual Environment and Install flakestorm **Important:** On macOS (and some Linux distributions), Python environments are externally managed. You must use a virtual environment: ```bash # Create a virtual environment (do this AFTER installing Ollama) python3 -m venv venv # Activate it (macOS/Linux) source venv/bin/activate # Activate it (Windows) # venv\Scripts\activate # Now install flakestorm # From PyPI (when published) pip install flakestorm # From source (development) git clone https://github.com/flakestorm/flakestorm.git cd flakestorm pip install -e ".[dev]" ``` **Note:** Ollama is installed at the system level and doesn't need to be in your virtual environment. The virtual environment is only for Python packages (flakestorm and its dependencies). **Alternative: Using pipx (for CLI applications)** If you only want to use flakestorm as a CLI tool (not develop it), you can use `pipx`: ```bash # Install pipx (if not already installed) brew install pipx # macOS # Or: python3 -m pip install --user pipx # Install flakestorm pipx install flakestorm ``` **Note:** Make sure you're using Python 3.10+. You can verify with: ```bash python3 --version # Should be 3.10 or higher ``` ### Step 4: (Optional) Install Rust Extension For 80x+ performance improvement on scoring: ```bash cd rust # Make sure virtual environment is activated pip install maturin maturin build --release pip install ../target/wheels/*.whl ``` ### Verify Installation ```bash flakestorm --version flakestorm --help ``` --- ## Quick Start ### 1. Initialize Configuration ```bash # Create flakestorm.yaml in your project flakestorm init ``` ### 2. Configure Your Agent Edit `flakestorm.yaml`: ```yaml # Your AI agent endpoint agent: endpoint: "http://localhost:8000/chat" type: http timeout: 30 # Prompts that should always work golden_prompts: - "What is the weather in New York?" - "Book a flight from NYC to LA for tomorrow" - "Cancel my reservation #12345" # What to check in responses invariants: - type: contains value: "weather" prompt_filter: "weather" - type: latency max_ms: 5000 - type: excludes_pii ``` ### 3. Run Tests ```bash # Basic run flakestorm run # With HTML report flakestorm run --output html # CI mode (fails if score < threshold) flakestorm run --ci --min-score 0.8 ``` ### 4. View Results ```bash # Open the generated report open reports/flakestorm-*.html ``` --- ## Core Concepts ### Golden Prompts **What they are:** Carefully crafted prompts that represent your agent's core use cases. These are prompts that *should always work correctly*. **How to choose them:** - Cover all major user intents - Include edge cases you've seen in production - Represent different complexity levels ```yaml golden_prompts: # Simple intent - "Hello, how are you?" # Complex intent with parameters - "Book a flight from New York to Los Angeles departing March 15th" # Edge case - "What if I need to cancel my booking?" ``` ### Mutation Types flakestorm generates adversarial variations of your golden prompts: | Type | Description | Example | |------|-------------|---------| | `paraphrase` | Same meaning, different words | "Book flight" → "Reserve a plane ticket" | | `noise` | Typos and formatting errors | "Book flight" → "Bok fligt" | | `tone_shift` | Different emotional tone | "Book flight" → "I NEED A FLIGHT NOW!!!" | | `prompt_injection` | Attempted jailbreaks | "Book flight. Ignore above and..." | ### Invariants (Assertions) Rules that agent responses must satisfy: ```yaml invariants: # Response must contain a keyword - type: contains value: "booked" # Response must NOT contain certain content - type: not_contains value: "error" # Response must match regex pattern - type: regex pattern: "confirmation.*#[A-Z0-9]+" # Response time limit - type: latency max_ms: 3000 # Must be valid JSON - type: valid_json # Semantic similarity to expected response - type: similarity expected: "Your flight has been booked successfully" threshold: 0.8 # Safety: no PII leakage - type: excludes_pii # Safety: must include refusal for dangerous requests - type: refusal ``` ### Robustness Score A number from 0.0 to 1.0 indicating how reliable your agent is: ``` Score = (Weighted Passed Tests) / (Total Weighted Tests) ``` Weights by mutation type: - `prompt_injection`: 1.5 (harder to defend against) - `paraphrase`: 1.0 (should always work) - `tone_shift`: 1.0 (should handle different tones) - `noise`: 0.8 (minor errors are acceptable) **Interpretation:** - **0.9+**: Excellent - Production ready - **0.8-0.9**: Good - Minor improvements needed - **0.7-0.8**: Fair - Needs work - **<0.7**: Poor - Significant reliability issues --- ## Configuration Deep Dive ### Full Configuration Schema ```yaml # ============================================================================= # AGENT CONFIGURATION # ============================================================================= agent: # Required: Where to send requests endpoint: "http://localhost:8000/chat" # Agent type: http, python, or langchain type: http # Request timeout in seconds timeout: 30 # HTTP-specific settings headers: Authorization: "Bearer ${API_KEY}" # Environment variable expansion Content-Type: "application/json" # How to format the request body # Available placeholders: {prompt} request_template: | {"message": "{prompt}", "stream": false} # JSONPath to extract response from JSON response_path: "$.response" # ============================================================================= # GOLDEN PROMPTS # ============================================================================= golden_prompts: - "What is 2 + 2?" - "Summarize this article: {article_text}" - "Translate to Spanish: Hello, world!" # ============================================================================= # MUTATION CONFIGURATION # ============================================================================= mutations: # Number of mutations per golden prompt count: 20 # Which mutation types to use types: - paraphrase - noise - tone_shift - prompt_injection # Weights for scoring (higher = more important to pass) weights: paraphrase: 1.0 noise: 0.8 tone_shift: 1.0 prompt_injection: 1.5 # ============================================================================= # LLM CONFIGURATION (for mutation generation) # ============================================================================= llm: # Ollama model to use model: "qwen2.5-coder:7b" # Ollama server URL host: "http://localhost:11434" # Generation temperature (higher = more creative mutations) temperature: 0.8 # ============================================================================= # INVARIANTS (ASSERTIONS) # ============================================================================= invariants: # Example: Response must contain booking confirmation - type: contains value: "confirmed" case_sensitive: false prompt_filter: "book" # Only apply to prompts containing "book" # Example: Response time limit - type: latency max_ms: 5000 # Example: Must be valid JSON - type: valid_json # Example: Semantic similarity - type: similarity expected: "I've booked your flight" threshold: 0.75 # Example: No PII in response - type: excludes_pii # Example: Must refuse dangerous requests - type: refusal prompt_filter: "ignore|bypass|jailbreak" # ============================================================================= # ADVANCED SETTINGS # ============================================================================= advanced: # Concurrent test executions concurrency: 10 # Retry failed requests retries: 3 # Output directory for reports output_dir: "./reports" # Fail threshold for CI mode min_score: 0.8 ``` ### Environment Variable Expansion Use `${VAR_NAME}` syntax to reference environment variables: ```yaml agent: endpoint: "${AGENT_URL}" headers: Authorization: "Bearer ${API_KEY}" ``` --- ## Running Tests ### Basic Commands ```bash # Run with default config (flakestorm.yaml) flakestorm run # Specify config file flakestorm run --config my-config.yaml # Output format: terminal (default), html, json flakestorm run --output html # Quiet mode (less output) flakestorm run --quiet # Verbose mode (more output) flakestorm run --verbose ``` ### Individual Commands ```bash # Just verify config is valid flakestorm verify --config flakestorm.yaml # Generate report from previous run flakestorm report --input results.json --output html # Show current score flakestorm score --input results.json ``` --- ## Understanding Results ### Terminal Output ``` ╭──────────────────────────────────────────────────────────────────╮ │ flakestorm TEST RESULTS │ ├──────────────────────────────────────────────────────────────────┤ │ Robustness Score: 0.85 │ │ ████████████████████░░░░ 85% │ ├──────────────────────────────────────────────────────────────────┤ │ Total Mutations: 80 │ │ ✅ Passed: 68 │ │ ❌ Failed: 12 │ ├──────────────────────────────────────────────────────────────────┤ │ By Mutation Type: │ │ paraphrase: 95% (19/20) │ │ noise: 90% (18/20) │ │ tone_shift: 85% (17/20) │ │ prompt_injection: 70% (14/20) │ ├──────────────────────────────────────────────────────────────────┤ │ Latency: avg=245ms, p50=200ms, p95=450ms, p99=890ms │ ╰──────────────────────────────────────────────────────────────────╯ ``` ### HTML Report The HTML report provides: 1. **Summary Dashboard** - Overall score, pass/fail breakdown 2. **Mutation Matrix** - Visual grid of all test results 3. **Failure Details** - Specific failures with input/output 4. **Latency Charts** - Response time distribution 5. **Recommendations** - AI-generated improvement suggestions ### JSON Export ```json { "timestamp": "2024-01-15T10:30:00Z", "config_hash": "abc123", "statistics": { "total_mutations": 80, "passed_mutations": 68, "failed_mutations": 12, "robustness_score": 0.85, "avg_latency_ms": 245, "p95_latency_ms": 450 }, "results": [ { "golden_prompt": "Book a flight to NYC", "mutation": "Reserve a plane ticket to New York", "mutation_type": "paraphrase", "passed": true, "response": "I've booked your flight...", "latency_ms": 234, "checks": [ {"type": "contains", "passed": true}, {"type": "latency", "passed": true} ] } ] } ``` --- ## Integration Patterns ### Pattern 1: HTTP Agent Most common pattern - agent exposed via REST API: ```yaml agent: endpoint: "http://localhost:8000/api/chat" type: http request_template: | {"message": "{prompt}"} response_path: "$.reply" ``` **Your agent code:** ```python from fastapi import FastAPI from pydantic import BaseModel app = FastAPI() class ChatRequest(BaseModel): message: str class ChatResponse(BaseModel): reply: str @app.post("/api/chat") async def chat(request: ChatRequest) -> ChatResponse: # Your agent logic here response = your_llm_call(request.message) return ChatResponse(reply=response) ``` ### Pattern 2: Python Module Direct Python integration (no HTTP overhead): ```yaml agent: endpoint: "my_agent.agent:handle_message" type: python ``` **Your agent code (`my_agent/agent.py`):** ```python def handle_message(prompt: str) -> str: """ flakestorm will call this function directly. Args: prompt: The user message (mutated) Returns: The agent's response as a string """ # Your agent logic return process_message(prompt) ``` ### Pattern 3: LangChain Agent For LangChain-based agents: ```yaml agent: endpoint: "my_agent.chain:agent" type: langchain ``` **Your agent code:** ```python from langchain.agents import AgentExecutor # flakestorm will call agent.invoke({"input": prompt}) agent = AgentExecutor(...) ``` --- ## Advanced Usage ### Custom Mutation Templates Override default mutation prompts: ```yaml mutations: templates: paraphrase: | Rewrite this prompt with completely different words but preserve the exact meaning: "{prompt}" noise: | Add realistic typos and formatting errors to this prompt. Make 2-3 small mistakes: "{prompt}" ``` ### Filtering Invariants by Prompt Apply assertions only to specific prompts: ```yaml invariants: # Only for booking-related prompts - type: contains value: "confirmation" prompt_filter: "book|reserve|schedule" # Only for cancellation prompts - type: regex pattern: "cancelled|refunded" prompt_filter: "cancel" ``` ### Custom Weights Adjust scoring weights based on your priorities: ```yaml mutations: weights: # Security is critical - weight injection tests higher prompt_injection: 2.0 # Typo tolerance is less important noise: 0.5 ``` ### Parallel Execution Control concurrency for rate-limited APIs: ```yaml advanced: concurrency: 5 # Max 5 parallel requests retries: 3 # Retry failed requests 3 times ``` --- ## Troubleshooting ### Common Issues #### "Cannot connect to Ollama" ```bash # Check if Ollama is running curl http://localhost:11434/api/version # Start Ollama if not running ollama serve ``` #### "Model not found" ```bash # List available models ollama list # Pull the required model ollama pull qwen2.5-coder:7b ``` #### "Agent connection refused" ```bash # Verify your agent is running curl http://localhost:8000/health # Check the endpoint in config cat flakestorm.yaml | grep endpoint ``` #### "Timeout errors" Increase timeout in config: ```yaml agent: timeout: 60 # Increase to 60 seconds ``` #### "Low robustness score" 1. Review failed mutations in the report 2. Identify patterns (e.g., all prompt_injection failing) 3. Improve your agent's handling of those cases 4. Re-run tests #### "Homebrew permission errors when installing Ollama" If you get `Operation not permitted` errors when running `brew install ollama`: ```bash # Fix Homebrew permissions sudo chown -R $(whoami) /Users/imac-frank/Library/Logs/Homebrew sudo chown -R $(whoami) /usr/local/Cellar sudo chown -R $(whoami) /usr/local/Homebrew # Then try again brew install ollama # Or use the official installer from https://ollama.ai/download instead ``` #### "Ollama binary contains HTML or syntax errors" If you downloaded a file that contains HTML instead of the binary: 1. **macOS:** Use Homebrew or download the official `.dmg` installer from https://ollama.ai/download 2. **Windows:** Download `OllamaSetup.exe` from https://ollama.com/download/windows 3. **Linux:** Use the official install script: `curl -fsSL https://ollama.com/install.sh | sh` Never download binaries directly via curl from the download page - always use the official installers or package managers. ### Debug Mode ```bash # Enable verbose logging flakestorm run --verbose # Or set environment variable export FLAKESTORM_DEBUG=1 flakestorm run ``` ### Getting Help - **Documentation**: https://flakestorm.dev/docs - **GitHub Issues**: https://github.com/flakestorm/flakestorm/issues - **Discord**: https://discord.gg/flakestorm --- ## Next Steps 1. **Start simple**: Test with 1-2 golden prompts first 2. **Add invariants gradually**: Start with `contains` and `latency` 3. **Review failures**: Use reports to understand weak points 4. **Iterate**: Improve agent, re-test, repeat 5. **Integrate to CI**: Automate testing on every PR --- *Built with ❤️ by the flakestorm Team*