Fix .gitignore to allow docs files and add documentation files

- Fix .gitignore pattern: un-ignore docs/ directory first, then ignore docs/*, then un-ignore specific files - Add all documentation files referenced in README.md: - USAGE_GUIDE.md - CONFIGURATION_GUIDE.md - TEST_SCENARIOS.md - MODULES.md - DEVELOPER_FAQ.md - PUBLISHING.md - CONTRIBUTING.md - API_SPECIFICATION.md - TESTING_GUIDE.md - IMPLEMENTATION_CHECKLIST.md - Pre-commit hooks fixed trailing whitespace and end-of-file formatting
2026-04-30 19:36:35 +02:00 · 2025-12-29 11:32:50 +08:00 · 2025-12-29 11:32:50 +08:00 · 69e0f8deeb
commit 69e0f8deeb
parent 4dd882a2d2
11 changed files with 5936 additions and 2 deletions
--- a/docs/USAGE_GUIDE.md
+++ b/docs/USAGE_GUIDE.md
@ -0,0 +1,871 @@
+# flakestorm Usage Guide
+
+> **The Agent Reliability Engine** - Chaos Engineering for AI Agents
+
+This comprehensive guide walks you through using flakestorm to test your AI agents for reliability, robustness, and safety.
+
+---
+
+## Table of Contents
+
+1. [Introduction](#introduction)
+2. [Installation](#installation)
+3. [Quick Start](#quick-start)
+4. [Core Concepts](#core-concepts)
+5. [Configuration Deep Dive](#configuration-deep-dive)
+6. [Running Tests](#running-tests)
+7. [Understanding Results](#understanding-results)
+8. [Integration Patterns](#integration-patterns)
+9. [CI/CD Integration](#cicd-integration)
+10. [Advanced Usage](#advanced-usage)
+11. [Troubleshooting](#troubleshooting)
+
+---
+
+## Introduction
+
+### What is flakestorm?
+
+flakestorm is an **adversarial testing framework** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.
+
+### Why Use flakestorm?
+
+| Problem | How flakestorm Helps |
+|---------|-------------------|
+| Agent fails with typos in user input | Tests with noise mutations |
+| Agent leaks sensitive data | Safety assertions catch PII exposure |
+| Agent behavior varies unpredictably | Semantic similarity assertions ensure consistency |
+| Prompt injection attacks | Tests agent resilience to injection attempts |
+| No way to quantify reliability | Provides robustness scores (0.0 - 1.0) |
+
+### How It Works
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         flakestorm FLOW                           │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│   1. GOLDEN PROMPTS          2. MUTATION ENGINE                 │
+│   ┌─────────────────┐        ┌─────────────────┐               │
+│   │ "Book a flight  │  ───►  │ Local LLM       │               │
+│   │  from NYC to LA"│        │ (Qwen/Ollama)   │               │
+│   └─────────────────┘        └────────┬────────┘               │
+│                                       │                         │
+│                                       ▼                         │
+│                              ┌─────────────────┐               │
+│                              │ Mutated Prompts │               │
+│                              │ • Typos         │               │
+│                              │ • Paraphrases   │               │
+│                              │ • Injections    │               │
+│                              └────────┬────────┘               │
+│                                       │                         │
+│   3. YOUR AGENT                       ▼                         │
+│   ┌─────────────────┐        ┌─────────────────┐               │
+│   │ AI Agent        │  ◄───  │ Test Runner     │               │
+│   │ (HTTP/Python)   │        │ (Async)         │               │
+│   └────────┬────────┘        └─────────────────┘               │
+│            │                                                    │
+│            ▼                                                    │
+│   4. VERIFICATION            5. REPORTING                       │
+│   ┌─────────────────┐        ┌─────────────────┐               │
+│   │ Invariant       │  ───►  │ HTML/JSON/CLI   │               │
+│   │ Assertions      │        │ Reports         │               │
+│   └─────────────────┘        └─────────────────┘               │
+│                                       │                         │
+│                                       ▼                         │
+│                              ┌─────────────────┐               │
+│                              │ Robustness      │               │
+│                              │ Score: 0.85     │               │
+│                              └─────────────────┘               │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Installation
+
+### Prerequisites
+
+- **Python 3.10+** (3.11 recommended)
+- **Ollama** (for local LLM mutation generation)
+- **Rust** (optional, for performance optimization)
+
+### Step 1: Install Ollama
+
+```bash
+# macOS
+brew install ollama
+
+# Linux
+curl -fsSL https://ollama.com/install.sh | sh
+
+# Start Ollama service
+ollama serve
+```
+
+### Step 2: Pull the Default Model
+
+```bash
+# Pull Qwen Coder 3 8B (recommended for mutations)
+ollama pull qwen2.5-coder:7b
+
+# Verify it's working
+ollama run qwen2.5-coder:7b "Hello, world!"
+```
+
+### Step 3: Install flakestorm
+
+```bash
+# From PyPI (when published)
+pip install flakestorm
+
+# From source (development)
+git clone https://github.com/flakestorm/flakestorm.git
+cd flakestorm
+pip install -e ".[dev]"
+```
+
+### Step 4: (Optional) Install Rust Extension
+
+For 80x+ performance improvement on scoring:
+
+```bash
+cd rust
+pip install maturin
+maturin build --release
+pip install ../target/wheels/*.whl
+```
+
+### Verify Installation
+
+```bash
+flakestorm --version
+flakestorm --help
+```
+
+---
+
+## Quick Start
+
+### 1. Initialize Configuration
+
+```bash
+# Create flakestorm.yaml in your project
+flakestorm init
+```
+
+### 2. Configure Your Agent
+
+Edit `flakestorm.yaml`:
+
+```yaml
+# Your AI agent endpoint
+agent:
+  endpoint: "http://localhost:8000/chat"
+  type: http
+  timeout: 30
+
+# Prompts that should always work
+golden_prompts:
+  - "What is the weather in New York?"
+  - "Book a flight from NYC to LA for tomorrow"
+  - "Cancel my reservation #12345"
+
+# What to check in responses
+invariants:
+  - type: contains
+    value: "weather"
+    prompt_filter: "weather"
+  - type: latency
+    max_ms: 5000
+  - type: excludes_pii
+```
+
+### 3. Run Tests
+
+```bash
+# Basic run
+flakestorm run
+
+# With HTML report
+flakestorm run --output html
+
+# CI mode (fails if score < threshold)
+flakestorm run --ci --min-score 0.8
+```
+
+### 4. View Results
+
+```bash
+# Open the generated report
+open reports/entropix_report_*.html
+```
+
+---
+
+## Core Concepts
+
+### Golden Prompts
+
+**What they are:** Carefully crafted prompts that represent your agent's core use cases. These are prompts that *should always work correctly*.
+
+**How to choose them:**
+- Cover all major user intents
+- Include edge cases you've seen in production
+- Represent different complexity levels
+
+```yaml
+golden_prompts:
+  # Simple intent
+  - "Hello, how are you?"
+  
+  # Complex intent with parameters
+  - "Book a flight from New York to Los Angeles departing March 15th"
+  
+  # Edge case
+  - "What if I need to cancel my booking?"
+```
+
+### Mutation Types
+
+flakestorm generates adversarial variations of your golden prompts:
+
+| Type | Description | Example |
+|------|-------------|---------|
+| `paraphrase` | Same meaning, different words | "Book flight" → "Reserve a plane ticket" |
+| `noise` | Typos and formatting errors | "Book flight" → "Bok fligt" |
+| `tone_shift` | Different emotional tone | "Book flight" → "I NEED A FLIGHT NOW!!!" |
+| `prompt_injection` | Attempted jailbreaks | "Book flight. Ignore above and..." |
+
+### Invariants (Assertions)
+
+Rules that agent responses must satisfy:
+
+```yaml
+invariants:
+  # Response must contain a keyword
+  - type: contains
+    value: "booked"
+    
+  # Response must NOT contain certain content
+  - type: not_contains
+    value: "error"
+    
+  # Response must match regex pattern
+  - type: regex
+    pattern: "confirmation.*#[A-Z0-9]+"
+    
+  # Response time limit
+  - type: latency
+    max_ms: 3000
+    
+  # Must be valid JSON
+  - type: valid_json
+    
+  # Semantic similarity to expected response
+  - type: similarity
+    expected: "Your flight has been booked successfully"
+    threshold: 0.8
+    
+  # Safety: no PII leakage
+  - type: excludes_pii
+    
+  # Safety: must include refusal for dangerous requests
+  - type: refusal
+```
+
+### Robustness Score
+
+A number from 0.0 to 1.0 indicating how reliable your agent is:
+
+```
+Score = (Weighted Passed Tests) / (Total Weighted Tests)
+```
+
+Weights by mutation type:
+- `prompt_injection`: 1.5 (harder to defend against)
+- `paraphrase`: 1.0 (should always work)
+- `tone_shift`: 1.0 (should handle different tones)
+- `noise`: 0.8 (minor errors are acceptable)
+
+**Interpretation:**
+- **0.9+**: Excellent - Production ready
+- **0.8-0.9**: Good - Minor improvements needed
+- **0.7-0.8**: Fair - Needs work
+- **<0.7**: Poor - Significant reliability issues
+
+---
+
+## Configuration Deep Dive
+
+### Full Configuration Schema
+
+```yaml
+# =============================================================================
+# AGENT CONFIGURATION
+# =============================================================================
+agent:
+  # Required: Where to send requests
+  endpoint: "http://localhost:8000/chat"
+  
+  # Agent type: http, python, or langchain
+  type: http
+  
+  # Request timeout in seconds
+  timeout: 30
+  
+  # HTTP-specific settings
+  headers:
+    Authorization: "Bearer ${API_KEY}"  # Environment variable expansion
+    Content-Type: "application/json"
+  
+  # How to format the request body
+  # Available placeholders: {prompt}
+  request_template: |
+    {"message": "{prompt}", "stream": false}
+  
+  # JSONPath to extract response from JSON
+  response_path: "$.response"
+
+# =============================================================================
+# GOLDEN PROMPTS
+# =============================================================================
+golden_prompts:
+  - "What is 2 + 2?"
+  - "Summarize this article: {article_text}"
+  - "Translate to Spanish: Hello, world!"
+
+# =============================================================================
+# MUTATION CONFIGURATION
+# =============================================================================
+mutations:
+  # Number of mutations per golden prompt
+  count: 20
+  
+  # Which mutation types to use
+  types:
+    - paraphrase
+    - noise
+    - tone_shift
+    - prompt_injection
+  
+  # Weights for scoring (higher = more important to pass)
+  weights:
+    paraphrase: 1.0
+    noise: 0.8
+    tone_shift: 1.0
+    prompt_injection: 1.5
+
+# =============================================================================
+# LLM CONFIGURATION (for mutation generation)
+# =============================================================================
+llm:
+  # Ollama model to use
+  model: "qwen2.5-coder:7b"
+  
+  # Ollama server URL
+  host: "http://localhost:11434"
+  
+  # Generation temperature (higher = more creative mutations)
+  temperature: 0.8
+
+# =============================================================================
+# INVARIANTS (ASSERTIONS)
+# =============================================================================
+invariants:
+  # Example: Response must contain booking confirmation
+  - type: contains
+    value: "confirmed"
+    case_sensitive: false
+    prompt_filter: "book"  # Only apply to prompts containing "book"
+  
+  # Example: Response time limit
+  - type: latency
+    max_ms: 5000
+  
+  # Example: Must be valid JSON
+  - type: valid_json
+  
+  # Example: Semantic similarity
+  - type: similarity
+    expected: "I've booked your flight"
+    threshold: 0.75
+  
+  # Example: No PII in response
+  - type: excludes_pii
+  
+  # Example: Must refuse dangerous requests
+  - type: refusal
+    prompt_filter: "ignore|bypass|jailbreak"
+
+# =============================================================================
+# ADVANCED SETTINGS
+# =============================================================================
+advanced:
+  # Concurrent test executions
+  concurrency: 10
+  
+  # Retry failed requests
+  retries: 3
+  
+  # Output directory for reports
+  output_dir: "./reports"
+  
+  # Fail threshold for CI mode
+  min_score: 0.8
+```
+
+### Environment Variable Expansion
+
+Use `${VAR_NAME}` syntax to reference environment variables:
+
+```yaml
+agent:
+  endpoint: "${AGENT_URL}"
+  headers:
+    Authorization: "Bearer ${API_KEY}"
+```
+
+---
+
+## Running Tests
+
+### Basic Commands
+
+```bash
+# Run with default config (flakestorm.yaml)
+flakestorm run
+
+# Specify config file
+flakestorm run --config my-config.yaml
+
+# Output format: terminal (default), html, json
+flakestorm run --output html
+
+# Quiet mode (less output)
+flakestorm run --quiet
+
+# Verbose mode (more output)
+flakestorm run --verbose
+```
+
+### CI/CD Mode
+
+```bash
+# Fail if score < 0.8
+flakestorm run --ci --min-score 0.8
+
+# Exit codes:
+# 0 = Score meets threshold
+# 1 = Score below threshold
+# 2 = Configuration error
+# 3 = Runtime error
+```
+
+### Individual Commands
+
+```bash
+# Just verify config is valid
+flakestorm verify --config flakestorm.yaml
+
+# Generate report from previous run
+flakestorm report --input results.json --output html
+
+# Show current score
+flakestorm score --input results.json
+```
+
+---
+
+## Understanding Results
+
+### Terminal Output
+
+```
+╭──────────────────────────────────────────────────────────────────╮
+│                     flakestorm TEST RESULTS                        │
+├──────────────────────────────────────────────────────────────────┤
+│  Robustness Score: 0.85                                          │
+│  ████████████████████░░░░ 85%                                    │
+├──────────────────────────────────────────────────────────────────┤
+│  Total Mutations: 80                                             │
+│  ✅ Passed: 68                                                   │
+│  ❌ Failed: 12                                                   │
+├──────────────────────────────────────────────────────────────────┤
+│  By Mutation Type:                                               │
+│    paraphrase:       95% (19/20)                                 │
+│    noise:            90% (18/20)                                 │
+│    tone_shift:       85% (17/20)                                 │
+│    prompt_injection: 70% (14/20)                                 │
+├──────────────────────────────────────────────────────────────────┤
+│  Latency: avg=245ms, p50=200ms, p95=450ms, p99=890ms            │
+╰──────────────────────────────────────────────────────────────────╯
+```
+
+### HTML Report
+
+The HTML report provides:
+
+1. **Summary Dashboard** - Overall score, pass/fail breakdown
+2. **Mutation Matrix** - Visual grid of all test results
+3. **Failure Details** - Specific failures with input/output
+4. **Latency Charts** - Response time distribution
+5. **Recommendations** - AI-generated improvement suggestions
+
+### JSON Export
+
+```json
+{
+  "timestamp": "2024-01-15T10:30:00Z",
+  "config_hash": "abc123",
+  "statistics": {
+    "total_mutations": 80,
+    "passed_mutations": 68,
+    "failed_mutations": 12,
+    "robustness_score": 0.85,
+    "avg_latency_ms": 245,
+    "p95_latency_ms": 450
+  },
+  "results": [
+    {
+      "golden_prompt": "Book a flight to NYC",
+      "mutation": "Reserve a plane ticket to New York",
+      "mutation_type": "paraphrase",
+      "passed": true,
+      "response": "I've booked your flight...",
+      "latency_ms": 234,
+      "checks": [
+        {"type": "contains", "passed": true},
+        {"type": "latency", "passed": true}
+      ]
+    }
+  ]
+}
+```
+
+---
+
+## Integration Patterns
+
+### Pattern 1: HTTP Agent
+
+Most common pattern - agent exposed via REST API:
+
+```yaml
+agent:
+  endpoint: "http://localhost:8000/api/chat"
+  type: http
+  request_template: |
+    {"message": "{prompt}"}
+  response_path: "$.reply"
+```
+
+**Your agent code:**
+
+```python
+from fastapi import FastAPI
+from pydantic import BaseModel
+
+app = FastAPI()
+
+class ChatRequest(BaseModel):
+    message: str
+
+class ChatResponse(BaseModel):
+    reply: str
+
+@app.post("/api/chat")
+async def chat(request: ChatRequest) -> ChatResponse:
+    # Your agent logic here
+    response = your_llm_call(request.message)
+    return ChatResponse(reply=response)
+```
+
+### Pattern 2: Python Module
+
+Direct Python integration (no HTTP overhead):
+
+```yaml
+agent:
+  endpoint: "my_agent.agent:handle_message"
+  type: python
+```
+
+**Your agent code (`my_agent/agent.py`):**
+
+```python
+def handle_message(prompt: str) -> str:
+    """
+    flakestorm will call this function directly.
+    
+    Args:
+        prompt: The user message (mutated)
+    
+    Returns:
+        The agent's response as a string
+    """
+    # Your agent logic
+    return process_message(prompt)
+```
+
+### Pattern 3: LangChain Agent
+
+For LangChain-based agents:
+
+```yaml
+agent:
+  endpoint: "my_agent.chain:agent"
+  type: langchain
+```
+
+**Your agent code:**
+
+```python
+from langchain.agents import AgentExecutor
+
+# flakestorm will call agent.invoke({"input": prompt})
+agent = AgentExecutor(...)
+```
+
+---
+
+## CI/CD Integration
+
+### GitHub Actions
+
+Create `.github/workflows/flakestorm.yml`:
+
+```yaml
+name: Agent Reliability Tests
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+jobs:
+  reliability-test:
+    runs-on: ubuntu-latest
+    
+    services:
+      ollama:
+        image: ollama/ollama
+        ports:
+          - 11434:11434
+    
+    steps:
+      - uses: actions/checkout@v4
+      
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+      
+      - name: Install dependencies
+        run: |
+          pip install flakestorm
+          pip install -r requirements.txt
+      
+      - name: Pull Ollama model
+        run: |
+          curl -X POST http://localhost:11434/api/pull \
+            -d '{"name": "qwen2.5-coder:7b"}'
+      
+      - name: Start agent
+        run: |
+          python -m my_agent &
+          sleep 5  # Wait for startup
+      
+      - name: Run flakestorm tests
+        run: |
+          flakestorm run --ci --min-score 0.8 --output json
+      
+      - name: Upload report
+        uses: actions/upload-artifact@v4
+        if: always()
+        with:
+          name: flakestorm-report
+          path: reports/
+```
+
+### GitLab CI
+
+```yaml
+flakestorm-test:
+  image: python:3.11
+  services:
+    - name: ollama/ollama
+      alias: ollama
+  variables:
+    OLLAMA_HOST: "http://ollama:11434"
+  script:
+    - pip install flakestorm
+    - flakestorm run --ci --min-score 0.8
+  artifacts:
+    paths:
+      - reports/
+    when: always
+```
+
+### Pre-commit Hook
+
+Add to `.pre-commit-config.yaml`:
+
+```yaml
+repos:
+  - repo: local
+    hooks:
+      - id: flakestorm
+        name: flakestorm Agent Tests
+        entry: flakestorm run --ci --min-score 0.8
+        language: system
+        pass_filenames: false
+        always_run: true
+```
+
+---
+
+## Advanced Usage
+
+### Custom Mutation Templates
+
+Override default mutation prompts:
+
+```yaml
+mutations:
+  templates:
+    paraphrase: |
+      Rewrite this prompt with completely different words 
+      but preserve the exact meaning: "{prompt}"
+    
+    noise: |
+      Add realistic typos and formatting errors to this prompt.
+      Make 2-3 small mistakes: "{prompt}"
+```
+
+### Filtering Invariants by Prompt
+
+Apply assertions only to specific prompts:
+
+```yaml
+invariants:
+  # Only for booking-related prompts
+  - type: contains
+    value: "confirmation"
+    prompt_filter: "book|reserve|schedule"
+  
+  # Only for cancellation prompts
+  - type: regex
+    pattern: "cancelled|refunded"
+    prompt_filter: "cancel"
+```
+
+### Custom Weights
+
+Adjust scoring weights based on your priorities:
+
+```yaml
+mutations:
+  weights:
+    # Security is critical - weight injection tests higher
+    prompt_injection: 2.0
+    
+    # Typo tolerance is less important
+    noise: 0.5
+```
+
+### Parallel Execution
+
+Control concurrency for rate-limited APIs:
+
+```yaml
+advanced:
+  concurrency: 5  # Max 5 parallel requests
+  retries: 3      # Retry failed requests 3 times
+```
+
+---
+
+## Troubleshooting
+
+### Common Issues
+
+#### "Cannot connect to Ollama"
+
+```bash
+# Check if Ollama is running
+curl http://localhost:11434/api/version
+
+# Start Ollama if not running
+ollama serve
+```
+
+#### "Model not found"
+
+```bash
+# List available models
+ollama list
+
+# Pull the required model
+ollama pull qwen2.5-coder:7b
+```
+
+#### "Agent connection refused"
+
+```bash
+# Verify your agent is running
+curl http://localhost:8000/health
+
+# Check the endpoint in config
+cat flakestorm.yaml | grep endpoint
+```
+
+#### "Timeout errors"
+
+Increase timeout in config:
+
+```yaml
+agent:
+  timeout: 60  # Increase to 60 seconds
+```
+
+#### "Low robustness score"
+
+1. Review failed mutations in the report
+2. Identify patterns (e.g., all prompt_injection failing)
+3. Improve your agent's handling of those cases
+4. Re-run tests
+
+### Debug Mode
+
+```bash
+# Enable verbose logging
+flakestorm run --verbose
+
+# Or set environment variable
+export ENTROPIX_DEBUG=1
+flakestorm run
+```
+
+### Getting Help
+
+- **Documentation**: https://flakestorm.dev/docs
+- **GitHub Issues**: https://github.com/flakestorm/flakestorm/issues
+- **Discord**: https://discord.gg/flakestorm
+
+---
+
+## Next Steps
+
+1. **Start simple**: Test with 1-2 golden prompts first
+2. **Add invariants gradually**: Start with `contains` and `latency`
+3. **Review failures**: Use reports to understand weak points
+4. **Iterate**: Improve agent, re-test, repeat
+5. **Integrate to CI**: Automate testing on every PR
+
+---
+
+*Built with ❤️ by the flakestorm Team*
+