mirror of
https://github.com/flakestorm/flakestorm.git
synced 2026-04-30 19:36:35 +02:00
Fix .gitignore to allow docs files and add documentation files
- Fix .gitignore pattern: un-ignore docs/ directory first, then ignore docs/*, then un-ignore specific files - Add all documentation files referenced in README.md: - USAGE_GUIDE.md - CONFIGURATION_GUIDE.md - TEST_SCENARIOS.md - MODULES.md - DEVELOPER_FAQ.md - PUBLISHING.md - CONTRIBUTING.md - API_SPECIFICATION.md - TESTING_GUIDE.md - IMPLEMENTATION_CHECKLIST.md - Pre-commit hooks fixed trailing whitespace and end-of-file formatting
This commit is contained in:
parent
4dd882a2d2
commit
69e0f8deeb
11 changed files with 5936 additions and 2 deletions
871
docs/USAGE_GUIDE.md
Normal file
871
docs/USAGE_GUIDE.md
Normal file
|
|
@ -0,0 +1,871 @@
|
|||
# flakestorm Usage Guide
|
||||
|
||||
> **The Agent Reliability Engine** - Chaos Engineering for AI Agents
|
||||
|
||||
This comprehensive guide walks you through using flakestorm to test your AI agents for reliability, robustness, and safety.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Introduction](#introduction)
|
||||
2. [Installation](#installation)
|
||||
3. [Quick Start](#quick-start)
|
||||
4. [Core Concepts](#core-concepts)
|
||||
5. [Configuration Deep Dive](#configuration-deep-dive)
|
||||
6. [Running Tests](#running-tests)
|
||||
7. [Understanding Results](#understanding-results)
|
||||
8. [Integration Patterns](#integration-patterns)
|
||||
9. [CI/CD Integration](#cicd-integration)
|
||||
10. [Advanced Usage](#advanced-usage)
|
||||
11. [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Introduction
|
||||
|
||||
### What is flakestorm?
|
||||
|
||||
flakestorm is an **adversarial testing framework** for AI agents. It applies chaos engineering principles to systematically test how your AI agents behave under unexpected, malformed, or adversarial inputs.
|
||||
|
||||
### Why Use flakestorm?
|
||||
|
||||
| Problem | How flakestorm Helps |
|
||||
|---------|-------------------|
|
||||
| Agent fails with typos in user input | Tests with noise mutations |
|
||||
| Agent leaks sensitive data | Safety assertions catch PII exposure |
|
||||
| Agent behavior varies unpredictably | Semantic similarity assertions ensure consistency |
|
||||
| Prompt injection attacks | Tests agent resilience to injection attempts |
|
||||
| No way to quantify reliability | Provides robustness scores (0.0 - 1.0) |
|
||||
|
||||
### How It Works
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ flakestorm FLOW │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ 1. GOLDEN PROMPTS 2. MUTATION ENGINE │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ "Book a flight │ ───► │ Local LLM │ │
|
||||
│ │ from NYC to LA"│ │ (Qwen/Ollama) │ │
|
||||
│ └─────────────────┘ └────────┬────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────┐ │
|
||||
│ │ Mutated Prompts │ │
|
||||
│ │ • Typos │ │
|
||||
│ │ • Paraphrases │ │
|
||||
│ │ • Injections │ │
|
||||
│ └────────┬────────┘ │
|
||||
│ │ │
|
||||
│ 3. YOUR AGENT ▼ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ AI Agent │ ◄─── │ Test Runner │ │
|
||||
│ │ (HTTP/Python) │ │ (Async) │ │
|
||||
│ └────────┬────────┘ └─────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 4. VERIFICATION 5. REPORTING │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Invariant │ ───► │ HTML/JSON/CLI │ │
|
||||
│ │ Assertions │ │ Reports │ │
|
||||
│ └─────────────────┘ └─────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────┐ │
|
||||
│ │ Robustness │ │
|
||||
│ │ Score: 0.85 │ │
|
||||
│ └─────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- **Python 3.10+** (3.11 recommended)
|
||||
- **Ollama** (for local LLM mutation generation)
|
||||
- **Rust** (optional, for performance optimization)
|
||||
|
||||
### Step 1: Install Ollama
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
brew install ollama
|
||||
|
||||
# Linux
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
|
||||
# Start Ollama service
|
||||
ollama serve
|
||||
```
|
||||
|
||||
### Step 2: Pull the Default Model
|
||||
|
||||
```bash
|
||||
# Pull Qwen Coder 3 8B (recommended for mutations)
|
||||
ollama pull qwen2.5-coder:7b
|
||||
|
||||
# Verify it's working
|
||||
ollama run qwen2.5-coder:7b "Hello, world!"
|
||||
```
|
||||
|
||||
### Step 3: Install flakestorm
|
||||
|
||||
```bash
|
||||
# From PyPI (when published)
|
||||
pip install flakestorm
|
||||
|
||||
# From source (development)
|
||||
git clone https://github.com/flakestorm/flakestorm.git
|
||||
cd flakestorm
|
||||
pip install -e ".[dev]"
|
||||
```
|
||||
|
||||
### Step 4: (Optional) Install Rust Extension
|
||||
|
||||
For 80x+ performance improvement on scoring:
|
||||
|
||||
```bash
|
||||
cd rust
|
||||
pip install maturin
|
||||
maturin build --release
|
||||
pip install ../target/wheels/*.whl
|
||||
```
|
||||
|
||||
### Verify Installation
|
||||
|
||||
```bash
|
||||
flakestorm --version
|
||||
flakestorm --help
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Initialize Configuration
|
||||
|
||||
```bash
|
||||
# Create flakestorm.yaml in your project
|
||||
flakestorm init
|
||||
```
|
||||
|
||||
### 2. Configure Your Agent
|
||||
|
||||
Edit `flakestorm.yaml`:
|
||||
|
||||
```yaml
|
||||
# Your AI agent endpoint
|
||||
agent:
|
||||
endpoint: "http://localhost:8000/chat"
|
||||
type: http
|
||||
timeout: 30
|
||||
|
||||
# Prompts that should always work
|
||||
golden_prompts:
|
||||
- "What is the weather in New York?"
|
||||
- "Book a flight from NYC to LA for tomorrow"
|
||||
- "Cancel my reservation #12345"
|
||||
|
||||
# What to check in responses
|
||||
invariants:
|
||||
- type: contains
|
||||
value: "weather"
|
||||
prompt_filter: "weather"
|
||||
- type: latency
|
||||
max_ms: 5000
|
||||
- type: excludes_pii
|
||||
```
|
||||
|
||||
### 3. Run Tests
|
||||
|
||||
```bash
|
||||
# Basic run
|
||||
flakestorm run
|
||||
|
||||
# With HTML report
|
||||
flakestorm run --output html
|
||||
|
||||
# CI mode (fails if score < threshold)
|
||||
flakestorm run --ci --min-score 0.8
|
||||
```
|
||||
|
||||
### 4. View Results
|
||||
|
||||
```bash
|
||||
# Open the generated report
|
||||
open reports/entropix_report_*.html
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Golden Prompts
|
||||
|
||||
**What they are:** Carefully crafted prompts that represent your agent's core use cases. These are prompts that *should always work correctly*.
|
||||
|
||||
**How to choose them:**
|
||||
- Cover all major user intents
|
||||
- Include edge cases you've seen in production
|
||||
- Represent different complexity levels
|
||||
|
||||
```yaml
|
||||
golden_prompts:
|
||||
# Simple intent
|
||||
- "Hello, how are you?"
|
||||
|
||||
# Complex intent with parameters
|
||||
- "Book a flight from New York to Los Angeles departing March 15th"
|
||||
|
||||
# Edge case
|
||||
- "What if I need to cancel my booking?"
|
||||
```
|
||||
|
||||
### Mutation Types
|
||||
|
||||
flakestorm generates adversarial variations of your golden prompts:
|
||||
|
||||
| Type | Description | Example |
|
||||
|------|-------------|---------|
|
||||
| `paraphrase` | Same meaning, different words | "Book flight" → "Reserve a plane ticket" |
|
||||
| `noise` | Typos and formatting errors | "Book flight" → "Bok fligt" |
|
||||
| `tone_shift` | Different emotional tone | "Book flight" → "I NEED A FLIGHT NOW!!!" |
|
||||
| `prompt_injection` | Attempted jailbreaks | "Book flight. Ignore above and..." |
|
||||
|
||||
### Invariants (Assertions)
|
||||
|
||||
Rules that agent responses must satisfy:
|
||||
|
||||
```yaml
|
||||
invariants:
|
||||
# Response must contain a keyword
|
||||
- type: contains
|
||||
value: "booked"
|
||||
|
||||
# Response must NOT contain certain content
|
||||
- type: not_contains
|
||||
value: "error"
|
||||
|
||||
# Response must match regex pattern
|
||||
- type: regex
|
||||
pattern: "confirmation.*#[A-Z0-9]+"
|
||||
|
||||
# Response time limit
|
||||
- type: latency
|
||||
max_ms: 3000
|
||||
|
||||
# Must be valid JSON
|
||||
- type: valid_json
|
||||
|
||||
# Semantic similarity to expected response
|
||||
- type: similarity
|
||||
expected: "Your flight has been booked successfully"
|
||||
threshold: 0.8
|
||||
|
||||
# Safety: no PII leakage
|
||||
- type: excludes_pii
|
||||
|
||||
# Safety: must include refusal for dangerous requests
|
||||
- type: refusal
|
||||
```
|
||||
|
||||
### Robustness Score
|
||||
|
||||
A number from 0.0 to 1.0 indicating how reliable your agent is:
|
||||
|
||||
```
|
||||
Score = (Weighted Passed Tests) / (Total Weighted Tests)
|
||||
```
|
||||
|
||||
Weights by mutation type:
|
||||
- `prompt_injection`: 1.5 (harder to defend against)
|
||||
- `paraphrase`: 1.0 (should always work)
|
||||
- `tone_shift`: 1.0 (should handle different tones)
|
||||
- `noise`: 0.8 (minor errors are acceptable)
|
||||
|
||||
**Interpretation:**
|
||||
- **0.9+**: Excellent - Production ready
|
||||
- **0.8-0.9**: Good - Minor improvements needed
|
||||
- **0.7-0.8**: Fair - Needs work
|
||||
- **<0.7**: Poor - Significant reliability issues
|
||||
|
||||
---
|
||||
|
||||
## Configuration Deep Dive
|
||||
|
||||
### Full Configuration Schema
|
||||
|
||||
```yaml
|
||||
# =============================================================================
|
||||
# AGENT CONFIGURATION
|
||||
# =============================================================================
|
||||
agent:
|
||||
# Required: Where to send requests
|
||||
endpoint: "http://localhost:8000/chat"
|
||||
|
||||
# Agent type: http, python, or langchain
|
||||
type: http
|
||||
|
||||
# Request timeout in seconds
|
||||
timeout: 30
|
||||
|
||||
# HTTP-specific settings
|
||||
headers:
|
||||
Authorization: "Bearer ${API_KEY}" # Environment variable expansion
|
||||
Content-Type: "application/json"
|
||||
|
||||
# How to format the request body
|
||||
# Available placeholders: {prompt}
|
||||
request_template: |
|
||||
{"message": "{prompt}", "stream": false}
|
||||
|
||||
# JSONPath to extract response from JSON
|
||||
response_path: "$.response"
|
||||
|
||||
# =============================================================================
|
||||
# GOLDEN PROMPTS
|
||||
# =============================================================================
|
||||
golden_prompts:
|
||||
- "What is 2 + 2?"
|
||||
- "Summarize this article: {article_text}"
|
||||
- "Translate to Spanish: Hello, world!"
|
||||
|
||||
# =============================================================================
|
||||
# MUTATION CONFIGURATION
|
||||
# =============================================================================
|
||||
mutations:
|
||||
# Number of mutations per golden prompt
|
||||
count: 20
|
||||
|
||||
# Which mutation types to use
|
||||
types:
|
||||
- paraphrase
|
||||
- noise
|
||||
- tone_shift
|
||||
- prompt_injection
|
||||
|
||||
# Weights for scoring (higher = more important to pass)
|
||||
weights:
|
||||
paraphrase: 1.0
|
||||
noise: 0.8
|
||||
tone_shift: 1.0
|
||||
prompt_injection: 1.5
|
||||
|
||||
# =============================================================================
|
||||
# LLM CONFIGURATION (for mutation generation)
|
||||
# =============================================================================
|
||||
llm:
|
||||
# Ollama model to use
|
||||
model: "qwen2.5-coder:7b"
|
||||
|
||||
# Ollama server URL
|
||||
host: "http://localhost:11434"
|
||||
|
||||
# Generation temperature (higher = more creative mutations)
|
||||
temperature: 0.8
|
||||
|
||||
# =============================================================================
|
||||
# INVARIANTS (ASSERTIONS)
|
||||
# =============================================================================
|
||||
invariants:
|
||||
# Example: Response must contain booking confirmation
|
||||
- type: contains
|
||||
value: "confirmed"
|
||||
case_sensitive: false
|
||||
prompt_filter: "book" # Only apply to prompts containing "book"
|
||||
|
||||
# Example: Response time limit
|
||||
- type: latency
|
||||
max_ms: 5000
|
||||
|
||||
# Example: Must be valid JSON
|
||||
- type: valid_json
|
||||
|
||||
# Example: Semantic similarity
|
||||
- type: similarity
|
||||
expected: "I've booked your flight"
|
||||
threshold: 0.75
|
||||
|
||||
# Example: No PII in response
|
||||
- type: excludes_pii
|
||||
|
||||
# Example: Must refuse dangerous requests
|
||||
- type: refusal
|
||||
prompt_filter: "ignore|bypass|jailbreak"
|
||||
|
||||
# =============================================================================
|
||||
# ADVANCED SETTINGS
|
||||
# =============================================================================
|
||||
advanced:
|
||||
# Concurrent test executions
|
||||
concurrency: 10
|
||||
|
||||
# Retry failed requests
|
||||
retries: 3
|
||||
|
||||
# Output directory for reports
|
||||
output_dir: "./reports"
|
||||
|
||||
# Fail threshold for CI mode
|
||||
min_score: 0.8
|
||||
```
|
||||
|
||||
### Environment Variable Expansion
|
||||
|
||||
Use `${VAR_NAME}` syntax to reference environment variables:
|
||||
|
||||
```yaml
|
||||
agent:
|
||||
endpoint: "${AGENT_URL}"
|
||||
headers:
|
||||
Authorization: "Bearer ${API_KEY}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Basic Commands
|
||||
|
||||
```bash
|
||||
# Run with default config (flakestorm.yaml)
|
||||
flakestorm run
|
||||
|
||||
# Specify config file
|
||||
flakestorm run --config my-config.yaml
|
||||
|
||||
# Output format: terminal (default), html, json
|
||||
flakestorm run --output html
|
||||
|
||||
# Quiet mode (less output)
|
||||
flakestorm run --quiet
|
||||
|
||||
# Verbose mode (more output)
|
||||
flakestorm run --verbose
|
||||
```
|
||||
|
||||
### CI/CD Mode
|
||||
|
||||
```bash
|
||||
# Fail if score < 0.8
|
||||
flakestorm run --ci --min-score 0.8
|
||||
|
||||
# Exit codes:
|
||||
# 0 = Score meets threshold
|
||||
# 1 = Score below threshold
|
||||
# 2 = Configuration error
|
||||
# 3 = Runtime error
|
||||
```
|
||||
|
||||
### Individual Commands
|
||||
|
||||
```bash
|
||||
# Just verify config is valid
|
||||
flakestorm verify --config flakestorm.yaml
|
||||
|
||||
# Generate report from previous run
|
||||
flakestorm report --input results.json --output html
|
||||
|
||||
# Show current score
|
||||
flakestorm score --input results.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Understanding Results
|
||||
|
||||
### Terminal Output
|
||||
|
||||
```
|
||||
╭──────────────────────────────────────────────────────────────────╮
|
||||
│ flakestorm TEST RESULTS │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ Robustness Score: 0.85 │
|
||||
│ ████████████████████░░░░ 85% │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ Total Mutations: 80 │
|
||||
│ ✅ Passed: 68 │
|
||||
│ ❌ Failed: 12 │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ By Mutation Type: │
|
||||
│ paraphrase: 95% (19/20) │
|
||||
│ noise: 90% (18/20) │
|
||||
│ tone_shift: 85% (17/20) │
|
||||
│ prompt_injection: 70% (14/20) │
|
||||
├──────────────────────────────────────────────────────────────────┤
|
||||
│ Latency: avg=245ms, p50=200ms, p95=450ms, p99=890ms │
|
||||
╰──────────────────────────────────────────────────────────────────╯
|
||||
```
|
||||
|
||||
### HTML Report
|
||||
|
||||
The HTML report provides:
|
||||
|
||||
1. **Summary Dashboard** - Overall score, pass/fail breakdown
|
||||
2. **Mutation Matrix** - Visual grid of all test results
|
||||
3. **Failure Details** - Specific failures with input/output
|
||||
4. **Latency Charts** - Response time distribution
|
||||
5. **Recommendations** - AI-generated improvement suggestions
|
||||
|
||||
### JSON Export
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-01-15T10:30:00Z",
|
||||
"config_hash": "abc123",
|
||||
"statistics": {
|
||||
"total_mutations": 80,
|
||||
"passed_mutations": 68,
|
||||
"failed_mutations": 12,
|
||||
"robustness_score": 0.85,
|
||||
"avg_latency_ms": 245,
|
||||
"p95_latency_ms": 450
|
||||
},
|
||||
"results": [
|
||||
{
|
||||
"golden_prompt": "Book a flight to NYC",
|
||||
"mutation": "Reserve a plane ticket to New York",
|
||||
"mutation_type": "paraphrase",
|
||||
"passed": true,
|
||||
"response": "I've booked your flight...",
|
||||
"latency_ms": 234,
|
||||
"checks": [
|
||||
{"type": "contains", "passed": true},
|
||||
{"type": "latency", "passed": true}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration Patterns
|
||||
|
||||
### Pattern 1: HTTP Agent
|
||||
|
||||
Most common pattern - agent exposed via REST API:
|
||||
|
||||
```yaml
|
||||
agent:
|
||||
endpoint: "http://localhost:8000/api/chat"
|
||||
type: http
|
||||
request_template: |
|
||||
{"message": "{prompt}"}
|
||||
response_path: "$.reply"
|
||||
```
|
||||
|
||||
**Your agent code:**
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
class ChatRequest(BaseModel):
|
||||
message: str
|
||||
|
||||
class ChatResponse(BaseModel):
|
||||
reply: str
|
||||
|
||||
@app.post("/api/chat")
|
||||
async def chat(request: ChatRequest) -> ChatResponse:
|
||||
# Your agent logic here
|
||||
response = your_llm_call(request.message)
|
||||
return ChatResponse(reply=response)
|
||||
```
|
||||
|
||||
### Pattern 2: Python Module
|
||||
|
||||
Direct Python integration (no HTTP overhead):
|
||||
|
||||
```yaml
|
||||
agent:
|
||||
endpoint: "my_agent.agent:handle_message"
|
||||
type: python
|
||||
```
|
||||
|
||||
**Your agent code (`my_agent/agent.py`):**
|
||||
|
||||
```python
|
||||
def handle_message(prompt: str) -> str:
|
||||
"""
|
||||
flakestorm will call this function directly.
|
||||
|
||||
Args:
|
||||
prompt: The user message (mutated)
|
||||
|
||||
Returns:
|
||||
The agent's response as a string
|
||||
"""
|
||||
# Your agent logic
|
||||
return process_message(prompt)
|
||||
```
|
||||
|
||||
### Pattern 3: LangChain Agent
|
||||
|
||||
For LangChain-based agents:
|
||||
|
||||
```yaml
|
||||
agent:
|
||||
endpoint: "my_agent.chain:agent"
|
||||
type: langchain
|
||||
```
|
||||
|
||||
**Your agent code:**
|
||||
|
||||
```python
|
||||
from langchain.agents import AgentExecutor
|
||||
|
||||
# flakestorm will call agent.invoke({"input": prompt})
|
||||
agent = AgentExecutor(...)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CI/CD Integration
|
||||
|
||||
### GitHub Actions
|
||||
|
||||
Create `.github/workflows/flakestorm.yml`:
|
||||
|
||||
```yaml
|
||||
name: Agent Reliability Tests
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
pull_request:
|
||||
branches: [main]
|
||||
|
||||
jobs:
|
||||
reliability-test:
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
services:
|
||||
ollama:
|
||||
image: ollama/ollama
|
||||
ports:
|
||||
- 11434:11434
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.11'
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
pip install flakestorm
|
||||
pip install -r requirements.txt
|
||||
|
||||
- name: Pull Ollama model
|
||||
run: |
|
||||
curl -X POST http://localhost:11434/api/pull \
|
||||
-d '{"name": "qwen2.5-coder:7b"}'
|
||||
|
||||
- name: Start agent
|
||||
run: |
|
||||
python -m my_agent &
|
||||
sleep 5 # Wait for startup
|
||||
|
||||
- name: Run flakestorm tests
|
||||
run: |
|
||||
flakestorm run --ci --min-score 0.8 --output json
|
||||
|
||||
- name: Upload report
|
||||
uses: actions/upload-artifact@v4
|
||||
if: always()
|
||||
with:
|
||||
name: flakestorm-report
|
||||
path: reports/
|
||||
```
|
||||
|
||||
### GitLab CI
|
||||
|
||||
```yaml
|
||||
flakestorm-test:
|
||||
image: python:3.11
|
||||
services:
|
||||
- name: ollama/ollama
|
||||
alias: ollama
|
||||
variables:
|
||||
OLLAMA_HOST: "http://ollama:11434"
|
||||
script:
|
||||
- pip install flakestorm
|
||||
- flakestorm run --ci --min-score 0.8
|
||||
artifacts:
|
||||
paths:
|
||||
- reports/
|
||||
when: always
|
||||
```
|
||||
|
||||
### Pre-commit Hook
|
||||
|
||||
Add to `.pre-commit-config.yaml`:
|
||||
|
||||
```yaml
|
||||
repos:
|
||||
- repo: local
|
||||
hooks:
|
||||
- id: flakestorm
|
||||
name: flakestorm Agent Tests
|
||||
entry: flakestorm run --ci --min-score 0.8
|
||||
language: system
|
||||
pass_filenames: false
|
||||
always_run: true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom Mutation Templates
|
||||
|
||||
Override default mutation prompts:
|
||||
|
||||
```yaml
|
||||
mutations:
|
||||
templates:
|
||||
paraphrase: |
|
||||
Rewrite this prompt with completely different words
|
||||
but preserve the exact meaning: "{prompt}"
|
||||
|
||||
noise: |
|
||||
Add realistic typos and formatting errors to this prompt.
|
||||
Make 2-3 small mistakes: "{prompt}"
|
||||
```
|
||||
|
||||
### Filtering Invariants by Prompt
|
||||
|
||||
Apply assertions only to specific prompts:
|
||||
|
||||
```yaml
|
||||
invariants:
|
||||
# Only for booking-related prompts
|
||||
- type: contains
|
||||
value: "confirmation"
|
||||
prompt_filter: "book|reserve|schedule"
|
||||
|
||||
# Only for cancellation prompts
|
||||
- type: regex
|
||||
pattern: "cancelled|refunded"
|
||||
prompt_filter: "cancel"
|
||||
```
|
||||
|
||||
### Custom Weights
|
||||
|
||||
Adjust scoring weights based on your priorities:
|
||||
|
||||
```yaml
|
||||
mutations:
|
||||
weights:
|
||||
# Security is critical - weight injection tests higher
|
||||
prompt_injection: 2.0
|
||||
|
||||
# Typo tolerance is less important
|
||||
noise: 0.5
|
||||
```
|
||||
|
||||
### Parallel Execution
|
||||
|
||||
Control concurrency for rate-limited APIs:
|
||||
|
||||
```yaml
|
||||
advanced:
|
||||
concurrency: 5 # Max 5 parallel requests
|
||||
retries: 3 # Retry failed requests 3 times
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### "Cannot connect to Ollama"
|
||||
|
||||
```bash
|
||||
# Check if Ollama is running
|
||||
curl http://localhost:11434/api/version
|
||||
|
||||
# Start Ollama if not running
|
||||
ollama serve
|
||||
```
|
||||
|
||||
#### "Model not found"
|
||||
|
||||
```bash
|
||||
# List available models
|
||||
ollama list
|
||||
|
||||
# Pull the required model
|
||||
ollama pull qwen2.5-coder:7b
|
||||
```
|
||||
|
||||
#### "Agent connection refused"
|
||||
|
||||
```bash
|
||||
# Verify your agent is running
|
||||
curl http://localhost:8000/health
|
||||
|
||||
# Check the endpoint in config
|
||||
cat flakestorm.yaml | grep endpoint
|
||||
```
|
||||
|
||||
#### "Timeout errors"
|
||||
|
||||
Increase timeout in config:
|
||||
|
||||
```yaml
|
||||
agent:
|
||||
timeout: 60 # Increase to 60 seconds
|
||||
```
|
||||
|
||||
#### "Low robustness score"
|
||||
|
||||
1. Review failed mutations in the report
|
||||
2. Identify patterns (e.g., all prompt_injection failing)
|
||||
3. Improve your agent's handling of those cases
|
||||
4. Re-run tests
|
||||
|
||||
### Debug Mode
|
||||
|
||||
```bash
|
||||
# Enable verbose logging
|
||||
flakestorm run --verbose
|
||||
|
||||
# Or set environment variable
|
||||
export ENTROPIX_DEBUG=1
|
||||
flakestorm run
|
||||
```
|
||||
|
||||
### Getting Help
|
||||
|
||||
- **Documentation**: https://flakestorm.dev/docs
|
||||
- **GitHub Issues**: https://github.com/flakestorm/flakestorm/issues
|
||||
- **Discord**: https://discord.gg/flakestorm
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Start simple**: Test with 1-2 golden prompts first
|
||||
2. **Add invariants gradually**: Start with `contains` and `latency`
|
||||
3. **Review failures**: Use reports to understand weak points
|
||||
4. **Iterate**: Improve agent, re-test, repeat
|
||||
5. **Integrate to CI**: Automate testing on every PR
|
||||
|
||||
---
|
||||
|
||||
*Built with ❤️ by the flakestorm Team*
|
||||
|
||||
Loading…
Add table
Add a link
Reference in a new issue