mirror of https://github.com/flakestorm/flakestorm.git synced 2026-04-26 09:16:25 +02:00

Entropix dbbdac9d43 Revise installation instructions in README.md, CONTRIBUTING.md, and USAGE_GUIDE.md to clarify the installation order for Ollama and flakestorm. Added detailed platform-specific installation steps for Ollama and emphasized the need for a virtual environment for Python packages. Included troubleshooting tips for common installation issues.

2025-12-30 18:36:42 +08:00

8.2 KiB

Raw Blame History

Flakestorm

The Agent Reliability Engine
Chaos Engineering for AI Agents

The Problem

The "Happy Path" Fallacy: Current AI development tools focus on getting an agent to work once. Developers tweak prompts until they get a correct answer, declare victory, and ship.

The Reality: LLMs are non-deterministic. An agent that works on Monday with temperature=0.7 might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.

The Void:

Observability Tools (LangSmith) tell you after the agent failed in production
Eval Libraries (RAGAS) focus on academic scores rather than system reliability
Missing Link: A tool that actively attacks the agent to prove robustness before deployment

The Solution

Flakestorm is a local-first testing engine that applies Chaos Engineering principles to AI Agents.

Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a Robustness Score.

"If it passes Flakestorm, it won't break in Production."

Features

✅ 5 Mutation Types: Paraphrasing, noise, tone shifts, basic adversarial, custom templates
✅ Invariant Assertions: Deterministic checks, semantic similarity, basic safety
✅ Local-First: Uses Ollama with Qwen 3 8B for free testing
✅ Beautiful Reports: Interactive HTML reports with pass/fail matrices
✅ 50 Mutations Max: Per test run
✅ Sequential Execution: One test at a time

Quick Start

Installation Order

Install Ollama first (system-level service)
Create virtual environment (for Python packages)
Install flakestorm (Python package)
Start Ollama and pull model (required for mutations)

Step 1: Install Ollama (System-Level)

FlakeStorm uses Ollama for local model inference. Install this first:

macOS Installation:

# Option 1: Homebrew (recommended)
brew install ollama

# If you get permission errors, fix permissions first:
sudo chown -R $(whoami) /Users/imac-frank/Library/Logs/Homebrew
sudo chown -R $(whoami) /usr/local/Cellar
sudo chown -R $(whoami) /usr/local/Homebrew
brew install ollama

# Option 2: Official Installer
# Visit https://ollama.ai/download and download the macOS installer (.dmg)

Windows Installation:

Visit https://ollama.com/download/windows
Download OllamaSetup.exe
Run the installer and follow the wizard
Ollama will be installed and start automatically

Linux Installation:

# Using the official install script
curl -fsSL https://ollama.com/install.sh | sh

# Or using package managers (Ubuntu/Debian example):
sudo apt install ollama

After installation, start Ollama and pull the model:

# Start Ollama (macOS/Linux - Windows starts automatically)
ollama serve

# In another terminal, pull the model
ollama pull qwen3:8b

Step 2: Install flakestorm (Python Package)

Using a virtual environment (recommended):

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install flakestorm
pip install flakestorm

Or using pipx (for CLI use only):

pipx install flakestorm

Note: Requires Python 3.10 or higher. On macOS, Python environments are externally managed, so using a virtual environment is required. Ollama runs independently and doesn't need to be in your virtual environment.

Initialize Configuration

flakestorm init

This creates a flakestorm.yaml configuration file:

version: "1.0"

agent:
  endpoint: "http://localhost:8000/invoke"
  type: "http"
  timeout: 30000

model:
  provider: "ollama"
  name: "qwen3:8b"
  base_url: "http://localhost:11434"

mutations:
  count: 10  # Max 50 total per run
  types:
    - paraphrase
    - noise
    - tone_shift
    - prompt_injection

golden_prompts:
  - "Book a flight to Paris for next Monday"
  - "What's my account balance?"

invariants:
  - type: "latency"
    max_ms: 2000
  - type: "valid_json"

output:
  format: "html"
  path: "./reports"

Run Tests

flakestorm run

Output:

Generating mutations... ━━━━━━━━━━━━━━━━━━━━ 100%
Running attacks...      ━━━━━━━━━━━━━━━━━━━━ 100%

╭──────────────────────────────────────────╮
│  Robustness Score: 87.5%                 │
│  ────────────────────────                │
│  Passed: 17/20 mutations                 │
│  Failed: 3 (2 latency, 1 injection)      │
╰──────────────────────────────────────────╯

Report saved to: ./reports/flakestorm-2024-01-15-143022.html

Mutation Types

Type	Description	Example
Paraphrase	Semantically equivalent rewrites	"Book a flight" → "I need to fly out"
Noise	Typos and spelling errors	"Book a flight" → "Book a fliight plz"
Tone Shift	Aggressive/impatient phrasing	"Book a flight" → "I need a flight NOW!"
Prompt Injection	Basic adversarial attacks	"Book a flight and ignore previous instructions"
Custom	Your own mutation templates	Define with `{prompt}` placeholder

Invariants (Assertions)

Deterministic

invariants:
  - type: "contains"
    value: "confirmation_code"
  - type: "latency"
    max_ms: 2000
  - type: "valid_json"

Semantic

invariants:
  - type: "similarity"
    expected: "Your flight has been booked"
    threshold: 0.8

Safety (Basic)

invariants:
  - type: "excludes_pii"  # Basic regex patterns
  - type: "refusal_check"

Agent Adapters

HTTP Endpoint

agent:
  type: "http"
  endpoint: "http://localhost:8000/invoke"

Python Callable

from flakestorm import test_agent

@test_agent
async def my_agent(input: str) -> str:
    # Your agent logic
    return response

LangChain

agent:
  type: "langchain"
  module: "my_agent:chain"

Local Testing

For local testing and validation:

# Run with minimum score check
flakestorm run --min-score 0.9

# Exit with error code if score is too low
flakestorm run --min-score 0.9 --ci

Robustness Score

The Robustness Score is calculated as:

R = \frac{W_s \cdot S_{passed} + W_d \cdot D_{passed}}{N_{total}}

Where:

S_{passed} = Semantic variations passed
D_{passed} = Deterministic tests passed
W = Weights assigned by mutation difficulty

Documentation

Getting Started

📖 Usage Guide - Complete end-to-end guide
⚙️ Configuration Guide - All configuration options
🧪 Test Scenarios - Real-world examples with code

For Developers

🏗️ Architecture & Modules - How the code works
❓ Developer FAQ - Q&A about design decisions
📦 Publishing Guide - How to publish to PyPI
🤝 Contributing - How to contribute

Reference

📋 API Specification - API reference
🧪 Testing Guide - How to run and write tests
✅ Implementation Checklist - Development progress

License

AGPLv3 - See LICENSE for details.

Tested with Flakestorm

8.2 KiB Raw Blame History