diff --git a/README.md b/README.md index 454648b..43175a2 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,6 @@ Instead of running one test case, Flakestorm takes a single "Golden Prompt", gen > **"If it passes Flakestorm, it won't break in Production."** -<<<<<<< HEAD ## Who Flakestorm Is For - **Teams shipping AI agents to production** — Catch failures before users do @@ -50,21 +49,8 @@ Flakestorm is built for production-grade agents handling real traffic. While it - ✅ **8 Core Mutation Types**: Comprehensive robustness testing covering semantic, input, security, and edge cases - ✅ **Invariant Assertions**: Deterministic checks, semantic similarity, basic safety -- ✅ **CI/CD Ready**: Run in pipelines with exit codes and score thresholds +- ✅ **Local-First**: Uses Ollama with Qwen 3 8B for free testing - ✅ **Beautiful Reports**: Interactive HTML reports with pass/fail matrices -======= -## What You Get in Minutes - -Within minutes of setup, Flakestorm gives you: - -- **Robustness Score**: A single number (0.0-1.0) that quantifies your agent's reliability -- **Failure Analysis**: Detailed reports showing exactly which mutations broke your agent and why -- **Security Insights**: Discover prompt injection vulnerabilities before attackers do -- **Edge Case Discovery**: Find boundary conditions that would cause production failures -- **Actionable Reports**: Interactive HTML reports with specific recommendations for improvement - -No more guessing if your agent is production-ready. Flakestorm tells you exactly what will break and how to fix it. ->>>>>>> b57b6e88dc216554442a189c16ad076ec06bb26e ## Demo @@ -88,88 +74,74 @@ No more guessing if your agent is production-ready. Flakestorm tells you exactly *Interactive HTML reports with detailed failure analysis and recommendations* -## Try Flakestorm in ~60 Seconds +## Quick Start -<<<<<<< HEAD -> **Note**: This local path is great for quick exploration. Production teams typically run Flakestorm in CI or cloud-based setups. See the [Usage Guide](docs/USAGE_GUIDE.md) for production deployment patterns. +### Installation Order -### Local Installation (OSS) -======= -Want to see Flakestorm in action immediately? Here's the fastest path: ->>>>>>> b57b6e88dc216554442a189c16ad076ec06bb26e +1. **Install Ollama first** (system-level service) +2. **Create virtual environment** (for Python packages) +3. **Install flakestorm** (Python package) +4. **Start Ollama and pull model** (required for mutations) -1. **Install flakestorm** (if you have Python 3.10+): - ```bash - pip install flakestorm - ``` +### Step 1: Install Ollama (System-Level) -2. **Initialize a test configuration**: - ```bash - flakestorm init - ``` +FlakeStorm uses [Ollama](https://ollama.ai) for local model inference. Install this first: -<<<<<<< HEAD -For local execution, FlakeStorm uses [Ollama](https://ollama.ai) for mutation generation. This is an implementation detail for the OSS path — production setups typically use cloud-based mutation services. Install this first: -======= -3. **Point it at your agent** (edit `flakestorm.yaml`): - ```yaml - agent: - endpoint: "http://localhost:8000/invoke" # Your agent's endpoint - type: "http" - ``` ->>>>>>> b57b6e88dc216554442a189c16ad076ec06bb26e +**macOS Installation:** -4. **Run your first test**: - ```bash - flakestorm run - ``` +```bash +# Option 1: Homebrew (recommended) +brew install ollama -That's it! You'll get a robustness score and detailed report showing how your agent handles adversarial inputs. +# If you get permission errors, fix permissions first: +sudo chown -R $(whoami) /Users/imac-frank/Library/Logs/Homebrew +sudo chown -R $(whoami) /usr/local/Cellar +sudo chown -R $(whoami) /usr/local/Homebrew +brew install ollama -> **Note**: For full local execution (including mutation generation), you'll need Ollama installed. See the [Usage Guide](docs/USAGE_GUIDE.md) for complete setup instructions. +# Option 2: Official Installer +# Visit https://ollama.ai/download and download the macOS installer (.dmg) +``` -## How Flakestorm Works +**Windows Installation:** -Flakestorm follows a simple but powerful workflow: +1. Visit https://ollama.com/download/windows +2. Download `OllamaSetup.exe` +3. Run the installer and follow the wizard +4. Ollama will be installed and start automatically -1. **You provide "Golden Prompts"** — example inputs that should always work correctly -2. **Flakestorm generates mutations** — using a local LLM, it creates adversarial variations: - - Paraphrases (same meaning, different words) - - Typos and noise (realistic user errors) - - Tone shifts (frustrated, urgent, aggressive users) - - Prompt injections (security attacks) - - Encoding attacks (Base64, URL encoding) - - Context manipulation (noisy, verbose inputs) - - Length extremes (empty, very long inputs) -3. **Your agent processes each mutation** — Flakestorm sends them to your agent endpoint -4. **Invariants are checked** — responses are validated against rules you define (latency, content, safety) -5. **Robustness Score is calculated** — weighted by mutation difficulty and importance -6. **Report is generated** — interactive HTML showing what passed, what failed, and why +**Linux Installation:** -The result: You know exactly how your agent will behave under stress before users ever see it. +```bash +# Using the official install script +curl -fsSL https://ollama.com/install.sh | sh -## Features +# Or using package managers (Ubuntu/Debian example): +sudo apt install ollama +``` -- ✅ **8 Core Mutation Types**: Comprehensive robustness testing covering semantic, input, security, and edge cases -- ✅ **Invariant Assertions**: Deterministic checks, semantic similarity, basic safety -- ✅ **Local-First**: Uses Ollama with Qwen 3 8B for free testing -- ✅ **Beautiful Reports**: Interactive HTML reports with pass/fail matrices +**After installation, start Ollama and pull the model:** -## Toward a Zero-Setup Path +```bash +# Start Ollama +# macOS (Homebrew): brew services start ollama +# macOS (Manual) / Linux: ollama serve +# Windows: Starts automatically as a service -We're working on making Flakestorm even easier to use. Future improvements include: +# In another terminal, pull the model +# Choose based on your RAM: +# - 8GB RAM: ollama pull tinyllama:1.1b or gemma2:2b +# - 16GB RAM: ollama pull qwen2.5:3b (recommended) +# - 32GB+ RAM: ollama pull qwen2.5-coder:7b (best quality) +ollama pull qwen2.5:3b +``` -- **Cloud-hosted mutation generation**: No need to install Ollama locally -- **One-command setup**: Automated installation and configuration -- **Docker containers**: Pre-configured environments for instant testing -- **CI/CD integrations**: Native GitHub Actions, GitLab CI, and more -- **Comprehensive Reporting**: Dashboard and reports with team collaboration. +**Troubleshooting:** If you get `syntax error: ` or `command not found` when running `ollama` commands: -The goal: Test your agent's robustness with a single command, no local dependencies required. +```bash +# 1. Remove the bad binary +sudo rm /usr/local/bin/ollama -For now, the local execution path gives you full control and privacy. As we build toward zero-setup, you'll always have the option to run everything locally. - -<<<<<<< HEAD # 2. Find Homebrew's Ollama location brew --prefix ollama # Shows /usr/local/opt/ollama or /opt/homebrew/opt/ollama @@ -253,53 +225,17 @@ pipx inject flakestorm flakestorm_rust flakestorm init ``` -This creates a `flakestorm.yaml` configuration file: +3. **Point it at your agent** (edit `flakestorm.yaml`): + ```yaml + agent: + endpoint: "http://localhost:8000/invoke" # Your agent's endpoint + type: "http" + ``` -```yaml -version: "1.0" - -agent: - endpoint: "http://localhost:8000/invoke" - type: "http" - timeout: 30000 - -model: - provider: "ollama" - # Choose model based on your RAM: 8GB (tinyllama:1.1b), 16GB (qwen2.5:3b), 32GB+ (qwen2.5-coder:7b) - # See docs/USAGE_GUIDE.md for full model recommendations - name: "qwen2.5:3b" - base_url: "http://localhost:11434" - -mutations: - count: 10 - types: - - paraphrase - - noise - - tone_shift - - prompt_injection - - encoding_attacks - - context_manipulation - - length_extremes - -golden_prompts: - - "Book a flight to Paris for next Monday" - - "What's my account balance?" - -invariants: - - type: "latency" - max_ms: 2000 - - type: "valid_json" - -output: - format: "html" - path: "./reports" -``` - -### Run Tests - -```bash -flakestorm run -``` +4. **Run your first test**: + ```bash + flakestorm run + ``` Output: ``` @@ -399,20 +335,17 @@ agent: module: "my_agent:chain" ``` -## CI/CD Integration - -Flakestorm is designed to run in CI pipelines with configurable score thresholds: +## Local Testing +For local testing and validation: ```bash # Run with minimum score check flakestorm run --min-score 0.9 -# Exit with error code if score is too low (for CI gates) +# Exit with error code if score is too low flakestorm run --min-score 0.9 --ci ``` -For local testing and development, the same commands work without the `--ci` flag. - ## Robustness Score The Robustness Score is calculated as: @@ -423,18 +356,6 @@ Where: - $S_{passed}$ = Semantic variations passed - $D_{passed}$ = Deterministic tests passed - $W$ = Weights assigned by mutation difficulty -======= ->>>>>>> b57b6e88dc216554442a189c16ad076ec06bb26e - -## Production Deployment - -Local execution is ideal for exploration and development. For production agents, Flakestorm is evolving toward a zero-setup, cloud-based workflow that mirrors real deployments. The OSS local path will always remain available for teams who prefer self-hosted solutions. - -See the [Usage Guide](docs/USAGE_GUIDE.md) for: -- Local setup and Ollama configuration -- Python environment details -- Production deployment patterns -- CI/CD integration examples ## Documentation