flakestorm/README.md
2026-01-05 00:00:25 +08:00

7.1 KiB

Flakestorm

The Agent Reliability Engine
Chaos Engineering for AI Agents

License GitHub Stars


The Problem

The "Happy Path" Fallacy: Current AI development tools focus on getting an agent to work once. Developers tweak prompts until they get a correct answer, declare victory, and ship.

The Reality: LLMs are non-deterministic. An agent that works on Monday with temperature=0.7 might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.

The Void:

  • Observability Tools (LangSmith) tell you after the agent failed in production
  • Eval Libraries (RAGAS) focus on academic scores rather than system reliability
  • Missing Link: A tool that actively attacks the agent to prove robustness before deployment

The Solution

Flakestorm is a local-first testing engine that applies Chaos Engineering principles to AI Agents.

Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a Robustness Score.

"If it passes Flakestorm, it won't break in Production."

What You Get in Minutes

Within minutes of setup, Flakestorm gives you:

  • Robustness Score: A single number (0.0-1.0) that quantifies your agent's reliability
  • Failure Analysis: Detailed reports showing exactly which mutations broke your agent and why
  • Security Insights: Discover prompt injection vulnerabilities before attackers do
  • Edge Case Discovery: Find boundary conditions that would cause production failures
  • Actionable Reports: Interactive HTML reports with specific recommendations for improvement

No more guessing if your agent is production-ready. Flakestorm tells you exactly what will break and how to fix it.

Demo

flakestorm in Action

flakestorm Demo

Watch flakestorm generate mutations and test your agent in real-time

Test Report

flakestorm Test Report 1

flakestorm Test Report 2

flakestorm Test Report 3

flakestorm Test Report 4

flakestorm Test Report 5

Interactive HTML reports with detailed failure analysis and recommendations

Try Flakestorm in ~60 Seconds

Want to see Flakestorm in action immediately? Here's the fastest path:

  1. Install flakestorm (if you have Python 3.10+):

    pip install flakestorm
    
  2. Initialize a test configuration:

    flakestorm init
    
  3. Point it at your agent (edit flakestorm.yaml):

    agent:
      endpoint: "http://localhost:8000/invoke"  # Your agent's endpoint
      type: "http"
    
  4. Run your first test:

    flakestorm run
    

That's it! You'll get a robustness score and detailed report showing how your agent handles adversarial inputs.

Note

: For full local execution (including mutation generation), you'll need Ollama installed. See the Usage Guide for complete setup instructions.

How Flakestorm Works

Flakestorm follows a simple but powerful workflow:

  1. You provide "Golden Prompts" — example inputs that should always work correctly
  2. Flakestorm generates mutations — using a local LLM, it creates adversarial variations:
    • Paraphrases (same meaning, different words)
    • Typos and noise (realistic user errors)
    • Tone shifts (frustrated, urgent, aggressive users)
    • Prompt injections (security attacks)
    • Encoding attacks (Base64, URL encoding)
    • Context manipulation (noisy, verbose inputs)
    • Length extremes (empty, very long inputs)
  3. Your agent processes each mutation — Flakestorm sends them to your agent endpoint
  4. Invariants are checked — responses are validated against rules you define (latency, content, safety)
  5. Robustness Score is calculated — weighted by mutation difficulty and importance
  6. Report is generated — interactive HTML showing what passed, what failed, and why

The result: You know exactly how your agent will behave under stress before users ever see it.

Features

  • 8 Core Mutation Types: Comprehensive robustness testing covering semantic, input, security, and edge cases
  • Invariant Assertions: Deterministic checks, semantic similarity, basic safety
  • Local-First: Uses Ollama with Qwen 3 8B for free testing
  • Beautiful Reports: Interactive HTML reports with pass/fail matrices

Toward a Zero-Setup Path

We're working on making Flakestorm even easier to use. Future improvements include:

  • Cloud-hosted mutation generation: No need to install Ollama locally
  • One-command setup: Automated installation and configuration
  • Docker containers: Pre-configured environments for instant testing
  • CI/CD integrations: Native GitHub Actions, GitLab CI, and more
  • Comprehensive Reporting: Dashboard and reports with team collaboration.

The goal: Test your agent's robustness with a single command, no local dependencies required.

For now, the local execution path gives you full control and privacy. As we build toward zero-setup, you'll always have the option to run everything locally.

Documentation

Getting Started

For Developers

Troubleshooting

Reference

License

Apache 2.0 - See LICENSE for details.


Tested with Flakestorm
Tested with Flakestorm