Flakestorm — Automated Robustness Testing for AI Agents. Stop guessing if your agent really works. FlakeStorm generates adversarial mutations and exposes failures your manual tests and evals miss. https://flakestorm.com
Find a file
2026-01-05 00:02:45 +08:00
docs Refactor README.md and USAGE_GUIDE.md to streamline installation instructions and enhance clarity on robustness scoring and mutation strategies. Removed outdated sections and added detailed explanations for mutation types and their applications in testing. This update aims to improve user understanding and facilitate easier setup and usage of Flakestorm. 2026-01-04 23:39:24 +08:00
examples Revise README.md to enhance clarity and user experience by updating the features section, streamlining the quick start guide, and introducing a new section on future improvements for zero-setup usage. The changes aim to provide a more intuitive overview of Flakestorm's capabilities and installation process. 2026-01-04 23:28:43 +08:00
rust Refactor Entropix to FlakeStorm 2025-12-29 11:15:18 +08:00
src/flakestorm Update version to 0.9.0 in pyproject.toml and __init__.py, enhance CONFIGURATION_GUIDE.md and USAGE_GUIDE.md with aggressive mutation strategies and requirements for invariants, and add validation to ensure at least 3 invariants are configured in FlakeStormConfig. 2026-01-03 00:18:31 +08:00
tests Enhance mutation capabilities by adding three new types: encoding_attacks, context_manipulation, and length_extremes. Update configuration and documentation to reflect the addition of these types, including their weights and descriptions. Revise README.md, API_SPECIFICATION.md, CONFIGURATION_GUIDE.md, and other relevant documents to provide comprehensive coverage of the new mutation strategies and their applications. Ensure all tests are updated to validate the new mutation types. 2026-01-01 17:28:05 +08:00
.gitignore Remove PUBLISHING.md from .gitignore to allow tracking of publishing documentation. 2026-01-05 00:01:10 +08:00
.pre-commit-config.yaml Refactor Entropix to FlakeStorm 2025-12-29 11:15:18 +08:00
BUILD_FIX.md Add pre-flight validation, flexible response handling, and improved error detection - Add pre-flight check to validate agent with first golden prompt before mutations - Improve response extraction to handle various agent response formats automatically - Add support for non-JSON responses (plain text, HTML) - Enhance error detection for HTTP 200 responses with error fields - Add comprehensive auto-detection for common response field names - Improve JSON parsing error handling with graceful fallbacks - Add example YAML config for GenerateSearchQueries agent - Update documentation with build and installation fixes 2026-01-02 15:21:20 +08:00
Cargo.toml Refactor Entropix to FlakeStorm 2025-12-29 11:15:18 +08:00
FIX_INSTALL.md Add pre-flight validation, flexible response handling, and improved error detection - Add pre-flight check to validate agent with first golden prompt before mutations - Improve response extraction to handle various agent response formats automatically - Add support for non-JSON responses (plain text, HTML) - Enhance error detection for HTTP 200 responses with error fields - Add comprehensive auto-detection for common response field names - Improve JSON parsing error handling with graceful fallbacks - Add example YAML config for GenerateSearchQueries agent - Update documentation with build and installation fixes 2026-01-02 15:21:20 +08:00
flakestorm.yaml Revise README.md to enhance clarity and user experience by updating the features section, streamlining the quick start guide, and introducing a new section on future improvements for zero-setup usage. The changes aim to provide a more intuitive overview of Flakestorm's capabilities and installation process. 2026-01-04 23:28:43 +08:00
flakestorm.yaml.example Enhance mutation capabilities by adding three new types: encoding_attacks, context_manipulation, and length_extremes. Update configuration and documentation to reflect the addition of these types, including their weights and descriptions. Revise README.md, API_SPECIFICATION.md, CONFIGURATION_GUIDE.md, and other relevant documents to provide comprehensive coverage of the new mutation strategies and their applications. Ensure all tests are updated to validate the new mutation types. 2026-01-01 17:28:05 +08:00
flakestorm_demo.gif Update model configuration and enhance documentation for improved user guidance - Change default model to "gemma3:1b" in flakestorm-generate-search-queries.yaml and increase mutation count from 3 to 20 - Revise README.md to include demo visuals and model recommendations based on system RAM - Expand USAGE_GUIDE.md with detailed model selection criteria and installation instructions - Enhance HTML report generation to include actionable recommendations for failed mutations and executive summary insights. 2026-01-02 20:01:12 +08:00
flakestorm_report1.png Add new AI agent for generating search queries using Google Gemini - Introduce keywords_extractor_agent with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated broken_agent example and associated files. 2026-01-02 21:52:56 +08:00
flakestorm_report2.png Update flakestorm_report2.png to reflect recent changes in report generation and visualization enhancements. 2026-01-02 21:56:56 +08:00
flakestorm_report3.png Add new AI agent for generating search queries using Google Gemini - Introduce keywords_extractor_agent with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated broken_agent example and associated files. 2026-01-02 21:52:56 +08:00
flakestorm_report4.png Add new AI agent for generating search queries using Google Gemini - Introduce keywords_extractor_agent with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated broken_agent example and associated files. 2026-01-02 21:52:56 +08:00
flakestorm_report5.png Add new AI agent for generating search queries using Google Gemini - Introduce keywords_extractor_agent with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated broken_agent example and associated files. 2026-01-02 21:52:56 +08:00
LICENSE Refactor Entropix to FlakeStorm 2025-12-29 11:15:18 +08:00
pyproject.toml Update version to 0.9.0 in pyproject.toml and __init__.py, enhance CONFIGURATION_GUIDE.md and USAGE_GUIDE.md with aggressive mutation strategies and requirements for invariants, and add validation to ensure at least 3 invariants are configured in FlakeStormConfig. 2026-01-03 00:18:31 +08:00
README.md Update README.md 2026-01-05 00:00:25 +08:00
test_wheel_contents.sh Add pre-flight validation, flexible response handling, and improved error detection - Add pre-flight check to validate agent with first golden prompt before mutations - Improve response extraction to handle various agent response formats automatically - Add support for non-JSON responses (plain text, HTML) - Enhance error detection for HTTP 200 responses with error fields - Add comprehensive auto-detection for common response field names - Improve JSON parsing error handling with graceful fallbacks - Add example YAML config for GenerateSearchQueries agent - Update documentation with build and installation fixes 2026-01-02 15:21:20 +08:00

Flakestorm

The Agent Reliability Engine
Chaos Engineering for AI Agents

License GitHub Stars


The Problem

The "Happy Path" Fallacy: Current AI development tools focus on getting an agent to work once. Developers tweak prompts until they get a correct answer, declare victory, and ship.

The Reality: LLMs are non-deterministic. An agent that works on Monday with temperature=0.7 might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.

The Void:

  • Observability Tools (LangSmith) tell you after the agent failed in production
  • Eval Libraries (RAGAS) focus on academic scores rather than system reliability
  • Missing Link: A tool that actively attacks the agent to prove robustness before deployment

The Solution

Flakestorm is a local-first testing engine that applies Chaos Engineering principles to AI Agents.

Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a Robustness Score.

"If it passes Flakestorm, it won't break in Production."

What You Get in Minutes

Within minutes of setup, Flakestorm gives you:

  • Robustness Score: A single number (0.0-1.0) that quantifies your agent's reliability
  • Failure Analysis: Detailed reports showing exactly which mutations broke your agent and why
  • Security Insights: Discover prompt injection vulnerabilities before attackers do
  • Edge Case Discovery: Find boundary conditions that would cause production failures
  • Actionable Reports: Interactive HTML reports with specific recommendations for improvement

No more guessing if your agent is production-ready. Flakestorm tells you exactly what will break and how to fix it.

Demo

flakestorm in Action

flakestorm Demo

Watch flakestorm generate mutations and test your agent in real-time

Test Report

flakestorm Test Report 1

flakestorm Test Report 2

flakestorm Test Report 3

flakestorm Test Report 4

flakestorm Test Report 5

Interactive HTML reports with detailed failure analysis and recommendations

Try Flakestorm in ~60 Seconds

Want to see Flakestorm in action immediately? Here's the fastest path:

  1. Install flakestorm (if you have Python 3.10+):

    pip install flakestorm
    
  2. Initialize a test configuration:

    flakestorm init
    
  3. Point it at your agent (edit flakestorm.yaml):

    agent:
      endpoint: "http://localhost:8000/invoke"  # Your agent's endpoint
      type: "http"
    
  4. Run your first test:

    flakestorm run
    

That's it! You'll get a robustness score and detailed report showing how your agent handles adversarial inputs.

Note

: For full local execution (including mutation generation), you'll need Ollama installed. See the Usage Guide for complete setup instructions.

How Flakestorm Works

Flakestorm follows a simple but powerful workflow:

  1. You provide "Golden Prompts" — example inputs that should always work correctly
  2. Flakestorm generates mutations — using a local LLM, it creates adversarial variations:
    • Paraphrases (same meaning, different words)
    • Typos and noise (realistic user errors)
    • Tone shifts (frustrated, urgent, aggressive users)
    • Prompt injections (security attacks)
    • Encoding attacks (Base64, URL encoding)
    • Context manipulation (noisy, verbose inputs)
    • Length extremes (empty, very long inputs)
  3. Your agent processes each mutation — Flakestorm sends them to your agent endpoint
  4. Invariants are checked — responses are validated against rules you define (latency, content, safety)
  5. Robustness Score is calculated — weighted by mutation difficulty and importance
  6. Report is generated — interactive HTML showing what passed, what failed, and why

The result: You know exactly how your agent will behave under stress before users ever see it.

Features

  • 8 Core Mutation Types: Comprehensive robustness testing covering semantic, input, security, and edge cases
  • Invariant Assertions: Deterministic checks, semantic similarity, basic safety
  • Local-First: Uses Ollama with Qwen 3 8B for free testing
  • Beautiful Reports: Interactive HTML reports with pass/fail matrices

Toward a Zero-Setup Path

We're working on making Flakestorm even easier to use. Future improvements include:

  • Cloud-hosted mutation generation: No need to install Ollama locally
  • One-command setup: Automated installation and configuration
  • Docker containers: Pre-configured environments for instant testing
  • CI/CD integrations: Native GitHub Actions, GitLab CI, and more
  • Comprehensive Reporting: Dashboard and reports with team collaboration.

The goal: Test your agent's robustness with a single command, no local dependencies required.

For now, the local execution path gives you full control and privacy. As we build toward zero-setup, you'll always have the option to run everything locally.

Documentation

Getting Started

For Developers

Troubleshooting

Reference

License

Apache 2.0 - See LICENSE for details.


Tested with Flakestorm
Tested with Flakestorm