mirror of https://github.com/flakestorm/flakestorm.git synced 2026-04-24 16:26:35 +02:00

Flakestorm — Automated Robustness Testing for AI Agents. Stop guessing if your agent really works. FlakeStorm generates adversarial mutations and exposes failures your manual tests and evals miss. https://flakestorm.com

adversarial-agent-testing ai-agent-testing langchain-agent prompt-injection-llm-security

Find a file

Francisco M Humarang Jr. 637c0c65be Merge branch 'main' of https://github.com/flakestorm/flakestorm		2026-01-13 21:40:03 +08:00
.github/ISSUE_TEMPLATE	Update README.md and CONTRIBUTING.md to enhance project visibility and support for new contributors. Added PyPI version and download badges, build status, and latest release information to README.md. Introduced a section in CONTRIBUTING.md for finding good first issues, providing guidance for beginners on how to contribute effectively.	2026-01-13 21:39:50 +08:00
docs	Update README.md and CONTRIBUTING.md to enhance project visibility and support for new contributors. Added PyPI version and download badges, build status, and latest release information to README.md. Introduced a section in CONTRIBUTING.md for finding good first issues, providing guidance for beginners on how to contribute effectively.	2026-01-13 21:39:50 +08:00
examples	Revise README.md to enhance clarity and user experience by updating the features section, streamlining the quick start guide, and introducing a new section on future improvements for zero-setup usage. The changes aim to provide a more intuitive overview of Flakestorm's capabilities and installation process.	2026-01-04 23:28:43 +08:00
rust	Refactor Entropix to FlakeStorm	2025-12-29 11:15:18 +08:00
src/flakestorm	Enhance documentation to reflect the addition of 22+ mutation types in Flakestorm, including advanced prompt-level and system/network-level attacks. Update README.md, API_SPECIFICATION.md, CONFIGURATION_GUIDE.md, USAGE_GUIDE.md, and related files to improve clarity on mutation strategies, testing scenarios, and configuration options. Emphasize the importance of comprehensive testing for production AI agents and provide detailed descriptions for each mutation type.	2026-01-05 22:21:27 +08:00
tests	Enhance mutation capabilities by adding three new types: encoding_attacks, context_manipulation, and length_extremes. Update configuration and documentation to reflect the addition of these types, including their weights and descriptions. Revise README.md, API_SPECIFICATION.md, CONFIGURATION_GUIDE.md, and other relevant documents to provide comprehensive coverage of the new mutation strategies and their applications. Ensure all tests are updated to validate the new mutation types.	2026-01-01 17:28:05 +08:00
.gitignore	Remove PUBLISHING.md from .gitignore to allow tracking of publishing documentation.	2026-01-05 00:01:10 +08:00
.pre-commit-config.yaml	Refactor Entropix to FlakeStorm	2025-12-29 11:15:18 +08:00
BUILD_FIX.md	Add pre-flight validation, flexible response handling, and improved error detection - Add pre-flight check to validate agent with first golden prompt before mutations - Improve response extraction to handle various agent response formats automatically - Add support for non-JSON responses (plain text, HTML) - Enhance error detection for HTTP 200 responses with error fields - Add comprehensive auto-detection for common response field names - Improve JSON parsing error handling with graceful fallbacks - Add example YAML config for GenerateSearchQueries agent - Update documentation with build and installation fixes	2026-01-02 15:21:20 +08:00
Cargo.toml	Refactor Entropix to FlakeStorm	2025-12-29 11:15:18 +08:00
FIX_INSTALL.md	Add pre-flight validation, flexible response handling, and improved error detection - Add pre-flight check to validate agent with first golden prompt before mutations - Improve response extraction to handle various agent response formats automatically - Add support for non-JSON responses (plain text, HTML) - Enhance error detection for HTTP 200 responses with error fields - Add comprehensive auto-detection for common response field names - Improve JSON parsing error handling with graceful fallbacks - Add example YAML config for GenerateSearchQueries agent - Update documentation with build and installation fixes	2026-01-02 15:21:20 +08:00
flakestorm-20260102-233336.html	Add files via upload	2026-01-12 19:42:41 +08:00
flakestorm.yaml	Revise README.md to enhance clarity and user experience by updating the features section, streamlining the quick start guide, and introducing a new section on future improvements for zero-setup usage. The changes aim to provide a more intuitive overview of Flakestorm's capabilities and installation process.	2026-01-04 23:28:43 +08:00
flakestorm.yaml.example	Enhance mutation capabilities by adding three new types: encoding_attacks, context_manipulation, and length_extremes. Update configuration and documentation to reflect the addition of these types, including their weights and descriptions. Revise README.md, API_SPECIFICATION.md, CONFIGURATION_GUIDE.md, and other relevant documents to provide comprehensive coverage of the new mutation strategies and their applications. Ensure all tests are updated to validate the new mutation types.	2026-01-01 17:28:05 +08:00
flakestorm_demo.gif	Update model configuration and enhance documentation for improved user guidance - Change default model to "gemma3:1b" in flakestorm-generate-search-queries.yaml and increase mutation count from 3 to 20 - Revise README.md to include demo visuals and model recommendations based on system RAM - Expand USAGE_GUIDE.md with detailed model selection criteria and installation instructions - Enhance HTML report generation to include actionable recommendations for failed mutations and executive summary insights.	2026-01-02 20:01:12 +08:00
flakestorm_report1.png	Add new AI agent for generating search queries using Google Gemini - Introduce `keywords_extractor_agent` with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated `broken_agent` example and associated files.	2026-01-02 21:52:56 +08:00
flakestorm_report2.png	Update flakestorm_report2.png to reflect recent changes in report generation and visualization enhancements.	2026-01-02 21:56:56 +08:00
flakestorm_report3.png	Add new AI agent for generating search queries using Google Gemini - Introduce `keywords_extractor_agent` with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated `broken_agent` example and associated files.	2026-01-02 21:52:56 +08:00
flakestorm_report4.png	Add new AI agent for generating search queries using Google Gemini - Introduce `keywords_extractor_agent` with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated `broken_agent` example and associated files.	2026-01-02 21:52:56 +08:00
flakestorm_report5.png	Add new AI agent for generating search queries using Google Gemini - Introduce `keywords_extractor_agent` with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated `broken_agent` example and associated files.	2026-01-02 21:52:56 +08:00
LICENSE	Refactor Entropix to FlakeStorm	2025-12-29 11:15:18 +08:00
pyproject.toml	Update version to 0.9.0 in pyproject.toml and __init__.py, enhance CONFIGURATION_GUIDE.md and USAGE_GUIDE.md with aggressive mutation strategies and requirements for invariants, and add validation to ensure at least 3 invariants are configured in FlakeStormConfig.	2026-01-03 00:18:31 +08:00
README.md	Update README.md and CONTRIBUTING.md to enhance project visibility and support for new contributors. Added PyPI version and download badges, build status, and latest release information to README.md. Introduced a section in CONTRIBUTING.md for finding good first issues, providing guidance for beginners on how to contribute effectively.	2026-01-13 21:39:50 +08:00
ROADMAP.md	Update README.md and CONTRIBUTING.md to enhance project visibility and support for new contributors. Added PyPI version and download badges, build status, and latest release information to README.md. Introduced a section in CONTRIBUTING.md for finding good first issues, providing guidance for beginners on how to contribute effectively.	2026-01-13 21:39:50 +08:00
test_wheel_contents.sh	Add pre-flight validation, flexible response handling, and improved error detection - Add pre-flight check to validate agent with first golden prompt before mutations - Improve response extraction to handle various agent response formats automatically - Add support for non-JSON responses (plain text, HTML) - Enhance error detection for HTTP 200 responses with error fields - Add comprehensive auto-detection for common response field names - Improve JSON parsing error handling with graceful fallbacks - Add example YAML config for GenerateSearchQueries agent - Update documentation with build and installation fixes	2026-01-02 15:21:20 +08:00

README.md

Flakestorm

The Agent Reliability Engine
Chaos Engineering for Production AI Agents

The Problem

The "Happy Path" Fallacy: Current AI development tools focus on getting an agent to work once. Developers tweak prompts until they get a correct answer, declare victory, and ship.

The Reality: LLMs are non-deterministic. An agent that works on Monday with temperature=0.7 might fail on Tuesday. Production agents face real users who make typos, get aggressive, and attempt prompt injections. Real traffic exposes failures that happy-path testing misses.

The Void:

Observability Tools (LangSmith) tell you after the agent failed in production
Eval Libraries (RAGAS) focus on academic scores rather than system reliability
CI Pipelines lack chaos testing — agents ship untested against adversarial inputs
Missing Link: A tool that actively attacks the agent to prove robustness before deployment

The Solution

Flakestorm is a chaos testing layer for production AI agents. It applies Chaos Engineering principles to systematically test how your agents behave under adversarial inputs before real users encounter them.

Instead of running one test case, Flakestorm takes a single "Golden Prompt", generates adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them against your agent, and calculates a Robustness Score. Run it before deploy, in CI, or against production-like environments.

"If it passes Flakestorm, it won't break in Production."

Production-First by Design

Flakestorm is designed for teams already running AI agents in production. Most production agents use cloud LLM APIs (OpenAI, Gemini, Claude, Perplexity, etc.) and face real traffic, real users, and real abuse patterns.

Why local LLMs exist in the open source version:

Fast experimentation and proofs-of-concept
CI-friendly testing without external dependencies
Transparent, extensible chaos engine

Why production chaos should mirror production reality: Production agents run on cloud infrastructure, process real user inputs, and scale dynamically. Chaos testing should reflect this reality—testing against the same infrastructure, scale, and patterns your agents face in production.

The cloud version removes operational friction: no local model setup, no environment configuration, scalable mutation runs, shared dashboards, and team collaboration. Open source proves the value; cloud delivers production-grade chaos engineering.

Who Flakestorm Is For

Teams shipping AI agents to production — Catch failures before users do
Engineers running agents behind APIs — Test against real-world abuse patterns
Teams already paying for LLM APIs — Reduce regressions and production incidents
CI/CD pipelines — Automated reliability gates before deployment

Flakestorm is built for production-grade agents handling real traffic. While it works great for exploration and hobby projects, it's designed to catch the failures that matter when agents are deployed at scale.

Demo

flakestorm in Action

Watch flakestorm generate mutations and test your agent in real-time

Test Report

Interactive HTML reports with detailed failure analysis and recommendations

How Flakestorm Works

Flakestorm follows a simple but powerful workflow:

You provide "Golden Prompts" — example inputs that should always work correctly
Flakestorm generates mutations — using a local LLM, it creates adversarial variations across 22+ mutation types:
- Prompt-level: Paraphrases, typos, tone shifts, prompt injections, encoding attacks, context manipulation, length extremes, multi-turn attacks, advanced jailbreaks, semantic similarity attacks, format poisoning, language mixing, token manipulation, temporal attacks
- System/Network-level: HTTP header injection, payload size attacks, content-type confusion, query parameter poisoning, request method attacks, protocol-level attacks, resource exhaustion, concurrent patterns, timeout manipulation
Your agent processes each mutation — Flakestorm sends them to your agent endpoint
Invariants are checked — responses are validated against rules you define (latency, content, safety)
Robustness Score is calculated — weighted by mutation difficulty and importance
Report is generated — interactive HTML showing what passed, what failed, and why

The result: You know exactly how your agent will behave under stress before users ever see it.

Note

: The open source version uses local LLMs (Ollama) for mutation generation. The cloud version (in development) uses production-grade infrastructure to mirror real-world chaos testing at scale.

Features

✅ 22+ Core Mutation Types: Comprehensive robustness testing covering:
- Prompt-level attacks: Paraphrase, noise, tone shift, prompt injection, encoding, context manipulation, length extremes, multi-turn attacks, advanced jailbreaks, semantic similarity, format poisoning, language mixing, token manipulation, temporal attacks
- System/Network-level attacks: HTTP header injection, payload size attacks, content-type confusion, query parameter poisoning, request method attacks, protocol-level attacks, resource exhaustion, concurrent patterns, timeout manipulation
✅ Invariant Assertions: Deterministic checks, semantic similarity, basic safety
✅ Beautiful Reports: Interactive HTML reports with pass/fail matrices
✅ Open Source Core: Full chaos engine available locally for experimentation and CI

Open Source vs Cloud

Open Source (Always Free):

Core chaos engine with all 22+ mutation types (no artificial feature gating)
Local execution for fast experimentation
CI-friendly usage without external dependencies
Full transparency and extensibility
Perfect for proofs-of-concept and development workflows

Cloud (In Progress / Waitlist):

Zero-setup chaos testing (no Ollama, no local models)
Scalable runs (thousands of mutations)
Shared dashboards & reports
Team collaboration
Scheduled & continuous chaos runs
Production-grade reliability workflows

Our Philosophy: We do not cripple the OSS version. Cloud exists to remove operational pain, not to lock features. Open source proves the value; cloud delivers production-grade chaos engineering at scale.

Try Flakestorm in ~60 Seconds

This is the fastest way to try Flakestorm locally. Production teams typically use the cloud version (waitlist). Here's the local quickstart:

Install flakestorm (if you have Python 3.10+):
```
pip install flakestorm
```
Initialize a test configuration:
```
flakestorm init
```

Point it at your agent (edit flakestorm.yaml):

agent:
  endpoint: "http://localhost:8000/invoke"  # Your agent's endpoint
  type: "http"

Run your first test:
```
flakestorm run
```

That's it! You'll get a robustness score and detailed report showing how your agent handles adversarial inputs.

Note

: For full local execution (including mutation generation), you'll need Ollama installed. See the Usage Guide for complete setup instructions.

Roadmap

See what's coming next! Check out our Roadmap for upcoming features including:

🚀 Pattern Engine Upgrade with 110+ Prompt Injection Patterns and 52+ PII Detection Patterns
☁️ Cloud Version enhancements (scalable runs, team collaboration, continuous testing)
🏢 Enterprise features (on-premise deployment, custom patterns, compliance certifications)

Documentation

Getting Started

📖 Usage Guide - Complete end-to-end guide (includes local setup)
⚙️ Configuration Guide - All configuration options
🔌 Connection Guide - How to connect FlakeStorm to your agent
🧪 Test Scenarios - Real-world examples with code
🔗 Integrations Guide - HuggingFace models & semantic similarity

For Developers

🏗️ Architecture & Modules - How the code works
❓ Developer FAQ - Q&A about design decisions
🤝 Contributing - How to contribute

Troubleshooting

🔧 Fix Installation Issues - Resolve ModuleNotFoundError: No module named 'flakestorm.reports'
🔨 Fix Build Issues - Resolve pip install . vs pip install -e . problems

Reference

📋 API Specification - API reference
🧪 Testing Guide - How to run and write tests
✅ Implementation Checklist - Development progress

Cloud Version (Early Access)

For teams running production AI agents, the cloud version removes operational friction: zero-setup chaos testing without local model configuration, scalable mutation runs that mirror production traffic, shared dashboards for team collaboration, and continuous chaos runs integrated into your reliability workflows.

The cloud version is currently in early access. Join the waitlist to get access as we roll it out.

License

Apache 2.0 - See LICENSE for details.

Tested with Flakestorm