Flakestorm — Automated Robustness Testing for AI Agents. Stop guessing if your agent really works. FlakeStorm generates adversarial mutations and exposes failures your manual tests and evals miss. https://flakestorm.com
Find a file
2026-03-12 20:05:51 +08:00
.github/ISSUE_TEMPLATE Update README.md and CONTRIBUTING.md to enhance project visibility and support for new contributors. Added PyPI version and download badges, build status, and latest release information to README.md. Introduced a section in CONTRIBUTING.md for finding good first issues, providing guidance for beginners on how to contribute effectively. 2026-01-13 21:39:50 +08:00
docs Update documentation and configuration for Flakestorm V2, enhancing clarity on CI processes, report generation, and reproducibility features. Added details on the new --output option for saving reports, clarified the use of --min-score, and improved descriptions of the seed configuration for deterministic runs. Updated README and usage guides to reflect these changes and ensure comprehensive understanding of the CI pipeline and report outputs. 2026-03-12 20:05:51 +08:00
examples Update documentation and configuration for Flakestorm V2, enhancing clarity on CI processes, report generation, and reproducibility features. Added details on the new --output option for saving reports, clarified the use of --min-score, and improved descriptions of the seed configuration for deterministic runs. Updated README and usage guides to reflect these changes and ensure comprehensive understanding of the CI pipeline and report outputs. 2026-03-12 20:05:51 +08:00
rust Update version to 2.0.0 and enhance chaos engineering features in Flakestorm. Added support for environment chaos, behavioral contracts, and replay regression. Expanded documentation and improved scoring mechanisms. Updated .gitignore to include new documentation files. 2026-03-06 23:33:21 +08:00
src/flakestorm Update documentation and configuration for Flakestorm V2, enhancing clarity on CI processes, report generation, and reproducibility features. Added details on the new --output option for saving reports, clarified the use of --min-score, and improved descriptions of the seed configuration for deterministic runs. Updated README and usage guides to reflect these changes and ensure comprehensive understanding of the CI pipeline and report outputs. 2026-03-12 20:05:51 +08:00
tests Enhance documentation and replay functionality in Flakestorm. Updated README to clarify V2 Spec and added references to LangSmith sources in configuration guide. Improved replay regression capabilities by allowing imports from LangSmith projects and runs, with filtering options. Added new classes for LangSmith project and run sources in the configuration. Updated replay loader to support project imports and refined session resolution logic. 2026-03-07 02:04:55 +08:00
.gitignore Update documentation and configuration for Flakestorm V2, enhancing clarity on CI processes, report generation, and reproducibility features. Added details on the new --output option for saving reports, clarified the use of --min-score, and improved descriptions of the seed configuration for deterministic runs. Updated README and usage guides to reflect these changes and ensure comprehensive understanding of the CI pipeline and report outputs. 2026-03-12 20:05:51 +08:00
.pre-commit-config.yaml Refactor Entropix to FlakeStorm 2025-12-29 11:15:18 +08:00
BUILD_FIX.md Add pre-flight validation, flexible response handling, and improved error detection - Add pre-flight check to validate agent with first golden prompt before mutations - Improve response extraction to handle various agent response formats automatically - Add support for non-JSON responses (plain text, HTML) - Enhance error detection for HTTP 200 responses with error fields - Add comprehensive auto-detection for common response field names - Improve JSON parsing error handling with graceful fallbacks - Add example YAML config for GenerateSearchQueries agent - Update documentation with build and installation fixes 2026-01-02 15:21:20 +08:00
Cargo.toml Refactor Entropix to FlakeStorm 2025-12-29 11:15:18 +08:00
FIX_INSTALL.md Add pre-flight validation, flexible response handling, and improved error detection - Add pre-flight check to validate agent with first golden prompt before mutations - Improve response extraction to handle various agent response formats automatically - Add support for non-JSON responses (plain text, HTML) - Enhance error detection for HTTP 200 responses with error fields - Add comprehensive auto-detection for common response field names - Improve JSON parsing error handling with graceful fallbacks - Add example YAML config for GenerateSearchQueries agent - Update documentation with build and installation fixes 2026-01-02 15:21:20 +08:00
flakestorm-20260102-233336.html Add files via upload 2026-01-12 19:42:41 +08:00
flakestorm.yaml Revise README.md to enhance clarity and user experience by updating the features section, streamlining the quick start guide, and introducing a new section on future improvements for zero-setup usage. The changes aim to provide a more intuitive overview of Flakestorm's capabilities and installation process. 2026-01-04 23:28:43 +08:00
flakestorm.yaml.example Enhance mutation capabilities by adding three new types: encoding_attacks, context_manipulation, and length_extremes. Update configuration and documentation to reflect the addition of these types, including their weights and descriptions. Revise README.md, API_SPECIFICATION.md, CONFIGURATION_GUIDE.md, and other relevant documents to provide comprehensive coverage of the new mutation strategies and their applications. Ensure all tests are updated to validate the new mutation types. 2026-01-01 17:28:05 +08:00
flakestorm_demo.gif Update model configuration and enhance documentation for improved user guidance - Change default model to "gemma3:1b" in flakestorm-generate-search-queries.yaml and increase mutation count from 3 to 20 - Revise README.md to include demo visuals and model recommendations based on system RAM - Expand USAGE_GUIDE.md with detailed model selection criteria and installation instructions - Enhance HTML report generation to include actionable recommendations for failed mutations and executive summary insights. 2026-01-02 20:01:12 +08:00
flakestorm_report1.png Add new AI agent for generating search queries using Google Gemini - Introduce keywords_extractor_agent with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated broken_agent example and associated files. 2026-01-02 21:52:56 +08:00
flakestorm_report2.png Update flakestorm_report2.png to reflect recent changes in report generation and visualization enhancements. 2026-01-02 21:56:56 +08:00
flakestorm_report3.png Add new AI agent for generating search queries using Google Gemini - Introduce keywords_extractor_agent with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated broken_agent example and associated files. 2026-01-02 21:52:56 +08:00
flakestorm_report4.png Add new AI agent for generating search queries using Google Gemini - Introduce keywords_extractor_agent with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated broken_agent example and associated files. 2026-01-02 21:52:56 +08:00
flakestorm_report5.png Add new AI agent for generating search queries using Google Gemini - Introduce keywords_extractor_agent with robust error handling and response parsing - Include multiple fallback strategies for query generation - Update README.md and documentation to reflect new agent capabilities and setup instructions - Remove outdated broken_agent example and associated files. 2026-01-02 21:52:56 +08:00
LICENSE Refactor Entropix to FlakeStorm 2025-12-29 11:15:18 +08:00
pyproject.toml Update version to 2.0.0 and enhance chaos engineering features in Flakestorm. Added support for environment chaos, behavioral contracts, and replay regression. Expanded documentation and improved scoring mechanisms. Updated .gitignore to include new documentation files. 2026-03-06 23:33:21 +08:00
README.md Update documentation and configuration for Flakestorm V2, enhancing clarity on CI processes, report generation, and reproducibility features. Added details on the new --output option for saving reports, clarified the use of --min-score, and improved descriptions of the seed configuration for deterministic runs. Updated README and usage guides to reflect these changes and ensure comprehensive understanding of the CI pipeline and report outputs. 2026-03-12 20:05:51 +08:00
RELEASE_NOTES.md Update to 24 mutation types: Add release notes, update development status to Production/Stable, and expand test coverage for all mutation types 2026-01-15 15:04:10 +08:00
ROADMAP.md Update version to 2.0.0 and enhance chaos engineering features in Flakestorm. Added support for environment chaos, behavioral contracts, and replay regression. Expanded documentation and improved scoring mechanisms. Updated .gitignore to include new documentation files. 2026-03-06 23:33:21 +08:00
test_wheel_contents.sh Add pre-flight validation, flexible response handling, and improved error detection - Add pre-flight check to validate agent with first golden prompt before mutations - Improve response extraction to handle various agent response formats automatically - Add support for non-JSON responses (plain text, HTML) - Enhance error detection for HTTP 200 responses with error fields - Add comprehensive auto-detection for common response field names - Improve JSON parsing error handling with graceful fallbacks - Add example YAML config for GenerateSearchQueries agent - Update documentation with build and installation fixes 2026-01-02 15:21:20 +08:00

Flakestorm

The Agent Reliability Engine
Chaos Engineering for Production AI Agents

License GitHub Stars PyPI version PyPI downloads Latest Release


The Problem

Production AI agents are distributed systems: they depend on LLM APIs, tools, context windows, and multi-step orchestration. Each of these can fail. Todays tools dont answer the questions that matter:

  • What happens when the agents tools fail? — A search API returns 503. A database times out. Does the agent degrade gracefully, hallucinate, or fabricate data?
  • Does the agent always follow its rules? — Must it always cite sources? Never return PII? Are those guarantees maintained when the environment is degraded?
  • Did we fix the production incident? — After a failure in prod, how do we prove the fix and prevent regression?

Observability tools tell you after something broke. Eval libraries focus on output quality, not resilience. No tool systematically breaks the agents environment to test whether it survives. Flakestorm fills that gap.

The Solution: Chaos Engineering for AI Agents

Flakestorm is a chaos engineering platform for production AI agents. Like Chaos Monkey for infrastructure, Flakestorm deliberately injects failures into the tools, APIs, and LLMs your agent depends on — then verifies that the agent still obeys its behavioral contract and recovers gracefully.

Other tools test if your agent gives good answers. Flakestorm tests if your agent survives production.

Three Pillars

Pillar What it does Question answered
Environment Chaos Inject faults into tools and LLMs (timeouts, errors, rate limits, malformed responses) Does the agent handle bad environments?
Behavioral Contracts Define invariants (rules the agent must always follow) and verify them across a matrix of chaos scenarios Does the agent obey its rules when the world breaks?
Replay Regression Import real production failure sessions and replay them as deterministic tests Did we fix this incident?

On top of that, Flakestorm still runs adversarial prompt mutations (22+ mutation types; max 50 per run in OSS) so you can test bad inputs and bad environments together.

Scores at a glance

What you run Score you get
flakestorm run Robustness score (01): how well the agent handled adversarial prompts.
flakestorm run --chaos --chaos-only Chaos resilience (same 01 metric): how well the agent handled a broken environment (no mutations, only chaos).
flakestorm contract run Resilience score (0100%): contract × chaos matrix, severity-weighted.
flakestorm replay run … Per-session pass/fail; aggregate replay regression score when run via flakestorm ci.
flakestorm ci Overall (weighted) score combining mutation robustness, chaos resilience, contract compliance, and replay regression — one number for CI gates.

Commands by scope

Scope Command What runs
V1 only / mutation only flakestorm run Just adversarial mutations → agent → invariants. No chaos, no contract matrix, no replay. Use a v1.0 config or omit --chaos so you get only the classic robustness score.
Mutation + chaos flakestorm run --chaos Mutations run against a fault-injected agent (tool/LLM chaos).
Chaos only flakestorm run --chaos --chaos-only No mutations; golden prompts only, with chaos. Single chaos resilience score.
Contract only flakestorm contract run Contract × chaos matrix; resilience score.
Replay only flakestorm replay run path/to/replay.yaml -c flakestorm.yaml One or more replay sessions.
ALL (full CI) flakestorm ci Mutation run + contract (if configured) + chaos-only run (if chaos configured) + all replay sessions (if configured); then overall weighted score. Writes a summary report (e.g. flakestorm-ci-report.html) with per-phase scores and links to detailed reports; use --output DIR or --output report.html and --min-score N.

Context attacks are part of environment chaos: adversarial content is applied to tool responses or to the input before invoke, not to the user prompt itself. The chaos interceptor applies memory_poisoning to the user input before each invoke; LLM faults (timeout, truncated, empty, garbage, rate_limit, response_drift) are applied in the interceptor (timeout before the call, others after the response). Types: indirect_injection (tool returns valid-looking content with hidden instructions), memory_poisoning (payload into input before invoke; strategy prepend | append | replace), system_prompt_leak_probe (contract assertion using probe prompts). Config: list of attack configs or dict (e.g. memory_poisoning: { payload: "...", strategy: "append" }). Scenarios in the contract chaos matrix can each define context_attacks. See Context Attacks.

Production-First by Design

Flakestorm is designed for teams already running AI agents in production. Most production agents use cloud LLM APIs (OpenAI, Gemini, Claude, Perplexity, etc.) and face real traffic, real users, and real abuse patterns.

Why local LLMs exist in the open source version:

  • Fast experimentation and proofs-of-concept
  • CI-friendly testing without external dependencies
  • Transparent, extensible chaos engine

Why production chaos should mirror production reality: Production agents run on cloud infrastructure, process real user inputs, and scale dynamically. Chaos testing should reflect this reality—testing against the same infrastructure, scale, and patterns your agents face in production.

The cloud version removes operational friction: no local model setup, no environment configuration, scalable mutation runs, shared dashboards, and team collaboration. Open source proves the value; cloud delivers production-grade chaos engineering.

Who Flakestorm Is For

  • Teams shipping AI agents to production — Catch failures before users do
  • Engineers running agents behind APIs — Test against real-world abuse patterns
  • Teams already paying for LLM APIs — Reduce regressions and production incidents
  • CI/CD pipelines (Cloud only) — Automated reliability gates, scheduled runs, and native pipeline integrations; OSS is for local and scripted runs

Flakestorm is built for production-grade agents handling real traffic. While it works great for exploration and hobby projects, it's designed to catch the failures that matter when agents are deployed at scale.

Demo

flakestorm in Action

flakestorm Demo

Watch Flakestorm run chaos and mutation tests against your agent in real-time

Test Report

flakestorm Test Report 1

flakestorm Test Report 2

flakestorm Test Report 3

flakestorm Test Report 4

flakestorm Test Report 5

Interactive HTML reports with detailed failure analysis and recommendations

How Flakestorm Works

Flakestorm supports several modes; you can use one or combine them:

  • Chaos only — Golden prompts → agent with fault-injected tools/LLM → invariants. Does the agent handle bad environments?
  • Contract — Golden prompts → agent under each chaos scenario → verify named invariants across a matrix. Does the agent obey its rules under every failure mode?
  • Replay — Recorded production input + recorded tool responses → agent → contract. Did we fix this incident?
  • Mutation (optional) — Golden prompts → adversarial mutations (24 types, max 50/run) → agent (optionally under chaos) → invariants. Does the agent handle bad inputs (and optionally bad environments)?

You define golden prompts, invariants (or a full contract with severity and chaos matrix), and optionally chaos (tool/LLM faults) and replay sessions. Flakestorm runs the chosen mode(s), checks responses against your rules, and produces a robustness score (mutation or chaos-only runs) or resilience score (contract run), plus HTML report. Use flakestorm run, flakestorm contract run, flakestorm replay run, or flakestorm ci for the combined overall score (OSS: run from CLI or your own scripts; native CI/CD integrations — scheduled runs, pipeline plugins — are Cloud only).

For the full V1 vs V2 flow (mutation-only vs four pillars, contract matrix isolation, resilience score formula), see the Usage Guide.

Note

: Mutation generation uses a local LLM (Ollama) or cloud APIs (OpenAI, Claude, Gemini). API keys via environment variables only. See LLM Providers.

Features

Chaos engineering pillars

  • Environment Chaos — Inject faults into tools and LLMs (timeouts, errors, rate limits, malformed responses, built-in profiles). Context attacks: indirect_injection, memory_poisoning (input before invoke; strategy: prepend/append/replace), system_prompt_leak_probe; config as list or dict. → Environment Chaos
  • Behavioral Contracts — Named invariants × chaos matrix; severity-weighted resilience score. Optional reset per cell: agent.reset_endpoint (HTTP) or agent.reset_function (e.g. myagent:reset_state). system_prompt_leak_probe: use probes (list of prompts) on an invariant to run probe prompts and verify response (e.g. excludes_pattern). behavior_unchanged: baseline auto or manual. Stateful agents: warn if no reset and responses differ. → Behavioral Contracts
  • Replay Regression — Import production failures (manual or LangSmith), replay deterministically, verify against contracts. Sessions can reference a file or inline id/input; sources support LangSmith project/run with optional auto_import. → Replay Regression

Supporting capabilities

  • Adversarial mutations — 24 mutation types (prompt-level and system/network-level); max 50 mutations per run in OSS. → Test Scenarios for mutation, chaos, contract, and replay examples.
  • Invariants & assertions — Deterministic checks, semantic similarity, safety (PII, refusal); configurable per contract.
  • Robustness score — For mutation runs: a single weighted score (01) of how well the agent handled adversarial prompts. Reported in HTML/JSON and CLI (results.statistics.robustness_score).
  • Unified resilience score — For full CI: weighted combination of mutation robustness, chaos resilience, contract compliance, and replay regression; weights (mutation, chaos, contract, replay) configurable in YAML and must sum to 1.0.
  • Context attacks — indirect_injection (into tool/context), memory_poisoning (into input before invoke; strategy: prepend/append/replace), system_prompt_leak_probe (contract assertion with probe prompts). Config: list or dict. → Context Attacks
  • LLM providers — Ollama, OpenAI, Anthropic, Google (Gemini); API keys via env only. → LLM Providers
  • Reports — Interactive HTML and JSON; contract matrix and replay reports. flakestorm ci writes a summary report (flakestorm-ci-report.html) with per-phase scores and links to detailed reports (mutation, contract, chaos, replay). Contract PASS/FAIL in the summary matches the contract detailed report (FAIL if any critical invariant fails).
  • Reproducible runs — Set advanced.seed in config (e.g. seed: 42) for deterministic results: Python random is seeded (chaos behavior fixed) and the mutation-generation LLM uses temperature=0 so the same config yields the same scores run-to-run.

Try it: Working example with chaos, contracts, and replay from the CLI.

Open Source vs Cloud

Open Source (Always Free):

  • Core chaos engine with all 24 mutation types (max 50 per run; no artificial feature gating)
  • Local execution for fast experimentation
  • Run from CLI or your own scripts (no native CI/CD; thats Cloud only)
  • Full transparency and extensibility
  • Perfect for proofs-of-concept and development workflows

Cloud (In Progress / Waitlist):

  • Zero-setup chaos testing (no Ollama, no local models)
  • CI/CD — native pipeline integrations, scheduled runs, reliability gates
  • Scalable runs (thousands of mutations)
  • Shared dashboards & reports
  • Team collaboration
  • Production-grade reliability workflows

Our Philosophy: We do not cripple the OSS version. Cloud exists to remove operational pain, not to lock features. Open source proves the value; cloud delivers production-grade chaos engineering at scale.

Try Flakestorm in ~60 Seconds

This is the fastest way to try Flakestorm locally. Production teams typically use the cloud version (waitlist). Here's the local quickstart:

  1. Install flakestorm (if you have Python 3.10+):

    pip install flakestorm
    
  2. Initialize a test configuration:

    flakestorm init
    
  3. Point it at your agent (edit flakestorm.yaml):

    agent:
      endpoint: "http://localhost:8000/invoke"  # Your agent's endpoint
      type: "http"
    
  4. Run your first test:

    flakestorm run
    

    With a v2 config you can also run flakestorm run --chaos, flakestorm contract run, flakestorm replay run, or flakestorm ci to exercise all pillars.

That's it! You get a robustness score (for mutation runs) or a resilience score (when using chaos/contract/replay), plus a report showing how your agent handles chaos and adversarial inputs.

Note

: For full local execution (including mutation generation), you'll need Ollama installed. See the Usage Guide for complete setup instructions.

Roadmap

See Roadmap for the full plan. Highlights:

  • V3 — Multi-agent chaos — Chaos engineering for systems of multiple agents: fault injection across agent-to-agent and tool boundaries, contract verification for multi-agent workflows, and replay of multi-agent production incidents.
  • Pattern engine — 110+ prompt-injection and 52+ PII detection patterns; Rust-backed, sub-50ms.
  • Cloud — Scalable runs, team dashboards, scheduled chaos, CI integrations.
  • Enterprise — On-premise, audit logging, compliance certifications.

Documentation

Getting Started

For Developers

Troubleshooting

Support

Reference

Cloud Version (Early Access)

For teams running production AI agents, the cloud version removes operational friction: zero-setup chaos testing without local model configuration, scalable mutation runs that mirror production traffic, shared dashboards for team collaboration, and continuous chaos runs integrated into your reliability workflows.

The cloud version is currently in early access. Join the waitlist to get access as we roll it out.

License

Apache 2.0 - See LICENSE for details.


Tested with Flakestorm
Tested with Flakestorm


❤️ Sponsor Flakestorm on GitHub