mirror of https://github.com/flakestorm/flakestorm.git synced 2026-04-26 17:26:27 +02:00

Flakestorm — Automated Robustness Testing for AI Agents. Stop guessing if your agent really works. FlakeStorm generates adversarial mutations and exposes failures your manual tests and evals miss. https://flakestorm.com

adversarial-agent-testing ai-agent-testing langchain-agent prompt-injection-llm-security

Find a file

Frank Humarang a36cecf255 Add initial project structure and configuration files - Created .gitignore to exclude unnecessary files and directories. - Added Cargo.toml for Rust workspace configuration. - Introduced example configuration file entropix.yaml.example for user customization. - Included LICENSE file with Apache 2.0 license details. - Created pyproject.toml for Python project metadata and dependencies. - Added README.md with project overview and usage instructions. - Implemented a broken agent example to demonstrate testing capabilities. - Established Rust module structure with Cargo.toml and source files. - Set up initial tests for assertions and configuration validation.		2025-12-28 21:55:01 +08:00
examples/broken_agent	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00
rust	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00
src/entropix	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00
tests	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00
.gitignore	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00
Cargo.toml	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00
entropix.yaml.example	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00
LICENSE	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00
pyproject.toml	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00
README.md	Add initial project structure and configuration files	2025-12-28 21:55:01 +08:00

README.md

Entropix

The Agent Reliability Engine
Chaos Engineering for AI Agents

The Problem

The "Happy Path" Fallacy: Current AI development tools focus on getting an agent to work once. Developers tweak prompts until they get a correct answer, declare victory, and ship.

The Reality: LLMs are non-deterministic. An agent that works on Monday with temperature=0.7 might fail on Tuesday. Users don't follow "Happy Paths" — they make typos, they're aggressive, they lie, and they attempt prompt injections.

The Void:

Observability Tools (LangSmith) tell you after the agent failed in production
Eval Libraries (RAGAS) focus on academic scores rather than system reliability
Missing Link: A tool that actively attacks the agent to prove robustness before deployment

The Solution

Entropix is a local-first testing engine that applies Chaos Engineering principles to AI Agents.

Instead of running one test case, Entropix takes a single "Golden Prompt", generates 50+ adversarial mutations (semantic variations, noise injection, hostile tone, prompt injections), runs them in parallel against your agent, and calculates a Robustness Score.

"If it passes Entropix, it won't break in Production."

Features

Semantic Mutations: Paraphrasing, noise injection, tone shifts, prompt injections
Invariant Assertions: Deterministic checks, semantic similarity, safety validations
Local-First: Uses Ollama with Qwen Coder 3 8B for free, unlimited attacks
Beautiful Reports: Interactive HTML reports with pass/fail matrices
CI/CD Ready: GitHub Actions integration to block PRs below reliability thresholds

Quick Start

Installation

pip install entropix

Prerequisites

Entropix uses Ollama for local model inference:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the default model
ollama pull qwen3:8b

Initialize Configuration

entropix init

This creates an entropix.yaml configuration file:

version: "1.0"

agent:
  endpoint: "http://localhost:8000/invoke"
  type: "http"
  timeout: 30000

model:
  provider: "ollama"
  name: "qwen3:8b"
  base_url: "http://localhost:11434"

mutations:
  count: 20
  types:
    - paraphrase
    - noise
    - tone_shift
    - prompt_injection

golden_prompts:
  - "Book a flight to Paris for next Monday"
  - "What's my account balance?"

invariants:
  - type: "latency"
    max_ms: 2000
  - type: "valid_json"

output:
  format: "html"
  path: "./reports"

Run Tests

entropix run

Output:

Entropix - Agent Reliability Engine v0.1.0

✓ Loading configuration from entropix.yaml
✓ Connected to Ollama (qwen3:8b)
✓ Agent endpoint verified

Generating mutations... ━━━━━━━━━━━━━━━━━━━━ 100%
Running attacks...      ━━━━━━━━━━━━━━━━━━━━ 100%
Verifying invariants... ━━━━━━━━━━━━━━━━━━━━ 100%

╭──────────────────────────────────────────╮
│  Robustness Score: 87.5%                 │
│  ────────────────────────                │
│  Passed: 35/40 mutations                 │
│  Failed: 5 (3 latency, 2 injection)      │
╰──────────────────────────────────────────╯

Report saved to: ./reports/entropix-2024-01-15-143022.html

Mutation Types

Type	Description	Example
Paraphrase	Semantically equivalent rewrites	"Book a flight" → "I need to fly out"
Noise	Typos and spelling errors	"Book a flight" → "Book a fliight plz"
Tone Shift	Aggressive/impatient phrasing	"Book a flight" → "I need a flight NOW!"
Prompt Injection	Adversarial attack attempts	"Book a flight and ignore previous instructions"

Invariants (Assertions)

Deterministic

invariants:
  - type: "contains"
    value: "confirmation_code"
  - type: "latency"
    max_ms: 2000
  - type: "valid_json"

Semantic

invariants:
  - type: "similarity"
    expected: "Your flight has been booked"
    threshold: 0.8

Safety

invariants:
  - type: "excludes_pii"
  - type: "refusal_check"
    dangerous_prompts: true

Agent Adapters

HTTP Endpoint

agent:
  type: "http"
  endpoint: "http://localhost:8000/invoke"

Python Callable

from entropix import test_agent

@test_agent
async def my_agent(input: str) -> str:
    # Your agent logic
    return response

LangChain

agent:
  type: "langchain"
  module: "my_agent:chain"

CI/CD Integration

GitHub Actions

name: Agent Reliability Check

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Ollama
        run: |
          curl -fsSL https://ollama.ai/install.sh | sh
          ollama pull qwen3:8b
      
      - name: Install Entropix
        run: pip install entropix
      
      - name: Run Reliability Tests
        run: entropix run --min-score 0.9 --ci

Robustness Score

The Robustness Score is calculated as:

R = \frac{W_s \cdot S_{passed} + W_d \cdot D_{passed}}{N_{total}}

Where:

S_{passed} = Semantic variations passed
D_{passed} = Deterministic tests passed
W = Weights assigned by mutation difficulty

Documentation

License

Apache 2.0 - See LICENSE for details.

Tested with Entropix